2016-04-15

Use IPython | Jupyter notebook with Hortonworks to run PySpark

2016-04-15
First, just follow the instructions on the official webpage to install the Hortonworks Sandbox VM, the prerequisites, the python 2.7, pip, a few machine learning packages, and Jupyter (the new IPython Notebook).
http://hortonworks.com/hadoop-tutorial/using-ipython-notebook-with-apache-spark/

However, instead of doing

    pip install "ipython[notebook]"

I did

    pip install jupyter

And they should be equivalent.

Then just don't do the introduced configuration for IPython. Instead, install "findspark":

    pip install findspark

Then just run the two lines to start the Jupyter notebook:

    source /opt/rh/python27/enable
    jupyter notebook --port 8889 --notebook-dir='/usr/hdp/current/spark-client/' --ip='*' --no-browser

Alternatively, can include the two lines in an .sh file, and run this file every time to start the notebook. This is similar to what the official webpage instructed, but without creating a profile.

Forward the port 8889 from Virtualbox to the host machine as introduced, and will need to restart the VM before we can access the notebook from the host machine.

And every time we are in a new notebook and we want to write PySpark code, run the following lines above all:

    import findspark
    findspark.init('/usr/hdp/current/spark-client')
    import pyspark
    sc = pyspark.SparkContext()

Note the path "/usr/hdp/current/spark-client" in the fourth line, also in the options when starting the notebook, is the default value of the SPARK_HOME path of Hortonworks.

PS. Hortonwork is so much faster and handier than Cloudera and Oracle Big Date Lite!!!!!













没有评论:

发表评论