2016-04-15
First, just follow the instructions on the official webpage to install the Hortonworks Sandbox VM, the prerequisites, the python 2.7, pip, a few machine learning packages, and Jupyter (the new IPython Notebook).
http://hortonworks.com/hadoop-tutorial/using-ipython-notebook-with-apache-spark/
However, instead of doing
pip install "ipython[notebook]"
I did
pip install jupyter
And they should be equivalent.
Then just don't do the introduced configuration for IPython. Instead, install "findspark":
pip install findspark
Then just run the two lines to start the Jupyter notebook:
source /opt/rh/python27/enable
jupyter notebook --port 8889 --notebook-dir='/usr/hdp/current/spark-client/' --ip='*' --no-browser
Alternatively, can include the two lines in an .sh file, and run this file every time to start the notebook. This is similar to what the official webpage instructed, but without creating a profile.
Forward the port 8889 from Virtualbox to the host machine as introduced, and will need to restart the VM before we can access the notebook from the host machine.
And every time we are in a new notebook and we want to write PySpark code, run the following lines above all:
import findspark
findspark.init('/usr/hdp/current/spark-client')
import pyspark
sc = pyspark.SparkContext()
Note the path "/usr/hdp/current/spark-client" in the fourth line, also in the options when starting the notebook, is the default value of the SPARK_HOME path of Hortonworks.
PS. Hortonwork is so much faster and handier than Cloudera and Oracle Big Date Lite!!!!!
没有评论:
发表评论