2015-07-30

A nice experience with Scalable Machine Learning

Just when I had learned the basics of MapReduce and was about to start learning Hadoop, this course came across my way and caught my eyes.

Apache Spark is said to be tens to hundreds times more efficient than Hadoop in performing highly parallelized computations, mainly because it stores data in main memory rather than on disks, so as to reduce the time wasted on reading and writing disks, which is a major source of computation expense. To my understanding, another benefit is the possibility to allow multiple map and reduce steps for a more complex procedure, by caching data and intermediate variables for multiple-times use.

IPython notebook, which is integrated in a VM installed with vagrant and VirtualBox, was used throughout the course. My basic knowledge of Python was quite sufficient to cope with the hands-on coding works every week, but of course, with further self learning through online resources.

The videos contain no instructions about the procedures and coding to be used in the homework, unlike many other online courses. I need to understand how the problems should be solved and what program modules  should be coded mainly on myself. This has taken me a lot of effort, and actually lots of fun when I finally solved each problem.

Even with some experience of Python, I didn't feel very convenient with the IPython notebook. The pros are that it gives you the result right below every line of codes after you run them, while the cons are that you quickly forget which variables you have created a few seconds ago and have to scroll up a lot to find what they are. Myself would prefer to use the QT console in future. It would be much handier to use similar consoles no matter if you use Rstudio, Matlab or Octave GUI, where the side panels keep a track of what you have in the workspace.

A good news is that Spark has better supported R in the most recent version. To me, numpy and pandas are not as straightforward as R in doing array/matrix computations and data frame operation. I will be very interested in trying R on Spark very soon.

After this course, I plan to practice more of decision tree, random forest, svm, neural network, naive Bayes, time series and MCMC across R, Python on and off Spark, to get a more complete profile in machine learning.

没有评论:

发表评论