1. Explore the data to roughly see the distribution. Plot or calculation covariance matrix to see if the different variables are independent. If all are nearly independent, then PCA is unnecessary.
2. Used libraries and functions:
numpy.linalg.eigh
numpy.argsort
* Note the use of np.dot() and np.outer() on numpy arrays when they are matrices
numpy.kron(matrix1, matrix2)
numpy.eye(n)
numpy.tile(matrix, (dimension1, dimension2,...))
3. The work flow of multiple map-reduce processes to solve big n and big d problems. Need to review several times the videos on this topic to bear in mind how it works.
Course: Scalable Machine Learning
Software: Apache Spark
Platform: Spark Vagrant
Language: Python
IDE: iPython
MOOC provider: BerkleyX-edX
没有评论:
发表评论