2016-03-22

WeekLog: busy and messy, but fruitful

Drowning in two Kaggle competitions for three weeks, I find myself even busier when I get back to the real world: three more projects to go as my course works.

When I have spent most of my time creating, modifying and tuning the models of machine learning, engineering the features in the data sets and fixing bugs in the VMs, I find I can barely speak and listen. Thankfully, I recovered soon after some group meeting with my classmates, talking about how to go on with the data mining project.

My Kaggles didn't go very well. After some hard work and quick learning, I could only rank around 500 and 1000. There is more feature engineering to go and hidden pattern to discover. However I have to stop now, just to get ready for the coming hard works.

Group meeting is really good to practice your communication skills. I usually speak everything in an abstract manner, taking it for granted that everybody should understand. That's not always true. I have to give concrete stuff to show what I have in my mind. 

Sometimes it's like teaching. I have to stop once in a while, see what feedback I get, and then adjust the way I speak accordingly. If they understand everything, I just go on further. Else if people get into trouble, I should take the responsibility and try different ways to describe my stuff. Sometimes drawing on a piece of paper would help a lot to show your idea, especially if there is an architecture or work flow in your mind. Metaphors are not a good idea, with the risk that people think you are not serious enough. Always pause and listen for feedback, and try to start again from where people can easily understand.

Of course, I have to learn from others as well. Today my group member gave me some good idea how MapReduce should be used. I was always struggling with what MapReduce can do -- not so good for machine learning and not specifically for data storage, then what is its use anyway? Today my mate told me in a simple way: it's just a kind of SQL! That is a perfect way to explain it to me. I appreciate that so much.

After all I made great progresses. Learned the new xgboost method as a tree model and the fancy h2o for deep learning. Reinvented and tested the Bayesian way for re-encoding categorical attributes, though proved that the method is not better than assigning arbitrary integers in this specific use case. Installed Spark on Ubuntu and let it work with Hadoop and Anaconda python 2.7. Fixed connection bugs when using Eclipse after updated my Hadoop version. Helped a lot of people with coding and bug fixing. . .

Days have gone quickly, and I am feeling more like a data scientist than ever before. Especially when even my 'technical guy' would need me to explain the algorithms to him. I take it as an encouragement to me.

Next we will do ensemble models for the data mining project. Specifically, I want to do a stacked generalization. It will be an interesting one. A lot of literature are there to read. I am on the right way I finally have chosen. Just keep going.


没有评论:

发表评论