2018-06-11

Towards an end-to-end analytics product

The last six months went surprisingly fast. While there are not so many ad hoc analytics and reporting tasks from the clients, I managed to be more exposed to the solution design works for our current and future products. Let me call them analytics as a product and reporting as a product. This is a valuable experience to me to make progresses in my career as a data scientist.

I always like to ask myself (and talk to relevant people when possible) before starting a new analytics task: 1) who want the analysis/report, 2) what are the deliverables and 3) what do they really want to know/understand.

When it comes with a bigger one, a project that aims at a complete product, things become more complex. Apart from the schedule and budget which I am not talking now, I have to consider things as follows.

1. What is the business value it will bring? The answer may be short but take long to get. To keep in line with the business goals, we want to know which stage our business is at and what we have promised to the current or up-coming clients. We want to be able to clearly show that this project will add to our promised advantages in the market within an affordable period and effort.

2. Based on the business value, what is the minimal viable product and what are the premium features on top of that. Actually what I used to focus on belongs to this part. When I was in the meeting, I am usually focused on what the clients may want and what are not necessary. But now I am thinking about these within a bigger framework and try to assign different weights to different parts based on the business values they may bring.
  There may be a modularised and/or layered structure of the whole analytics/reporting product. We may have a kernel module that is a minimal viable product, and more features can be accessed by the client when they pay more, so that we may have different plans, like free/silver/platinum plans for different clients.

3. Back to the product itself, we want to design not only a static analytics result or report, but also a dynamic process with pipelines and interactions. This is an extension to the "people who want the analysis/report", the "deliverables" and "what do they really want to know".

3.1 When we consider the people who want the analytics product, we need to construct a behaviour model of them and do our design towards that.
  Do they want to just receive an Excel with tables in it? or do they want to open a webpage and see a dashboard with the interactive tools, where they can get the latest information with a single click? Do we allow them to write into part of our database, where the clients can approve/reject our suggested changes in the database? How will they likely to use such features?
  Do we design this product only for external clients, or is it also for internal developers who may use some of the analytics results? Maybe we need to create both external and internal APIs for different uses and usage patterns.

3.2 When we consider the deliverables, we do not only give the final report/dashboard/API calls, but also need to establish a whole set of infrastructures and pipeline of processes for them.
  For example we may need to build helper tables in database and a process to refresh them daily or hourly with data from production database. We need to have a strategy and mechanism to cope with situations when nothing is refreshed, when the refreshing process fails, and when the server is down. When there are dirty or missing parts from the production database, how do we deal with them? When there is a change request to alter the values of some records, how to we ensure the changed values are queried instead of the old values, how do we enable possible rolling-back, and how do we ensure nothing is broken? Some tiny decisions require very cautious consideration and good understanding of the business.
  We need to design the tables in a way such that different clients may read and write different parts independently without conflicts. When the results are fed in to other developers in the company, we need to make sufficient communications to make sure that the use of the results will be restricted to only where they are needed, without breaking things elsewhere.
  And regarding to the methods used in the actual analytics works, including machine learning methods, statistic tests and measurement of performance, we need to demonstrate that they are proper, accurate, or at least outperforming all other methods currently used in the industry. This is mostly required internally, but sometimes it can be a selling point for marketing.

3.3 "What do they really want to know"
  Usually this is not clarified until several rounds of communications, even after the work has reached to the mid-stage in reality. However we really want to make this question clarified as early as possible, and it would be ideal we can settle it at the solution design stage.
  To give an example, when they say they want to understand how students are performing in homework, there is a chance finally they just care about the most recent week for each student.
  So a more precise description is needed all the time because we need to implement different queries for different questions, though they sound similar to each other. This also applies to data processing, machine learning and statistic methods. Answering different questions will need different methods or strategies.
  The design of end-deliverables will also address this question, by generating the most informative and relevant figures and charts to the clients, possibly with interactivity that enables them to navigate, search, roll up, drill down, and perhaps approve/reject what we suggest, all depending what they really want to know.

These are what I have learned recently. I may not implement all of them in the near future, but it is beneficial to bear in mind the ideas.

2018-04-17

五感杂陈的4月

好久不曾充电了。最近开始学习Coursera上面Deep Learning的课程。前两周的内容只是回顾一些熟知的概念。只是其中对Geoffrey Hinton的访谈涉及到一些不太懂的术语。大神对于我们这些初入行的人其实也提不出太有针对性的建议,不过他对于读文献的建议深得我心:读文献不要太多,那么到底怎么量化呢?答:读到你觉得这个领域的所有作者都在某个地方对某个错误视而不见的时候,就可以停下来了。其实我以前在理论生态学和系统生物学领域读文献的时候就是这样的,只不过当年我不知道,停下来之后应该做什么。

刚巧最近工作需要,读了几篇关于Deep Knowledge Tracing的东西,也是基于Deep Learning的,感觉有些不得要领。这个时候,一件要做的事情是去研究他们Github上的代码,另外一件就只有把恩达大神的基础课学好,然后另起炉灶独立发明轮子了。这两年很火的“第一性原理”还是有其现实依据的。联想到Geoff大神的建议,对“真传一句话,假传万卷书”有了更多的体会。

这半年找下家找得很不如意。没想到入行快两年了,还是受到诸多限制,职业生涯发展不似本人想象的那样拨开云雾见晴天。一个是自己在工作中没有把理论和实际运用得水乳交融,另外也是没能坚持学习Python,过分依赖使用R。至于大数据平台的使用经验,这个没办法,在工作中的应用场景不多。好好的Spark就被我们只用来对数据做原始处理,连最基本的logistic regression 都没用过,真的是暴殄天物。先定个小目标:试着在数据无法全部载入内存的限制下,如何完成一个机器学习的小项目。

另,女儿即将满周岁。感谢上天给我们这么可爱的女儿。很希望一天都不要离开她,但是为了将来更好的生活,也许短暂的离开是必要的。这世上很多事情不尽人意,只能尽自己所能想象到的可能去努力。

2017-04-30

Gamification of learning is not a surrender to indolence

It is almost a consensus in the e-learning industry that at least some kind of gamification learning should be included in any online, digitalised or so-called "intelligent" or "smart" learning products and platforms. Generally, gamification is recognised as an enhancer of engagement and learning outcomes for students, or trainees in a non-school setting.

Some key elements of gamification are immediate feedback (telling you about your latest achievement right now, rather than a week later after quiz), progress design and randomised rewards (a diversity of medals, badges, cumulative points, etc.), and a lot of others to make the learning experience more of fun than pain. They are the behaviour controlling (or nudging, to be softer) methods that are proved to be effective in improving students' engagement to learning activities, and the passing rate and grade.

Not surprisingly, gamification in learning is still doubted or resisted by a fair part of school teachers, not only because it sounds like playing games, but also due to a deep worry that an essential part of education could be missed: Education is not only learning skill points, passing exams and gaining higher marks. Education also addresses that people should learn to overcome difficulties while gaining some resilience and persistence, which form good parts of a better oneself for everybody.

As e-learning designers and developers, we must respect this perspective from school teachers. Personally I totally agree that education is not only for better grade, but also for a better oneself. Is resilience or persistence an essential part of a good personal character? Well maybe not everyone agrees, but I would accept it, with 70-80% of my confidence.

In fact, if you ever played games on computer or smart phone, you cannot forget the hard times when you get stuck at a stage, or defeated by a boss again and again, and the times when you finally made the breakthroughs. You may be in a bad mood for a while, and even worse, you get "addicted" being frustrated by the little devils in the electronic box, but eventually you find you unique way to get through, or get out. That is good or bad depending on how the game is designed to affect you, but a good story from which you gain some experience coping with difficulties.

Likewise, during the gamification of learning, we don't even need to deliberately design the hard times for students. They will make their struggles when they get there. What they need is the fair amount of help: not too little to stop them making effort, and not too fostering to make things too easy. Unlike the bad games, we don't want to break a student's heart, we want to place good support there. That's one of the areas where we must cooperate closely with school teachers in the development of the learning products. And again, a really good e-learning product or platform is one that helps and empowers teachers, not one trying to replace them.

Another issue addressed by school teachers is "discipline".  Personally I would rather call it "reliability". Keeping good behaviors in class is discipline, while refraining from interrupting the learning activities is reliability. Ultimately it is reliability that's a demanded personality. Can we implement the training of reliability in gamification of learning as well? Or do we just design a tool that helps teachers to do the job? They remain open options.

2017-04-27

Data mining is still a human work

The term data mining as introduced in lectures and literature is not so straightforward to many people. The explanations appear too geeky or nerdy based on how they are told.

By the words themselves, "data mining" can be any activity that is trying to find useful information from a bunch of data. (Not to say that the term is lexically wrong, as coal mining is about mining for coals and gold mining is about mining for gold, but data mining is definitely not mining for data - rather, it is mining for information and insights from data.)

When we look into a set of shopping data and spot that somebody suddenly begins to buy more expensive yogurt than before, which lasted for a couple of weeks, well, we apparently find something interesting. This person may have just earned some fortune. Another guess is this person was always struggling with poorly tasting yogurt and just finally find this big retreat. We want to know which of the two possibilities are right, and we look further into the shopping records. Aha, this person also bought a few other new brands of better quality. Now we are pretty sure that this guy just got something so spend, and is quite likely to accept a few big deals we want to offer.

This could be a great example of data mining. No technical skills or knowledge of statistics is really needed. It's just the analytical mind and good capture of abnormal patterns, and perhaps also the attention of alternative explanations. Sometimes people call this 'attention to details', which in my opinion should refer to something else.

And these are still in the realm of postulation and falsification, or say, a game of guess and prove. They need human understanding of human stuffs, which machines are not yet good at by themselves, at least not for this moment. Whatever technical infrastructure and skills are applied later, for automation or for scalability, they are based on top of the pioneering human thoughts.

Machine learning is another term often mentioned together with data mining. Well depending on how complex the job of data mining is, machine learning can be a core part of the job, which help us complete the computational and model fitting task. The machine needs us to feed in good and relevant data, and to tell them how to measure the results. Without humans who tell those to the machine, nothing good can be expected. So a conclusion is made here: machine learning is a job for machines, but data mining is still a human work.

2017-04-26

Criticisms as they are

We may receive lots of criticisms, no matter if we are starting a new business, switching to a new career, developing a new product, proposing a new theory, or making an important choice in our life.

The most helpful ones are from a higher level of knowledge, or from a synthesized range of perspectives. They bring more comprehensive understanding of the things we are trying, or help us look at things from a higher dimension and beware our position within a bigger picture. Without those criticisms we may fail absurdly, with a cost we should have avoided.

The less helpful ones are from irrelevant angles of perspective. The critics hold totally different standards and values, and may be full of prejudice. However, they sometimes turn out not that irrelevant as we thought, and bring complementary information that we did not realize we need beforehand.

The useless ones are out of narrower mindsets and poverty of knowledge. Usually they are easy to identify because we know that we know better than those. However some of them are so nicely packed and beautifully presented, that we may think they are still insightful. If they are still encouraging to us, just take it as flattering and relax. If they are really discouraging, ignore them as much as we can.

The harmful ones are made to be detrimental to you. They look either really aspiring or extremely embarrassing, with a good mix of truth and myth that you don't know very well, appearing to be informative, but in fact misleading to you and your collaborators. Once you respond or defy against them, you are directed to the wrong directions. Be prepared to identify them during early stage, and try everything to get rid of any impact imposed on you.

However it is common that we find it tricky to distinguish between the good and bad criticisms. There is no need to panic. We just need to learn our lessons, to do our research, upgrade our knowledge, skills and visions, to be better equipped to cope with criticisms. Life-long learning is still the best solution, not only for this aim but also for many other challenges in our life.

2017-04-18

Package to read and write Excel workbook in R: openxlsx or xlsx?

Recently I tried both packages, xlsx and openxlsx in R to read and write Excel workbooks (in xlsx format). Here I list a few things that I find useful when choosing which one to use based on the aim and context. These are based on xlsx 0.5.7 and openxlsx 4.0.0.

1. For some reason, xlsx uses more memory and runs slower than openxlsx, perhaps due to the use of Java runtime by xlsx. When dealing with data tables that are bigger than a few MB on a computer with very limited RAM, it's better to use openxlsx.

2. When writing contents with languages other than English, sometimes the special letters in UTF are not properly encoded when they are retrieved from database. In this case, the package xlsx usually will make everything right without your attention when writing to workbook. With openxlsx, you have to parse and get the UTF contents correctly into your local memory, before they are written to workbook. (This issue is unlikely to happen within an OS with any graphic UI. If it happens, the use of hex and unhex functions in the SQL queries can solve most problems.)

3. Want to add pictures (jpg or png) in workbook? Use xlsx.

4. With xlsx, column widths can be automatically set after the data frame is added. In openxlsx, they can only be manually set with explicit values.

5. To read out part of a worksheet to be then formatted as an html table, openxlsx is the better option.

6. To apply more than one cell style (like font style, background color, boarder etc.) into a few cell ranges in a worksheet, openxlsx is much easier to use and running faster with less memory than xlsx.

7. When none of the above special requirements are worried about, openxlsx is much handier to install and to use.

2017-04-07

Data science is a new norm of science? Not that new

When I first read Anderson's "The End of Theory" in 2015, it had been seven years since the article was published. To me, it did make some sense by declaring that theories, or even the whole system of science, which is composed of axioms and theorems, may not be so necessary if we can just interpret everything by what data, or data mining, tells us.

To me it was exactly in line with the idea of Mendeleev's periodic table: when we are unsure of what rules are behind the phenomena, something based on empirical observations (or loads of experiment results plus literature) may provide a massive and inclusive table like a dictionary, where we can just look up the answer to whatever question we ask. In the big data era, it is just more feasible to feed the work towards this table with the rich quantity and quality of data. Well, when we really have got this table, why do we still want theories? We don't. This lookup table is more than enough for us.

But at that moment, I was going mad seeking after theories in life science, which I believed would be a key element to make real breakthroughs in lots of areas, from cancer research to induced stem cells. The idea of a lookup table looked not a nice thing to me. At least it was a bit ahead of time.

Interestingly, I had just read another article, "On the Tendencies of Motion", an ironical fake research paper written by a group of real and serious scientists. This is an article wrote in 1981, about 27 years earlier that the last one. To some extent, this article did better than Anderson did in explaining how Data Science is carried out. In a really clumsy manner, using dirty, heterogeneous and inconsistent data, and a really slow "computer" composed of "brothers of the monastic orders, each working an abacus and linked in the appropriate parallel and serial circuits by the abbots", this fake research project nearly re-discovered Newton's 2nd and 3rd laws, approximately.

The fake and clumsy research mocked in that article, is now actually becoming the norm of data science, thanks to the improvement in computational hardware and software, so called big data infrastructures. Why do we give up effort in seeking after theories, causation, rules, or the "truth"? Because we can. The power of enhanced data processing and computation capabilities has made all this possible. That's an argument which makes some sense.

2016-07-20

Analyst and modeler: my new job

After so many days of unemployment I finally got a job I had been dreamed of. Now I am doing mathematical modelling for my new job!!! But I am just not in the mood to celebrate. I just deserve it, and this is my belief.

Yet I have other challenges to meet. One is to learn programming Scala. It is such a weird language. However it is good to reset my brain to be young and fresh again. I will worry less and do my job and start my real career. From now on.

2016-04-15

Use IPython | Jupyter notebook with Hortonworks to run PySpark

2016-04-15
First, just follow the instructions on the official webpage to install the Hortonworks Sandbox VM, the prerequisites, the python 2.7, pip, a few machine learning packages, and Jupyter (the new IPython Notebook).
http://hortonworks.com/hadoop-tutorial/using-ipython-notebook-with-apache-spark/

However, instead of doing

    pip install "ipython[notebook]"

I did

    pip install jupyter

And they should be equivalent.

Then just don't do the introduced configuration for IPython. Instead, install "findspark":

    pip install findspark

Then just run the two lines to start the Jupyter notebook:

    source /opt/rh/python27/enable
    jupyter notebook --port 8889 --notebook-dir='/usr/hdp/current/spark-client/' --ip='*' --no-browser

Alternatively, can include the two lines in an .sh file, and run this file every time to start the notebook. This is similar to what the official webpage instructed, but without creating a profile.

Forward the port 8889 from Virtualbox to the host machine as introduced, and will need to restart the VM before we can access the notebook from the host machine.

And every time we are in a new notebook and we want to write PySpark code, run the following lines above all:

    import findspark
    findspark.init('/usr/hdp/current/spark-client')
    import pyspark
    sc = pyspark.SparkContext()

Note the path "/usr/hdp/current/spark-client" in the fourth line, also in the options when starting the notebook, is the default value of the SPARK_HOME path of Hortonworks.

PS. Hortonwork is so much faster and handier than Cloudera and Oracle Big Date Lite!!!!!













2016-03-22

WeekLog: busy and messy, but fruitful

Drowning in two Kaggle competitions for three weeks, I find myself even busier when I get back to the real world: three more projects to go as my course works.

When I have spent most of my time creating, modifying and tuning the models of machine learning, engineering the features in the data sets and fixing bugs in the VMs, I find I can barely speak and listen. Thankfully, I recovered soon after some group meeting with my classmates, talking about how to go on with the data mining project.

My Kaggles didn't go very well. After some hard work and quick learning, I could only rank around 500 and 1000. There is more feature engineering to go and hidden pattern to discover. However I have to stop now, just to get ready for the coming hard works.

Group meeting is really good to practice your communication skills. I usually speak everything in an abstract manner, taking it for granted that everybody should understand. That's not always true. I have to give concrete stuff to show what I have in my mind. 

Sometimes it's like teaching. I have to stop once in a while, see what feedback I get, and then adjust the way I speak accordingly. If they understand everything, I just go on further. Else if people get into trouble, I should take the responsibility and try different ways to describe my stuff. Sometimes drawing on a piece of paper would help a lot to show your idea, especially if there is an architecture or work flow in your mind. Metaphors are not a good idea, with the risk that people think you are not serious enough. Always pause and listen for feedback, and try to start again from where people can easily understand.

Of course, I have to learn from others as well. Today my group member gave me some good idea how MapReduce should be used. I was always struggling with what MapReduce can do -- not so good for machine learning and not specifically for data storage, then what is its use anyway? Today my mate told me in a simple way: it's just a kind of SQL! That is a perfect way to explain it to me. I appreciate that so much.

After all I made great progresses. Learned the new xgboost method as a tree model and the fancy h2o for deep learning. Reinvented and tested the Bayesian way for re-encoding categorical attributes, though proved that the method is not better than assigning arbitrary integers in this specific use case. Installed Spark on Ubuntu and let it work with Hadoop and Anaconda python 2.7. Fixed connection bugs when using Eclipse after updated my Hadoop version. Helped a lot of people with coding and bug fixing. . .

Days have gone quickly, and I am feeling more like a data scientist than ever before. Especially when even my 'technical guy' would need me to explain the algorithms to him. I take it as an encouragement to me.

Next we will do ensemble models for the data mining project. Specifically, I want to do a stacked generalization. It will be an interesting one. A lot of literature are there to read. I am on the right way I finally have chosen. Just keep going.


2016-03-03

A work-around to cope with count numbers, when the pre-built package has no support for Poisson regression

##### Here is a work-around way of dealing with count numbers with pre-built regression / binary classification models.
##### And how to calculate both RMSE and RMSLE for each way.

### Notation:
# y and yval: values in the label (or say outcome) column for the training and validation sets

# y_new and yval_new: values in the converted label column for the training and validation sets

# pred: the predicted values returned by the pre-built model (by C5.0, rpart, neuralnet, etc.), which can be compared with y_new

# pred_O: the converted values from the "pred", which can be compared with y.

1. If using linear regression:
do conversion:
                  y_new = log(y+10^(-15))       for any y == 0; and
                  y_new = log(y)       for any y greater than 0
and use y_new for anything like linear regression. then actually will be doing Poisson regression.

Modelling codes may look like
            model = lm(y_new ~ x1 + x2, data = mydata)

To predict,
            pred = predict(model, validation_data)
            pred_O = exp(pred)
           
To measure the performance of the prediction,
        RMSE:
            RMSE_val = sqrt(sum((pred_O - yval) ^ 2) / length(yval))
        RMSLE:
            RMSLE_val = sqrt(sum((pred - yval_new) ^ 2) / length(yval))

2. If using logistic regression for count numbers:
do conversion:
                  y_new = y/(y+1)
use y_new for logistic regression, if only the model allows the label values to be non-integer (between 0~1).

Modelling codes may look like
            model = glm(y_new ~ x1 + x2, data = mydata, family=binomial())
           
to predict,
        pred = predict(model, validation_data, type="response")
        pred_O = pred / (1 - pred)
       
To measure the performance of the prediction,
        RMSE_val = sqrt(sum((pred_O - yval) ^ 2) / length(yval))
       
        RMSLE_val = sqrt(sum((log(pred_O+1) - log(yval)) ^ 2) / length(yval))
                 
                  = sqrt(sum((log(1 / (1 - pred)) - log(yval)) ^ 2) / length(yval))