2016-02-09

vt-x unavailable for VirtualBox? blame Avast

Recently when I updated my Virtualbox from 5.0.10 to 5.0.12 and 5.0.14, all the VMx weren't able to start, with an error message about vt-x unavailable. It took me half a day to realize that it's Avast who did the trick.

However I couldn't find anyway to make Avast to allow Virtualbox to use the hardware vt-x.

Found a workaround here: (but I haven't tried the solution yet)
"Open Avast, go to settings-> Troubleshooting and uncheck “Enable hardware-assisted virtualization” then restart your PC"

2015-12-25

Notes: benchmarking MongoDB with YCSB - need more RAM

Using YCSB, each record is typically 1KB by default. However this requires slightly more than 2KB storage capacity with MongoDB.

As the result, a database/collection with 1 million records/documents will occupy more than 2GB.

And when loading or operating these records 1 million times with a uniform request distribution, the RAM must be much more than 4GB.

During my test, a 6GB RAM will maintain a reasonable performance, while 4GB RAM can lead to collapsed performance, especially for scanning operations.

I used MongoDB 3.0.7 and 3.2.0 with YCSB 0.5.0.

2015-12-03

Notes: Preparation for benchmarking Cassandra 2.2.3 with YCSB 0.5.0 on Ubuntu 14.04 64bit

# To install Cassandra on Ubuntu, use the method from
# https://www.digitalocean.com/community/tutorials/how-to-install-cassandra-and-run-a-single-node-cluster-on-ubuntu-14-04

# Note YCSB 0.5.0 hasn't supported latest versions of Cassandra (3.0+), so just install Cassandra 2.2.3.

# To check it running as a service:
sudo service cassandra status

# To start and stop the service
sudo service cassandra start
sudo service cassandra stop

# check the nodetool:
nodetool status
# may include some ip and port parameters:
nodetool -host 127.0.0.1 -p 9042 status

# To operate databases, use the cql shell in terminal:
cqlsh

# If running cassandra in the foreground when you are logged in as "hduser", you need to change their ownership to the current user. Not necessary:
# sudo chown -R hduser:hadoop /var/log/cassandra
# sudo chown -R hduser:hadoop /var/lib/cassandra

# When sudo running cassandra as service, the ownership of the above folders
# should be changed back to "cassandra":
# sudo chown -R cassandra /var/log/cassandra
# sudo chown -R cassandra /var/lib/cassandra

# Before using YCSB, enter the cqlsh, and create database in Cassandra as below:

CREATE KEYSPACE ycsb WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 1 };
USE ycsb;
CREATE TABLE usertable (
    y_id varchar primary key,
    field0 varchar,
    field1 varchar,
    field2 varchar,
    field3 varchar,
    field4 varchar,
    field5 varchar,
    field6 varchar,
    field7 varchar,
    field8 varchar,
    field9 varchar);

# So that cassandra has the database to accept ycsb workloads.

# Some parts of workloads failed due to timeout when writing.To overcome this issue:
# Find the "cassandra.yaml" file (if you installed it by sudo apt-get, this file should be under /etc/cassandra),
# Find the settings in the file, words like:
write_request_timeout_in_ms: 2000
# Change this value and other similar settings into 30000 - 60000 so most timeouts will disappear.
# Reboot your Ubuntu, check if cassandra service is running by typing in terminal:
sudo service cassandra status
# If not, check the repairing method from the website linked above.
# So Cassandra is all set ready.

# For YCSB:
# Copy the following two files into your "ycsb-0.5.0/cassandra2-binding/lib" directory (downloaded from http://www.slf4j.org/download.html):
slf4j-api-1.7.13.jar
slf4j-simple-1.7.13.jar

# Remove or rename the following file in the same directory:
slf4j-api-1.6.4.jar

# Then off you go!!

# Example:
# Under the ycsb-0.5.0 directory, try to run in the terminal:

./bin/ycsb load cassandra2-cql -P workloads/workloada -p hosts=localhost
./bin/ycsb run cassandra2-cql -P workloads/workloada -p hosts=localhost

# After running workloads, clear the usertable:
# In the terminal:
cqlsh
# In the cql shell:
use ycsb;
truncate usertable;
exit;

# So everything is done.

2015-11-20

Some comment about artificial neural networks

Just want to share some of my understanding from the course of Machine Learing (Coursera, Stanford University, lectured by Andrew Ng). I finished this course earlier 2015 and have become very interested in the method of neural networks (NN). This was really mind-changing to me, in that one non-linear, complex system can "explain" another arbitrarily complex system.

When I talked about NN with my academic friends, they are mostly worrying about overfitting and incapability to find the "global optimum". The second worried thing is the required size of training samples, which industrial friends are also concerned. At this moment , I would not worry about the first issue.

By nature NN is supposed to solve problems that human-recognition are good at solving but linear computer algorithms cannot, and indeed NN is like humans. Humans have their prejudice, understand the same thing from many different perspectives, and do the same thing in different (sometimes wrong) ways . NN has adopted all these, in forms of over-fitting and the existence of multiple local optima.

Just take this plain example: many of us walk everyday. We walk with our feet, legs, backbones and arms moving in such different and diverse manners, but we walk equally well (almost)! We accept these multiple local optima and don't care a "globally optimal" way of walking.

And our life experience in early ages will affect so much how we behave in adulthood, sometimes causing misalignment when we cope with new situations. This is a typical "over-fitting" phenomenon. However most of us don't really think this is a problem to solve. We accept this as a fact-of-life, if not beauty-of-life. And yes, we always learn from new experience when we grow up, using new "training samples" to reshape our ways of doing things, probably a practical solution.

As another simple example of "over-fitting", just think how many times you interpret anything in your house as like a "face".

However I do care about the second issue, the sample size. Complex questions require us to have a big size of training samples, but we always worry how big is big enough. Practically we want the training samples to cover "all different situations", but we often don't really know if this requirement is fully met.

And there is a dilemma here: when the training samples are indeed "enough", we may not need to train any model any more. If the training samples have already covered all possible situations, all we have to do is to use the training samples as a lookup table for new data. Too few training data is useless, while too many training data is meaningless. In practical we have to face this question.

Thinking in another way from the first issue, NN has opened a door of re-recognizing the world. There may be more than one "correct way" to explain the things in this world. Our previous way of optimizing procedures, maximizing profit, doing things, and even explaining physical, chemical and biological rules, may only be some "local optima". NN may provide totally different ways of doing those, which may be equally good or even better, though sometimes strange (like "google deep dream").


2015-09-04

Notes: risk management in corporation financial decision making

Futures/forward contract

Spot price: real time market price

Call/put options:

Allow the option to expire/elapse

When the risk is price rise:
buy the right to buy = buy a call = long call
sell the right to buy = sell a call = short call

When the risk is price fall:
buy the right to sell = buy a put = long put
sell the right to sell = sell a put = short put

Exercise price:  the price at which to set up a cap or floor

In-the-money; at-the-money; out-of-the-money

Six factors impacting the value of call/put options:
Four of them impact the intrinsic value:
1. Exercise price: negative to the call value; positive to the put value
2. Asset value: positive; negative
3. Interest rate: p; n
4. Dividends (e.g. if the traded underlying asset is a share): n; p

Two of them impact the time value:
5. Volatility
6. Expiry time

European / American style options: whether you cannot or can exercise before the expiry date.






2015-08-03

Notes of mitX Big Data course: NoSQL and NewSQL

NoSQL and NewSQL

Prof. SAMUEL MADDEN

1. Traditional DBMS features:

1) Record oriented (row store);
2) Schema (carefully structured);
3) Use SQL;
4) Do transactions (ACID semantics)

2. For high throughput, high velocity data

1) Failures in a single node may not be a problem, OK to 'failover', and thus don't require 100% consistency;
2) Some data are not relational, e.g. json;
3) Strict data schema may not be necessary in face of so diversed data sources;

Definition: Data model: The way data is stored/represented in a database system.

3. NoSQL

NoSQL is non-relational. It's either a Key-Value store or a document store.
1) Key-Value: can only put (write) and get (read);
2) Document: like key-value, but with values as dictionaries or XML/json documents.
no ACID - limited atommicity and consistency, no schema.
Easy to understand, implement and program; easy to distribute across nodes;

Eventual consistency + majority read/write protocol to conquer replica failures. Note that majority read protocol includes the 'most recent' rule.
Without majority read/write protocol, the eventual consistency strategy may give wrong answers.

4. NewSQL: H-Store (by MIT)
Provides ACID and SQL, but with as high-throughput as NoSQL.
Reduces caching, logging/recovery and locking time to boost speed.
Column store, in-memory, data partitioning.
H-Store does serial execution instead of concurrent execution, thus avoid locking.
H-Store does compact logging to avoid heavy-weight recovery.

Partition - single thread in each partition
- separated procedures can be expressed using SQL, but are predeclared when setting up the customized H-Store system, rather than arbitrarily composed by users.

H-Store can run 25 times as many transactions as traditional DBMSs.

OLTP workloads are mostly easy to partition.

5. Summary

New DBMSs should support new data models (key-value, documents), high availability (use replica for high-throughput data and ensure consistency),

Notes of mitX Big Data course: new DBMSs

Trends in Database management systems: in-memory column format.

Basic knowledge:
ACID: atomicity, consistency, isolation, durability

1. Column store
Column store is 50-100 times faster than row store.
The reasons are
1) when you want to read something by a query, you may want only a few columns, but a row-based store will give you all the columns of each row, most you don't want;
2) columns are easy to encode and compress;
3) don't use headers for each record;
4) column-wise picking by an executer is faster because of vector processing.

Native column store systems:
HP/Vertica, SAP/Hana, Paraccel(Amazon), SAP/Sybase/IQ
Native row store systems:
MS, Oracle (latest Oracle db 12c is doing column store as an option), DB2, Netezza
In transition:
Teradata, Asterdata, Greenplum,

Example with Vertica:
read some records into main memory and present them in row-store form;
Once changes are made, rotate the rows into columns, encode them and write onto disk.
Paraccel and Hana are similar.

2. Ways to reduce computation cost (time)

1) To reduce time for isolation:
locking: isolate rows being edited from other editors
latching: isolate data being processed in one thread from other threads
New ways to avoid locking while dealing with isolation:
concurrency control:
  MVCC (NuoDB, Hekaton)
  Time stamp order (H-Store, VoltDB)
  Lightweight combination of time stamp order and dynamic locking (Calvin, Dora)
Use single-thread systems.

2) To reduce memory loading time:

Transaction processing databases are not so big as data warehouses. They can well be done in main memory (buffer pool).
1TB memory is already in market ($30k)
Anti-caching is going to be popular.

3) To reduce logging time:
i. do command logging (logic logging) instead of data changelog (dynamic logging).
ii. with replica in big data stuctures, no need of logging and recovery for plain failures. You just 'failover' and that's fine.

Recommended new OLTP (Online transaction processing) systems: MS Hekaton, SAP Hana, VoltDB, MemSQL, SQLFire,... They are 100 times faster than old systems.
Notes from outside the course: Oracle 12c has dual row and column based formats, both can do in-memory OLTP. The column format uses memory only while the row format can be stored on disk.

3. NoSQL (MongoDB, Cassandra, ...): comments from Professor Mike Stonebraker

1) NoSQL advocates to give up ACID, which is not good. Actually NoSQL is currently seeking after better ACID.
2) NoSQL advocates to 'scheme later', which is possible in some cases but not really good practice.

4. Array database systems (such as SciDB)
Good to do complex analyses requiring matrix computation. It depends on how such analyses are sought after in market.
Examples includes data mining looking into covariance between pairs of stocks in their prices over time.
SciDB provides array SQL.

5. Graph database (Neo4J, )
Also good to do OLTP.

6. Hadoop, Hive, Pig
MapReduce is only useful when doing with highly parallel computations.
Hive (Facebook) and Pig (Yahoo) are top layers doing SQL like queries and manipulations, which is more likely to be active in market.
HDFS, the bottom level file system, is dying.
Facts: Facebook runs ~2500 nodes of mapreduce.
A normal SQL aggregate running on Hadoop is 100x slower than on a column store DBMS.
A matrix computation analysis on Hadoop is 100x slower than on an array based DBMS.

Cloudera Impala (an execution engine implementing Hive) is becoming more like SAP Hana or Vertica.

Hortonworks and Facebook are doing the same thing, developing Hive-interfaced data warehousing systems without MapReduce.
Other data warehousing systems may also well have supported Hive, however.

Notes from others:
Google Dremel is also running SQL over MapReduce. These alike systems usually do very simple queries and don't need fault tolerance.
While Google Tenzing is designed for complex queries on top of mapreduce clusters.
Apache Spark uses MapReduce clusters but runs in-memory.
Apache Shark implements in-memory, column-oriented database on top of Spark (Hive on Spark), so we can use SQL with Spark.
Those new systems can do data partitioning in memory. This eases doing transactions separately on each partition simultaneously thus avoid concurrency controlling.

Conclusion:
A. Data warehouse is going to be a column store market;
B. OLTP will be a main memory market;
C. Array based and Graph DBMSs may get traction;
D. NoSQL currently is not good if you want ACID.

2015-07-30

A nice experience with Scalable Machine Learning

Just when I had learned the basics of MapReduce and was about to start learning Hadoop, this course came across my way and caught my eyes.

Apache Spark is said to be tens to hundreds times more efficient than Hadoop in performing highly parallelized computations, mainly because it stores data in main memory rather than on disks, so as to reduce the time wasted on reading and writing disks, which is a major source of computation expense. To my understanding, another benefit is the possibility to allow multiple map and reduce steps for a more complex procedure, by caching data and intermediate variables for multiple-times use.

IPython notebook, which is integrated in a VM installed with vagrant and VirtualBox, was used throughout the course. My basic knowledge of Python was quite sufficient to cope with the hands-on coding works every week, but of course, with further self learning through online resources.

The videos contain no instructions about the procedures and coding to be used in the homework, unlike many other online courses. I need to understand how the problems should be solved and what program modules  should be coded mainly on myself. This has taken me a lot of effort, and actually lots of fun when I finally solved each problem.

Even with some experience of Python, I didn't feel very convenient with the IPython notebook. The pros are that it gives you the result right below every line of codes after you run them, while the cons are that you quickly forget which variables you have created a few seconds ago and have to scroll up a lot to find what they are. Myself would prefer to use the QT console in future. It would be much handier to use similar consoles no matter if you use Rstudio, Matlab or Octave GUI, where the side panels keep a track of what you have in the workspace.

A good news is that Spark has better supported R in the most recent version. To me, numpy and pandas are not as straightforward as R in doing array/matrix computations and data frame operation. I will be very interested in trying R on Spark very soon.

After this course, I plan to practice more of decision tree, random forest, svm, neural network, naive Bayes, time series and MCMC across R, Python on and off Spark, to get a more complete profile in machine learning.

Notes of Scalable Machine learning - PCA

 1. Explore the data to roughly see the distribution. Plot or calculation covariance matrix to see if the different variables are independent. If all are nearly independent, then PCA is unnecessary.

 2. Used libraries and functions:
numpy.linalg.eigh
numpy.argsort
* Note the use of np.dot() and np.outer() on numpy arrays when they are matrices
numpy.kron(matrix1, matrix2)
numpy.eye(n)
numpy.tile(matrix, (dimension1, dimension2,...))

3. The work flow of multiple map-reduce processes to solve big n and big d problems. Need to review several times the videos on this topic to bear in mind how it works.

Course: Scalable Machine Learning
Software: Apache Spark
Platform: Spark Vagrant
Language: Python
IDE: iPython
MOOC provider: BerkleyX-edX


2015-07-19

Learning data analysis

I am studying to be a data analyst for the industrial world.

Out from academy, I was first shocked by the fact that people wouldn't suppose that you know well about things like linear regression, classification, clustering and PCA analyses - you have to prove that you are proficient with those.

Especially when all your past degrees are in biology, rather than those popularly thought as 'numeric' or 'technical' disciplines such as maths, physics or statistics. I could only provide my transcript to show that I got high marks in maths, statistics and physics when I did my undergraduate study and even in mathematical ecology (actually quantitative ecology) from my master's study, if I ever got a chance to be asked.

That's why I am going to pursue a degree in statistics, or rather more directly, in data analytics, to fulfill this gap.

Of course I have yet to learn more skills on myself to be better equipped for career change. I have to use R more proficiently for industrial use, writing more efficient codes and producing more professional illustration. I have to improve my Python skill to be as good as R, and also for its integration into advanced platforms like Hadoop and Spark for large scale data analysis. I have to learn not only SQL but also NoSQL and NewSQL for data ETL.

And also knowledge in operation, supply chain and finance. These are the fields where data analytics is most deployed.

Learning can be endless and I shall pause at some stage and make use of what I have learned, ideally by starting a small project that solves some realistic problem.

Just to make a record what I have been doing for the past two years. Too many thoughts will be a no-no at this time, which have halted me too often. Take action and don't hesitate. Things will turn out when you get close.

2015-06-05

The two children problem: language ambiguity

I happened to see this question in a forum and found it very interesting.

You know your friend has two children, and you know one of them is a girl. Suppose the probability of each sex of each children is simply 1/2 (i.e. ignoring biological and human/society trivial), what is the probability of both children are girls?

There is actually an explanation of this question on Wikipedia (Boy or girl paradox). The answer really depends on how you understand the words "you know one of them is a girl".

Interpretation 1: 

Suppose we call the two children A and B as we don't know them, then the interpretation is "either you know A is a girl, or you know B is a girl". This is also equivalent to say that "you happened to see one of them and thus knew that it is a girl, but you don't know if it was A or B you saw".

From this interpretation, we may give the answer 1/2, i.e. the possibility of "both the two children are girls"=="the other child is also a girl" is 1/2.

Interpretation 2: 

Still suppose we call the two children A and B, but this is unimportant for this interpretation. The interpretation is that we know nothing but "at least one of A and B is a girl". 

From this interpretation, we may give the answer 1/3, i.e.  the possibility of "both the two children are girls" is 1/3.

Using Bayes' theorem we also clearly see both interpretations make their own sense.

From Wikipedia:
Martin Gardner, the original author of this problem, later acknowledge that the second interpretation was ambiguous, because you must clarify how people get to know that "at least one of A and B is a girl", and the way people get to know this is likely to be the same way in the first interpretation.

However there are realistic ways to get the second interpretation. For example you may spot your friend buying girl's toys and get confirmed they are buying them for their own kids, or else, you may have heard that you friend once saying "Thank god I don't have two boys".

Then how we decide to use interpretation 1 or 2? It really depends on how the question is phrased. There are a few opinions to distinguish them if you are asked a similar question:

1. Whether it is the family (parent) or one of the children that is randomly selected and watched. The answer is 1/3 for the family and 1/2 for the children.

2. Whether or not there is "identilisation" of the two children, no matter if it is implicit or explicit, ordered or unordered. Usually if the phrase "at least one of them" appears, we know there is no identilisation and the answer should be 1/3. Otherwise the answer will be 1/2.

3. Whatever your examiner/boss/supervisor decides:P