2016-03-03

A work-around to cope with count numbers, when the pre-built package has no support for Poisson regression

##### Here is a work-around way of dealing with count numbers with pre-built regression / binary classification models.
##### And how to calculate both RMSE and RMSLE for each way.

### Notation:
# y and yval: values in the label (or say outcome) column for the training and validation sets

# y_new and yval_new: values in the converted label column for the training and validation sets

# pred: the predicted values returned by the pre-built model (by C5.0, rpart, neuralnet, etc.), which can be compared with y_new

# pred_O: the converted values from the "pred", which can be compared with y.

1. If using linear regression:
do conversion:
                  y_new = log(y+10^(-15))       for any y == 0; and
                  y_new = log(y)       for any y greater than 0
and use y_new for anything like linear regression. then actually will be doing Poisson regression.

Modelling codes may look like
            model = lm(y_new ~ x1 + x2, data = mydata)

To predict,
            pred = predict(model, validation_data)
            pred_O = exp(pred)
           
To measure the performance of the prediction,
        RMSE:
            RMSE_val = sqrt(sum((pred_O - yval) ^ 2) / length(yval))
        RMSLE:
            RMSLE_val = sqrt(sum((pred - yval_new) ^ 2) / length(yval))

2. If using logistic regression for count numbers:
do conversion:
                  y_new = y/(y+1)
use y_new for logistic regression, if only the model allows the label values to be non-integer (between 0~1).

Modelling codes may look like
            model = glm(y_new ~ x1 + x2, data = mydata, family=binomial())
           
to predict,
        pred = predict(model, validation_data, type="response")
        pred_O = pred / (1 - pred)
       
To measure the performance of the prediction,
        RMSE_val = sqrt(sum((pred_O - yval) ^ 2) / length(yval))
       
        RMSLE_val = sqrt(sum((log(pred_O+1) - log(yval)) ^ 2) / length(yval))
                 
                  = sqrt(sum((log(1 / (1 - pred)) - log(yval)) ^ 2) / length(yval))
           

没有评论:

发表评论