##### Here is a work-around way of dealing with count numbers with pre-built regression / binary classification models.
##### And how to calculate both RMSE and RMSLE for each way.
### Notation:
# y and yval: values in the label (or say outcome) column for the training and validation sets
# y_new and yval_new: values in the converted label column for the training and validation sets
# pred: the predicted values returned by the pre-built model (by C5.0, rpart, neuralnet, etc.), which can be compared with y_new
# pred_O: the converted values from the "pred", which can be compared with y.
1. If using linear regression:
do conversion:
y_new = log(y+10^(-15)) for any y == 0; and
y_new = log(y) for any y greater than 0
and use y_new for anything like linear regression. then actually will be doing Poisson regression.
Modelling codes may look like
model = lm(y_new ~ x1 + x2, data = mydata)
To predict,
pred = predict(model, validation_data)
pred_O = exp(pred)
To measure the performance of the prediction,
RMSE:
RMSE_val = sqrt(sum((pred_O - yval) ^ 2) / length(yval))
RMSLE:
RMSLE_val = sqrt(sum((pred - yval_new) ^ 2) / length(yval))
2. If using logistic regression for count numbers:
do conversion:
y_new = y/(y+1)
use y_new for logistic regression, if only the model allows the label values to be non-integer (between 0~1).
Modelling codes may look like
model = glm(y_new ~ x1 + x2, data = mydata, family=binomial())
to predict,
pred = predict(model, validation_data, type="response")
pred_O = pred / (1 - pred)
To measure the performance of the prediction,
RMSE_val = sqrt(sum((pred_O - yval) ^ 2) / length(yval))
RMSLE_val = sqrt(sum((log(pred_O+1) - log(yval)) ^ 2) / length(yval))
= sqrt(sum((log(1 / (1 - pred)) - log(yval)) ^ 2) / length(yval))
没有评论:
发表评论