Monday, December 10, 2012

Linear Regression :: when no fanciness is needed

I am a firm believer in using a right tool for the job, and sometimes a simple hammer can be more effective than a jackhammer. When it comes to data mining, such can be the case with linear regression, so I'd like to put in a few words about it.

Most powerful and most used techniques in data mining (i.e. support vector machines, neural networks, etc.) are usually of classification learning style. These techniques are trained on labelled data and the resulting model predicts which class a new instance belongs to. The data they operate on are nominal (constant values that can't be compared) or ordinal (values that can be ordered, but there is no distance between them). If such algorithms need to deal with numeric data, the usual technique is to combine it into discrete intervals (i.e. instead of the numeric range of values {148, 134, 152, 80, 256, 462, 57}, have {large, medium, small} and define each value in terms of these constants). But what if this breakdown is not granular enough for you and what you want to predict is not a class, but a numeric quantity? Then regression is something you should consider.

Linear regression is a simplest of the regressions, but sometimes can do the job nicely as well. Use it if you have a range of numeric values over time, for example (or some other numeric quantity) and want to see where the trend is going. What linear regression does is it fits a line through the data (see graphs below), so that you can see which direction the data is taking and can predict future values. For instance, in time series analysis, if you use linear regression on historic data, you can predict where the future values will be based on the intersection with the regression line. Of course, this model is only useful if it represents your data well. To find out if it does, you need a coefficient of determination, or R2. This value, which ranges from 0 to 1, represents the percent of the data that fits the model well, i.e. the higher, the better. I have found one little problem with it, though. If the data does not ascend of descend, but tends toward a horizontal line, like in Figure 2, then R2 is very low, even when the data is close to the regression line. I've put two toy examples of data and their corresponding linear regression in two figures below. You can see that in both cases the data is pretty close to the regression line, but while R2 for Fig.1 is high as expected, Fig.2 has a very low one. Not quite sure why that happens, but it seems like coefficient of determination is not a good measure when linear regression line is parallel to x axis.

Figure 1. Linear regression on sample data. Coefficient of determination (R2) is 0.903
Figure 2. Linear regression on sample data. Coefficient of determination (R2) is 0.007

Nitty-gritty

Images are generated in R statistical environment. The regression formula and coefficient of determination are calculated in my java implementation of linear regression, found in Git repo.

No comments:

Post a Comment