ruggerobettinardi.eu

Linear Regression is one of the oldest Supervised learning method.

The idea is to fit a line to the data, and use this line to predict new target data - In case we have two input features to predict our target, we will fit a plane to the data, and if we have more than two input features, it will be a hyperplane.

To keep everything simple, let's assume we have to fit a line to the data, as we only have one input feature. The line equation is:

Where y is the target variable (i.e. the value we want to predict), x is the input feature we use to predict the target, whereas a (the line slope) and b (the line intercept) are the regression coefficients.

The fitting is performed by evaluating different values for the regression coefficients, and choosing those that minimize some error function (also called loss function or cost function), which is a function that quantify how large are the errors (called residuals) we obtain by predicting the target values using a simple line.

Let's make an example.

We can assume that, in general, the bigger the eggs of a dinosaur, the bigger the adult dino will be, suggesting a possibly linear relationship.

We can therefore use Linear Regression to predict the most probable adult size of a given dinosaurs just by knowing the size of its eggs.

To find the best-fitting line that we will use to make our predictions, we will use some loss function that will minimize the residuals (i.e. the errors), in other words, the distance between every observed data point and the regression line. In the above plot, the regression line is green, whereas all the residual distances have been colored in blue.

Linear Regression is a Supervised Method

Use it when the target variable is continuous (e.g. prices, blood metabolytes' levels...)

Only makes sense if we can assume that input and target variables are linearly correlated

Simple linear regression: when only one input feature (regressor)

Multiple linear regression: when predicting using more then one input features (2+ regressors)

the standard loss function minimized in Linear Regression is Ordinary Least Squares (OLS)

Linear Regression with sklearn

Let's see how to perform linear regression using sklearn. For that, we will use the diabetes toy dataset, that can be imported directly from sklearn.

We will first show how to perform Simple Linear Regression by selecting only one of the input features in the dataset, and later we will perform Multiple Linear Regression bu using all of the

Simple Linear Regression

Multiple Linear Regression

As expected, using all input variables increase the performance of our prediction, as R2 pass from 0.3 to 0.4. This is not surprising, as the more variables we use to predict, the better our model will approximate the target variable.

However, increasing the number of input variable makes also our model more prone to overfitting, plus the more variable we use to predict, the less we know which variables are actually the ones more responsible for the relationship we see.

Some of the methods used to overcome those issues are Lasso, Ridge or ElasticNet Regressions which I will cover in later posts with Python hands-on examples.

More about Linear Regression

Wikipedia

scikit-learn

Kaggle Notebook - Assumptions behind Linear Regression