## Multiple regression

**Multiple regression analysis is a technique to derive a line that can predict an outcome based on a list of variables.**

In univariate regression only one predictive variable is used. The example used over there is to find out if students with a higher motivation tend to have better results. That could be true but it is probably not the only aspect that predicts the result. Other factors might be a good predictor too, like time spent on learning, intelligence, familiarity with the subject and so on. Now we don’t want to calculate the best fitting line with only one predictor, but with a list of predictors. The equation we are looking for, is this one:

**y = a + b _{1} * x_{1} + b_{2} * x_{2} + b_{3} * x_{3} + ……**

In which: y = the predicted value on the y-as

a = the intercept (the value where the line crosses the y-as)

b_{i } = the value to multiply with for xi-variable

x_{i} = any value you like for that variable

The formula can be made as long as you wish. Just add more variables and see if it has predictable value. If not, these variables can be omitted from the regression equation.

**What is the multiple regression used for**

The multiple regression line is used to predict an outcome. So if you think a test result is influenced by time spent on learning, motivation, intelligence and familiarity with the subject, just measure these aspects, put the measured values in the formula and calculate the predicted score.

Easily said, but first you need to know the regression line. Therefore regression analysis should have two steps: finding the regression line and testing the regression line. Most of the time in research only step one is done. The second step is usually omitted in research, but it might be very useful in daily life. For instance in predicting economic welfare in a country, the aspects that influence consumers buying behaviour, predict the influences on healthcare and so on.

**How to calculate the multiple regression line**

It is not easy to calculate this regression line. It is done with matrix calculation, and I am sorry but I will not explain this here. We have computers to do the calculation, and I rely on their procedures. I think these procedures are tested over and over, so I do not question the outcome.

Keep in mind that adding a new variable to the equation, changes all regression coefficients. So even small changes in the predictors used in the equation affect the predictable value of all predictors.

**Multiple Regression and Multiple Correlation**

With only one predictor the equation can be evaluated by the correlation coefficient (r). With more predicting variables the equation is evaluated by the multiple correlation coefficient (R). To be more precise, r and R aren’t used but the squared value, denoted by R^{2}.

Adding extra predictors increases R^{2}. By excluding a predictor from the equation, R^{2} decreases. Testing the difference between two equations is done by hierarchical regression analysis.

**Testing the influence of variables in the equotion**

If the regression coefficient of a variable (b) is zero, the regression line goes in a horizontal way. It does not influence the dependent variable. That is certainly the case in a univariate situation. In a multiple regression in which a lot of predictors are used, the statistical effect of a single factor might get lost. Then this predictor becomes redundant. In combination with all other predictors, the aspect is overrun by the other aspects that predict the score of the test. This is the basic idea of mediation.

A combination of two predictors might give a stronger effect. For instance a combination of motivation and spending a lot of time on learning, might lead to a higher score on the test. This is the basic idea of moderation.

If you want to compare several regression lines, hierarchical regression should be used. With this test the change in R^{2} is statistically tested and the influence of all predictors on the dependent variable can be compared.

**The standardised regression coefficient**

In the output of many (all?) software programmes two types of regression coefficients are presented. The first one is the exact regression coefficient and is denoted by b. These values depend on the range of the variable. The advantage of this notation is that is corresponds with real life.

The second one is the standardised regression coefficient (Beta) and is based on standardising the variable with a mean of 0 and a standard deviation of 1. The advantage of this notation is that the regression coefficient can be compared easily.