Linear Regression Analysis using Excel

by Dr Muhammad Al-Salamah

Linear regression and multiple linear regression in general is used, among other purposes, to establish a functional relationship between a production line characteristic, commonly called the response, and a control factor that is supposed to have an effect on the response. This functional relationship has the general form:

linear regression model

In this form, the response is given by Y and the factors are represented by the x's. What is missing in this relationship are the values of the regression coefficients. This tutorial will explain how Excel can be used to compute the least squares values for the regression coefficients and perform very important significance tests on the regression model and the factors.

Almarai Yogurt زبادي المراعي

Almarai Yogurt Filling Operation

In Almarai yogurt filling operation, it has been noticed that there is a high variability in filling volume measured in ml The industrial engineer has suspected that the filling volume variability could be caused by three main factors; feed rate of containers, temperature of yogurt, and the length in time the machine has been in operation since the start of the shift. The industrial engineer suspects that there could be other factors affecting the filling volume; but only the stated factors can be easily controlled.

He understands that the filling volume will not be the same if it is repeated with the same values of feed rate, temperature, and operation duration. The variability in the filling volume is caused by a random error, that for practical purposes is not important. But, he knows very well that, in regression analysis, the error is supposed to be normally distributed with mean zero and constant variance. This error includes the effects of factors that are not included in the analysis; such as the age of the machine, viscosity of the liquid, etc.

This first step in the analysis is to collect data. What he has done was to observe the filling operation at different values of the control variables. The way the values of the control variables are chosen can be either random or the values available from historical records. In general, it is recommended not to revert to either options; it will be better if the values to the control variables are chosen carefully to reflect the best possible values and their contribution to the response.

For the yogurt filling operation, the industrial engineer has managed to collect 25 observations of the filling volume for different value combinations of the feed rate, temperate, and operation duration. The data are organized in the following table:

Observation Filling
volume
(ml)
Feed
rate
(parts per hour)
Temperature
of the liquid
(oC)
Operation
duration
(hours)
1 235.5 69 75.3 5.5
2 214.6 54 79 1.1
3 231.8 59 72.5 4.4
4 219 61 79 2.1
5 221.8 63 71.2 2.6
6 221.5 68 77.4 2.7
7 232.5 57 71.9 4.4
8 236.7 60 79.8 5.5
9 223.4 62 74.5 2.9
10 218.9 62 72.2 2
11 234.9 51 77.7 5.1
12 222.1 65 79.5 2.9
13 228 54 70.3 3.6
14 218.4 56 79.5 1.7
15 216.5 56 71.7 1.4
16 218 57 77.6 1.9
17 234.7 70 72 5.3
18 220.9 65 79.8 2.5
19 230.2 52 77.5 4.2
20 214 68 74 1.1
21 218.8 57 70.9 1.8
22 237.2 52 78.3 5.4
23 239.5 53 74.2 5.9
24 221.9 53 70.6 2.2
25 232.5 66 80 4.8

The industrial engineer is interested to know more about the filling operation and the contribution of the control variables on the response. To make scientifically based conclusions, he has to test some relevant hypotheses, like

  • how the feed rate, temperature, and operation duration collectively contribute to the variability or change in the filling volume.
  • how the filling volume is affected by the individual control variables.
  • how good the regression model in its entirety.

Data Analysis Tool

The Data Analysis tool in Excel is used for most statistical studies. The regression utility under the Data Analysis is for regression analysis, including analysis of variance, ANOVA.

The Data Analysis tool is available under the Data tab:

Excel Data Analysis

When the Data Analysis is clicked, the Data Analysis box opens up:

Excel Data analysis tool

From the list of analysis tools, choose Regression. The Regression dialog box looks like this:

Excel Regression dialog box

In the Input Y Range, this field should point to the cells containing the values of y's. In the Input X Range, the cells containing the values of x's are referenced in this field. This Regression dialog box has many parameters that could be of interest in the regression analysis. One particularly important parameter is the Confidence Level. By default, the confidence level is set to 95%. If another value is desired, it can be changed in this field. The confidence level will be used in the construction of the confidence intervals on the regression parameters. For the current example, we want to, in addition to the 95% level, the 90% level; therefore, 90 is inserted in this field. If the analysis of the residuals will be required, the choices in the relevant areas in the Regression dialog box should be checked:

Excel Regression Dialog box

The Regression tool is going to generate three tables in a new worksheet. These tables are the regression statistics, ANOVA, and statistics on the regression coefficients. The regression statistics table provides general information on the regression model, such as the value of the coefficient of determination, commonly referred to as R2.

The analysis of variance is used to test the significant of the overall regression and how well it explains the variability in filling volume. The ANOVA table organizes the calculations involved in the analysis of variance:

Excel ANOVA table

The most important components in the ANOVA table are the estimate of the variance of the error and the p-value. In general, the variance measures the spread of the data; hence, in regression, the error variance quantifies the spread of the actual observation of the filling volume about the regression model.

The p-value is a measure of the degree of the insignificance of the regression model in explaining the variability in the filling volume. The significance of the regression model increases as the p-value decreases. In the above table, the p-value is almost zero, which indicates the control variables affect significantly the filling volume. In general, the p-value is compared to a pre-chosen alpha; and as long as the p-value is less than the value of alpha, we say the regression model is significant.

The last table the Regression tool generates is the estimates table. The table does not show the estimates of the regression coefficients only, but it shows other important values, such as the p-value for the individual coefficients and the confidence intervals:

regression analysis table

Each row of the table is for one coefficient. The second column of the table gives the estimates of the coefficients. The third column shows the standard error, which is defined as the square root of the variance. The last four columns show the confidence intervals for the individual coefficients. Particularly important are the p-values, which are interpreted the same way the p-value in the ANOVA table (the smaller, the better).

You can download the linear regression data.xls to practice regression analysis using Excel.

Note: this is a made-up example for illustration purposes.