CHAPTER EIGHT

Linear Regression

 

 

8.1 Scatter Diagram

 

Example 8.1 A chemical engineer is investigating the effect of process operating temperature () on product yield (). The study results in the following data:

 

100

110

120

130

140

150

160

170

180

190

45

51

54

61

66

70

74

78

85

89

 

(Hines and Montgomery, 1990, p 457) Check if there is any linear relationship between temperature and product yield.

 

Solution Make a data file with as Var1 and  as Var2. Then follow the steps in Statistica to get a scatter diagram.

 

  1. Graphs
  2. Scatterplot
  3. Click Variables (Select Var1 and Var2) / OK
  4. Click Advanced (In Graph Type click Regular & in Fit click Off)
  5. OK

 

Since the scatter diagram (See Figure 8.1) between temperature (Var1) and product yield (Var2) shows a linear trend, one recommends estimating the line of best fit.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 8.1 Scatterplot

 


8.2 The Correlation Coefficient

 

The strength of linear relationship between  is measured by the correlation coefficient, defined by

where

 

In Example 8.1, we have, ,

 

                                               

 

8.3 Estimating the Line of Best Fit

 

The simple linear regression model is of the form

where is the conditional mean of y at x,

y and x are respectively the dependent (response) and independent  (explanatory) variables,

 is the random error component,

and are the intercept and the slope of the regression line respectively.

 

The least squares estimators of the regression parameters are given by

 and

Once the parameters are estimated, the equation   will be called the estimated regression line, the prediction line,  the line of best fit or the least squares line. It should be noted that  can be used as a point estimate of the conditional mean of y at , or a predictor of the response at .

 

For the data in Example 8.1,

 

 and

so that the line of best fit is given by

 

.

 

At a temperature of 1400 C, we predict the yield to be

 

Estimating and Using Statistica        

To estimate and by using Basic Statistics and Tables Module, we can find the estimates of and  by simply plotting a scatter graph of the dependent variable against the independent variable with a linear fit.  While you are in Basic Statistics and Tables Module follow the steps:

  1. Enter the values of   in one column, say Var1 and the corresponding  values in another column, say Var2
  2. Graphs / Stats 2D Graphs / Scatterplots
  3. Variables /  = Var1 and   = Var2 / OK
  4. In advance, Select Regular for Graph Type and Linear for Fit
  5. OK

 

For the data in Example 8.1 you will get  (Figure 8.2) which means that the estimates of and are given by and  respectively.  Thus, the predicted simple linear regression model is given by

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


                                                                Figure 8.2 A Graph of Y versus X


8.4 Sources of Variation

 

The variation in the dependent variable say  is attributed partly to that in the independent variable. The rest is attributed to what is called the Sums of Squares Due to Errors defined  by  and can be calculated by the following table:

 

 

100

45

45.5636

0.3176

110

51

50.3939

 

0.3674

120

54

55.2242

1.4987

130

61

60.0545

  0.9455

0.8940

140

66

64.8848

 

1.2437

150

70

69.7151

 

0.0812

160

74

74.5454

0.2975

170

78

79.3757

1.8926

180

85

84.2060

 

0.6304

190

89

89.0363

0.0013

 

 

 

7.2244

 

The SSE = 7.2244, here compare these errors (or residuals) with that obtained by Statistica.

 

Predicted and Residual Values Using Statistica

Follow the steps:

 

  1. Statistics / Multiple regression
  2. Variables (select the dependent and independent variables) / OK / OK
  3. Click Residual / assumption / prediction
  4. Click perform residual analysis
  5. In advanced click Summary: Residual & predicted   (get Figure 8.3)

 

 

 

 

 

 

 

 

 

 

 


                                               Figure 8.3 Predicted and Residual Values

 

Decomposition of the Sum of Squares

 

It can be proved that   where  is the Total Sum of Squares,  is the Sum of Squares due to Regression and  is the Sum of Squares of Errors, also known as the residual sum of squares.

The coefficient of determination is defined by

 

In Example 8.1,,  so that = 7.2243. Note that the expression  may not be computationally efficient.

 

The coefficient of determination

 

 

Calculation of Sums of Squares Using Statistica

 

To calculate the sum of squares using Statistica follow the steps:

 

  1. Enter the values of   in one column, say Var1 and the corresponding  values in another column, say Var2.
  2. Statistics / Multiple Regression
  3. Variables ( select the dependent and independent variables) / OK /OK
  4. In Advanced click ANOVA ( Overall Goodness of Fit)

 

The resulting spreadsheet of result shows the Total Sum of Squares (TSS), the sum of squares due to regression (SSR) and the sum of squares due to errors (SSE), the mean squares and the  value.

For the data in Example 8.1 above, we have ,   and   (Figure 8.4).

 

 

 

 

 

 

 

 


Figure 8.4 Analysis of Variance

 


8.5 Confidence Interval Estimation of Regression Parameters

 

 

Confidence Interval (CI) for the Slope Parameter

A CI for  is given by  

where is the 100()th percentile of the t-distribution with , and

 

, the estimate  for  

For the data in Example 8.1, , and thus a 95% CI for  is given by

 

In other words,

.

 

Confidence Interval for the Conditional Mean

A  100% CI for the conditional mean at  is given by

 

 

 

In example 8.1, a 95% CI for the conditional mean at 140  is given by

 

 

 

i.e.

 

The above problem can be solved using Statistica following the steps:

 

  1. Statistics
  2. Multiple Regression
  3. Click variables (Choose Dependent and Independent Variable say Var2 and Var1)
  4. OK/OK
  5. Click Residuals/Assumptions/Prediction
  6. Click Compute Confidence Limits (Checked by default )
  7. Click Predict Dependent Variable (Under Common value put 140) Click Apply, then OK

 

We find that 95% confidence interval for  is given by

[64.18146, 65.58823] (See Figure 8.5).


 

 

 

 

 

 

 

 

 

 

 

 

 


                   Figure 8.5 Confidence Interval for Mean

 

 

8.6 Prediction Interval (PI) for a Future Observation

 

A  PI for a future observation  at  is given by

 

 

 

For the data in Example 8.1 a 95% prediction interval for the yield at 140  is given by

 

i.e,

 

The problem can be solved using Statistica following the steps:

 

  1. Statistics
  2. Multiple Regression
  3. Click variables (Choose Dependent and Independent Variable say Var2 and Var1)
  4. OK / OK
  5. Click Residuals/Assumptions/Prediction
  6. Click Compute Prediction Limits
  7. Click Predict Dependent Variable to enter fixed   say 140
  8.  Click Apply, then OK

 

It is predicted with 95% confidence that product yield will be in the interval

[62.58338, 67.18632] (See Figure 8.6).

 

 

 

 

 

 

 

 

 

 

 


                           Figure: 8.6 Prediction Interval for Product Yield

 

8.7 Testing the Slope of the Regression Line

 

The following table has a list of possible null hypotheses involving the slope , the critical region and the p-value in each case.

 

Hypotheses about and their respective rejection regions and p-values

Rejection Region

p-value

 

 

The test Statistic for these hypotheses is

                                                                                                              

                                                                       

The hypothesis  is known as the hypothesis of the significance of the regression.

 

In example 8.1, test the hypothesis at significance level .

 

The value of the test statistics is

Since t = 46.1689 > = 2.306, we reject  in favor of the alternative hypothesis , and conclude that the regression is significant.                

 

8.8 Testing the Significance of the Regression by Analysis of Variance

 

In order to test the hypothesis  at the 5% significant level, using an F test, we reproduce here the ANOVA table of example 8.1 shown in figure 8.4

 

 

 

SV

SS

DF

MS

Regression

1924.875758

1

1924.8757

2131.5738

Error

7.22424243

8

0.90303

 

Total

1932.10

9

 

 

                

 

The test statistic for the above hypothesis is

.

The observed value of the test statistic is . Since  > , the critical value from the F distribution with 1 and n – 2 degrees of freedom, we reject  in favor of the alternative hypothesis  at 5% level of significance.

 

Testing the Significance of the Regression Using Statistica

Using a t-test we follow the steps:

 

  1. Enter the values of  in one column, say Var1 and the corresponding  values in another column, say Var2.
  2. Statistics / Multiple Regression, to get Figure 8.7,click Advanced

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


                                        Figure 8.7 Multiple Regression Setup

 

 

  1. Variables (select the dependent and independent variables) / OK
  2.  OK ( to get Figure 8.8)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 8.8 Regression Results

 

 

  1. In Advanced, click  Summary: Regression results.You will get Figure 8.9

 

 

 

 

 

 

 

 

 


                                                Figure 8.9 Regression Summary

 

The regression parameters are given by  and  (See Column B of  Figure 8.9).

 

Since the  for testing the null hypothesis against the alternative hypothesis that  is given by 0 (See the column labeled by p-level), we reject the null hypothesis at any  in favor of the alternative hypothesis.

 

 

Using an F test, follow the steps:

 

  1. Enter the values of   in one column, say Var1 and the corresponding  values in another column, say Var2.
  2. Statistics / Multiple Regression, to get Figure 8.7, click Advanced
  3. Variables ( select the dependent and independent variables) / OK /OK
  4. In Advanced click ANOVA ( Overall Goodness of Fit)

 

For the data in Example 8.1, we have  (See Figure 8.4) so that we reject null hypothesis  at any  and accept the alternative hypothesis , indicating that the regression of on   is significant whether we are testing at 1% or 5% level of significance.

 

8.9 Checking Model Assumptions

 

We now discuss how to verify the assumptions that the random errors are normally distributed and that they have a constant variance.

 

Checking the Assumption of Normality

To check the assumption that the errors follow a normal distribution, a normal probability plot of residuals is drawn.  If the plot is approximately linear, then the assumption is justified, otherwise, the assumption is not justified. 

In Statistica, to get the normal probability plot of residuals, we follow the steps below assuming that we have the data of Example 8.1 in the Multiple Regression Module.

  1. Statistics / Multiple Regression
  2. Variable / Select the dependent and independent variables / OK / OK
  3. In Residual / assumption / prediction, click Perform residual analysis
  4. In Quick, click Normal plot of residuals

 

Since the normal probability plot of residuals (See Figure 8.10) for the data in Example 8.1 exhibits a linear trend.  Thus, the normality assumption is valid.

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 8.10 Normal probability Plot

 

           

 

Checking the Assumption of Constancy of Variance

To check the assumption that the errors have a constant variance, a graph of the residuals versus the independent variable is plotted.  If the graph shows no pattern, then the assumption is justified.  Otherwise, the assumption is not justified.

 

In Statistica, to plot the graph of the residuals versus the independent variable, we proceed as follows assuming that we have the data in multiple regression module:

 

 

  1. Statistics / Multiple Regression
  2. Variable / Select the dependent and independent variables / OK
  3. In Residuals / assumption / prediction, click Perform residual analysis
  4. under Type of residual click Raw residuals
  5. In Residuals, click Residuals vs. independent Var, “select the independent variable”

 

The graph in Figure 8.11 is the residual plot for example 8.1. Since it does not exhibit any pattern, we conclude that the constant variance assumption is justified. The value  states that raw residuals and temperatures  are almost uncorrelated.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


    Figure 8.11 Graph of Residuals vs Independent Variable

 

 

8.10 Multiple Linear Regression

 

The multiple regression model is a mathematical model which explains the relationship between the dependent variable and two or more independent variables.  For example, a manufacturer wants to model the quality of a product as a function of temperature and pressure  at which it is produced.

 

The multiple linear regression model with  independent variables is given by

 

where and  are the intercept and the random error term respectively.  We shall refer to the in the model as the regression parameters.

 

 

Example 8.2 Consider the problem of predicting gasoline mileage (in miles per gallon), where the independent variables are fuel octane rating  and average speed (mile per hour) .  The sample data obtained from 20 test runs with cars at various speeds are as follows:

 

 


24.8

88

52

30.6

93

60

31.1

91

28

28.2

90

52

31.6

90

55

29.9

89

46

31.5

92

58

27.2

87

46

33.3

94

55

32.6

95

62

30.6

88

47

28.1

89

58

25.2

90

63

35.0

93

54

29.2

91

53

31.9

92

52

27.7

89

52

31.7

94

53

34.2

93

54

30.1

91

58

 

Estimate the linear regression model and interpret your results.

 

Solution To solve this problem by using Statistica, you must be in Multiple Regression module and follow the steps:

 

  1. Enter the values of each variable in a separate column ( or variable)
  2. Statistics / Multiple Regression
  3. Variables ( select the dependent variable from the list on the left)
  4. Hold down the Ctrl key and select the independent variables from the list on the right.
  5. OK
  6. In Advanced Click Summary: Regression Results

 

For the data in Example 8.2, one can read the estimates of the regression parameters as :  (See column labeled B in Figure 8.12).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 


Figure 8.12 Regression Summary

 

Thus, the predicted multiple linear regression model for the given data is

 

If  'average speed' ( or Var3) is held fixed, it is estimated that a 1-unit increase in octane ( or Var2) would result in 1.0193 unit  increase in the expected 'gasoline mileage'. Similarly if 'octane' ( or Var2)  is held fixed, it is estimated that a 1-unit increase in average speed ( or Var3) would result in,    unit  decrease in the expected 'gasoline mileage'.

 


Exercises

 

8.1     (cf. Devore, J. L., 2000, 510). The following data represent the burner area liberation rate  and emission rate (Nox) :

 

100   125   125   150   150   200   200   250   250   300   300   350   400  

150   140   180   210   190   320   280   400   430   440   390   600   610  

 

(a)    Assuming that the simple linear regression model is valid, obtain the least square estimate of the true regression line.

(b)   What is the estimate of the expected Nox emission rate when burner area liberation rate equals 225?

(c)    Estimate the amount by which you expect Nox emission rate to change when burner area liberation rate is decreased by 50.

 

8.2     (cf. Devore, J. L., 2000, 510). The following data represent the wet deposition (NO3)  and lichen N (% dry weight) :

 

0.05

0.10

0.11

0.12

0.31

0.42

0.58

0.68

0.68

0.73

0.85

0.48

0.55

0.48

0.50

0.58

0.52

0.86

1.0

0.86

0.88

1.04

 

(a)    What are the least square estimates of ?

(b)   Predict lichen N for an NO3 deposition value of 0 .5.

(c)    Test the significance of regression at 5% level of significance.

 

8.3      (Devore, J. L., 2000, 510). The following data represent  available travel space in feet, and  separation distance:

 

12.8    12.9    12.9    13.6    14.5    14.6    15.1    17.5    19.5    20.8

  5.5      6.2      6.3      7.0      7.8      8.3      7.1    10.0    10.8    11.0

 

(a)    Derive the equation of the estimated line.

(b)   What separation distance would you predict if available travel space value is 15.0?

 

8.4     (Devore, J. L., 2000, 511). Consider the following data set in which the variable of interest are  commuting distance and  commuting time:

 

15   16   17   18   19   20    5   10   15    20   25   50     5   10   15   20   25   50

42   45   35   42   49   46   16   32   44   45   63   115   8   16   22   23   31   60

 

Obtain the least square estimate of the regression model.

 

8.5              (cf. Devore, J. L., 2000, 584). Soil and sediment adsorption, the extent to which chemicals collect in a condensed  form on the surface, is an important characteristic influencing the effectiveness of pesticides and various agricultural chemicals, The article “Adsorption of Phosphate, Arsenate, Methancearsonate, and Cacodylate by Lake and Stream sediments: Comparisons with Soils” (J. of Environ. Qual., 1984: 499-504) gives the accompanying data on y = phosphate adsorption index,  = amount of extractable iron, and = amount of extractable aluminum.

 

x1

61

175

111

124

130

173

169

169

160

244

257

333

199

x2

13

21

24

23

64

38

33

61

39

71

112

88

54

4

18

14

18

26

26

21

30

28

36

65

62

40

(a)    Find the least square estimates of the parameters and write the equation of the estimated model.

(b)   Make a prediction of Adsorption index resulting from an extractable iron = 250 and extractable aluminum = 55.

(c)    Test the null hypothesis that  against the alternative hypothesis that  at 5% level of significance.

 

8.6     (Johnson, R. A., 2000, 345). The following table shows how many weeks a sample of 6 persons have worked at an automobile inspection station and the number of cars each one inspected between noon and 2 P.M. on a given day:

 

Number of weeks employed (x)

2

7

9

1

5

12

Number of cars inspected (y)

13

21

23

14

15

21

 

(a)      Find the equation of the least squares line, which will enable us to predict  in terms of.

(b)     Use the result of part (a) to estimate how many cars someone who has been working at the inspection station for 8 weeks can be expected to inspect during the given 2-hour period.

 

8.7     (cf. Devore, J. L., 2000, 590). An investigation of die casting process resulted in the accompanying data on = on furnace temperature, = die close time and y = temperature difference on the die surface (A Multiple Objective Decision Making Approach for Assessing Simultaneous Improvement in Die Life and Casting Quality in a Die Casting Process,” Quality Engineering, 1994: 371-383).

 

 

x1

1250

1300

1350

1250

1300

1250

1300

1350

1350

x2

6

7

6

7

6

8

8

7

8

80

95

101

85

92

87

96

106

108

 

(a)    Write the equation of the estimated model.

(b)   Test the null hypothesis that  against the alternative hypothesis that  at 5% level of significance.

 

 

8.8     (cf. Johnson, R. A., 2000, 334). The following are measurements of the air velocity and evaporation coefficient of burning fuel droplets in an impulse engine:

 

Air Velocity (cm/sec)

20

60

100

140

180

220

160

300

340

380

Evaporation coefficient mm2/sec)

0.18

0.37

0.35

0.78

0.56

0.75

1.18

1.36

1.17

1.65

 

(a)    Fit a straight line to the data by the method of least square and use it to estimate the evaporation coefficient of a droplet when the air velocity is 190 cm/s.

(b)   Test the null hypothesis that  against the alternative hypothesis at the 0.05 level of significance.

 

8.9     (Johnson, R. A., 2000, 344). A chemical company, wishing to study the effect of extraction time on the efficiency on an extraction operation, obtained the data shown in the following table;

 

Extraction time (minutes) (x)

27

45

41

19

35

39

19

49

15

31

Extraction efficiency (%) (y)

57

64

80

46

62

72

52

77

57

68

 

(a)    Draw a scattergram to verify that a straight line will provide a good fit to the data.

(b)   Draw a straight line to predict the extraction efficiency one can expect when the extraction time is 35 minutes.

 

8.11   (cf. Johnson, R. A., 2000, 347). The cost of manufacturing a lot of certain product depends on the lot size, as shown by the following sample data:

         

Cost (Dollars)

30

70

140

270

530

1010

2500

5020

Lot Size

1

5

10

25

50

100

250

500

 

(a)    Draw a scattergram to verify the assumption that the relationship is linear, letting lot size be  and cost .

(b)   Fit a straight line to these data by the method of least squares, using lot size as the independent variable, and draw its graph on the diagram obtained in part (a).

 

8.12     (Johnson, R. A., 2000, 345). The following table,  is the tensile force applied to a steel specimen in thousands of pounds, and  is the resulting elongation thousands of an inch:

 

1

2

3

4

5

6

14

33

40

63

76

85

(a)      Graph the data to verify that it is reasonable to assume that the regression of   on  is linear.

(b)     Find the equation of the least square line, and use it to predict the elongation when the tensile force is 3.5 thousand pounds.

8.13   (Johnson, R. A., 2000, 385). The following are the data on the number of twists required to break a certain kind of forged alloy bar and the percentage of two alloying elements present in the   metal;

 

 

No. of twists ()

41

49

69

65

40

50

58

57

31

36

44

57

19

31

33

43

%age Element A ()

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

%age Element B ()

5

5

5

5

10

10

10

10

15

15

15

15

20

20

20

20

 

Fit a least square regression plane and use its equation to estimate the number of twists required to break one of the bars when  = 2.5 and  = 12.

 

8.14 (Johnson, R. A., 2000, 263). Twelve specimens of cold-reduced sheet steel, having different copper contents and annealing temperatures, are measured for hardness with the following results:

 

Hardness

78.9

65.1

55.2

56.4

80.9

69.7

57.4

55.4

85.3

71.8

60.7

58.9

Copper content

0.02

0.02

0.02

0.02

0.10

0.10

0.10

0.10

0.18

0.18

0.18

0.18

Annealing Temp.

1000

1100

1200

1300

1000

1100

1200

1300

1000

1100

1200

1300

 

Fit an equation of the form  , where  represents the copper content,  represents the annealing temperature, and y represents the hardness.

 

 

8.15     Suppose the following data gives the mass of adults in kilograms sampled from

three villages

 

71.5          68.0

62.0          65.0

60.5          62.8

63.3          73.5

71.3         73.1

58.4          58.7

73.6          74.1

64.7          72.5

65.5          58.1

78.1          73.6

66.0          73.0

50.6          58.5

66.5          76.5

72.6          66.8

62.1          52.6

74.3         73.3

71.1         71.9

64.5          58.8

76.3          76.0

65.4          65.0

60.6          63.7

62.0          64.0

59.1          69.9

62.9          61.6

62.8          69.6

69.7          69.8

60.2          67.2

72.9          69.2

77.1          78.5

63.4          58.2

 

(a)    Assuming that these samples are independent, run t-tests to determine which of the villages have identical mean mass of adults stating clearly the hypotheses you are testing.  State your conclusions based on the  as well as the t-value.

(b)   State the assumption under which your tests are valid.

 

8.16        (cf. Dougherty, 1990, 595) When smoothing a surface with an abrasive, the roughness of the finished surface decreases as the abrasive grain becomes finer. The following data give measurements of surface roughness  (in micrometers) in terms of the grit numbers of the grains, finer grains possessing larger grit numbers.

 

24

30

36

46

54

60

0.34

0.30

0.28

0.22

0.19

0.18

(a) Draw a scatter diagram. Do you recommend fitting a linear regression model?

(b) How strong is the linear correlation between the two variables?

(c) Do you think that there is strong nonlinear correlation between the two variables?

 

8.17        (cf. Johnson, R. A., 2000, 578). The article “ How to optimize and Control the Wire Bonding Process: Part II” (Solid State Technology, Jan. 1991: 67-72) described on experiment carried out to assess the impact of the variable  = force (gm),  = power (mw),  = temperature (), and  = time (ms) on y = ball bond share strength (gm). The following data generated to be consistent with the information given in the article:

 

Observations

Force

Power

Temperature

Time

Strength

1

30

60

175

15

26.2

2

40

60

175

15

26.3

3

30

90

175

15

39.8

4

40

90

175

15

39.7

5

30

60

225

15

38.6

6

40

60

225

15

35.5

7

30

90

225

15

48.8

8

40

90

225

15

37.8

9

30

60

175

25

26.6

10

40

60

175

25

23.4

11

30

90

175

25

38.6

12

40

90

175

25

52.1

13

30

60

225

25

39.5

14

40

60

225

25

32.3

15

30

90

225

25

43.0

16

40

90

225

25

56.0

17

25

75

200

20

35.2

18

45

75

200

20

46.9

19

35

45

200

20

22.7

20

35

105

200

20

58.7

21

35

75

150

20

34.5

22

35

75

250

20

44.0

23

35

75

200

10

35.7

24

35

75

200

30

41.8

25

35

75

200

20

36.5

26

35

75

200

20

37.6

27

35

75

200

20

40.3

28

35

75

200

20

46.0

29

35

75

200

20

27.8

30

35

75

200

20

40.3

 

(a)    Find the least square estimates of the parameters and write the equation of the estimated model.

(b)   Make a prediction of strength resulting from a force of 35 gm, power of 75 mw, temperature of 200 degrees and time of 20 ms.

(c)    Test the null hypothesis that  against the alternative hypothesis that  at 5% level of significance.