Linear Regression

May 12 2021

Objectives

Understand linear regression and describe a case study
Compute and interpret coefficients in a linear regression analysis in R.
Interpolate regression model in R to produce a raster layer.

Science

What are we doing?

Explain
Predict

Let’s stop for a moment before we begin and discuss science. Are there a few scientists out there? Will we ever explain and predict everything about soil? Or plants? Or animals? No. I had a science teacher Mr. Wilkening in high school who always asked if anyone had any questions from our assignments due that day. One time, he simply asked “Any questions?” As a joke someone asked, “What is the meaning of life?” He calmly replied “42”. Then he asked “Any other questions?” Unfortunately, no one in the class understood his clever reference to the popular book “A Hitchhiker’s Guide to the Galaxy” at the time, including me. Has anyone read this book? I still haven’t. We will never know everything there is to know about soil (or the meaning of life). But we can however, explain and predict many interesting things about soil, and today we will discuss one way of doing these 2 things with a statistical analysis called linear regression.

Connections

Why is it windy in Iowa?

Missouri sucks and Minnesota blows
No connection

Where do babies come from?

Storks deliver babies
No connection

Why are basements in Iowa full of cracks?

Soils in Iowa contain high amounts of shrink-swell clays
Connection

If we want to predict or explain something, like why it’s windy in Iowa, we need to look for connections. If this question reminds you of the punchline to a joke, you’d of course know the answer. It’s because Missouri sucks and Minnesota blows. This might be a funny joke (if you’re from Iowa). Unfortunately though, there isn’t a connection between how lousy a person thinks a couple of surrounding states are, and the average wind speed of their state. This example demonstrates the importance of having a connection between a dependent variable and an independent variable. The dependent variable is the one we are trying to predict. The independent variable is the one we are using to make the prediction. Another famous example demonstrating the importance of a connection is an attempt to explain where babies come from. If you plot the human birth rate and number of storks across several European countries, as Robert Matthews, a professor at Aston University has already done, you might come to the incorrect conclusion that storks deliver babies. If I ask why are basements in Iowa full of cracks, you might predict it has something to do with the soil surrounding the basements. This is an example of a dependent and independent variable which have a connection.

Predictions and Predicting with a Function

Predictions

None of us will report to work on Sunday
The average price of a gallon of gas in the US will be $100.00 on January 1

Predicting with a function

\[y= \beta_0 + \beta_1x + \epsilon\]

So let’s make some predictions! Making predictions is easy. I predict none of us will report to work on Sunday. This is probably a good prediction, since Sunday is not part of the typical work week. I predict the average price of a gallon of gas in the US will be $100.00 on January 1. This is far less likely to be a good prediction. We don’t want to just make predictions, we want to make good predictions. How? We can use a function. This is a linear prediction function which includes a random error term. Y is the dependent variable, what we want to predict. Beta zero is a constant term, the intercept. It represents the predicted value of y when x=0. What is x? X is the independent variable, what you are using to make the prediction. Going back now to beta one, this term is the slope of the line. The predicted change in y for each one unit change in x. Finally at the end of the function, epsilon. This is the random error term I mentioned before. Without the random error term, as a function, y can only represent an exact linear relationship between x. Including the random error term allows us to take into account any deviation of the actual y values which exist in our dataset from our predictions. To put it another way, the random error term accounts for other unpredictable factors. Why are we using a linear function? Our topic is linear regression right? It turns out lines can do a pretty good job predicting dependent variables as long as the slope of the line doesn’t change as x changes. If this happened, it wouldn’t really be a line.

Assumptions

Linear
Independent
Homoscedastic
Normal

For linear regression to work well, there are 4 assumptions we must understand. As I just mentioned, the relation between what you are using to predict and what’s being predicted needs to be linear. Suppose I wanted to predict percent clay in soils with argillic horizons using soil depth. The typical “clay bulge” occurring with depth doesn’t fit a linear relationship so linear regression won’t work in this case. The second assumption is the independence of errors. This means the errors are not related to each other. Errors may not be independent if they tend to be the same within specific conditions. The third assumption requires the variables are homoscedastic. This means the variables have a constant variation. If the variation is not constant, (heteroscedastic) this would indicate the function, for example is accurate predicting low values, but not when predicting high values. Finally, the last assumption made in linear regression is that the errors are normally distributed. We will revisit these assumptions a bit later.

Interpreting the Linear Regression Model

Let’s go back to the function again with some visuals. I generated this plot from a sample dataset available in R called trees. This dataset contains 31 observations of black cherry trees height, girth and volume. This plot shows height on the y axis and volume on the x axis. You can see the equation here again. I’ve added annotations to the plot to help explain each term in the equation, and an example of what the terms would be in this particular case. First, the y term, which is our y axis - Height. Then beta zero, which is the value of y where the line crosses or “intercepts” the y axis. This also corresponds to the value of y when x equals zero. The next term, beta one is the calculated slope of the line. We can calculate this by looking at the change in y, shown as delta y on the plot divided by the change in x. The x term is the x axis, Volume, and finally, epsilon, the error term is the difference between what an actual observation is and the line.

Building the Linear Regression Model

Ordinary Least Squares
\[\beta_1= \frac{\sum(x_i - \bar x) (y_i - \bar y)} {\sum(x_i - \bar x)^2}\]
\[\beta_0= \bar y - \beta_1 \times \bar x\]

Examples

Wills et al., 2013

Carbon equivalent correction regression factor: \[OC_{dc}= 0.25 + 0.86(OC_{wc})\] where

$OC_{dc}=$ organic carbon by dry combustion (%)

$OC_{wc}=$ organic carbon by wet combustion (%)

Examples

Simple vs. Multiple Linear Regression

Simple linear regression (SLR):
Y is predicted from one independent variable ($x$)
Multiple linear regression (MLR): Y is predicted from two or more independent variables ($x_1,x_2,x_3...$)

Assumptions

Linear
Independent
Homoscedastic
Normal

Testing Model Assumptions

Normality
- Histograms
- QQ plots
- Residual plots
Outliers
- QQ plots
- Box plots
Multicollinearity
- Correlation $\geq0.7$ or $\leq-0.7$ indicates highly correlated
- Variance inflation factors
Homoscedasticity
- Residual plots

Heteroscedasticity

Analysis of Residuals

Heteroscedasticity
- causes estimates of regression coefficients to be less precise
Non-normality
- compromises interpretability of significance tests of the regression coefficients
Multicollinearity
- over-estimates the variances of the regression coefficients
Spatial Autocorrelation
- results in an underestimation of the standard error of the estimates of the regression coefficients and a bias towards rejecting the $H_0$ that the value of the coefficient is zero

Interpreting Model Results

fit

## Linear Regression Model
##  
##  ols(formula = Height ~ Volume, data = d)
##  
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
##  Obs      31    LR chi2     13.73    R2       0.358    
##  sigma5.1931    d.f.            1    R2 adj   0.336    
##  d.f.     29    Pr(> chi2) 0.0002    g        4.150    
##  
...

Interpreting Model Results

fit

...
##  
##  Residuals
##  
##       Min       1Q   Median       3Q      Max 
##  -10.7777  -2.9722  -0.1515   2.0804  10.6426 
##  
##  
##            Coef    S.E.   t     Pr(>|t|)
##  Intercept 69.0034 1.9744 34.95 <0.0001 
##  Volume     0.2319 0.0577  4.02 0.0004  
## 
...

The next part of the summary lists the estimate of the y-intercept and the slope of each independent variable. It also provides the standard error.

T values test the hypothesis that the coefficient is different from 0. You can get the t-values by dividing the coefficient by its standard error. The t-values also show the importance of a variable in the model.

The next column to the right is the two-tail p-values which test the hypothesis that each coefficient is different from 0.

The last part is the Residual standard error which is the square root of the residual sum of squares over the degrees of freedom. The degrees of freedom calculated by taking the number of observations and subtracting 2 (n-2).

$R^2$ shows the amount of variance of Y explained by X. Adjusted $R^2$ shows the same as $R^2$ but adjusted by the # of cases and # of variables. When the # of variables are small and the # of cases is very large then Adj $R^2$ is closer to $R^2$. This provides a more honest association between X and Y.

F-statistic = tests the null hypothesis that all the model coefficients are 0

The p-value of the model tests whether $R^2$ is different from 0.

Diagnostic Plots

Residuals vs Fitted
Quantile - Quantile (QQ)
Spread-Location
Leverage Plot

Residuals vs Fitted

ggplot() +
  geom_point(aes(x = fit$fitted.values, y = fit$residuals)) +
  geom_smooth(aes(x = fit$fitted.values, y = fit$residuals),
              method = "loess", formula = "y ~ x", se = F) +
  geom_abline(intercept = 0, slope = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted", x = "Fitted Values", y = "Residuals")

QQ Plot

ggplot() +
  geom_qq(aes(sample = fit$residuals)) +
  geom_qq_line(aes(sample = fit$residuals), linetype = "dashed") +
  labs(title = "Normal Q-Q", x = "Theoretical Quantiles \n ols(Height ~ Volume)",
       y = "Standardized Residuals")

Spread-Location

ggplot() +
  geom_point(aes(x = fit$fitted.values, y = sqrt(abs(fit$residuals)))) +
  geom_smooth(aes(x = fit$fitted.values, y = sqrt(abs(fit$residuals))),
              method = "loess", formula = "y ~ x", se = F) +
  geom_abline(intercept = 0, slope = 0, linetype = "dashed") +
labs(title = "Scale-Location", x = "Fitted Values \n ols(Height ~ Volume)",
     y = "Square Root of Residuals")

Leverage Plot

class(fit) <- "lm"
ggplot() +
  geom_point(aes(x = hat(fit$fitted.values), y = fit$residuals, size = cooks.distance(fit))) +
  geom_smooth(aes(x = hat(fit$fitted.values), y = fit$residuals),
              method = "loess", formula = "y ~ x", se = F) +
  geom_abline(intercept = 0, slope = 0, linetype = "dashed") +
labs(title = "Residuals vs Leverage", x = "Leverage \n ols(Height ~ Volume)",
     y = "Standardized Residuals")

Unlike the other plots, this time patterns are irrelevant. Leverage is a measure of how much each data point influences the regression. Because the regression must pass through the centroid, points that lie far from the centroid have greater leverage, and their leverage increases if there are fewer points nearby. As a result, leverage reflects both the distance from the centroid and the isolation of a point. The plot also shows values of Cook’s distance, which measures how much the regression would change if a point was deleted. Cook’s distance is increased by leverage and by large residuals: a point far from the centroid with a large residual can severely alter the regression. On this plot, you want to see that the blue smoothed line stays close to the horizontal dashed line and that no points have a large Cook’s distance (i.e. >0.5). What do you notice about this plot?

Exercise: Linear Regression

http://ncss-tech.github.io/stats_for_soil_survey/book2/linear-regression.html

Summary

Linear regression models are intuitive, quick to execute, and easy to interpret, making them useful for NASIS calculations and pedotransfer functions.
Due to the non-linear nature of environmental data, data transformations or deletions are often needed to meet model assumptions.
Tacit knowledge is needed throughout model development.

Linear Regression in R

\[ols(formula, data)\]

formula $response \sim predictor_1+predictor_2+predictor_x$
data specifies the dataset fit detailed statistical summary

library(rms)
d <- trees
dd <- datadist(d)
options(datadist = "dd")
fit <- ols(Height ~ Volume, d)
fit

Linear Regression in R - Diagnostics Tests

cor() - correlation matrix

hist() - histogram

vif() - variance-inflation and generalized variance inflation factors for linear and generalized linear models (in car package)

cor(d)
hist(d$Height)
library(car)
vif(fit)

Other Types of Regression

Step-wise regression - typically used for exploratory data analysis or datasets with large sets of predictors; not recommended for modeling due to its unreliability

Weighted least squares regression - potentially useful when the homoscedasticity assumption is violated in OLS; gives more weight to observations with small error variance. WLS is not recommended unless the variance structure is known.

Logistic regression - useful when predicting a binary outcome from a set of continuous predictor variables.

Review Questions

Kahoot (http://kahoot.it)

Additional Resources

Sum of Squares Regression: $SSR = \sum(\hat{y}-\bar{y})$

Sum of Squares Error $SSE = \sum(y-\hat{y})$

Total Sum of Squares $SST = \sum(y - \hat{y})^2 = SSR + SSE$

where

$y=$ observed

$\hat{y}=$ predicted

$\bar{y}=$ population mean

Additional Resources

Root Mean Square Error $=\sqrt{MSE}$

$R^2=\frac{SSR}{SST}$

$R^2(adj)=1 - \big( \frac{N - 1}{N - K - 1}\big) \frac{SSE}{SST}$

Residual Standard Error $=\sqrt{\frac{SSE}{N - 2}}$

$t=\frac{\beta}{SE_\beta}$

N = number of observations
K = number of variables
$\beta$ = coefficient

Additional Resources - ANOVA

Source	SS	DF	MS	F
Regression (or explained)	$SSR$	$K$	$MSR = \frac{SSR}{K}$	$F=\frac{MSR}{MSE}$
Error (or residual)	$SSE$	$N-K-1$	$MSE = \frac{SSE}{(N-K-1)}$
Total	$SST$	$N-1$	$MST=\frac{SST}{(N-1)}$

An alternative formula for F, which is sometimes useful when the original data are not available (e.g. when reading someone else’s article) is
$F=\frac{R^2\times(N-K-1)}{(1-R^2)\times K}$

where N = number of observations and K = number of variables

References

Bishop, TFA, and AB McBratney. 2001. “A Comparison of Prediction Methods for the Creation of Field-Extent Soil Property Maps.” Geoderma 103 (1-2): 149–60.

Faraway, Julian J. 2002. “Practical Regression and Anova Using R.” University of Bath Bath.

Holland, Steven. 2011. “Data Analysis in Geosciences.” 2011. http://strata.uga.edu/6370/rtips/regressionPlots.html.

Matthews, Robert Andrew. 2000. “Storks Deliver Babies (P = 0.008).” Teaching Statistics 22 (2): 36–38.

Seybold, Cathy A, Paul R Finnell, and Moustafa A Elrashidi. 2009. “Estimating Total Acidity from Soil Properties Using Linear Models.” Soil Science 174 (2): 88–93.

Whittingham, Mark J, Philip A Stephens, Richard B Bradbury, and Robert P Freckleton. 2006. “Why Do We Still Use Stepwise Modelling in Ecology and Behaviour?” Journal of Animal Ecology 75 (5): 1182–9.

Wills, Skye, Cathy Seybold, Joe Chiaretti, Cleiton Sequeira, and Larry West. 2013. “Quantifying Tacit Knowledge About Soil Organic Carbon Stocks Using Soil Taxa and Official Soil Series Descriptions.” Soil Science Society of America Journal 77 (5): 1711–23.

Source	SS	DF	MS	F
Regression (or explained)	\(SSR\)	\(K\)	\(MSR = \frac{SSR}{K}\)	\(F=\frac{MSR}{MSE}\)
Error (or residual)	\(SSE\)	\(N-K-1\)	\(MSE = \frac{SSE}{(N-K-1)}\)
Total	\(SST\)	\(N-1\)	\(MST=\frac{SST}{(N-1)}\)