May 12 2021

Objectives

  • Understand linear regression and describe a case study
  • Compute and interpret coefficients in a linear regression analysis in R.
  • Interpolate regression model in R to produce a raster layer.

Science

What are we doing?

  • Explain
  • Predict

Connections

Why is it windy in Iowa?

  • Missouri sucks and Minnesota blows
  • No connection

Where do babies come from?

  • Storks deliver babies
  • No connection

Why are basements in Iowa full of cracks?

  • Soils in Iowa contain high amounts of shrink-swell clays
  • Connection

Predictions and Predicting with a Function

Predictions

  • None of us will report to work on Sunday
  • The average price of a gallon of gas in the US will be $100.00 on January 1

Predicting with a function

  • \[y= \beta_0 + \beta_1x + \epsilon\]

Assumptions

  • Linear
  • Independent
  • Homoscedastic
  • Normal

Interpreting the Linear Regression Model

Building the Linear Regression Model

  • Ordinary Least Squares
  • \[\beta_1= \frac{\sum(x_i - \bar x) (y_i - \bar y)} {\sum(x_i - \bar x)^2}\]
  • \[\beta_0= \bar y - \beta_1 \times \bar x\]

Examples

Wills et al., 2013

Carbon equivalent correction regression factor: \[OC_{dc}= 0.25 + 0.86(OC_{wc})\] where

\(OC_{dc}=\) organic carbon by dry combustion (%)

\(OC_{wc}=\) organic carbon by wet combustion (%)

Examples

Examples

Simple vs. Multiple Linear Regression

  • Simple linear regression (SLR):
    Y is predicted from one independent variable (\(x\))
  • Multiple linear regression (MLR): Y is predicted from two or more independent variables (\(x_1,x_2,x_3...\))

Assumptions

  • Linear
  • Independent
  • Homoscedastic
  • Normal

Testing Model Assumptions

  • Normality
    • Histograms
    • QQ plots
    • Residual plots
  • Outliers
    • QQ plots
    • Box plots
  • Multicollinearity
    • Correlation \(\geq0.7\) or \(\leq-0.7\) indicates highly correlated
    • Variance inflation factors
  • Homoscedasticity
    • Residual plots

Heteroscedasticity

Analysis of Residuals

  • Heteroscedasticity
    • causes estimates of regression coefficients to be less precise
  • Non-normality
    • compromises interpretability of significance tests of the regression coefficients
  • Multicollinearity
    • over-estimates the variances of the regression coefficients
  • Spatial Autocorrelation
    • results in an underestimation of the standard error of the estimates of the regression coefficients and a bias towards rejecting the \(H_0\) that the value of the coefficient is zero

Interpreting Model Results

fit
## Linear Regression Model
##  
##  ols(formula = Height ~ Volume, data = d)
##  
##                  Model Likelihood    Discrimination    
##                        Ratio Test           Indexes    
##  Obs      31    LR chi2     13.73    R2       0.358    
##  sigma5.1931    d.f.            1    R2 adj   0.336    
##  d.f.     29    Pr(> chi2) 0.0002    g        4.150    
##  
...

Interpreting Model Results

fit
...
##  
##  Residuals
##  
##       Min       1Q   Median       3Q      Max 
##  -10.7777  -2.9722  -0.1515   2.0804  10.6426 
##  
##  
##            Coef    S.E.   t     Pr(>|t|)
##  Intercept 69.0034 1.9744 34.95 <0.0001 
##  Volume     0.2319 0.0577  4.02 0.0004  
## 
...

Diagnostic Plots

  • Residuals vs Fitted
  • Quantile - Quantile (QQ)
  • Spread-Location
  • Leverage Plot

Residuals vs Fitted

ggplot() +
  geom_point(aes(x = fit$fitted.values, y = fit$residuals)) +
  geom_smooth(aes(x = fit$fitted.values, y = fit$residuals),
              method = "loess", formula = "y ~ x", se = F) +
  geom_abline(intercept = 0, slope = 0, linetype = "dashed") +
  labs(title = "Residuals vs Fitted", x = "Fitted Values", y = "Residuals")

QQ Plot

ggplot() +
  geom_qq(aes(sample = fit$residuals)) +
  geom_qq_line(aes(sample = fit$residuals), linetype = "dashed") +
  labs(title = "Normal Q-Q", x = "Theoretical Quantiles \n ols(Height ~ Volume)",
       y = "Standardized Residuals")

Spread-Location

ggplot() +
  geom_point(aes(x = fit$fitted.values, y = sqrt(abs(fit$residuals)))) +
  geom_smooth(aes(x = fit$fitted.values, y = sqrt(abs(fit$residuals))),
              method = "loess", formula = "y ~ x", se = F) +
  geom_abline(intercept = 0, slope = 0, linetype = "dashed") +
labs(title = "Scale-Location", x = "Fitted Values \n ols(Height ~ Volume)",
     y = "Square Root of Residuals")

Leverage Plot

class(fit) <- "lm"
ggplot() +
  geom_point(aes(x = hat(fit$fitted.values), y = fit$residuals, size = cooks.distance(fit))) +
  geom_smooth(aes(x = hat(fit$fitted.values), y = fit$residuals),
              method = "loess", formula = "y ~ x", se = F) +
  geom_abline(intercept = 0, slope = 0, linetype = "dashed") +
labs(title = "Residuals vs Leverage", x = "Leverage \n ols(Height ~ Volume)",
     y = "Standardized Residuals")

Exercise: Linear Regression

Summary

  • Linear regression models are intuitive, quick to execute, and easy to interpret, making them useful for NASIS calculations and pedotransfer functions.
  • Due to the non-linear nature of environmental data, data transformations or deletions are often needed to meet model assumptions.
  • Tacit knowledge is needed throughout model development.

Linear Regression in R

\[ols(formula, data)\]

formula \(response \sim predictor_1+predictor_2+predictor_x\)
data specifies the dataset fit detailed statistical summary

library(rms)
d <- trees
dd <- datadist(d)
options(datadist = "dd")
fit <- ols(Height ~ Volume, d)
fit

Linear Regression in R - Diagnostics Tests

cor() - correlation matrix

hist() - histogram

vif() - variance-inflation and generalized variance inflation factors for linear and generalized linear models (in car package)

cor(d)
hist(d$Height)
library(car)
vif(fit)

Other Types of Regression

Step-wise regression - typically used for exploratory data analysis or datasets with large sets of predictors; not recommended for modeling due to its unreliability

Weighted least squares regression - potentially useful when the homoscedasticity assumption is violated in OLS; gives more weight to observations with small error variance. WLS is not recommended unless the variance structure is known.

Logistic regression - useful when predicting a binary outcome from a set of continuous predictor variables.

Review Questions

Additional Resources

Sum of Squares Regression: \(SSR = \sum(\hat{y}-\bar{y})\)

Sum of Squares Error \(SSE = \sum(y-\hat{y})\)

Total Sum of Squares \(SST = \sum(y - \hat{y})^2 = SSR + SSE\)

where

\(y=\) observed

\(\hat{y}=\) predicted

\(\bar{y}=\) population mean

Additional Resources

Root Mean Square Error \(=\sqrt{MSE}\)

\(R^2=\frac{SSR}{SST}\)

\(R^2(adj)=1 - \big( \frac{N - 1}{N - K - 1}\big) \frac{SSE}{SST}\)

Residual Standard Error \(=\sqrt{\frac{SSE}{N - 2}}\)

\(t=\frac{\beta}{SE_\beta}\)

N = number of observations
K = number of variables
\(\beta\) = coefficient

Additional Resources - ANOVA

Source SS DF MS F
Regression (or explained) \(SSR\) \(K\) \(MSR = \frac{SSR}{K}\) \(F=\frac{MSR}{MSE}\)
Error (or residual) \(SSE\) \(N-K-1\) \(MSE = \frac{SSE}{(N-K-1)}\)
Total \(SST\) \(N-1\) \(MST=\frac{SST}{(N-1)}\)

An alternative formula for F, which is sometimes useful when the original data are not available (e.g. when reading someone else’s article) is
\(F=\frac{R^2\times(N-K-1)}{(1-R^2)\times K}\)

where N = number of observations and K = number of variables

References

Bishop, TFA, and AB McBratney. 2001. “A Comparison of Prediction Methods for the Creation of Field-Extent Soil Property Maps.” Geoderma 103 (1-2): 149–60.

Faraway, Julian J. 2002. “Practical Regression and Anova Using R.” University of Bath Bath.

Holland, Steven. 2011. “Data Analysis in Geosciences.” 2011. http://strata.uga.edu/6370/rtips/regressionPlots.html.

Matthews, Robert Andrew. 2000. “Storks Deliver Babies (P = 0.008).” Teaching Statistics 22 (2): 36–38.

Seybold, Cathy A, Paul R Finnell, and Moustafa A Elrashidi. 2009. “Estimating Total Acidity from Soil Properties Using Linear Models.” Soil Science 174 (2): 88–93.

Whittingham, Mark J, Philip A Stephens, Richard B Bradbury, and Robert P Freckleton. 2006. “Why Do We Still Use Stepwise Modelling in Ecology and Behaviour?” Journal of Animal Ecology 75 (5): 1182–9.

Wills, Skye, Cathy Seybold, Joe Chiaretti, Cleiton Sequeira, and Larry West. 2013. “Quantifying Tacit Knowledge About Soil Organic Carbon Stocks Using Soil Taxa and Official Soil Series Descriptions.” Soil Science Society of America Journal 77 (5): 1711–23.