- Understand linear regression and describe a case study
- Compute and interpret coefficients in a linear regression analysis in R.
- Interpolate regression model in R to produce a raster layer.
May 12 2021
What are we doing?
Why is it windy in Iowa?
Where do babies come from?
Why are basements in Iowa full of cracks?
Predictions
Predicting with a function
Wills et al., 2013
Carbon equivalent correction regression factor: \[OC_{dc}= 0.25 + 0.86(OC_{wc})\] where
\(OC_{dc}=\) organic carbon by dry combustion (%)
\(OC_{wc}=\) organic carbon by wet combustion (%)
fit
## Linear Regression Model ## ## ols(formula = Height ~ Volume, data = d) ## ## Model Likelihood Discrimination ## Ratio Test Indexes ## Obs 31 LR chi2 13.73 R2 0.358 ## sigma5.1931 d.f. 1 R2 adj 0.336 ## d.f. 29 Pr(> chi2) 0.0002 g 4.150 ## ...
fit
... ## ## Residuals ## ## Min 1Q Median 3Q Max ## -10.7777 -2.9722 -0.1515 2.0804 10.6426 ## ## ## Coef S.E. t Pr(>|t|) ## Intercept 69.0034 1.9744 34.95 <0.0001 ## Volume 0.2319 0.0577 4.02 0.0004 ## ...
ggplot() + geom_point(aes(x = fit$fitted.values, y = fit$residuals)) + geom_smooth(aes(x = fit$fitted.values, y = fit$residuals), method = "loess", formula = "y ~ x", se = F) + geom_abline(intercept = 0, slope = 0, linetype = "dashed") + labs(title = "Residuals vs Fitted", x = "Fitted Values", y = "Residuals")
ggplot() + geom_qq(aes(sample = fit$residuals)) + geom_qq_line(aes(sample = fit$residuals), linetype = "dashed") + labs(title = "Normal Q-Q", x = "Theoretical Quantiles \n ols(Height ~ Volume)", y = "Standardized Residuals")
ggplot() + geom_point(aes(x = fit$fitted.values, y = sqrt(abs(fit$residuals)))) + geom_smooth(aes(x = fit$fitted.values, y = sqrt(abs(fit$residuals))), method = "loess", formula = "y ~ x", se = F) + geom_abline(intercept = 0, slope = 0, linetype = "dashed") + labs(title = "Scale-Location", x = "Fitted Values \n ols(Height ~ Volume)", y = "Square Root of Residuals")
class(fit) <- "lm" ggplot() + geom_point(aes(x = hat(fit$fitted.values), y = fit$residuals, size = cooks.distance(fit))) + geom_smooth(aes(x = hat(fit$fitted.values), y = fit$residuals), method = "loess", formula = "y ~ x", se = F) + geom_abline(intercept = 0, slope = 0, linetype = "dashed") + labs(title = "Residuals vs Leverage", x = "Leverage \n ols(Height ~ Volume)", y = "Standardized Residuals")
http://ncss-tech.github.io/stats_for_soil_survey/book2/linear-regression.html
\[ols(formula, data)\]
formula \(response \sim predictor_1+predictor_2+predictor_x\)
data specifies the dataset fit detailed statistical summary
library(rms) d <- trees dd <- datadist(d) options(datadist = "dd") fit <- ols(Height ~ Volume, d) fit
cor() - correlation matrix
hist() - histogram
vif() - variance-inflation and generalized variance inflation factors for linear and generalized linear models (in car package)
cor(d) hist(d$Height) library(car) vif(fit)
Step-wise regression - typically used for exploratory data analysis or datasets with large sets of predictors; not recommended for modeling due to its unreliability
Weighted least squares regression - potentially useful when the homoscedasticity assumption is violated in OLS; gives more weight to observations with small error variance. WLS is not recommended unless the variance structure is known.
Logistic regression - useful when predicting a binary outcome from a set of continuous predictor variables.
Sum of Squares Regression: \(SSR = \sum(\hat{y}-\bar{y})\)
Sum of Squares Error \(SSE = \sum(y-\hat{y})\)
Total Sum of Squares \(SST = \sum(y - \hat{y})^2 = SSR + SSE\)
where
\(y=\) observed
\(\hat{y}=\) predicted
\(\bar{y}=\) population mean
Root Mean Square Error \(=\sqrt{MSE}\)
\(R^2=\frac{SSR}{SST}\)
\(R^2(adj)=1 - \big( \frac{N - 1}{N - K - 1}\big) \frac{SSE}{SST}\)
Residual Standard Error \(=\sqrt{\frac{SSE}{N - 2}}\)
\(t=\frac{\beta}{SE_\beta}\)
N = number of observations
K = number of variables
\(\beta\) = coefficient
Source | SS | DF | MS | F |
---|---|---|---|---|
Regression (or explained) | \(SSR\) | \(K\) | \(MSR = \frac{SSR}{K}\) | \(F=\frac{MSR}{MSE}\) |
Error (or residual) | \(SSE\) | \(N-K-1\) | \(MSE = \frac{SSE}{(N-K-1)}\) | |
Total | \(SST\) | \(N-1\) | \(MST=\frac{SST}{(N-1)}\) |
An alternative formula for F, which is sometimes useful when the original data are not available (e.g. when reading someone else’s article) is
\(F=\frac{R^2\times(N-K-1)}{(1-R^2)\times K}\)
where N = number of observations and K = number of variables
Bishop, TFA, and AB McBratney. 2001. “A Comparison of Prediction Methods for the Creation of Field-Extent Soil Property Maps.” Geoderma 103 (1-2): 149–60.
Faraway, Julian J. 2002. “Practical Regression and Anova Using R.” University of Bath Bath.
Holland, Steven. 2011. “Data Analysis in Geosciences.” 2011. http://strata.uga.edu/6370/rtips/regressionPlots.html.
Matthews, Robert Andrew. 2000. “Storks Deliver Babies (P = 0.008).” Teaching Statistics 22 (2): 36–38.
Seybold, Cathy A, Paul R Finnell, and Moustafa A Elrashidi. 2009. “Estimating Total Acidity from Soil Properties Using Linear Models.” Soil Science 174 (2): 88–93.
Whittingham, Mark J, Philip A Stephens, Richard B Bradbury, and Robert P Freckleton. 2006. “Why Do We Still Use Stepwise Modelling in Ecology and Behaviour?” Journal of Animal Ecology 75 (5): 1182–9.
Wills, Skye, Cathy Seybold, Joe Chiaretti, Cleiton Sequeira, and Larry West. 2013. “Quantifying Tacit Knowledge About Soil Organic Carbon Stocks Using Soil Taxa and Official Soil Series Descriptions.” Soil Science Society of America Journal 77 (5): 1711–23.