Pesky ‘Linear Model’ Assumptions (review)

Linearity - Y is linearly related to X.
Independence of Error - the error (residual) is independent for each value of X.
Homoscedasticity - the variation around the regression line is constant for all values of X.
Normality - the values of Y are normally distributed at each value of X.

Comparison between Linear Models and GLMs

Type	Distributions	Estimation Method	Goodnesss of Fit
Linear	Gaussian (i.e. Normal)	least squares	variance
GLM	Any Exponential Family	maximum-likelihood	deviance

Generalized Linear Models (fewer assumptions)

Family (or Distribution)	Default Link Function	Data Type	Example
Gaussian	identity	interval or ratio	clay content
Binomial	logit	binary (yes/no) or binomial (proportions)	presense of mollisols
Poisson	log	counts	# of species

Overview - Logistic Regression

Response = binary (yes or no)(1 or 0) or binomial (0 to 1)(0 to 100%)
Prediction = probability ( e.g. 0 - 100%)
A threshold value, often 0.5, is used to determine if the prediction is part of the modeled class

Example 1: Probability of Mollisols

Beaudette & O’Geen, 2009

Example 2: Probability of Red Clay

Evans & Hartemink, 2014

Example 3: Probability of Ponding

NRCS Unpublished

Example 4: Aggregate representation of genetic soil horizons

2015 Digital Soil Morphometrics talk

Logistic Link Function

Does this look familiar?

Fitting

spod.glm <- glm(spod ~ dem10m + eastness + northness + maxent, family = binomial, data = wv)
summary(spod.glm)

## 
## Call:
## glm(formula = spod ~ dem10m + eastness + northness + maxent, 
##     family = binomial, data = wv)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7958  -0.7920  -0.4946   0.8902   2.3825  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.450480   2.328311   3.200 0.001375 ** 
## dem10m      -0.008974   0.002347  -3.823 0.000132 ***
## eastness    -0.864715   0.256272  -3.374 0.000740 ***
## northness    0.675560   0.230950   2.925 0.003443 ** 
## maxent       0.031468   0.008573   3.671 0.000242 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 307.11  on 249  degrees of freedom
## Residual deviance: 254.47  on 245  degrees of freedom
## AIC: 264.47
## 
## Number of Fisher Scoring iterations: 4

Accuracy - rms R package

library(rms)
spod.lrm <- lrm(spod ~ dem10m + eastness + northness + maxent, data = wv)
print(spod.lrm)

## Logistic Regression Model
##  
##  lrm(formula = spod ~ dem10m + eastness + northness + maxent, 
##      data = wv)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           250    LR chi2      52.63    R2       0.268    C       0.773    
##   FALSE        174    d.f.             4    g        1.339    Dxy     0.546    
##   TRUE          76    Pr(> chi2) <0.0001    gr       3.817    gamma   0.546    
##  max |deriv| 1e-09                          gp       0.234    tau-a   0.232    
##                                             Brier    0.167                     
##  
##            Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept  7.4505 2.3284  3.20  0.0014  
##  dem10m    -0.0090 0.0023 -3.82  0.0001  
##  eastness  -0.8647 0.2563 -3.37  0.0007  
##  northness  0.6756 0.2310  2.93  0.0034  
##  maxent     0.0315 0.0086  3.67  0.0002  
##

Accuracy - modEvA R package

library(modEvA)
cbind(d.squared     = Dsquared(spod.glm), 
      adj.d.squared = Dsquared(spod.glm, adjust = TRUE)
      )

##      d.squared adj.d.squared
## [1,] 0.1713869     0.1578586

rbind(RsqGLM(spod.glm))

##      CoxSnell  Nagelkerke McFadden  Tjur      sqPearson
## [1,] 0.1898509 0.268436   0.1713869 0.2054845 0.209848

Accuracy - Confusion Matrix

library(caret)
cm <- table(observed = wv$spod, predicted = predict(spod.glm, type = "response") > 0.5)
confusionMatrix(cm, positive = "TRUE")

## Confusion Matrix and Statistics
## 
##         predicted
## observed FALSE TRUE
##    FALSE   158   16
##    TRUE     38   38
##                                           
##                Accuracy : 0.784           
##                  95% CI : (0.7278, 0.8334)
##     No Information Rate : 0.784           
##     P-Value [Acc > NIR] : 0.536375        
##                                           
##                   Kappa : 0.4443          
##                                           
##  Mcnemar's Test P-Value : 0.004267        
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.8061          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.9080          
##              Prevalence : 0.2160          
##          Detection Rate : 0.1520          
##    Detection Prevalence : 0.3040          
##       Balanced Accuracy : 0.7549          
##                                           
##        'Positive' Class : TRUE            
##

Accuracy - Probability Threshold

Interpreting the Coefficients

# extract coefficient, stored in log(odds)
coef(spod.glm)

##  (Intercept)       dem10m     eastness    northness       maxent 
##  7.450480458 -0.008974187 -0.864714623  0.675560332  0.031468219

# convert coefficient to odds
exp(coef(spod.glm))

##  (Intercept)       dem10m     eastness    northness       maxent 
## 1720.6896668    0.9910660    0.4211717    1.9651338    1.0319686

Interpreting the Coefficients - Partial Effect Plots

library(visreg); par(mfcol = c(2, 2))
visreg(spod.glm, scale = "response", ylab = "spodic probability")

Interpreting the Coeficiencts - Deviance Explained

anova(spod.glm)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: spod
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev
## NULL                        249     307.11
## dem10m     1   8.0805       248     299.03
## eastness   1  23.2714       247     275.76
## northness  1   7.0218       246     268.73
## maxent     1  14.2606       245     254.47

Interpreting the Coeficiencts - Deviance Explained

Checking Assumptions - Multicollinearity

Leads to unstable coefficients and large SE for coefficients
Rules of thumb vary, but a square root of vif greater than 2 or 3 indicates an unacceptable value.

##    dem10m  eastness northness    maxent 
##  1.196760  1.020541  1.014372  1.230978

Checking Assumptions - Residual Plots

Checking Assumptions - Resdiual Plots

library(car)
residualPlots(spod.glm, fit = FALSE, tests = FALSE)

Variable Selection

spod.glm2 <- glm(spod ~., data = wv)
spod.step <- step(spod.glm2, trace = FALSE)

summary(spod.step)

## 
## Call:
## glm(formula = spod ~ downslpgra + eastness + mirref + ndvi + 
##     northeastn + northness + planc100 + relpos11 + slp50 + solar + 
##     tanc75 + pred, data = wv)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8701  -0.2469  -0.1020   0.2881   0.8774  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.487e-01  9.936e-01   0.552  0.58129    
## downslpgra  -5.105e-03  1.773e-03  -2.880  0.00434 ** 
## eastness     3.272e+04  2.158e+04   1.517  0.13069    
## mirref      -6.202e-01  3.761e-01  -1.649  0.10047    
## ndvi         1.403e+00  7.509e-01   1.869  0.06286 .  
## northeastn  -4.627e+04  3.051e+04  -1.517  0.13070    
## northness    3.272e+04  2.157e+04   1.517  0.13070    
## planc100     3.674e-01  2.211e-01   1.662  0.09788 .  
## relpos11    -3.006e-01  1.988e-01  -1.512  0.13185    
## slp50       -1.179e-02  4.369e-03  -2.699  0.00745 ** 
## solar       -6.784e-07  4.510e-07  -1.504  0.13387    
## tanc75       1.287e+00  6.877e-01   1.871  0.06253 .  
## pred         9.768e-01  2.165e-01   4.512 1.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1611452)
## 
##     Null deviance: 52.896  on 249  degrees of freedom
## Residual deviance: 38.191  on 237  degrees of freedom
## AIC: 267.76
## 
## Number of Fisher Scoring iterations: 2

Variable Selection

AIC (Akaike Information Criterion) - smaller is better
drop1() reports the size of the model without the variable, thus drop smaller AIC values

drop1(spod.step, test = "Chisq")

## Single term deletions
## 
## Model:
## spod ~ downslpgra + eastness + mirref + ndvi + northeastn + northness + 
##     planc100 + relpos11 + slp50 + solar + tanc75 + pred
##            Df Deviance    AIC scaled dev.  Pr(>Chi)    
## <none>          38.191 267.76                          
## downslpgra  1   39.528 274.36      8.5982  0.003365 ** 
## eastness    1   38.562 268.17      2.4146  0.120208    
## mirref      1   38.630 268.61      2.8521  0.091254 .  
## ndvi        1   38.754 269.42      3.6577  0.055810 .  
## northeastn  1   38.562 268.17      2.4146  0.120208    
## northness   1   38.562 268.17      2.4146  0.120209    
## planc100    1   38.636 268.65      2.8961  0.088794 .  
## relpos11    1   38.560 268.16      2.4002  0.121317    
## slp50       1   39.366 273.33      7.5703  0.005934 ** 
## solar       1   38.556 268.13      2.3753  0.123266    
## tanc75      1   38.756 269.42      3.6669  0.055502 .  
## pred        1   41.472 286.36     20.6016 5.655e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variable Selection - Deviance Explained

References

Beaudette, D. E., & O’Geen, A. T, 2009. Quantifying the aspect effect: an application of solar radiation modeling for soil survey. Soil Science Society of America Journal, 73:1345-1352

Evans, D.M. and Hartemink, A.E., 2014. Digital soil mapping of a red clay subsoil covered by loess. Geoderma, 230:296-304.

Additional reading

Lane, P.W., 2002. Generalized linear models in soil science. European Journal of Soil Science 53, 241- 251. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2002.00440.x/abstract

James, G., D. Witten, T. Hastie, and R. Tibshirani, 2014. An Introduction to Statistical Learning: with Applications in R. Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/

Hengl, T. 2009. A Practical Guide to Geostatistical Mapping, 2nd Edt. University of Amsterdam, www.lulu.com, 291 p. ISBN 978-90-9024981-0. http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c0w.pdf

Chapter 5 - Logistic Regression

Objectives

Pesky ‘Linear Model’ Assumptions (review)

Comparison between Linear Models and GLMs

Generalized Linear Models (fewer assumptions)

Overview - Logistic Regression

Example 1: Probability of Mollisols

Example 2: Probability of Red Clay

Example 3: Probability of Ponding

Example 4: Aggregate representation of genetic soil horizons

Logistic Link Function

Logistic Fit

Fitting

Accuracy - rms R package

Accuracy - modEvA R package

Accuracy - Confusion Matrix

Accuracy - Probability Threshold

Interpreting the Coefficients

Interpreting the Coefficients - Partial Effect Plots

Interpreting the Coeficiencts - Deviance Explained

Interpreting the Coeficiencts - Deviance Explained

Checking Assumptions - Multicollinearity

Checking Assumptions - Residual Plots

Checking Assumptions - Resdiual Plots

Variable Selection

Variable Selection

Variable Selection - Deviance Explained

Final Spodic Model

Questions

References

Additional reading