Chapter 5 - Logistic Regression

Stephen Roecker and Tom D’Avello

2020-03-13

Objectives

Pesky ‘Linear Model’ Assumptions (review)

Comparison between Linear Models and GLMs

Type Distributions Estimation Method Goodnesss of Fit
Linear Gaussian (i.e. Normal) least squares variance
GLM Any Exponential Family maximum-likelihood deviance

Generalized Linear Models (fewer assumptions)

Family (or Distribution) Default Link Function Data Type Example
Gaussian identity interval or ratio clay content
Binomial logit binary (yes/no) or binomial (proportions) presense of mollisols
Poisson log counts # of species

Overview - Logistic Regression

Example 1: Probability of Mollisols

Beaudette & O’Geen, 2009

Beaudette & O’Geen, 2009

Example 2: Probability of Red Clay

Evans & Hartemink, 2014

Evans & Hartemink, 2014

Example 3: Probability of Ponding

NRCS Unpublished

NRCS Unpublished

Example 4: Aggregate representation of genetic soil horizons

2015 Digital Soil Morphometrics talk

Logistic Fit

Fitting

spod.glm <- glm(spod ~ dem10m + eastness + northness + maxent, family = binomial, data = wv)
summary(spod.glm)
## 
## Call:
## glm(formula = spod ~ dem10m + eastness + northness + maxent, 
##     family = binomial, data = wv)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7958  -0.7920  -0.4946   0.8902   2.3825  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.450480   2.328311   3.200 0.001375 ** 
## dem10m      -0.008974   0.002347  -3.823 0.000132 ***
## eastness    -0.864715   0.256272  -3.374 0.000740 ***
## northness    0.675560   0.230950   2.925 0.003443 ** 
## maxent       0.031468   0.008573   3.671 0.000242 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 307.11  on 249  degrees of freedom
## Residual deviance: 254.47  on 245  degrees of freedom
## AIC: 264.47
## 
## Number of Fisher Scoring iterations: 4

Accuracy - rms R package

library(rms)
spod.lrm <- lrm(spod ~ dem10m + eastness + northness + maxent, data = wv)
print(spod.lrm)
## Logistic Regression Model
##  
##  lrm(formula = spod ~ dem10m + eastness + northness + maxent, 
##      data = wv)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           250    LR chi2      52.63    R2       0.268    C       0.773    
##   FALSE        174    d.f.             4    g        1.339    Dxy     0.546    
##   TRUE          76    Pr(> chi2) <0.0001    gr       3.817    gamma   0.546    
##  max |deriv| 1e-09                          gp       0.234    tau-a   0.232    
##                                             Brier    0.167                     
##  
##            Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept  7.4505 2.3284  3.20  0.0014  
##  dem10m    -0.0090 0.0023 -3.82  0.0001  
##  eastness  -0.8647 0.2563 -3.37  0.0007  
##  northness  0.6756 0.2310  2.93  0.0034  
##  maxent     0.0315 0.0086  3.67  0.0002  
## 

Accuracy - modEvA R package

library(modEvA)
cbind(d.squared     = Dsquared(spod.glm), 
      adj.d.squared = Dsquared(spod.glm, adjust = TRUE)
      )
##      d.squared adj.d.squared
## [1,] 0.1713869     0.1578586
rbind(RsqGLM(spod.glm))
##      CoxSnell  Nagelkerke McFadden  Tjur      sqPearson
## [1,] 0.1898509 0.268436   0.1713869 0.2054845 0.209848

Accuracy - Confusion Matrix

library(caret)
cm <- table(observed = wv$spod, predicted = predict(spod.glm, type = "response") > 0.5)
confusionMatrix(cm, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##         predicted
## observed FALSE TRUE
##    FALSE   158   16
##    TRUE     38   38
##                                           
##                Accuracy : 0.784           
##                  95% CI : (0.7278, 0.8334)
##     No Information Rate : 0.784           
##     P-Value [Acc > NIR] : 0.536375        
##                                           
##                   Kappa : 0.4443          
##                                           
##  Mcnemar's Test P-Value : 0.004267        
##                                           
##             Sensitivity : 0.7037          
##             Specificity : 0.8061          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 0.9080          
##              Prevalence : 0.2160          
##          Detection Rate : 0.1520          
##    Detection Prevalence : 0.3040          
##       Balanced Accuracy : 0.7549          
##                                           
##        'Positive' Class : TRUE            
## 

Accuracy - Probability Threshold

Interpreting the Coefficients

# extract coefficient, stored in log(odds)
coef(spod.glm)
##  (Intercept)       dem10m     eastness    northness       maxent 
##  7.450480458 -0.008974187 -0.864714623  0.675560332  0.031468219
# convert coefficient to odds
exp(coef(spod.glm))
##  (Intercept)       dem10m     eastness    northness       maxent 
## 1720.6896668    0.9910660    0.4211717    1.9651338    1.0319686

Interpreting the Coefficients - Partial Effect Plots

library(visreg); par(mfcol = c(2, 2))
visreg(spod.glm, scale = "response", ylab = "spodic probability")

Interpreting the Coeficiencts - Deviance Explained

anova(spod.glm)
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: spod
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev
## NULL                        249     307.11
## dem10m     1   8.0805       248     299.03
## eastness   1  23.2714       247     275.76
## northness  1   7.0218       246     268.73
## maxent     1  14.2606       245     254.47

Interpreting the Coeficiencts - Deviance Explained

Checking Assumptions - Multicollinearity

##    dem10m  eastness northness    maxent 
##  1.196760  1.020541  1.014372  1.230978

Checking Assumptions - Residual Plots

Checking Assumptions - Resdiual Plots

library(car)
residualPlots(spod.glm, fit = FALSE, tests = FALSE)

Variable Selection

spod.glm2 <- glm(spod ~., data = wv)
spod.step <- step(spod.glm2, trace = FALSE)

summary(spod.step)
## 
## Call:
## glm(formula = spod ~ downslpgra + eastness + mirref + ndvi + 
##     northeastn + northness + planc100 + relpos11 + slp50 + solar + 
##     tanc75 + pred, data = wv)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8701  -0.2469  -0.1020   0.2881   0.8774  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.487e-01  9.936e-01   0.552  0.58129    
## downslpgra  -5.105e-03  1.773e-03  -2.880  0.00434 ** 
## eastness     3.272e+04  2.158e+04   1.517  0.13069    
## mirref      -6.202e-01  3.761e-01  -1.649  0.10047    
## ndvi         1.403e+00  7.509e-01   1.869  0.06286 .  
## northeastn  -4.627e+04  3.051e+04  -1.517  0.13070    
## northness    3.272e+04  2.157e+04   1.517  0.13070    
## planc100     3.674e-01  2.211e-01   1.662  0.09788 .  
## relpos11    -3.006e-01  1.988e-01  -1.512  0.13185    
## slp50       -1.179e-02  4.369e-03  -2.699  0.00745 ** 
## solar       -6.784e-07  4.510e-07  -1.504  0.13387    
## tanc75       1.287e+00  6.877e-01   1.871  0.06253 .  
## pred         9.768e-01  2.165e-01   4.512 1.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1611452)
## 
##     Null deviance: 52.896  on 249  degrees of freedom
## Residual deviance: 38.191  on 237  degrees of freedom
## AIC: 267.76
## 
## Number of Fisher Scoring iterations: 2

Variable Selection

drop1(spod.step, test = "Chisq")
## Single term deletions
## 
## Model:
## spod ~ downslpgra + eastness + mirref + ndvi + northeastn + northness + 
##     planc100 + relpos11 + slp50 + solar + tanc75 + pred
##            Df Deviance    AIC scaled dev.  Pr(>Chi)    
## <none>          38.191 267.76                          
## downslpgra  1   39.528 274.36      8.5982  0.003365 ** 
## eastness    1   38.562 268.17      2.4146  0.120208    
## mirref      1   38.630 268.61      2.8521  0.091254 .  
## ndvi        1   38.754 269.42      3.6577  0.055810 .  
## northeastn  1   38.562 268.17      2.4146  0.120208    
## northness   1   38.562 268.17      2.4146  0.120209    
## planc100    1   38.636 268.65      2.8961  0.088794 .  
## relpos11    1   38.560 268.16      2.4002  0.121317    
## slp50       1   39.366 273.33      7.5703  0.005934 ** 
## solar       1   38.556 268.13      2.3753  0.123266    
## tanc75      1   38.756 269.42      3.6669  0.055502 .  
## pred        1   41.472 286.36     20.6016 5.655e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variable Selection - Deviance Explained

Final Spodic Model

Questions

?….

References

Beaudette, D. E., & O’Geen, A. T, 2009. Quantifying the aspect effect: an application of solar radiation modeling for soil survey. Soil Science Society of America Journal, 73:1345-1352

Evans, D.M. and Hartemink, A.E., 2014. Digital soil mapping of a red clay subsoil covered by loess. Geoderma, 230:296-304.

Additional reading

Lane, P.W., 2002. Generalized linear models in soil science. European Journal of Soil Science 53, 241- 251. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2002.00440.x/abstract

James, G., D. Witten, T. Hastie, and R. Tibshirani, 2014. An Introduction to Statistical Learning: with Applications in R. Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/

Hengl, T. 2009. A Practical Guide to Geostatistical Mapping, 2nd Edt. University of Amsterdam, www.lulu.com, 291 p. ISBN 978-90-9024981-0. http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c0w.pdf