Chapter 5 Generalized Linear Models

5.1 Introduction

Generalized linear models (GLM) as the name implies are a generalization of the linear modeling framework to allow for the modeling of response variables (e.g. soil attributes) with non-normal distributions and heterogeneous variances. Whereas linear models are designed for predicting continuous soil properties such as clay content or soil temperature, GLM can be used to predict the presence/absence of argillic horizons (i.e. logistic regression) or counts of a plant species along a transect (i.e. Poisson regression). These generalizations greatly expand the applicability of the linear modeling framework, while still allowing for a similar fitting procedure and interpretation of the resulting models.

In the past in order to handle non-linearity and heterogeneous variances, transformations have been made to the response variable, such as the log(y). However, such transformations complicate the models interpretation because the results refer to the transformed scale (e.g. log(y))(Lane 2002). These response transformations are not guaranteed to achieve both normality and constant variance simultaneously. GLM approaches transform the response, but also preserve the scale of the response, and provide separate functions to transform the mean response and variance, known as the link and variance functions respectively. So instead of looking like this:

\(f(y) = \beta_{0} + \beta_{1}x + \varepsilon\)

you get this:

\(g(\mu)\) or \(\eta = \beta_{0} + \beta_{1}x + \varepsilon\)

with \(g(\mu)\) or \(\eta\) symbolizing the link function.

Another alteration of the classical linear model is that with GLM the coefficients are estimated iteratively by maximum likelihood estimation instead of ordinary least squares. This results in the GLM minimizing the deviance, instead of the sum of squares. However, for the Gaussian (i.e. normal) distributions the deviance and sum of squares are equivalent.

5.2 Logistic Regression

Logistic regression is a specific type of GLM designed to model data that has a binomial distribution (i.e. presence/absence, yes/no, or proportional data), which in statistical learning parlance is considered a classification problem. For binomial data the logit link transform is generally used. The effect of the logit transform can be seen in the following figure. It creates a sigmoidal curve, which enhances the separation between the two groups. It also has the effect of ensuring that the values range between 0 and 1.

When comparing a simple linear model vs a simple logistic model we can see the effect of the logit transform on the relationship between the response and predictor variable. As before it follows a sigmoidal curve and prevents predictions from exceeding 0 and 1.

## Warning: Removed 19 rows containing missing values or values outside the scale range
## (`geom_line()`).

5.3 Examples

Example 1: Probability of Mollisols (Beaudette and O’Geen 2009)
Example 1: Probability of Mollisols (Beaudette and O’Geen 2009)
Example 2: Probability of Red Clay (Evans and Hartemink 2014)
Example 2: Probability of Red Clay (Evans and Hartemink 2014)
Example 3: Probability of Ponding (NRCS Unpublished)
Example 3: Probability of Ponding (NRCS Unpublished)

5.4 Exercise

Now that we’ve discussed some of the basic background GLM theory we’ll move on to a real exercise, and address any additional theory where it relates to specific steps in the modeling process. The examples selected for this chapter come from Joshua Tree National Park (JTNP)(i.e. CA794) in the Mojave desert. The problem tackled here is a familiar one: Where can I expect to find argillic horizons on fan piedmonts? Argillic horizons within the Mojave are typically found on fan remnants, which are a stable landform that is a remnant of the Pleistocene (Peterson 1981). Despite the low relief of most fans, fan remnants are uplands in the sense that they generally don’t receive run-on or active deposition.

With this dataset we’ll encounter some challenges. To start with, fan piedmont landscapes typically have relatively little relief. Since most of our predictors will be derivatives of elevation, that won’t leave us with much to work with. Also, our elevation data comes from the USGS National Elevation dataset (NED), which provides considerably less detail than say LiDAR or IFSAR data (Shi et al. 2012). Lastly our pedon dataset like most in NASIS, hasn’t received near as much quality control as have the components. So we’ll need to wrangle some of the pedon data before we can analyze it. These are all typical problems encountered in any data analysis and should be good practice. Ideally, it would be more interesting to try and model individual soil series with argillic horizons, but due to some of the challenges previously mentioned it would be difficult with this dataset. However, at the end we’ll look at one simple approach to try and separate individual soil series with argillic horizons.

5.4.1 Load packages

To start, as always we need to load some extra packages. This will become a familiar routine every time you start R. Most of the basic functions we need to develop a logistic regression model are contained in base R, but the following contain some useful spatial and data manipulation functions. Believe it or not we will use all of them and more.

library(aqp)     # soil data classes
library(raster)  # raster data functions
library(sf)      # vector data functions
library(mapview) # interactive mapping
library(ggplot2) # graphing
library(dplyr)   # data manipulation
library(tidyr)   # data manipulation
library(rms)     # additional regression tools
library(caret)   # additional modeling tools

5.4.2 Read in data

Hopefully like all good soil scientists and ecological site specialists you enter your field data into NASIS. Better yet hopefully someone else did it for you! Once data are captured in NASIS it is much easier to import the data into R, extract the pieces you need, manipulate it, model it, etc. If it’s not entered into NASIS, it may as well not exist. For this exercise we’ll load a cached dataset on GitHub.

# pedons <- fetchNASIS()
githubURL <- url("https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ca794_pedons.RData")
load(githubURL)
# Examine the SoilProfileCollection object
View(pedons)

5.5 Exploratory analysis (EDA)

5.5.1 Data wrangling

Generally before we begin modeling you should spend some time exploring the data. By examining a simple summary we can quickly see the breakdown of how many argillic horizons we have. Unfortunately, odds are good that all the argillic horizons haven’t been consistently populated in the diagnostic horizon table like they should be. Luckily for us, the desert argillic horizons always pop up in the taxonomic name, so we can use pattern matching to extract it. By doing this we gain an additional 11 pedons with argillic horizons and are able to label the missing values (i.e. NA). At a minimum for modeling purposes we probably need 10 pedons of the target we’re interested in and a total of 100 observations overall.

# Check consistency of argillic horizon population

# extract the site table from the SoilProfileCollection object
s <- site(pedons) 


# tabulate column
count(s, argillic.horizon)
##   argillic.horizon   n
## 1            FALSE 790
## 2             TRUE 263
## 3               NA 167
# Extract argillic presence from the taxonomic subgroup
s <- mutate(s, argillic = grepl("arg", taxsubgrp))


# tabulate column
count(s, argillic)
##   argillic    n
## 1    FALSE 1022
## 2     TRUE  198

Ideally, if the diagnostic horizon table had been populated consistently we could have used the upper depth to diagnostic feature to filter out argillic horizons that start below 50cm, which may not be representative of “good” argillic horizons and may therefore have gotten correlated to a Torripsamments anyway. Not only are unrepresentative sites confusing for scientists, they’re equally confusing for models. However, as we saw earlier, some pedons don’t appear to be fully populated, so we’ll stick with those pedons that have the argillic specified in their taxonomic subgroup name, since it gives us the biggest sample.

# extract the diagnostic table from the SoilProfileCollection object
diagnostic_hz(pedons) %>%
  # create a new variable
  mutate(argillic.horizon50 = featkind == "argillic horizon" & featdept < 50) %>%
  # tabulate column
  count(argillic.horizon50)
##   argillic.horizon50    n
## 1              FALSE 1935
## 2               TRUE  222
## 3                 NA    4

5.5.2 Geomorphic data

Another obvious place to look is at the geomorphic data in the site table. This information is intended to help differentiate where our soil observations exist on the landscape. If populated consistently it could potentially be used in future disaggregation efforts, as demonstrated by Nauman and Thompson (2014).

# Landform vs argillic presence
s %>%
  # subset rows using logical expression
  filter(argillic == TRUE) %>%
  # cross-tabulate columns
  count(landform, argillic) %>%
  # subset rows by landform.string with > 3 observations
  filter(n > 3) %>%
  # sort data by decreasing n
  arrange(-n)
##                      landform argillic  n
## 1                fan remnants     TRUE 72
## 2                  hillslopes     TRUE 28
## 3                  fan aprons     TRUE 19
## 4                       hills     TRUE 15
## 5                   pediments     TRUE  9
## 6             mountain slopes     TRUE  8
## 7                        <NA>     TRUE  7
## 8               alluvial fans     TRUE  6
## 9                   low hills     TRUE  5
## 10 fan aprons on fan remnants     TRUE  4
## 11                  mountains     TRUE  4
# generalize the landform.string
s <- mutate(
  s, 
  landform_generic = ifelse(grepl("fan|terrace|sheet|drainageway|wash", s$landform), "fan", "hill")
  )

Examining the above frequency table we can see that argillic horizons occur predominantly on fan remnants as was alluded too earlier. However, they also seem to occur frequently on other landforms - some of which are curious combinations of landforms or redundant terms.

# Hillslope position

# # subset rows using logical expression
s_sub <- filter(s, landform_generic == "fan") 


# Hillslope
s_sub %>%
  # cross-tabulate columns
  count(hillslopeprof, argillic) %>%
  # group by variable and calculate percent
  group_by(argillic) %>%
  mutate(pct = round(n / sum(n) * 100)) %>%
  # reshape to wide format
  select(-n) %>%
  pivot_wider(names_from = argillic, values_from = pct)
## # A tibble: 6 × 3
##   hillslopeprof `FALSE` `TRUE`
##   <fct>           <dbl>  <dbl>
## 1 summit             16     39
## 2 shoulder            4      2
## 3 backslope          14     20
## 4 footslope           2      1
## 5 toeslope           16      4
## 6 <NA>               48     34
# slope shape
s_sub %>%
  # create a new variable
  mutate(SS = paste(shapedown, shapeacross)) %>%
  # cross-tabulate columns
  count(SS, argillic) %>%
  # calculate percent by group
  group_by(argillic) %>%
  mutate(pct = round(n / sum(n) * 100)) %>%
  # reshape to wide format
  select(-n) %>%
  pivot_wider(names_from = argillic, values_from = pct)
## # A tibble: 11 × 3
##    SS              `FALSE` `TRUE`
##    <chr>             <dbl>  <dbl>
##  1 NA NA                11     10
##  2 concave concave       1     NA
##  3 concave convex        0     NA
##  4 concave linear        4      3
##  5 convex concave        0      1
##  6 convex convex         7      7
##  7 convex linear         6      9
##  8 linear NA             0     NA
##  9 linear concave        6      1
## 10 linear convex        21     32
## 11 linear linear        44     38

Looking at the hillslope position of fan landforms we can see a slightly higher proportion of argillic horizons are found on summits, while less are found on toeslopes. Slope shape doesn’t seem to provide any useful information for distinguishing argillic horizons.

s_long <- s %>%
  # subset rows using logical expression
  filter(landform_generic == "fan") %>%
  # extract columns
  select(argillic, bedrckdepth, slope, elev, surface_total_frags_pct) %>%
  # reshape to long format
  pivot_longer(cols = c(bedrckdepth, slope, elev, surface_total_frags_pct))


# examine results
head(s_long)
## # A tibble: 6 × 3
##   argillic name                    value
##   <lgl>    <chr>                   <dbl>
## 1 FALSE    bedrckdepth                NA
## 2 FALSE    slope                      28
## 3 FALSE    elev                     1761
## 4 FALSE    surface_total_frags_pct    85
## 5 FALSE    bedrckdepth                11
## 6 FALSE    slope                      60
# plot results
ggplot(s_long, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ name, scale = "free")

Looking at boxplots of our numeric variables we can see none of them show much separation between the presense/absense of argillic horizons.

5.5.3 Soil Scientist Bias

Next we’ll look at soil scientist bias. The question being: Are some soil scientists more likely to describe argillic horizons than others? Due to the excess number of soil scientist that have worked on CA794, including detailees, we’ve filtered the names of soil scientist to include just the top 3 mappers and given priority to the most senior soil scientists when they occur together.

# Filter and sort most frequent soil scientists by seniority
s <- mutate(
  s,
  old = descname,
  descname2 = NA,
  descname2 = ifelse(grepl("Stephen", old), "Stephen", descname2), # least senior
  descname2 = ifelse(grepl("Paul",    old), "Paul",    descname2),
  descname2 = ifelse(grepl("Peter",   old), "Peter",   descname2)  # most senior
  )
s %>%
  # subset rows using logical expression
  filter(landform_generic == "fan") %>%
  # cross-tabulate columns
  count(descname2, argillic) %>%
  # calculate percent by group
  group_by(descname2) %>%
  mutate(pct = round(n / sum(n) * 100)) %>%
  # reshape the results into a table
  select(-n) %>%
  pivot_wider(names_from = argillic, values_from = pct)
## # A tibble: 4 × 3
## # Groups:   descname2 [4]
##   descname2 `FALSE` `TRUE`
##   <chr>       <dbl>  <dbl>
## 1 Paul           73     27
## 2 Peter          90     10
## 3 Stephen        84     16
## 4 <NA>           78     22

For fan landforms, one of the soil scientists seems more likely than the others to describe argillic horizons. However while this information is suggestive, it is far from definitive in showing a potential bias because it doesn’t take into account other factors. We’ll examine this more closely later.

5.5.4 Plot coordinates

Where do our points plot? To start we need to convert them to a spatial object first. Then we can create an interactive we map using mapview. Also, if we wish we can also export the locations as a Shapefile.

# subset rows with only complete X & Y coordinates 
s_sub <- filter(s, complete.cases(x_std, y_std))


# convert s_sub data.frame to a spatial object
s_sf <- st_as_sf(s_sub,
                 # define coordinates
                 coords = c("x_std", "y_std"),
                 # define coordinate reference system using ESPG code
                 crs = 4326
                 ) %>%
  # reproject to equal area coordinate reference system using ESPG code
  st_transform(crs = 5070)


# Download soil survey area boundaries
ca794 <- soilDB::fetchSDA_spatial("CA794", geom.src = "sapolygon") %>%
  # convert to spatial object
  st_as_sf() %>%
  # reproject to equal area coordinate reference system using ESPG code
  st_transform(crs = 5070)

# Plot interactive map
mapview(ca794, alpha.regions = 0, lwd = 2) +
  mapview(s_sf, zcol = "argillic")

5.6 Exercise 1: View the data

  • Examine the interactive map.
  • Hypothesize what predictor variables/landscape relationships might be predictive.

5.6.1 Extracting spatial data

Prior to any spatial analysis or modeling, you will need to develop a suite of geodata files that can be intersected with your field data locations. This is, in and of itself a difficult task, and should be facilitated by your Regional GIS Specialist. Typically, these geodata files would primarily consist of derivatives from a DEM or satellite imagery. Prior to any prediction it is also necessary to ensure the geodata files have the same projection, extent, and cell size. Once we have the necessary files we can construct a list in R of the file names and paths, read the geodata into R, and then extract the geodata values where they intersect with field data.

# load raster stack from GitHub
githubURL <- url("https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ca794_rs.rds")
geodata_r <- readRDS(githubURL)


# Extract the geodata and add to a data frame
data <- raster::extract(geodata_r, as(s_sf, "Spatial"), sp = TRUE)@data

# Modify some of the geodata variables
data <- mutate(
  data, 
  cluster  = factor(cluster),
  twi_sc   = abs(twi - 13.8), # 13.8 = twi median
  gsi      = (ls_3 - ls_1) / (ls_3 + ls_2 + ls_1),
  ndvi     = (ls_4 - ls_3) / (ls_4 + ls_3),
  )

5.6.2 Examine spatial data

With our spatial data in hand, we can now see whether any of the variables will help us separate the presence/absence of argillic horizons. Because we’re dealing with a classification problem, we’ll compare the numeric variables using boxplots. What we’re looking for are variables with the least amount of overlap in their distribution (i.e. the greatest separation in their median values).

URL <- url("https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata")
load(URL)

train <- data

# Select argillic horizons with "arg" in the subgroup name and on fans
# Argillic horizons that occur on hills and mountains more than likely form by different process, and therefore would require a different model.train$argillic 
train <- mutate(
  train, 
  argillic  = factor(ifelse(grepl("arg", taxsubgrp) & train$mrvbf > 0.15, "yes", "no")),
  ch_log  = log(ch + 1),
  z2str_log = log(z2str + 1),
  ch    = NULL,
  z2str = NULL
  )

train <- select(train, c(argillic, z:z2str_log))

data_l <- train %>%
  select(-cluster, -geo) %>%
  pivot_longer(col = - argillic) %>%
  filter(!is.na(value))

ggplot(data_l, aes(x = argillic, y = value)) +
  geom_boxplot() +
  facet_wrap(~ name, scales = "free")

5.7 Constructing the model

R has several functions for fitting linear models. The most common is arguably the glm() function from the stats R package, which is loaded by default. The glm() function is also extended thru the use of several additional packages such as the car and caret R packages. Another noteworthy R package for logistic regrssion is rms, which offers the lrm() function. The rms R package (Harrell 2015) offers an ‘almost’ comprehesive alternative to glm() and it’s accessory function. It is difficult to objectively functions say which approach is better. Therefore methods both methods will be demonstrated. Look for comments (i.e. #) below referring to rms, stats, caret or visreg.

# stats

fit_glm <- glm(argillic ~ z + slp + twi_sc + ch_log + z2str_log + mrrtf + solar + precip + precipsum + temp +tc_1 + tc_2 + tc_3 + k + th + u + cluster, data = train, family = binomial)


# rms

library(rms)

dd <- datadist(train)
options(datadist = "dd")

fit_lrm <- lrm(argillic ~ z + slp + twi_sc + ch_log + z2str_log + mrrtf + solar + precip + precipsum + temp + tc_1 + tc_2 + tc_3 + k + th + u, data = train, x = TRUE, y = TRUE)

5.7.1 Diagnostic

5.7.1.1 Residual plots

One unfortunate side effect of side effect of logistic regression is that the default residual plots are not interpretable. However the partial residual plots can be useful for identifying outliers and nonlinear trends.

par(mfcol = c(2, 2))

plot(fit_glm)

termplot(fit_glm, partial.resid = TRUE, col.res = "black", pch = 16)

5.7.1.2 Multicolinearity

As we mentioned earlier multicolinearity should be avoided. To assess a model for multicolinearity we can compute the variance inflation factor (VIF). Its square root indicates the amount of increase in the predictor coefficients standard error. A value greater than 3 indicates a doubling the standard error. Rules of thumb vary, but a square root of vif greater than 2 or 3 indicates an unacceptable value.

# vif() function from the rms or car packages
sqrt(vif(fit_lrm))

# or 

sqrt(vif(fit_lrm)) > 3

5.7.2 Variable Selection & model validation

Modeling is an iterative process that cycles between fitting and evaluating alternative models. Compared to tree and forest models, linear and generalized models typically require more scrunity from the user. Automated model selection procedures are available, but should not be taken at face value because they may result in complex and unstable models. This is in part due to correlation amongst the predictive variables that can confuse the model. Also, the order in which the variables are included or excluded from the model effects the significance of the other variables, and thus several weak predictors might mask the effect of one strong predictor. Regardless of the approach used, variable selection is probably the most controversial aspect of linear modeling.

Both the rms and caret packages offer methods for variable selection and cross-validation. In this instance the rms approach is a bit more convinent and faster, with the one line call to validate().

set.seed(42)

# rms
## stepwise selection and validation
validate(fit_lrm, bw = TRUE)

The results for validate() above show which variables were retained and deleted. Below we can see a dot matrix of which variables were retained in during the 40 iterations of the bootstrapping. In addition, below we can see the difference between the training and test accuracy and error metrics. Remember that it is the test accuracy we should pay attention too.

5.7.3 Final model & accuracy

Once we’ve settled on a final model we can fit it and validate it. The validate() function can also be used for cross-validation and will get multiple probability-based accuracy metrics, including: \(R^2\), the Brier score (B), Tjur’s D (D), and many others.

final_lrm <- lrm(argillic ~ slp + twi_sc + tc_1 + tc_2 + precipsum, data = train, x = TRUE, y = TRUE)

validate(final_lrm, method = "crossvalidation", B = 10)
##           index.orig training    test optimism index.corrected  n
## Dxy           0.7060   0.7059  0.6944   0.0115          0.6945 10
## R2            0.3202   0.3206  0.3214  -0.0008          0.3210 10
## Intercept     0.0000   0.0000  0.0492  -0.0492          0.0492 10
## Slope         1.0000   1.0000  1.0469  -0.0469          1.0469 10
## Emax          0.0000   0.0000  0.0185   0.0185          0.0185 10
## D             0.1684   0.1686  0.1648   0.0039          0.1646 10
## U            -0.0017  -0.0019 -0.0054   0.0034         -0.0052 10
## Q             0.1701   0.1705  0.1701   0.0004          0.1697 10
## B             0.0755   0.0755  0.0761  -0.0006          0.0761 10
## g             3.7232   3.7273  3.9348  -0.2075          3.9307 10
## gp            0.1311   0.1312  0.1302   0.0009          0.1301 10
# Compare the metrics from rms::validate() to aqp::brierScore() and modEvA::RsqGLM()
pred <- predict(final_lrm, train, type = "fitted")
test <- data.frame(pred, obs = train$argillic == "yes")

# Brier score
brierScore(test, classLabels = "pred", actual = "obs")
## [1] 0.07219881
# Nagelkerke D2 & Tjur's D
unlist(modEvA::RsqGLM(pred = test$pred, obs = test$obs, plot = FALSE))
##   CoxSnell Nagelkerke   McFadden       Tjur  sqPearson 
##  0.1557253  0.3201934  0.2540906  0.1842274  0.1879082

Because we’re dealing with a classification problem, we have to consider both errors of commission (Type I) and omission (Type II), or their corresponding accuracies of sensitivity and positive predicted value respectively. Before we can assess the error, however, we need to select a probability threshold.

  • Sensitivity and specificity examine how well the ground truth or reference data compares to the predictions.
  • Positive and negative predicted values examine the inverse concept of how well the predictions match the reference data as a function of their prevalence (see help(confusionMatrix).
# examine possible thresholds
ggplot(test, aes(x = pred, fill = obs)) +
  geom_density(alpha = 0.5) +
  geom_vline(aes(xintercept = 0.5), lty = "dashed") +
  xlab("probability") +
  scale_x_continuous(breaks = seq(0, 1, 0.2))

test$predictedClass <- ifelse(test$pred > 0.25, TRUE, FALSE)

# Confusion Matrix
cm <- table(predicted = test$predictedClass, observed = test$obs)
confusionMatrix(cm, positive = "TRUE")
## Confusion Matrix and Statistics
## 
##          observed
## predicted FALSE TRUE
##     FALSE   954   56
##     TRUE     92   65
##                                           
##                Accuracy : 0.8732          
##                  95% CI : (0.8527, 0.8917)
##     No Information Rate : 0.8963          
##     P-Value [Acc > NIR] : 0.994946        
##                                           
##                   Kappa : 0.397           
##                                           
##  Mcnemar's Test P-Value : 0.004015        
##                                           
##             Sensitivity : 0.5372          
##             Specificity : 0.9120          
##          Pos Pred Value : 0.4140          
##          Neg Pred Value : 0.9446          
##              Prevalence : 0.1037          
##          Detection Rate : 0.0557          
##    Detection Prevalence : 0.1345          
##       Balanced Accuracy : 0.7246          
##                                           
##        'Positive' Class : TRUE            
## 
# Spatially variable accuracy
test$cluster <- train$cluster
temp <- test %>%
  group_by(cluster) %>%
  summarise(
    TP       = sum(predictedClass == TRUE  & obs, na.rm = TRUE),
    FN       = sum(predictedClass == FALSE & obs, na.rm = TRUE),
    sensitivity = TP / (TP + FN),
    n  = length(obs)
    )

ggplot(temp, aes(x = cluster, y = sensitivity, size = n)) +
  geom_point()

  • Discuss the variability of the predictions across the clusters, perhaps different models need to be constructed in each cluster, some clusters appear to be dominated by specific soil series, these data aren’t clean enough (nor are the series concepts usually) to model series separately, however, we could use the clusters as an additional model to attempt to separate the series. Do the hyperthermic clusters perform differently.

5.7.4 Model effects

# Model accuracy, and slopes (e.g. coefficents)
final_lrm
## Frequencies of Missing Values Due to Each Variable
##  argillic       slp    twi_sc      tc_1      tc_2 precipsum 
##         8        52        53        52        52        52 
## 
## Logistic Regression Model
## 
## lrm(formula = argillic ~ slp + twi_sc + tc_1 + tc_2 + precipsum, 
##     data = train, x = TRUE, y = TRUE)
## 
## 
##                         Model Likelihood      Discrimination    Rank Discrim.    
##                               Ratio Test             Indexes          Indexes    
## Obs           1167    LR chi2     197.55      R2       0.320    C       0.853    
##  no           1046    d.f.             5     R2(5,1167)0.152    Dxy     0.706    
##  yes           121    Pr(> chi2) <0.0001    R2(5,325.4)0.447    gamma   0.707    
## max |deriv| 0.0004                            Brier    0.075    tau-a   0.131    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -2.3859 1.8877 -1.26  0.2062  
## slp       -0.1968 0.0394 -5.00  <0.0001 
## twi_sc    -0.4272 0.0887 -4.82  <0.0001 
## tc_1      -0.0104 0.0076 -1.36  0.1730  
## tc_2       0.0259 0.0237  1.09  0.2748  
## precipsum  0.0743 0.0247  3.01  0.0026
# Chi-square
anova(final_lrm)
##                 Wald Statistics          Response: argillic 
## 
##  Factor     Chi-Square d.f. P     
##  slp        24.97      1    <.0001
##  twi_sc     23.20      1    <.0001
##  tc_1        1.86      1    0.1730
##  tc_2        1.19      1    0.2748
##  precipsum   9.06      1    0.0026
##  TOTAL      85.86      5    <.0001
# Model Effects
summary(final_lrm)
##              Effects              Response : argillic 
## 
##  Factor      Low      High     Diff.   Effect    S.E.    Lower 0.95 Upper 0.95
##  slp           3.5798  24.4420 20.8620 -4.106300 0.82180 -5.7170000 -2.495700 
##   Odds Ratio   3.5798  24.4420 20.8620  0.016468      NA  0.0032895  0.082443 
##  twi_sc        1.3002   4.6222  3.3219 -1.419100 0.29461 -1.9966000 -0.841720 
##   Odds Ratio   1.3002   4.6222  3.3219  0.241920      NA  0.1358000  0.430970 
##  tc_1        103.6200 144.7600 41.1350 -0.427900 0.31405 -1.0434000  0.187630 
##   Odds Ratio 103.6200 144.7600 41.1350  0.651880      NA  0.3522400  1.206400 
##  tc_2         46.3230  60.8050 14.4820  0.374410 0.34284 -0.2975500  1.046400 
##   Odds Ratio  46.3230  60.8050 14.4820  1.454100      NA  0.7426400  2.847300 
##  precipsum    26.4350  35.6600  9.2253  0.685080 0.22765  0.2388900  1.131300 
##   Odds Ratio  26.4350  35.6600  9.2253  1.983900      NA  1.2698000  3.099600
plot(summary(final_lrm))

# Plot Effects
ggplot(Predict(final_lrm, fun = plogis))

  • View the results in ArcGIS and examine the accuracy at individual points
  • Discuss the effects of data quality, including both NASIS and GIS
  • Discuss how the modeling process isn’t an end in itself, but serves to uncover trends, possibly generate additional questions and direct future investigations

5.8 Generate spatial predictions

# Custom function to return the predictions and their standard errors
library(raster)

# wrapper function, to help with prediction at each grid cell
predfun <- function(model, data) {
  
  # predict presence of argillic horizon (outcome = TRUE)
  # result is vector of probabilities
  prob <- predict(model, data, type = "fitted")
  
  # compute Shannon entropy from both (FALSE | TRUE) probabilities
  H <- apply(cbind(prob, 1 - prob), 1, shannonEntropy)
  
  # combine Pr(Argillic == TRUE) and Shannon entropy
  res <- cbind(prob, H)
  
  # return to calling function
  return(res)
  }
  
# Generate spatial predictions
r <- predict(geodata_r, final_lrm, fun = predfun, index = 1:2, progress = "text")
library(raster)

# argillic probability
plot(r[[1]], col = viridis::viridis(10))

# argillic Shannon Entropy
plot(r[[2]], col = viridis::viridis(10))

5.9 Exercise

  1. Construct a dataset of shallow soils as shown below.
URL <- url("https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata")
load(URL)

train2 <- aqp::allocate(pedons, to = "ST Diagnostic Features")
train2 <- train2 %>%
  mutate(shallow = grepl("contact", train2$featkind) & train2$featdept) %>%
  arrange(peiid, -shallow, featdept) %>%
  filter(shallow == TRUE & ! duplicated(peiid)) %>%
  right_join(data, by = "peiid") %>%
  mutate(shallow = ifelse(is.na(shallow), FALSE, shallow)) %>%
  select(z:mast, pc_1:u, ndvi:gsi, shallow)
  1. Fit a GLM

  2. Perform a variable selection

    • Does the automatic selection make sense?
  3. Assess the model accuracy and plot the fit

    • How do the train vs test accuracies compare?
  4. Summarize and plot the model effects

    • Which variable has the steepest slope?
    • Which variable has the greatest effect?
    • Which variable has the highest Chi-squared?

5.10 Additional reading

The application of generalized linear models in soil science are described by Lane (2002).@james2021 provide a useful introduction to logistic regression.

References

Beaudette, D. E., and A. T. O’Geen. 2009. “Quantifying the Aspect Effect: An Application of Solar Radiation Modeling for Soil Survey.” Soil Science Society of America Journal 73 (4): 1345–52. https://doi.org/10.2136/sssaj2008.0229.
Evans, D. M., and Alfred E. Hartemink. 2014. “Digital Soil Mapping of a Red Clay Subsoil Covered by Loess.” Geoderma 230-231 (October): 296–304. https://doi.org/10.1016/j.geoderma.2014.03.013.
Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-19425-7.
Lane, P. W. 2002. “Generalized Linear Models in Soil Science.” European Journal of Soil Science 53 (2): 241–51. https://doi.org/10.1046/j.1365-2389.2002.00440.x.
Peterson, Frederick F. 1981. “Landforms of the Basin and Range Province Defined for Soil Survey.” Reno, Nevada. https://jornada.nmsu.edu/files/Peterson_LandformsBasinRangeProvince.pdf.
Shi, Xun, Lyndsey Girod, Robert Long, Roger DeKett, Jessica Philippe, and Tom Burke. 2012. “A Comparison of LiDAR-Based DEMs and USGS-Sourced DEMs in Terrain Analysis for Knowledge-Based Digital Soil Mapping.” Geoderma 170 (January): 217–26. https://doi.org/10.1016/j.geoderma.2011.11.020.