Introduction

Historically, various iterations of soil survey have engaged primarily in descriptive science. More specifically, soil survey has been concerned with the characterization and categorization of various soil phenomenon into discrete groups. The goal of this descriptive approach was to facilitate useful interpretations of the landscape for the purpose of agriculture, silviculture, engineering, etc. Soil scientists developed soil surveys using tacit knowledge derived from observations and recognition of patterns on the landscapes they worked.

One of the most common relationships developed from tacit knowledge is the relationship between soil moisture regime (SMR) and plant community. In the Northern Rocky Mountains MLRA (43A), it has been common practice to estimate the SMR based on Forest Service Habitat Types (HT). However, depending on this qualitative relationship reduces our ability to accurately estimate SMR in the absence of ideal site conditions (i.e., post-fire). Here, I attempt to quantify the relationship between xeric and udic SMR and common environmental data. I then develop a model for SMR using environmental covariates.

Objectives

  1. Evaluate the distribution of environmental covariates grouped by xeric and udic soil moisture regimes (SMR), as well as the most common habitat type (HT) classification.
  2. Utilize tree-based modeling techniques to generate a decision tree for the classification of soil moisture regimes.

Project Area

The project area is the portion of MLRA 43A that is within the Moscow office’s area of responsibility. This area spans from the Idaho-Canadian border down into central Idaho (Fig 1.). It is an incredibly diverse landscape ranging from glaciated mountains in the north to steep river canyons in the south. Parent materials consist of glacial till, loess covered hills, and eroding Columbia basalt.

Fig 1. Map of Study Area. Polygons represent Land Resource Units (LRU) within MLRA.

Fig 1. Map of Study Area. Polygons represent Land Resource Units (LRU) within MLRA.

Results

Objective 1: Evaluating Environmental Covariates

Point data pulled from NASIS were used in these analyses. The original data consisted of 1044 observations. These data were used to extract raster values for heat load index (HLI), mean annual precipitation (MAP), average minimum (tmin) and maximum temperature (tmax), potential evapotranspiration (ETPOT), and elevation. All rasters were resampled to a resolution of 10m.

HLI is a unitless index representing an estimation of direct incident radiation, and is calculated using slope and aspect. The HLI layer used here was created with the spatialEco::hli() function. ETPOT was calculated using the ETpot tool in SAGA (Hargraeves and Samani, 1985; Conrad et al., 2015). Temperature and precipitation data were acquired from PRISM. Elevation data was acquired from the Geospatial Data Gateway.


# DATA CLEANING
# set missing values to NA
pnts <- pnts %>%
  mutate_if(is.character, ~replace(., . == " ", NA))

# check NAs
table(is.na(pnts$txmstcl))
## 
## FALSE  TRUE 
##   865   179
# 179 missing values, lets see if we can populate those based on taxonomy
pnts$txmstcl[grepl("xer|Xer",pnts$taxclnm)] <- "xeric"
pnts$txmstcl[grepl("ud|Ud",pnts$taxclnm)] <- "udic"
pnts$txmstcl[grepl("aqu|Aqu",pnts$taxclnm)] <- "aquic"

# check
table(is.na(pnts$txmstcl))
## 
## FALSE  TRUE 
##   905   139
# now only 139, lots of missing taxonomy names as well (113)

# eliminate unused groups
pnts$txmstcl[grepl("perudic",pnts$txmstcl)] <- "udic" #n = 1
pnts$txmstcl[grepl("aquic",pnts$txmstcl)] <- NA #n = 93

# set to factors
pnts$txmstcl <- as.factor(pnts$txmstcl)

# summarize quantiles for each covariate grouped by SMR
# create function for mode
getmode <- function(v) {
  # function finds mode with smallest index (i.e. the first mode found if there are ties)
  uniqv <-  unique(v)
  uniqv[which.max(tabulate(match(v,uniqv)))]
}

# mos_esds is prepared data without associated geometry

smr_p <- mos_esds %>% filter(map_ras != 0) %>% filter(!is.na(txmstcl)) %>%
  group_by(txmstcl) %>%
  summarise(p10 = quantile(map_ras,.1),
            p50 = quantile(map_ras,.5),
            p90 = quantile(map_ras,.9),
            habitat_type = getmode(na.omit(ht)),
            n = n())

smr_e <- mos_esds %>% filter(etpot != 0) %>% filter(!is.na(txmstcl)) %>%
  group_by(txmstcl) %>%
  summarise(p10 = quantile(etpot,.1),
            p50 = quantile(etpot,.5),
            p90 = quantile(etpot,.9),
            habitat_type = getmode(na.omit(ht)),
            n = n())

smr_h <- mos_esds %>% filter(hli != 0) %>% filter(!is.na(txmstcl)) %>%
  group_by(txmstcl) %>%
  summarise(p10 = quantile(hli,.1),
            p50 = quantile(hli,.5),
            p90 = quantile(hli,.9),
            habitat_type = getmode(na.omit(ht)),
            n = n())
Fig 2. Shaded Density Plot of Mean Annual Precipitation (A), Average Potential Evapotranspiration (B), and Heat Load Index (C) by soil moisture regime.

Fig 2. Shaded Density Plot of Mean Annual Precipitation (A), Average Potential Evapotranspiration (B), and Heat Load Index (C) by soil moisture regime.




Table 1 Descriptive Statistics of Mean Annual Precipitation (in)
SMR 10th 50th 90th Habitat Type n
udic 28.3585 36.3580 44.0738 western redcedar/queencup beadlily 473
xeric 23.4549 28.0894 37.2382 grand fir/ninebark 336
Table 2 Descriptive Statistics of Potential Evapotranspiration (mm)
SMR 10th 50th 90th Habitat Type n
udic 134.9688 156.9698 165.0735 western redcedar/queencup beadlily 473
xeric 148.2561 159.4336 183.6904 grand fir/ninebark 336
Table 3 Descriptive Statistics of Heat Load Index
SMR 10th 50th 90th Habitat Type n
udic 0.4357 0.660 0.8441 western redcedar/queencup beadlily 473
xeric 0.6018 0.722 0.8814 grand fir/ninebark 336

Habitat type : most commonly classified habitat type.
n : total number of observations.



Objective 2: Classifying SMR with Tree-based Models

A regression tree model (rpart()) was trained using a 70:30 (train:test) data split. The model was then pruned according using the lowest x-relative error. This produced a final model using MAP,HLI, minimum annual temperature (tmin), maximum annual temperature (tmax), and ETPOT as independent variables. The model was then validated using the test set, and had a final accuracy of 0.75.


Cleaning data and developing the model

# clean data for covariates
mos2 <- filter(mos_esds, !is.na(hli))
mos2 <- filter(mos2, !is.na(txmstcl))

# reproducible
set.seed(9724)

# split data into 70:30 train and test set
datasplit <- sample(nrow(mos2), nrow(mos2) * 0.7)
train <- mos2[datasplit,]
test <- mos2[-datasplit,]

# run regression tree model
smr_model <- rpart(txmstcl ~ elev_ras + etpot + map_ras + hli + slp_fld + tmax_ras + tmin_ras,
                   data = train,
                   method = "class")

# prune tree based on lowest X-Relative Error
smr_prun <- prune(smr_model, cp = 0.010593)
prp(smr_prun, branch = .8, yesno = 2, box.palette = "BuRd", type = 2, extra = 100, branch.lty = 2)
Fig 3. Decision Tree for Soil Moisture Regime. Percentage represents observations in node.

Fig 3. Decision Tree for Soil Moisture Regime. Percentage represents observations in node.



Fig. 4. Variable Importance Plot. Importance scaled from 0 to 100.

Fig. 4. Variable Importance Plot. Importance scaled from 0 to 100.


Validating the model

# validate model using test dataset
pred <- predict(smr_prun, newdata = test, type = "class")
cm <- confusionMatrix(pred, test$txmstcl)

Table 4. Rpart Model Validation Metrics.
Accuracy Kappa Sensitivity Specificity Precision
0.75 0.48 0.81 0.66 0.77


Next, a random forest model (randomForest()) was trained using the same data split as above. Predictor variables were chosen using the Boruta() algorithm, and model parameters were chosen using the caret::train() function. The final independent variables were elevation, ETPOT, MAP, HLI, tmax, and tmin. The model was then validated using the test set, and had a final accuracy of 0.79.


Developing a Random Forest Model

# run the Boruta algorithm
library(Boruta)
xvars <- c('elev_ras','etpot','map_ras','hli','slp_fld','tmax_ras','tmin_ras')
smr_bor <- Boruta(y = train$txmstcl, x = train[, xvars], maxRuns = 35, doTrace = 1)

# plot variable importance and selected features
# make ggplot
# data transformations
bxbr <-  as.data.frame(smr_bor[["ImpHistory"]])
bxbr2 <- melt(bxbr)
t <- as.data.frame(smr_bor[["finalDecision"]])
t$variable <- rownames(t)
names(t)[1] <- "decision"
bxbr2 <- merge(bxbr2, t)
bxbr2$variable <-  with(bxbr2, reorder(variable,value, median))


ggplot(data = bxbr2, aes(x = variable, y = value, fill = decision)) +
  geom_boxplot() +
  ylab(label = "Importance") +
  xlab(label = "Attributes")
Fig 5. Variable Importance Plot.

Fig 5. Variable Importance Plot.


# fit model, automatically tune, and cross-validate on Boruta variables
# extract the selected feature variables
vars <- getSelectedAttributes(smr_bor)

# automate model parameters
tc <- trainControl(
  method = "cv", number = 10, returnResamp = "all",
  savePredictions = TRUE,
  search = "grid",
  verboseIter = FALSE
)

smr_tr_bor <- train(y = train$txmstcl, x = train[, vars],
                    method = "ranger",
                    importance = "permutation",
                    trControl = tc)

# train model with parameters from smr_tr_bor$finalModel
library(ranger)
yvar <- "txmstcl"
smr_rf <- ranger(txmstcl ~ .,
                       data = train[c(yvar,vars)],
                       min.node.size = 1,
                       num.trees = 500,
                       mtry = 2,
                       splitrule = "extratrees")

# validate
pred <- predict(smr_rf, data = test, type = "response")
test$predictions <- pred[["predictions"]]
cm2 <- confusionMatrix(test$predictions, test$txmstcl)


Table 5. Random Forest Model Validation Metrics.
Accuracy Kappa Sensitivity Specificity Precision
0.79 0.56 0.81 0.66 0.77



Discussion

Although the foundations of soil survey were laid with descriptive science, advancements in technology and science- as well as changing stakeholder needs- have made it necessary for soil survey to adopt a more quantitative approach. As part of this endeavor, it is becoming increasingly important to analyze our previous, tacit understandings of soil-landscape relationships. In other words, taking the soil scientist’s ‘mental-model’ of the landscape and quantifying it using statistical techniques. I have shown here that quantifying the mental-model of soil moisture regimes (SMR) is not only possible with statistical methods, but a viable option as well given the small amount of environmental covariates used.

Comparing the distributions of mean annual precipitation (MAP), potential evapotranspiration (ETPOT), and heat load index (HLI) across my data set of 809 observations (Fig 2), I found that all distributions overlapped to some extent. However, xeric soil moisture regime tended to have lower median values for MAP and higher median values for HLI (Tables 1 & 3). Interestingly, the distribution of xeric ETPOT was bimodal (Fig 2), but still trended toward higher values. The overlapping distributions show that individual environmental variables are not indicative of SMR by themselves, and specific site conditions, such as slope and elevation, likely modulate the importance of these variables. Therefore, I developed two tree-based models which utilized the predictive capacity of multiple environmental covariates.

The first model trained was a regression tree model using the rpart package. The second model trained was a random forest model using the randomForest package. Both models performed well, with the random forest model performing slightly better with a validation accuracy of 79% (Table 5). The final covariates in each model were similar, with each model choosing ETPOT, MAP, HLI, tmax and tmin (Figs 4 & 5). For each model, MAP scored as the most important variable. Conversely, the random forest model had HLI as the second most important variable, whereas the regression tree model had it as the least important. The random forest model seems to follow the logic proposed in the mental-model, where precipitation modulated by the effects of aspect and slope exert the greatest control over SMR. However, the random forest model is a ‘black-box’, and one can not visualize the hundreds of decision trees within the model. For this reason, the output of the regression tree model may be more user-friendly (Fig 3).

In summary, the traditional mental-model for SMR depends heavily on the habitat type present at the site. This model has worked well because vegetation is highly responsive to the same environmental conditions that influence SMR- such as precipitation, slope, and aspect. For this project I accurately classified SMR based on a set of readily available environmental covariates using tree-based models. The development of these models will allow for the classification of SMR in the absence of vegetation data. However, the accuracy of these models depends on the accuracy of the point data SMR classifications. Since the SMR has been estimated based on the habitat type, variation in describers’ plant knowledge and ocular assessments of species abundances may influence the accuracy of the SMR classification. This further supports the need for objective classification techniques. Therefore, future projects could improve upon these models by using data classified from soil moisture sensors.

References

Conrad, O., Bechtel, B., Bock, M., Dietrich, H., Fischer, E., Gerlitz, L., Wehberg, J., Wichmann, V., and Böhner, J. (2015): System for Automated Geoscientific Analyses (SAGA) v. 2.1.4, Geosci. Model Dev., 8, 1991-2007, doi:10.5194/gmd-8-1991-2015.

Hargreaves, G.H., Samani, Z.A. (1985): Reference crop evapotranspiration from ambient air temperatures.. Paper presented in ASAE Regional Meeting, Grand Junction, Colorado. http://cagesun.nmsu.edu/~zsamani/papers/Hargreaves_Samani_85.pdf.