Historically, various iterations of soil survey have engaged primarily in descriptive science. More specifically, soil survey has been concerned with the characterization and categorization of various soil phenomenon into discrete groups. The goal of this descriptive approach was to facilitate useful interpretations of the landscape for the purpose of agriculture, silviculture, engineering, etc. Soil scientists developed soil surveys using tacit knowledge derived from observations and recognition of patterns on the landscapes they worked.
One of the most common relationships developed from tacit knowledge is the relationship between soil moisture regime (SMR) and plant community. In the Northern Rocky Mountains MLRA (43A), it has been common practice to estimate the SMR based on Forest Service Habitat Types (HT). However, depending on this qualitative relationship reduces our ability to accurately estimate SMR in the absence of ideal site conditions (i.e., post-fire). Here, I attempt to quantify the relationship between xeric and udic SMR and common environmental data. I then develop a model for SMR using environmental covariates.
The project area is the portion of MLRA 43A that is within the Moscow office’s area of responsibility. This area spans from the Idaho-Canadian border down into central Idaho (Fig 1.). It is an incredibly diverse landscape ranging from glaciated mountains in the north to steep river canyons in the south. Parent materials consist of glacial till, loess covered hills, and eroding Columbia basalt.
Point data pulled from NASIS were used in these analyses. The original data consisted of 1044 observations. These data were used to extract raster values for heat load index (HLI), mean annual precipitation (MAP), average minimum (tmin) and maximum temperature (tmax), potential evapotranspiration (ETPOT), and elevation. All rasters were resampled to a resolution of 10m.
HLI is a unitless index representing an estimation of direct incident
radiation, and is calculated using slope and aspect. The HLI layer used
here was created with the spatialEco::hli()
function. ETPOT
was calculated using the ETpot tool in SAGA (Hargraeves and Samani,
1985; Conrad et al., 2015). Temperature and precipitation data were
acquired from PRISM. Elevation data was acquired from the Geospatial
Data Gateway.
# DATA CLEANING
# set missing values to NA
pnts <- pnts %>%
mutate_if(is.character, ~replace(., . == " ", NA))
# check NAs
table(is.na(pnts$txmstcl))
##
## FALSE TRUE
## 865 179
# 179 missing values, lets see if we can populate those based on taxonomy
pnts$txmstcl[grepl("xer|Xer",pnts$taxclnm)] <- "xeric"
pnts$txmstcl[grepl("ud|Ud",pnts$taxclnm)] <- "udic"
pnts$txmstcl[grepl("aqu|Aqu",pnts$taxclnm)] <- "aquic"
# check
table(is.na(pnts$txmstcl))
##
## FALSE TRUE
## 905 139
# now only 139, lots of missing taxonomy names as well (113)
# eliminate unused groups
pnts$txmstcl[grepl("perudic",pnts$txmstcl)] <- "udic" #n = 1
pnts$txmstcl[grepl("aquic",pnts$txmstcl)] <- NA #n = 93
# set to factors
pnts$txmstcl <- as.factor(pnts$txmstcl)
# summarize quantiles for each covariate grouped by SMR
# create function for mode
getmode <- function(v) {
# function finds mode with smallest index (i.e. the first mode found if there are ties)
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v,uniqv)))]
}
# mos_esds is prepared data without associated geometry
smr_p <- mos_esds %>% filter(map_ras != 0) %>% filter(!is.na(txmstcl)) %>%
group_by(txmstcl) %>%
summarise(p10 = quantile(map_ras,.1),
p50 = quantile(map_ras,.5),
p90 = quantile(map_ras,.9),
habitat_type = getmode(na.omit(ht)),
n = n())
smr_e <- mos_esds %>% filter(etpot != 0) %>% filter(!is.na(txmstcl)) %>%
group_by(txmstcl) %>%
summarise(p10 = quantile(etpot,.1),
p50 = quantile(etpot,.5),
p90 = quantile(etpot,.9),
habitat_type = getmode(na.omit(ht)),
n = n())
smr_h <- mos_esds %>% filter(hli != 0) %>% filter(!is.na(txmstcl)) %>%
group_by(txmstcl) %>%
summarise(p10 = quantile(hli,.1),
p50 = quantile(hli,.5),
p90 = quantile(hli,.9),
habitat_type = getmode(na.omit(ht)),
n = n())
SMR | 10th | 50th | 90th | Habitat Type | n |
---|---|---|---|---|---|
udic | 28.3585 | 36.3580 | 44.0738 | western redcedar/queencup beadlily | 473 |
xeric | 23.4549 | 28.0894 | 37.2382 | grand fir/ninebark | 336 |
SMR | 10th | 50th | 90th | Habitat Type | n |
---|---|---|---|---|---|
udic | 134.9688 | 156.9698 | 165.0735 | western redcedar/queencup beadlily | 473 |
xeric | 148.2561 | 159.4336 | 183.6904 | grand fir/ninebark | 336 |
SMR | 10th | 50th | 90th | Habitat Type | n |
---|---|---|---|---|---|
udic | 0.4357 | 0.660 | 0.8441 | western redcedar/queencup beadlily | 473 |
xeric | 0.6018 | 0.722 | 0.8814 | grand fir/ninebark | 336 |
Habitat type : most commonly classified habitat type.
n : total number of observations.
A regression tree model (rpart()
) was trained using a
70:30 (train:test) data split. The model was then pruned according using
the lowest x-relative error. This produced a final model using MAP,HLI,
minimum annual temperature (tmin), maximum annual temperature (tmax),
and ETPOT as independent variables. The model was then validated using
the test set, and had a final accuracy of 0.75.
Cleaning data and developing the model
# clean data for covariates
mos2 <- filter(mos_esds, !is.na(hli))
mos2 <- filter(mos2, !is.na(txmstcl))
# reproducible
set.seed(9724)
# split data into 70:30 train and test set
datasplit <- sample(nrow(mos2), nrow(mos2) * 0.7)
train <- mos2[datasplit,]
test <- mos2[-datasplit,]
# run regression tree model
smr_model <- rpart(txmstcl ~ elev_ras + etpot + map_ras + hli + slp_fld + tmax_ras + tmin_ras,
data = train,
method = "class")
# prune tree based on lowest X-Relative Error
smr_prun <- prune(smr_model, cp = 0.010593)
prp(smr_prun, branch = .8, yesno = 2, box.palette = "BuRd", type = 2, extra = 100, branch.lty = 2)
Validating the model
# validate model using test dataset
pred <- predict(smr_prun, newdata = test, type = "class")
cm <- confusionMatrix(pred, test$txmstcl)
Accuracy | Kappa | Sensitivity | Specificity | Precision |
---|---|---|---|---|
0.75 | 0.48 | 0.81 | 0.66 | 0.77 |
Next, a random forest model (randomForest()
) was trained
using the same data split as above. Predictor variables were chosen
using the Boruta()
algorithm, and model parameters were
chosen using the caret::train()
function. The final
independent variables were elevation, ETPOT, MAP, HLI, tmax, and tmin.
The model was then validated using the test set, and had a final
accuracy of 0.79.
Developing a Random Forest Model
# run the Boruta algorithm
library(Boruta)
xvars <- c('elev_ras','etpot','map_ras','hli','slp_fld','tmax_ras','tmin_ras')
smr_bor <- Boruta(y = train$txmstcl, x = train[, xvars], maxRuns = 35, doTrace = 1)
# plot variable importance and selected features
# make ggplot
# data transformations
bxbr <- as.data.frame(smr_bor[["ImpHistory"]])
bxbr2 <- melt(bxbr)
t <- as.data.frame(smr_bor[["finalDecision"]])
t$variable <- rownames(t)
names(t)[1] <- "decision"
bxbr2 <- merge(bxbr2, t)
bxbr2$variable <- with(bxbr2, reorder(variable,value, median))
ggplot(data = bxbr2, aes(x = variable, y = value, fill = decision)) +
geom_boxplot() +
ylab(label = "Importance") +
xlab(label = "Attributes")
# fit model, automatically tune, and cross-validate on Boruta variables
# extract the selected feature variables
vars <- getSelectedAttributes(smr_bor)
# automate model parameters
tc <- trainControl(
method = "cv", number = 10, returnResamp = "all",
savePredictions = TRUE,
search = "grid",
verboseIter = FALSE
)
smr_tr_bor <- train(y = train$txmstcl, x = train[, vars],
method = "ranger",
importance = "permutation",
trControl = tc)
# train model with parameters from smr_tr_bor$finalModel
library(ranger)
yvar <- "txmstcl"
smr_rf <- ranger(txmstcl ~ .,
data = train[c(yvar,vars)],
min.node.size = 1,
num.trees = 500,
mtry = 2,
splitrule = "extratrees")
# validate
pred <- predict(smr_rf, data = test, type = "response")
test$predictions <- pred[["predictions"]]
cm2 <- confusionMatrix(test$predictions, test$txmstcl)
Accuracy | Kappa | Sensitivity | Specificity | Precision |
---|---|---|---|---|
0.79 | 0.56 | 0.81 | 0.66 | 0.77 |
Although the foundations of soil survey were laid with descriptive science, advancements in technology and science- as well as changing stakeholder needs- have made it necessary for soil survey to adopt a more quantitative approach. As part of this endeavor, it is becoming increasingly important to analyze our previous, tacit understandings of soil-landscape relationships. In other words, taking the soil scientist’s ‘mental-model’ of the landscape and quantifying it using statistical techniques. I have shown here that quantifying the mental-model of soil moisture regimes (SMR) is not only possible with statistical methods, but a viable option as well given the small amount of environmental covariates used.
Comparing the distributions of mean annual precipitation (MAP), potential evapotranspiration (ETPOT), and heat load index (HLI) across my data set of 809 observations (Fig 2), I found that all distributions overlapped to some extent. However, xeric soil moisture regime tended to have lower median values for MAP and higher median values for HLI (Tables 1 & 3). Interestingly, the distribution of xeric ETPOT was bimodal (Fig 2), but still trended toward higher values. The overlapping distributions show that individual environmental variables are not indicative of SMR by themselves, and specific site conditions, such as slope and elevation, likely modulate the importance of these variables. Therefore, I developed two tree-based models which utilized the predictive capacity of multiple environmental covariates.
The first model trained was a regression tree model using the
rpart
package. The second model trained was a random forest
model using the randomForest
package. Both models performed
well, with the random forest model performing slightly better with a
validation accuracy of 79% (Table 5). The final covariates in each model
were similar, with each model choosing ETPOT, MAP, HLI, tmax and tmin
(Figs 4 & 5). For each model, MAP scored as the most important
variable. Conversely, the random forest model had HLI as the second most
important variable, whereas the regression tree model had it as the
least important. The random forest model seems to follow the logic
proposed in the mental-model, where precipitation modulated by the
effects of aspect and slope exert the greatest control over SMR.
However, the random forest model is a ‘black-box’, and one can not
visualize the hundreds of decision trees within the model. For this
reason, the output of the regression tree model may be more
user-friendly (Fig 3).
In summary, the traditional mental-model for SMR depends heavily on the habitat type present at the site. This model has worked well because vegetation is highly responsive to the same environmental conditions that influence SMR- such as precipitation, slope, and aspect. For this project I accurately classified SMR based on a set of readily available environmental covariates using tree-based models. The development of these models will allow for the classification of SMR in the absence of vegetation data. However, the accuracy of these models depends on the accuracy of the point data SMR classifications. Since the SMR has been estimated based on the habitat type, variation in describers’ plant knowledge and ocular assessments of species abundances may influence the accuracy of the SMR classification. This further supports the need for objective classification techniques. Therefore, future projects could improve upon these models by using data classified from soil moisture sensors.
Conrad, O., Bechtel, B., Bock, M., Dietrich, H., Fischer, E., Gerlitz, L., Wehberg, J., Wichmann, V., and Böhner, J. (2015): System for Automated Geoscientific Analyses (SAGA) v. 2.1.4, Geosci. Model Dev., 8, 1991-2007, doi:10.5194/gmd-8-1991-2015.
Hargreaves, G.H., Samani, Z.A. (1985): Reference crop evapotranspiration from ambient air temperatures.. Paper presented in ASAE Regional Meeting, Grand Junction, Colorado. http://cagesun.nmsu.edu/~zsamani/papers/Hargreaves_Samani_85.pdf.