Introduction

An Example

Consider this situation: you have a collection of pedons that have been correlated to a named soil series (or component) and would like to objectively compute a range in characteristics (“low-rv-high” values) and horizon depths. As with most collections of pedon data, there may be considerable variation in description style and horizons used, horizon depths, and number of horizons described:

alt text

In this scenario, there are several obvious “micro-correlation” decisions that need to be made before horizons can be grouped for aggregation. For example, what horizonation prototype scheme (e.g., A-Bt1-Bt2-Bt3-Cr-R) best conveys the concept of this soil series or soil component? Does it make sense to group [Bt3, Bt4, BCt, CBt] horizons for aggregation? Or what about grouping [Bt3, 2Bt3] horizons? Do [BA, AB] horizons occur frequently enough to be included in the horizonation prototype?

Based on your knowledge of the area, pedon 2 might be a good “typical” pedon to use in developing a horizonation prototype. After careful review of the data and consultation with your crew, a new set of labels are assigned to each horizon (red labels in figure above) that define groups over which soil properties will be aggregated. These new labels define functionally similar groups that may span multiple genetic horizons.

Formalized Approach

This document describes an aggregation strategy based on the use of “generalized horizon labels” (GHL) through a combination of narrative, R code, and figures. You can follow along in an R session by copying code from the grey boxes and pasting it into the R console.

Here is a basic outline of the process:

  1. Select a set of GHL that best represents a group of pedons to be aggregated. This could be based on series descriptions, expert knowledge, or even inspection of the most frequently occurring horizon designations.

  2. Assign GHL to each horizon using whatever information is available for grouping horizons. This micro-correlation of horizon designations will likely require slightly different “rules” in each instance. Careful inspection of horizon designation and observed properties is critical.

  3. Evaluate GHL assignments and manually refine as needed.

  4. Keep track of final GHL assignments in NASIS or text file.

  5. Estimate a most likely horizonation e.g., top and bottom depths for each generalized horizon label.

  6. Compute range in characteristics, aka low-rv-high values, for clay, sand, pH, etc. over GHL. (next document in this series)

Setup R Envionment

If you have never used the aqp or soildb packages before, you will likely need to install them. This only needs to be done once.

# stable version from CRAN + dependencies
install.packages('ape', dep=TRUE) 
install.packages('latticeExtra', dep=TRUE)
install.packages('plyr', dep=TRUE) 
install.packages('aqp', dep=TRUE) 
install.packages('soilDB', dep=TRUE)
install.packages('sharpshootR', dep=TRUE)

Once you have all of the R packages on which this document depends, it is a good idea to load them. R packages must be installed anytime you change versions of R (e.g., after an upgrade) and loaded anytime you want to access functions from within those packages.

library(aqp)
library(soilDB)
library(sharpshootR)
library(igraph)
library(ape)
library(latticeExtra)
library(lattice)
library(cluster)
library(MASS)

Sample Data

While the methods outlined in this document can be applied to any collection of pedons, it is convenient to work with a standardized set of data. You can follow along with the analysis by copying code from the following blocks and running it in your R session. The sample data used in this document is based on 15 soil profiles that have been correlated to the Loafercreek soil series from the Sierra Nevada Foothill Region of California. Note that the internal structure of the loafercreek data is identical to the structure returned by fetchNASIS() from the soilDB package. All horizon-level values are pulled from the pedon horizon table of the pedons being analyzed.

# load sample data from the soilDB package
data(loafercreek, package = 'soilDB')
# keep only the first n pedons
pedons <- loafercreek[1:20, ]
# tighter margins
par(mar = c(0, 0, 0, 1))
# profile sketches with customization of style
plotSPC(pedons, print.id = FALSE, max.depth = 150, cex.names = 0.66, width = 0.35, name.style = 'center-center', depth.axis = list(style = 'compact', line = -2))

15 pedons correlated to the Loafercreek soil series.


Optional: Follow Along with Your Data

The following code block demonstrates how to pull data in using the fetchNASIS() function from the soilDB package.

# first load the desired data set within NASIS into your NASIS selected set
# then load data from the NASIS selected set into R
pedons <- fetchNASIS(from='pedons')
# optionally subset the data by taxon name - enter your taxon name
pedons <- pedons[grep(pattern='ENTER_YOUR_TAXON_NAME', pedons$taxonname, ignore.case=TRUE), ]

Methods

Selection of Generalized Horizon Labels

Generalized horizon labels represent an expert-guided selection of designations that were consistently observed in the field and are meaningful in terms of soil morphology and management. The very first step in this process is to tabulate the number of times each horizon designation occurs.

sort(table(pedons$hzname), decreasing=TRUE)
## 
##    A  Bt1  Bt2   BA   Cr    R  Bt3   Bw 2Bt2   A1   A2  BAt  BCt   Bt 2Bt1 2Bt3   2R  Bt4    C  2BC 
##   16   15   14   10    9    8    7    4    3    3    3    3    3    3    2    2    2    2    2    1 
## 2BCt  2Cr 2Crt   BC  Bw1  Bw2  Crt   H1   Oi 
##    1    1    1    1    1    1    1    1    1

In this case, [A, Bt1, Bt2, BA] horizon designations appear to be a good starting point. However, relying on field experience and expert knowledge of these soils, cross-checking against the OSD, and talking with other soil scientists may be more important than the previous tabulation.

Constructing a graph of transitions from one horizon to the next (“transition probabilities”) can highlight those horizon designations that are most commonly used together. A follow-up tutorial on transition probability matrix interpretation is planned.

tp <- hzTransitionProbabilities(pedons, 'hzname')
par(mar = c(1, 1, 1, 1))
plotSoilRelationGraph(tp, graph.mode = 'directed', edge.arrow.size = 0.5, edge.scaling.factor = 2, vertex.label.cex = 0.75, vertex.label.family = 'sans')

Graph constructed from transition probabilities.


A quick summary of horizon depth mid-points (e.g., average depth of horizon) can help in organizing the various designations and possibly give some clues as to how they can be grouped. The following plot is called a box and whisker plot.

# compute horizon mid-points
pedons$mid <- with(horizons(pedons), (hzdept + hzdepb) / 2)

# sort horizon designation by group-wise median values
hz.designation.by.median.depths <- names(sort(tapply(pedons$mid, pedons$hzname, median)))

# plot the distribution of horizon mid-points by designation
bwplot(mid ~ factor(hzname, levels=hz.designation.by.median.depths), 
       data=horizons(pedons), 
       ylim=c(155, -5), ylab='Horizon Mid-Point Depth (cm)', 
       scales=list(y=list(tick.number=10)), 
       panel=function(...) {
  panel.abline(h=seq(0, 140, by=10), v=1:length(hz.designation.by.median.depths), col=grey(0.8), lty=3)
    panel.bwplot(...)
})

Range in horizon depth mid-point for original horizon designations.


Next, a similar summary of soil properties (clay content and total rock fragment volume) is presented. The goal is to determine which horizons designations can be grouped and which generalized horizon labels to assign to each group.

# box and wisker plot by clay content
bwplot(clay ~ factor(hzname, levels=hz.designation.by.median.depths), 
       data=horizons(pedons), 
       ylab='Clay Content (%)', 
       scales=list(y=list(tick.number=10)), 
       panel=function(...) {
  panel.abline(h=seq(0, 100, by=5), v=1:length(hz.designation.by.median.depths), col=grey(0.8), lty=3)
  panel.bwplot(...)
})

Range in clay content for original horizon designations.


# box and wisker plot by total rock fragment volume
bwplot(total_frags_pct ~ factor(hzname, levels=hz.designation.by.median.depths), 
       data=horizons(pedons), 
       ylab='Total Rock Fragment Volume (%)', 
       scales=list(y=list(tick.number=10)), 
       panel=function(...) {
  panel.abline(h=seq(0, 100, by=10), v=1:length(hz.designation.by.median.depths), col=grey(0.8), lty=3)
  panel.bwplot(...)
})

Range in total rock fragment content for original horizon designations.


Sometimes looking at thematic soil profile sketches can be informative.

# color horizons by clay content
par(mar = c(0, 0, 3, 1))
plotSPC(pedons, print.id = FALSE, max.depth = 150, cex.names = 0.66, width = 0.35, name.style = 'center-center', depth.axis = list(style = 'compact', line = -2), color='clay')

Horizon colors are based on clay content.


Assignment of Generalized Horizon Labels

Once a set of generalized horizon labels have been determined, a corresponding set of regular expression (REGEX) rules are developed to convert field-described designations into GHL. Pattern matching with REGEX will typically assign useful GHL; however, there will always be cases where manual intervention is required. More on that will follow.

From the above analysis and the OSD, it seems like the following sequence of GHL is appropriate: (A, Bt1, Bt2, Bt3, Cr, R) – an A horizon, followed by 3 Bt horizons, then Cr, and finally R. For each GHL, a corresponding REGEX rule is needed. For example, '^A$|Ad|Ap' will match ‘A’, ‘Ad’, and ‘Ap’.

# save our GHL
n <- c('A','Bt1','Bt2','Bt3','Cr','R')
# REGEX rules
p <- c('^A$|Ad|Ap|A1|A2',
       'Bt1$|^Bt$',
       '^Bt2$',
       '^Bt3|^Bt4|CBt$|BCt$|2Bt|2CB$|^C$|2BC',
       'Cr',
       'R')

Apply GHL pattern-matching rules, save to a new column called genhz, and cross-tabulate the occurrence of GHL and original designations.

pedons$genhz <- generalize.hz(pedons$hzname, n, p)
# cross-tabulate original horizon designations and GHL
tab <- table(pedons$genhz, pedons$hzname)
addmargins(tab)
##           
##            2BC 2BCt 2Bt1 2Bt2 2Bt3 2Cr 2Crt  2R   A  A1  A2  BA BAt  BC BCt  Bt Bt1 Bt2 Bt3 Bt4  Bw
##   A          0    0    0    0    0   0    0   0  16   3   3   0   0   0   0   0   0   0   0   0   0
##   Bt1        0    0    0    0    0   0    0   0   0   0   0   0   0   0   0   3  15   0   0   0   0
##   Bt2        0    0    0    0    0   0    0   0   0   0   0   0   0   0   0   0   0  14   0   0   0
##   Bt3        1    1    2    3    2   0    0   0   0   0   0   0   0   0   3   0   0   0   7   2   0
##   Cr         0    0    0    0    0   1    1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   R          0    0    0    0    0   0    0   2   0   0   0   0   0   0   0   0   0   0   0   0   0
##   not-used   0    0    0    0    0   0    0   0   0   0   0  10   3   1   0   0   0   0   0   0   4
##   Sum        1    1    2    3    2   1    1   2  16   3   3  10   3   1   3   3  15  14   7   2   4
##           
##            Bw1 Bw2   C  Cr Crt  H1  Oi   R Sum
##   A          0   0   0   0   0   0   0   0  22
##   Bt1        0   0   0   0   0   0   0   0  18
##   Bt2        0   0   0   0   0   0   0   0  14
##   Bt3        0   0   2   0   0   0   0   0  23
##   Cr         0   0   0   9   1   0   0   0  12
##   R          0   0   0   0   0   0   0   8  10
##   not-used   1   1   0   0   0   1   1   0  22
##   Sum        1   1   2   9   1   1   1   8 121

In the above cross-tabulation, we can see that a couple of original designations were not matched (not-used in the table) by our REGEX rules, namely, BA, Bw, and Oi horizons. For this example, we are going to assume that those horizons are not common enough for inclusion in our set of GHL.

It is also possible to display GHL assignment as a network graph.

# convert contingency table -> adj. matrix
m <- genhzTableToAdjMat(tab)
# plot using a function from the sharpshootR package
par(mar = c(1,1,1,1))
plotSoilRelationGraph(m, graph.mode = 'directed', edge.arrow.size = 0.5)

Evaluation of Generalized Horizon Labels

An initial evaluation of our GHL assignment can be accomplished by plotting profile sketches with horizons colored by GHL. The result looks correct, but further investigation may be warranted.

# make a palette of colors, last color is for not-used class
cols <- c(grey(0.33), 'orange', 'orangered', 'chocolate', 'green', 'blue', 'yellow')

# assign a color to each generalized horizon label
hz.names <- levels(pedons$genhz)
pedons$genhz.soil_color <- cols[match(pedons$genhz, hz.names)]

# plot generalized horizons via color and add a legend
par(mar = c(2,1,0,2))
plotSPC(pedons, print.id = FALSE, max.depth = 150, cex.names = 0.66, width = 0.35, name.style = 'center-center', depth.axis = list(style = 'compact', line = -2), color = 'genhz.soil_color')
legend('bottomleft', legend=hz.names, pt.bg=c(cols), pch=22, bty='n', cex=1)

Horizon colors are based on assigned GHL.


A box and whisker plot of the depth-ranges for each of the generalized horizon labels helps to visualize the degree of overlap. Slicing the collection of profiles along 1-cm intervals generates a more precise figure.

# slice profile collection from 0-150 cm
s <- dice(pedons, 0:150 ~ genhz)
# convert horizon name back to factor, using original levels
s$genhz <- factor(s$genhz, levels = hz.names)
# plot depth-ranges of generalized horizon slices
bwplot(hzdept ~ genhz, data=horizons(s), 
       ylim=c(155, -5), ylab='Generalized Horizon Depth (cm)', 
       scales=list(y=list(tick.number=10)), asp=1, 
       panel=function(...) {
          panel.abline(h=seq(0, 140, by=10), v=1:length(hz.names),col=grey(0.8), lty=3)
            panel.bwplot(...)
          }
       )

Range in GHL horizon depth.


Advanced: Multivariate Soil Property Summary

The evaluation of generalized horizon labels typically requires a review of more information than field-described horizon labels and depth. For example, clay content, sand content, pH, total rock fragment volume, and horizon mid-points are soil properties that can be used to differentiate horizons– as long as the data are populated. In this document, clay content, total rock fragment volume, and horizon mid-points are used. Multivariate comparisons are commonly based on the concept of pair-wise dissimilarity and subsequent reduction of the resulting distance matrix into a simpler representation. Non-metric multidimensional scaling (nMDS) is one method for reducing the distance matrix into a new set of coordinates in two-dimensional space. A related document explores this idea further.

# store the column names of our variables of interest
vars <- c('clay', 'mid', 'total_frags_pct')

# result is a list of several items
hz.eval <- evalGenHZ(pedons, 'genhz', vars)

# extract MDS coords
pedons$mds.1 <- hz.eval$horizons$mds.1
pedons$mds.2 <- hz.eval$horizons$mds.2

# extract silhouette widths and neighbor
pedons$sil.width <- hz.eval$horizons$sil.width
pedons$neighbor <- hz.eval$horizons$neighbor

In the following figure, generalized horizon labels are plotted at nMDS coordinates as colored circles, along with original horizon designations and pedon IDs. Ideally, groupings of GHL should form relatively homogeneous clusters in this figure; however, there will always be some overlap at the edges. Heterogeneous regions should be inspected in cases where the assigned GHL may not make sense. In this example, the original designation of “Bt3” for pedon ID 09CKS036 does not seem to fit into the “Bt3” generalized horizon. According to the soil properties used (clay content, total rock fragments, horizon mid-point), the horizon may better fit the “Bt2” generalized horizon. In addition, it appears that “Bw” and “BA” horizons can be included into the “A” generalized horizon. Decisions on issues such as this can only be made based on field experience or at least some level of expert knowledge about the soils in question. Once a decision has been made, additions to the GHL rules (e.g. adding “Bw” and “BA” to the “A” GHL) and possibly manual adjustment can be used to accomodate the changes.

# convert pedons to a data.frame
pedons.df <- as(pedons, 'data.frame')
# plot generalized horizon labels at MDS coordinates
mdsplot <- xyplot(mds.2 ~ mds.1, groups=genhz, data=pedons.df, 
                  xlab='', ylab='', aspect=1,
                  scales=list(draw=FALSE), 
                  auto.key=list(columns=length(levels(pedons.df$genhz))), 
                  par.settings=list(
                    superpose.symbol=list(pch=16, cex=3, alpha=0.5)
                  )
)

# annotate with original hzname and pedon ID
mdsplot +
  layer(panel.abline(h=0, v=0, col='grey', lty=3)) + 
  layer(panel.text(pedons.df$mds.1, pedons.df$mds.2, pedons.df$hzname, cex=0.85, font=2, pos=3)) +
  layer(panel.text(pedons.df$mds.1, pedons.df$mds.2, pedons.df$pedon_id, cex=0.55, font=1, pos=1))

Horizon designations and GHL as plotted along nMDS axes.


The silhouette width metric is a useful indicator of how consistently group labels partition observations in a dissimilarity matrix. Silhouette widths closer to 1 indicate excellent partitioning, while values closer to 0 indicate less effective partitioning. Silhouette widths less than 0 are often associated with inappropriate label assignment. A sketch of profiles with horizons colored by silhouette width can indicate possible horizons with inappropriate GHL assignment or a pedon that does not fit within the concept of Loafercreek. For example, many of the horizons associated with pedon ID 09CKS036 do not appear to fit within the collection of soil properties associated with generalized horizon labels “Bt1”, “Bt2”, and “Bt3”. Keep in mind that this evaluation is only based on three soil properties– expert judgement will be required for a final assignment of GHL.

# plot silhouette width metric
par(mar = c(0, 0, 3, 1))
plot(pedons, label='pedon_id', cex.id = 0.5, max.depth = 150, cex.names = 0.66, width = 0.35, name.style = 'center-center', depth.axis = list(style = 'compact', line = -2), color='sil.width')

Blue and green colors warrant further investigation.


The following table contains information on those horizons with silhouette widths less than 0. The “neighbor” column contains the next closest GHL (in terms of the soil properties used in the pair-wise dissimilarity calculation), which may be more appropriate to use.

# index those horizons with silhouette widths less than 0
check.idx <- which(pedons.df$sil.width < 0)
# sort this index based on min sil.width
check.idx.sorted <- check.idx[order(pedons.df$sil.width[check.idx])]
# list those pedons/horizons that may need some further investigation
pedons.df[check.idx.sorted, c('peiid', 'pedon_id', 'hzname', 'genhz', 'neighbor', 'sil.width', vars)]
##      peiid      pedon_id hzname genhz neighbor    sil.width clay  mid total_frags_pct
## 5   115595          300J    Bt2   Bt2      Bt3 -0.304496846   35 62.5               0
## 72  338022      07RJV089    Bt2   Bt2      Bt1 -0.260129476   25 38.0              10
## 65  338021      07RJV086    Bt2   Bt2      Bt1 -0.250695590   22 29.0              30
## 60  307963      07SKC014     Bt   Bt1      Bt3 -0.243816842   29 66.0               8
## 85  338024      07RJV105    Bt1   Bt1        A -0.232022190   20 24.0               5
## 86  338024      07RJV105    Bt2   Bt2      Bt1 -0.226833882   20 38.0              25
## 66  338021      07RJV086    Bt3   Bt3      Bt2 -0.206187598   25 44.5              35
## 16  115827 S2000CA007012    Bt1   Bt1        A -0.196307412   20 22.5              10
## 55  307961      07SKC016    Bt2   Bt2      Bt3 -0.191380187   32 54.5              15
## 37  268820      07SKC003    Bt2   Bt2      Bt1 -0.174281799   32 28.0               5
## 10  115819 S2000CA007009    Bt2   Bt2      Bt1 -0.163204476   29 39.5               5
## 112 338028      07RJV115    Bt2   Bt2      Bt3 -0.156297021   32 43.0               5
## 73  338022      07RJV089    Bt3   Bt3      Bt2 -0.140498159   25 52.0              25
## 118 338029      07RJV119   2Bt1   Bt3      Bt1 -0.127800749   32 32.0              10
## 71  338022      07RJV089    Bt1   Bt1        A -0.121385623   22 24.0               5
## 105 338027      07RJV111    Bt2   Bt2      Bt1 -0.109751782   32 35.5               5
## 120 338029      07RJV119    2BC   Bt3      Bt2 -0.097935655   30 57.0              40
## 36  268820      07SKC003    Bt1   Bt1        A -0.090766960   19 16.5              24
## 27  207242 S2007CA109005    BCt   Bt3      Bt2 -0.090679589   16 96.5              45
## 9   115819 S2000CA007009    Bt1   Bt1        A -0.085728747   25 19.0               5
## 74  338022      07RJV089    Bt4   Bt3      Bt2 -0.085165952   30 66.0              45
## 17  115827 S2000CA007012    Bt2   Bt2      Bt1 -0.056866078   28 44.0              25
## 39  268820      07SKC003   2BCt   Bt3      Bt2 -0.042981244   28 56.0              23
## 104 338027      07RJV111    Bt1   Bt1      Bt2 -0.037714883   30 20.0              50
## 26  207242 S2007CA109005    Bt2   Bt2      Bt3 -0.033069807   18 89.0              20
## 111 338028      07RJV115    Bt1   Bt1      Bt2 -0.024421743   25 25.5              50
## 80  338023      07RJV096    Bt2   Bt2      Bt1 -0.013245810   25 40.5              50
## 25  207242 S2007CA109005    Bt1   Bt1      Bt2 -0.007499182   18 66.0              15

The folowing table contains mean and standard deviations (values in parenthesis) of soil properties (clay content, total rock fragments, horizon mid-point) and silhouette widths summarized by GHL. Generalized horizons with the largest silhouette widths are the most internally consistent, while those with silhouette widths close to 0 are the least consistent.

hz.eval$stats
##      genhz         clay           mid total_frags_pct    sil.width
## 1        A 16.19 (3.08)   3.91 (2.35)     7.73 (7.67)  0.59 (0.16)
## 2      Bt1    24.38 (4)  29.39 (15.3)   15.89 (15.48) -0.01 (0.14)
## 3      Bt2 28.36 (5.51) 43.75 (15.96)    22.64 (21.1) -0.13 (0.11)
## 4      Bt3 34.67 (9.84)  54.87 (13.3)   22.17 (26.89)  0.08 (0.16)
## 5       Cr     NaN (NA) 86.96 (21.86)           0 (0)     NaN (NA)
## 6        R     NaN (NA) 131.75 (9.41)           0 (0)     NaN (NA)
## 7 not-used  19.7 (3.57) 16.75 (13.06)   13.18 (18.12)     NaN (NA)

Plot profiles with only those horizons with silhouette widths less than 0 flagged.

# add a column containing a color (red) that flags horizons with silhouette width less than 0
pedons$sil.flag <- ifelse(pedons$sil.width < 0, 'red', 'white')

par(mar = c(0, 0, 1, 1))
plotSPC(pedons, label = 'pedon_id', cex.id = 0.5, max.depth = 150, cex.names = 0.66, width = 0.35, name.style = 'center-center', depth.axis = list(style = 'compact', line = -2), color='sil.flag')

Horizons with red colors warrant further investigation.


Saving GHL to NASIS

GHL assignment as described thus far applies only to in-memory objects in the current R session. In order to use these GHL in further analysis or perform manual adjustments, the GHL need to be saved externally. If you are a NASIS user (working with data derived from NASIS) then the following code will create a text file that can be read by NASIS and stored in the dspcomplayerid field of the pedon horizon table. The “Update horizon group aggregations” calculation (“Calculations” -> “Pedon Horizon” -> “Update horizon group aggregations”) expects a text file at the C:/data/horizon_agg.txt location on your system, containing phiid and generalized horizon labels.

Note that some people prefer to adjust GHL assignments in R while others prefer to make adjustments after loading the data into NASIS.

# clear-out any existing files
rules.file <- 'C:/data/horizon_agg.txt'
write.table(data.frame(), file=rules.file, row.names=FALSE, quote=FALSE, na='', col.names=FALSE, sep='|')
# extract horizon data
h <- horizons(pedons)
# strip-out 'not-used' genhz labels and retain horizon ID and genhz assignment
h <- h[which(h$genhz != 'not-used'), c('phiid', 'genhz')]
# append to NASIS import file
write.table(h, file=rules.file, row.names=FALSE, quote=FALSE, na='', col.names=FALSE, sep='|', append=TRUE)

Concluding Remarks

In the next document, we will work with pedon data from NASIS and use our GHL to generate estimates of “low-rv-high” style range in characteristics. An alternative approach to assignment of GHL can be performed with the “Update comp layer id by hzname (horizon group Aggregation)” calculation. This calculation, however, performs only basic pattern matching and likely requires considerable manual intervention.


This document is based on aqp version 2.0 and soilDB version 2.7.8.