D.E. Beaudette, J.M. Skovlin

2018-01-26

This document is based on `aqp`

version 1.15.3 and `soilDB`

version 2.0-1.

Consider the situation: you have a collection of pedons that have been correlated to a named soil series (or component) and would like to *objectively* compute a range in characteristics (“low-rv-high” values) and horizon depths. As with most collections of pedon data there may be considerable variation in description style and horizons used, horizon depths, and number of horizons described:

In this contrived example, there are several obvious “micro-correlation” decisions that need to be made before horizons can be grouped for aggregation. For example, what horizonation prototype scheme will be used (e.g. A-Bt1-Bt2-Bt3-Cr-R) and best conveys the concept of this soil series or soil component? Does it make sense to group {Bt3, Bt4, BCt, CBt} horizons for aggregation? Along those lines, what about grouping {Bt3, 2Bt3} horizons? Do {BA, AB} horizons occur frequently enough to be included in the horizonation prototype?

Based on your knowledge of the area pedon 2 might be a good “typical” pedon from which a horizonation prototype can be developed. After careful review of the data and consultation with your crew, a new set of labels are assigned to each horizon (red labels in figure above) that define groups over which soil properties will be aggregated. These new labels define *functionally-similar* groups that may span multiple genetic horizons.

If you have never used the aqp or soildb packages before, you will likely need to install them. This only needs to be done once.

```
# stable version from CRAN + dependencies
install.packages('ape', dep=TRUE)
install.packages('latticeExtra', dep=TRUE)
install.packages('plyr', dep=TRUE)
install.packages('aqp', dep=TRUE)
install.packages('soilDB', dep=TRUE)
```

Now that you have all of the R packages that this document depends on, it would be a good idea to load them. R packages must be **installed** anytime you change versions of R (e.g. after an upgrade), and **loaded** anytime you want to access functions from within those packages.

```
library(aqp)
library(soilDB)
library(ape)
library(latticeExtra)
library(plyr)
library(lattice)
library(cluster)
library(MASS)
```

While the methods outlined in this document can be applied to any collection of pedons, it is convenient to work with a standardized set of data. You can follow along with the analysis by copying code from the following blocks and running it in your **R** session. The sample data used in this document is based on 30 soil profiles that have been correlated to the Loafercreek soil series from the Sierra Nevada Foothill Region of California. Note that the internal structure of the `loafercreek`

data is identical to the structure returned by `fetchNASIS()`

from the soilDB package. All horizon-level values are pulled from the pedon horizon table of the pedons being analyzed.

```
# load sample data from the soilDB package
data(loafercreek, package = 'soilDB')
# keep only the first 30 pedons
pedons <- loafercreek[1:30, ]
# plot profile sketches
par(mar=c(0,0,0,0))
plot(pedons, name='hzname', print.id=FALSE, cex.names=0.8, axis.line.offset=-4)
```

The following code block demonstrates how to pull data in using the `fetchNASIS()`

function from the soilDB package.

```
# first load the desired data set within NASIS into your NASIS selected set
# then load data from the NASIS selected set into R
pedons <- fetchNASIS(from='pedons')
# optionally subset the data by taxon name - enter your taxon name
pedons <- pedons[grep(pattern='ENTER_YOUR_TAXON_NAME', f$taxonname, ignore.case=TRUE), ]
```

Quantiles (also called percentiles) are a convenient way to express where a number lies within a distribution. For example, the 5th-percentile of a set of numbers is the value that splits the lowest 5% of the data from the rest. The 50th-percentile, commonly referred to as the median, splits the data exactly in half and is a good alternative to the mean for summarizing central tendency. The 25th and 75th percentiles or 5th and 95th percentiles form useful brackets around the spread around the median. A box and whisker plot uses quantiles to display the central tendency, spread, and balance of a distribution. Unlike the mean and standard deviation, quantiles offer a relatively robust (outliers, distribution shape, skewness, etc.) definition of central tendency (RV) and spread (LOW, HIGH).

The following R code demonstrates the relationship between frequency distribution, quantiles, and box-whisker plot for a set of 500 normally-distributed values with a mean of 10 and standard deviation of 2.

```
# simulate 500 values from the normal distribution: with mean = 10, sd= 2
set.seed(1010101)
x <- rnorm(n=500, mean=10, sd=2)
# compute the 5th, 25th, 50th, 75th, and 95th percentiles of x
q <- quantile(x, probs=c(0.05, 0.25, 0.5, 0.75, 0.95))
# plot a smoothed frequency distribution
plot(density(x), main='Quantile Demonstration', ylim=c(0, 0.3), ylab='', xlab='', axes=FALSE)
# mark quantiles we computed above
abline(v=q, lty=3, col='red')
text(x=q, y=0.1, labels=c('5th', '25th', '50th', '75th', '95th'))
# overlay a box-whisker plot
boxplot(x, at=0.25, add=TRUE, horizontal=TRUE, boxwex=0.1, border='DarkBlue', axes=FALSE)
# overlay lines at the original values
rug(x, side=3, col='DarkBlue')
# add x-axis
axis(side=1, at=pretty(x))
```

Once a set of generalized horizon labels have been determined a corresponding set of regular expression (REGEX) rules were developed to convert field-described designations into GHL. Pattern matching with REGEX will typically assign useful GHL, however, there will always be cases where manual intervention is required. More on that later.

From the above analysis and the OSD, it seems like the following sequence of GHL are appropriate: (A, Bt1, Bt2, Bt3, Cr, R)– an A horizon, followed by 3 Bt horizons, then Cr and finally R. For each GHL we need a corresponding REGEX rule. For example, `'^A$|Ad|Ap'`

will match ‘A’, ‘Ad’, and ‘Ap’.

```
# save our GHL
n <- c('A','Bt1','Bt2','Bt3','Cr','R')
# REGEX rules
p <- c('^A$|Ad|Ap',
'Bt1$',
'^Bt2$',
'^Bt3|^Bt4|CBt$|BCt$|2Bt|2CB$|^C$',
'Cr',
'R')
```

Apply GHL pattern-matching rules and save to a new column called `genhz`

and cross-tabulate the occurrence of GHL and original designations.

```
pedons$genhz <- generalize.hz(pedons$hzname, n, p)
# cross-tabulate original horizon designations and GHL
addmargins(table(pedons$genhz, pedons$hzname))
```

```
##
## 2BCt 2Bt1 2Bt2 2Bt3 2Bt4 2Cr 2Crt 2R A A1 A2 BA BAt BC BCt Bt Bt1 Bt2 Bt3 Bt4
## A 0 0 0 0 0 0 0 0 27 0 0 0 0 0 0 0 0 0 0 0
## Bt1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 0 0 0
## Bt2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 0 0
## Bt3 1 1 2 3 1 0 0 0 0 0 0 0 0 0 6 0 0 0 17 4
## Cr 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## R 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## not-used 0 0 0 0 0 0 0 0 0 3 3 16 5 4 0 3 0 0 0 0
## Sum 1 1 2 3 1 2 1 1 27 3 3 16 5 4 6 3 27 26 17 4
##
## Bw Bw1 Bw2 C Cr Crt Oi R Sum
## A 0 0 0 0 0 0 0 0 27
## Bt1 0 0 0 0 0 0 0 0 27
## Bt2 0 0 0 0 0 0 0 0 26
## Bt3 0 0 0 4 0 0 0 0 39
## Cr 0 0 0 0 11 2 0 0 16
## R 0 0 0 0 0 0 0 19 20
## not-used 4 1 1 0 0 0 2 0 42
## Sum 4 1 1 4 11 2 2 19 197
```

From the above cross-tabulation, we can see that a couple of original designations were not matched (`not-used`

in the table) by our REGEX rules: BA, Bw, and Oi horizons. In this example, we are going to make the assumption that those horizons aren’t common enough for inclusion in our set of GHL.