Chapter 3 Exploratory Data Analysis

NASIS has an abundance of tabular data. These records capture a large amount of raw information about our soil series and map unit component concepts. In many cases, the data are not as clear to interpret or detailed as we would like. Before embarking on developing statistical models and generating predictions, it is essential to understand your data. This chapter will demonstrate ways to characterize a variety of different data types.

John Tukey ((Tukey 1977)) advocated the practice of exploratory data analysis (EDA) as a critical part of the scientific process:

“No catalog of techniques can convey a willingness to look for what can be seen, whether or not anticipated. Yet this is at the heart of exploratory data analysis. The graph paper and transparencies are there, not as a technique, but rather as a recognition that the picture examining eye is the best finder we have of the wholly unanticipated.”

Fortunately, we can dispense with the graph paper and use software that makes it easy to develop graphical output and descriptive statistics for our data.

3.1 Objectives (Exploratory Data Analysis)

This chapter will demonstrate the following:

  • Methods for estimating Low, RV, and High values
  • Methods for visualizing soil data
  • Methods for data transformation
  • Effects of weighting and missing values on summary statistics

3.2 Statistics

Descriptive statistics include:

  • Mean - arithmetic average
  • Weighted Mean - weighted arithmetic average
  • Median - middle value
  • Mode - most frequent value
  • Standard Deviation - variation around the mean
  • Interquartile Range - range encompasses 50% of the values
  • Kurtosis - peakedness of the data distribution
  • Skewness - symmetry of the data distribution

Graphical methods include:

  • Histogram - a bar plot where each bar represents the frequency of observations for a given range of values
  • Density estimation - an estimation of the frequency distribution based on the sample data
  • Quantile-quantile plot - a plot of the actual data values against a normal distribution
  • Box plots - a visual representation of median, quartiles, symmetry, skewness, and outliers
  • Scatter plots - a graphical display of one variable plotted on the x axis and another on the y axis
  • Radial plots - plots formatted for the representation of circular data

3.3 Data Inspection

Before you start an EDA, you should inspect your data and correct all typos and blatant errors. EDA can then be used to identify additional errors such as outliers and help you determine the appropriate statistical analyses. For this chapter we’ll use the loafercreek dataset from the CA630 Soil Survey Area.

library(dplyr)

# Load from the the loakercreek dataset
data("loafercreek", package = "soilDB") 

# Extract the horizon table
h <- aqp::horizons(loafercreek)

# Construct generalized horizon designations
n <- c("A",
       "BAt",
       "Bt1",
       "Bt2",
       "Cr",
       "R")
# REGEX rules
p <- c("A",
       "BA|AB",
       "Bt|Bw",
       "Bt3|Bt4|2B|C",
       "Cr",
       "R")

# Compute genhz labels and add to loafercreek dataset
h$genhz <- aqp::generalize.hz(h$hzname, n, p)

# Examine genhz vs hznames (wide format)
table(h$genhz, h$hzname)
##           
##            2BC 2BCt 2Bt1 2Bt2 2Bt3 2Bt4 2Bt5 2CB 2CBt 2Cr 2Crt 2R  A A1 A2 AB ABt Ad Ap  B BA BAt BC BCt Bt Bt1 Bt2
##   A          0    0    0    0    0    0    0   0    0   0    0  0 97  7  7  0   0  1  1  0  0   0  0   0  0   0   0
##   BAt        0    0    0    0    0    0    0   0    0   0    0  0  0  0  0  1   0  0  0  0 31   8  0   0  0   0   0
##   Bt1        0    0    0    0    0    0    0   0    0   0    0  0  0  0  0  0   2  0  0  0  0   0  0   0  8  93  88
##   Bt2        1    1    3    8    8    6    1   1    1   0    0  0  0  0  0  0   0  0  0  0  0   0  4  16  0   0   0
##   Cr         0    0    0    0    0    0    0   0    0   4    2  0  0  0  0  0   0  0  0  0  0   0  0   0  0   0   0
##   R          0    0    0    0    0    0    0   0    0   0    0  6  0  0  0  0   0  0  0  0  0   0  0   0  0   0   0
##   not-used   0    0    0    0    0    0    0   0    0   0    0  0  0  0  0  0   0  0  0  1  0   0  0   0  0   0   0
##           
##            Bt3 Bt4 Bw Bw1 Bw2 Bw3  C CBt Cd Cr Cr/R Crt H1 Oi  R Rt
##   A          0   0  0   0   0   0  0   0  0  0    0   0  0  0  0  0
##   BAt        0   0  0   0   0   0  0   0  0  0    0   0  0  0  0  0
##   Bt1        0   0 10   2   2   1  0   0  0  0    0   0  0  0  0  0
##   Bt2       47   8  0   0   0   0  6   6  1  0    0   0  0  0  0  0
##   Cr         0   0  0   0   0   0  0   0  0 49    0  20  0  0  0  0
##   R          0   0  0   0   0   0  0   0  0  0    1   0  0  0 40  1
##   not-used   0   0  0   0   0   0  0   0  0  0    0   0  1 24  0  0
# Examine matching pairs (long format)
h %>% group_by(genhz, hzname) %>% count()
## # A tibble: 43 × 3
## # Groups:   genhz, hzname [43]
##    genhz hzname     n
##    <fct> <chr>  <int>
##  1 A     A         97
##  2 A     A1         7
##  3 A     A2         7
##  4 A     Ad         1
##  5 A     Ap         1
##  6 BAt   AB         1
##  7 BAt   BA        31
##  8 BAt   BAt        8
##  9 Bt1   ABt        2
## 10 Bt1   Bt         8
## # ℹ 33 more rows

As noted in Chapter 1, a visual examination of the raw data is possible by clicking on the dataset in the environment tab, or via commandline:

View(h) 

This view is fine for a small dataset, but can be cumbersome for larger ones. The summary() function can be used to quickly summarize a dataset however, even for our small example dataset, the output can be voluminous. Therefore in the interest of saving space we’ll only look at a sample of columns.

h %>%
  select(genhz, claytotest, total_frags_pct, phfield, effclass) %>%
  summary()
##       genhz       claytotest    total_frags_pct    phfield       effclass        
##  A       :113   Min.   :10.00   Min.   : 0.00   Min.   :4.90   Length:626        
##  BAt     : 40   1st Qu.:18.00   1st Qu.: 0.00   1st Qu.:6.00   Class :character  
##  Bt1     :206   Median :22.00   Median : 5.00   Median :6.30   Mode  :character  
##  Bt2     :118   Mean   :23.63   Mean   :13.88   Mean   :6.18                     
##  Cr      : 75   3rd Qu.:28.00   3rd Qu.:20.00   3rd Qu.:6.50                     
##  R       : 48   Max.   :60.00   Max.   :95.00   Max.   :7.00                     
##  not-used: 26   NA's   :167                     NA's   :381

The summary() function is known as a generic R function. It will return a summary for any R object.

Because h is a data frame, we get a summary of each column. Factors will be summarized by their frequency (i.e., number of observations), while numeric or integer variables will print out a five number summary, and characters simply print their length. The number of missing observations for any variable will also be printed if they are present. If any of these metrics look unfamiliar to you, don’t worry we’ll cover them shortly.

When you do have missing data and the function you want to run will not run with missing values, the following options are available:

  1. Exclude all rows or columns that contain missing values using the function na.exclude(), such as h2 <- na.exclude(h). However this can be wasteful because it removes all rows (e.g., horizons), regardless if the row only has 1 missing value. Instead it’s sometimes best to create a temporary copy of the variable in question and then remove the missing variables, such as claytotest <- na.exclude(h$claytotest).

  2. Replace missing values with another value, such as zero, a global constant, or the mean or median value for that column, such as h$claytotest <- ifelse(is.na(h$claytotest), 0, h$claytotest) # or h[is.na(h$claytotest), ] <- 0.

  3. Read the help file for the function you’re attempting to use. Many functions have additional arguments for dealing with missing values, such as na.rm.

A quick check for typos would be to examine the list of levels for a factor or character, such as:

# just for factors
levels(h$genhz)
## [1] "A"        "BAt"      "Bt1"      "Bt2"      "Cr"       "R"        "not-used"
# for characters and factors
sort(unique(h$hzname)) 
##  [1] "2BC"  "2BCt" "2Bt1" "2Bt2" "2Bt3" "2Bt4" "2Bt5" "2CB"  "2CBt" "2Cr"  "2Crt" "2R"   "A"    "A1"   "A2"   "AB"  
## [17] "ABt"  "Ad"   "Ap"   "B"    "BA"   "BAt"  "BC"   "BCt"  "Bt"   "Bt1"  "Bt2"  "Bt3"  "Bt4"  "Bw"   "Bw1"  "Bw2" 
## [33] "Bw3"  "C"    "CBt"  "Cd"   "Cr"   "Cr/R" "Crt"  "H1"   "Oi"   "R"    "Rt"

If the unique() function returned typos such as “BT” or “B t”, you could either fix your original dataset or you could make an adjustment in R, such as:

h$hzname <- ifelse(h$hzname == "BT", "Bt", h$hzname)

Typo errors such as these are a common problem with old pedon data in NASIS.

3.4 Exercise 1: Fetch and Inspect Data

Save the code you use in an R script, add answers as comments, and send to your mentor. Exercises 2 and 3 will build on Exercise 1.

  1. Load the gopheridge dataset found within the soilDB package or use your own data.

  2. Inspect the gopheridge object. How many sites or profiles are in the collection? How many horizons?

  3. Apply the generalized horizon rules below or develop your own, see the following job-aid and the Pattern Matching section of Chapter 2 for details.

# gopheridge rules
n <- c('A', 'Bt1', 'Bt2', 'Bt3','Cr','R')
p <- c('^A|BA$', 'Bt1|Bw','Bt$|Bt2', 'Bt3|CBt$|BCt', 'Cr', 'R')
  1. Summarize the hzdept, genhz, texture_class, sandtotest, and fine gravel columns using the summary() function.

3.5 Descriptive Statistics

Table 3.1: Short Description of Descriptive Statistics and R Functions
Parameter NASIS Description R function
Mean RV ? arithmetic average mean()
Weighted Mean RV weighted arithmetic average weighted.mean()
Median RV middle value, 50% quantile median()
Mode RV most frequent value sort(table(), decreasing = TRUE)[1]
Standard Deviation L & H ? variation around mean sd()
Quantiles L & H percent rank of values, such that all values are <= p quantile()

3.5.1 Measures of Central Tendency

These measures are used to determine the mid-point of the range of observed values. In NASIS speak this should ideally be equivalent to the representative value (RV) for numeric and integer data. The mean and median are the most commonly used statistics for central tendency.

3.5.1.1 Mean

Mean - is the arithmetic average all are familiar with, formally expressed as: \(\bar{x} =\frac{\sum_{i=1}^{n}x_i}{n}\) which sums ( \(\sum\) ) all the X values in the sample and divides by the number (n) of samples. It is assumed that all references in this document refer to samples rather than a population.

The mean clay content from the loafercreek dataset may be determined:

mean(h$claytotest, na.rm = TRUE)
## [1] 23.62767

3.5.1.2 Median

Median is the middle measurement of a sample set, and as such is a more robust estimate of central tendency than the mean. This is known as the middle or 50th quantile, meaning there are an equal number of samples with values less than and greater than the median. For example, assuming there are 21 samples, sorted in ascending order, the median would be the 11th sample.

The median from the sample dataset may be determined:

median(h$claytotest, na.rm = TRUE)
## [1] 22

3.5.1.3 Mode

Mode - is the most frequent measurement in the sample. The use of mode is typically reserved for factors, which we will discuss shortly. One issue with using the mode for numeric data is that the data need to be rounded to the level of desired precision. R does not include a function for calculating the mode, but we can calculate it using the following example.

# sort and select the 1st value, which will be the mode
sort(table(round(h$claytotest)), decreasing = TRUE)[1] 
## 25 
## 42

3.5.1.4 Frequencies

Frequencies

To summarize factors and characters we can examine their frequency or number of observations. This is accomplished using the table() or summary() functions.

table(h$genhz)
## 
##        A      BAt      Bt1      Bt2       Cr        R not-used 
##      113       40      206      118       75       48       26
# or

summary(h$genhz)
##        A      BAt      Bt1      Bt2       Cr        R not-used 
##      113       40      206      118       75       48       26

This gives us a count of the number of observations for each horizon. If we want to see the comparison between two different factors or characters, we can include two variables.

table(h$genhz, h$texcl)
##           
##              c  cl   l scl sic sicl sil  sl
##   A          0   0  78   0   0    0  27   6
##   BAt        0   2  31   0   0    1   4   1
##   Bt1        2  44 127   4   1    5  20   1
##   Bt2       16  54  29   5   1    3   5   0
##   Cr         1   0   0   0   0    0   0   0
##   R          0   0   0   0   0    0   0   0
##   not-used   0   1   0   0   0    0   0   0
# or

h %>% 
  count(genhz, texcl)
##       genhz texcl   n
## 1         A     l  78
## 2         A   sil  27
## 3         A    sl   6
## 4         A  <NA>   2
## 5       BAt    cl   2
## 6       BAt     l  31
## 7       BAt  sicl   1
## 8       BAt   sil   4
## 9       BAt    sl   1
## 10      BAt  <NA>   1
## 11      Bt1     c   2
## 12      Bt1    cl  44
## 13      Bt1     l 127
## 14      Bt1   scl   4
## 15      Bt1   sic   1
## 16      Bt1  sicl   5
## 17      Bt1   sil  20
## 18      Bt1    sl   1
## 19      Bt1  <NA>   2
## 20      Bt2     c  16
## 21      Bt2    cl  54
## 22      Bt2     l  29
## 23      Bt2   scl   5
## 24      Bt2   sic   1
## 25      Bt2  sicl   3
## 26      Bt2   sil   5
## 27      Bt2  <NA>   5
## 28       Cr     c   1
## 29       Cr  <NA>  74
## 30        R  <NA>  48
## 31 not-used    cl   1
## 32 not-used  <NA>  25

We can also add margin totals to the table or convert the table frequencies to proportions.

# append the table with row and column sums
addmargins(table(h$genhz, h$texcl))
##           
##              c  cl   l scl sic sicl sil  sl Sum
##   A          0   0  78   0   0    0  27   6 111
##   BAt        0   2  31   0   0    1   4   1  39
##   Bt1        2  44 127   4   1    5  20   1 204
##   Bt2       16  54  29   5   1    3   5   0 113
##   Cr         1   0   0   0   0    0   0   0   1
##   R          0   0   0   0   0    0   0   0   0
##   not-used   0   1   0   0   0    0   0   0   1
##   Sum       19 101 265   9   2    9  56   8 469
# calculate the proportions relative to the rows, margin = 1 calculates for rows, margin = 2 calculates for columns, margin = NULL calculates for total observations
table(h$genhz, h$texture_class) %>%
  prop.table(margin = 1) %>%
  round(2) * 100
##           
##             br   c  cb  cl  gr   l  pg scl sic sicl sil  sl spm
##   A          0   0   0   0   0  70   0   0   0    0  24   5   0
##   BAt        0   0   0   5   0  79   0   0   0    3  10   3   0
##   Bt1        0   1   0  22   0  62   0   2   0    2  10   0   0
##   Bt2        0  14   1  46   2  25   1   4   1    3   4   0   0
##   Cr        97   2   0   0   2   0   0   0   0    0   0   0   0
##   R        100   0   0   0   0   0   0   0   0    0   0   0   0
##   not-used   0   0   0   4   0   0   0   0   0    0   0   0  96

To determine the mean by a group or category, use the group_by and summarize functions:

h %>%
  group_by(genhz) %>%
  summarize(clay_avg = mean(claytotest, na.rm = TRUE),
            clay_med = median(claytotest, na.rm = TRUE))
## # A tibble: 7 × 3
##   genhz    clay_avg clay_med
##   <fct>       <dbl>    <dbl>
## 1 A            16.2       16
## 2 BAt          19.5       19
## 3 Bt1          24.0       24
## 4 Bt2          31.2       30
## 5 Cr           15         15
## 6 R           NaN         NA
## 7 not-used    NaN         NA

3.5.2 Measures of Dispersion

These are measures used to determine the spread of values around the mid-point. This is useful to determine if the samples are spread widely across the range of observations or concentrated near the mid-point. In NASIS speak these values might equate to the low (L) and high (H) values for numeric and integer data.

3.5.2.1 Variance

Variance is a positive value indicating deviation from the mean:

\(s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2} {n - 1}\)

This is the square of the sum of the deviations from the mean, divided by the number of samples minus 1. It is commonly referred to as the sum of squares. As the deviation increases, the variance increases. Conversely, if there is no deviation, the variance will equal 0. As a squared value, variance is always positive. Variance is an important component for many statistical analyses including the most commonly referred to measure of dispersion, the standard deviation. Variance for the sample dataset is:

var(h$claytotest, na.rm=TRUE)
## [1] 64.27607

3.5.2.2 Standard Deviation

Standard Deviation is the square root of the variance:

\(s = \sqrt\frac{\sum_{i=1}^{n}(x_i - \bar{x})^2} {n - 1}\)

The units of the standard deviation are the same as the units measured. From the formula you can see that the standard deviation is simply the square root of the variance. Standard deviation for the sample dataset is:

sd(h$claytotest, na.rm = TRUE)
## [1] 8.017236
# or

sqrt(var(h$claytotest, na.rm = TRUE))
## [1] 8.017236

3.5.2.3 Coefficient of Variation

Coefficient of Variation (CV) is a relative (i.e., unitless) measure of standard deviation:

\(CV = \frac{s}{\bar{x}} \times 100\)

CV is calculated by dividing the standard deviation by the mean and multiplying by 100. Since standard deviation varies in magnitude with the value of the mean, the CV is useful for comparing relative variation amongst different datasets. However Webster (2001) discourages using CV to compare different variables. Webster (2001) also stresses that CV is reserved for variables that have an absolute 0, like clay content. CV may be calculated for the sample dataset as:

# remove NA values and create a new variable
clay <- na.exclude(h$claytotest)

sd(clay) / mean(clay) * 100
## [1] 33.93156

3.5.2.4 Quantiles (Percentiles)

Quantiles (a.k.a. Percentiles) - the percentile is the value that cuts off the first nth percent of the data values when sorted in ascending order.

The default for the quantile() function returns the min, 25th percentile, median or 50th percentile, 75th percentile, and max, known as the five number summary originally proposed by Tukey. Other probabilities however can be used. At present the 5th, 50th, and 95th are being proposed for determining the range in characteristics (RIC) for a given soil property.

quantile(h$claytotest, na.rm = TRUE)
##   0%  25%  50%  75% 100% 
##   10   18   22   28   60
# or

quantile(h$claytotest, probs = c(0.05, 0.5, 0.95), na.rm = TRUE)
##   5%  50%  95% 
## 13.0 22.0 38.1

Thus, for the five number summary 25% of the observations fall between each of the intervals. Quantiles are a useful metric because they are largely unaffected by the distribution of the data, and have a simple interpretation.

The National Soil Survey Handbook Part 618.2 specifies that new aggregate component data in NASIS should use quantiles for low, RV, and high values. For low and high the 5th, 10th, 20th and 80th, 90th, 95th can be used, respectively, according to which percentiles best capture the spread of a particular dataset. For newly populated information in NASIS, the RV is used to approximate the 50th percentile (median) of a dataset.

3.5.2.5 Range

Range is the difference between the highest and lowest measurement of a group. Using a vector of sample data (clay) the range can be may be determined with:

range(clay)
## [1] 10 60

which returns the minimum and maximum values observed.

To get the width of the range you can do:

diff(range(clay))
## [1] 50
# or

max(clay) - min(clay)
## [1] 50

3.5.2.6 Interquartile Range

Interquartile Range (IQR) is the range from the upper (75%) quartile to the lower (25%) quartile. This represents 50% of the observations occurring in the mid-range of a sample. IQR is a robust measure of dispersion, unaffected by the distribution of data. In soil survey lingo you could consider the IQR to estimate the central concept of a soil property. IQR may be calculated for the sample dataset as:

IQR(clay)
## [1] 10
# or

diff(quantile(clay, p = c(0.25, 0.75)))
## 75% 
##  10

3.5.3 Weighted Summaries

Weighted Mean - is the weighted arithmetic average. In the ordinary arithmetic average, each data point is assigned the same weight, whereas in a weighted average the contribution of each data point to the mean can vary.

The function weighted.mean() takes two major inputs: x, a vector of data values, and w, a vector of weights. Like other statistics, you can remove missing values with na.rm=TRUE.

# calculate horizon thickness
h$hzthk <- h$hzdepb - h$hzdept

# weighted mean (across all horizons and profiles)
weighted.mean(h$claytotest, h$hzthk, na.rm = TRUE)
## [1] 25.24812

The above example is applied to all profiles and all horizons. We only get one value across 106 profiles and 626 horizons.

More commonly we use the weighted mean for depth-weighted averages in individual pedons or components and, also, for mapunit averages.

Depth-weighted averages commonly use horizon thicknesses as weights. Mapunit averages commonly use component percentages as weights.

When we want to scale the contribution of particular observations or variables to a final result, weights may be derived from expert knowledge or some other data source.

3.5.3.1 Profile-specific Weighted Calculations

A more practical application of weighted.mean() is Particle Size Control Section (PSCS) weighted average properties. For this we need to use the horizon thickness within the control section as weights.

The aqp function trunc() allows you to remove portions of profiles that are outside specified depth intervals (such as the PSCS boundaries). profileApply() allows for iterating over all profiles in a SoilProfileCollection to call some function on each one.

Here we calculate the PSCS weighted mean clay content for each profile in loafercreek:

# calculate a "truncated" SPC with just the PSCS; drop = FALSE retains empty profiles for missing PSCS
loafercreek_pscs <- trunc(loafercreek, loafercreek$psctopdepth, loafercreek$pscbotdepth, drop = FALSE)

# calculate weighted mean for each profile, using the truncated horizon thicknesses as weights
loafercreek$pscs_clay <- profileApply(loafercreek_pscs, function(p) {
  weighted.mean(p$claytotest, w = p$hzdepb - p$hzdept, na.rm = TRUE)
})

# inspect
loafercreek$pscs_clay
##   [1] 28.76471 28.76471 25.88000      NaN 16.00000 37.32000 32.63415      NaN 33.78571 29.57576 29.00000 23.02632
##  [13] 25.28000 23.36364 26.25000 34.12000 47.69565 30.73171 28.32609 32.00000 27.12500 27.76316 25.00000 28.60870
##  [25] 23.39535 22.32000 24.10000 20.91600 22.52000 20.48780 44.38000 37.88000 28.70000 28.80000 36.46000 23.78000
##  [37] 30.72000 24.33333 23.76000 21.78000 28.72727 20.83871 19.56000 25.48000 28.92000 33.95122 20.18000 18.86486
##  [49] 18.12500 25.56000 17.34000 25.84000 19.00000 28.37500 27.10870 24.72000 26.20370 25.13514 26.00000 24.38000
##  [61] 24.20000 26.24000 27.44444 29.90000 24.00000 42.08000 28.34000 29.36842 23.10000 24.56818 29.10000 24.16000
##  [73] 28.82000      NaN 18.77500      NaN 45.82917 25.25161      NaN 20.50000 21.28889 24.86000 32.34000 33.36585
##  [85] 31.04000 23.05405 17.22000 26.52000 29.22000 20.00000 24.60000 19.74000 18.25000 20.70000 31.84000 21.97959
##  [97] 32.44737 29.56000      NaN 22.09412 21.34000 27.46154 26.68000 20.10526 31.00000 22.65000

Note that several of these pedons have NaN values in the result. This can happen even with na.rm=TRUE because pedons either do not have both psctopdepth and pscbotdepth populated in NASIS or all horizons in the PSCS have NA clay content.

# inspect new site-level column pscs_clay
head(site(loafercreek)[, c("peiid", "upedonid", "psctopdepth", "pscbotdepth", "pscs_clay")])

# plot just the pedons with missing pscs_clay
plot(subset(loafercreek, is.na(pscs_clay)), color = "claytotest")

You always need to pay attention to the possibility of missing data or weights because missing values can dramatically the final result–sometimes in unexpected ways!

3.5.3.2 Other Weighted Statistics

There are many other weighted statistics available. Essentially all will use a matrix of data values and a matrix of weights to scale the contribution of data values to the final statistic value.

The package Hmisc defines several weighted variants of common summary statistics, notably Hmisc::wtd.quantile(). The DescTools package also defines Quantile() which strictly follows the Eurostat definition of a weighted quantile.

3.5.4 Correlation

A correlation matrix is a table of the calculated correlation coefficients of all variables. This provides a quantitative measure to guide the decision making process. The following will produce a correlation matrix for the sp4 dataset:

h$hzdepm <- (h$hzdepb + h$hzdept) / 2 # Compute the middle horizon depth

h %>%
  select(hzdepm, claytotest, sandtotest, total_frags_pct, phfield) %>%
  cor(use = "complete.obs") %>%
  round(2)
##                 hzdepm claytotest sandtotest total_frags_pct phfield
## hzdepm            1.00       0.59      -0.08            0.50   -0.03
## claytotest        0.59       1.00      -0.17            0.28    0.13
## sandtotest       -0.08      -0.17       1.00           -0.05    0.12
## total_frags_pct   0.50       0.28      -0.05            1.00   -0.16
## phfield          -0.03       0.13       0.12           -0.16    1.00

Variables are perfectly correlated with themselves so they have a correlation coefficient of 1.0. Negative values indicate a negative relationship between variables.

What is considered “highly correlated”? A good rule of thumb is anything with a value of 0.7 or greater is considered highly correlated.

3.6 Exercise 2: Compute Descriptive Statistics

Add to your existing R script from Exercise 1, add answers as comments, and send to your mentor. Exercise 3 will build on Exercises 1 and 2.

  1. Aggregate by genhz and calculate several descriptive statistics for hzdept, gravel and phfield.
  2. Cross-tabulate geomposhill and argillic.horizon from the site table, as a percentage.
  3. Compute a correlation matrix between hzdept, gravel and phfield.

3.7 Graphical Methods

Now that we’ve checked for missing values and typos and made corrections, we can graphically examine the sample data distribution of our data. Frequency distributions are useful because they can help us visualize the center (e.g., RV) and spread or dispersion (e.g., low and high) of our data. Typically in introductory statistics the normal (i.e., Gaussian) distribution is emphasized.

Table 3.2: Short Description of Graphical Methods
Plot Types Description
Bar a plot where each bar represents the frequency of observations for a ‘group’
Histogram a plot where each bar represents the frequency of observations for a ‘given range of values’
Density an estimation of the frequency distribution based on the sample data
Quantile-Quantile a plot of the actual data values against a normal distribution
Box-Whisker a visual representation of median, quartiles, symmetry, skewness, and outliers
Scatter & Line a graphical display of one variable plotted on the x axis and another on the y axis
Table 3.3: Comparison of R’s 3 Graphing Systems and their Equivalent Functions for Plotting
Plot Types Base R lattice ggplot geoms
Bar barplot() barchart() geom_bar()
Histogram hist() histogram() geom_histogram()
Density plot(density()) densityplot() geom_density()
Quantile-Quantile qqnorm() qq() geom_qq()
Box-Whisker boxplot() bwplot() geom_boxplot()
Scatter & Line plot() xyplot geom_point()

3.7.1 Distributions

3.7.2 Bar Plot

A bar plot is a graphical display of the frequency (i.e. number of observations (count or n)), such as soil texture, that fall within a given class. It is a graphical alternative to to the table() function.

library(ggplot2)

# bar plot
ggplot(h, aes(x = texcl)) +
  geom_bar()

3.7.3 Histogram

A histogram is similar to a bar plot, except that instead of summarizing categorical data, it categorizes a continuous variable like clay content into non-overlappying intervals for the sake of display. The number of intervals can be specified by the user, or can be automatically determined using an algorithm, such as nclass.Sturges(). Since histograms are dependent on the number of bins, for small datasets they’re not the best method of determining the shape of a distribution.

ggplot(h, aes(x = claytotest)) +
  geom_histogram(bins = nclass.Sturges(h$claytotest))

3.7.4 Density Curve

A density estimation, also known as a Kernel density plot, generally provides a better visualization of the shape of the distribution in comparison to the histogram. Compared to the histogram where the y-axis represents the number or percent (i.e., frequency) of observations, the y-axis for the density plot represents the probability of observing any given value, such that the area under the curve equals one. One curious feature of the density curve is the hint of two peaks (i.e. bimodal distribution?). Given that our sample includes a mixture of surface and subsurface horizons, we may have two different populations. However, considering how much the two distributions overlap, it seems impractical to separate them in this instance.

ggplot(h, aes(x = claytotest)) +
  geom_density()

3.7.5 Box plots

Box plots are a graphical representation of the five number summary, depicting quartiles (i.e. the 25%, 50%, and 75% quantiles), minimum, maximum and outliers (if present). Boxplots convey the shape of the data distribution, the presence of extreme values, and the ability to compare with other variables using the same scale, providing an excellent tool for screening data, determining thresholds for variables and developing working hypotheses.

The parts of the boxplot are shown in the figure below. The “box” of the boxplot is defined as the 1st quartile and the 3rd quartile. The median, or 2nd quartile, is the dark line in the box. The whiskers (typically) show data that is 1.5 * IQR above and below the 3rd and 1st quartile. Any data point that is beyond a whisker is considered an outlier.

That is not to say the outlier points are in error, just that they are extreme compared to the rest of the data set. However, you may want to evaluate these points to ensure that they are correct.

ggplot(h, aes(x = genhz, y = claytotest)) +
  geom_boxplot()

The above box plot shows a steady increase in clay content with depth. Notice the outliers in the box plots, identified by the individual circles.

3.7.6 Quantile comparison plots (QQplot)

A QQ plot is a plot of the actual data values against a normal distribution (which has a mean of 0 and standard deviation of 1).

# QQ Plot for Clay
ggplot(h, aes(sample = claytotest)) + 
  geom_qq() +
  geom_qq_line()

# QQ Plot for Frags
ggplot(h, aes(sample = total_frags_pct)) + 
  geom_qq() +
  geom_qq_line()

If the data set is perfectly symmetric (i.e. normal), the data points will form a straight line. Overall this plot shows that our clay example is more or less symmetric. However the second plot shows that our rock fragments are far from evenly distributed.

A more detailed explanation of QQ plots may be found on Wikipedia:
https://en.wikipedia.org/wiki/QQ_plot

3.7.7 The ‘Normal’ distribution

What is a normal distribution and why should you care? Many statistical methods are based on the properties of a normal distribution. Applying certain methods to data that are not normally distributed can give misleading or incorrect results. Most methods that assume normality are robust enough for all data except the very abnormal. This section is not meant to be a recipe for decision making, but more an extension of tools available to help you examine your data and proceed accordingly. The impact of normality is most commonly seen for parameters used by pedologists for documenting the ranges of a variable (i.e., Low, RV and High values). Often a rule-of thumb similar to: “two standard deviations” is used to define the low and high values of a variable. This is fine if the data are normally distributed. However, if the data are skewed, using the standard deviation as a parameter does not provide useful information of the data distribution. The quantitative measures of Kurtosis (peakedness) and Skewness (symmetry) can be used to assist in accessing normality and can be found in the fBasics package, but Webster (2001) cautions against using significance tests for assessing normality. The preceding sections and chapters will demonstrate various methods to cope with alternative distributions.

A Gaussian distribution is often referred to as “Bell Curve”, and has the following properties:

  1. Gaussian distributions are symmetric around their mean
  2. The mean, median, and mode of a Gaussian distribution are equal
  3. The area under the curve is equal to 1.0
  4. Gaussian distributions are denser in the center and less dense in the tails
  5. Gaussian distributions are defined by two parameters, the mean and the standard deviation
  6. 68% of the area under the curve is within one standard deviation of the mean
  7. Approximately 95% of the area of a Gaussian distribution is within two standard deviations of the mean

Viewing a histogram or density plot of your data provides a quick visual reference for determining normality. Distributions are typically normal, Bimodal or Skewed:

Examples of different types of distributions
Examples of different types of distributions

Occasionally distributions are Uniform, or nearly so:

With the loafercreek dataset the mean and median for clay were only slightly different, so we can safely assume that we have a normal distribution. However many soil variables often have a non-normal distribution. For example, let’s look at graphical examination of the mean vs. median for clay and rock fragments:

The solid lines represent the breakpoint for the mean and standard deviations. The dashed lines represents the median and quantiles. The median is a more robust measure of central tendency compared to the mean. In order for the mean to be a useful measure, the data distribution must be approximately normal. The further the data departs from normality, the less meaningful the mean becomes.

The median always represents the same thing independent of the data distribution, namely, 50% of the samples are below and 50% are above the median. The example for clay again indicates that distribution is approximately normal.

However for rock fragments, we commonly see a long tailed distribution (e.g., skewed). Using the mean in this instance would overestimate the rock fragments. Although in this instance the difference between the mean and median is only 9 percent.

3.7.8 Scatterplots and Line Plots

Plotting points of one ratio or interval variable against another is a scatter plot. Plots can be produced for a single or multiple pairs of variables. Many independent variables are often under consideration in soil survey work. This is especially common when GIS is used, which offers the potential to correlate soil attributes with a large variety of raster datasets.

The purpose of a scatterplot is to see how one variable relates to another. With modeling in general the goal is parsimony (i.e., simple). The goal is to determine the fewest number of variables required to explain or describe a relationship. If two variables explain the same thing, i.e., they are highly correlated, only one variable is needed. The scatterplot provides a perfect visual reference for this.

Create a basic scatter plot using the loafercreek dataset.

# scatter plot
ggplot(h, aes(x = claytotest, y = hzdepm)) +
  geom_point() +
  ylim(100, 0)

# line plot
ggplot(h, aes(y = claytotest, x = hzdepm, group = peiid)) +
  geom_line() +
  coord_flip() +
  xlim(100, 0)

This plots clay on the X axis and depth on the X axis. As shown in the scatterplot above, there is a moderate correlation between these variables.

The function below produces a scatterplot matrix for all the numeric variables in the dataset. This is a good command to use for determining rough linear correlations for continuous variables.

library(GGally)
  
h %>%
  select(hzdepm, claytotest, phfield, total_frags_pct) %>%
  ggpairs()

3.7.9 3rd Dimension - Color, Shape, Size, Layers, etc…

3.7.9.1 Color and Groups

# scatter plot
ggplot(h, aes(x = claytotest, y = hzdepm, color = genhz)) +
  geom_point(size = 3) +
  ylim(100, 0)

# density plot
ggplot(h, aes(x = claytotest, color = genhz)) +
  geom_density(size = 2)

# bar plot
ggplot(h, aes(x = genhz, fill = texture_class)) +
  geom_bar()

# box plot
ggplot(h, aes(x = genhz, y = claytotest)) + 
  geom_boxplot()

# heat map (pseudo bar plot)
s <- aqp::site(loafercreek)

ggplot(s, aes(x = landform_string, y = pmkind)) + 
  geom_tile(alpha = 0.2) 

3.7.9.2 Facets - box plots

library(tidyr)

# convert to long format
df <- h %>% 
  select(peiid, genhz, hzdepm, claytotest, phfield, total_frags_pct) %>% 
  pivot_longer(cols = c("claytotest", "phfield", "total_frags_pct"))

ggplot(df, aes(x = genhz, y = value)) +
  geom_boxplot() +
  xlab("genhz") +
  facet_wrap(~ name, scales = "free_y")

3.7.9.3 Facets - depth plots

data(loafercreek, package = "soilDB")

s <- aqp::slab(loafercreek, 
               fm = ~ claytotest + phfield + total_frags_pct,
               slab.structure = 0:100,
               slab.fun = function(x) quantile(x, c(0.1, 0.5, 0.9), na.rm = TRUE))

ggplot(s, aes(x = top, y = X50.)) +
  # plot median
  geom_line() +
  # plot 10th & 90th quantiles
  geom_ribbon(aes(ymin = X10., ymax = X90., x = top), alpha = 0.2) +
  # invert depths
  xlim(c(100, 0)) +
  # flip axis
  coord_flip() +
  facet_wrap(~ variable, scales = "free_x")

3.8 Exercise 3: Graphical Methods

Add to your existing R script from Exercise 2, add answers as comments, and send to your mentor.

  1. Create a faceted boxplot of genhz vs gravel and phfield.

  2. Create a facted depth plot for gravel and phfield.

3.9 Transformations

Slope aspect and pH are two common variables warranting special consideration.

3.9.1 pH

There is debate about the best way to average pH since is it a log transformed variable. Remember, pHs of 6 and 5 correspond to hydrogen ion concentrations of 0.000001 and 0.00001 respectively. The actual average is 5.26; -log10((0.000001 + 0.00001) / 2). If no conversions are made for pH, the mean and sd in the summary are considered the geometric mean and sd, not the arithmetic. The wider the pH range, the greater the difference between the geometric and arithmetic mean. The difference between the arithmetic average of 5.26 and the geometric average of 5.5 is small. Boyd, Tucker, and Viriyatum (2011) examined the issue in detail, and suggests that regardless of the method used, the choice should be documented.

If you have a table with pH values and wish to calculate the arithmetic mean using R, this example will work:

# arithmetic mean
log10(mean(1 / 10^-h$phfield, na.rm = TRUE)) 
## [1] 6.371026
# geometric mean
mean(h$phfield, na.rm = TRUE) 
## [1] 6.18

3.9.2 Circular data

Slope aspect - requires the use of circular statistics for numerical summaries, or graphical interpretation using circular plots. For example, if soil map units being summarized have a uniform distribution of slope aspects ranging from 335 degrees to 25 degrees, the Zonal Statistics tool in ArcGIS would return a mean of 180.

The most intuitive means available for evaluating and describing slope aspect are circular plots available with the circular package in R and the radial plot option in the TEUI Toolkit. The circular package in R will also calculate circular statistics like mean, median, quartiles etc.

library(circular)

# Extract the site table
s <- aqp::site(loafercreek) 

aspect <- s$aspect_field
aspect <- circular(aspect, template="geographic", units="degrees", modulo="2pi")

summary(aspect)
##        n     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.      Rho     NA's 
## 101.0000  12.0000 255.0000 195.0000 199.5000 115.0000  20.0000   0.1772   5.0000

The numeric output is fine, but a following graphic is more revealing, which shows the dominant Southwest slope aspect.

rose.diag(aspect, bins = 8, col="grey")

3.9.3 Texture Class and Fine-earth Separates

Many pedon descriptions include soil texture class and modifiers, but lack numeric estimates such as clay content and rock fragments percentage. This lack of “continuous” numeric data makes it difficult to analyze and estimate certain properties precisely.

While numeric data on textural separates may be missing, it can still be estimated by the class ranges and averages. NASIS has many calculations used to estimate missing values. To facilitate this process in R, several new functions have recently been added to the aqp package. These new aqp functions are intended to impute missing values or check existing values. The ssc_to_texcl() function uses the same logic as the particle size estimator calculation in NASIS to classify sand and clay into texture class. The results are stored in data(soiltexture) and used by texcl_to_ssc() as a lookup table to convert texture class to sand, silt and clay. The function texcl_to_ssc() replicates the functionality described by Levi (2017). Unlike the other functions, texture_to_taxpartsize() is intended to be computed on weighted averages within the family particle size control section. Below is a demonstration of these new aqp R functions.

library(aqp)
library(soiltexture)

# example of texcl_to_ssc(texcl)
texcl <- c("s", "ls", "l", "cl", "c")
test <- texcl_to_ssc(texcl)
head(cbind(texcl, test))

# example of ssc_to_texcl()
ssc <- data.frame(
  CLAY = c(55, 33, 18, 6, 3),
  SAND = c(20, 33, 42, 82, 93),
  SILT = c(25, 34, 40, 12, 4)
)

texcl <- ssc_to_texcl(sand = ssc$SAND, clay = ssc$CLAY)
ssc_texcl <- cbind(ssc, texcl)
head(ssc_texcl)

# plot on a textural triangle
TT.plot(
  class.sys = "USDA-NCSS.TT",
  tri.data = ssc_texcl,
  pch = 16,
  col = "blue"
)

# example of texmod_to_fragvoltol()
frags <- c("gr", "grv", "grx", "pgr", "pgrv", "pgrx")
test <- texmod_to_fragvoltot(frags)[1:4]
head(test)

# example of texture_to_taxpartsize()
tex <- data.frame(
  texcl = c("c", "cl", "l", "ls", "s"),
  clay  = c(55, 33, 18, 6, 3),
  sand  = c(20, 33, 42, 82, 93),
  fragvoltot = c(35, 15, 34, 60, 91)
)

tex$fpsc <- texture_to_taxpartsize(
  texcl = tex$texcl,
  clay = tex$clay,
  sand = tex$sand,
  fragvoltot = tex$fragvoltot
)

head(tex)

3.10 The Shiny Package

Shiny is an R package which combines R programming with the interactivity of the web.

install.packages("shiny")

Methods for Use

  • Online
  • Locally

The shiny package, created by RStudio, enables users to not only use interactive applications created by others, but to build them as well.

3.10.1 Online

Easiest Method

The ability to use a shiny app online is one of the most useful features of the package. All of the R code is executed on a remote computer which sends the results over a live http connection. Very little is required of the user in order to obtain results.

3.10.2 Locally

No Internet required once configured

  • Install R and RStudio (done)
  • Install Required packages (app dependent)
  • Download, open in RStudio and click “Run App”

The online method may be easy for the user, but it does require a significant amount of additional effort for the creator. We won’t get into those details here! The local method, however simply requires the creator to share a single app.R file. It is the user which needs to put forth the additional effort.

3.10.3 Web App Demonstration

Online:

Local:

Online apps such as the North Central Region Web App are useful tools, available for all to use during soil survey, ecological site development, or other evaluations. The North Central Region app is however limited to data which is already available online, such as SDA (Soil Data Access) and NASIS (National Soil Information System) Web Reports.

It is also reliant on the successful operation of those data systems. If the NASIS Web Reports or SDA is down for maintenance, the app fails. Local apps have the ability to leverage local data systems more easily like NASIS or other proprietary data.

3.10.4 Troubleshooting Errors

  1. Reload the app and try again. (refresh browser, or click stop, and run app again in RStudio) When the app throws an error, it stops. All other tabs/reports will no longer function until the app is reloaded.

  2. Read the getting started section on the home page. This is a quick summary of tips to avoid issues, and will be updated as needed.

  3. Check to see if SDA and NASIS Web Reports are operational, if they aren’t working, then the app won’t work either.

  4. Double check your query inputs. (typos, wildcards, null data, and too much data, can be common issues)

  5. 5 minutes of inactivity will cause the connection to drop, be sure you are actively using the app.

  6. Run the app locally - the online app does not show console activity, which can be a big help when identifying problems.

  7. Check the app issues page to see if the issue you are experiencing is already documented. (Polite but not required)

  8. Contact the app author ()

When you run into trouble, there are a few steps you can take on your own to get things working again. This list may help you get your issue resolved. If not, contact () for assistance.

3.10.5 Shiny App Embedding

Shiny apps are extremely versatile, they can be embedded into presentations, Markdown, or HTML Those same formats can also be embedded in to a shiny app. This is a very simple example of a shiny app which consists of an interactive dropdown menu which controls what region is displayed in the bar chart. Let’s take a look at the code.

3.10.5.1 Shiny App Code

shinyApp(
  
  # Use a fluid Bootstrap layout
  ui = fluidPage(    
    
    # Give the page a title
    titlePanel("Telephones by region"),
    
    # Generate a row with a sidebar
    sidebarLayout(      
      
      # Define the sidebar with one input
      sidebarPanel(
        selectInput("region", "Region:", 
                    choices = colnames(datasets::WorldPhones)), 
        hr(),
        helpText("Data from AT&T (1961) The World's Telephones.")
      ),
      
      # Create a spot for the barplot
      mainPanel(
        plotOutput("phonePlot")  
      )
      
    )
  ),
  # Define a server function for the Shiny app
  server = function(input, output) {
    
    # Fill in the spot we created for a plot
    output$phonePlot <- renderPlot({
      
      # Render a barplot
      barplot(
        datasets::WorldPhones[, input$region] * 1000,
        main = input$region,
        ylab = "Number of Telephones",
        xlab = "Year"
      )
    })
  }
)

Shiny apps consist of a ui and a server. The ui is the part of the shiny app the user sees, the user interface. In the ui, a user can choose or enter inputs for processing and viewing the results. The server takes the inputs, performs some data processing and rendering of those inputs and generates the outputs for the ui to display.

3.10.6 Questions

  1. What new features in RStudio are available for you to use once the shiny package is installed?

  2. The Region 11 Web App uses which two data sources for reports?

  3. If an error occurs while using the Region 11 Web App, what should you do?

3.10.7 Examples

3.10.7.1 NASIS Reports

The NASIS Reports button is a link to a master list of NASIS Web reports for Regions 10 and 11.

3.10.7.2 Water Table

The query method option allows you to choose between MUKEY, NATSYM, MUNAME. It also has a radio button for switching between flooding and ponding.

Plots and Data Tables are on separate sub-menu items.

3.10.7.3 Organic Matter

Same options as the Water Table Tab except no radio button. The query method option allows you to choose between MUKEY, NATSYM, MUNAME.

Plots and Data Tables are on separate sub-menu items.

3.10.7.4 Project Report

The project report can accept multiple projects. Use the semicolon (;) as a separator. You can also save a copy of the report by clicking the link below the submit button.

3.10.7.5 Project Extent

Type in Year, type or select office and type project name for the Project extent query. Zoom and pan to view extent. Use the layers button to change the basemap or toggle layers. Click the link below the submit button to download a .zip containing the extent as a ESRI shapefile.

3.10.7.6 Long Range Plan

Enter an office symbol to generate a long range plan report.

3.10.7.7 Interpretations

Enter the national mapunit symbol to plot all available interpretations for the mapunit from SDA.

3.11 Exercise 4: Using the North Central Region Web App

3.11.0.1 Project Report

Use the project report to generate a report on a project in your own area. Save the results and explain the results of pH plots for one of your components.

3.11.0.2 Project Extent

Map an extent of a project. How many layers are available to choose from as a basemap? How many layers can be toggled on or off?

3.11.0.3 Long Range Plan

Choose an office to generate a long range plan. What is the highest acreage project for 2025?

3.12 soilReports

One of the strengths of NASIS is that it that has many queries and reports to access the complex data. This makes it easy for the average user to load their data, process it, and run numerous reports.

The soilReports R package is essentially just a collection of R Markdown (.Rmd) documents.

R Markdown is a plain text markup format for creating reproducible, dynamic reports with R. These .Rmd files can be used to generate HTML, PDF, Word, Markdown documents with a variety of forms, templates and applications.

Install the soilReports package. This package is updated regularly and should be installed from GitHub.

# Install the soilReports package from GitHub
remotes::install_github("ncss-tech/soilReports", dependencies = FALSE, build = FALSE)

To view the list of available reports first load the package then use the listReports() function.

# Load the soilReports package
library(soilReports)

# List reports
listReports()
##                                     name version
## 1                     national/DT-report     1.0
## 2             national/NASIS-site-export     1.0
## 3  region11/component_summary_by_project     0.1
## 4      region11/lab_summary_by_taxonname     1.0
## 5  region11/mupolygon_summary_by_project     0.1
## 6    region11/pedon_summary_by_taxonname     1.1
## 7                       region2/dmu-diff     0.7
## 8                    region2/dmu-summary     0.4
## 9             region2/edit-soil-features   0.2.1
## 10                   region2/gdb-acreage     1.0
## 11       region2/mlra-comparison-dynamic     0.1
## 12               region2/mlra-comparison     2.0
## 13       region2/mu-comparison-dashboard   0.0.0
## 14                 region2/mu-comparison   4.0.3
## 15                    region2/mu-summary       1
## 16                 region2/pedon-summary     1.0
## 17                    region2/QA-summary     0.6
## 18           region2/shiny-pedon-summary     1.1
## 19                region2/spatial-pedons     1.0
## 20                     templates/minimal     1.0
##                                                                                              description
## 1                                                          Create interactive data tables from CSV files
## 2                                                                    Export NASIS Sites to Spatial Layer
## 3                                                           summarize component data for an MLRA project
## 4                                                          summarize lab data from NASIS Lab Layer table
## 5                                                           summarize mupolygon layer from a geodatabase
## 6                                                          summarize field pedons from NASIS pedon table
## 7                                                                         Differences between select DMU
## 8                                                                                     DMU Summary Report
## 9                                 Generate summaries of NASIS components for EDIT Soil Features sections
## 10                                                            Geodatabase Mapunit Acreage Summary Report
## 11                                    compare MLRA/LRU-scale delineations, based on mu-comparison report
## 12                                                  compare MLRA using pre-made, raster sample databases
## 13            interactively subset and summarize SSURGO data for input to `region2/mu-comparison` report
## 14                     compare stack of raster data, sampled from polygons associated with 1-8 map units
## 15                                     summarize raster data for a large collection of map unit polygons
## 16                                      Generate summaries from NASIS pedons and associated spatial data
## 17                                                                                     QA Summary Report
## 18                        Interactively subset and summarize NASIS pedon data from one or more map units
## 19 Visualize NASIS pedons in interactive HTML report with overlays of SSURGO, STATSGO or custom polygons
## 20                                                                        A minimal soilReports template

3.12.1 Extending soilReports

Each report in soilReports has a “manifest” that describes any dependencies, configuration files or inputs for your R Markdown report document. If you can identify these things it is easy to convert your own R-based analyses to the soilReports format.

The .Rmd file allows R code and text with Markdown markup to be mingled in the same document and then “executed” like an R script.

3.13 Exercise 5: Run Lab Summary By Taxon Name Soil Report

For another exercise you can use the region11/lab_summary_by_taxonname report report to summarize laboratory data for a soil series. This report requires you to get some data from the Pedon “NCSS Lab” tables in NASIS.

3.13.1 Requirements

  • Data are properly populated, otherwise the report may fail. Common examples include:

    • Horizon depths don’t lineup
    • Either the Pedon or Site tables isn’t loaded
  • ODBC connection to NASIS is set up

  • Beware each report has a unique configuration file that may need to be edited.

3.13.2 Instructions

  1. Load your NASIS selected set. Run a query such as “POINT - Pedon/Site/NCSSlabdata by upedonid and Taxon Name” from the Region 11 report folder to load your selected set. Be sure to target both the site, pedon and lab layer tables. Remove from your selected set the pedons and sites you wish to exclude from the report.

  2. Copy the lab summary to your working directory.

copyReport(reportName = "region11/lab_summary_by_taxonname", outputDir = "C:/workspace2/lab_sum")
  1. Examine the report folder contents.

The report is titled report.Rmd. Notice there are several other support files. The parameters for the report are contained in the config.R file.

  1. Check or create a genhz_rules file for a soil series. In order to aggregate the pedons by horizon designation, a genhz_rules file (e.g., Miami_rules.R) is needed. See above.

If none exists see the following job aid on how to create one, Assigning Generalized Horizon Labels.

Pay special attention to how caret ^ and dollar $ symbols are used in REGEX. They function as anchors for the beginning and end of the string, respectively.

  • A ^ placed before an A horizon, ^A, will match any horizon designation that starts with A, such as Ap, Ap1, but not something merely containing A, such as BA.

  • Placing a $ after a Bt horizon, Bt$, will match any horizon designation that ends with Bt, such as 2Bt or 3Bt, but not something with a vertical subdivision, such as Bt2.

  • Wrapping pattern with both ^ and $ symbols will result only in exact matches – i.e. that start and end with the contents between ^ and $. For example ^[AC]$, will only match A or C, not Ap, Ap2, or Cg.

  1. Execute the report.

Command-line approach

# Set report parameters
series <- "Miami"
genhz_rules <- "C:/workspace2/lab_sum/Miami_rules.R"

# report file path
report_path <- "C:/workspace2/lab_sum/report.Rmd"

# Run the report
render(input = report_path, 
       output_dir = "C:/workspace2", 
       output_file = "C:/workspace2/lab_sum.html", 
       envir = new.env(), 
       params = list(series = series,
                     genhz_rules = genhz_rules
                     )
       )

Manual approach

Open the report.Rmd, hit the Knit drop down arrow, and select Knit with Parameters.

  1. Save the report. The report is automatically saved upon creation in the same folder as the R report. However, it is given the same generic name as the R report (i.e., “C:/workspace/lab_sum/report.html”), and will be overwritten the next time the report is run. Therefore, if you wish to save the report, rename the .html file to a name of your choosing and/or convert it to a PDF.

Also, beware when opening the .html file with Internet Explorer – be sure to click on “Allow blocked content” if prompted. Otherwise, Internet Explorer will alter the formatting of tables etc. within the document.

Sample pedon report

Brief summary of steps:

  • Load your selected set with the pedon and site table for an existing GHL file, or make your own (highly encouraged)
  • Run the lab_summary_by_taxonname.Rmd report on a soil series of your choice.
  • Show your work and submit the results to your mentor.

3.14 Exercise 6: Run Shiny Pedon Summary

The region2/shiny-pedon-summary report is an interactive Shiny-based report that uses flexdashboard to help the user subset and summarize NASIS pedons from a graphical interface.

  • You can try a ShinyApps.io demo here

The “Shiny Pedon Summary” allows one to rapidly generate reports from a large set of pedons in their NASIS selected set.

The left INPUT sidebar has numerous options for subsetting pedon data. Specifically, you can change REGEX patterns for mapunit symbol, taxon name, local phase, and User Pedon ID. Also, you can use the drop down boxes to filter on taxon kind or compare different “modal”/RV pedons.

Example: Analyzing the Loafercreek Taxadjuncts
Example: Analyzing the Loafercreek Taxadjuncts
  1. Create an instance of the region2/shiny-pedon-summary report with soilReports:
# create new instance of reports 
library(soilReports)

# set path for shiny-pedon-summary report instance
shinyped.path <- "C:/workspace2/chapter3/shiny-pedon"

# create directories (if needed)
if(!dir.exists(shinyped.path))
  dir.create(shinyped.path, recursive = TRUE)

# get report dependencies
reportSetup('region2/shiny-pedon-summary')

# copy report contents to target path
copyReport('region2/shiny-pedon-summary', outputDir = shinyped.path, overwrite = TRUE)
  1. Update the config.R file

You can update the config.R file in “C:/workspace2/chapter3/shiny-pedon” (or wherever you installed the report) to use the soilDB datasets loafercreek and gopheridge by setting demo_mode <- TRUE. This is the simplest way to demonstrate how this report works. Alternately, when demo_mode <- FALSE, pedons will be loaded from your NASIS selected set.

config.R also allows you to specify a shapefile for overlaying the points on – to determine mapunit symbol – as well as several raster data sources whose values will be extracted at point locations and summarized. The demo dataset does not use either of these by default, due to large file sizes.

Furthermore, a default (very general) set of REGEX generalized horizon patterns is provided to assign generalized horizon labels for provisional grouping. These provided patterns are unlikely to cover ALL cases, and always need to be modified for final correlation. That said, they do a decent job of making a first-pass correlation for diverse types of soils.

The default config.R settings use the general patterns: use_regex_ghz <- TRUE. You are welcome to modify the defaults. If you want to use the values you have populated in NASIS Pedon Horizon Component Layer ID, set use_regex_ghz <- FALSE.

  1. Run the report via shiny.Rmd

This report uses the Shiny flexdashboard interface. Open up shiny.Rmd and click the “Run Document” button to start the report. This will load the pedon and spatial data specified in config.R.

NOTE: If a Viewer Window does not pop-up right away, click the gear icon to the right of the “Run Document” button. Be sure the option “Preview in Window” is checked, then click “Run Document” again.

  1. All of the subset parameters are in the left-hand sidebar. Play around with all of these options – the graphs and plots in the tabs to the right will automatically update as you make changes.

  2. When you like what you have, you can export a non-interactive HTML file for your records. To do this, first, set the “Report name:” box to something informative – this will be your report output file name. Then, scroll down to the bottom of the INPUT sidebar and click “Export Report” button. Check the “output” folder (subdirectory of where you installed the report) for your results.

3.15 Exercise 7: Run Mapunit Comparison

Another popular report in soilReports is the region2/mu-comparison report.

This report uses constant density sampling (sharpshootR::constantDensitySampling()) to extract numeric and categorical values from multiple raster data sources that overlap a set of user-supplied polygons.

In this example, we clip a small portion of SSURGO polygons from the CA630 soil survey area extent. We then select a small set of mapunit symbols (5012, 3046, 7083, 7085, 7088) that occur within the clipping extent. These mapunits have soil forming factors we expect to contrast with one another in several ways. You can inspect other mapunit symbols by changing mu.set in config.R.

  1. Download the demo data:
# set up ch4 path and path for report
mucomp.path <- "C:/workspace2/chapter3/mucomp"

# create any directories that may be missing
if(!dir.exists(mucomp.path)) {
  dir.create(mucomp.path, recursive = TRUE)
}

mucomp.zip <- file.path(mucomp.path, 'mucomp-data.zip')

# download raster data, SSURGO clip from CA630, and sample script for clipping your own raster data 
download.file('https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/chapter_3-mucomp-data/mucomp-data.zip', mucomp.zip)

unzip(mucomp.zip, exdir = mucomp.path, overwrite = TRUE)
  1. Create an instance of the region2/mu-comparison report with soilReports:
# create new instance of reports 
library(soilReports)

# get report dependencies
reportSetup('region2/mu-comparison')

# create report instance
copyReport('region2/mu-comparison', outputDir = mucomp.path, overwrite = TRUE)

If you want, you can now set up the default config.R that is created by copyReport() to work with your own data. OR you can use the “sample” config.R file (called new_config.R) in the ZIP file downloaded above.

  1. Run this code to replace the default config.R with the sample data config.R:
# copy config file containing relative paths to rasters downloaded above
file.copy(from = file.path(mucomp.path, "new_config.R"), 
          to = file.path(mucomp.path, "config.R"), 
          overwrite = TRUE)
  1. Open report.Rmd in the C:/workspace2/chapter3/mucomp folder and click the “Knit” button at the top of the RStudio source pane to run the report.

  2. Inspect the report output HTML file, as well as the spatial and tabular data output in the output folder.

Question: What differences can you see between the five mapunit symbols that were analyzed?

3.16 Additional Reading (Exploratory Data Analysis)

Healy, K., 2018. Data Visualization: a practical introduction. Princeton University Press. http://socviz.co/

Helsel, D.R., Hirsch, R.M., Ryberg, K.R., Archfield, S.A., and Gilroy, E.J., 2020, Statistical methods in water resources: U.S. Geological Survey Techniques and Methods, book 4, chap. A3, 458 p., https://doi.org/10.3133/tm4a3. [Supersedes USGS Techniques of Water-Resources Investigations, book 4, chap. A3, version 1.1.]

Kabacoff, R.I., 2018. Data Visualization in R. https://rkabacoff.github.io/datavis/

Peng, R. D., 2016. Exploratory Data Analysis with R. Leanpub. https://bookdown.org/rdpeng/exdata/

Wilke, C.O., 2019. Fundamentals of Data Visualization. O’Reily Media, Inc. https://clauswilke.com/dataviz/

References

Boyd, Claude E., Craig S. Tucker, and Rawee Viriyatum. 2011. “Interpretation of pH, Acidity, and Alkalinity in Aquaculture and Fisheries.” North American Journal of Aquaculture 73 (4): 403–8. https://doi.org/10.1080/15222055.2011.620861.
Tukey, John Wilder. 1977. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co. https://archive.org/details/exploratorydataa0000tuke_7616.
Webster, R. 2001. “Statistics to Support Soil Research and Their Presentation.” European Journal of Soil Science 52 (2): 331–40. https://doi.org/10.1046/j.1365-2389.2001.00383.x.