Chapter 2 The Data We Use

2.1 Objectives (The Data We Use)

Expand on basic R skills from Chapter 1
- Inspect and work with different data types
- Perform operations on data such as filtering and aggregation
- Begin to explore regular expression (regex) patterns for text data
Work with Soil Data Sources and Structures
- Use the soilDB package to load data into R
- Understand the SoilProfileCollection (SPC) object
- Learn about the data checks in the fetchNASIS() function

2.2 The Structure of Soil Data

What if you could extract, organize, and visualize data from NASIS and many other commonly used soil database sources with a couple of lines of code?

The aqp (Algorithms for Quantitative Pedology) and soilDB packages enable data to be fetched from various sources and cast into a SoilProfileCollection (SPC) object. Tabular and spatial data objects fetched via soilDB and processed using aqp methods can simplify the process of working with commonly used soil data.

2.2.1 Package References

We load aqp and soilDB packages using the library() command.

# load the libraries
library(aqp)
library(soilDB)

The manual pages for soilDB and aqp are accessible online and from the Help tab in RStudio.

2.2.2 Importance of Pedon Data

The importance of pedon data for present and future work cannot be overstated. These data represent decades of on-the-ground observations of the soil resource for a given area.

As difficult as it may be to take the time to enter legacy pedon data, it is vitally important that we capture this resource and get these data into NASIS as an archive of point observations.

2.2.3 Some Issues With Pedon Data

Making and documenting observations of soil requires hard work. Digging is difficult, and writing soil descriptions is time consuming!
Our confidence in observations typically weakens with the depth of the material described.
- If we acknowledge this, which we must, then how do we deal with it in pedon data?
  - Use a cutoff depth, for example 100 cm, can be used to truncate observations to a zone of greater confidence.
  - Show the relative confidence of the data with depth.

2.3 Challenges with Pedon Data

Consistency
- Missing data
Confidence in the observations
- Uncertainty with depth
Description style differences
- Depth described, horizonation usage styles
Legacy data vintage
- Decadal span of data
- Taxonomy updates, horizon nomenclature changes
Location confidence
- Origin of the location information
- Datum used for data collection
- Accuracy for GPS values at the time of data collection

2.4 The SoilProfileCollection

The SoilProfileCollection class (SPC) provided by the aqp package is a specialized structure for soil data analysis. It simplifies the process of working with collections of data associated with soil profiles, e.g., site-level, horizon-level, spatial, diagnostic horizons, and other metadata.

A SoilProfileCollection is similar to the NASIS Site/Pedon “object” in that it provides generalizations, specific routines and rules about data tables and their relationships.

The SoilProfileCollection is an S4 R object. S4 objects have slots. Of primary importance, are the slots for site-level and horizon-level data.

In many ways the SPC is more adaptable than the NASIS “Pedon” concept because it is more general. However, the SPC is not as expressive as the complex hierarchy of objects in NASIS, which are more aligned with data archival vs. analysis.

2.4.1 SoilProfileCollection methods

Many “familiar” methods are defined for the SoilProfileCollection object. Some are unique, and others operate like more common functions of vector and data.frame objects, such as nrow() (“how many horizons?”) or length() (“how many sites/pedons?”).

Perhaps most importantly, when you access the site data (with site(<object>)) or the horizon data (with horizons(<object>)) of a SoilProfileCollection, you get a data.frame object that you can use like any other you might use or make in R.

2.4.1.1 Promoting a `data.frame` to SoilProfileCollection

The SoilProfileCollection object is a collection of 1-dimensional profile descriptions, of the type conventionally described on a Form 232, or of tabular data returned from laboratory.

The object is “horizon data forward” in that you start with the layers, and can add or create site-level attributes by normalization, joins, and calculation.

Most of the time if you are using your NASIS data, or an official database, there are defined ways of getting the data “into” an SPC. For example, fetchOSD returns a SoilProfileCollection that has been assembled from horizon and site level attributes gleaned from the OSDs text, Soil Classification database, and other sources.

In the pre-course, we had you set up a process so you could connect to your local NASIS instance to “fetch” data and have methods like fetchNASIS put things together for you.

This input to make a SoilProfileCollection can be represented most simply as a data.frame with unique site or profile ID and depth combinations for each horizon or layer–for example, a subset of the phorizon or chorizon table in NASIS.

A simple demonstration of “tabular horizon data” is the sp4 data set bundled with aqp: some serpentine soil profiles stored in a data.frame in the aqp package (after McGahan et al., 2009).

library(aqp)

# Load sample serpentine soil data (McGahan et al., 2009)
data(sp4, package = "aqp")

# this is a data.frame
# same as if loaded from CSV file etc.
class(sp4)

## [1] "data.frame"

# inspect the first couple of rows
head(sp4)

##       id name top bottom   K   Mg  Ca CEC_7 ex_Ca_to_Mg sand silt clay   CF
## 1 colusa    A   0      3 0.3 25.7 9.0  23.0        0.35   46   33   21 0.12
## 2 colusa  ABt   3      8 0.2 23.7 5.6  21.4        0.23   42   31   27 0.27
## 3 colusa  Bt1   8     30 0.1 23.2 1.9  23.7        0.08   40   28   32 0.27
## 4 colusa  Bt2  30     42 0.1 44.3 0.3  43.0        0.01   27   18   55 0.16
## 5  glenn    A   0      9 0.2 21.9 4.4  18.8        0.20   54   20   25 0.55
## 6  glenn   Bt   9     34 0.3 18.9 4.5  27.5        0.20   49   18   34 0.84

To convert this horizon data into a SoilProfileCollection, we need to identify three parameters: idname, top, and bottom. These parameters refer to the columns of unique profile IDs, top depths and bottom depths, respectively.

There are a couple of important constraints and considerations:

records (rows) represent horizons
profiles are uniquely identified by a column (user pedon ID, pedon record ID, etc.)
profiles IDs cannot contain missing values (NA)
horizon top and bottom depths are identified by column names
ideally there are no gaps, overlap, or missing top/bottom depths (more on that later)

Use a formula to specify column names in the data.frame, in this case "id", "top" and "bottom".

#      profile ID ~ top depth + bottom depth
depths(sp4) <- id ~ top + bottom

# note new class
class(sp4)

## [1] "SoilProfileCollection"
## attr(,"package")
## [1] "aqp"

# compact summary
sp4

## SoilProfileCollection with 10 profiles and 30 horizons
## profile ID: id  |  horizon ID: hzID 
## Depth range: 16 - 49 cm
## 
## ----- Horizons (6 / 30 rows  |  10 / 14 columns) -----
##      id hzID top bottom name   K   Mg  Ca CEC_7 ex_Ca_to_Mg
##  colusa    1   0      3    A 0.3 25.7 9.0  23.0        0.35
##  colusa    2   3      8  ABt 0.2 23.7 5.6  21.4        0.23
##  colusa    3   8     30  Bt1 0.1 23.2 1.9  23.7        0.08
##  colusa    4  30     42  Bt2 0.1 44.3 0.3  43.0        0.01
##   glenn    5   0      9    A 0.2 21.9 4.4  18.8        0.20
##   glenn    6   9     34   Bt 0.3 18.9 4.5  27.5        0.20
## [... more horizons ...]
## 
## ----- Sites (6 / 10 rows  |  1 / 1 columns) -----
##         id
##     colusa
##      glenn
##      kings
##   mariposa
##  mendocino
##       napa
## [... more sites ...]
## 
## Spatial Data:
## [EMPTY]

The formula expresses the idea that a profile id defined by set of top and bottom depths.

NOTE: A character vector with same names has the same effect, and can be easier to “program” with than the formula-based syntax.

depths(sp4) <- c("id", "top", "bottom")

2.4.2 Promoting to “Spatial” SoilProfileCollection

You can also use the SoilProfileCollection to manage the information about a profile’s position on the Earth.

Chapter 4 will cover spatial data in greater detail, and the SoilProfileCollection Reference has a section on Spatial Data. For now know you can use the initSpatial<- method to define “X” and “Y” coordinate columns and the coordinate reference system in one line:

sp4$x <- runif(10)
sp4$y <- runif(10) # dummy XY coordinates

initSpatial(sp4, crs = "OGC:CRS84") <- ~ x + y

This is new syntax introduced in aqp 2.0, the older syntax uses the coordinates<- and proj4string<- methods.

2.4.3 Extracting Site and Horizon Data

You can extract values from the collection’s @site and @horizon slots using the site() and horizons() functions. These create data.frame objects that are separate from the SoilProfileCollection.

# extract site data from SPC into new data.frame 's'
# note that it only contains profile IDs
s <- site(sp4)
str(s)

## 'data.frame':    10 obs. of  1 variable:
##  $ id: chr  "colusa" "glenn" "kings" "mariposa" ...

# extract horizon data from SPC into new data.frame 'h'
h <- horizons(sp4)
str(h)

## 'data.frame':    30 obs. of  14 variables:
##  $ id         : chr  "colusa" "colusa" "colusa" "colusa" ...
##  $ name       : chr  "A" "ABt" "Bt1" "Bt2" ...
##  $ top        : int  0 3 8 30 0 9 0 4 13 0 ...
##  $ bottom     : int  3 8 30 42 9 34 4 13 40 3 ...
##  $ K          : num  0.3 0.2 0.1 0.1 0.2 0.3 0.2 0.6 0.8 0.6 ...
##  $ Mg         : num  25.7 23.7 23.2 44.3 21.9 18.9 12.1 12.1 17.7 28.3 ...
##  $ Ca         : num  9 5.6 1.9 0.3 4.4 4.5 1.4 7 4.4 5.8 ...
##  $ CEC_7      : num  23 21.4 23.7 43 18.8 27.5 23.7 18 20 29.3 ...
##  $ ex_Ca_to_Mg: num  0.35 0.23 0.08 0.01 0.2 0.2 0.58 0.51 0.25 0.2 ...
##  $ sand       : int  46 42 40 27 54 49 43 36 27 42 ...
##  $ silt       : int  33 31 28 18 20 18 55 49 45 26 ...
##  $ clay       : int  21 27 32 55 25 34 3 15 27 32 ...
##  $ CF         : num  0.12 0.27 0.27 0.16 0.55 0.84 0.5 0.75 0.67 0.25 ...
##  $ hzID       : chr  "1" "2" "3" "4" ...

2.4.4 Methods like `data.frame`

The base R functions for accessing and setting data.frame columns by name such as $ and [[ work for SoilProfileCollection objects, too. Review data.frame methods:

[[ and $: single columns in data.frame, by name
- x[['variable']]
- x$variable
[: combinations of rows and columns, by name or index
- x[i, ]: specified rows, all columns
- x[, j]: all rows, specified columns
- x[i, j]: specified rows, specified columns

See Chapter 1 and the Chapter 2 Appendix for additional details and examples.

2.4.4.1 Column Access by Name: `$` and `[[`

# sp4 is a SoilProfileCollection
sp4$clay

##  [1] 21 27 32 55 25 34  3 15 27 32 25 31 33 13 21 23 15 17 12 19 14 14 22 25 40 51 67 24 25 32

sp4[['clay']]

##  [1] 21 27 32 55 25 34  3 15 27 32 25 31 33 13 21 23 15 17 12 19 14 14 22 25 40 51 67 24 25 32

# horizon data.frame
h$clay

##  [1] 21 27 32 55 25 34  3 15 27 32 25 31 33 13 21 23 15 17 12 19 14 14 22 25 40 51 67 24 25 32

h[['clay']]

##  [1] 21 27 32 55 25 34  3 15 27 32 25 31 33 13 21 23 15 17 12 19 14 14 22 25 40 51 67 24 25 32

# use $<- or [[<- to set proportional clay content
sp4$clay <- sp4[['clay']] / 100

# undo what we did above; back to percentage
sp4[['clay']] <- sp4$clay * 100

# create new site variable ("numberone" recycled for all sites)
site(sp4)$newvar1 <- "numberone"

# create new horizon variable ("numbertwo" recycled for all horizons)
horizons(sp4)$newvar2 <- "numbertwo"

2.4.4.2 Row Access: `[`

The SoilProfileCollection also has [ (“single bracket”), but with a different interpretation from the [i, j] indexing of data.frame objects.

In a data.frame you have object[row, column, drop=TRUE]; the result is a data.frame (or a vector with default drop=TRUE).
In a SoilProfileCollection you have object[site, horizon]; the result is a SoilProfileCollection.

# i-index: first 2 profiles, all horizons
sp4[1:2, ]

## SoilProfileCollection with 2 profiles and 6 horizons
## profile ID: id  |  horizon ID: hzID 
## Depth range: 34 - 42 cm
## 
## ----- Horizons (6 / 6 rows  |  10 / 15 columns) -----
##      id hzID top bottom name   K   Mg  Ca CEC_7 ex_Ca_to_Mg
##  colusa    1   0      3    A 0.3 25.7 9.0  23.0        0.35
##  colusa    2   3      8  ABt 0.2 23.7 5.6  21.4        0.23
##  colusa    3   8     30  Bt1 0.1 23.2 1.9  23.7        0.08
##  colusa    4  30     42  Bt2 0.1 44.3 0.3  43.0        0.01
##   glenn    5   0      9    A 0.2 21.9 4.4  18.8        0.20
##   glenn    6   9     34   Bt 0.3 18.9 4.5  27.5        0.20
## 
## ----- Sites (2 / 2 rows  |  2 / 2 columns) -----
##      id   newvar1
##  colusa numberone
##   glenn numberone
## 
## Spatial Data:
## [EMPTY]

# j-index: all profiles; first 2 horizons of each profile
sp4[, 1:2]

## SoilProfileCollection with 10 profiles and 20 horizons
## profile ID: id  |  horizon ID: hzID 
## Depth range: 5 - 40 cm
## 
## ----- Horizons (6 / 20 rows  |  10 / 15 columns) -----
##      id hzID top bottom name   K   Mg  Ca CEC_7 ex_Ca_to_Mg
##  colusa    1   0      3    A 0.3 25.7 9.0  23.0        0.35
##  colusa    2   3      8  ABt 0.2 23.7 5.6  21.4        0.23
##   glenn    5   0      9    A 0.2 21.9 4.4  18.8        0.20
##   glenn    6   9     34   Bt 0.3 18.9 4.5  27.5        0.20
##   kings    7   0      4    A 0.2 12.1 1.4  23.7        0.58
##   kings    8   4     13  Bt1 0.6 12.1 7.0  18.0        0.51
## [... more horizons ...]
## 
## ----- Sites (6 / 10 rows  |  2 / 2 columns) -----
##         id   newvar1
##     colusa numberone
##      glenn numberone
##      kings numberone
##   mariposa numberone
##  mendocino numberone
##       napa numberone
## [... more sites ...]
## 
## Spatial Data:
## [EMPTY]

When you use the [ function, everything in the SoilProfileCollection is subset simultaneously depending on the constraints specified by the indices.

# First profile, first 2 horizons
horizons(sp4[1, 1:2])

##       id name top bottom   K   Mg  Ca CEC_7 ex_Ca_to_Mg sand silt clay   CF hzID   newvar2
## 1 colusa    A   0      3 0.3 25.7 9.0  23.0        0.35   46   33   21 0.12    1 numbertwo
## 2 colusa  ABt   3      8 0.2 23.7 5.6  21.4        0.23   42   31   27 0.27    2 numbertwo

All slots in the collection have a relationship to the site or i-index. When you remove sites (profiles), all associated records (e.g. spatial, diagnostics, horizons, etc.) in the object are removed.

Similarly, when all horizons are removed (say, you request the 6th j-index from a profile that has only 5 layers), the site index and all associated data are removed from the collection.

Exercise 1: Assemble a SoilProfileCollection from several CSV files

Save the code you use in an R script, add answers as comments, and send to your mentor.

Link to exercise R code

Questions:

Run the code in the linked R file and answer these questions.

How many profiles (sites) and horizons are in the granite SoilProfileCollection? How many in andesite? See length() and nrow() functions for SoilProfileCollection objects.
Which profile, and which horizon in that profile, has the highest ratio of oxalate-extractable Fe to dithionite-citrate-extractable Fe (Fe_o_to_Fe_d)?

2.5 Using the `soilDB` Package

The soilDB package for R provides functions for accessing data stored in NASIS, KSSL, SDA, SoilWeb, SoilGrids, and other sources.

These high-level ‘fetch’ functions bundle or wrap lower-level ‘get’ functions which access internal database interfaces to NASIS and other data sources. The ODBC connection to NASIS that you set up during the pre-course is an example of an internal database interface.

Basic data checks are run within ‘fetch’ functions. These checks ensure the basic integrity of the data as it is queried and moved from its existing structure into an SPC. There are times when it is useful to use the lower-level get functions individually. They generally return single data.frame or list of data.frame.

You can set up scripts to make custom queries against these or other sources on your own. There is an example at the end of this chapter.

For now we will start with the ‘fetch’ functions and others that will get you a large variety of data you can use for soil and ecological site analyses. There are many more functions available in soilDB, see https://ncss-tech.github.io/soilDB/reference/index.html for details.

2.5.1 soilDB functions for tabular data

soilDB functions are the quickest way to get up and running:

fetchNASIS()
- Gets and re-packages data from a local NASIS database.
  - soilDB Vignette Columns in fetchNASIS(from="pedons")
fetchVegdata()
- Gets Vegetation Plot and related/child tables into a list from a local NASIS database.
fetchNASISLabData()
- Gets KSSL laboratory pedon/horizon layer data from a local NASIS database.
fetchNASISWebReport()
SDA_query()
- Can be used to access SSURGO, STATSGO (spatial and tabular), and Lab DataMart snapshots
- Submits queries to the Soil Data Access system.
fetchLDM()
- Gets KSSL data from the Lab Data Mart snapshot in Soil Data Access
fetchSDA()
- Fetches legend/mapunit/component/horizon data from Soil Data Access.
fetchKSSL()
- Gets KSSL data from the SoilWeb system via BBOX, MLRA, or series name query.
  - KSSL Data Demo
  - Water Retention Curve Development from KSSL Data
fetchOSD()
- Fetches a limited subset of horizon- and site-level attributes for named soil series from the SoilWeb system.
  - Querying Soil Series Data
OSDquery()
- Full-text searching of OSD sections.
  - Querying Soil Series Data
fetchSCAN()
- Queries soil and climate data from USDA-NRCS SCAN Stations.
  - A Unified Interface to SCAN/SNOTEL Data
fetchHenry()
- Downloads data from the Henry Mount Soil Climate Database.
  - Henry Mount Soil Climate Database Tutorial
fetchPedonPC()
- Fetches commonly used site and horizon data from a PedonPC (MS Access) database.

2.5.2 Open Database Connectivity (ODBC) Connection to NASIS

After setting up an ODBC connection, as you did as part of the pre-course, you can use R to access data from a selected set defined in your local NASIS database.

How to Create an ODBC Connection to local NASIS database for R.

Does NASIS need to be open and running to query data using soilDB?

No, fetchNASIS() works whether the NASIS application is running or not. You just need to make sure that the data you want has been loaded into your selected set.

2.5.3 `fetchNASIS()`

The fetchNASIS() convenience function extracts data from a NASIS selected set via Structured Query Language (SQL).

Note that the import process in fetchNASIS(), and the other methods, is not comprehensive. It does not pull every column for every table related to pedon data out of NASIS.

Instead, it pulls essential / commonly used pedon and horizon data. Higher level functions like fetchNASIS() bundle a series of lower-level queries to get specific parts of the Pedon or Component data structures. Much of the nested complexity of NASIS is simplified in the resulting object.

Many-to-one relationships are “flattened” where possible by fetchNASIS(). This aggregates the data from various tables into one “site” record with related horizon records, per profile.

You can see the child tables that are aggregated using the get_extended_data_from_NASIS() method, which returns a named list of child table sources that can be joined to the SoilProfileCollection made with fetchNASIS() using the internal record IDs.

2.5.3.1 fetchNASIS arguments

fetchNASIS() has a number of different arguments:

from = ‘pedons’ or ‘components’ or ‘pedon_report’
- This option allows you to select which data you want to load from NASIS. Choosing either ‘pedons’ or ‘components’ will load data from your local database. If ‘pedon_report’ is specified then it will load data from the text file generated by the NASIS report ‘fetchNASIS’ (run offline). This is useful for loading more than 20,000 pedons at one time, such for an entire Soil Survey Region.
url = string specifying the (temporary) URL for the NASIS pedon_report output generated by the fetchNASIS NASIS Report that can be run “Offline Against National Database” EXAMPLE OUTPUT (MT663)
SS = TRUE/FALSE
- The Selected Set (SS) option allows you to choose whether you want the data to load from your current selected set in NASIS or from the local database tables. The default is set to TRUE so if unspecified fetchNASIS() will always load from the data in the selected set.
stringAsFactors = NULL
- This option is no longer used. See soilDB::NASISDomainsAsFactor()
rmHzErrors = TRUE/FALSE
- Setting this value to TRUE removes pedons that do not pass checks for horizon depth consistency. Default: FALSE
nullFragsAreZero = TRUE/FALSE
- Setting this value to TRUE (the default) converts null entries for rock fragment volumes to 0. This is typically the right assumption because rock fragment data are typically populated only when observed. If you know that your data contain a combination of omitted information (e.g. no rock fragment volumes are populated) then consider setting this argument to FALSE.
soilColorState = ‘moist’ or ‘dry’
- Select dry or moist colors to be converted and placed into a horizon-level attribute called soil_color. The default is set to ‘moist’ unless specified. Moist and dry colors are also stored in moist_soil_color and dry_soil_color.
lab = TRUE/FALSE
- This option allows for loading the data associated with horizons that may be in the phlabresults table. The default is set to FALSE, which will not load records from the phlabresults table.
fill = TRUE/FALSE
- This option adds horizons to pedons or components that lack horizon data in the database, preserving their “position” within the output SoilProfileCollection object. Default is FALSE such that profiles without horizons are omitted from the result.
dropAdditional = TRUE/FALSE
- Used only for from='components' with duplicates = TRUE. Prevent “duplication” of mustatus == "additional" mapunits? Default: TRUE
dropNonRepresentative = TRUE/FALSE
- Used only for from='components' with duplicates = TRUE. Prevent “duplication” of non-representative data mapunits? Default: TRUE
duplicates = FALSE
- Used only for from='components'. Duplicate components for all instances of use (i.e. one for each legend data mapunit is used on; optionally for additional mapunits, and/or non-representative data mapunits?). This will include columns from get_component_correlation_data_from_NASIS_db() that identify which legend(s) each component is used on. NOTE: This requires that parent tables above “component” (such as legend, mapunit, correlation and datamapunit) are loaded in your NASIS database/selected set.
dsn = NULL
- Optional: custom path to an SQLite snapshot or DBIConnection object to database with NASIS schema. See soilDB::createStaticNASIS().

2.5.4 The `gopheridge` Dataset

The gopheridge sample data set is a sample R object returned from fetchNASIS() in a self-contained .rda file stored in soilDB.

Open RStudio, and set up the environment by loading packages and the gopheridge sample dataset.

library(aqp)
library(soilDB)

# load example dataset
data(gopheridge, package = "soilDB")

# what kind of object is this?
class(gopheridge)

## [1] "SoilProfileCollection"
## attr(,"package")
## [1] "aqp"

# what does the internal structure look like?
str(gopheridge, 2)

## Formal class 'SoilProfileCollection' [package "aqp"] with 8 slots
##   ..@ idcol       : chr "peiid"
##   ..@ hzidcol     : chr "phiid"
##   ..@ depthcols   : chr [1:2] "hzdept" "hzdepb"
##   ..@ metadata    :List of 6
##   ..@ horizons    :'data.frame': 317 obs. of  76 variables:
##   ..@ site        :'data.frame': 52 obs. of  172 variables:
##   ..@ diagnostic  :'data.frame': 164 obs. of  4 variables:
##   ..@ restrictions:'data.frame': 56 obs. of  8 variables:

There are many columns in the data.frame objects within a fetchNASIS() SoilProfileCollection.

The following guide is part of the soilDB documentation and describes the columns that are returned from a fetchNASIS(from="pedons") call. In the future this guide will be extended to describe component data.

fetchNASIS(from=“pedons”) column description guide

With siteNames() and horizonNames() we can view the names for eachslot respectively in the gopheridge object.

# the fields at the site and horizon levels within the SPC
siteNames(gopheridge)

##   [1] "peiid"                                    "siteiid"                                 
##   [3] "ecositeid"                                "ecositenm"                               
##   [5] "ecositecorrdate"                          "es_classifier"                           
##   [7] "siteecositehistory.classifier"            "es_selection_method"                     
##   [9] "upedonid"                                 "siteobsiid"                              
##  [11] "usiteid"                                  "site_id"                                 
##  [13] "pedon_id"                                 "obsdate"                                 
##  [15] "obs_date"                                 "utmzone"                                 
##  [17] "utmeasting"                               "utmnorthing"                             
##  [19] "horizdatnm"                               "longstddecimaldegrees"                   
##  [21] "latstddecimaldegrees"                     "gpspositionalerror"                      
##  [23] "describer"                                "descname"                                
##  [25] "pedonpurpose"                             "pedontype"                               
##  [27] "pedlabsampnum"                            "labdatadescflag"                         
##  [29] "tsectstopnum"                             "tsectinterval"                           
##  [31] "utransectid"                              "tsectkind"                               
##  [33] "tsectselmeth"                             "erocl"                                   
##  [35] "elev_field"                               "slope_field"                             
##  [37] "aspect_field"                             "elev"                                    
##  [39] "slope"                                    "aspect"                                  
##  [41] "ecostatename"                             "ecostateid"                              
##  [43] "commphasename"                            "commphaseid"                             
##  [45] "plantassocnm"                             "earthcovkind1"                           
##  [47] "earthcovkind2"                            "bedrckdepth"                             
##  [49] "bedrckkind"                               "bedrckhardness"                          
##  [51] "pmgroupname"                              "hillslopeprof"                           
##  [53] "geomslopeseg"                             "shapeacross"                             
##  [55] "shapedown"                                "slopecomplex"                            
##  [57] "drainagecl"                               "geomposhill"                             
##  [59] "geomposmntn"                              "geompostrce"                             
##  [61] "geomposflats"                             "swaterdepth"                             
##  [63] "flodfreqcl"                               "floddurcl"                               
##  [65] "flodmonthbeg"                             "pondfreqcl"                              
##  [67] "ponddurcl"                                "pondmonthbeg"                            
##  [69] "climstaid"                                "climstanm"                               
##  [71] "climstatype"                              "ffd"                                     
##  [73] "map"                                      "reannualprecip"                          
##  [75] "airtempa"                                 "soiltempa"                               
##  [77] "airtemps"                                 "soiltemps"                               
##  [79] "airtempw"                                 "soiltempw"                               
##  [81] "surface_fine_gravel"                      "surface_gravel"                          
##  [83] "surface_cobbles"                          "surface_stones"                          
##  [85] "surface_boulders"                         "surface_channers"                        
##  [87] "surface_flagstones"                       "surface_parafine_gravel"                 
##  [89] "surface_paragravel"                       "surface_paracobbles"                     
##  [91] "surface_parastones"                       "surface_paraboulders"                    
##  [93] "surface_parachanners"                     "surface_paraflagstones"                  
##  [95] "surface_unspecified"                      "surface_total_frags_pct_nopf"            
##  [97] "surface_total_frags_pct"                  "othervegid"                              
##  [99] "othervegclass"                            "site_state"                              
## [101] "site_county"                              "site_mlra"                               
## [103] "slope_shape"                              "surface_fgravel"                         
## [105] "classdate"                                "classifier"                              
## [107] "classtype"                                "taxonname"                               
## [109] "localphase"                               "taxonkind"                               
## [111] "seriesstatus"                             "taxclname"                               
## [113] "taxpartsize"                              "taxorder"                                
## [115] "taxsuborder"                              "taxgrtgroup"                             
## [117] "taxsubgrp"                                "soiltaxedition"                          
## [119] "osdtypelocflag"                           "taxmoistcl"                              
## [121] "taxtempregime"                            "taxfamother"                             
## [123] "taxreaction"                              "taxfamhahatmatcl"                        
## [125] "psctopdepth"                              "pscbotdepth"                             
## [127] "selection_method"                         "ochric.epipedon"                         
## [129] "argillic.horizon"                         "paralithic.materials"                    
## [131] "lithic.contact"                           "paralithic.contact"                      
## [133] "abrupt.textural.change"                   "duripan"                                 
## [135] "cambic.horizon"                           "andic.soil.properties"                   
## [137] "umbric.epipedon"                          "aquic.conditions"                        
## [139] "slickensides"                             "mollic.epipedon"                         
## [141] "mottles.with.chroma.2.or.less"            "calcic.horizon"                          
## [143] "anthropic.epipedon"                       "kandic.horizon"                          
## [145] "densic.contact"                           "densic.materials"                        
## [147] "lithologic.discontinuity"                 "redox.depletions.with.chroma.2.or.less"  
## [149] "redox.concentrations"                     "reduced.matrix"                          
## [151] "histic.epipedon"                          "albic.horizon"                           
## [153] "spodic.horizon"                           "fibric.soil.materials"                   
## [155] "hemic.soil.materials"                     "sapric.soil.materials"                   
## [157] "volcanic.glass"                           "folistic.epipedon"                       
## [159] "strongly.contrasting.particle.size.class" "episaturation"                           
## [161] "endosaturation"                           "secondary.carbonates"                    
## [163] "human.transported.material"               "glossic.horizon"                         
## [165] "spodic.materials"                         "albic.materials"                         
## [167] "landform_string"                          "landscape_string"                        
## [169] "microfeature_string"                      "geomicrorelief_string"                   
## [171] "pmkind"                                   "pmorigin"

horizonNames(gopheridge)

##  [1] "phiid"                "peiid"                "hzname"               "dspcomplayerid"      
##  [5] "hzdept"               "hzdepb"               "bounddistinct"        "boundtopo"           
##  [9] "claytotest"           "silttotest"           "sandtotest"           "clay"                
## [13] "silt"                 "sand"                 "fragvoltot"           "texture"             
## [17] "texcl"                "lieutex"              "phfield"              "effclass"            
## [21] "labsampnum"           "rupresblkdry"         "rupresblkmst"         "rupresblkcem"        
## [25] "stickiness"           "plasticity"           "ksatpedon"            "texture_class"       
## [29] "hzID"                 "d_hue"                "d_value"              "d_chroma"            
## [33] "dry_soil_color"       "d_r"                  "d_g"                  "d_b"                 
## [37] "d_sigma"              "m_hue"                "m_value"              "m_chroma"            
## [41] "moist_soil_color"     "m_r"                  "m_g"                  "m_b"                 
## [45] "m_sigma"              "soil_color"           "fine_gravel"          "gravel"              
## [49] "cobbles"              "stones"               "boulders"             "channers"            
## [53] "flagstones"           "parafine_gravel"      "paragravel"           "paracobbles"         
## [57] "parastones"           "paraboulders"         "parachanners"         "paraflagstones"      
## [61] "unspecified"          "total_frags_pct_nopf" "total_frags_pct"      "art_fgr"             
## [65] "art_gr"               "art_cb"               "art_st"               "art_by"              
## [69] "art_ch"               "art_fl"               "art_unspecified"      "total_art_pct"       
## [73] "huartvol_cohesive"    "huartvol_penetrable"  "huartvol_innocuous"   "huartvol_persistent"

2.5.4.1 Make profile sketches

The plotSPC() or plot() function generates sketches of a SoilProfileCollection object based on horizon depths, designations, and colors. The SoilProfileCollection Reference vignette contains many examples demonstrating way in which these sketches can be customized. The Soil Profile Sketches tutorial contains additional examples that demonstrate ways to customize soil profile sketches.

See ?plotSPC for a detailed list of arguments and examples.

The fetchNASIS() function automatically converts moist Munsell colors into R-style colors, available in the soil_color horizon level attribute. This color is used by default in plotSPC()

par(mar = c(1, 1, 1, 1))

# omitting pedon IDs and horizon designations
plotSPC(gopheridge,
        print.id = FALSE,
        name = NA,
        width = 0.3)
title('Pedons from the `gopheridge` sample dataset', line = -0.5)

Additional examples / documentation related to soil profile sketches:

2.5.4.2 Pedon Data Checks

When you load pedons using the fetchNASIS() function, the following data checks are performed:

Inconsistent horizon boundaries. Pedons with inconsistent horizon boundaries are not loaded when rmHzErrors = TRUE. In most cases, this occurs when the bottom depth of a horizon is not the same as the upper depth of the next lower horizon.

##   hzname top bot
## 1      A   0  30
## 2    Bt1  38  56
## 3    Bt2  56 121
## 4     Bk 121 135
## 5      R 135  NA

Note the issue above. The bottom depth of the A horizon and the upper depth of the Bt1 horizon should be the same: either 30 or 38 cm. The correct depth needs to be determined and fixed in the database.

Missing lower horizon depths. Offending horizons are fixed by replacing the missing bottom depth with the top depth plus 1 cm. In the case of the profile shown above, a bottom depth of 137 cm would be inserted where the depth is missing.
Sites missing pedon records. Data without corresponding horizons are not loaded.

2.5.4.3 Find Data with Errors

If errors in the pedon data are detected when loading data using fetchNASIS(), the following “get” commands can trace them back to the corresponding records in NASIS.

These access an option that is stored in a special object called an environment associated with the soilDB package – they generally contain vectors of IDs to help locating the problematic data.

get('sites.missing.pedons', envir = soilDB.env)
- Returns user site ID for sites missing pedons
get('dup.pedon.ids', envir = soilDB.env)
- Returns user pedon ID for sites with duplicate pedon ID
get('bad.pedon.ids', envir = soilDB.env)
- Returns user pedon ID for pedons with inconsistent horizon depths
get('bad.horizons', envir = soilDB.env)
- Returns a data.frame of horizon-level information for pedons with inconsistent horizon depths

Here is a full list of ID vectors that can be created when errors are detected:

# fetchNASIS("pedons")
get('sites.missing.pedons', envir = get_soilDB_env())
get('dup.pedon.ids', envir = get_soilDB_env())
get('bad.pedon.ids', envir = get_soilDB_env())
get('missing.bottom.depths', envir=get_soilDB_env())
get('top.bottom.equal', envir=get_soilDB_env())
get('bad.horizons', envir = get_soilDB_env())
get('rock.fragment.volume.gt100.phiid', envir=get_soilDB_env())
get('artifact.volume.gt100.phiid', envir=get_soilDB_env())
get('multisiteobs.surface', envir=get_soilDB_env())
get('surface.fragment.cover.gt100.siteobsiid', envir=get_soilDB_env())
get('multiple.labsampnum.per.phiid', envir=get_soilDB_env())

# fetchNASISLabData()
get('bad.labpedon.ids', envir=get_soilDB_env())

# fetchNASIS("components"), fetchNASISWebReport()
get('dupe.coiids', envir=get_soilDB_env())
get('dupe.muiids', envir=get_soilDB_env())
get('multiple.mu.per.dmu', envir=get_soilDB_env())
get("component.hz.problems", envir = get_soilDB_env())
get('multiple.ecosite.per.coiid', envir=get_soilDB_env())
get('multiple.otherveg.per.coiid', envir=get_soilDB_env())

# Soil Data Access
get('dup.compmgrp.cokeyrvindictor', envir=get_soilDB_env())
get('component.hz.problems', envir=get_soilDB_env())
get('component.ecosite.problems', envir = get_soilDB_env())

These get() calls access variables stored in the package environment soilDB.env. The variables only exist if there are “problems” / values found by the data checks. If you fix the errors in the NASIS database and the checks don’t find any errors then these vectors of IDs will not be defined.

2.6 Basic Data Inspection

Now that you’ve loaded some data, you can look at additional ways to summarize and interact with data elements.

2.6.1 `table()`: Tabulation and Cross Tabulation

The base R table() function is very useful for quick summary operations. It returns a named vector with the amount of each unique level of the a given vector.

The numeric vector of “counts” is commonly combined with other functions such as sort(), order(), prop.table(), is.na() to identify abundance, proportions, or missing data (NA).

# load required packages
library(aqp)
library(soilDB)

data("gopheridge", package = "soilDB")

# for these examples, we use the gopheridge object as our "selected set"
pedons <- gopheridge 

## you can use fetchNASIS to load your own data, like this:
# pedons <- fetchNASIS()

# summarize which soil taxa we have loaded
table(pedons$taxonname)

## 
## Gopheridge 
##         52

# sort taxonomic names in descending order
sort(table(pedons$taxonname), decreasing = TRUE)

## Gopheridge 
##         52

# could do the same thing for taxonomic subgroups 
table(pedons$taxsubgrp)

## 
## mollic haploxeralfs  typic haploxerepts  ultic haploxeralfs  ultic haploxerolls 
##                   1                   6                  44                   1

sort(table(pedons$taxsubgrp), decreasing = TRUE)

## 
##  ultic haploxeralfs  typic haploxerepts mollic haploxeralfs  ultic haploxerolls 
##                  44                   6                   1                   1

We can convert counts in the table() result into proportions with prop.table():

prop.table(table(pedons$taxsubgrp))

## 
## mollic haploxeralfs  typic haploxerepts  ultic haploxeralfs  ultic haploxerolls 
##          0.01923077          0.11538462          0.84615385          0.01923077

table() can be used to get counts over multiple dimensions of factor levels. This is called cross tabulation.

For instance, let’s cross tabulate taxonomic subgroup (taxsubgrp) and the particle size family class (taxpartsize) for pedons

table(pedons$taxsubgrp, pedons$taxpartsize)

##                      
##                       clayey-skeletal coarse-loamy fine fine-loamy loamy-skeletal
##   mollic haploxeralfs               0            0    0          0              1
##   typic haploxerepts                1            0    0          0              5
##   ultic haploxeralfs                2            0    1          1             40
##   ultic haploxerolls                0            1    0          0              0

As expected with gopheridge the majority of pedons are Loamy-skeletal Ultic Haploxeralfs.

Since pedons contains site and horizon level data, when cross tabulating we need to be sure pair a column from the site data with other site-level columns, and the same for horizon-level. This is because all arguments to table() must have the same length.

For example, let’s cross-tabulate horizon designation (hzname) with horizon texture class (texcl).

We can use the addmargins() function to add the row and column sums to the margins of the table for easier interpretation when there are many rows/columns in the result.

addmargins(table(pedons$hzname, pedons$texcl))

##        
##           c  cl   l scl sic sicl sil  sl Sum
##   2BCt5   0   1   0   0   0    0   0   0   1
##   2Bt1    0   0   3   0   0    0   0   0   3
##   2Bt2    1   2   1   1   0    0   0   0   5
##   2Bt3    1   3   1   0   0    0   0   0   5
##   2Bt4    2   0   0   0   0    0   0   0   2
##   2CBt    0   0   0   1   0    0   0   0   1
##   2Cr     0   0   0   0   0    0   0   0   0
##   2Crt    0   0   0   0   0    0   0   0   0
##   2R      0   0   0   0   0    0   0   0   0
##   A       0   0  33   0   0    0  12   3  48
##   A1      0   0   2   0   0    0   2   0   4
##   A2      0   0   2   0   0    0   2   0   4
##   A3      0   0   1   0   0    0   0   0   1
##   AB      0   0   2   0   0    0   2   0   4
##   BA      0   0  17   0   0    1   0   0  18
##   BC      0   0   4   0   0    0   0   0   4
##   BCt     0   2   2   1   0    0   1   0   6
##   Bt      0   0   1   1   1    0   0   0   3
##   Bt1     0   3  28   1   0    3   5   0  40
##   Bt2     3  15  14   1   1    3   1   0  38
##   Bt3     4   8   6   1   0    1   1   0  21
##   Bt4     1   2   2   0   0    0   0   0   5
##   Bw      0   0   6   0   0    0   1   0   7
##   Bw1     0   0   5   0   0    0   0   0   5
##   Bw2     0   0   5   0   0    0   0   0   5
##   Bw3     0   0   3   0   0    0   0   0   3
##   C       1   0   1   0   0    0   0   0   2
##   C/Brt   0   1   0   0   0    0   0   0   1
##   CBt     0   0   0   1   0    0   0   0   1
##   Cr      0   0   1   0   0    0   0   0   1
##   Crt     0   0   0   0   0    0   0   0   0
##   Ct      0   0   1   0   0    0   0   0   1
##   Oe      0   0   0   0   0    0   0   0   0
##   Oi      0   0   0   0   0    0   0   0   0
##   R       0   0   0   0   0    0   0   0   0
##   Sum    13  37 141   8   2    8  27   3 239

2.6.2 Missing Values

Missing values are encoded in R as NA. The table() function has a useNA argument that affects whether NA values are counted.

Use is.na() to return a logical value that identifies missing data.

table(pedons$taxsubgrp, useNA = "ifany")

## 
## mollic haploxeralfs  typic haploxerepts  ultic haploxeralfs  ultic haploxerolls 
##                   1                   6                  44                   1

#  is.na(...) 
table(is.na(pedons$taxsubgrp))

## 
## FALSE 
##    52

#  is NOT NA !is.na(...)
table(!is.na(pedons$taxsubgrp))

## 
## TRUE 
##   52

# it can also be applied to horizon level columns in the SPC
sort(table(pedons$texture), decreasing=TRUE)

## 
##        BR         L      GR-L     GRV-L     CBV-L       SPM     GRX-L       SIL    GRV-CL    CBV-CL 
##        58        36        33        24        18        14        12        12         9         8 
##    GR-SIL     CBX-L    GRX-CL    CBX-CL   GRV-SIL        CL   GRV-SCL     GRX-C       MPM        SL 
##         7         6         5         4         4         3         3         3         3         3 
##      CB-L     GR-CL   GRX-SCL     PGR-C    PGRX-L      SICL    STV-CL     STV-L     STX-C     STX-L 
##         2         2         2         2         2         2         2         2         2         2 
##         C      CB-C     CB-CL    CB-SCL    CB-SIL   CBV-SIL   CBX-SCL      CN-L   CN-SICL     CNX-L 
##         1         1         1         1         1         1         1         1         1         1 
##  CNX-SICL     FLV-L      GR-C    GR-SIC  GRV-SICL   GRX-SIC   GRX-SIL  PCB-SICL PCBV-SICL     PCN-C 
##         1         1         1         1         1         1         1         1         1         1 
##   PCNX-CL    PGRV-C   PGRV-CL  PGRX-SCL  PGRX-SIL      ST-L     STV-C    STX-CL  STX-SICL 
##         1         1         1         1         1         1         1         1         1

2.6.3 Logical Operators

Logical operators act on vectors for the purposes of comparison.

== “EQUAL TO”
!= “NOT EQUAL TO”
< LESS than
- LESS than or equal to <=
> GREATER than
- GREATER than or equal to >=
%in% Equivalent to IN () in SQL; same logic as match() but returns a logical, not integer
- Example: pedons$taxpartsize %in% c('loamy-skeletal', 'sandy-skeletal')
- Returns a vector of TRUE/FALSE equal in length to left-hand side
& logical AND
| logical OR

Any logical operation on NA returns NA, so it is important to make sure your data are complete before relying on logical expressions to find conditions of interest.

2.6.4 Pattern Matching

The following examples use the grep() function to pattern match within the data. We do this to create an index of the SoilProfileCollection for records that match the specified pattern, and then use that index to filter to specific sites and profiles.

Patterns are specified using regular expression (REGEX) syntax.

This process can be applied to many different columns in the SPC based on how you need to filter the data. This example pattern matches on the taxsubgrp column, but another useful application might be to pattern match on geomorphology or parent material.

Say we want to see what the variation of particle size classes are within a specific subgroup. We can use grep() to create a row index, then apply that index to the SoilProfileCollection.

# create a numeric index for pedons with taxsubgroup containing 'typic'
idx <- grep('typic', pedons$taxsubgrp)
idx

## [1] 11 12 13 14 26 50

# use square bracket notation to subset 'typic' soils in `subset1` object
subset1 <- pedons[idx,]

# or use the index directly to summarize taxpartsize for 'typic' soils
sort(table(pedons$taxpartsize[idx]), decreasing = TRUE)

## 
##  loamy-skeletal clayey-skeletal 
##               5               1

Note: grep() below has an invert argument (default FALSE). This option is very useful for excluding the results of the pattern matching process by inverting whatever the result is. grepl() is the logical version of grep(), so you can invert it using the logical NOT operator: !.

Another method is to create an index using which() function. which() takes any logical vector (or expression), and it returns the indices (positions) where that expression returns TRUE. The use of which becomes more important when there are missing values (NA) in an expression.

Do a graphical check to see the “typic” profiles are selected.

Plot them in R using the SoilProfileCollection “plot” method (e.g., specialized version of the generic plot() function).

# adjust margins
par(mar=c(1,0,0,1))

# plot the first 10 profiles of subset1
plot(subset1[1:10, ], label = 'taxsubgrp', max.depth = 60)
title('Pedons with the word "typic" at subgroup-level of Soil Taxonomy', line=-2)

2.6.4.1 Resources for Regular Expressions

A great way to test out regular expressions, and get in-depth explanations of how the syntax is working, is to use an online expression builder. For example, Regex101 or RegExr.

https://regex101.com/ & https://regexr.com/ - Online regular expression testers
For more information on using regular expressions in grep() for pattern matching operations, see:
- GNU Regular-expression-syntax
- Quick Start Guide :http://www.regular-expressions.info/quickstart.html

Quick check: Review the chapter content, or run the commands, and review the documentation, to answer the questions.

True or False: grepl() returns a numeric vector
True or False: which(grepl('typic', pedons$taxsubgrp)) is the same as grep('typic', pedons$taxsubgrp).

2.6.4.2 Statistics for Soil Survey: REGEX Operator Reference

. One character, any character
* Zero-or-more Quantifier (of previous token)
+ One-or-more Quantifier (of previous token)
{n} quantifier where n is the the number of a match “repeats” (of previous token)
[A-Z!] ONE capital letter, or an exclamation mark
[0-9]{2} TWO numbers (using { quantifier)
| is equivalent to OR:
- Example: grep('loamy|sandy', c("loamy-skeletal","sandy","sandy-skeletal"))
  - “loamy OR sandy”
^ Anchor to beginning of string / line:
- Example: grep('^sandy', c("loamy-skeletal","sandy","sandy-skeletal"))
  - “STARTS WITH sandy”
$ Anchor to end of string / line:
- Example: grep('skeletal$', c("loamy-skeletal","sandy","sandy-skeletal"))
  - “ENDS WITH skeletal”
\\b Anchor to word boundary:
- Example: grep('\\bmesic', c("mesic","thermic","isomesic"))
  - “WORD STARTS WITH mesic” (e.g. not “isomesic”)

2.6.5 Filtering

A variety of methods are available to subset or “filter” R data objects, from a simple data.frame or vector, to something more complex like a Spatial object or a SoilProfileCollection.

You can index many R objects using numeric or logical expressions as above. There are also methods that make this process a little easier.

The base R method for this is subset() and it works on data.frame objects. It is nice because you can specify column names without explicitly referencing the data set, since subset uses non-standard evaluation of expressions passed as arguments.

2.6.5.1 Filtering with `aqp::subset()`

We use the SoilProfileCollection subset method, where we first specify a data (pedons) object then we can write expressions for the columns that exist in that object.

Here, we combine two logical expressions to find taxsubgrp containing "alfs" (Alfisols) with obsdate before January 1st, 2010.

subset2 <- subset(pedons, grepl("alfs", taxsubgrp) & obsdate < as.POSIXlt("2010-01-01"))

# check taxonomic range of particle size classes in the data
# overwhelmingly these are described as loamy-skeletal ultic haploxeralfs
sort(table(subset2$taxsubgrp), decreasing = TRUE)

## 
##  ultic haploxeralfs mollic haploxeralfs 
##                  28                   1

sort(table(subset2$taxpartsize), decreasing = TRUE)

## 
##  loamy-skeletal clayey-skeletal            fine      fine-loamy 
##              25               2               1               1

# check year described and taxpartsize
table(subset2$taxpartsize, substr(subset2$obsdate, 0, 4))

##                  
##                   2007 2008 2009
##   clayey-skeletal    1    0    1
##   fine               1    0    0
##   fine-loamy         1    0    0
##   loamy-skeletal    19    1    5

# a double equal sign '==' is used for exact character or numeric criteria
subset3 <- subset(subset2, taxpartsize == 'loamy-skeletal')

table(subset3$taxpartsize)

## 
## loamy-skeletal 
##             25

par(mar = c(0, 0, 2, 1))
plotSPC(subset3[1:12, ], print.id = FALSE)
title('Loamy-skeletal Ultic Haploxeralfs')

2.6.6 Dates and Times

Dates and times use special object types in R. The Unix time, also known as “Posix time,” is a system for describing a point in time. Unix epoch is a whole number value that is the number of seconds elapsed since the 00:00:00 UTC on 1 January 1970 minus leap seconds.

We can use logical comparison operators on dates and times if their string representation such as "01/01/1970" is converted to a common base R UNIX time representation known as POSIXlt or POSIXct.

This conversion accounts for important things such as timezone using your computer’s locale–which is important to keep in mind.

When converting to POSIX time several unambiguous (year-month-day) date time formats can be detected.

For instance, if you want to convert a date in the common month-day-year format, you need to specify the format argument:

as.POSIXct(24*60*60, origin = "1970-01-01", tz = "UTC")

## [1] "1970-01-02 UTC"

By default the timezone will match your current timezone. Dates without times are treated as being at midnight UTC.

You can customize the timezone with tz argument:

as.POSIXlt("01/01/1970", tz = "UTC", format = "%m/%d/%Y")

## [1] "1970-01-01 UTC"

POSIXlt and POSIXct objects can be formatted with the format() function. strptime() can be used to parse character vectors into date/times. You use as.POSIXlt() with character input, and as.POSIXct() with numeric input.

R also has the Date class which can be used for formatting calendar dates. You can convert POSIXlt/POSIXct objects to Date objects with as.Date()

Exercise 2: O Horizon Thickness

This exercise uses a synthetic transect generated with R code. The 10 hypothetical pedons along the transect have been described using a standardized set of horizon designations: Oe, Oa, A, Cg1, Cg2. The thicknesses of layers, the depth to gleyed matrix, and soil organic carbon content vary between pedons.

We will start to examine this dataset graphically, and then we will use some functions defined for the SoilProfileCollection to extract information from the data.

Save the code you use in an R script, add answers as comments, and send to your mentor.

First, we will load the sample data:

load(url("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/ohz.rda"))
plotSPC(ohz)

The aqp function thicknessOf() can be used to calculate the thickness of horizons matching a regular expression pattern. For example, to calculate the total O horizon thickness we can match the pattern "O":

ohz_thickness <- thicknessOf(ohz, pattern = "O", prefix = "O_")
ohz_thickness

##      id O_thickness
## 1  P-01          23
## 2  P-02          40
## 3  P-03          37
## 4  P-04          42
## 5  P-05          24
## 6  P-06          45
## 7  P-07          43
## 8  P-08          33
## 9  P-09          47
## 10 P-10          42

We will use the site() LEFT JOIN method to merge the calculated thickness into the SoilProfileCollection.

site(ohz) <- ohz_thickness

Now, let’s plot the profiles in order of total O horizon thickness. We draw dashed lines at 20 and 40cm to indicate the thresholds for histic epipedon and Histosols soil order:

plotSPC(ohz, plot.order = order(ohz$O_thickness))
abline(h = c(20, 40), lty = 2)

Using the ohz object answer the following questions. Use techniques from the previous sections. Check your logic by comparing to the profile sketch.

Check if any soil organic carbon values are NA. Make a profile sketch (plotSPC()) using organic carbon percentage ("soc") to color the profiles. Then calculate a new column using the function is.na() and use that to color the profiles. Which profiles have missing data? In which horizons?
Tabulate horizon designations (ohz$name) using table(). How many of each horizon designation are present?
Use regular expressions to count the number of "Cg" horizons in the collection. How many Cg horizons have organic carbon data in the "soc" column?
Filter the ohz object using aqp::subset() to select pedons where O_thickness column is greater than 20 and less than 40. This is the histic epipedon thickness requirement. How many pedons have a histic epipedon?
Filter the ohz object using aqp::subset() to select pedons where O_thickness column is greater than 40. This is the Histosols order thickness requirement. How many pedons are Histosols?

2.7 Generalized Horizon Labels

After review of the data, a new set of more general labels can be assigned to each horizon to define groups over which soil properties will be aggregated. These are called “Generalized Horizon Labels” (GHL). These new labels define functionally similar groups that usually span multiple genetic horizons.

Consider this situation: you have a collection of pedons that have been correlated to a named soil series or component. You would like to objectively compute a range in characteristics (“low-rv-high” values) and the typical range of horizon depths.

Most collections of pedon data have variation in description style and horizons used, horizon depths, and number of horizons described.

In the scenario depicted in the image above there are several obvious “micro-correlation” decisions that can be to group horizons for aggregation.

For example: what horizon prototype scheme (e.g., A-Bt1-Bt2-Bt3-Cr-R) best conveys the concept of this soil series or soil component? Does it make sense to group the Bt3, Bt4, BCt, and CBt horizons for aggregation? What about grouping Bt3 and 2Bt3 horizons? Do BA and AB horizons occur frequently enough to be included as their own group?

2.7.1 Formal Approach

Generalized horizon labels are an aggregation strategy. Here is a basic outline of the process:

Select a set of GHL that best represents a group of pedons to be aggregated. This could be based on series descriptions, expert knowledge, or inspection of the most frequently described horizon designations.
Assign GHL to each horizon using whatever information is available for grouping horizons. This micro-correlation of horizon designations will often require slightly different rules for each dataset and group. Careful inspection of horizon designation and observed properties is critical.
Evaluate GHL assignments and manually refine as needed.
Keep track of final GHL assignments in NASIS (dspcomplayerid) column, an R script, or a text file.
Compute range in characteristics, aka low-rv-high values, for clay, sand, pH, etc. using the labels as grouping variables.
Estimate most likely top and bottom depths for each GHL group.

2.7.2 Functions for Handling Generalized Horizon Labels

The aqp package defines several functions that can be used to assign GHLs or summarize the data associated with them.

The function generalize.hz() can be used to apply new labels based on regular expression pattern matching. The function GHL() can be used to get or set the column containing generalized horizon labels for a SoilProfileCollection.

Exercise 4 in this chapter will let you work through the process of summarizing data using GHLs with your own data. But first we continue with tools to inspect your data retrieved from NASIS with soilDB. It is very important that basic QC/QA be performed on your data before proceeding with any statistical summaries.

2.8 `fetchNASIS()` data checks

fetchNASIS() does a lot of the hard work for you by pulling out a variety of child tables and joining them to site/horizon tables.

This results in a SoilProfileCollection object that has many of the NASIS data columns. A variety of internal processes are performed to get data that conforms with site or horizon tables. Output is generated in the console that you should always look at.

We will walk through a few of the most common outputs.

Quality control and limited “filling” of fragments, horizon depth data, site data, tables

replacing missing lower horizon depths with top depth + 1cm ... [19 horizons]
-> QC: horizon errors detected:
    Use `get('bad.pedon.ids', envir=soilDB.env)` for pedon record IDs (peiid)
    Use `get('bad.horizons', envir=soilDB.env)` for horizon designations
-> QC: pedons missing bottom hz depths:
    Use `get('missing.bottom.depths', envir=soilDB.env)` for pedon record IDs (peiid)

Notes about default settings and handling of NULL (missing data elements or records)

NOTE: all records are missing surface fragment cover
NOTE: all records are missing rock fragment volume

2.8.1 Inspecting Results

We will analyze occurrence of andic soil properties in pedons from MT647. This is a demonstration of inspecting a NASIS dataset by splitting it based on a specific diagnostic feature.

We will download a “selected set” from the course website from an .rda file to save you the effort of crafting your selected set just for this example.

Downloading and installing the .rda is equivalent to NASIS query SSRO_Northwest: _PedonPC_Plus_DataDump_select for User Site ID: %MT647%, NASIS Site: SSRO_Northwest, and NASIS Group: NW-MIS Point Data.

After populating your NASIS selected set mt647 is created in R with mt647 <- fetchNASIS(rmHzErrors=TRUE) and mt647err with mt647err <- fetchNASIS(rmHzErrors=FALSE). rmHzErrors=FALSE is the default behavior, we are creating two objects to demonstrate how to identify and fix common horizon errors.

2.8.1.1 Load Example Data

To load the sample object data into R, just use load() and the path to the .rda file. In this case we are using a remote path, so we use url() to open the connection:

load(url("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/mt647.rda"))

2.8.2 Select Pedons with Andic Soil Properties

length(mt647)

## [1] 481

table(site(mt647)$andic.soil.properties, useNA = "ifany")

## 
## FALSE  TRUE  <NA> 
##     2    83   396

# get just the profiles with andic.soil.properties == TRUE
mt647.asp <- subset(mt647, andic.soil.properties)
mt647.asp

## SoilProfileCollection with 83 profiles and 446 horizons
## profile ID: peiid  |  horizon ID: phiid 
## Depth range: 20 - 305 cm
## 
## ----- Horizons (6 / 446 rows  |  10 / 77 columns) -----
##   peiid   phiid hzdept hzdepb hzname texture pedon_id dspcomplayerid bounddistinct boundtopo
##  828140 4005861      0      2     Oe    <NA>  P23-134           <NA>          <NA>      <NA>
##  828140 4005863      2     13      E   CBV-L  P23-134           <NA>         clear    smooth
##  828140 4005859     13     23    Bs1   CBV-L  P23-134           <NA>         clear    smooth
##  828140 4005862     23     33    Bs2   CBV-L  P23-134           <NA>       gradual    smooth
##  828140 4005864     33     69     BC GRX-FSL  P23-134           <NA>       gradual    smooth
##  828140 4005860     69    152     2C GRX-FSL  P23-134           <NA>          <NA>      <NA>
## [... more horizons ...]
## 
## ----- Sites (6 / 83 rows  |  10 / 132 columns) -----
##  siteiid  peiid ecositeid ecositenm ecositecorrdate es_classifier siteecositehistory.classifier
##   845415 828140      <NA>      <NA>            <NA>          <NA>                          <NA>
##   845429 828152      <NA>      <NA>            <NA>          <NA>                          <NA>
##   845430 828153      <NA>      <NA>            <NA>          <NA>                          <NA>
##   845432 828155      <NA>      <NA>            <NA>          <NA>                          <NA>
##   845434 828157      <NA>      <NA>            <NA>          <NA>                          <NA>
##   845451 828176      <NA>      <NA>            <NA>          <NA>                          <NA>
##  es_selection_method upedonid siteobsiid
##                 <NA>  P23-134     821442
##                 <NA>  P92-052     821456
##                 <NA>  P92-046     821457
##                 <NA>  P93-053     821459
##                 <NA>  P91-103     821461
##                 <NA>  P89-011     821478
## [... more sites ...]
## 
## Spatial Data:
## [EMPTY]

length(mt647.asp)

## [1] 83

We can compare this to what we see in the NASIS Pedon Diagnostic Features table:

2.8.2.1 Using `fetchNASIS()` output to find errors

Any profiles that have have logic errors detected are stored in soilDB.env bad.pedon.ids variable after you run fetchNASIS. If this variable does not exist either you have not run fetchNASIS() in the current session or there are no errors.

# these are the troublesome user pedon IDs
get('bad.pedon.ids', envir = soilDB.env)

##  [1] "P93-037" "P90-008" "P90-004" "P90-009" "P93-058" "P93-059" "P90-025" "P90-002" "P90-012" "P92-001"
## [11] "P92-085" "P90-019" "P90-010" "P90-015" "P91-094" "P92-029" "P92-076" "P93-026" "P93-035" "P93-063"
## [21] "P93-064" "P93-041" "P93-043" "P93-044" "P93-083" "P93-112" "P93-113" "P93-124" "P93-001" "P96-007"
## [31] "P91-025" "P93-078" "P92-044" "P91-112" "P92-038" "P90-018" "P93-057" "P93-084" "P90-016" "P92-063"
## [41] "P92-048" "P93-052" "F01-230" "F95-420" "F95-114" "F96-205" "P75-006" "P91-105" "P91-059"

# rmHzErrors=TRUE removes the "bad" illogical pedons
any(mt647$upedonid %in% get('bad.pedon.ids', envir=soilDB.env))

## [1] FALSE

# mt647err is created with rmHzErrors=FALSE
any(mt647err$upedonid %in% get('bad.pedon.ids', envir=soilDB.env))

## [1] TRUE

# mt647.asp is a subset of mt647, so  no errors
any(mt647.asp$upedonid %in% get('bad.pedon.ids', envir=soilDB.env))

## [1] FALSE

When fetchNASIS(..., rmHzErrors = TRUE) any horizons that were omitted from the SoilProfileCollection will be stored in the bad.horizons variable in the soilDB.env environment.

head(get('bad.horizons', envir=soilDB.env))

##      peiid   phiid upedonid hzname hzdept hzdepb
## 67  868038 4270406  F01-230      E     NA     NA
## 68  868038 4270407  F01-230     Bw     NA     NA
## 95  868072 4270514  F95-114   <NA>     NA     NA
## 99  868048 4270437  F95-420   <NA>     NA     NA
## 111 868074 4270519  F96-205      B     NA     NA
## 112 868074 4270521  F96-205      C     NA     NA

2.8.2.2 Logic Checks for the SoilProfileCollection

The aqp package has several functions that do logic checks on SoilProfileCollection objects. The main method that does this in aqp is checkHzDepthLogic() which returns a data.frame of results of running four logic tests on the horizon data from each profile.

Checks for:

bottom depths less than top depth / bad top depth order ("depthLogic")
bottom depths equal to top depth ("sameDepth")
overlaps and gaps ("overlapOrGap")
missing depths ("missingDepth")

logic_tests <- checkHzDepthLogic(mt647err)

# look at first few (look OK; valid == TRUE)
head(logic_tests)

##    peiid valid depthLogic sameDepth missingDepth overlapOrGap
## 1 828138  TRUE      FALSE     FALSE        FALSE        FALSE
## 2 828139  TRUE      FALSE     FALSE        FALSE        FALSE
## 3 828140  TRUE      FALSE     FALSE        FALSE        FALSE
## 4 828141  TRUE      FALSE     FALSE        FALSE        FALSE
## 5 828142  TRUE      FALSE     FALSE        FALSE        FALSE
## 6 828143  TRUE      FALSE     FALSE        FALSE        FALSE

# these all have overlapOrGap errors
head(logic_tests[!logic_tests$valid, ])

##     peiid valid depthLogic sameDepth missingDepth overlapOrGap
## 11 828148 FALSE      FALSE     FALSE        FALSE         TRUE
## 23 828160 FALSE      FALSE     FALSE        FALSE         TRUE
## 25 828162 FALSE      FALSE     FALSE        FALSE         TRUE
## 34 828171 FALSE      FALSE     FALSE        FALSE         TRUE
## 35 828172 FALSE      FALSE     FALSE        FALSE         TRUE
## 36 828173 FALSE      FALSE     FALSE        FALSE         TRUE

# join the logic test data into the site table
site(mt647err) <- logic_tests

Use the $valid vector in result to select profiles with depth logic issues (logic_tests$valid == FALSE)

bad.profiles <- subset(mt647err, !valid)
bad.profiles

## SoilProfileCollection with 49 profiles and 338 horizons
## profile ID: peiid  |  horizon ID: phiid 
## Depth range: 15 - 152 cm
## 
## ----- Horizons (6 / 338 rows  |  10 / 77 columns) -----
##   peiid   phiid hzdept hzdepb hzname texture pedon_id dspcomplayerid bounddistinct boundtopo
##  828148 4005908      0      3     Oe    <NA>  P93-037           <NA>          <NA>      <NA>
##  828148 4005907      3      5     Oi    <NA>  P93-037           <NA>          <NA>      <NA>
##  828148 4005905      5     15      E  CB-FSL  P93-037           <NA>         clear    smooth
##  828148 4005911     15     23     Bs  CB-SIL  P93-037           <NA>        abrupt    smooth
##  828148 4005909     23     25     2E CBV-FSL  P93-037           <NA>         clear    smooth
##  828148 4005910     25     58     2B CBV-FSL  P93-037           <NA>       gradual    smooth
## [... more horizons ...]
## 
## ----- Sites (6 / 49 rows  |  10 / 137 columns) -----
##   peiid siteiid ecositeid ecositenm ecositecorrdate es_classifier siteecositehistory.classifier
##  828148  845423      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828160  845435      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828162  845437      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828171  845446      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828172  845447      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828173  845448      <NA>      <NA>            <NA>          <NA>                          <NA>
##  es_selection_method upedonid siteobsiid
##                 <NA>  P93-037     821450
##                 <NA>  P90-008     821462
##                 <NA>  P90-004     821464
##                 <NA>  P90-009     821473
##                 <NA>  P93-058     821474
##                 <NA>  P93-059     821475
## [... more sites ...]
## 
## Spatial Data:
## [EMPTY]

length(bad.profiles)

## [1] 49

Alternately, we can keep only the “valid” ones (where logic_tests$valid == TRUE):

good.profiles <- subset(mt647err, logic_tests$valid)
good.profiles

## SoilProfileCollection with 481 profiles and 2536 horizons
## profile ID: peiid  |  horizon ID: phiid 
## Depth range: 14 - 1552 cm
## 
## ----- Horizons (6 / 2536 rows  |  10 / 77 columns) -----
##   peiid   phiid hzdept hzdepb hzname texture pedon_id dspcomplayerid bounddistinct boundtopo
##  828138 4005848      0      5      O    <NA>  P91-043           <NA>          <NA>      <NA>
##  828138 4005852      5     18      E    GR-L  P91-043           <NA>         clear    smooth
##  828138 4005850     18     38    Bw1   GRV-L  P91-043           <NA>       gradual      wavy
##  828138 4005849     38     51    Bw2 CBV-FSL  P91-043           <NA>       gradual      wavy
##  828138 4005853     51     71     BC  CBV-SL  P91-043           <NA>         clear      wavy
##  828138 4005851     71     81      R    <NA>  P91-043           <NA>          <NA>      <NA>
## [... more horizons ...]
## 
## ----- Sites (6 / 481 rows  |  10 / 137 columns) -----
##   peiid siteiid ecositeid ecositenm ecositecorrdate es_classifier siteecositehistory.classifier
##  828138  845413      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828139  845414      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828140  845415      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828141  845416      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828142  845417      <NA>      <NA>            <NA>          <NA>                          <NA>
##  828143  845418      <NA>      <NA>            <NA>          <NA>                          <NA>
##  es_selection_method upedonid siteobsiid
##                 <NA>  P91-043     821440
##                 <NA>  P91-029     821441
##                 <NA>  P23-134     821442
##                 <NA>  P93-025     821443
##                 <NA>  P93-075     821444
##                 <NA>  P93-074     821445
## [... more sites ...]
## 
## Spatial Data:
## [EMPTY]

length(good.profiles)

## [1] 481

Once you have created a subset SoilProfileCollection, you can manually inspect the data in that set to identify areas where there are issues. Ideally you would fix these in NASIS. If you don’t have the ability to fix them in NASIS, you might write some R code into your analysis script to apply patches to missing or invalid data before proceeding with summaries.

2.9 Extended Data Functions

Additional data related to both site and horizon information can be fetched using the get_extended_data_from_NASIS() function. This function returns a named list object with several tables that fetchNASIS() draws from.

There are a variety of calculated fields that are included in the default fetchNASIS() result based on these extended data. Sometimes you need to work with a less aggregated representation of the source data to chase down data issues.

2.9.1 Elements of `get_extended_data_from_NASIS()`

Ecological Site History ("ecositehistory")
Diagnostic Features ("diagnostic")
Diagnostic Feature TRUE/FALSE Summary ("diagHzBoolean")
Restrictions ("restriction")
Fragment and Texture Summaries
- Horizon Fragments ("frag_summary")
- Horizon Artifacts ("art_summary")
- Surface Fragments ("surf_frag_summary")
- Texture Class Modifiers ("texmodifier")
Geomorphic Table Summaries ("geomorph")
Parent Material Summaries ("pm")
Taxonomic History ("taxhistory")
Site Text Notes w/ Photo links("photo")
Horizon Structure ("struct")
Horizon Designation ("hzdesgn")

2.9.2 Load Example Data

Below is a summary of additional information that can be readily brought into R from your NASIS selected set via the get_extended_data_from_NASIS() function.

Before continuing, imagine opening the NASIS client, populating your selected set with 2015MT663% using a query like “NSSC Pangaea – POINT-Pedon/Site by User Pedon ID”

Load the data like we did above.

load(url("https://github.com/ncss-tech/stats_for_soil_survey/raw/master/data/book/02/mt663.rda"))

## fetch extended site and horizon data from local NASIS
# mt663 <- fetchNASIS()
# mt663ext <- get_extended_data_from_NASIS_db()

We could use the get_extended_data_from_NASIS_db() function if 2015MT663% or other data were in the selected set, but we will use the mt663ext data we loaded from the .rda file.

The column names are the names of variables that you could join to your site or horizon data by various means. Generally these variable names, with a few exceptions, mirror the NASIS 7 data model names.

# site and pedon related extended data

# list all dataframes in the extended data
str(mt663ext, 1)

## List of 14
##  $ ecositehistory:'data.frame':  0 obs. of  7 variables:
##  $ siteaoverlap  :'data.frame':  127 obs. of  5 variables:
##  $ diagnostic    :'data.frame':  292 obs. of  4 variables:
##  $ diagHzBoolean :'data.frame':  115 obs. of  20 variables:
##  $ restriction   :'data.frame':  11 obs. of  8 variables:
##  $ frag_summary  :'data.frame':  561 obs. of  18 variables:
##  $ art_summary   :'data.frame':  561 obs. of  14 variables:
##  $ texmodifier   :'data.frame':  622 obs. of  5 variables:
##  $ geomorph      :'data.frame':  241 obs. of  9 variables:
##  $ taxhistory    :'data.frame':  115 obs. of  23 variables:
##  $ photo         :'data.frame':  793 obs. of  4 variables:
##  $ pm            :'data.frame':  179 obs. of  9 variables:
##  $ struct        :'data.frame':  444 obs. of  6 variables:
##  $ hzdesgn       :'data.frame':  561 obs. of  19 variables:

# vegetation data summary
colnames(mt663ext$ecositehistory)

## [1] "siteiid"                       "ecositeid"                     "ecositenm"                    
## [4] "ecositecorrdate"               "es_classifier"                 "siteecositehistory.classifier"
## [7] "es_selection_method"

# diagnostic features
colnames(mt663ext$diagnostic)

## [1] "peiid"    "featkind" "featdept" "featdepb"

# surface rock fragments
colnames(mt663ext$surf_frag_summary)

## NULL

# geomorphic description
colnames(mt663ext$geomorph)

## [1] "peiid"          "geomicrorelief" "geommicelev"    "geomfmod"       "geomfname"      "geomfeatid"    
## [7] "existsonfeat"   "geomfiidref"    "geomftname"

# taxonomic history data
colnames(mt663ext$taxhistory)

##  [1] "peiid"            "classdate"        "classifier"       "classtype"        "taxonname"       
##  [6] "localphase"       "taxonkind"        "seriesstatus"     "taxclname"        "taxpartsize"     
## [11] "taxorder"         "taxsuborder"      "taxgrtgroup"      "taxsubgrp"        "soiltaxedition"  
## [16] "osdtypelocflag"   "taxmoistcl"       "taxtempregime"    "taxfamother"      "taxreaction"     
## [21] "taxfamhahatmatcl" "psctopdepth"      "pscbotdepth"

# linked photo stored in site textnotes
colnames(mt663ext$photo)

## [1] "siteiid"   "recdate"   "textcat"   "imagepath"

# site parent materials
colnames(mt663ext$pm)

## [1] "siteiid"    "seqnum"     "pmorder"    "pmdept"     "pmdepb"     "pmmodifier" "pmgenmod"  
## [8] "pmkind"     "pmorigin"

###
### horizon related extended data
### 

# rock fragments 
colnames(mt663ext$frag_summary)

##  [1] "phiid"                "fine_gravel"          "gravel"               "cobbles"             
##  [5] "stones"               "boulders"             "channers"             "flagstones"          
##  [9] "parafine_gravel"      "paragravel"           "paracobbles"          "parastones"          
## [13] "paraboulders"         "parachanners"         "paraflagstones"       "unspecified"         
## [17] "total_frags_pct_nopf" "total_frags_pct"

# soil texture modifers
colnames(mt663ext$texmodifier)

## [1] "peiid"  "phiid"  "phtiid" "seqnum" "texmod"

# soil structure data
colnames(mt663ext$struct)

## [1] "phiid"         "structgrade"   "structsize"    "structtype"    "structid"      "structpartsto"

2.9.3 Visualizing Common Landforms

The following code generates a simple graphical summary of the 10 most commonly occurring "landform_string" (a calculated field in fetchNASIS()) to inspect which are the most common.

# load data from a NASIS selected set (or sample object)
pedons <- mt663

# create 'lf' object of landform factors sorted in descending order
lf <- sort(table(pedons$landform_string), decreasing = TRUE)

# plot top 10 or length, whichever is shorter
Hmisc::dotchart2(lf[1:pmin(10, length(lf))],
                 col = 'black',
                 xlim = c(0, max(lf)),
                 cex.labels = 0.75)

For a challenge and to further inspect your own data try the above code with some other summaries of geomorphic data produced by fetchNASIS().

You can swap landform_string for: landscape_string (landscape), hillslopeprof (2D), geomposmntn, geomposhill, geompostrce, geomposflats (3D), slope_shape, shapeacross, shapedown (slope shape across/down), microfeature_string (microfeature), or geomicrorelief_string (site observation microrelief).

2.9.4 Diagnostic Features

2.9.4.1 Boolean Diagnostic Features in `fetchNASIS()`

If diagnostic features are populated in the Pedon Diagnostic Features table in NASIS, then Boolean (TRUE or FALSE, or “logical”) fields are created for each diagnostic feature type found in the data brought in by fetchNASIS.

These fields can be used to model presence / absence of a diagnostic soil feature by extracting the site data from the SoilProfileCollection with site().

2.9.4.2 Thickness from Diagnostic Features Table

The following is an example of how you could use the diagnostic features (if populated!) from the extended data to determine the thickness of a diagnostic feature of interest.

# get diagnostic features associated with pedons loaded from selected set
d <- diagnostic_hz(mt663)

# summary of the diagnostic features in your data!
unique(d$featkind)

##  [1] "ochric epipedon"          "cambic horizon"           "lithic contact"          
##  [4] "mollic epipedon"          "argillic horizon"         "redox concentrations"    
##  [7] "andic soil properties"    "secondary carbonates"     "sapric soil materials"   
## [10] "aquic conditions"         "reduced matrix"           "albic horizon"           
## [13] "spodic horizon"           "glossic horizon"          "spodic materials"        
## [16] "lithologic discontinuity" "densic materials"         "umbric epipedon"         
## [19] "albic materials"          NA

# tabulate
sort(table(droplevels(factor(d$featkind))), decreasing = TRUE)

## 
##          ochric epipedon           cambic horizon         argillic horizon          mollic epipedon 
##                       61                       54                       43                       42 
##    andic soil properties           lithic contact     secondary carbonates          umbric epipedon 
##                       30                       20                        7                        7 
##            albic horizon           spodic horizon          glossic horizon           reduced matrix 
##                        6                        4                        3                        3 
##    sapric soil materials lithologic discontinuity          albic materials         aquic conditions 
##                        3                        2                        1                        1 
##         densic materials     redox concentrations         spodic materials 
##                        1                        1                        1

# subset argillic horizons
d <- d[d$featkind == 'argillic horizon', ]

# create a new column and subtract the upper from the lower depth
d$argillic_thickness_cm <- d$featdepb - d$featdept

# create another new column with the upper depth to the diagnostic feature
d$depth_to_argillic_cm <- d$featdept

# omit NA values
d <- na.omit(d)

# subset to pedon records IDs and calculated thickness
d <- d[, c('peiid', 'argillic_thickness_cm', 'depth_to_argillic_cm')]
head(d)

##      peiid argillic_thickness_cm depth_to_argillic_cm
## 7  1092610                    56                   30
## 24 1092617                    38                   34
## 26 1092618                    29                   23
## 28 1092619                    38                   32
## 30 1092620                    29                   24
## 33 1092621                    23                   19

# left-join with existing site data
site(mt663) <- d

# plot as histogram
par(mar = c(4.5, 4.5, 1, 1))

# note additional arguments to adjust figure labels
hist(
  mt663$argillic_thickness_cm, 
  xlab = 'Thickness of argillic (cm)', 
  main = '',
  las = 1
)

hist(
  mt663$depth_to_argillic_cm, 
  xlab = 'Depth to argillic top depth (cm)', 
  main = '',
  las = 1
)

Quick check: What can you do with the boolean diagnostic feature data stored in the site table of a fetchNASIS() SoilProfileCollection?

2.9.4.3 Diagnostic Feature Diagrams

# work up diagnostic plot based on the mt663 dataset loaded above
library(aqp)
library(soilDB)
library(sharpshootR)

# can limit which diagnostic features to show by setting 'v' manually
v <- c('ochric.epipedon', 'mollic.epipedon', 'andic.soil.properties', 
       'argillic.horizon', 'cambic.horizon', 
       'lithic.contact')

# the default concatenated landform_string may have multiple levels
#  depending on how the geomorphic tables were populated
#  these are concatenated using the ampersand (&) character
#  so take the first string split using ampersand as a delimiter
mt663$first_landform <- sapply(strsplit(mt663$landform_string, "&"), 
                                  function(x) x[[1]])

# plot with diagnostic features ordered according to co-occurrence
# v: site-level attributes to consider
# k: number of clusters to identify
diagnosticPropertyPlot(
  mt663[1:30, ], v = v, k = 5,
  grid.label = 'usiteid', 
  dend.label = 'first_landform', 
  sort.vars = TRUE
)

Exercise 3: Diagnostic Horizons in Your Own Data

Save the code you use in an R script, add answers as comments, and send to your mentor.

Use the following script to generate a diagnostic-feature diagram for the pedon data you’ve loaded from your NASIS selected set. Alternately, you may use the MT663 data from the example above, just substitute the object mt663 for f.

You can select a subset of desired diagnostic properties or use all diagnostic feature columns.

library(aqp)
library(soilDB)
library(sharpshootR)

# Load data
f <- fetchNASIS(from = 'pedons')

# ... May need to use subset() to reduce the number of pedons!

# get all diagnostic feature columns from site data 
# by pattern matching on '[.]' in the site attribute names
# this is not a generic rule, but works here
idx <- grep('[.]', siteNames(f))
v <- siteNames(f)[idx]

# inspect v
v

# insert diagnostics of interest from the possible list in 'v'
v <- c('ochric.epipedon', 'cambic.horizon', 
       'argillic.horizon', 'paralithic.contact',
       'lithic.contact')

# generate diagnostic property diagram
diagnosticPropertyPlot(
  f, v = v, k = 5, 
  grid.label = 'usiteid', 
  dend.label = 'taxonname'
)

Questions:

Use the results of site(), horizons(), diagnostic_hz() to answer the following questions about your own data. These functions return data.frame objects from a SoilProfileCollection derived from your local NASIS data.

How many pedons do you have in your selected set? Check the number of rows in the site portion of the SoilProfileCollection with nrow(), or use the method length() on the collection.
How many diagnostic features do you have for those pedons? How many different "featkind"? You can calculate the number of unique values with with length() and unique()
Interpret: What diagnostic features did you choose to plot? Did you notice any patterns in the diagnostic features that tend to occur together?

For more information on generating diagnostic feature diagrams, see the following tutorial:

Diagnostic Feature Property Plots.

2.10 Custom Queries to Local NASIS Database

fetchNASIS() and related convenience functions are wrappers around commonly used chunks of SQL (Structured Query Language). Queries of the NASIS local database can be written in T-SQL which is the dialect of SQL used to communicate with Microsoft SQL Server. This is the connection that you set for the pre-course.

To create a DBIConnection object that can access the NASIS local database, use dbConnectNASIS() or simply NASIS(). This will look for an ODBC connection of the name "nasis_local" and will authenticate with the read-only credentials.

You can also pass an existing DBIConnection or path to SQLite file as the dsn argument to connect to an alternate source. The default driver type used for NASIS connections is OdbcConnection from the odbc package.

To query or execute SQL commands on a connection you created, you can use dbGetQuery(connection, query), where connection is your DBIConnection object and query is the SQL to execute.

The soilDB package also includes a helper function dbQueryNASIS(), which functions just like dbGetQuery() except the default is to close the connection after returning the result. This can be convenient for interactive use or cases where a function needs to only execute a single query, as it will prevent you from leaving connections open accidentally.

2.10.1 Example: Site Soil Temperature Data from CA792 (Sequoia-Kings Canyon National Park)

The following example will return all records in your selected set sitesoiltemp table, along with a couple of fields from the site, siteobs, and pedon tables. This is a convenient way to collect all of the field-based soil temperature data associated with the pedons in your selected set for further analysis.

You can use the CA792 (Sequoia-Kings Canyon National Park) pedons as an example. Use a query that searches user pedon ID for the following pattern %CA792% to download and populate a selected set in the NASIS client.

library(soilDB)

# write query as a character string
q <- "SELECT siteiid as siteiid, peiid, usiteid, obsdate, soitemp, soitempdep 
       FROM site_View_1
       INNER JOIN siteobs_View_1 ON site_View_1.siteiid = siteobs_View_1.siteiidref
       LEFT OUTER JOIN sitesoiltemp_View_1 ON siteobs_View_1.siteobsiid = sitesoiltemp_View_1.siteobsiidref
       LEFT OUTER JOIN pedon_View_1 ON siteobs_View_1.siteobsiid = pedon_View_1.siteobsiidref
      ORDER BY obsdate, siteiid;"

# setup connection to local NASIS
channel <- NASIS()

# exec query
d <- dbQueryNASIS(channel, q)

The functions dbConnectNASIS() (alias NASIS()) and dbQueryNASIS() allow you to create a connection to the NASIS local database and send queries to that connection, respectively. By default, dbQueryNASIS() will close your connection after completing the query; you can change this by setting close=FALSE.

# check results
str(d)

# remove records missing values
d <- na.omit(d)

# tabulate unique soil depths
table(d$soitempdep)

# extract doy of year
d$doy <- as.integer(format(d$obsdate, "%j"))

# when where measurements collected?
hist(
  d$doy,
  xlim = c(1, 366),
  las = 1,
  main = 'Soil Temperature Measurements',
  xlab = 'Day of Year'
)

# soil temperature by day of year
plot(
  soitemp ~ doy,
  data = d,
  main = 'Soil Temperature Measurements (CA792)\nNASIS "Site Soil Temperature" table',
  type = 'p',
  pch = 21, 
  bg = 'royalblue',
  xlim = c(1, 366),
  ylim = c(-1, 26),
  xlab = 'Day of Year',
  ylab = 'Soil Temperature at 50cm (deg C)',
  las = 1
)

# vernal equinox, summer solstice, autumnal equinox, winter solstice
x <- as.Date(c('2022-03-20', '2022-06-21', '2022-09-23', '2022-12-21'))

# convert dates -> Julian date, or day-of-year
doy <- as.integer(format(x, "%j"))

# add vertical lines
abline(v = doy, lty = 3, col = grey(0.45))

# annotate
# pos argument: left-center offset
text(
  x = doy, 
  y = -1, 
  labels = c('vernal equinox', 'summer solstice', 'autumnal equinox', 'winter solstice'), 
  pos = 2,
  cex = 0.75,
  font = 3
)

# box/whisker plot showing distribution of day-of-year 
# when measurements (pedon descriptions) were collected
boxplot(d$doy, at = 26, horizontal = TRUE, add = TRUE, lty = 1, width = 1, col = 'white', axes = FALSE)

Exercise 4: Generalized Horizons with Loafercreek

Here we will introduce the concept of using regular expressions to apply “Generalized Horizon Labels” based on the field soil description horizon designations.

This demonstrates one way of grouping horizon data for determining the Range in Characteristics of layers within a Soil Series or a SSURGO component.

Generalized Horizon Labels with Loafercreek

For the exercise during class we will give you some time to read over the materials and apply the process to the sample loafercreek data. The code to apply the process to loafercreek is given. Then we ask that you apply the process to your own data, adjusting your generalized horizon patterns as needed for your local soils.

This may take more time than we have during class itself, so you should follow up as needed with your mentor to complete the exercise. Chapter 3 Exploratory Data Analysis will get deeper into some of the topics that are referenced in the loafercreek code, such as summary statistics on grouped data. You will send a table and profile plots to your mentor when you are done.

Further discussion of generalized horizon labels to order profiles based on horizon-level properties can be found in this tutorial:

Pair-Wise Distances by Generalized Horizon Labels tutorial

Chapter 2 The Data We Use

2.1 Objectives (The Data We Use)

2.2 The Structure of Soil Data

2.2.1 Package References

2.2.2 Importance of Pedon Data

2.2.3 Some Issues With Pedon Data

2.3 Challenges with Pedon Data

2.4 The SoilProfileCollection

2.4.1 SoilProfileCollection methods

2.4.1.1 Promoting a data.frame to SoilProfileCollection

2.4.2 Promoting to “Spatial” SoilProfileCollection

2.4.3 Extracting Site and Horizon Data

2.4.4 Methods like data.frame

2.4.4.1 Column Access by Name: $ and [[

2.4.4.2 Row Access: [

Exercise 1: Assemble a SoilProfileCollection from several CSV files

2.5 Using the soilDB Package

2.5.1 soilDB functions for tabular data

2.5.2 Open Database Connectivity (ODBC) Connection to NASIS

2.5.3 fetchNASIS()

2.5.3.1 fetchNASIS arguments

2.5.4 The gopheridge Dataset

2.5.4.1 Make profile sketches

2.5.4.2 Pedon Data Checks

2.5.4.3 Find Data with Errors

2.6 Basic Data Inspection

2.6.1 table(): Tabulation and Cross Tabulation

2.6.2 Missing Values

2.6.3 Logical Operators

2.6.4 Pattern Matching

2.6.4.1 Resources for Regular Expressions

2.6.4.2 Statistics for Soil Survey: REGEX Operator Reference

2.6.5 Filtering

2.6.5.1 Filtering with aqp::subset()

2.6.6 Dates and Times

Exercise 2: O Horizon Thickness

2.7 Generalized Horizon Labels

2.7.1 Formal Approach

2.7.2 Functions for Handling Generalized Horizon Labels

2.8 fetchNASIS() data checks

2.8.1 Inspecting Results

2.8.1.1 Load Example Data

2.8.2 Select Pedons with Andic Soil Properties

2.8.2.1 Using fetchNASIS() output to find errors

2.8.2.2 Logic Checks for the SoilProfileCollection

2.9 Extended Data Functions

2.9.1 Elements of get_extended_data_from_NASIS()

2.9.2 Load Example Data

2.9.3 Visualizing Common Landforms

2.9.4 Diagnostic Features

2.9.4.1 Boolean Diagnostic Features in fetchNASIS()

2.9.4.2 Thickness from Diagnostic Features Table

2.9.4.3 Diagnostic Feature Diagrams

Exercise 3: Diagnostic Horizons in Your Own Data

2.10 Custom Queries to Local NASIS Database

2.10.1 Example: Site Soil Temperature Data from CA792 (Sequoia-Kings Canyon National Park)

Exercise 4: Generalized Horizons with Loafercreek

2.4.1.1 Promoting a `data.frame` to SoilProfileCollection

2.4.4 Methods like `data.frame`

2.4.4.1 Column Access by Name: `$` and `[[`

2.4.4.2 Row Access: `[`

2.5 Using the `soilDB` Package

2.5.3 `fetchNASIS()`

2.5.4 The `gopheridge` Dataset

2.6.1 `table()`: Tabulation and Cross Tabulation

2.6.5.1 Filtering with `aqp::subset()`

2.8 `fetchNASIS()` data checks

2.8.2.1 Using `fetchNASIS()` output to find errors

2.9.1 Elements of `get_extended_data_from_NASIS()`

2.9.4.1 Boolean Diagnostic Features in `fetchNASIS()`