A quick tutorial on how to search for and accommodate messy data. Inconsistent horizon depths, missing horizons, overlapping horizons, and other such mayhem can lead to unexpected results. It is best to search for and filter (or fix) these kind of errors before proceeding to analyze soil profile data. While classes and methods within the aqp package are fairly tolerant of messy data, it is recommended that you apply these tests to your data before feeding into aqp functions.

Sample Data

Copy and paste this code into an R session to familiarize yourself with the sample data set used in this tutorial.

library(aqp)

# load sample data set, a simple data.frame object with horizon-level data from 10 profiles
data(sp1)
str(sp1)

# optionally read about it...
# ?sp1

# upgrade to SoilProfileCollection
# 'id' is the name of the column containing the profile ID
# 'top' is the name of the column containing horizon upper boundaries
# 'bottom' is the name of the column containing horizon lower boundaries
depths(sp1) <- id ~ top + bottom

# check it out:
class(sp1)
print(sp1)
plot(sp1)

Checking for Bad Data

Lets break some good data to make a point. Checking the horizon logic of a SoilProfileCollection object is performed by the checkHzDepthLogic function. Note that the kable function from the knitr package is used for nice-looking output in the tutorial, you wouldn’t normally use this in an interactive session. It is handy for use in RMarkdown reports.

# load required libraries
library(aqp)
library(knitr)

# load sample data and promote to SoilProfileCollection
data(sp1)
depths(sp1) <- id ~ top + bottom

# make a copy of the example data set, we will insert errors later
bad <- sp1

# insert a missing horizon boundary in profile P001
bad$top[4] <- NA

# create an overlapping horizon in profile P002
bad$top[9] <- 15

# create a depth logic error in profile P003
bad$bottom[13] <- 12

# perform battery of checks
sp1.chk <- checkHzDepthLogic(sp1)
bad.chk <- checkHzDepthLogic(bad)

# graphical reference
par(mar=c(0, 0, 0, 0))
plot(sp1)

Check the original data; looks like there is a problem with top/bottom depths sharing the save value! It is fairly common for Cr and R horizons to be described with only a top depth or equal top / bottom depths.

Note that we are using [ (square bracket) style indexing of data.frame and SoilProfileCollection objects. Be sure to review the SoilProfileCollection Object Reference for details.

kable(sp1.chk)
id valid depthLogic sameDepth missingDepth overlapOrGap
P001 FALSE FALSE TRUE FALSE FALSE
P002 TRUE FALSE FALSE FALSE FALSE
P003 TRUE FALSE FALSE FALSE FALSE
P004 TRUE FALSE FALSE FALSE FALSE
P005 TRUE FALSE FALSE FALSE FALSE
P006 TRUE FALSE FALSE FALSE FALSE
P007 TRUE FALSE FALSE FALSE FALSE
P008 TRUE FALSE FALSE FALSE FALSE
P009 TRUE FALSE FALSE FALSE FALSE
# highlight the problem
idx <- which(sp1.chk$sameDepth)
kable(sp1.chk[idx, ])
id valid depthLogic sameDepth missingDepth overlapOrGap
P001 FALSE FALSE TRUE FALSE FALSE
# check horizon data for this profile
# using profile-indexing to select the profile
# using column-indexing to select the first 8 columns from horizon attributes
kable(horizons(sp1[idx, ])[, 1:8])
group id top bottom bound_distinct bound_topography name texture
2 P001 0 2 A S A1 GRVLS
2 P001 2 14 G S A2 GRVLS
2 P001 14 49 G S AB GRVLS
2 P001 49 57 C S BA GRVLS
2 P001 57 89 C S Bt GRVSCL
2 P001 89 89 NA NA Rt NA

Check the “bad” version of sp1 that we intentionally broke.

kable(bad.chk)
id valid depthLogic sameDepth missingDepth overlapOrGap
P001 FALSE TRUE TRUE TRUE FALSE
P002 FALSE FALSE FALSE FALSE TRUE
P003 FALSE TRUE FALSE FALSE TRUE
P004 TRUE FALSE FALSE FALSE FALSE
P005 TRUE FALSE FALSE FALSE FALSE
P006 TRUE FALSE FALSE FALSE FALSE
P007 TRUE FALSE FALSE FALSE FALSE
P008 TRUE FALSE FALSE FALSE FALSE
P009 TRUE FALSE FALSE FALSE FALSE
# merge test results into site-level attributes of SoilProfileCollection
site(bad) <- bad.chk
# convert validity test to a factor
bad$valid <- factor(bad$valid)

# plot valid vs. invalid hz depth test results
# par() is used to tighten default margins
par(mar=c(0,0,3,0))
groupedProfilePlot(bad, groups='valid', group.name.offset = -20)
title('Valid Horizonation Check')

What happens if you attempt to use profiles with horizon depth errors? Functions like slice can tolerate some errors, but not overlaps.

try(slice(bad, 1:100 ~ .))
## Error : bad horizonation in IDs: P002

You can ignore the error by setting strict=FALSE, but there are still warnings printed to the console.

slice(bad, 1:100 ~ ., strict = FALSE)

Filtering out Bad Data

idx <- which(bad.chk$valid)

# keep the good ones
good <- bad[idx, ]

# check
par(mar=c(0,0,0,0))
plot(good)

Data Cleaning

Soil data often contain records where the lower depth of the deepest horizon is missing (NA) or the same as the top depth (R, Cd, Cr, etc.). These data can either be filtered out or “cleaned” by replacing the missing lower depth with the corresponding upper depth.

Note: this functionality will eventually be put into a proper aqp function.

# replace missing lower boundaries with top + 10
idx <- which(!is.na(sp1$top) & is.na(sp1$bottom))   
if(length(idx) > 0) {
  sp1$bottom[idx] <- sp1$top[idx] + 10
}

# replace same top/bottom with top + 10
idx <- which(sp1$top == sp1$bottom)
if(length(idx) > 0) {
  sp1$bottom[idx] <- sp1$bottom[idx] + 10
}

# check data-cleaning:
kable(checkHzDepthLogic(sp1))

Older soils data with may have O horizons that start above the soil surface (e.g. "Oe 3 to 0 cm). Data containing these type of horizons must either be filtered or fixed before use with classes or methods defined in the aqp package. Functions in the aqp and soilDB packages always nag about horizon depth errors. It is in your best interest to fix these errors if you can, as output from plotSPC, slice, slab and related functions may not be reliable.


This document is based on aqp version 1.27.