The Loafercreek series (fine-loamy, mixed, superactive, thermic Ultic Haploxeralfs) is comprised of soils on foothills underlain by nearly-vertically-bedded metavolcanic rocks.
The Loafercreek soils are moderately deep (50 to 100cm) to a paralithic contact.
The Loafercreek series was established in Butte county (CA612) and now is mapped in Calaveras and Tuolumne (CA630) as well as portions of Mariposa and Stanislaus Counties.
The metamorphic belt in the Sierra Nevada Foothills is comprised of Mesozoic and Paleozoic age metamorphic rocks. The metavolcanic rocks, dominantly what is called greenstone, are derived from metamorphosed intermediate to mafic igneous rocks. These metamorphosed volcanics are closely associated with sedimentary rocks (marine origin).
The metavolcanic rocks in this area were recognized to form distinctive “tombstone”-like outcrops by miners during the California Gold Rush. You can see some examples of this in the image below.
Areas of soils conceptually related to Loafercreek have been mapped primarily as Auburn in the Sierra Nevada Foothills.
Many areas (mapunits) of Auburn soils are outside the range of the family (12th edition taxonomy) of the series as it is currently defined. Auburn soils were historically mapped as Ruptic-Lithic Xerochrepts.
Soil survey manuscript descriptions of map units modeled after the old series concepts shows that their definition spans, at a minimum, shallow to moderately deep depth classes, likely with deeper inclusions.
The Loafercreek series concept might fit the range in characteristics of the deeper (moderately deep) areas within Auburn mapunits.
In this demo we will explore the aqp function profileApply()
for soil-profile-based analyses and workflows. We use a dataset called loafercreek
from soilDB.
This is a sample data set that roughly corresponds to the “extended” set of pedons (including taxadjuncts, similar soils, etc.) to the series.
Loafercreek’s siblings are the soils that geographically occupy the same landscapes and parent materials – often they are components in the same mapunits.
In some mapunits, if not named, sibling soils are part of a component range due to inclusion of “similar soils” to a different named component. Alternately, they can be implied by “bracketing” of dissimilar components.
In CA630, Central Sierra Foothills Area, California (Parts of Calaveras & Tuolumne Counties), the shallow metavolcanic soils with argillic horizons (Ultic Haploxeralfs) are predominantly the Bonanza series, with some areas of Dunstone. Fine particle size class (PSC), moderately deep, soils are similar to the Argonaut concept. Skeletal PSC soils are Jasperpeak (shallow, Lithic Haploxeralfs) and Gopheridge (moderately deep, Ultic Haploxeralfs). Soils with a deep bedrock restriction are called Motherlode (fine-loamy) or Gardellones (loamy-skeletal).
In this series of demos, we are going to use R-based pedon summaries to explore properties of the Loafercreek soils found during the soil survey inventory in CA630.
This demo has three worked “examples” involving the use of profileApply()
for summarizing pedon data. This, as well as some troubleshooting information and example code (some of which uses profileApply()
).
You can use this R script version of the demo document to avoid having to copy and paste all the code. This will allow you to focus on interpreting the output. This version has all of the text as comments.
Readers are encouraged to run all of the code in their own IDE/text editor/console. Also, you are encouraged to use ?function.name
to view R manual pages whenever you encounter a function()
you do not recognize.
You can get the latest development versions of aqp, soilDB and sharpshootR using remotes package:
# install remotes if needed
# install.packages("remotes")
remotes::install_github('ncss-tech/aqp', dependencies = FALSE, build = FALSE)
remotes::install_github('ncss-tech/soilDB', dependencies = FALSE, build = FALSE)
remotes::install_github('ncss-tech/sharpshootR', dependencies = FALSE, build = FALSE)
loafercreek
To get soil data out of the database, and into an R object, we typically will use the library soilDB. We also need aqp. aqp gives us the basic data structure we use to hold pedon data: the SoilProfileCollection object.
library(aqp)
library(soilDB)
One of the built-in datasets provided by soilDB is loafercreek
.
Let’s load loafercreek
.
data("loafercreek")
A SoilProfileCollection (SPC) is an S4 object that contains site (spc@site
) and horizon (spc@horizon
) slots – each containing a single data.frame.
The contents (slots) of a SPC should be accessed by using horizons(spc)
, site(spc)
, coordinates(spc)
, etc.
For instance:
# access the clay attribute from the horizons data frame
horizons(spc)$clay
#add new site data by LEFT JOIN on UNIQUE site ID (assumed to be present in both spc and new.site.data)
site(spc) <- new.site.data
SPC data (sites, horizons) can also be accessed in ways similar to a base R data.frame via square bracket notation: spc[your.site.index, your.horizon.index]
.
In practice, you are usually indexing either a site-level OR a horizon-level attribute. You want to be aware of the length of any index you are using, and ensure you are getting what you expect, due to the flexibility built into the SPC data access methods.
We use the data.frame-like bracket notation to get a few profiles (by specifying a site-index) and plot them.
my.sub.set <- loafercreek[3:6, ]
# number of rows in my.sub.set@site (# of sites or profiles)
nrow(site(my.sub.set))
## [1] 4
Next we will, make a plot of just the subset.
Put User Pedon ID labels on each profile (along left-hand side with id.style = "side"
)
Make the horizon designation text 50% larger than default (cex.names = 0.75
).
Note that we start the process by adjusting figure margins with par()
.
# adjust margins, units are inches, bottom, left, top, right; adjust as needed
par(mar=c(0, 0, 0, 1))
# make a SoilProfileCollection plot
plotSPC(my.sub.set, label = 'pedon_id',
id.style = "side", cex.names = 0.75,
x.idx.offset = 0.1)
Setting the x.idx.offset
(e.g. x.idx.offset = 0.1
) can sometimes help with displaying plots with small numbers of profiles. This is a byproduct of some of the limitations of the base R graphics system.
To get these SPC plots to look “nice” there are many other options for adjusting offsets for plots and axis positions, as well as labels, colors, legends, etc. that can be found under ?plotSPC()
.
Now, we will take a look at the main parts of the loafercreek
SPC.
Number of sites in loafercreek
@site data.frame
length(loafercreek)
## [1] 106
Number of horizons in loafercreek
@horizons data.frame
nrow(loafercreek)
## [1] 626
You can see from the ratio of rows in loafercreek@horizons
to rows in loafercreek@site
(626:106), there is a many:one relationship.
Many:one is very common at lower levels (i.e the NASIS child tables of Site and Component or Pedon Horizon).
This is how we encode much of the information we describe in soil survey (geomorphology, color, structure, rock fragments etc) and encode in soil data structures.
It is convenient to have some of the contents of the child tables “flattened” to site or horizon level attributes. This is so that they are 1:1 with either the number of records in site(loafercreek)
, or the number of records in horizons(loafercreek)
.
fetchOSD()
Here is an example showing a SoilProfileCollection created using soilDB function fetchOSD()
.
A list of soil series (series.names
) that are geographically or conceptually associated with metavolcanic rocks in the Sierra Nevada Foothills is supplied.
The extended
argument is set to TRUE
to return a SoilProfileCollection of type location descriptions parsed from the Official Series Description (OSD).
library(soilDB)
series.names <- c("Argonaut", "Auburn", "Bonanza",
"Exchequer", "Gardellones", "Gopheridge",
"Jasperpeak", "Loafercreek", "Motherlode",
"Sobrante")
osds <- fetchOSD(soils = series.names, extended = TRUE)
# adjust margins, units are inches, bottom, left, top, right; modify as needed
par(mar=c(0, 0, 0, 1))
plotSPC(osds$SPC, max.depth=200)
fetchKSSL()
Deeper soils on metavolcanic Sierra Nevada foothills (thermic, typic xeric) commonly have a pedogenic clay increase (argillic horizon). Due to the quantity of mafic minerals in the parent rock, the soils can weather to be quite red.
The KSSL data show a substantial amount of dithionite-citrate extractable (pedogenic) iron for Loafercreek [all taxon kinds; n=10 profiles, h=33 horizons with non-NA Fe-d].
The median Fe-d for Loafercreek KSSL data is 3.1%, with a minimum of 2% and maximum of 7.8%, by mass.
These are somewhat crude statistics, as summarizing across all horizons in the SPC obscures between-pedon variation and/or covariance with depth. Regardless, these values are relatively high.
Check for yourself. Use soilDB::fetchKSSL()
to make a SPC of the lab pedons correlated to Loafercreek. Or, you can get KSSL pedons for any other taxonname, MLRA or rectangular bounding box. What other attributes are available from KSSL?
library(soilDB)
# use the `series` argument to specify taxonname. or use `mlra` or `bbox`
# ?fetchKSSL for details
k <- fetchKSSL(series = "loafercreek")
# count the number of rows (records) in the `loafercreek@site` data.frame
n.pedons <- length(k)
#calculate some basic univariate summary statistics on _all_ horizons in the SPC
median(k$fe_dith, na.rm = TRUE)
min(k$fe_dith, na.rm = TRUE)
max(k$fe_dith, na.rm = TRUE)
# here you would inspect the data further ...
# other attributes or by horizon designation, perhaps?
A SPC can be promoted to have an object of type SpatialPoints. This is the same as promoting an object like a data.frame to SpatialPointsDataFrame.
The spatial slot (spc@sp
) will contain the information.
loafercreek
contains NCSS standard (WGS84 decimal degrees) latitude (y_std
) and longitude (x_std
) in loafercreek@site
table.
Let’s make it Spatial!
# set spatial coordinates to create a Spatial object
coordinates(loafercreek) <- ~ x_std + y_std
Inspect the spatial slot.
slot(loafercreek, 'sp')
## class : SpatialPoints
## features : 106
## extent : -121.4412, -120.2611, 37.68111, 39.51897 (xmin, xmax, ymin, ymax)
## crs : NA
OK, so the spatial points are in there… but the coordinate reference system (CRS) is NA
.
We know what the CRS is. The data originate from NASIS, where it is standard to use WGS84 decimal degrees. Decimal degrees are the values contained in x_std
and y_std
when we fetchNASIS()
. So, let’s set the CRS for loafercreek
.
# when you set the proj4string, be sure it matches the formula
# and the system/format of the data you sent to coordinates() above
proj4string(loafercreek) <- '+proj=longlat +datum=WGS84'
slot(loafercreek, 'sp')
## class : SpatialPoints
## features : 106
## extent : -121.4412, -120.2611, 37.68111, 39.51897 (xmin, xmax, ymin, ymax)
## crs : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
Looks good. These SpatialPoints
are safe to spTransform()
(e.g. to UTM) because their coordinate reference system is now defined.
Note that it is not good practice to use the @
(slots) to edit data. But they are available for inspection. Generally slots are for internal use only by S4 objects. They have special functions (like coordinates()
and proj4string()
) designed to access/alter them.
You can get a separate SpatialPointsDataFrame (with just the site-level data) from an SPC using the below code. We “coerce” the loafercreek
SoilProfileCollection to SpatialPointsDataFrame using a custom S4 method that is part of the internal definition of the SPC.
# loafercreek is a SoilProfileCollection
class(loafercreek)
## [1] "SoilProfileCollection"
## attr(,"package")
## [1] "aqp"
# coerce the loafercreek object to SpatialPointsDataFrame
loafercreek.spdf <- as(loafercreek, 'SpatialPointsDataFrame')
# loafercreek.spdf is a SpatialPointsDataFrame WITH JUST THE SITE DATA
# the S4 method defined in the SPC does this by design
class(loafercreek.spdf)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
nrow(loafercreek.spdf)
## [1] 106
Note that in the coercion call as()
you should NOT use site(loafercreek)
as that would return the ordinary data.frame loafercreek@site
(without the SpatialPoints slot found in loafercreek@sp
).
# plot pedon locations as points
plot(loafercreek.spdf, sub = "Loafercreek Pedon Locations", pch = 19, cex = 0.5)
# "add" county level maps from maps package
maps::map(database = 'county', regions = 'CA', add = TRUE)
# add a neatline
box()
Note that two pedons from CA612 are included (one is the type location), but clearly most of the pedons are clustered further south (in CA630).
You can build-your-own SoilProfileCollection by reading site/horizon data from flat files (e.g. read.csv
) or querying from a database into data.frames.
Here we construct some fake site and horizon data.frames for demonstration.
You should mimic the format of the your.site.data
and your.horizon.data
objects we are creating with R code in your flat files, if you go the flat file route.
Horizon data is related to site data via site ID column (in this example id
).
For most applications it is expected each unique site ID has a set of non-overlapping top and bottom depths (in this example top
and bottom
)
Site and horizon IDs can be accessed from an SPC using profile_id()
and hzID()
. The column names used for storing those IDs are accessible with idname()
and hzidname()
. That means you can name the unique ID and depth variables whatever you want.
## make 10 site ids `id` each witin site-level attributes `di` and `mlra`
new.site.data <- data.frame(id = 1:10, di = 10 - 0:9, mlra = "18")
head(new.site.data)
## id di mlra
## 1 1 10 18
## 2 2 9 18
## 3 3 8 18
## 4 4 7 18
## 5 5 6 18
## 6 6 5 18
## or read your data from file
# your.site.data <- read.csv("your.site.data.csv")
## make 10 random horizon datasets, with site id, top and bottom
## horizon designation and 5 random horizon level attributes (p1, p2...)
your.spc <- do.call('rbind', lapply(new.site.data[['id']], random_profile))
head(your.spc)
## id top bottom name p1 p2 p3 p4 p5
## 1 1 0 20 H1 -2.063383 -5.230548 5.020055 3.350988 -9.239306
## 2 1 20 47 H2 -8.630441 -28.434761 6.816337 2.592330 1.490403
## 3 1 47 57 H3 -24.833698 -23.588662 8.870240 -9.105395 13.054354
## 4 1 57 71 H4 -33.254646 -24.264202 6.033142 -5.547809 10.744784
## 5 1 71 87 H5 -36.134003 -34.593398 18.754336 -4.282854 13.941819
## 6 1 87 113 H6 -41.770485 -47.490161 22.944712 -22.955786 28.945171
## or read your data from file.
# your.spc <- read.csv("your.horizon.data.csv")
#promote horizon data frame (in this case 10 random profiles) to SPC
depths(your.spc) <- id ~ top + bottom
# merge site data into the site slot of the SPC based on common site id `id`
site(your.spc) <- new.site.data
# calculate horizon thickness and ratio of variable p1 to p2
new.horizon.data <- data.frame(id=your.spc$id,
hzID=your.spc$hzID,
hz_thickness=(your.spc$bottom - your.spc$top),
ratio=(your.spc$p1 / your.spc$p2))
# merge horizon data into the horizon slot of the SPC
# based on common site id `id` and horizon id `hzID`
horizons(your.spc) <- new.horizon.data
# new.horizon.data contains both unique site and horizon ID, so merge is possible
# check to see if the horizon data merge worked
head(your.spc)
## Object of class SoilProfileCollection
## Number of profiles: 6
## Depth range: 33-113 cm
##
## Horizon attributes:
## id hzID top bottom name p1 p2 p3 p4 p5 hz_thickness
## 1 1 1 0 20 H1 -2.063383 -5.230548 5.020055 3.350988 -9.239306 20
## 2 1 2 20 47 H2 -8.630441 -28.434761 6.816337 2.592330 1.490403 27
## 3 1 3 47 57 H3 -24.833698 -23.588662 8.870240 -9.105395 13.054354 10
## 4 1 4 57 71 H4 -33.254646 -24.264202 6.033142 -5.547809 10.744784 14
## 5 1 5 71 87 H5 -36.134003 -34.593398 18.754336 -4.282854 13.941819 16
## 6 1 6 87 113 H6 -41.770485 -47.490161 22.944712 -22.955786 28.945171 26
## ratio
## 1 0.3944870
## 2 0.3035173
## 3 1.0527811
## 4 1.3705230
## 5 1.0445347
## 6 0.8795608
##
## Sampling site attributes:
## id di mlra
## 1 1 10 18
## 2 2 9 18
## 3 3 8 18
## 4 4 7 18
## 5 5 6 18
## 6 6 5 18
# inspect attribute names storing critical SPC information
idname(your.spc)
## [1] "id"
hzidname(your.spc)
## [1] "hzID"
horizonDepths(your.spc)
## [1] "top" "bottom"
head(profile_id(your.spc)) #unique site/profile ids
## [1] "1" "2" "3" "4" "5" "6"
head(hzID(your.spc)) #unique horizon ids (assigned at time of SPC creation)
## [1] 1 2 3 4 5 6
soilDB provides a variety of ways to import soil data into SPCs.
fetchOSD()
for getting profile information from series type locations.
fetchKSSL()
for querying data from a snapshot of the Kellogg Soil Survey Laboratory database.
fetchNASIS()
for accessing NASIS pedons and components via ODBC connection
fetchSDA_component()
& SDA_query()
for SSURGO tabular or spatial data via Soil Data Access (SDA).
Watch out for future NASIS, SDA and flat file specific modules.
The packages aqp and sharpshootR provide functions for interacting with and summarizing the contents of SPCs.
profileApply()
profileApply()
The aqp function profileApply()
allows us to evaluate a function on each profile in a SPC.
It emulates the base R *apply
functionality (apply
, lapply
, mapply
etc.); but instead of an array, list, or matrix, iterating over profiles within a SPC.
The number of results returned is equal to the number of profiles in the SPC.
profileApply()
- method in aqp for SoilProfileCollection objects
profileApply(object, FUN, simplify=TRUE, ...)
Arguments
object
- a SoilProfileCollection
FUN
- a function to be applied to each profile within the collection
simplify
- logical, should the result list be simplified to a vector?
...
- further arguments passed to FUN
Here, we use an aqp function estimateSoilDepth()
to demonstrate how profileApply()
works.
estimateSoilDepth()
uses a REGular EXpression (regex pattern) to match horizon designations of a certain pattern, and returns the top depth of the first horizon matching a given pattern.
The default settings are designed for bedrock restrictions so the default pattern matches horizon designations Cr, R or Cd (p = "Cr|R|Cd"
).
Create a numeric vector depth.to.contact
to hold the result of calling estimateSoilDepth()
on each profile in loafercreek
.
depth.to.contact <- profileApply(loafercreek, estimateSoilDepth)
Lets see how our depth.to.contact
data look using a density plot.
density()
provides an estimate of the probability density function; or pdf for a single variable. Probability densities are estimated from sample data using a specific kernel and smoothing window. The default kernel is “Gaussian” or_normal_ distribution. See ?density
.
#look at a density (continuous frequency) plot; depth on x axis
plot(density(depth.to.contact, na.rm = TRUE))
The median (66 cm) is the 0.50 quantile of depth.to.contact
.
Quantiles are values occurring within the range [min=50, max=148] of a “sample” (in this case the sample is 106 depths to contact).
Let’s summarize depth.to.contact
using quantiles with quantile
.
quantile(depth.to.contact,
probs = c(0,0.01,0.05,0.25,0.5,0.75,0.95,0.99,1),
na.rm = TRUE)
## 0% 1% 5% 25% 50% 75% 95% 99% 100%
## 50.00 51.00 53.25 60.00 66.00 76.00 92.00 98.90 148.00
A sample quantile “splits” your data at a particular level (specified with probs
) based on the estimated probability mass density distribution along the range [min, max] of your data (depth.to.contact
).
The minimum and maximum of your sample data correspond to the quantiles at probability levels 0 and 1, respectively. When the nominal probability level is expressed as a percentage, quantiles are called percentiles.
So, when you say a range is the “5th to 95th percentiles,” you are excluding 10 percent of the observed data and describing the range of the “central” 90%.
Where is the 5th to 95th percentile range on the graph above? Does it describe the central tendency of the data well?
You can estimate quantiles at different probability levels by changing the argument probs
. probs
takes a numeric vector of probabilities [0, 1] you would like quantiles for.
There are several methods for calculating quantiles depending on the variable/analysis of interest (see ?quantile
type
argument)
There is one deep “Loafercreek” in loafercreek
with bedrock contact at 148 cm. Say, for example, you didn’t expect any pedons in your dataset that far outside the “Loafercreek series” depth class.
We use the %in%
operator, with a site-level attribute/index specified first, in order to produce a logical vector. We will then index loafercreek
with that vector to make SPCs based on the “bad” pedon.
bad.peiid <- c("542129")
#SPC with just the "bad" pedon (this one isn't that bad)
deep.one <- loafercreek[site(loafercreek)$peiid %in% bad.peiid, ]
length(deep.one)
## [1] 1
#the inverse, loafercreek without the "bad" one
loafernew <-loafercreek[!(site(loafercreek)$peiid %in% bad.peiid), ]
length(loafernew)
## [1] 105
You can clearly see the “deep” observation in the loafercreek$bedrckdepth
density()
plots above. You also may have noticed 148
was way above the 99th percentile when we looked at quantiles. The deep pedon is correlated to Motherlode (see loafercreek$taxonname
) If you keep the default estimateSoilDepth()
settings, the depth returned for the deep pedon is 148
. But we could have excluded it.
If we tweak the no.contact.*
settings to return NA
if the contact isn’t found within 100
cm (moderately deep), the depth returned is NA
. Had we used this setting, all deep observations (just one) would be excluded (return NA
).
estimateSoilDepth(deep.one)
## [1] 148
estimateSoilDepth(deep.one, no.contact.depth = 100, no.contact.assigned = NA)
## [1] NA
By changing the default settings of estimateSoilDepth()
you can calculate many types of “depth-to-X feature”. estimateSoilDepth()
works using text-based matching for any horizon-level character attribute (examples: dry hue, texture class, in lieu texture, effervescence class, cementation class).
Possible modifications to the estimateSoilDepth()
default pattern to calculate depth to other types of diagnostic features would be duripan p = 'qm'
or carbonates p = 'k'
. However, as mentioned, the feature whose depth we are calculating doesn’t have to be part of a horizon designation.
More “similar” sets of soils might be expected to have less variation in the depth to an “important” characteristic.
Your analysis should always consider that for individual profiles you have the possibility of X event not occurring at any depth. Also, you may have insufficient data to determine whether X event occurred.
Here are some other examples:
depth to lithologic discontinuity - estimateSoilDepth(pedon, p = '^2')
depth to gley (reduced; ~2 chroma dominant color) - estimateSoilDepth(pedon, p = 'g')
Using R to encode logical constraints on data (“calculated”) will help you to ensure the values populated in the database (“stored”) are consistent with one another.
As an example of how you might do this, we will compare the calculated values for depth to contact (which are based on Pedon Horizon designation and depth) with the ones pre-loaded with loafercreek
dataset (populated in Site Bedrock NASIS child table).
If the stored and calculated match, they will plot on the 1:1 line.
#plot difference "stored v.s. calculated"
plot(loafercreek$bedrckdepth ~ depth.to.contact, xlim = c(0,200), ylim = c(0,200))
#add a 1:1 line with intercept 0 and slope 1
abline(0, 1)
Some plot off the 1:1 line – these might be worth inspecting in NASIS! We trust the calculation (this time).
Of course, in practice, you need to be careful about things like cementation class of bedrock layers, and possibility of multiple horizons not meeting e.g. densic criteria. These values could be outputted to file to flag NASIS record IDs in need of review and potential revision.
Let’s keep the calculated values, and update the site attribute in the SPC, because we will use the bedrock depth values later to check something else. It will be good to know we can rely on those values being consistently determined.
The depth.to.contact
variable we created above is a numeric vector of depths to a bedrock contact, calculated for each profile using profileApply()
and estimateSoilDepth()
on loafercreek
.
The loafercreek
dataset already has a variable called bedrckdepth
in the loafercreek@site
table. We want to replace it with the new values. The code would be the same if bedrckdepth
were new.
Access loafercreek
site attribute bedrckdepth
using $
and set the value to depth.to.contact
with the assignment operator: <-
.
loafercreek$bedrckdepth <- depth.to.contact
Since length(depth.to.contact) == length(loafercreek)
the SPC is smart enough to know we are editing a loafercreek@site
attribute.
*Tip:* To remove an attribute from a data.frame, use the assignment operator to set it to NULL
To resolve ambiguity about whether you are accessing loafercreek@site
or loafercreek@horizons
you need to specify a site- or horizon-index (length equal to number of rows) and/or use site(loafercreek)
or horizons(loafercreek)
.
Unpredictable results or errors may arise if you do not specify which portion of the SPC you are indexing when you reference loafercreek
.
Furthermore, the site- or horizon-index you use should not contain NA
# create new variable with clay proportion, calculated from %
loafercreek$new.hz.level.variable <- loafercreek$clay / 100
Since length(loafercreek$clay) == nrow(loafercreek)
(the number of horizons in the SPC) the SPC is smart enough to know we are editing a horizon attribute called new.hz.level.variable
. new.hz.level.variable
does not exist in loafercreek@horizons
, so it will be created.
In R, you define your own functions using function()
and the assignment operator <-
.
your.function.name <- function(...) { 2 + 2 }
A function takes zero or more inputs (...
) and executes some code (2 + 2
) and produces some result or output (4
).
The curly braces { }
denote the body of a function. The input variables (specified in side the parentheses of function(...)
) are only “in scope” inside the braces. Variables created inside the body of the function when it is executing are destroyed when the function returns.
If executing the logic in a function body brings the process to a return()
statement, the result is returned back to a parent function or interface (e.g. console) from which the function was called. If it reaches stop()
or warning()
an error or warning will be returned.
A function typically returns a value with return()
. It also may:
return the result of the last expression within the function body
perform various other operations with the data (e.g. make plots to a graphics device)
manipulate objects outside scope of the function
The basic format for defining an R function is:
# your input `tempF` is a numeric vector in degrees fahrenheit
fahrenheit_to_celsius <- function(temp_F) {
# perform the conversion from input to output
temp_C <- ((temp_F - 32) * (5 / 9))
# return output, a numeric vector in degrees C
return(temp_C)
}
Let’s use it on a vector of three temperatures in degrees Fahrenheit.
# a numeric vector in degrees fahrenheit; e.g. Soil Temperature Regime breaks
temp.regime.F <- c(46.4, 59, 71.6)
# call the function we defined, and store the output
temp.regime.C <- fahrenheit_to_celsius(temp.regime.F)
temp.regime.C
## [1] 8 15 22
A bit contrived, but if you understand these mechanics you can write very powerful analyses quickly that can be then applied in batch.
Let’s apply functions to soils. This function defined below calculates the profile maximum clay content. It is designed for use with profileApply()
.
The argument p
is an individual pedon. profileApply()
iterates over the loafercreek
profiles one-by-one, passing them to the function profileMaxClay()
for evaluation.
profileMaxClay()
applies the function max
to the attribute clay
in a profile p
.
profileMaxClay <- function(p) {
# access the numeric vector `clay` from the horizons data.frame
d <- horizons(p)$clay
# if all `d` are NA and na.rm = TRUE, max() makes a zero-length variable
if(all(is.na(d)))
return(NA) # so if a pedon `p` has no clay data... return NA pre-emptively
# calculate the maximum value in `d`, omitting NA
return(max(d, na.rm = TRUE))
}
Let’s apply profileMaxClay()
to all profiles in loafercreek
and look at the distribution of the results.
We find that our profile maximum clay contents span the fine-loamy clay contents (18-35%), with quite a few of them being higher than that range (>35% clay).
# calculate the max clay content in all profiles
loafercreek$maxclay <- profileApply(loafercreek, profileMaxClay)
Let’s visualize profile maximum clay with a density plot.
We use a rectangular kernel (smoother) for this demo. Compare the result with kernel = "gaussian"
.
# look at the density plot (estimated probability density function) of maximum clay contents
plot(density(loafercreek$maxclay, na.rm = TRUE, kernel = "rectangular"))
# calculate quantiles
quantile(loafercreek$maxclay, probs = c(0,0.01,0.05,0.25,0.5,0.75,0.95,0.99,1), na.rm = TRUE)
## 0% 1% 5% 25% 50% 75% 95% 99% 100%
## 18.000 18.000 20.000 25.200 30.000 35.000 48.200 51.572 60.000
Another value that could be of interest is DEPTH to maximum clay content. Feel free to modify profileMaxClay()
to return that.
Or, better yet, write a generalized function to return arbitrary horizon level attributes.
For this example extension to profileMaxClay()
, the function we just wrote, we add the attr
argument. attr
takes a character string (attribute in horizonNames(loafercreek)
) as an argument.
Instead of just returning the maximum clay content, the value corresponding to the requested attribute attr
is returned from the first horizon with maximum clay content.
For example: you could return “hzdept” (horizon top depth), “phfield” (field-measured pH), “total_frags_pct” (total frag. vol). We will use it to get the top depth.
profileMaxClayAttr <- function(p, attr = "clay") {
# get the horizon data
h <- horizons(p)
# get clay content in vector d (you could change this...)
d <- h$clay
# pre-emptively return NA if all clay are NA
if(all(is.na(d)))
return(NA)
# what horizon(s) have maximum clay content (in profile)
d.max.idx <- which(d == max(d, na.rm = TRUE))
# there may be multiple horizons with the max clay content...
# we just care about the first one (closest to the surface)
# flattening a (possible) many:one relationship
d.max.idx.first <- d.max.idx[1]
# return an arbitrary attribute `name`
# default returns clay from horizon of max clay
# you set attr to return attr from horizon of max clay
return(h[d.max.idx.first, attr])
}
This is a simple stub function you can alter and expand as needed. It contains basic NA
handling, subsetting, example logic etc. Of course, you can change the logic that selects the horizon(s) of interest (i.e not clay content, not the first horizon) and the summary statistic (max()
) for your own analysis.
But, for this example, we calculate the max clay for each profile. Then the depth to that specific clay content, again, for each profile.
# overwrite with identical values as obtained with profileMaxClay()
# from new generalized function! using default attr='clay'
loafercreek$maxclay <- profileApply(loafercreek, profileMaxClayAttr)
# calculate the DEPTH TO first horizon with max clay content in all profiles
# note we change default attr to return hzdept instead of clay
loafercreek$maxclaydepth <- profileApply(loafercreek,
profileMaxClayAttr, attr="hzdept")
Inspect the depth to maximum clay distribution. It appears less skewed than the clay contents themselves.
# look at the density plot (estimated probability density function)
# of minimum depth to maximum clay content
plot(density(loafercreek$maxclaydepth, na.rm = TRUE))
# calculate quantiles
quantile(loafercreek$maxclaydepth,
probs = c(0,0.01,0.05,0.25,0.5,0.75,0.95,0.99,1),
na.rm = TRUE)
## 0% 1% 5% 25% 50% 75% 95% 99% 100%
## 10.0 10.0 18.9 33.0 40.0 49.5 71.3 81.1 86.0
Food for thought:
How often does the maximum clay content occur in the particle size control section, or only in a small portion of it?
See aqp::estimatePSCS()
.
Create a function called profileMinimumKsat()
to identify potential soil mapunit component limitations due to low hydraulic conductivity.
Use soilDB and fetchNASIS_components()
or fetchSDA_components()
to create a SPC, or use your own data if you have Ksat information.
Calculate the minimum Ksat and the shallowest depth to that minimum Ksat for all components in your SPC (using profileApply()
).
Bonus: add an argument to ignore Ksat values below a user-specified depth before calculating the minimum.
We will see if two indicators of landscape stability and pedogenic development (clay content and soil color “redness”) are related within loafercreek
dataset.
We will use the Hurst redness index concept as defined below (hereafter known as “redness”). Then we will define a “cutoff depth” that is “appropriate” for the soils in our set. This is to simulate a taxonomic or interpretive limit of some sort. The readers are encouraged to consider alternate concepts of redness, or ways they can alter this workflow for their needs.
This workflow applies any numeric criterion applied to horizon data / profiles. There are many possible examples available in U.S. Soil Taxonomy. For this example, we chose a morphologic, rather than a taxonomic, index – but it was something that would fit nicely with the data (and lack thereof) in loafercreek
and would not be too computationally complex.
We showed at the beginning of this document (fetchKSSL()
demo) that there can be quite a bit of “pedogenic” iron in Loafercreek and the soils correlated to it.
In areas with younger colluvial material on the surface, or, due to variation within the parent rock, the colors may range duller (lower chroma) and yellower. Commonly, the weathered bedrock has 10YR or yellower hues.
Our hypothesis is: soils that are shallower to a “red” redness index will have higher clay content because they have older, more weathered materials near the surface. We won’t be measuring the age or specific pedogenic iron content, but we hope to use soil color as a surrogate.
In these relatively old, residual, bedrock-controlled landscapes, several different colluvial “events” (or historic periods of greater hillslope transport activity) may have buried residual materials – which contain enough iron to weather to be quite red.
In the most active portions of the landscape, little to no true soil material “residuum” may be present. The soil materials may be comprised primarily of colluvium that was deposited on top “scoured” bedrock. Some of that colluvium may have been from eroded argillic horizons, or have been in place long enough to weather
Conversely, in other landscape positions, erosion could have exposed highly weathered materials at the surface.
We hypothesize the variation we see in loafercreek
is in part due to the complex colluvial history of parts of these landscapes. The natural variation is coupled with unclear or inconsistent indicators of lithologic discontinuities in the field, as well as in the profile descriptions.
We will calculate a classic “redness index” from Hurst (1977) Visual estimation of iron in saprolite. We will use dry soil color to apply Hurst’s method to our data.
You can read the original paper here.
Basically, each horizon dry hue (\(H\)) is converted to a numeric equivalent (\(H^\star\)). This is a linearization of a portion of the Munsell color wheel.
To obtain the “redness index” \(H^\star\) is multiplied by the ratio of dry value (\(L\); lightness) to dry chroma \(C\); chroma).
\(Redness Index = H^\star \cdot (L / C)\)
First, we need to convert the d_hue
data in loafercreek
over to these \(H^\star\) values. Let’s create a look-up table.
# create a named numeric vector as a "lookup table"
hue.lookup.table <- seq(5, 22.5, 2.5)
names(hue.lookup.table) <- c('5R','7.5R','10R','2.5YR',
'5YR','7.5YR','10YR','2.5Y')
Look at the look-up table.
Dry Hue (H) | H* |
---|---|
5R | 5.0 |
7.5R | 7.5 |
10R | 10.0 |
2.5YR | 12.5 |
5YR | 15.0 |
7.5YR | 17.5 |
10YR | 20.0 |
2.5Y | 22.5 |
Redder hues have lower \(H^*\) values – this will lower the \(Redness Index\) assigned to a particular Hue-Value-Chroma combination.
# determine H* using the lookup table
hstar <- hue.lookup.table[loafercreek$d_hue]
# calculate Hurst (1977) "Redness Index" H*(L/C)
loafercreek$hri <- hstar * loafercreek$d_value / loafercreek$d_chroma
The ratio of value to chroma (\(L / C\)) will help to discern the systematic relationships in value and chroma where hue fails fails on its own.
Looking at the equation for \(Redness Index\), we can see that if value (\(L\)) and chroma (\(C\)) are both high or both low, you obtain a \(L / C\) ratio near one, so the \(H^\star\) (dry hue) dominates the redness index.
At high value, and low chroma, the \(Redness Index\) gets larger.
Conversely, at low value and high chroma, the \(RednessIndex\) gets smaller.
We will classify our data based on the top depth of the shallowest horizon found with a \(Redness Index\) less than or equal to 20
.
This provisional threshold was chosen for this demonstration after some preliminary inspection of the data, similar to the sections that follow, and review of Figure 3 from Hurst (1977) in light of the Loafercreek data.
loafercreek$horizon.is.red <- loafercreek$hri <= 20
summary(loafercreek$horizon.is.red)
## Mode FALSE TRUE NA's
## logical 173 152 301
Let’s use profileApply()
to calculate the top depth of the first horizon to meet our redness criteria in each profile of loafercreek
.
To do the work we define an anonymous function that takes the logical vector horizon.is.red
from each individual profile’s horizon data.
The function finds which horizon indexes have a TRUE
value.
From that set of horizons, takes the first (shallowest) one’s top depth.
loafercreek$depth.to.red <- profileApply(loafercreek, function(p) {
# access horizon slot of a single profile `p` from loafercreek
h <- horizons(p)
# calculate indices where horizon.is.red == TRUE, take the first
shallowest.match.idx <- which(h$horizon.is.red)[1]
# return top depth of first horizon matching
return(h[shallowest.match.idx, 'hzdept'])
})
Next we will make a density plot. We will cut off the density plot at the maximum depth (+1) to redness we found in our data.
# calculate top depth of deepest horizon that met our redness criteria
density.cutoff <- max(loafercreek$depth.to.red, na.rm=T)+1
density.cutoff
## [1] 54
plot(density(loafercreek$depth.to.red,
from = 0, to = density.cutoff, na.rm=T))
Here is the subset of pedons that has a red color (a Hurst redness index of 20 or less) within 20 cm depth. “20-in-20”
We use the aqp function subsetProfiles()
and a site-level attribute s
matched using a character string containing a logical expression.
# subset of profiles with depth.to.red <= 20
sub1 <- subsetProfiles(loafercreek, s = 'depth.to.red <= 20')
Profiles can also be subsetted on a horizon-level attribute (h
).
Here is how you could do that if you wanted to ignore the depth threshold part of the example analysis (which is a bit arbitrary - even though it appears to capture a “distinct” group of soils based on the density()
plot)
# subset of profiles with any horizon.is.red == TRUE
sub1.alternate <- subsetProfiles(loafercreek, h = 'horizon.is.red == TRUE')
Let’s count number of profiles in sub1
and sub1.alternate
# has red color within 20 cm
length(sub1)
## [1] 44
# has red color at any depth
length(sub1.alternate)
## [1] 54
OK. We will continue on with just sub1
– since there is a big drop off in density around 20cm – we want to keep those observations separate for now.
Let’s make a plot to visually inspect the group.
# adjust margins, units are inches, bottom, left, top, right
# be sure to leave room for legend
par(mar=c(0, 0, 3, 2))
# convert legend variable to factor for coloring
sub1$horizon.is.red <- factor(sub1$horizon.is.red)
# calculate the plotting order
plotting.order <- order(sub1$depth.to.red)
# make a plot, used some exaggerated red and brown colors to display
# horizon class membership for our threshold
plotSPC(sub1, max.depth = 100, print.id = FALSE,
plot.order = plotting.order,
axis.line.offset = -0.5, name = '',
col.palette = parseMunsell(c('10YR 5/3', '5YR 3/8')),
width = 0.4, color = 'horizon.is.red')
# we sorted the plot on "depth to red" to add a line to guide our eye
# plot a white line, with black dots along it
lines(1:length(sub1), sub1$depth.to.red[plotting.order], col = "white", lwd = 2)
lines(1:length(sub1), sub1$depth.to.red[plotting.order], lwd = 3, lty = 3)
# add an axis with depth to redness (corresponds to dotted line)
axis(side = 1, at = 1:length(sub1), cex.axis = 0.5,
labels = sub1$depth.to.red[plotting.order])
mtext(text = 'Depth to Red (cm)', side = 1, line = 2.5)
OK. So sub1
is the pedons that are shallow (<20cm) to red colors. What about the rest of them?
We will put them in an SPC called sub2
.
To make sub2
, we use SPC square bracket notation, the %in%
operator, and a logical vector for subsetting.
Since peiid
(pedon record ID) exists in loafercreek@horizons
and loafercreek@site
tables, we have to specify which one we want to use.
To do that, we use a site-level index – an index that has same length as the number of sites/profiles in the SPC.
By accessing peiid
from loafercreek@site
with site()
, you specify which peiid
you want (and therefore its length).
Furthermore, the peiid
from loafercreek
is used on the left-hand side of the %in%
operator to yield a vector of length equal to length of loafercreek
. If we had put it on the right-hand side, the vector would equal length of sub1
– which is not what we want.
We use that logical vector to index loafercreek
to obtain the pedon record IDs that are NOT IN sub1
.
# create a logical vector of `peiids` in `loafercreek` that are IN sub1,
# then invert it with NOT (!)
not.in.sub1 <- !(site(loafercreek)$peiid %in% site(sub1)$peiid)
# square bracket SPC subsetting
sub2 <- loafercreek[not.in.sub1,]
# how many in the "remainder" set?
length(sub2)
## [1] 62
Hmm…
par(mar=c(0, 0, 3, 2))
# convert legend variable to factor for coloring
sub2$horizon.is.red <- factor(sub2$horizon.is.red)
# depth cut off at 100cm, hide IDs and make it clear which have data
plotSPC(sub2, max.depth=100, print.id = FALSE,
axis.line.offset=-0.5, name = '',
width = 0.4, color = 'horizon.is.red',
col.palette = parseMunsell(c('10YR 5/3', '5YR 3/8')),
col.legend.cex=1.25, col.label='Horizon is red?')
OK. So, in the two profile plots we can visually see four cases.
some pedons have redness within 20cm depth (many at the surface)
some pedons have redness at greater than 20cm depth
some pedons do not have redness at any depth
some pedons do not have dry color information in any horizon
Also, affecting the first three cases, is some pedons only have one or a few observations of dry color, and the rest NA
. i.e. incomplete data.
Perhaps dry colors were only recorded for the surface to rule out an epipedon?
Incomplete data can be particularly concerning for analyses based on small datasets. You might have noticed by now that there are more moist color observations than dry in loafercreek
.
Unusual estimates from incomplete cases can disproportionately affect the distribution of the group they are correlated to when treated as if they were “complete.”
We will deal with the NA
after the next section. First we will briefly inspect inspect the first subset sub1
of “shallow to red” profiles.
If we look at the distribution of our redness index across all horizons in sub1
with a density()
plot, we see that the data span both sides of our provisional threshold of 20
… with a peak on the “red” side of the line (smaller redness index = redder). This makes sense since we cut our full dataset at 20
to make sub1
.
# plot estimate of the probability density function for redness index
plot(density(sub1$hri, na.rm=T))
# add a vertical line at HRI=20 (where we put our "break" in the data)
abline(v=20, lty=2, lwd=2, col="RED")
We will group redness values by horizon label using the genhz
(NASIS Comp. Layer ID).
# modify the "genhz" (NASIS Comp. Layer ID) to combine BA and A
# we do this because there is only one BA horizon with dry color data
loafercreek$genhz[loafercreek$genhz == "BA"] <- "A"
# make a list of data frames split by "genhz" (NASIS Comp. Layer ID)
genhz.list <- split(horizons(sub1), f=sub1$genhz)
# calculate some quantiles of Hurst Redness Index for each genhz
# lapply applies an anoymous function to each data frame in genhz.list
qtiles.by.genhz <- do.call('rbind', lapply(genhz.list, function(d) {
n.obs <- sum(!is.na(d$hri))
names(n.obs) <- "n.obs"
return(c(quantile(d$hri,
probs=c(0,0.05,0.25,0.5,0.75,0.95,1),
na.rm=TRUE), n.obs))
}))
#remove NA rows
qtiles.by.genhz <- qtiles.by.genhz[complete.cases(qtiles.by.genhz),]
Let’s look at the quantiles of redness results for the generalized soil horizons in sub1
.
0% | 5% | 25% | 50% | 75% | 95% | 100% | n.obs | |
---|---|---|---|---|---|---|---|---|
A | 10 | 12 | 15 | 19 | 25 | 32 | 40 | 52 |
Bt1 | 10 | 10 | 12 | 16 | 20 | 26 | 35 | 42 |
Bt2 | 8 | 10 | 12 | 17 | 19 | 29 | 40 | 47 |
Bt3 | 10 | 10 | 12 | 14 | 16 | 18 | 19 | 4 |
BCt | 6 | 10 | 17 | 19 | 22 | 29 | 34 | 10 |
Based on the above table, it appears that there are some trends in terms of generalized (correlated) genetic horizons. The lower argillic is redder it has a lower \(Redness Index\).
This is a portion of loafercreek
that had red colors at shallow depths, therefore as you might expect, large proportions of the data are below the redness index threshold of 20
that we set.
We want to see if the provisional separation using the 20 cm depth separates distinct morphologies. To do that we will make a new site-level attribute called red.shallow
to portray pedon class membership with respect to the “redness” depth threshold.
We do this using the typical spc$variable.name <- new.value
assignment discussed above in the SoilProfileCollection section. new.value
is either of length 1
(assigned to all sites) or length(spc)
for a site-level variable.
loafercreek$red.shallow <- loafercreek$depth.to.red <= 20
NA
valuesWe saw in the profile plots that the first group sub1
pretty uniformly met the desired criteria. A few pedons had only a little bit of dry hue data, but it was red within 20cm.
However, quite a few pedons in the “remainder” group sub2
have some issues.
It contains pedons where depth to red is greater than 20cm. Also, it includes some pedons that have dry color data but no red dry color data. Furthermore, there are quite a few pedons with no dry hue data at all (all horizons NA
).
52 out of 106 pedons returned NA
for red.shallow
in loafercreek
. That includes the latter two groups (not red and no data). We need to fix that.
Check how many are NA for grouping variable relative to total.
sum(is.na(loafercreek$red.shallow))
length(loafercreek)
We need to carefully consider the values that are NA
in our data when calculating most summary statistics. It is OK to omit NA
, or allow it to exist without contributing anything, but you need to know why they are NA
.
Often times it is fine to na.rm = TRUE
because:
There is good reason the records are NA
(e.g. bedrock)
The pattern of absence of the missing data is “random” and not “systematic”
We can’t run the function we need to run in the presence of NA
It is feasible to programmatically exclude pedons with too much NA
in the variable of interest so estimates are not skewed by pedons without complete data. However, it is inexcusable if valid observations inadvertently get kicked out of an analysis.
In this analysis, we want to treat the soils that
differently from
We can easily check which pedons have all NA
for dry color using profileApply()
and which of these cases are met. Then we can assign them a new group or put them in one of the existing groups.
Below an anonymous function calls is.na()
which returns a logical (TRUE/FALSE) vector depending on whether the input element was NA
or not. It has same length as the input vector (in this case horizon-level variable horizon.is.red
).
Then it calls the function all()
which only returns TRUE
if all of the inputs are TRUE
. Note: any()
is similarly useful.
all.na <- profileApply(loafercreek,
function(p) {
return(all(is.na(p$horizon.is.red)))
})
We want to make it so the only loafercreek$red.shallow
values that will be NA
are for the profiles with no dry color data at all.
# change value of profiles that have NA red.shallow, without all NA color
loafercreek$red.shallow[is.na(loafercreek$red.shallow) & !all.na] <- 2
We assign the value 2
because conventionally FALSE == 0
and TRUE == 1
(or any value greater than zero). We do this in anticipation of creating a categorical variable (factor) to hold our redness classes. The 2
group will be pedons who have color data to “disprove” their redness.
Now look at the groups we calculated.
summary(factor(loafercreek$red.shallow))
## 0 1 2 NA's
## 10 44 21 31
We want to use as many of the pedon descriptions as we can in this hypothetical classification based on depth to “redness”. Now there is less NA
than we started with, but we still have some. The NA
values are a problem for a site-level grouping variable, or any index for that matter.
SPCs and SPC-related functions (as do many other R functions) require no NA
in grouping / index variables or will give warnings.
This is a valid thing to do, as systematic exclusion of data – as is often necessary with NA
is not a statistically sound practice. If there are large numbers of pedons we can’t accommodate with a calculation and they return NA
, it becomes highly questionable whether it is an appropriate analysis to attempt with the data we have.
Just because a grouping scheme describes the pedons you like, doesn’t mean you can apply it reliably to profiles in general, or, that it will be broadly applicable to other sets of soil profiles.
Create a SPC that is the subset of pedons in loafercreek
that have one or more NA
for clay
in their soil horizons. Hint: ?any
; and, are there any horizons you should exclude first?
The remaining NA
values (and corresponding site/pedon data) need to be omitted when subsetting a SPC based on red.shallow
. We will leave them in for now and filter them out as needed downstream.
Inspection of the plot of pedons with our grouping variable equal to NA
correctly shows only pedons that lack dry hue data at all depths.
# adjust margins, units are inches, bottom, left, top, right
par(mar=c(0, 0, 0, 2))
plotSPC(loafercreek[which(is.na(loafercreek$red.shallow)),],
color = 'd_hue', max.depth=100,
print.id = FALSE, name = '', axis.line.offset=-0.5)
Looks good. And, likewise, here is the grouped profile plot showing our “redness” groups. They all have color data to back them up.
# adjust margins, units are inches, bottom, left, top, right
# be sure to leave room for legend
par(mar=c(0, 0, 3, 2))
loafercreek$horizon.is.red <- factor(loafercreek$horizon.is.red)
# NOTE: GPPs DO allow NA in the grouping variable; plots as '<missing>'
# ...but a warning is generated
groupedProfilePlot(loafercreek,
group.name.cex = 1.5, groups = 'red.shallow',
color = 'horizon.is.red', max.depth = 100,
print.id = FALSE, name = '', width = 0.4, divide.hz = FALSE,
col.palette = parseMunsell(c('10YR 5/3', '5YR 3/8')),
col.legend.cex = 1.25, col.label = 'Horizon is red?')
So we have FOUR categories… Let’s make a new factor to handle these redness subsets.
The new labels are in parentheses:
“less than 20cm to redness” (RED == 1
)
“more than 20cm to redness” (DEEPRED == 0
)
“no redness” (NOTRED == 2
)
all dry color NA
(NODATA == NA
).
# make some labels
#loafercreek$red.shallow : 1 0 2 NA
redness.levels <- c("RED","DEEPRED", "NOTRED","NODATA")
# set the value of the NAs to 3
loafercreek$red.shallow[is.na(loafercreek$red.shallow)] <- 3
# create a site-level grouping factor
loafercreek$redness.class <- factor(loafercreek$red.shallow,
levels = c(1,0,2,3),
labels = redness.levels)
Summarize our new factor.
summary(loafercreek$redness.class)
## RED DEEPRED NOTRED NODATA
## 44 10 21 31
# adjust margins, units are inches, bottom, left, top, right
# be sure to leave room for legend
par(mar=c(0, 0, 3, 2))
groupedProfilePlot(loafercreek, max.depth=100,
group.name.cex = 1.5, groups = 'redness.class',
color = 'horizon.is.red', print.id = FALSE,
name = '', axis.line.offset=-0.5,
width = 0.4, divide.hz = FALSE,
col.palette = parseMunsell(c('10YR 5/3', '5YR 3/8')),
col.legend.cex=1.25, col.label='Horizon is red?')
In the second example of this demo (under Defining your own functions), we calculated profile maximum clay.
We now will make a plot of the maxclay
probability density distribution for each of our different redness groups:
loafercreek
(All pedons)
RED
DEEPRED
NOTRED
NODATA
What do you see in the density plot below?
Of course, that is just a portion of our data. We have 3 other classes if we include the NODATA
class (which we must). Here’s some code to make a plot like the one above for a lot of groups with base R graphics.
# compare groups versus full set. Empty plot.
plot(density(loafercreek$maxclay, na.rm = TRUE),
type="n", xlim = c(0, 60), ylim = c(0, 0.1),
main = "Probability density of profile\n maxiumum clay content by \"redness\" class",
sub = "Maximum Clay %; Subsets versus full `loafercreek` group")
# set up plotting arguments
line.labelz <- c("ALL", levels(loafercreek$redness.class))
line.colorz <- c("BLACK","DARKRED","RED","BLUE","PURPLE")
plot.lty <- c(3,1,1,1,1)
# CREATE DATA FRAME (NO FACTORS TO PRESERVE ORDERING)
plot.params <- data.frame(labels=line.labelz,
line.color=line.colorz,
lty=plot.lty, stringsAsFactors=FALSE)
# make a base plot with base R apply() :)
res <- apply(plot.params, MARGIN=1, FUN=function(c) {
idx <- loafercreek$redness.class %in% c[['labels']]
if(all(!idx)) # handle 'ALL' which is not a factor level... it is all factor levels
idx <- !idx
lines(density(loafercreek$maxclay[idx], na.rm = TRUE, from=0, to=60),
lty=as.numeric(c[['lty']]), col=c[['line.color']], lwd = 2)
})
legend(x = 45, y = 0.1025, cex = 0.9,
legend = plot.params$labels, col = plot.params$line.color,
lwd = 2, lty = plot.params$lty)
There is a lot of overlap – which makes sense – all of these soils in loafercreek
more-or-less correlate as “similar soils” to Loafercreek.
It might make sense when all of these groups are considered together that there is only one, relatively broadly defined, dominant taxon used for these soils.
table(toupper(loafercreek$taxonname))
##
## ARGONAUT DUNSTONE GOPHERIDGE LOAFERCREEK MILLVILLA MOTHERLODE
## 1 1 4 98 1 1
table(toupper(loafercreek$taxonkind))
##
## SERIES TAXADJUNCT
## 90 14
The RED
group have more variation “broader peak” with more peak probability density in the >35% clay region of the plot than the others.
Both the DEEPRED
AND NOTRED
group have some high clay contents, but they are more isolated, and their peaks appear lower than the RED
group. The DEEPRED
max peak is greater than the NOTRED
max peak.
But remember, we actually only had about n = 10
profiles in the DEEPRED
group, so comparing it to the more data rich groups, or really reading into observed differences too strongly might not be appropriate.
We will use the NASIS Component Layer ID loafercreek$genhz
and the redness class loafercreek$redness.class
we assigned above.
Note that some horizons in the dataset are not assigned generalized horizons because that is how they were populated in NASIS Component Layer ID when loafercreek
was created.
In practice, you need to be certain that you are not excluding important variation through omitting a grouping label. If I were correlating these soils I generally would like to see all horizons correlated to a generalized horizon and would probably not use as much subdivision in the Bt horizons.
I would rather see infrequently observed horizons (e.g. Oi, Oe, C in loafercreek
) that may not be justifiable with the existing data than random exclusion. If there are enough of them they may warrant a note in e.g. the OSD or component description.
Using regular expressions to programatically assign labels is one way to make a first pass correlation based on horizon designation. See generalize.hz()
and the tutorials on the AQP homepage.
Also, see this Loafercreek-themed GENHZ demo where more complete genhz
labels are assigned.
# modify the "genhz" (NASIS Comp. Layer ID) to combine BA and A
# we do this because there is only one BA horizon with dry color data
loafercreek$genhz[loafercreek$genhz == "BA"] <- "A"
loafercreek$genhz[is.na(loafercreek$genhz)] <- "not-assigned"
# create a horizon level redness grouping factor from site
# TODO: make an easier method for doing this in the SPC object
loafercreek$redness <- merge(horizons(loafercreek),
site(loafercreek)[,c('peiid','redness.class')],
all.x = TRUE)$redness.class
# make a list of data frames split by "redness" and "genhz" (NASIS Comp. Layer ID)
red.genhz.list <- split(horizons(loafercreek),
f = list(loafercreek$redness, loafercreek$genhz))
Now we define an anonymous function to calculate clay quantiles for each unique genhz
. We also append the number of observations n.obs
as a column at the end of each row.
# calculate some quantiles of clay content for each redness.class*genhz
qtiles.redgenhz <- do.call('rbind', lapply(red.genhz.list, function(d) {
# add number of obervations (named numeric vector) to output
n.obs <- sum(!is.na(d$clay))
names(n.obs) <- "n.obs"
# calculate quantiles, concatenate n.obs & return
return(data.frame(q = t(quantile(d$clay,
probs = c(0,0.05,0.25,0.5,0.75,0.95,1),
na.rm = TRUE)),
n.obs,
redness = d$redness[1],
genhz = d$genhz[1]))
}))
Make pretty tables that will look nice in the .Rmd using knitr::kable()
.
# print a list of data frames, split by redness class
library(knitr)
red.clayl <- split(qtiles.redgenhz, f = qtiles.redgenhz$redness)
res <- lapply(red.clayl, function(d) {
rownames(d) <- d$genhz
print(kable(d[c("A", "Bt1", "Bt2", "Bt3", "BCt", "Cr", "not-assigned"), ],
caption = "Selected Quantiles of Clay Content -
grouped by NASIS Component Layer ID & Redness Group"))
})
q.0. | q.5. | q.25. | q.50. | q.75. | q.95. | q.100. | n.obs | redness | genhz | |
---|---|---|---|---|---|---|---|---|---|---|
A | 11 | 11.00 | 14.00 | 16 | 19.25 | 24.63 | 28.0 | 52 | RED | A |
Bt1 | 12 | 15.10 | 18.25 | 22 | 27.00 | 32.27 | 41.0 | 43 | RED | Bt1 |
Bt2 | 17 | 20.55 | 24.00 | 28 | 34.00 | 43.00 | 51.4 | 52 | RED | Bt2 |
Bt3 | 26 | 26.75 | 29.75 | 32 | 32.75 | 34.50 | 35.0 | 6 | RED | Bt3 |
BCt | 16 | 19.00 | 26.00 | 32 | 35.00 | 42.80 | 44.0 | 13 | RED | BCt |
Cr | 15 | 15.00 | 15.00 | 15 | 15.00 | 15.00 | 15.0 | 1 | RED | Cr |
not-assigned | 14 | 14.50 | 18.80 | 25 | 29.50 | 50.00 | 50.0 | 31 | RED | not-assigned |
q.0. | q.5. | q.25. | q.50. | q.75. | q.95. | q.100. | n.obs | redness | genhz | |
---|---|---|---|---|---|---|---|---|---|---|
A | 13 | 14.3 | 16.0 | 17 | 19.75 | 23.35 | 24 | 14 | DEEPRED | A |
Bt1 | 18 | 19.1 | 20.0 | 22 | 25.25 | 29.80 | 32 | 12 | DEEPRED | Bt1 |
Bt2 | 20 | 23.0 | 25.0 | 27 | 30.00 | 51.00 | 60 | 13 | DEEPRED | Bt2 |
Bt3 | 22 | 22.8 | 26.0 | 30 | 40.00 | 48.00 | 50 | 3 | DEEPRED | Bt3 |
BCt | 30 | 30.5 | 32.5 | 35 | 40.00 | 44.00 | 45 | 3 | DEEPRED | BCt |
Cr | 19 | 19.0 | 19.0 | 19 | 19.00 | 19.00 | 19 | 1 | DEEPRED | Cr |
not-assigned | NA | NA | NA | NA | NA | NA | NA | 0 | DEEPRED | not-assigned |
q.0. | q.5. | q.25. | q.50. | q.75. | q.95. | q.100. | n.obs | redness | genhz | |
---|---|---|---|---|---|---|---|---|---|---|
A | 10 | 11.10 | 15.000 | 17 | 18.5 | 23.90 | 28 | 32 | NOTRED | A |
Bt1 | 14 | 14.15 | 19.475 | 22 | 25.0 | 29.75 | 34 | 22 | NOTRED | Bt1 |
Bt2 | 16 | 19.00 | 23.000 | 25 | 32.0 | 40.00 | 50 | 31 | NOTRED | Bt2 |
Bt3 | 19 | 19.20 | 20.000 | 22 | 23.0 | 44.60 | 50 | 5 | NOTRED | Bt3 |
BCt | 25 | 25.00 | 25.000 | 25 | 26.0 | 29.20 | 30 | 5 | NOTRED | BCt |
Cr | 32 | 32.00 | 32.000 | 32 | 32.0 | 32.00 | 32 | 1 | NOTRED | Cr |
not-assigned | NA | NA | NA | NA | NA | NA | NA | 0 | NOTRED | not-assigned |
q.0. | q.5. | q.25. | q.50. | q.75. | q.95. | q.100. | n.obs | redness | genhz | |
---|---|---|---|---|---|---|---|---|---|---|
A | 12 | 12.50 | 14.00 | 16.0 | 18.00 | 22.50 | 24 | 31 | NODATA | A |
Bt1 | 16 | 17.00 | 18.75 | 20.0 | 24.25 | 27.65 | 33 | 28 | NODATA | Bt1 |
Bt2 | 10 | 20.00 | 23.00 | 27.0 | 28.75 | 35.00 | 42 | 46 | NODATA | Bt2 |
NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
BCt | 30 | 30.15 | 30.75 | 31.5 | 32.25 | 32.85 | 33 | 2 | NODATA | BCt |
Cr | 33 | 33.00 | 33.00 | 33.0 | 33.00 | 33.00 | 33 | 1 | NODATA | Cr |
not-assigned | 14 | 14.80 | 18.00 | 26.0 | 28.00 | 28.00 | 28 | 5 | NODATA | not-assigned |
Several of the less-red groups have some very high clay contents observed, though mostly at greater depth. It does appear like there are slight, possibly meaningful offsets in the median values for groups – but their ranges overlap considerably.
There is no reason to expect clay content and redness to always be correlated in the same way everywhere on the landscape. Also, it makes sense the deep red might have their high clay content deeper in the profile.
To continue with this analysis, I would like to fill in the gaps in the generalized horizon labels. Then, to the rest of the profile description data in NASIS from this metavolcanic landscape. Investigating spatial relationships between these groups or other attributes than clay
might be of interest.
summary(loafercreek$redness.class)
## RED DEEPRED NOTRED NODATA
## 44 10 21 31
loafercreek
is a biased set of the “full sample” of CA630 pedons that occur in these landscapes. And those (numerous) pedons are a (hopefully less and less) biased sample of the real world. But that is a story for another day.
profileApply()
- method in aqp for SoilProfileCollection objects
profileApply(object, FUN, simplify=TRUE, ...)
Arguments
object
- a SoilProfileCollection
FUN
- a function to be applied to each profile within the collection
simplify
- logical, should the result list be simplified to a vector?
...
- further arguments passed to FUN
If you are having trouble with a profileApply()
-based routine:
Make sure the function arguments you are using are appropriate for your analysis.
Try your function (without profileApply()
) on single profiles where you know the result to make sure it works
Know the data type you are expecting for the result and make sure you are getting it.
Here we evaluate estimateSoilDepth()
for one pedon in loafercreek
(loafercreek[1]
). We know to expect a numeric value (a depth) when is called via profileApply()
.
Which is exactly what we get. Except the profileApply()
result has the peiid
in the names
attribute. It is a named numeric vector.
# create an SPC with just one pedon (try different ones)
just.one <- loafercreek[1]
the.result <- estimateSoilDepth(just.one)
the.result
## [1] 79
applied.result <- profileApply(just.one, estimateSoilDepth)
applied.result
## 64505
## 79
To prove calling the function manually and via profileApply()
are equal, use logical comparison:
(the.result == applied.result)
## 64505
## TRUE
To access the names
attribute (containing the SPC unique site ID) of the profileApply()
result:
names(applied.result)
## [1] "64505"
str(applied.result)
## Named int 79
## - attr(*, "names")= chr "64505"
Check if a peiid
from names
attribute is in a list of other peiid
values (e.g. from loafercreek
).
What site index matches the result peiid
? What horizon index(es)?
# logical vector length of left hand side (1)
# "is left hand side character vector IN right hand side character vector?"
names(applied.result) %in% loafercreek$peiid
## [1] TRUE
# get site index (we took the first one at the beginning)
# NB a vector of length equal to loafercreek number of site/profiles results
which(site(loafercreek)$peiid %in% names(applied.result))
## [1] 1
# If we don't specify site(), we get `peiid` from @horizons.
# So we get several horizon indexes, all matching peiid and belonging to the first site
# NB This may or may NOT be what you expect/want!
which(loafercreek$peiid %in% names(applied.result))
## [1] 1 2 3 4 5 6 7 8
The default behavior of profileApply()
is to simplify
the result from list to vector.
[more worked examples? specific examples of error messages? troubleshooting apply-based functions?]
simplify
or not to simplify
?If you are using profileApply()
and trying to return a more complex data type like a data.frame, or need list output, you will benefit from setting simplify = FALSE
.
Here, we call profileApply()
without specifying simplify
. The default simplify
argument (TRUE
) from the function definition of profileApply()
is used.
The result, recognized as a set of numbers, is returned as a named numeric vector, which was simplified from the list (typical from *apply()
functions). This is what you want 9 times out of 10 for univariate results.
numeric.vector <- profileApply(loafercreek, estimateSoilDepth)
head(numeric.vector, 3) # named numeric vector, names are peiid
## 64505 115595 115819
## 79 74 74
class(numeric.vector) # numeric
## [1] "numeric"
typeof(numeric.vector) # double precision numeric
## [1] "double"
Note the different data type (list) of the “unsimplified” profileApply()
result.
a.list <- profileApply(loafercreek, estimateSoilDepth, simplify = FALSE)
head(a.list, 3) # a named list, names are peiid
## $`64505`
## [1] 79
##
## $`115595`
## [1] 74
##
## $`115819`
## [1] 74
class(a.list) # list can contain a mix of any data type
## [1] "list"
typeof(a.list[[1]]) # the first element of this list is numeric (integer)
## [1] "integer"
str(unlist(a.list)) # create a named numeric vector from list (since all are numeric)
## Named num [1:106] 79 74 74 79 99 73 64 55 58 61 ...
## - attr(*, "names")= chr [1:106] "64505" "115595" "115819" "115827" ...
str(as.numeric(a.list)) # create an UNnamed numeric vector from list
## num [1:106] 79 74 74 79 99 73 64 55 58 61 ...
If the function you are trying to profileApply()
work for a few single well-behaved pedons, your problems may be related to source data – edge cases, errors, things you didn’t plan for.
If it is a function you have designed, add a bit of code to print out the ID so you can know if profileApply()
is failing on a particular pedon.
If you did not write the function, or would prefer not to edit it, write a wrapper function to print the ID of each pedon before passing the pedon and other arguments to your problematic function, and call your wrapper function with profileApply()
instead.
Say, you want to calculate the depth-weighted average of a numeric property over some depth interval [tdepth
,bdepth
] for all profiles in a SPC using this function:
depth.weighted.average <- function(spc, tdepth, bdepth, attr, ...) {
#expand `attr` in formula
custom.formula <- formula(paste0(tdepth,":",bdepth," ~ ", paste0(attr, collapse=" + ")))
# calculate a depth-weighted average using aqp::slice()
return(mean(slice(spc, custom.formula, just.the.data=TRUE)[[attr]],
na.rm = TRUE))
}
NA
We want to write a wrapper function around depth.weighted.average()
TAILORED for a specific analysis.
Our goal is to make sure the data we pass to the function are not going to cause errors (all NA
) and are trustworthy. That is, not a “bunch” of NA
– where “a bunch” is a user-defined threshold.
NA
will be removed silently with mean(..., na.rm=TRUE)
. Removing (rm) NA
means summary statistics (in this case the mean) will suffer from being derived from a dataset with missing data.
If all inputs are NA
, removing NA
will often cause an error by creating zero-length input vector.
To handle the above cases we define wrapper function dwt.mean.wrap()
which we will use to handle calls to depth.weighted.average()
.
#dwt.mean.wrap()
# x - a SoilProfileCollection; usually single-profile from profileApply()
# na.threshold - specifies the proportion (by number of records) of NA for warning
#
# Pedons with all NA for `attr` return NA (with a message containing peiid)
# Generate a warning if NA is over a threshold (with message containing peiid)
dwt.mean.wrap <- function(x, na.threshold, attr, ...) {
# create local variable to limit SPC accessor calls
my.attr <- horizons(x)[[attr]]
# if all records are NA don't try to process the data, flag & return NA
if(all(is.na(my.attr))) {
print(paste0("All horizons are `NA` \'", attr, "\' -- ", profile_id(x)))
return(NA)
}
# calculate proportion of NA records and check against threshold
prop.na <- sum(is.na(my.attr)) / length(my.attr)
if(prop.na > na.threshold) {
print(paste0("Greater than ", round(na.threshold * 100),
"% of horizons are `NA` \'", attr, "\' - upedonid: ", profile_id(x)))
#we will not omit it, just flag it so we have the ID
}
# we used x and attr to max the above output, but depth.weighted.average()
# also gets tdepth and bdepth from call to profileApply(), passed on with `...`
return(depth.weighted.average(x, attr, ...))
# then we return the result
}
Now we calculate 25 - 75 cm depth-weighted average clay content for all loafercreek
pedons. The wrapper excludes pedons with all NA
and warns about pedons with more than 40% NA
(based on number of records).
wrapper.test <- profileApply(loafercreek,
dwt.mean.wrap,
na.threshold = 0.40,
tdepth = 25, bdepth = 75,
attr = 'clay')
## [1] "All horizons are `NA` 'clay' -- 307574"
## [1] "All horizons are `NA` 'clay' -- 338045"
## [1] "Greater than 40% of horizons are `NA` 'clay' - upedonid: 414938"
## [1] "Greater than 40% of horizons are `NA` 'clay' - upedonid: 477042"
## [1] "Greater than 40% of horizons are `NA` 'clay' - upedonid: 477055"
## [1] "All horizons are `NA` 'clay' -- 533172"
## [1] "All horizons are `NA` 'clay' -- 533337"
## [1] "All horizons are `NA` 'clay' -- 533889"
## [1] "All horizons are `NA` 'clay' -- 640647"
## [1] "All horizons are `NA` 'clay' -- 1193371"
plot(density(wrapper.test, na.rm=TRUE),
main = "25-75cm depth-weighted average clay content\nall `loafercreek` pedons")
sum(is.na(wrapper.test))
## [1] 7
n = 99 in plot; n = 7 NA
. Sums to 106.
So, we added some checks for entirely missing and mostly-missing clay
data and print out the offending site IDs. The output for pedons with all NA
matches the number of observations we would expect. Cool.
Modify dwt.mean.wrap()
to check the depth-weighted proportion of NA
instead of proportion of records NA
. That is: consider how much thickness of the profile is NA – not how many horizons.
How many records are flagged at the 0.4 na.threshold
using the depth-weighted approach?
HINT: use depth to bedrock (loafercreek$bedrckdepth
) as your total thickness – exclude the thickness of any non-soil layers
Hurst, V.J. (1977) Visual estimation of iron in saprolite. Geo Soc Am Bull. 88:174-176.
Diagnostic horizon validation (dark surface epipedon, argillic horizon)
Particle size control section validation
Texture and rock fragment validation
Spatial comparisons of pedon observations
Expanding the loafercreek
and gopheridge
datasets: geographically associated soils
#diagnostic horizon validation (dark surface, argillic)
#particle size control section depths and weighted-average calculation
#horizon-level validations and preparing SPC objects for analysis
#do spatial example.. can we predict where the red/clayey ones are?
#bring in the shallow and skeletal and deep data from the Loafercreek mapunits