1 The “Scenario”

You have a collection of pedons that have been correlated to a soil series or component that you would would like to compute the Range in Characteristics (“low-rv-high” values; RIC) for.


2 Objective

For your exercise, we ask you to calculate Range in Characteristic(s) for a soil series or component.

To do this, you will assign Generalized Horizon Labels (GHLs) to pedons from your area of responsibility. These labels will be a grouping variable to help you to determine the RIC for one (or more) properties of your choice.

One way we can create GHLs is by matching patterns in the field horizon designations to correlate horizon data to a simpler set of labels. We have started to call this assignment of GHLs micro-correlation.

2.1 So, what do I have to hand in?

  1. A SoilProfileCollection plot - showing the GHLs as horizon colors and field designation as the labels along the side of each profile.

  2. Table of Range in Characteristics for your selected property in each GHL.

  3. An R file, with comments for any changes you made to the default analysis. You may use this R file as a template.

2.1.1 Format and Required Files

Alternately, you may do the entire assignment in a .Rmd file (R Markdown) and submit the “knitted” HTML result.

You will need to move the code from the .R file to a .Rmd yourself. You can create a basic .Rmd from RStudio using File >> New File >> R Markdown… Here is a cheatsheet for basic R Markdown syntax: RStudio R Markdown Cheat Sheet

If you create an .Rmd file, include both the .Rmd and the knitted HTML result. If you use an R script, include the R script and PDF or screenshots of Results #1 and #2.

NOTE: There are some Exercise Tips at the end of the document to help you get going on modifications for your own analyses.

Send the results to your mentor with your first and last name in the file names. It may also be helpful to include a copy of your input pedon data (.Rda). Instructions for saving this can be found below.

3 Instructions

  1. Query NASIS database to load your selected set with some pedons. Replace the call to data("loafercreek") and the subsetting of pedons (loafercreek[1:20,]) with a routine that brings in your own data (e.g. using fetchNASIS()). fetchNASIS. If you want to get the pedons associated with a particular taxon name, try the NASIS query: NSSC_Pangaea: POINT - Pedon/Site/NCSSlabdata by taxonname.

  2. Create a subset (in R) of soils to summarize (call it pedons). You should have about n = 20 for a manageable demo that fits nicely in a SoilProfileCollection plot, but you are welcome to try the same routines on larger sets. STAT 1010: Chapter 2 - Filtering

  3. Inspect the field horizon designations (look at the pedons in R or NASIS, if needed). Think about which field horizon designation(s) should correlate to each “prototype” horizon.

  4. Decide on “prototype” horizon designation scheme. Think of the “prototype” as as set of general horizon labels that are related – like the list of horizon designations that you provide for the Range in Characteristics in an OSD, or the list of layers you include in a SSURGO component.

  5. Write a set of regular expressions (REGEX patterns) (you’ll need one pattern per generalized horizon) to do the correlations you thought about in #4. Test and learn more about regular expressions here: https://regexr.com/

  6. Cross tabulate your GHLs against the field horizon designations. This will show a table of the mapping from “old” to “new” (“field” to “correlated”). Use the table() function covered in Chapter 2.

  7. Check if any horizon designations were NOT assigned a label (have label “not-used”). At a minimum you should be able to answer the question: “Which horizons were not assigned?” Bonus points if you can answer “Why [those horizons] weren’t assigned?

  8. Repeat steps 3 through 7 as needed. You don’t need to get the patterns perfect but we want you to think about how you could/would “improve” them, especially if they don’t work as intended.

  9. Do a statistical summary on your horizon groups to produce a range in characteristics. You can use the example given below for loafercreek (mean/sd/min/max clay content). Alternately you do something else of interest, such as summarizing the range in thickness of “B” horizons, or the range in depth to bedrock; or tabulating texture classes or colors used by horizon.

Be prepared to discuss issues you had with your mentor. In particular, what “decisions” did you have to make that might have influenced your final correlations and range in characteristics?


3.1 This document is an example

This document takes you through a demo of the exercise using a subset of the loafercreek dataset from the soilDB package. You are encouraged to run through the code with loafercreek before attempting it on your own data.

After reviewing this workflow, and with the help of your mentor, you should be able to apply this technique to your own data.

This assignment integrates several R/data analysis skills as well as brings on the “Great Unknown” of NASIS data inputs from across the country. With this type of uncharted territory, there is a lot of room for learning new things and dealing with new problems.

If your code does not work at first do not be discouraged.

Feel free to contact Andrew Brown (), or your assigned mentor, if you have questions, issues or comments.

4 Getting started (with Loafercreek)

First read over and run the code in this document using the first 20 pedons from loafercreek as a demonstration. This will help you get comfortable with the process.

Then apply the same strategy to NASIS pedons from your area of responsibility, adjusting patterns and summaries as needed.

For your exercise, replace the next block of code with code to get your own data.

If you use fetchNASIS() please note that it produces QC output that may be relevant to your analysis. 2.7: fetchNASIS data checks

library(aqp, warn.conflicts = FALSE)
library(soilDB)

## STEP 1 ----

# load sample `loafercreek` data from the soilDB package
data("loafercreek")

## STEP 2  ----
# keep only the first 20 pedons
pedons <- loafercreek[1:20, ]

# plot profile sketches
par(mar = c(0, 0, 2, 1))
plot(pedons, name = 'hzname', print.id = FALSE)

In order to help your mentor debug any issues you may be having, it is helpful to also provide them with an RData file containing the SoilProfileCollection of pedons, as below.

# after loading your data as a SoilProfileCollection, save it
save(pedons, file = "my_pedons.Rda")

That way your mentor won’t have to re-create your selected set in NASIS to inspect your data.

5 Generalized Horizon Labels

Why use Generalized Horizon Labels?

We use Generalized Horizon Labels (GHL) to simplify the grouping in our input data. As soil scientists we put a lot of effort into our descriptions. We work hard to describe changes in profiles with corresponding changes in horizons. However, we don’t all necessarily have the same opinions on how that should be done.

Therefore, we generalize across profile descriptions, to deal with variation in:

  • description style / horizons designations used

  • horizon depths / boundaries

  • number of horizons described

When creating summaries of data we need a way to “relate” observations of particular horizons from particular pedons back to the typical set of horizons found in the “group” the data belong to (e.g. a series or a component).

Maybe we could use all the unique horizon designations in the data?

5.1 Inspect Field Designations

And then create a summary for each group?

## STEP 3 ----

# tabulate hzname
table(pedons$hzname)
## 
##  2BC 2BCt 2Bt1 2Bt2 2Bt3  2Cr 2Crt   2R    A   A1   A2   BA  BAt   BC  BCt   Bt 
##    1    1    2    3    2    1    1    2   16    3    3   10    3    1    3    3 
##  Bt1  Bt2  Bt3  Bt4   Bw  Bw1  Bw2    C   Cr  Crt   H1   Oi    R 
##   15   14    7    2    4    1    1    2    9    1    1    1    8
# these are the _unique_ horizon designations in our subset `pedons`
unique(pedons$hzname)
##  [1] "A"    "BA"   "BAt"  "Bt1"  "Bt2"  "Cr"   "Bt3"  "Oi"   "Crt"  "R"   
## [11] "H1"   "Bw"   "BCt"  "A1"   "A2"   "2Bt3" "2BCt" "2Cr"  "Bt"   "2Bt1"
## [21] "2Bt2" "2Crt" "Bw1"  "Bw2"  "BC"   "Bt4"  "C"    "2R"   "2BC"

With most decent-sized datasets, you will have a lot of groups when taking this simple approach to grouping.

Here we have 29 different horizon designations. Nobody would attempt to make separate ranges for each unique group, especially with such a small amount of data in some of the groups.

Depending on things like depth class or the nature of the parent material, the number of horizon RICs provided in a series or component will vary.

Many series concepts provide RICs for only a couple generalized horizons (say, just an A and a Bt in a very deep soil) and avoid providing ranges for transitional horizons/vertical subdivisions. Other descriptions may have more layers and properties broken out.

The great thing about the GHL approach is that you can “test” the effect of adding/removing groups. This doesn’t have to be a formal test, necessarily, but you can see how the inclusion of more groups, or different rules affects the way your data are partitioned. Then you can decide if it adds interpretive value (i.e. a layer with “significantly” different properties) to have more or less groups based on the data you have.

5.2 “Micro-correlation”

First, you will need some general labels appropriate for the soil you are studying. This the list of horizon labels occurs in your hypothetical, idealized, “typical” soil. For instance, the horizons that occur in the OSD/TUD/component pedon or some generalization of them would be a good start.

For this exercise we will try to produce a set of REGEX patterns that correlate the field-observed horizon designations to your prototype horizons.

Let’s take a look at the horizon designations from the Loafercreek OSD for inspiration. If you are trying this on a series of your own, you will need to replace the series name argument (must be in quotes).

l <- fetchOSD('loafercreek')
l$hzname
## [1] "Oi"  "A"   "BAt" "Bt1" "Bt2" "Bt3" "Crt" "R"

There are quite a few horizons in the OSD pedon. We might not be able to produce a unique RIC for each subdivision of the Bt. And we probably don’t want to, even if we could. So we will have to generalize.

With generalized horizon labels (GHLs), correlation decisions are being made on a horizon basis (in addition to at the pedon level), so we call it a “micro-correlation.”

In this process, we are determine what data from each pedon contributes to each Range in Characteristics. This has always been a part of Soil Correlation–we are just making it explicit and reproducible by using R to track our “decisions” at the horizon level.

A simple micro-correlation might be: “this transitional AB horizon has ‘A’ as the first designation so it is be more like an ‘A’ than a ‘Bt’ horizon”. More complex decisions take into account multiple properties beyond the horizon designation (such as clay content, color, or texture class).

Grouping horizon observations by horizon designation is an excellent way to begin to explore the properties of a set of profiles. There are patterns and connotations in the way we are trained to designate soil horizons that will often yield useful groupings.

You can (and should) look at more than just horizon designation. Often unusual data sneak through the cracks, either getting in a group they shouldn’t, or not getting matched at all–these need to be addressed with specific patterns or manual adjustments.

Here is an example of a prototype for horizonation for Loafercreek. It is a broad generalization of the labels we found in the Loafercreek OSD pedon horizons above.

Our prototype labels include an surface horizon (“A”), upper transitional horizon (“BA”), argillic horizon (“Bt”), and a bedrock contact (“Cr”):

## STEP 4 ----

# create 4 GHLs: A, upper transitional, argillic and bedrock
prototype.labels <- c('A',
                      'BA',
                      'Bt',
                      'Cr')

Evaluating the pros and cons of a single group for the argillic horizon (Bt) versus splitting out upper and lower (Bt1/Bt2), for instance, would be a great analysis to evaluate for your own extensions of this demo. We are deliberately keeping the example very general.

5.3 Regular Expressions

The vector prototype.labels has 4 values in it. Therefore, patterns.to.match must also contain 4 patterns.

Define the paired patterns for each prototype.label

## STEP 5 ----

# REGEX rules describing mapping from field data to prototype.labels
patterns.to.match <- c('^A',
                       '^B[^Ct]*$',
                       'B.*t',
                       'Cr|R')

Here is a brief explanation of the function of each of the 4 patterns:

  1. If the horizon designation starts with "A", it goes in the “A” generalized horizon group

  2. If the horizon designation starts with "B", and does not contain “C” or "t" after "B", it goes in the "BA" generalized horizon group

  3. If the horizon designation contains "B" followed by anything, and then "t", it goes in the "Bt" generalized horizon group

  4. If the horizon contains "R" or "Cr", it goes in the "Cr" generalized horizon group

More tips:

Pay special attention to how caret ^ and dollar $ symbols are used in REGEX. They function as anchors for the beginning and end of the string, respectively.

  • A ^ placed before an A horizon, ^A, will match any horizon designation that starts with A, such as Ap, Ap1, but not something merely containing A, such as BA.

  • Placing a $ after a Bt horizon, Bt$, will match any horizon designation that ends with Bt, such as 2Bt or 3Bt, but not something with a vertical subdivision, such as Bt2.

  • Wrapping pattern with both ^ and $ symbols will result only in exact matches – i.e. that start and end with the contents between ^ and $. For example ^A$, will only match A, not Ap, Ap2, or Cg.

  • A pair of square brackets matches a single character, but could be one of many individual characters. For instance [BC]+t matches “one or more of B or C followed by t”. This also makes use of + a quantifier operator (one or more). * is another quantifier denoting zero or more.

Test and learn more about regular expressions here: https://regexr.com/

5.4 generalize.hz()

We use the aqp function generalize.hz() to apply the patterns in patterns.to.match to pedons$hzname and return the corresponding new GHL for horizons where a match is made.

Summarizing the previous two subsections, we created:

  • 4 GHLs (prototype.labels); and,

  • 4 regular expression patterns (patterns.to.match) to assign data to (prototype.labels)

Importantly, the label for the last of the set of patterns to match is returned – so if the first and fourth pattern match the same horizon, the final result contains the fourth label.

Note loafercreek (and other SPCs coming out of fetchNASIS()) already have a horizon-level variable called genhz which has the contents of the NASIS Pedon Horizon Component Layer ID (dspcomplayerid) by default (when populated).

At the end of this document there is an optional guide for importing the labels assigned by R into NASIS, which requires creating a special |-delimited .txt file.

Since we don’t want to overwrite the data that came out of NASIS at this point, we will create a new horizon-level variable newgenhz to hold our preliminary GHL assignments.

# apply prototype labels `new` to horizons matching `pat`
pedons$newgenhz <- generalize.hz(x = pedons$hzname, new = prototype.labels, pat = patterns.to.match)

5.5 Cross Tabulate Results

That’s it. We have generalized the horizons. Let’s take a look at how our patterns did.

We “cross-tabulate” the results of generalize.hz() with the input data to see how our field-data got mapped to the new labels.

In particular we want to see if any horizons in the input data got “missed” by our patterns or if horizons are getting correlated to labels we did not expect.

## STEP 6 ----
# cross-tabulate results
oldvsnew <- addmargins(table(pedons$newgenhz, pedons$hzname))
oldvsnew
##           
##            2BC 2BCt 2Bt1 2Bt2 2Bt3 2Cr 2Crt  2R   A  A1  A2  BA BAt  BC BCt  Bt
##   A          0    0    0    0    0   0    0   0  16   3   3   0   0   0   0   0
##   BA         0    0    0    0    0   0    0   0   0   0   0  10   0   0   0   0
##   Bt         0    1    2    3    2   0    0   0   0   0   0   0   3   0   3   3
##   Cr         0    0    0    0    0   1    1   2   0   0   0   0   0   0   0   0
##   not-used   1    0    0    0    0   0    0   0   0   0   0   0   0   1   0   0
##   Sum        1    1    2    3    2   1    1   2  16   3   3  10   3   1   3   3
##           
##            Bt1 Bt2 Bt3 Bt4  Bw Bw1 Bw2   C  Cr Crt  H1  Oi   R Sum
##   A          0   0   0   0   0   0   0   0   0   0   0   0   0  22
##   BA         0   0   0   0   4   1   1   0   0   0   0   0   0  16
##   Bt        15  14   7   2   0   0   0   0   0   0   0   0   0  55
##   Cr         0   0   0   0   0   0   0   0   9   1   0   0   8  22
##   not-used   0   0   0   0   0   0   0   2   0   0   1   1   0   6
##   Sum       15  14   7   2   4   1   1   2   9   1   1   1   8 121

In this table you see that columns correspond to all the different horizon designations found in the original data.

And the rows correspond to our GHLs.

The numbers in each cell show how many observations (horizons) have that combination of field designation and GHL.

Note that the ‘not-used’ class is the default result when none of the patterns match. You can set alternate values for no-match case with generalize.hz(..., non.matching.code = 'alternate-not-used-code').

# find which columns are greater than zero in row 'not-used'
col.idx.not.used <- which(oldvsnew['not-used',] > 0)

# what column indexes (field horizon designations) did not get mapped onto a row (generalized hz label)?
col.idx.not.used
## 2BC  BC   C  H1  Oi Sum 
##   1  14  24  27  28  30
# show just those columns
oldvsnew[, col.idx.not.used]
##           
##            2BC  BC   C  H1  Oi Sum
##   A          0   0   0   0   0  22
##   BA         0   0   0   0   0  16
##   Bt         0   0   0   0   0  55
##   Cr         0   0   0   0   0  22
##   not-used   1   1   2   1   1   6
##   Sum        1   1   2   1   1 121

For the loafercreek example, we see that 2 “BC”, 2 “C” 1 “H1” and 1 “Oi” horizons did not match any pattern.

Since we require a “t” to be in the “Bt” group, and “C” is not allowed in the “BA” group, the “BC” falls through the cracks. Likewise, “C” and “Oi” did not have patterns created to match them.

So, let’s say we’ve decided we don’t want these ‘not-used’ horizons lumped with our ‘A’, ‘BA’, ‘Bt’ OR ‘Cr’ groups. Therefore, we either need to add additional pairs of labels and patterns to match them OR leave them as ‘not-used’.

5.6 Discussion (RE: Loafercreek)

Since there are only a handful of observations for the BC/C’s and O’s (4 and 1 of each, respectively) they may not be particularly “representative” for the “Loafercreek series.”

The lack of clay films (“t” subscript) may be a thing in common between “BC” and “C” – could they be combined?

If you were trying to apply generalized labels to Loafercreek, you could test the idea that they have an unusually large volume of rock fragments (horizons(loafercreek)$total_frags_pct) – maybe some of them do and some don’t.

You could compare the range derived for your “C” to the range for “BC” to help you decide if they are similar to one another or not (if you were considering lumping them together). Do they have similar clay contents and colors?We will lump them for this demo, since it will be a small group with this subset no matter what.

If we had more observations of the Oi we could estimate its thickness using the transition probabilities between GHLs. In this case (Loafercreek), they are seldom more than a few centimeters thick and are not much of an “O” horizon to speak of, so we have left this class out for now.

We apply the patterns as before, but create another GHL variable pedons$newgenhz2 to hold the new result. This is to illustrate that the development of GHL patterns is an iterative process and your first pass may be far from perfect.

For a new BC label pattern we match all horizons that contain C and have zero or more characters that are NOT t and put them in the BC group.

Because of the ordering of patterns, Cr will be matched by patterns 4 and 5, but only the label for pattern 5 (Cr) will be assigned. Let’s assign the new labels:

## REPEAT STEPS 4 AND 5 ----

# create 5 generalized horizons: A, upper transitional, argillic, lower-transitional and bedrock
prototype.labels.v2 <- c('A',
                         'BA',
                         'Bt',
                         'BC',
                         'Cr')

# REGEX rules describing mapping from field data to prototype.labels
patterns.to.match.v2 <- c('^A',
                          '^B[^Ct]*$',
                          'B.*t',
                          'C[^t]*',
                          'Cr|R')

# use generalize.hz() to apply a set of patterns and paired labels
# to the `pedons$hzname` character vector containing field designations
pedons$newgenhz2 <- generalize.hz(x = pedons$hzname,
                                  new = prototype.labels.v2,
                                  pat = patterns.to.match.v2)

Now we cross-tabulate again, showing only not-used data.

## REPEAT STEP 6 ----

# create a second cross-tabulation, using the updated genhz
oldvsnew2 <- addmargins(table(pedons$newgenhz2, pedons$hzname))

# find which table columns are greater than zero in row 'not-used'
col.idx.not.used <- which(oldvsnew2['not-used',] > 0)

# show just those columns
oldvsnew2[, col.idx.not.used]
##           
##             H1  Oi Sum
##   A          0   0  22
##   BA         0   0  16
##   Bt         0   0  51
##   BC         0   0   8
##   Cr         0   0  22
##   not-used   1   1   2
##   Sum        1   1 121

As you can see, the BC and C horizons that were not-used before are now correlated to the BC group.

The only horizon data that are not-used are the 2 Oi horizons. You can compare pedons$newgenhz2 with the labels we created before pedons$newgenhz and the labels loaded from NASIS Pedon Horizon Component Layer ID pedons$genhz to see the differences.

# check for equality (assignment 1 versus assignment 2)
pedons$newgenhz == pedons$newgenhz2

5.7 Visualizing Profile Sketches

Let’s recreate the graph we did at the beginning, only now we will color horizons in the plot based on their GHL. This will make it clear how our patterns simplified the grouping of the pedon horizon data, and also provide us with a visual check on our logic.

Compare the coloring (based on pedons$newgenhz2) with the field horizon designations (pedons$hzname) to the right of each profile.

## RESULT #1

# plot profile sketches - first 20 profiles; color by gen hz.
par(mar = c(0, 0, 3, 1))
plotSPC(pedons,
        name = 'hzname',
        color = 'newgenhz2',
        print.id = FALSE)

Here are a few things that are evident for the Loafercreek example: Our upper transitional horizon (‘BA’ group) captures ‘BA’ as well as ‘Bw’. The bulk of the profile is the argillic horizon (Bt). Some pedons have lower gradational horizons (BC or C). Most pedons have Cr or Cr over R, but we treat the paralithic and lithic contacts equivalently for this demo.

In RStudio you can “Export” a plot from the drop down menu at top of “Plots” pane (after you run the code to make the plot).

Or save the plot using R code. See ?pdf, ?jpg, ?dev.off helpfiles for how to capture output sent to a graphics device (by plot()) and save it to a file instead of sending it to the “Plots” pane.

We compare the the number of original horizon designations from the field data with the number of unique generalized horizon labels.

# original field data (27 levels)
length(unique(pedons$hzname))
## [1] 29
# new generalized data (6 levels, including not-used)
length(unique(pedons$newgenhz2))
## [1] 6

We went from 27 levels or “groups” in the field data to 5 groups “as correlated” (4 soil horizons + bedrock)

Let’s look at how we can generate RICs based on the labels we assigned (and subsequently revised).

5.8 Summaries by Generalized Horizon

Here we use dplyr to produce statistical summaries for each of our generalized horizons.

We group the horizon data into sub-data.frames using the GHLs we assigned in (pedons$newgenhz2) as the grouping variable. Then we do some statistics on each “piece” (using summarize) and combine the results for review.

## STEP 9 ----

# get the horizon data frame out of the SPC
hzdata <- horizons(pedons)

library(dplyr, warn.conflicts = FALSE)

# summarize horizon groups with single summary statistics 
#   using mean, sd, min, max, quantile
res_df <- hzdata %>% 
            group_by(newgenhz2) %>%
            summarize(clay_mean = mean(clay, na.rm = TRUE),
                      clay_sd = sd(clay, na.rm = TRUE),
                      clay_min = min(clay, na.rm = TRUE),
                      clay_max = max(clay, na.rm = TRUE),
                      clay_Q05 = quantile(clay, probs = 0.05, na.rm = TRUE),
                      clay_Q50 = quantile(clay, probs = 0.5, na.rm = TRUE),
                      clay_Q95 = quantile(clay, probs = 0.95, na.rm = TRUE))
res_df
Summary Statistics for Generalized Horizons
newgenhz2 clay_mean clay_sd clay_min clay_max clay_Q05 clay_Q50 clay_Q95
A 16.19048 3.076021 11 22 12.00 15 22.00
BA 19.25000 3.356586 12 26 14.25 19 24.50
Bt 29.32653 8.716335 16 60 18.40 29 47.20
BC 27.33333 6.439462 16 35 18.25 29 33.75
Cr NaN NA Inf -Inf NA NA NA
not-used NaN NA Inf -Inf NA NA NA

This is an implementation of a Range in Characteristics for clay content. The 5th to 95th percentile range clay_Q05 to clay_Q95 would, given the available data, include 90% of the observed data in each group, with the median clay_Q50 being a central value.

Here is a summary of why it is beneficial to use percentiles for summary of soil survey data / development of Range in Characteristics:

Question: How do Loafercreek clay content mean, standard deviation, minimum and maximum compare to the selected quantiles? Why?

Question: Do you think there was any benefit (in terms of representing RIC for clay content) to adding the BC generalized horizon group? What else might you change about this analysis, now that you see the RIC?

You can save the table as a .csv or .Rda (Rdata object - in a file) file using write.csv() or save().

# save result #2 to file

# save a text-based (comma-separated) version of the result table
write.csv(res_df, file = "Your_RIC_table_output.csv")

# save a binary file representation of the R object containing result table
save(res_df, file = "Your_RIC_table_output.Rda")

To continue with your work, you might need these groups to be populated in NASIS Component Layer ID – learn how to do that next.

6 Optional: Saving to NASIS dspcomplayerid

NOTE: THIS IS NOT REQUIRED PART OF THE EXERCISE – PROVIDED FOR YOUR INFORMATION AND FUTURE USE

In order to use/save GHLs in further analysis or perform manual adjustments, they need to be saved externally.

If you are a NASIS user then the following code will create a text file that can be read by NASIS and stored in the dspcomplayerid field of the Pedon Horizon table.

The NASIS Pedon Horizon Calculation “Update horizon group aggregations using a text file” uses a text file C:/data/horizon_agg.txt, which contains each phiid (pedon horizon unique record ID) paired with a label to assign.

Here is the code to make a NASIS horizon group aggregation text file. This will write newgenhz out to the horizon_agg.txt file out for each phiid in your object pedons.

# set output path
genhz.file <- 'C:/data/horizon_agg.txt'

# update genhz.var if you change the site(pedons) column with labels
genhz.var <- 'newgenhz'

# write blank output (gets rid of any old assignments saved in the file)
write.table(
  data.frame(),
  file = genhz.file,
  row.names = FALSE,
  quote = FALSE,
  na = '',
  col.names = FALSE,
  sep = '|'
)

# extract horizon data.frame
h <- horizons(pedons)

# strip-out 'not-used' genhz labels and retain horizon ID and genhz assignment
h <- h[which(h[[genhz.var]] != 'not-used'), c('phiid', genhz.var)]

# append to NASIS import file
write.table(
  h,
  file = genhz.file,
  row.names = FALSE,
  quote = FALSE,
  na = '',
  col.names = FALSE,
  sep = '|',
  append = TRUE
)

To import the file, run NASIS Pedon Horizon Calculation “Update horizon group aggregations using a text file.”

Some people prefer to adjust assignments in R while others prefer to make adjustments after loading the data into NASIS. Some combination of the two may be required depending on the type and extent of adjustments that need to be made.

Typically, NASIS is good for making final specific changes to relatively small numbers of micro-correlation decisions, whereas wholeseale re-assignments that affect many records in a consistent/programmatically-discernible way can be implemented much more efficiently in R.

You can also store temporary results in RData files.

# after updating genhz, save a new copy of the data
save(pedons, file = "my_pedons_genhz.Rda")

7 Exercise Tips

Use fetchNASIS() to get pedons from your selected set.

# then load data from the NASIS selected set into an R object called `pedons`
pedons <- fetchNASIS(from = 'pedons')

Of course, you first need to query some from your NASIS Local Database to have them in there.

Then subset your fetchNASIS() result to create a smaller group of pedons, called pedons.

# optionally subset the data, FOR INSTANCE: by taxon name - replace Loafercreek with your taxon name
pedons <- pedons[grep(pattern = 'Loafercreek', x = f$taxonname, ignore.case = TRUE),]

Instead of using a hard-coded numeric index (for example: 1:20), you could subset your selected set using text-matching on a site/pedon attribute, for example, taxon name.

To subset on taxon name, we used the function grep() to return just the numeric indices where x = f$taxonname matches our pattern (pattern='Loafercreek'). We set ignore.case=TRUE so we will match “LOAFERCREEK”, “loafercreek” and “Loafercreek” – along with any other oddly-capitalized variants that might exist. There are numerous other attributes that we could have subsetted on. Finally, we use the data.frame notation for subsetting a SoilProfileCollection.

For this assignment, you must to do some sort of subset of your selected set using R – but it does not need to be complex.

Use any site or horizon level attribute. See the function aqp::subset() for a slick way to do this for site- or horizon-level variables.

Now that you have seen the full demonstration and read the tips for applying this workflow to your own data, please return to the top of the document.

Run the code and perform analysis for your own pedons instead of loafercreek. Consider the results and discussion that were provided for Loafercreek, BUT instead, consider the conditions in your data and adjust as needed. Discuss any issues you may have with your mentor.


This document is a demonstration of concepts from the presentation “Soil Data Aggregation in R” found here.

The contents are based on the ‘Assigning and Using Generalized Horizon Labels’ tutorial found here.