Component Interpretation Summary via SDA

You probably need to install some packages before this will work. Have a look at this tutorial to get setup.

Lets get some interpretations that are relevant to MLRA 18 (Sierra Nevada Foothills) soils so that we can compare all components named Amador and Millvilla. Just for fun lets throw in a very different soil from MLRA 17 (San Joaquin Valley); Hanford.

Note that SDA has a 100,000 row limit, so we have to be a little clever when writing the queries. In this case, letting the database filter out NULL fuzzy ratings and the reasons for ratings.

library(aqp)
library(soilDB)
library(reshape2)
library(lattice)
library(cluster)
library(ape)
library(sharpshootR)
library(plyr)


# beware, there are hard limits (100k rows) on what can be returned by SDA
q <- "SELECT component.cokey, compname, mrulename, interplr, interplrc
FROM legend
INNER JOIN mapunit ON mapunit.lkey = legend.lkey
INNER JOIN component ON component.mukey = mapunit.mukey
INNER JOIN cointerp ON component.cokey = cointerp.cokey
WHERE
-- exclude STATSGO
areasymbol != 'US'
AND compname IN ('Millvilla', 'Amador', 'Hanford')
AND seqnum = 0
AND mrulename IN ('ENG - Construction Materials; Topsoil', 
'ENG - Sewage Lagoons', 'ENG - Septic Tank Absorption Fields', 
'ENG - Unpaved Local Roads and Streets', 
'AGR - California Revised Storie Index (CA)')
AND interplr IS NOT NULL;"

# query and check
x <- SDA_query(q)
kable(head(x))

cokey	compname	mrulename	interplr	interplrc
22539852	Hanford	AGR - California Revised Storie Index (CA)	0.837	Grade 1 - Excellent
22539852	Hanford	ENG - Construction Materials; Topsoil	0.741	Fair
22539852	Hanford	ENG - Sewage Lagoons	0.400	Somewhat limited
22539852	Hanford	ENG - Septic Tank Absorption Fields	0.954	Somewhat limited
22539852	Hanford	ENG - Unpaved Local Roads and Streets	0.400	Somewhat limited
22539782	Hanford	AGR - California Revised Storie Index (CA)	0.796	Grade 2 - Good

OK, so how do the fuzzy numbers (interplr values) for these components compare?

# compute number of rules and soils
n.soils <- length(unique(x$compname))
n.rules <- length(unique(x$mrulename))

# compare population with box-whisker plot
bwplot(compname ~ interplr | mrulename, data=x, layout=c(1,n.rules), strip=strip.custom(bg=grey(0.8)))

Those figures were useful, sometimes it is nice to see the median values for each interpretation (rows) and component (columns).

s <- ddply(x, c('mrulename', 'compname'), plyr::summarize, 
           low=quantile(interplr, probs=0.05, na.rm=TRUE), 
           rv=quantile(interplr, probs=0.5, na.rm=TRUE), 
           high=quantile(interplr, probs=0.95, na.rm=TRUE))

knitr::kable(dcast(s, mrulename ~ compname, value.var = 'rv'), caption = "Median Fuzzy Ratings")

Median Fuzzy Ratings
mrulename	Amador	Hanford	Millvilla
AGR - California Revised Storie Index (CA)	0.219	0.837	0.318
ENG - Construction Materials; Topsoil	0.000	0.820	0.000
ENG - Septic Tank Absorption Fields	1.000	1.000	1.000
ENG - Sewage Lagoons	1.000	1.000	1.000
ENG - Unpaved Local Roads and Streets	0.500	0.041	1.000

What about the categorical ratings? Note that the kable function from the knitr package makes nice HTML tables for us.

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'ENG - Construction Materials; Topsoil'), caption="ENG - Construction Materials; Topsoil")

ENG - Construction Materials; Topsoil
	Amador	Hanford	Millvilla
Fair	0	95	4
Poor	23	10	13

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'AGR - California Revised Storie Index (CA)'), caption = "AGR - California Revised Storie Index (CA)'")

AGR - California Revised Storie Index (CA)’
	Amador	Hanford	Millvilla
Grade 1 - Excellent	0	75	0
Grade 2 - Good	0	28	0
Grade 3 - Fair	0	2	7
Grade 4 - Poor	14	0	10
Grade 5 - Very Poor	9	0	0

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'ENG - Septic Tank Absorption Fields'), caption = "ENG - Septic Tank Absorption Fields")

ENG - Septic Tank Absorption Fields
	Amador	Hanford	Millvilla
Somewhat limited	0	21	0
Very limited	23	84	17

knitr::kable(xtabs(~ interplrc + compname , data=x, subset= mrulename == 'ENG - Sewage Lagoons'), caption = "ENG - Sewage Lagoons")

ENG - Sewage Lagoons
	Amador	Hanford	Millvilla
Somewhat limited	0	18	0
Very limited	23	87	17

Interpretation Based Similarity

Here is a crazy idea, what if we could sort a collection of soil series based on a subset of relevant interpretations? We can, as long as some assumptions are made:

there exists a small set of interpretations that reliably describe some aspect of “similarity”
the mean fuzzy rating is a realistic description of central tendency
in aggregate, the collection of components named for a soil series will approximate the central tendency of that series

Lets try it with a small set of MLRA 17, 18, and 22A soils. NULL ratings will confound interpretation of the results–there are no “AGR - Pesticide Loss Potential-Leaching” ratings for components named “Dunstone”. I have tried to select a small set of relevant interpretations, but clearly there are many possibilities. Soil profile sketches (derived from Official Series Descriptions) are used in the final figure. Note that these are not the same as the component data that were used to generate the interpretation-based similarity.

# set list of soil series (component names)
soil.list <- c('Pardee', 'Yolo', 'Capay', 'Aiken', 'Amador', 'Pentz', 'Sobrante',
'Argonaut', 'Toomes', 'Jocal', 'Holland', 'Auburn', 'Dunstone', 
'Hanford', 'Redding', 'Columbia', 'San Joaquin', 'Fresno')

# set list of relevant interpretations
interp.list <- c('ENG - Construction Materials; Topsoil', 
'ENG - Sewage Lagoons', 'ENG - Septic Tank Absorption Fields', 
'ENG - Unpaved Local Roads and Streets', 
'AGR - California Revised Storie Index (CA)')

# compose query
q <- paste0("SELECT compname, mrulename, AVG(interplr) as interplr_mean
FROM legend
INNER JOIN mapunit ON mapunit.lkey = legend.lkey
INNER JOIN component ON component.mukey = mapunit.mukey
INNER JOIN cointerp ON component.cokey = cointerp.cokey
WHERE 
-- exclude STATSGO
areasymbol != 'US'
AND compname IN ", format_SQL_in_statement(soil.list), "
AND seqnum = 0
AND mrulename IN ", format_SQL_in_statement(interp.list), "
AND interplr IS NOT NULL
GROUP BY compname, mrulename;")

# send query
x <- SDA_query(q)

# reshape long -> wide
x.wide <- dcast(x, compname ~ mrulename, value.var = 'interplr_mean')
knitr::kable(x.wide, digits = 3, caption="Mean Fuzzy Ratings for Select Soil Series")

Mean Fuzzy Ratings for Select Soil Series
compname	AGR - California Revised Storie Index (CA)	ENG - Construction Materials; Topsoil	ENG - Septic Tank Absorption Fields	ENG - Sewage Lagoons	ENG - Unpaved Local Roads and Streets
Aiken	0.737	0.211	1.000	0.972	0.794
Amador	0.226	0.000	1.000	1.000	0.691
Argonaut	0.441	0.124	1.000	1.000	0.994
Auburn	0.299	0.001	1.000	1.000	0.997
Capay	0.505	0.166	1.000	0.819	1.000
Columbia	0.610	0.445	0.958	0.962	0.850
Dunstone	0.233	0.000	1.000	1.000	0.788
Fresno	0.140	0.000	1.000	1.000	0.399
Hanford	0.845	0.694	0.987	0.872	0.209
Holland	0.601	0.100	0.981	0.999	0.880
Jocal	0.617	0.075	0.938	0.994	0.938
Pardee	0.246	0.000	1.000	1.000	1.000
Pentz	0.247	0.003	1.000	1.000	0.707
Redding	0.300	0.021	1.000	1.000	0.797
San Joaquin	0.263	0.204	1.000	1.000	0.542
Sobrante	0.438	0.071	1.000	1.000	0.859
Toomes	0.175	0.000	1.000	1.000	1.000
Yolo	0.828	0.846	0.906	0.614	0.608

# set row names to uppercase compname
# this will allow us to integrate with an SPC of OSD data later
row.names(x.wide) <- toupper(x.wide$compname)

# create distance matrix
# leave out compname
d <- daisy(x.wide[, -1])

# cluster via divisive hierachical method
# labels are recovered from row names -> distance matrix dimnames
h <- diana(d)

# fetch typical profiles from official series descriptions
osds <- fetchOSD(soil.list)

# magic happens here
# hang profile sketches from interpretation-based clustering
# may have to tinker with arguments to get figure right
par(mar=c(2,1,2,1))
plotProfileDendrogram(osds, clust = diana(d), scaling.factor = 0.008, max.depth=200, width = 0.3, name.style = 'center-center', shrink = TRUE)
title('Component Similarity via Interpretation Fuzzy Values')
mtext('profile sketches from OSDs', side = 1)

Interesting. Discussion to be continued…

This document is based on soilDB version 2.7.7.

Component Interpretation Summary via SDA

D.E. Beaudette

2023-03-21

Interpretation Based Similarity