1 Introduction

This is a short tutorial on how to interact with the Soil Data Access (SDA) web-service using R. Queries are written using a dialect of SQL. On first glance SQL appears similar to the language used to write NASIS queries and reports, however, these are two distinct languages. Soil Data Access is a “window” into the spatial and tabular data associated with the current SSURGO snapshot. Queries can contain spatial and tabular filters. If you are new to SDA or SQL, have a look at this page. Here is a SSURGO-specific library of SQL code, organized by task.

Spatial queries can be included in SQL statements submitted to SDA as long as the geometry has first been transformed to WGS84 geographic coordinates and formatted as “well known text” (WKT). The sf and terra packages provide functionality for converting between coordinate systems. Coordinate reference system definitions (a “CRS”) are typically provided using EPSG codes. You can search for various CRS definitions in a variety of formats using spatialreference.org/.

The soilDB library for R provides a helper function (SDA_query()) for submitting queries to SDA, processing the result, and reformatting the results into a rectangular table (a data.frame). Most of the work required to use the SDA_query() function will be writing SQL to describe the columns you would like returned and how the data should be filtered and possibly grouped. The SDA_spatialQuery() function simplifies the process of using spatial data (sf, spatVector, or Spatial* objects) to query map unit polygons.

Follow along with the blocks of code below by copying / pasting into a new R “script” document. Each block of command can be run by pasting into the R console, or by “stepping through” lines of code by moving the cursor to the top of a block (in the R script panel) and repeatedly pressing ctrl + enter. Note that knittr:kable is used to create cleaner output in the tutorial and isn’t required while following along.

If you are feeling adventurous, have a look at a draft tutorial on queries that return geometry from SDA. Additional tips on advanced spatial queries can be found here.

1.1 Critical Notes

1.1.1 SSURGO vs. STATSGO

SSURGO (1:24k soil survey) and STATSGO (1:250k soil survey) records are stored together in SDA. Therefore, it is critical that evey query to SDA include some kind of filter for selecting the appropriate records. Filtering strategies include:

explicit exclusion of STATSGO records, via legend.areasymbol != 'US' in the WHERE clause
implicit exclusion of STATSGO records, via SSURGO areasymbol in the WHERE clause
spatial queries using SDA helper functions: e.g. SDA_Get_Mukey_from_intersection_with_WktWgs84()
explicit selection of SSURGO / STATSGO records by record ID: e.g. mukey, cokey, etc.

Explicit exclusion of STATSGO records:

SELECT [...]
FROM legend
INNER JOIN mapunit mu ON mu.lkey = legend.lkey
INNER JOIN component co ON mu.mukey = co.mukey
WHERE legend.areasymbol != 'US' ;

Implicit exclusion of STATSGO records:

SELECT [...]
FROM legend
INNER JOIN mapunit mu ON mu.lkey = legend.lkey
INNER JOIN component co ON mu.mukey = co.mukey
WHERE legend.areasymbol = 'CA113' ;

1.1.2 Integer Columns

Be sure to explicitly CAST() integer columns when using operators such as multiplication, division, etc.. or when values may be very large. Queries that work with large sets of muacres from the mapunit table are one such case. Implicit casting is accomplished via operations that include higher precision values (e.g. include values to the right of the decimal point).

Integer overflow, result is NULL.

SELECT 
SUM(muacres) AS ac 
FROM legend
INNER JOIN mapunit mu ON mu.lkey = legend.lkey
WHERE legend.areasymbol != 'US' ;

Casting muacres to numeric data type, result is 2,280,193,353 acres.

SELECT 
SUM(CAST(muacres AS NUMERIC)) AS ac 
FROM legend
INNER JOIN mapunit mu ON mu.lkey = legend.lkey
WHERE legend.areasymbol != 'US' ;

Implicit casting to numeric via division by floating point value, result is 2280.193 million acres.

SELECT 
SUM(muacres / 1000000.0) AS ac 
FROM legend
INNER JOIN mapunit mu ON mu.lkey = legend.lkey
WHERE legend.areasymbol != 'US' ;

Scaling of muacres by component percentage, as converted to a fraction. Result is wrong: 500 million acres.

SELECT 
SUM((comppct_r / 100) * muacres) / 1000000 AS ac 
FROM legend
INNER JOIN mapunit mu ON mu.lkey = legend.lkey
INNER JOIN component AS co ON mu.mukey = co.mukey 
WHERE legend.areasymbol != 'US' ;

Casting comppct_r and muacres to numeric, result is correct: 2242.39 million acres. Note that there are still some components with NULL comppct_r, will be fixed eventually.

SELECT 
SUM((CAST(comppct_r AS NUMERIC) / 100) * CAST(muacres AS NUMERIC)) / 1000000 AS ac 
FROM legend
INNER JOIN mapunit mu ON mu.lkey = legend.lkey
INNER JOIN component AS co ON mu.mukey = co.mukey 
WHERE legend.areasymbol != 'US' ;

Use this last pattern whenever converting component percentages to a fraction, or scaling the muacres field.

1.2 Install Required R Packages

You only need to do this once. If you haven’t installed these packages, then copy the code below and paste into the RStudio “console” pane.

# run these commands in the R console
# stable version from CRAN + dependencies
install.packages("aqp", dep = TRUE)
install.packages("soilDB", dep = TRUE)
install.packages("terra", dep = TRUE)
install.packages("sf", dep = TRUE)

2 Simple Queries

Now that you have the required packages, load them into the current R session.

library(aqp)
library(soilDB)
library(terra)
library(sf)

When was the CA653 survey area last exported to SSURGO?

SDA_query("SELECT areasymbol, saverest FROM sacatalog WHERE areasymbol = 'CA653'")

##   areasymbol            saverest
## 1      CA653 9/1/2022 2:12:46 PM

Are there any survey areas that haven’t been updated since Jan 1, 2019? Just STATSGO.

SDA_query("SELECT areasymbol, saverest FROM sacatalog WHERE saverest < '01/01/2019'")

##   areasymbol               saverest
## 1         US 10/13/2016 12:28:22 PM

What is the most recently updated survey in CA?

SDA_query("SELECT areasymbol, saverest FROM sacatalog WHERE areasymbol LIKE 'CA%' ORDER BY saverest DESC")[1, ]

##   areasymbol             saverest
## 1      CA663 9/15/2022 2:27:44 PM

A simple query from the component table, for a single map unit: mukey = '461958'. This is a SSURGO map unit key, therefore STATSGO records are implicitly removed from the results.

q <- "SELECT 
mukey, cokey, comppct_r, compname, taxclname
FROM component
WHERE mukey = '461958'"

# run the query
res <- SDA_query(q)

# check
knitr::kable(head(res))

mukey	cokey	comppct_r	compname	taxclname
461958	22732982	4	Galt	NA
461958	22732983	3	Kimball	NA
461958	22732984	1	Unnamed	NA
461958	22732985	85	San Joaquin	Fine, mixed, thermic Abruptic Durixeralfs
461958	22732986	4	Bruella	NA
461958	22732987	3	Hedge	NA

Get a list of map units that contain “Amador” as minor component. Note that this type of query requires explicit exclusion of STATSGO records.

q <- "SELECT 
muname, mapunit.mukey, cokey, compname, comppct_r
FROM legend
INNER JOIN mapunit ON mapunit.lkey = legend.lkey
INNER JOIN component on mapunit.mukey = component.mukey
WHERE
-- exclude STATSGO
legend.areasymbol != 'US'
AND compname LIKE '%amador%'
AND majcompflag = 'No'"

# run the query
res <- SDA_query(q)

# check
knitr::kable(head(res))

muname	mukey	cokey	compname	comppct_r
Pardee-Ranchoseco complex, 3 to 15 percent slopes	461931	22732564	Amador	3
Pardee cobbly loam, 2 to 15 percent slopes	2924951	23459437	Amador	10
Pardee-Ranchoseco complex, 3 to 15 percent slopes	1420721	22485048	Amador	3
Hicksville sandy clay loam, 0 to 2 percent slopes, occasionally flooded	1612051	22521674	Amador	3
Ultic Haploxeralfs-Mollic Haploxeralfs complex, 3 to 30 percent slopes	2924833	22523521	Amador	5
Mckeonhills clay, 5 to 15 percent slopes	1540892	23467419	Amador	1

# optionally save the results to CSV file
# write.csv(res, file='path-to-file.csv', row.names=FALSE)

Get basic map unit and component data for a single survey area, Yolo County (CA113). There is no need to exclude STATSGO records because we are specifying a SSURGO areasymbol in the WHERE clause.

q <- "SELECT 
component.mukey, cokey, comppct_r, compname, taxclname, 
taxorder, taxsuborder, taxgrtgroup, taxsubgrp
FROM legend
INNER JOIN mapunit ON mapunit.lkey = legend.lkey
INNER JOIN component ON component.mukey = mapunit.mukey
WHERE legend.areasymbol = 'CA113'"

# run the query
res <- SDA_query(q)

# check
knitr::kable(head(res))

mukey	cokey	comppct_r	compname	taxclname	taxorder	taxsuborder	taxgrtgroup	taxsubgrp
459154	22734238	100	Water	NA	NA	NA	NA	NA
459204	22734512	100	Gravel pits	NA	NA	NA	NA	NA
459205	22734700	100	Water	NA	NA	NA	NA	NA
459208	22734523	5	Sehorn	NA	NA	NA	NA	NA
459208	22734524	5	Corning	NA	NA	NA	NA	NA
459208	22734525	85	Balcom	Fine-loamy, mixed, thermic Calcixerollic Xerochrepts	Inceptisols	Ochrepts	Xerochrepts	Calcixerollic Xerochrepts

Cross tabulate the occurrence of components named “Auburn” and “Dunstone” with survey areasymbol. Note that this type of query requires explicit exclusion of STATSGO records.

q <- "SELECT areasymbol, component.mukey, cokey, comppct_r, compname, compkind, taxclname
FROM legend
INNER JOIN mapunit ON mapunit.lkey = legend.lkey
INNER JOIN component ON component.mukey = mapunit.mukey
WHERE compname IN ('Auburn', 'Dunstone')
-- exclude STATSGO
AND areasymbol != 'US'
ORDER BY areasymbol, compname"

res <- SDA_query(q)

xtabs(~ areasymbol + compname, data=res)

##           compname
## areasymbol Auburn Dunstone
##      CA067      9        0
##      CA607     16        0
##      CA612      8       19
##      CA618     29        1
##      CA619     24        1
##      CA620     14        0
##      CA624     24        0
##      CA628     20        0
##      CA630      2        2
##      CA632      2        1
##      CA644     15        1
##      CA648      6        0
##      CA649     11        3
##      CA707     11        0
##      CA731      0        2
##      CA750      0        2

3 Queries Using Simple Spatial Filters

Get the map unit key and name at a single, manually-defined point (-121.77100 37.368402). Spatial queries using SDA helper functions automatically exclude STATSGO records.

q <- "SELECT mukey, muname
FROM mapunit
WHERE mukey IN (
SELECT * from SDA_Get_Mukey_from_intersection_with_WktWgs84('point(-121.77100 37.368402)')
)"

res <- SDA_query(q)
knitr::kable(res)

mukey	muname
1882921	Diablo clay, 15 to 30 percent slopes, MLRA 15

Get the map names and mukey values for a 1000m buffer around a manually-defined point (-121.77100 37.368402). The buffer() function from the {terra} package can conveniently derive planar buffers from geographic coordinates.

# the query point is in geographic coordinates / WGS84 datum
p <- vect(cbind(-121.77100, 37.368402), crs = 'epsg:4326')

# 1000m buffer (planar coordinates)
# result is still geographic coordinates
p.buff <- terra::buffer(p, 1000)

# convert to WKT
p.wkt <- as.data.frame(p.buff, geom = 'wkt')$geometry

q <- paste0("SELECT mukey, muname
FROM mapunit
WHERE mukey IN (
SELECT * from SDA_Get_Mukey_from_intersection_with_WktWgs84('", p.wkt, "')
)")

res <- SDA_query(q)
knitr::kable(head(res))

mukey	muname
456983	Diablo clay, 9 to 15 percent slopes
456993	Gaviota loam, 15 to 30 percent slopes
457017	Los Gatos-Gaviota complex, 50 to 75 percent slopes
1882920	Diablo clay, 30 to 50 percent slopes, MLRA 15
1882921	Diablo clay, 15 to 30 percent slopes, MLRA 15
1882923	Alo-Altamont complex, 15 to 30 percent slopes

Get some component data for a manually-defined bounding box, defined in WGS84 geographic coordinates.

# define a bounding box: xmin, xmax, ymin, ymax
#
#         +-------------------(ymax, xmax)
#         |                        |
#         |                        |
#     (ymin, xmin) ----------------+

# bbox extent
bb <- ext(-120.9, -120.8, 37.7, 37.8)

# convert to polygons
bb <- vect(bb, crs = 'epsg:4326')

# convert bounding box to WKT
bb <- as.data.frame(bb, geom = 'wkt')$geometry

# compose query, using WKT BBOX as filtering criteria
q <- paste0("SELECT mukey, cokey, compname, comppct_r
            FROM component 
            WHERE mukey IN (SELECT DISTINCT mukey FROM SDA_Get_Mukey_from_intersection_with_WktWgs84('", bb, "') )
            ORDER BY mukey, cokey, comppct_r DESC")

res <- SDA_query(q)

# check
knitr::kable(head(res))

mukey	cokey	compname	comppct_r
462527	23471173	Madera	10
462527	23471174	Alamo	85
462527	23471175	San Joaquin	5
462554	23471263	Corning	85
462554	23471264	Redding	5
462554	23471265	Montpellier	10

Compute polygon area via SDA and double-check via terra::expanse(). Note the complex syntax required to convert polygons into “geography” type, thus invoking spherical geometry and resulting in area values with units of square meters. Computing polygon area server-side requires a more complex SQL statement, but is much more efficient because the associated polygon data do not need to be transferred from SDA to the local R session.

# much faster (10x or more), because area is calculated server-side
a <- SDA_query("
SELECT mukey,
GEOGRAPHY::STGeomFromWKB(
  mupolygongeo.STUnion(
    mupolygongeo.STStartPoint()
  ).STAsBinary(), 4326
).STArea() * 0.000247105 AS area_ac
FROM mupolygon WHERE mukey = '418254'
")

# slow, polygon geometry must be converted to text and downloaded
x <- fetchSDA_spatial('418254')
x <- vect(x)
# compute area in sq. meters then convert to acres
x$area <-expanse(x) * 0.000247105

# note that the distribution of polygon area is identical
options(scipen = 10)
round(quantile(a$area_ac))

##    0%   25%   50%   75%  100% 
##     1     7    15    36 16632

round(quantile(x$area))

##    0%   25%   50%   75%  100% 
##     1     7    15    36 16632

4 Helper Functions

4.1 `fetchSDA()`

library(ggplot2)
library(gridExtra)

# query soil components by areasymbol and musym
test <- fetchSDA(WHERE = "areasymbol = 'IN005' AND musym = 'MnpB2'")

# profile plot
par(mar = c(0, 0, 0, 2.5))
plotSPC(test, name.style = 'center-center', cex.names = 0.8)

# convert the data for depth plot
clay_slice <- horizons(dice(test, 0:200 ~ claytotal_l + claytotal_r + claytotal_h))
names(clay_slice) <- gsub("claytotal_", "", names(clay_slice))

om_slice <- horizons(dice(test, 0:200 ~ om_l + om_r + om_h))
names(om_slice) <- gsub("om_", "", names(om_slice))

test2 <- rbind(data.frame(clay_slice, var = "clay"),
              data.frame(om_slice, var = "om")
)

h <- merge(test2, site(test)[c("nationalmusym", "cokey", "compname", "comppct_r")],
          by = "cokey",
          all.x = TRUE
)

# depth plot of clay content by soil component
gg_comp <- function(x) {
  ggplot(x) +
    geom_line(aes(y = r, x = hzdept_r)) +
    geom_line(aes(y = r, x = hzdept_r)) +
    geom_ribbon(aes(ymin = l, ymax = h, x = hzdept_r), alpha = 0.2) +
    xlim(200, 0) +
    xlab("depth (cm)") +
    facet_grid(var ~ nationalmusym + paste(compname, comppct_r)) +
    coord_flip()
}
g1 <- gg_comp(subset(h, var == "clay"))
g2 <- gg_comp(subset(h, var == "om"))

grid.arrange(g1, g2)

# query all Miami major components
s <- get_component_from_SDA(WHERE = "compname = 'Miami' \n
              AND majcompflag = 'Yes' AND areasymbol != 'US'")


# landform vs 3-D morphometry
test <- {
  subset(s, ! is.na(landform) | ! is.na(geompos)) ->.;
  split(., .$drainagecl, drop = TRUE) ->.;
  lapply(., function(x) {
    test = data.frame()
    test = as.data.frame(table(x$landform, x$geompos))
    test$compname   = x$compname[1]
    test$drainagecl = x$drainagecl[1]
    names(test)[1:2] <- c("landform", "geompos")
    return(test)
  }) ->.;
  do.call("rbind", .) ->.;
  .[.$Freq > 0, ] ->.;
  within(., {
    landform = reorder(factor(landform), Freq, max)
    geompos  = reorder(factor(geompos),  Freq, max)
    geompos  = factor(geompos, levels = rev(levels(geompos)))
  }) ->.;
}

test$Freq2 <- cut(test$Freq,
                  breaks = c(0, 5, 10, 25, 50, 100, 150),
                  labels = c("<5", "5-10", "10-25", "25-50", "50-100", "100-150")
)


ggplot(test, aes(x = geompos, y = landform, fill = Freq2)) +
  geom_tile(alpha = 0.5) + facet_wrap(~ paste0(compname, "\n", drainagecl)) +
  discrete_scale("colour", "viridis", function(n) viridisLite::viridis(n)) +
  theme(aspect.ratio = 1, axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  ggtitle("Landform vs 3-D Morphometry for Miami Major Components on SDA")

4.2 `SDA_spatialQuery()`

Spatial queries are usually simpler with a two step process via SDA_spatialQuery() to get a set of intersecting map unit keys, then SDA_query() to return additional tabular data. Some things to keep in mind when using SDA_spatial_query():

use what = 'mupolygon' to return polygon features and computed area in acres (slower, higher resource usage)
use what = 'mukey' to return only map unit keys (faster, lower resource usage)
spatial filters must have a valid CRS defined
spatial queries for more than 1000 features should probably be done using a local copy of the map unit polygons

Long-running queries and queries that consume too many server resources will be canceled by SDA. See ?SDA_spatialQuery for details.

# query map unit keys intersecting with the 1000m buffered point defined above
mu <- SDA_spatialQuery(p.buff, what = 'mukey', geomIntersection = TRUE)

# create a query to get component data for resulting map unit keys
sql <- sprintf("SELECT mukey, cokey, compname, comppct_r
               FROM component
               WHERE mukey IN %s
               ORDER BY cokey, comppct_r DESC", format_SQL_in_statement(mu$mukey)
) 

co <- SDA_query(sql)

knitr::kable(head(co))

mukey	cokey	compname	comppct_r
2435467	23469156	Gaviota	85
2435467	23469157	Vallecitos	10
2435467	23469158	Rock outcrop	5
2435468	23469159	Gaviota	88
2435468	23469160	Vallecitos	10
2435468	23469161	Rock outcrop	2

Try downloading some polygon geometry using the same buffered point. The geomIntersection argument to SDA_spatialQuery() toggles overlap (FALSE) vs. intersection (TRUE).

# use the buffered point geometry to query SDA
# requesting spatial intersection vs. overlapping features
p.mu.polys <- SDA_spatialQuery(p.buff, what = 'mupolygon', geomIntersection = TRUE)

Graphical description of the previous steps: query point, 1000m buffer, buffer bounding box (BBOX), and intersection of buffered point / map unit polygons.

# plot intersecting map unit polygons from SDA
plot(p.mu.polys, axes = FALSE, mar = c(1, 2, 0, 0))

# add buffered point
lines(p.buff, col = 'red', lwd = 2)

# add bounding box around buffered point
lines(ext(p.buff), col = 'RoyalBlue')

# add original point
points(p, col='orange', pch = 15, cex = 2)

legend('bottomleft', legend=c('query point', '1000m buffer', 'buffer BBOX', 'polygon intersection'), col=c('orange', 'red', 'royalblue', 'black'), lwd = c(NA, 2, 2, 2), pch=c(15, NA, NA, NA), xpd = FALSE, cex = 0.8)

Let’s apply the SDA_spatialQuery() function to some real data; KSSL pedons correlated to “Auburn”. Not all of these pedons have coordinates, so we will have to do some filtering first. See the in-line comments for details on each line of code.

# get KSSL pedons with taxonname = Auburn
# result is a SoilProfileCollection object
# coordinates will be WGS84 GCS
auburn <- fetchKSSL('auburn')

# keep only those pedons with valid coordinates
auburn <- subset(auburn, subset = !is.na(x) & !is.na(y))

# extract site data with coordinates
s <- site(auburn)

# init spatVector
sp <- vect(s, geom = c('x', 'y'), crs = 'epsg:4326')

# perform SDA query on collection of points
mu.data <- SDA_spatialQuery(sp, what = 'mupolygon')

# use local spatial intersection to link source + SDA data
# row-order is preserved
sp$mukey <- terra::extract(mu.data, sp)$mukey

# join results to original SoilProfileCollection using 'pedlabsampnum'
site(auburn) <- as.data.frame(sp)[, c('pedlabsampnum', 'mukey')]

Check the results and plot the “Auburn” KSSL pedons, grouped by intersecting map unit key and coloring horizons according to clay content.

# plot profiles, grouped by mukey
# color horizons with clay content
par(mar = c(0, 2, 4, 2))

groupedProfilePlot(auburn, groups = 'mukey', name.style = 'center-center', group.name.cex = 0.65, color = 'clay', id.style = 'side', label = 'pedon_id', max.depth = 100)

# describe IDs
mtext('user pedon ID', side = 2, line = -1.5)
mtext('mukey', side = 3, line = -1, at = c(0, 0), adj = 0)

This document is based on soilDB (2.7.8) and aqp (2.0.2).

Soil Data Access (SDA) Tutorial

D.E. Beaudette, A.G. Brown, S.M. Roecker

2023-09-21

1 Introduction

1.1 Critical Notes

1.1.1 SSURGO vs. STATSGO

1.1.2 Integer Columns

1.2 Install Required R Packages

2 Simple Queries

3 Queries Using Simple Spatial Filters

4 Helper Functions

4.1 `fetchSDA()`

4.2 `SDA_spatialQuery()`

Soil Data Access (SDA) Tutorial

D.E. Beaudette, A.G. Brown, S.M. Roecker

2023-09-21

1 Introduction

1.1 Critical Notes

1.1.1 SSURGO vs. STATSGO

1.1.2 Integer Columns

1.2 Install Required R Packages

2 Simple Queries

3 Queries Using Simple Spatial Filters

4 Helper Functions

4.1 fetchSDA()

4.2 SDA_spatialQuery()

4.1 `fetchSDA()`

4.2 `SDA_spatialQuery()`