R and RStudio have been installed on all computers with NASIS and are typically updated and CCE approved once a year. USDA machines may be 1 to 3 versions behind the latest available version for public download. Having an outdated version of R though rarely creates a problem, although warnings may appear. RStudio is a powerful and productive user interface for R and is the recommended software for soil scientists.

Tips for using R:

NRCS Resources for R

Soil Data Aggregation using R

Introduction to R Webinar

Statistics for Soil Survey

NCSS GitHub Repositories

Package ‘aqp’ manual

Package ‘soilDB’ manual

Package ‘sharpshootR’ manual

Introduction to soilDB

Tips on Getting NASIS Data into R

AQP Tutorials

Region 11

Additional Resources for Learning R

Quick R website

Simple R tutorial

R Manuals

Comprehensive R Archive Network (CRAN) Task View

Gardener’s own

David Rossiter’s R applications and lecture notes

Use R for Digital Soil Mapping Tutorial by Brendan Malone

NPS Inventory and Monitoring lecture notes

Geographic Data Analysis and Visualization: Topics and Examples

1 Import/Export/Save

# set working directory
setwd("C:/workspace") #R uses / instead of \ for file paths

# view current working directory
getwd( )

# reads a file in a table format with a header row and create a dataframe called x
x <- read.table(filename, header = TRUE, sep = " ")

Additional tips for read.table( ):

–the default separator sep = " " is any whitespace

–use header = TRUE to read the first line as a header of column names

–use as.is = TRUE to prevent character vectors from being converted to factors

–use skip = n to skip n lines before reading data

# reads a CSV file with a header row and create a dataframe called x
x <- read.csv("filename", header = TRUE)
# writes a dataframe to a tab delimited text file with a header row
# use the 'sep = ' argument to specify additional output file formats
write.table(dataframe, file = "filename.txt", sep = "\t", header = TRUE)
# writes a dataframe to a comma delimitated file with a header row
write.csv(dataframe, file = "filename.csv", header = TRUE)
# saves all objects
save.image( )
# saves R workspace to a file named workspace.RData
save.image(file = "workspace.RData")
# saves a history of the commands that have been executed during a R session as sand.Rhistory
savehistory(file = "sand.Rhistory")  

# loads a history file (sand.Rhistory in this example) containing a list of previously executed commands 
loadhistory(file = "sand.Rhistory")

# displays a list of all previous commands (max.show = Inf)
# maximum number of lines displayed can be specified using the max.show argument
history(max.show = Inf)

2 Packages

The aqp package will be used in the following examples:

# installs a package and its dependencies
install.packages("aqp", dep = TRUE )
# loads package 
# uninstalls package 

3 Help

# help window for a command (fetchNASIS in this example); package argument is optional
help(fetchNASIS, package = soilDB)
# placing a ? before a specified command will also open a help window if the package containing the command has been installed and loaded

4 Graphics

# plot (numeric) data against the index (1:N)

# bivariate plot of x (on the x-axis) and y (on the y-axis)
# can be executed after a plot is generated to add labels to a plot
title(main = "NULL", sub = "NULL", xlab = "NULL", ylab = "NULL")
# graphing parameters 
par( ) 

# adjust margin size (vector of 4 numbers, 1 per side) (default is c(5, 4, 4, 2) + 0.1)
par(mar = c(2, 2, 1, 1))

# adjust the number of plots per row and per column on the canvas (filled row-wise)
par(mfrow = c(2, 1))    #2 rows and 1 column
plot(x, y)
plot(x, z)

See the par help page for more graphing parameter options.

# adds one or more straight lines to a plot
abline( )   
# will close the active graphics window
dev.off( )  

# will open a new plot window
dev.new( )
# Will close all graphics windows
graphics.off( ) 

5 Managing Variables and Objects

5.1 Examining Data Structure

The letter “x” will be used to represent a R object in the following examples:

# invokes a spreadsheet-style data viewer on a matrix or dataframe object
# displays a list of all objects in the search path
# displays the internal structure of an R object 

# displays the object class
# displays the column names of an R object 

# displays the first (head) or last (tail) parts of a vector, matrix, table, data frame or function 
# returns the number of rows

# returns the number of columns

# returns the object length 
# returns the unique values for factors or characters in a vector, dataframe, or array
# levels(x) will produce the same output, but is only applicable to factors

Example: unique(SPC$landform.string) would return a list of all landforms in the SPC object

5.2 Editing and Summarizing Data

# prints a brief summary of the results of various model fitting functions or each column in a matrix or dataframe
# coerces the column to a factor

# logical flag to determine if the levels in a factor should be regarded as ordered (in the order given)

# coerces the column to be numeric

# returns a permutation which rearranges its first argument into ascending or descending order
# order is used when dealing with a sequence of numeric, complex, character or logical vectors, all of the same length
# when dealing with tables, the sort( ) argument can be used to order data
# removes rows in a dataframe, matrix, or vector that contain NA values 
# invokes a spreadsheet-like text editor where edits can be saved if a new object is specified
new.object <- edit(x)
# writes an ASCII text representation of a R object to a file or connection
# useful for converting a list (ex: "2011MT081001" "2011MT081009") into a comma delimited string (ex: c("2011MT0810001", "2011MT0810009")) for use in queries
dput(x, file = " ")
# displays a contingency table of the counts at each combination of factor levels

Example: table(data$subgroup, data$order) would print out a contingency table of frequency counts by subgroup and order

# sorts (or order) a vector or factor (partially) in ascending or descending order

Example: sort(table(object$subgroup), decreasing = TRUE) would sort a table depicting a frequency count of subgroup in decreasing order

Example: sort(unique(data$hzname)) would return a vector of unique horizon names in ascending order

# the concatenate command combines values into a vector or list
# the output type is determined from the highest type of the components in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression
c( )
# returns the TRUE indices of a logical object, allowing for array indices
which( )

Additional syntax options for use in which() criteria:

Operator Meaning Example Result
!= not-equal-to character “string” which(s$partsizeclass != 'sandy-skeletal') would return the row numbers of object s where the particle size class was everything except sandy-skeletal
== equal to which(s$partsizeclass == 'sandy-skeletal') would return the row numbers of object s where the particle size class is sandy-skeletal
< > less than, greater than which(s$depthtobedrock > 102) would return the row numbers where depth to bedrock is greater than 102
<= >= less than equal to, greater than equal to which(s$depthtobedrock >= 102) would return the row numbers where depth to bedrock is greater than or equal to 102
%in% equivalent to IN() in SQL which(s$partsizeclass %in% c('loamy-skeletal', 'sandy-skeletal')) would return the row numbers of object s where the particle size class is sandy-skeletal or loamy-skeletal
# a value with the same shape as test filled with elements from either yes or no 
ifelse(test, yes, no)

Example: ifelse(as.numeric(data$lime) == 2, 1, 0) would return a numerical vector of 1s and 0s, 1 representing lime = 2 and 0 representing lime = to anything other than 2)

# the grepl( ) command will search for matches to an argument pattern within each element of a character vector
grepl(pattern, vector)

Example: grepl('dyst', subgroup) would pattern match to find the rows with the text dyst in the object called subgroup

Additional syntax options for use in REGEX (regular expression) pattern matching:

Operator Meaning Example Result
equiavlent to “OR” in SQL grep('loamy | sandy', f$partsize_class) would return TRUE for pedons with loamy or sandy partsize_class
^ anchors to the left side of the string grep('^sandy', f$partsize_class) would return TRUE for pedons with partsize_class beginning with sandy
$ anchors to the right side of the string grep('skeletal$', f$partsize_class) would return TRUE for pedons with partsize_class ending in skeletal
& equivalent to “AND” in SQL grep('loamy & sandy', f$partsize_class) would return TRUE for pedons with loamy and sandy partsize_class
# a generic command that evaluates expr in a local environment constructed from data. Data can be a list, a dataframe, or an integer. 
with(data, expr)

Example: with(data, table(hillslope_pos, argillic.horizon, useNA = "ifany")) would generate a frequency table from data with two columns: hillslope_pos and argillic.horizon. This table would include all NA values since useNA is set to “ifany”.

Example: with(data, round(prop.table(table(hillslope_pos, argillic.horizon, useNA = "ifany"), 2) * 100)) would calculate the portions (%) of hillslope_pos with and without argillic horizons described. The “2” calculates the proportions relative to the column totals and the *100 converts the output values to percentages.

# subsets vectors, matrices or dataframes.
subset(data, select = )

Example: subset(s, surface_gravel > 0, select= c(landform.string, surface.gravel)) would generate a table listing only the row numbers, surface.gravel, and landform.string where surface.gravel is greater than 0.

# splits the data into subsets, computes summary statistic for each, and returns the result in a convenient form
aggregate(column1 ~ column2, data = , statistic)

Example: aggregate(clay ~ horizon, data = h, mean) would generate a table of mean clay content for each horizon

Example: aggregate(clay ~ horizon, data = h, summary) would generate a table of the minimum, 1st and 3rd quantiles, median, mean, and maximum clay content for each horizon

# rounds the values in its first argument to the specified number of decimal places (default 0)

Example: table(round(data$clay)) would generate a table with rounded clay values

# adds column and row summaries to a table made up of one or more vectors (in this example, x and z)
addmargins(table(x, z))
# returns a proportions table
prop.table(table(x), margin = NULL)*100

Example: round(prop.table(table(data$horizon, data$texture_class), margin = 1) * 100) would generate a rounded proportions table relative to the rows, margin = 1 calculates for rows, margin = 2 calculates for columns, margin = NULL calculates for total observations

# combines two vectors (x and z in this example), creating a dataframe
data.frame(x, z)
# binds columns or rows into a matrix
cbind( )
rbind( )

6 Exploratory Data Analysis

6.1 Distributions

# computes a histogram of the given numerical vector. 
# in the lattice package, histogram( ) can be used for nicer histogram plots. In the example, 3 separate histograms (for clay, sand, and total_frags_pct) are plotted
histogram(~ clay + sand + total_frags_pct, data = h)

Example: histogram(~ clay | horizon, data = h) would plot multiple histograms, one for every horizon in the object h

# plots the computed density plot
dp <- density

# in the lattice package, densityplot( ) produces a Kernel Density Plot 

Example: densityplot(~ clay + sand, data = h, auto.key = TRUE) would compute a density plot with two curves (clay and sand), removing NA values

# generates a QQplot with a line that represents the quantiles of a normal distribution
# if the data are normally distributed, the data points will fall very close to the QQline


6.2 Central Tendency

# returns the arithmetic average (mean) of a numerical vector after removing NA values  
mean(x, na.rm = TRUE) 
# returns the middle measurement of a sample set, and as such is a more robust estimate of central tendency than the mean
# median is also known as the middle or 50th quantile, meaning that there are an equal number of samples with values less than and greater than the median

# returns the most frequent measurements in the sample (mode)
sort(table(round(data$column)), decreasing = TRUE)[1]

6.3 Dispersion

# computes the variance (sum of squares) of a numeric vector, matrix, or dataframe

# computes the standard deviation of a numeric vector 
# the default for the quantile( ) function returns the min, 25th percentile, median or 50th percentile, 75th percentile, and max from a numeric vector

# returns the minimum and maximum values for a numeric vector
# produces a "box-and-whiskers" plot where y represents a numeric vector of data values to be split into groups according to variable grp (usually a factor) 
boxplot(y ~ grp, data = )
plot(y ~ grp, data = )

# the bwplot command in the lattice package can also be used to generate boxplots
bwplot(y ~ grp, data = )    

6.4 Association

# creates an object called vars that contains numerical vectors (clay, sand, and phfield in the example below)
vars <- c("clay", "sand", "phfield")

# creates a correlation matrix (table of the calculated correlation coefficients)
round(cor(data[vars], use = "complete.obs"), 2) #2 specifies decimal places
# generates a scatterplot of two numeric variables (in this example, x and z)
plot(x ~ z, data = )

# scatterplot command in the lattice package
xyplot(x ~ z, data = )

# in the lattice package, produces a scatterplot matrix for all numeric variables (vars) in a dataset

# generates a scatterplot matrix for all numeric variables (vars)
# in the circular package, the circular ( ) command creates a circular object (in the example, aspect) to provide a numerical summary
aspect <- data$aspect_field
aspect <- circular(aspect, template = "geographic", units = "degrees", modulo = "2pi")

# also in the circular package, the rose.diag( ) command generates a rose diagram plot of circular data (in this example, aspect) 
rose.diag(aspect, bins = 8, col = "grey")

7 Spatial

7.1 Points and General Mapping Commands

# in the sp package, the coordinates command promotes a dataframe to a spatial object or retrieves spatial coordinates from a Spatial object
coordinates(dataframe) <- ~ x + y

# in the sp package, the proj4string command sets the projection of a spatial object. In this example, the projection is set to WGS84 
proj4string(spatialobject) <- '+proj = longlat + datum = WGS84'

Projection Codes

# in the maps package, will plot counties and states. The xlim = c( , ) and ylim = c( , ) parameters allow the user to "zoom in or out" by setting the x and y coordinate limits
map('county', 'statename')
# a generic command to draw a sequence of points at the specified spatial extent 
# returns the bounding box coordinates for a spatial object
# in the rgdal package, the spTransform command will reproject spatial objects 
spTransform(spatialobject, CRS("+init = epsg:code"))

7.2 Raster

The series of commands in the example below use the raster package to specify a file path, concatenate a list of raster files in the specified file path, and combine the file directory and file names so that the rasters can be stacked:

# set file path
folder <- "C:/geodata/"

# list of file names
files <- c(
  elev   = "ned30m.tif", 
  slope  = "ned30m_slope5.tif",    
  aspect = "ned30m_aspect5.tif",     
  twi    = "ned30m_wetness.tif")

# combine the folder directory and file names
geodata_f <- sapply(files, function(x) paste0(folder, x))

# create a raster stack
geodata_r <- stack(geodata_f)
# alternatively, if all rasters are in the working directory (and only those rasters), the example below can be used to create a raster stack if the raster package is loaded
# the pattern argument can be used to specify input raster file extension

rasters <- stack(list.files(getwd( ),pattern = "img$", full.names = FALSE))
# extracts raster data to point data (stored as a spatial object)
# in the example below, the output dataframe object "data" will contain the following columns: pedon_id, taxonname, x_std, y_std, tax_subgroup and extracted raster values for each row in the spatial object for each raster in the raster stack

data <- data.frame( as.data.frame(spatialobject)[c("pedon_id", "taxonname", "x_std", "y_std", "tax_subgroup")], extract(rasterstack, spatialobject))

# in the following example, the data2 dataframe will contain extracted raster values and the single column specified from the spatial object
data2 <- data.frame(spatialobject$column, extract(rasterstack, spatialobject))
# export raster using the raster package
writeRaster(r, filename = "C:/workspace/raster.tif", format = "GTiff", progress = "text", overwrite = TRUE)
# in the raster package, the projectRaster function will project the values of a raster object to a new raster object with another projection

projectRaster(from, to, method = " ")

7.3 Vector

# import vector data using the rgdal package
readOGR(dsn = "C:/workspace/polygon.shp", layer = "polygon")

# export vector data to ESRI shapefile using the rgdal package 
writeOGR(pol, dsn = "C:/workspace/polygon.shp", layer = "polygon", driver = "ESRI Shapefile", overwrite_layer = TRUE)

8 soilDB and AQP

8.1 Convenience Functions in the soilDB Package

  • fetchNASISLabData()
    • Get KSSL laboratory pedon/horizon layer data from a local NASIS database.
  • fetchNASIS_component_data()
    • Get selected NASIS mapunit and component data from a local NASIS database (experimental).
  • fetchKSSL()
    • Get KSSL data via BBOX, MLRA, or series name query, from the SoilWeb system.
      • For more information, check out the following tutorial: KSSL data demo.
  • fetchOSD()
    • Fetch a limited subset of horizon- and site-level attributes for named soil series from the SoilWeb system.
  • fetchRaCA()
    • Get Rapid Carbon Assessment (RaCA) data by State, geographic bounding-box, RaCA site ID, or series query from the SoilWeb system.
      • For more information, check out the following tutorial: RaCA data demo.
  • fetchSCAN()
    • Query soil and climate data from USDA-NRCS SCAN Stations (experimental).
  • fetchHenry()
  • fetchPedonPC()
    • Fetch commonly used site/horizon data from a PedonPC v.5 database.
  • SDA_query
    • Submit queries to the Soil Data Access system.

8.2 Get Functions in the soilDB Package

  • get(‘sites.missing.pedons’, envir=soilDB.env)
    • Returns user site ID’s for sites missing pedons.
  • get(‘dup.pedon.ids’, envir=soilDB.env)
    • Returns pedon ID’s for sites with duplicate pedon ID’s.
  • get(‘bad.pedon.ids’, envir=soilDB.env)
    • Returns user pedon ID’s for pedons with inconsistent horizon depths.
  • get(‘bad.horizons’, envir=soilDB.env)
    • Returns a dataframe of horizon-level information for pedons with inconsistent horizon depths.
  • get_extended_data_from_NASIS_db( )
    • Extracts accessory tables and summaries from a local NASIS Database.

8.3 Functions Specific to a Soil Profile Collection Object

In the examples below, SPC will be used to represent a soil profile collection object.

# returns a list of the site column names in the soil profile collection object 

# returns a list of the horizon column names in the soil profile collection object 

# creates an object called s that contains the site information from a soil profile collection object
s <- site(SPC)

# creates an object called h that contains the horizon information from a soil profile collection object
h <- horizons(SPC)

# converts SPC into a data frame called x
x <- as(SPC, 'data.frame')
# invokes a help document containing a detailed list of arguments and examples for plotting soil profile collections
# creates an object called x that contains only the site_id, x_std, and y_std columns out of the site table
x <- site(SPC)[, c('site_id', 'x_std', 'y_std')]
# use of grepl( ) to filter and create an index, apply that index to the SPC, and plot the first 10 pedons in the newly created index object
# invert = TRUE would pattern match for everything except for lithic
idx <- grep('lithic', SPC$tax_subgroup, invert = FALSE)

# save this subset of 'lithic' soils for later use  
lithic.subgroup <- SPC[idx, ]

# or use the index directly to summarize a field
sort(table(f$part_size_class[idx]), decreasing = TRUE)

# plot first 10 profiles in lithic.subgroup index labeled by site_id
plot(lithic.subgroup[1:10, ], label = 'site_id')

If objects are analogous to nouns in a spoken language, then functions are analogous to verbs. A function defines an action that (usually) results in the creation of an object. An object that a function “acts on” is referred to as an “argument” of that function.

# the example below pattern matches for "t" in the hzname table and creates a dataframe of pedons with hzname "t" with the peiid, phiid, hzname, hzdept, hzdepb, clay, and phfield columns 

findBtHorizons <- function(i) {
  # extract horizons for current profile
  h <- horizons(i) 
  # search for pattern 't' in horizon designations
  idx <- grep('t', h$hzname)
     # subset these horizons
  h2 <- h[idx, ] 
  # subset columns in resulting dataframe
  res <- h2[, c('peiid', 'phiid',  
  'hzname', 'hzdept', 'hzdepb', 
  'clay', 'phfield')]
  # return data

“i” is a temporary object created inside of the context of this function based on the argument supplied; in the example above, the argument ‘i’ represents a single soil profile.

# the profileApply function in the AQP package will take the function in the first example to left and apply it to all profiles in a soil 
l <- profileApply(SPC, FUN = findBtHorizons, simplify = FALSE)

# convert list into a dataframe
# ldply is a command in the plyr package
Bt.horizons <- ldply(l)

The ddply( ) function is a convenient way to iterate over groups of rows in a dataframe and compute summaries (similar to GROUP BY in SQL).

Example: Bt.horizons.top <- ddply(Bt.horizons, 'peiid', summarise, depth_to_argillic_cm = min (hzdept)) would create a new object called Bt.horizons.top that would contain a summary of the minimum top depth of the argillic horionz for each peiid.

This summary could then be added to the soil profile collection, joining on peiid:

# rejoins the Bt.horizons.top dataframe to the site data in the SPC using the peiid
site(SPC) <- Bt.horizons.top

# NOTE: the when used in conjunction with site( ), the assignment operator performs a left-join
# the slab command in the AQP package can be used to aggregate soil properties along user-defined 'slabs', and optionally within groups

slab( )

Example: slab(SPC, fm = peiid ~ clay + sand, slab.structure = c(25,100), slab.fun = mean, na.rm = TRUE) would generate a table summarizing the mean clay and mean of sand content for each peiid between 25 and 100 cm.

In the sharpshootR package, the diagnosticPropertyPlot command can be used to generate a graphical description of the presence/absence of soil diagnostic properties.

diagnosticPropertyPlot(f, v, k, grid.label = 'pedon_id', dend.label = 'pedon_id')
# where: 
#   f = a SoilProfileCollection object
#   v = a character vector of site-level attribute names that are boolean (e.g. TRUE/FALSE) data
#   k = an integer, number of groups to highlight
#   grid.label = name of a site-level attribute (usually unique) annotating the y-axis of the grid
#   dend.label = name of a site-level attribute (usually unique) annotating dendrogram terminal leaves


v <- c('ochric.epipedon', 'cambic.horizon', 'argillic.horizon', 'paralithic.contact', 'lithic.contact')

# generate diagnostic property diagram
diagnosticPropertyPlot(SPC, v, k = 5, grid.label = 'site_id', dend.label = 'taxonname', sort.vars = FALSE)