Chapter 1 Introduction to R

1.1 Outline

  1. Course Overview
    1. Review Course Objectives
    2. Why is this training needed?
    3. Why is course organized this way?
  2. What is R?
    1. Why should I use R?
    2. What can R do?
  3. How do I get started?
    1. RStudio interface
    2. What are packages?
    3. How to navigate the Help tab
    4. How to save files
  4. Manipulating data
    1. Loading & viewing data
    2. Filtering, transforming, merging, aggregating and reshaping data
    3. Exporting data

1.2 Course Overview

1.2.1 Course Objectives

  • Develop solutions to investigate soil survey correlation problems and update activities.
  • Evaluate investigations for interpretive results and determine how to proceed.
  • Summarize data for population in NASIS.
  • Analyze spatial data to investigate soil-landscape relationships
  • Help to pursue the question “why”

1.2.2 Why is this training needed?

1.2.3 Why is course organized this way?

  • Our best judgment for assembling into 24 hours what could be 6 University level courses
  • Mixture of slides and script enabled web pages is new for NRCS
  • The web content is a long-term investment and should serve as a permanent reference
  • Feel free to provide guidance for improving the class for future offerings

1.3 What is R?

R is a free, open-source software and programming language developed in 1995 at the University of Auckland as an environment for statistical computing and graphics (Ikaha and Gentleman, 1996). Since then R has become one of the dominant software environments for data analysis and is used by a variety of scientific disiplines, including soil science, ecology, and geoinformatics (Envirometrics CRAN Task View; Spatial CRAN Task View). R is particularly popular for its graphical capabilities, but it is also prized for it’s GIS capabilities which make it relatively easy to generate raster-based models. More recently, R has also gained several packages which are designed specifically for analyzing soil data.

  1. a software environment:
    • statistics
    • graphics
    • programming
    • calculator
    • GIS
  2. a language to explore, summarize, and model data
    • functions = verbs
    • objects = nouns



1.3.1 Why Should I Learn R?

While the vast majority of people use Microsoft Excel for data analysis, R offers numerous advantages, such as:

  1. Cost. R is free! (“Free as in free speech, not free beer.”)

  2. Reproducible Research (self-documenting, repeatable)

    • repeatable:
      • code + output in a single document (‘I want the right answer, not a quick answer’ - Paul Finnell)
      • easier the next time (humorous example)
      • numerous Excel horror stories of scientific studies gone wrong exist (TED Talk)
    • scalable: applicable to small or large problems
  3. R in a Community

  4. Learning Resources (quantity and quality)

  5. R is ‘becoming’ the new norm (paradigm shift?) “If we don’t accept these challenges, other who are less qualified will; and soil scientists will be displaced by apathy.” (Arnold and Wilding, 1992)

While some people find the use of a commandline environment daunting, it is becoming a necessary skill for scientists as the volume and variety of data has grown. Thus scripting or programming has become a third language for many scientists, in addition to their native language and disipline specific terminology. Other popular programming languages include: SQL (i.e. NASIS), Python (i.e. ArcGIS), and JavaScript.


ODBC and GDAL link R to nearly all possible formats/interfaces


1.3.2 What can R do?

1.3.3 Packages

  • Base R (functionality is extended through packages)
    • basic summaries of quantitative or qualitative data
    • data exploration via graphics
    • GIS data processing and analysis
  • Soil Science R Packages
    • aqp - visualization, aggregation, classification
    • soilDB - access to commonly used soil databases
    • soilReports - handful of report templates
    • soiltexture - textural triangles
  • Ecology R packages
    • vegan - ordination, diversity analysis, etc.
    • dismo - species distribution modeling

1.3.3.1 Soil Science Applications

1.3.3.1.1 Create Maps

1.3.3.1.2 Draw Soil Profiles

1.3.3.1.3 Draw Depth Plots

1.3.3.1.4 Estimate the Range in Characteristics (RIC)
variable genhz pct10 median pct90
clay A 13 16 22
clay BAt 16 19 25
clay Bt1 18 24 32
clay Bt2 22 30 44
clay Cr 15 15 15
phfield A 6 6 7
phfield BAt 5 6 6
phfield Bt1 5 6 7

1.4 RStudio: An Integrated Development Environment (IDE) for R

RStudio is an integrated development environment (IDE) that allows you to interact with R more readily. RStudio is similar to the standard RGui, but is considerably more user friendly. It has more drop-down menus, windows with multiple tabs, and many customization options. The first time you open RStudio, you will see three windows. A forth window is hidden by default, but can be opened by clicking the File drop-down menu, then New File, and then R Script. Detailed information on using RStudio can be found at at RStudio’s Website.

RStudio Windows / Tabs Location Description
Console Window lower-left location were commands are entered and the output is printed
Source Tabs upper-left built-in text editor
Environment Tab upper-right interactive list of loaded R objects
History Tab upper-right list of key strokes entered into the Console
Files Tab lower-right file explorer to navigate C drive folders
Plots Tab lower-right output location for plots
Packages Tab lower-right list of installed packages
Help Tab lower-right output location for help commands and help search window
Viewer Tab lower-right advanced tab for local web content

1.5 Rcmdr (R Commander): A Graphical User Interface for R

While we recommend the use of RStudio for some of the reasons listed above, many people new to R (or infrequent users) might benefit from a graphical user interface (GUI) that allows the user to run basic functions using a point and click interface.

Luckily for beginners R has the R Commander (Rcmdr) GUI, which is similiar to JMP. Rcmdr was created by John Fox for his introductory statistics students so they could see how the software worked without learning a large number of function names and arguments. Rcmdr is a great way to begin familiarizing yourself with R and statistics within a GUI environment.

Regretable we know of no GUI that allows users to perform the majority of soil survey applications demonstrated in this course, and thus won’t Rcmdr won’t be covered. For those who wish to pursue Rcmdr, alternative instructions can be viewed at Andy Chang & G. Jay Kerns website.

To take a quick peak at Rcmdr, it can be opened by entering the following command into the R console.

install.packages(Rcmdr)
library(Rcmdr)

1.6 R basics

  • R is command-line driven. It requires you to type or copy-and-paste commands after a command prompt (>) that appears when you open R. This is called the “Read-Eval-Print-Loop” or REPL. After typing a command in the R console and pressing Enter on your keyboard, the command will run.

  • If your command is not complete, R issues a continuation prompt (signified by a plus sign: +).

  • R is case sensitive. Make sure your spelling and capitalization are correct.

  • Commands in R are also called functions. The basic format of a function in R is: object <- function.name(argument_1 = data, argument_2 = TRUE).

  • The up arrow (^) on your keyboard can be used to bring up previous commands that you’ve typed in the R console.

  • Any text that you do not want R to act on (such as comments, notes, or instructions) needs to be preceded by the # symbol (a.k.a. hash-tag, comment, pound, or number symbol). R ignores the remainder of the script line following #.

# Addition
1 + 1
## [1] 2
# Multiplication
10 * 10
## [1] 100
# Compute Logarithm
log10(100)
## [1] 2
# Print Text
"Hello World"
## [1] "Hello World"
# Combine a List of Values
c(1, 2)
## [1] 1 2
# Create sequence of values
1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# Loading built-in datasets
data(npk)

# Plot histogram
hist(npk$yield)

# Assignment
test1 <- 1
test2 <- "Hello World"

# Print
print(test1)
## [1] 1
test2
## [1] "Hello World"

1.7 Managing Packages

Packages are collections of additional functions that can be loaded on demand. They commonly include example data that can be used to demonstrate those functions. Although R comes with many common statistical functions and models, most of our work requires additional packages.

1.7.1 Installing Packages

To use a package, you must first install it and then load it. These steps can be done at the command line or using the Packages Tab. Examples of both approaches are provided below. R packages only need to be installed once (until R is upgraded or re-installed). Every time you start a new R session, however, you need to load every package that you intend to use in that session.

Within the Packages tab you will see a list of all the packages currently installed on your computer, and 2 buttons labeled either “Install” or “Update”. To install a new package simply select the Install button. You can enter install one or more than one packages at a time by simply separating them with a comma.

To find out what packages are installed on your computer, use the following commands:

library() 

# or

installed.packages()

One useful package for soil scientists is the soiltexture package. It allows you to plot soil textural triangles. The following command shows how to install this package if you do not have currently have it downloaded:

# CRAN (static version)
installed.packages(c("aqp", "soilDB", "soilReports", "soiltexture"))

# GitHub (development version)
devtools::install_github("ncss-tech/soilDB", dependencies = FALSE, upgrade_dependencies = FALSE, build = FALSE)

1.7.2 Loading Packages

Once a package is installed, it must be loaded into the R session to be used.

library(soilDB)

You can also load packages using the Packages Tab, by checking the box next to the package name. Documentation about the soiltexture package is available from the help functions in R.

help(package = "soilDB")

This help command sends you to a webpage.

1.8 Getting Help

R has extensive documentation, numerous mailing lists, and countless books (many of which are free and listed at end of each chapter for this course).

To learn more about the function you are using and the options and arguments available, learn to help yourself by taking advantage of some of the following help functions in RStudio:

  1. Use the Help tab in the lower-right Window to search commands (such as hist) or topics (such as histogram).

  1. Type help(read.csv) or ?read.csv in the Console window to bring up a help page. Results will appear in the Help tab in the lower right-hand window. Certain functions may require quotations, such as help("+").
# Help file for a function
help(read.csv) # or ?read.csv

# Help files for a package
help(package = "soiltexture")

1.9 Documenting your work

RStudio’s Source Tabs serve as a built-in text editor. Prior to executing R functions at the Console, commands are typically written down (or scripted). Scripting is essentially showing your work.

The sequence of functions necessary to complete a task are scripted in order to document or automate a task.

While scripting may seems cumbersome at first, it ultimately saves time in the long run, particularly for repetitive tasks (humorous YouTube Video on Scripting).

Benefits include:

  • allows others to reproduce your work, which is the foundation of science
  • serves as instruction/reminder on how to perform a task
  • allows rapid iteration, which saves time and allows the evaluation of incremental changes
  • reduces the chance of human error

1.9.1 Basic Tips for Scripting

To write a script, simply open a new R script file by clicking File>New File>R Script. Within the text editor type out a sequence of functions.

  • Place each function (e.g. read.csv()) on a separate line.
  • If a function has a long list of arguments, place each argument on a separate line.
  • A command can be excuted from the text editor by placing the cursor on a line and typing Crtl + Enter, or by clicking the Run button.
  • An entire R script file can be excuted by clicking the Source button.

1.10 Organizing your work

When you first begin a project you should create a new folder and place within it all the data and code associated with the project. This vastly simplifies the process of accessing your files from R, but is it also a good habitat in that it makes it easy pickup where you left off later and find your data. Within R your project folder is also know as your working directory. This directory will be the default location your plots and other R output are saved.

Essentially, you want to have the inputs for your code to be found in the working directory so that you can refer to them using relative file paths. Relative file paths make it easier if you move the folder containing your script(s) around. Or, if you share it with someone else, they will have little issue getting your code to work on their own file system.

NOTE: Beware when specifying any file paths that R uses forward slashes / instead of back slashes \. Back slashes are reserved for use as an escape character.

1.10.1 Setting the Working Directory

Before you begin working in R, you should set your working directory to your project folder; for example, “C:\workspace2\projectx…”.

To change the working directory in RStudio, select main menu Session >> Set Working Directory >> …. Or, from the “Files” tab click More >> Set As Working Directory to use the current location of the “Files” tab as your working directory.

Setting the working directory can also be done via the Console with the setwd() command:

setwd("C:/workspace2")

To check the file path of the current working directory (which should now be “C:\workspace2”), type:

getwd()

1.10.2 RStudio Projects (.Rproj files)

You can also manage your working directory using RStudio Projects. An RStudio Project file (.Rproj) is analogous to, for example, a .mxd file for ArcMap. It contains information about the specific settings you may have set for a “project”.

You open or create projects using the drop down menu in the top right-hand corner of the RStudio window (shown below)

RStudio Project Menu

Here is what a typical Project drop-down menu looks like:

RStudio Project Menu (expanded)

  • You can create new projects from existing or new directories with “New Project…”.

  • When you click “Open Project…”, your working directory is automatically set to the .Rproj file’s location – this is extremely handy

  • Any projects you have created/used recently will show up in the “Project List”

1.11 Saving your work

In R, you can save several types of files to keep track of the work you do. The file types include: workspace, script, history, and graphics. It is important to save often because R, like any other software, may crash periodically.

Such problems are especially likely when working with large files. You can save your workspace in R via the command line or the File menu.

1.11.0.1 R script (.R)

An R script is simply a text file of R commands that you’ve typed.

You may want to save your scripts (whether they were written in R Editor or another program such as Notepad) so that you can reference them in the future, edit them as needed, and keep track of what you’ve done.

To save R scripts in RStudio, simply click the save button from your R script tab. Save scripts with the .R extension.

R assumes that script files are saved with only that extension. If you are using another text editor, you won’t need to worry about saving your scripts in R. You can open text files in the RStudio text editor, but beware copying and pasting from Word files as discussed below.

To open an R script, click the file icon.

1.11.0.2 Microsoft Word Files

Using Microsoft Word to write or save R scripts is generally a bad idea.

Certain keyboard characters, such as quotations "“, are not stored the same in Word (e.g. they are”left" and “right” handed). The difference is hard to distinguish, but will not run in R.

Also, pasting your R code or output into Wword documents manually is not reproducible, so while it may work in a pinch, it ultimately costs you time.

You can use the word_document Rmarkdown template to automatically “Knit” .docx files from R code using a template, which is very handy for quickly getting a nice looking document!

1.11.0.3 R Markdown (.Rmd)

Stub about the basic benefits of Rmarkdown for reproducibility, interactive reports like Shiny, but also “Notebooks” in RStudio.

This document is made in bookdown! You can make websites with blogdown, etc. You can knit visually appealing and high-quality documents into rich HTML, PDF or Word documents.

These are all based off of the powerful pandoc engine and the tools in the Rmarkdown ecosystem.

1.11.0.4 R history (.Rhistory)

An R history file is a copy of all your key strokes. You can think of it as brute force way of saving your work. It can be useful if you didn’t document all your steps in an R script file.

Like an R file, an Rhistory file is simply a text file that lists all of the commands that you’ve executed. It does not keep a record of the results.

To load or save your R history from the History Tab click the Open File or Save button. If you load an Rhistory file, your previous commands will again become available with the up-arrow and down-arrow keys.

You can also use the command line to load or save your history.

savehistory(file = "sand.Rhistory")  
loadhistory(file = "sand.Rhistory")  
history(max.show=Inf) #displays all previous commands

1.11.0.5 R Graphics

Graphic outputs can be saved in various formats.

Format Function
pdf pdf(“graphic.pdf”)
window metafile win.metafile(“graphic.wmf”)
png png(“graph.png”)
jpeg jpeg(“graph.jpg”)
bmp bmp(“graph.bmp”)
postscript postscript(“graph.ps”)

To save a graphic: (1) Click the Plots Tab window, (2) click the Export button, (3) Choose your desired format, (3) Modify the export settings as you desire, and (4) click Save.

The R command for saving a graphic is:

png(file = "npk_yield.png")
plot(npk$yield)
dev.off()

The first line of this command creates a blank file named sand with a JPEG extension. The second line plots the data object that you want to create a graphic of (here it is conveniently the same name as the JPEG file we are creating). The third line closes the graphics device.

1.12 Exercise 1

Using the examples discussed thus far as a guide, demonstrate your mastering of the material by performing the following tasks.

  1. Create an R script file, demonstrate 3 basic R functions, and comment (#) your code.
  2. Install the FedData R package from CRAN and GitHub. Save the commands in your R script file.
  3. Load the FedData R package and read the help file for the get_ssurgo function within the FedData package. What is the 1st input/argument? Save the R command in your R script.
  4. Save your R script, and forward to your instructor.

1.13 Loading Data

R can load a variety of data formats, however tabular data is by far the most common, and what we will spend of the majority of our time working with. Typically tabular data is stored in spreadsheets (e.g. .txt, .csv, .xlsx), databases (e.g. NASIS), or webpages (.html). Within R tabular data is stored as a data.frame.

1.13.0.1 Text files

Text files are a preferable format for storing and transferring small datasets. One basic command for importing text files into R is read.csv(). The command is followed by the file name or URL and then some optional instructions for how to read the file.

These files can either be imported into R by clicking the Import Dataset >> From Text buttons from the Environment tab, or by typing the following command into the R console:

# from working directory
sand <- read.csv("C:/workspace2/sand_example.csv")
# from URL
sand <- read.csv("https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/sand_example.csv") 

1.13.0.2 Excel files

R can import Excel files, but generally speaking it is a bad idea to use Excel. Excel has a dangerous default which automatically converts data with common notations to their standard format without warning or notice. For example, the character “11-JUN” entered into a cell automatically becomes the date 6/11/2021, even though the data is still displayed as 11-JUN. The only way to avoid this default behavior is to manually import your data into Excel via the Data Tab>Get External Data Ribbon, and manually set the data type of all your columns to text. Failure to do so has resulted in numerous retracted research articles (Washington Post Article). Warnings aside, Excel files are a very common and are a format most people are familiar with. Therefore we will illustrate how to bring them into R.

Download the sand Excel dataset from GitHub at https://github.com/ncss-tech/stats_for_soil_survey/blob/master/data/Pre-course/R_sand/sand_example.xlsx

Excel datasets can either be imported into R by clicking the Import Dataset >> From Excel buttons from the Environment tab, or by typing the following command into the R console:

library(readxl)

sand_example <- read_excel("sand_example.xlsx")

1.13.0.3 NASIS (Web) Reports

NASIS provides a plethora of reports, many of which can be read into R for analysis. The soilDB R package provides a series of function to read data from NASIS either using a local database connection or via HTML web reports. Similar functions also exist for accessing tabular data from Soil Data Access. More details on soilDB will be provided in the next chapter, but now we’ll illustrate how to access some example datasets for manipulating tabular data.

library(soilDB)

# get projects
prj <- get_project_from_NASISWebReport(mlrassoarea = "11-IND", fiscalyear = 2020)

# get legends
leg <- get_legend_from_NASISWebReport(mlraoffice = "Indi%", areasymbol = "%")

# get map units
mu  <- get_mapunit_from_NASISWebReport(areasymbol = c("IN001", "IN11%"))

1.14 Data manipulation

Before we can do any sort of analysis, we often need to manipulate our data somehow. Estimates varying, but an analyst typically spend 80% of their time manipulating data, and only 20% actually analyzing or modeling. Tasks generally involve filtering, transforming, merging, aggregating, and reshaping data.

R has many functions and packages for manipulating data frames, but within the past several years a family of packages, known as the tidyverse, have been developed to simplify interacting with data frames (or tibbles). Within the tidyverse the most commonly used packages are dplyr and tidyr. Many of the tidyverse function names are patterned after SQL syntax.

We will review the most common functions you need to know in order to accomplish the majority of data manipulation tasks.

1.14.1 Viewing and Removing Data

Once a file is imported, it is imperative that you check to ensure that R correctly imported your data. Make sure numerical data are correctly imported as numerical, that your column headings are preserved, etc. To view the data simply click on the mu dataset listed in the Environment tab. This will open up a separate window that displays a spreadsheet like view.

Additionally you can use the following functions to view your data in R.

Function Description
print() prints the entire object (avoid with large tables)
head() prints the first 6 lines of your data
str() shows the data structure of an R object
names() lists the column names (i.e., headers) of your data
ls() lists all the R objects in your workspace directory

Try entering the following commands to view the mu dataset in R:

str(mu)

names(mu)

head(mu)

ls()

A data object is anything you’ve created or imported and assigned a name to in R. The Environment tab allows you to see what data objects are in your R session and expand their structure. Right now sand should be the only data object listed. If you wanted to delete all data objects from your R session, you could click the broom icon from the Environments tab. Otherwise you could type:

# Remove all R objects
rm(list = ls(all = TRUE)) 

# Remove individual objects
rm(mu, leg, sand)

1.14.2 Filtering or Subsetting Data

When analyzing data in NASIS, filtering is typically accomplished by loading your selected set with only the records you’re interested in. However, it is often useful or necessary to subset your data after it’s loaded. This can allow you to isolate interesting records within large datasets. For these reasons R has numerous options/functions for filtering data.

Data frames can be filtered by both columns and rows, using either names, position (e.g. column 1, row 5), or logical indices (e.g. TRUE/FALSE). Another particularly useful feature is the use of pattern matching which uses regular expressions to select data, which is similar to the LIKE statement from SQL.

**Filtering with names and numerical indices

# Filtering with names
mu$areasymbol                             # select column names using $
mu[, c("areasymbol", "musym")]            # select column names using []
mu[c("1", "2"), ]                         # select row names using []
mu[c("1", "2"), c("areasymbol", "musym")] # select column and row names using []


# Filtering by position
mu[1, ]          # select first row
mu[, 1]          # select first column
mu[2, 2]         # select second row and second column
mu[c(1, 2, 3), ] # select multiple rows
mu[c(-1, -2), ]  # drop multiple rows

Logical Operators - == R uses a double equal sign as “equal-to” in SQL - != Not-equal-to - <, >, <=, >= Less than, greater than, less than or equal to, and greater than or equal - & Equivalent to AND in SQL and Soil Taxonomy, must match both conditions - | Equivalent to OR in SQL and Soil Taxonomy, must match at least one condition - %in% Equivalent to IN () in SQL (e.g. mu$areasymbol %in% c("IN001", "IN111") - grepl() equivlant to LIKE in SQL (e.g. grepl("IN%", mu$areasymbol))

Filtering with logicals

# Standard evaluation with base R []

# Filtering with logicals
mu[mu$areasymbol == "IN001", ]                # select rows that equal IN001
mu[mu$areasymbol != "IN001", ]                # select rows that do not equal IN001
mu[, names(mu) == "areasymbol"]               # select columns that equal areasymbol
mu[, names(mu) %in% c("areasymbol", "musym")] # select columns that match areasymbol and musym 


# Non-standard evaluation with tidyverse

library(dplyr)

# Filtering rows
filter(mu, areasymbol == "IN001")
filter(mu, areasymbol != "IN001")
filter(mu, areasymbol %in% c("IN001", "IN111"))
filter(mu, muacres > 0)

# Select columns
select(mu, areasymbol, musym)

# Slice  rows
slice(mu, 1:5)

1.14.3 Transforming Data

This allows you to create new columns by convert, compute, or combine data within existing columns.

mu <- mutate(mu, 
             # convert to hectares
             muhectares = muacres * 0.4047,
             # convert muname to TRUE or FALSE if Miami is present using pattern matching
             miami      = grepl("Miami", muname),
             # compute % minor component
             n_minor    = n_component - n_majcompflag,
             # combine columns
             key        = paste(areasymbol, musym)
             )

1.14.4 Sorting Data

Sorting allows you to rearrange your data. Beware R has several similar functions (e.g. sort and order) for sorting data only work with specific datatypes. The tidyverse function arrange is designed to work with data frames.

# sort ascending
arrange(mu, areasymbol, muname)

# sort descending
arrange(mu, desc(areasymbol), desc(muname))

1.14.5 Piping Data

Another particularly useful feature provided by the magrittr package and used in the tidyverse is the use of pipe (%>%).

f(x,y) becomes x %>% f(y)

The “pipe” is something that occurs in many programming languages and computer contexts. It allows output from one expression to be passed as input to the first argument of the next function. This allows sequences of commands to be read from right to left, rather than from the inside out.

# non-piping example 1
mu_sub <- filter(mu, areasymbol == "IN001")
mu_sub <- mutate(mu_sub, pct_100less = pct_component < 100) 

# non-piping example 2
mu_sub <- mutate(filter(mu, areasymbol == "IN001"), pct_100less = pct_component < 100)

# piping
mu_sub <- mu %>% 
  filter(areasymbol == "IN001") %>% 
  mutate(pct_100less = pct_component < 100)

1.14.6 Merging/Joining or Combining Data

** Joining**

When working with tabular data you often have 2 or more tables you need to join. There several ways to join tables, in particular which direction to join and which columns to join on.

# inner join
leg_mu <- inner_join(leg, mu, by = c("liid", "areasymbol"))

# left join
leg_mu <- left_join(leg, mu, by = c("liid"))

# right_join
leg_mu <- right_join(leg, mu, by = "liid")

** Combining**

If your tables have the same structure (e.g. columns), or length and order you may simply combine them. For example, if you have two different mapunit tables.

# combine rows
rbind(mu, mu)
rbind(mu, leg) # won't work


# combine columns
cbind(mu, mu)  # beware combine tables with duplicate column names
cbind(mu, areasymbol_2 = mu$areasymbol)
cbind(mu, leg) # won't work

1.14.7 Aggregating or Grouping Data

Because soil data has multiple dimensions (e.g. properties and depths) and levels of organization (e.g. many to one relationships), it is often necessary to aggregate it. For example, when we wish to make a map we often need to aggregate over components and then map units. Depending on the data type this aggregation may involve taking a weighted average or selecting the dominant condition.

The group_by function defines the groups over which we wish to summarize the data.

mu_agg <- mu %>% 
  group_by(grpname, areasymbol) %>%
  summarize(sum_muacres = sum(muacres),
            n_musym     = length(musym)
            )

1.14.8 Reshaping Data

Typically data is stored in what is known as a wide format, where each column contains a different variable (e.g. depth, clay, sand, rocks). However, sometimes it is necessary to reshape or pivot to a long format, where each variable/column is compressed into 2 new rows. One new column contains the old column names, while another new column contains the values from the old columns. This is particularly useful when combining multiple variables into a single plot.

library(tidyr)

# Simplify mu example dataset
mu2 <- mu %>%
  select(grpname, areasymbol, musym, muacres, n_component, pct_hydric) %>%
  slice(1:5)
print(mu2)

# Pivot long
mu2_long <- pivot_longer(mu2, cols = c(muacres, n_component, pct_hydric))
print(mu2_long)

# Pivot wide
mu2_wide <- pivot_wider(mu2_long, names_from = name)
print(mu2_wide)

1.14.9 Exporting Data

To export data from R, use the command write.csv() or write.dbf() functions. Since we have already set our working directory, R automatically saves our file into the working directory.

write.csv(mu_agg, file = "mu_agg.csv")


library(foreign)

write.dbf(mu_agg, file = "mu_agg.dbf")

1.15 Exercise 2

  1. Create a new R script file.
  2. Load a 2021 project table from mlrassoarea = "2-SON" and a mapunit table from areasymbol = c("CA630", "CA649").
  3. Filter your project table for rows where the projectapprovedflag == TRUE.
  4. Calculate the acreage of hydric soils for each map unit by multiplying muacres * pct_hydric.
  5. Join your filtered project table to the mapunit table using a left join.
  6. Aggregate the total acreage of hydric soils each soil survey area.
  7. Save your R script and forward to your instructor.

1.16 Review

Given what you now know about R, try to answer the following questions:

  1. Can you think of a situation where an existing hypothesis or conventional wisdom was not repeatable?

  2. What are packages?

  3. What is GitHub?

  4. Where can you get help?

  5. What is a data frame?

  6. What are 3 ways you can manipulate a data frame?

1.17 Additional Reading (Introduction)

1.18 References (Introduction)

Brevik, E.C., J.A. Homburg, B.A. Miller, T.E. Fenton, J.A. Doolittle, and S.J. Indorante, 2016. Selected highlights in American soil science history from the 1980s to the mid-2010s. Catena 146:128-146.

Chaney, N., E. Wood, A.B. McBratney, J.W. Hempel, T.W. Nauman, C.W. Brungard, and N.P. Odgers, 2016. POLARIS: A 30-meter probabilistic soil series maps of the contiguous United States. Geoderma 274(15)54-67. https://www.sciencedirect.com/science/article/pii/S0016706116301434

Hennemann, G.R., and Rossiter, DG., 2004. Training needs for the next generation of soil surveyors. International Conference on Innovative Techniques in Soil Survey; 22-26 March 2004, Cha-Am, Thailand. http://www.css.cornell.edu/faculty/dgr2/Docs/ChaAm/ChaAmKeynoteHennemann.pdf

Kempen, B., D. Brus, J. Stoorvogel, G. Heuvelink, F. de Vries, 2012. Efficiency Comparison of Conventional and Digital Soil Mapping for Updating Soil Maps. Geoderma 76(6)2095-2115. https://acsess.onlinelibrary.wiley.com/doi/10.2136/sssaj2011.0424

Ihaka, R., and R. Gentleman. 1996. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 5(3):399–314. https://www.stat.auckland.ac.nz/~ihaka/downloads/R-paper.pdf

MacMillian, R., D. Moon, and R. Coupe, 2007. Automated predictive ecological mapping in a Forest Region in B.C., Canada, 2001-2005. Geoderma 140(4)353-373. www.sciencedirect.com/science/article/pii/S0016706107001152

Mausbach, 2003. The Importance of Statistical Documentation - Keeping Soil Survey Information Relevant in the 21st Century. 2003 National Cooperative Soil Survey Conference, Plymouth, MA. https://www.nrcs.usda.gov/Internet/FSE_DOCUMENTS/nrcs142p2_051833.pdf

Ramcharan, A., T. Hengl, T. Nauman, C. Brungard, S. Waltman, S. Wills, and J. Thompson, 2017. Soil Property and Class Mas of the Conterminous United States at 100-Meter Spatial Resolution. Soil Science Society of America Journal, 82(1)186-201. https://acsess.onlinelibrary.wiley.com/doi/10.2136/sssaj2017.04.0122