1 R Fundamentals

The following examples are meant to be copied from this document and pasted into R, where they can run interactively. Comments (green text with a “#” sign at left) briefly describe the function of the code in each line. Further documentation on objects and functions from the aqp package can be accessed by typing help(soilDB) or help(aqp) at the R console. The general form for a help request is ?function_name.

1.1 Classes of Objects Used in R

One of the most versatile things about R is that it can manipulate and work with data in many ways. Below are examples of ways to create and reference information in several data types that are commonly used with soil data. Within the R session, objects contain information that is loaded from files, extracted from NASIS, created on the fly, or calculated by some function. If none of the base classes are sufficient for a task, it is possible to define custom classes. The SoilProfileCollection is one such example.

Objects in R are analogous to nouns in a spoken language. They are labels for things we encounter in life. While the meaning of a noun in a spoken language generally doesn’t change (e.g., the meaning of the word “apple” doesn’t randomly change), the contents of objects in R can be modified or re-assigned at any time by using the assignment operator (<-).

1.1.1 Vectors

Vectors are a fundamental object in the R language. They represent a set of 1 or more numbers, characters (commonly called strings), or Boolean (true/false) values.

# implicit vector creation from a sequence from 1:10
1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

# numeric vector example: clay percent values
clay <- c(10, 12, 15, 26, 30)

# 'c()' is the concatenate function
# values are concatenated into an object we've called 'clay' by assigning the concatenate result to 'clay' using '<-' or '='
# print the values of 'clay' by typing it into the R console followed by enter
clay

## [1] 10 12 15 26 30

# character vector: taxonomic subgroup
subgroup <- c("typic haplocryepts","andic haplocryepts","typic dystrocryepts")  
subgroup

## [1] "typic haplocryepts"  "andic haplocryepts"  "typic dystrocryepts"

# logical vector: diagnostic feature presence/absence
# note that TRUE and FALSE must be capitalized
andic <- c(FALSE,TRUE,FALSE) 
andic

## [1] FALSE  TRUE FALSE

1.1.1.1 Referencing elements of a vector

Specific elements from a vector are accessed with square brackets, e.g., clay[i]. Note that the examples below use vectors to reference elements of another vector.

# 2nd and 4th elements of vector 'clay' from above
clay[c(2, 4)]

## [1] 12 26

# 1st and 3rd elements of vector 'subgroup' from above
subgroup[c(1, 3)]

## [1] "typic haplocryepts"  "typic dystrocryepts"

# everything but the first element of vector 'andic'
andic[-1]

## [1]  TRUE FALSE

# re-order clay values using a sequence from 5 to 1
clay[5:1]

## [1] 30 26 15 12 10

So what’s the deal with the square bracket notation?

Access elements i from vector x: x[i].
Exclude elements i from vector x: x[-i].

1.1.1.2 Vectorized evaluation: Implicit looping

Most functions in R are “vectorized.” This means that operations such as addition (the + function) on vectors will automatically iterate through every element. Following are some examples to demonstrate these concepts.

# clay values from above
# divide by 100, notice that iteration is over all the elements of 'clay' in the result
clay / 100

## [1] 0.10 0.12 0.15 0.26 0.30

# search for the text 'dyst' in elements of 'subgroup'...more on this pattern matching process in chapter 2a!
grepl('dyst', subgroup)

## [1] FALSE FALSE  TRUE

# multiply two vectors of the same length
c(5, 5) * c(1, 2)

## [1]  5 10

# be careful, operations on vectors of different length results in "recycling"!
# this is helpful at times, but can be a common source of confusion
10 * c(1, 2)

## [1] 10 20

c(1, 10, 100) + c(1, 2)

## Warning in c(1, 10, 100) + c(1, 2): longer object length is not a multiple
## of shorter object length

## [1]   2  12 101

# what is actually happening here in this recycling of vectors?
# taking the following elements from above, first+first, second+second, third+first
# 1+1=2, 10+2=12, 100+1=101

1.1.2 Dataframes

Dataframes are central to most work in R. They describe rectangular data, which can be thought of as a spreadsheet with rows and columns. Each column in the dataframe is constrained to a single data type: numeric, date-time, character, Boolean, and so on. Note that each column of a dataframe is a vector, and all column vectors must be the same length (hence the adjective “rectangular”).

# Take our two character and logical vectors we created above and convert them into a more useful dataframe.
# we'll use the data.frame() function to glue these two vectors together into object 'd'
d <- data.frame(subgroup, andic)
d

##              subgroup andic
## 1  typic haplocryepts FALSE
## 2  andic haplocryepts  TRUE
## 3 typic dystrocryepts FALSE

You can see that the dataframe was created and it worked, but the vector names are not very informative. A couple of useful functions for working with column names are names(), which renames columns in a dataframe and colnames(), which creates a vector of column names.

# get the column names of a dataframe
names(d)

## [1] "subgroup" "andic"

# we can use 'names()' and 'c()' to rename the columns in a dataframe
names(d) <- c('tax_subgroup', 'andic.soil.properties')
d

##          tax_subgroup andic.soil.properties
## 1  typic haplocryepts                 FALSE
## 2  andic haplocryepts                  TRUE
## 3 typic dystrocryepts                 FALSE

1.1.2.1 Referencing within dataframes

Note in dataframe d that each row has an index number in front of it. Using the square brackets notation, you can reference any part of a dataframe: rows or columns or specific row- and column-selections. Here are some examples:

# format: dataframe_name[rows, columns]
d[1, ] # first row of dataframe

##         tax_subgroup andic.soil.properties
## 1 typic haplocryepts                 FALSE

d[, 1] # first column of dataframe

## [1] typic haplocryepts  andic haplocryepts  typic dystrocryepts
## Levels: andic haplocryepts typic dystrocryepts typic haplocryepts

d[2, 2] # second row, second column

## [1] TRUE

# In dataframes we can also use the '$' symbol to reference vector columns within a specific dataframe object
# format: dataframe_name$column_name
d$tax_subgroup

## [1] typic haplocryepts  andic haplocryepts  typic dystrocryepts
## Levels: andic haplocryepts typic dystrocryepts typic haplocryepts

# Other useful functions for checking objects and working with dataframes
# the 'str()' function will show you the structure of an object and the data types of the vectors within it
str(d)

## 'data.frame':    3 obs. of  2 variables:
##  $ tax_subgroup         : Factor w/ 3 levels "andic haplocryepts",..: 3 1 2
##  $ andic.soil.properties: logi  FALSE TRUE FALSE

# 'class()' will tell you the object type or data type
class(d)

## [1] "data.frame"

class(d$tax_subgroup)

## [1] "factor"

# use 'colnames()' to get a vector of column names from a dataframe
colnames(d)

## [1] "tax_subgroup"          "andic.soil.properties"

# ncol and nrow provide column and row dimensions
ncol(d)

## [1] 2

nrow(d)

## [1] 3

# building on what we've learned above, we can use the square bracket notation on a dataframe to re-order columns
d <- d[ ,c('andic.soil.properties', 'tax_subgroup')]
d

##   andic.soil.properties        tax_subgroup
## 1                 FALSE  typic haplocryepts
## 2                  TRUE  andic haplocryepts
## 3                 FALSE typic dystrocryepts

# another way we could do this is to use the column indexes within the concatenate function
# although the column indexes are not visible each column has an index number assigned to it
d <- d[ , c(2,1)]

How would you remove a vector or column from a dataframe?

d$tax_subgroup <- NULL will remove this column from the dataframe.

1.1.3 Factors

Factors are an extension of the character class, designed for encoding group labels. This object (and some possible issues that it can create) are described in a later section.

# generate a factor representation of the characters contained in the word 'pedology'
# also setting the range of possibilities to the letters of the alphabet
x <- factor(substring("pedology", 1:8, 1:8), levels = letters)
# note that the object 'x' knows that there are 26 possible levels
x

## [1] p e d o l o g y
## Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

# another way to do the same thing - the substring() is parsing the string 'pedology' for us in the above example
#x <- factor(c('p', 'e', 'd', 'o', 'l', 'o', 'g', 'y'), levels=letters)

# what happens when the levels are not specified?
factor(substring("mississippi", 1:11, 1:11))

##  [1] m i s s i s s i p p i
## Levels: i m p s

# the string collapses to repeating levels only

1.1.4 Lists

Lists are similar to the dataframe class but without limitations on the length of each element. Lists are commonly used to store tree-like data structures or “ragged” data: elements with varying length. List elements can contain just about anything, making them one of the most flexible objects in R. Let’s look at some examples.

# make a list with named elements
l <- list('favorite shovels'=c('sharpshooter', 'gibbs digger', 'auger', 'rock bar', 'backhoe!'),
          'food'=c('apples', 'bread', 'cheese', 'vienna sausages', 'lutefisk'),
          'numbers I like'=c(12, 1, 5, 16, 25, 68),
          'chips I like' =c('plantain', 'tortilla', 'potato', '10YR 3/2'),
          'email messages deleted'=c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE))
# check the list
l

## $`favorite shovels`
## [1] "sharpshooter" "gibbs digger" "auger"        "rock bar"    
## [5] "backhoe!"    
## 
## $food
## [1] "apples"          "bread"           "cheese"          "vienna sausages"
## [5] "lutefisk"       
## 
## $`numbers I like`
## [1] 12  1  5 16 25 68
## 
## $`chips I like`
## [1] "plantain" "tortilla" "potato"   "10YR 3/2"
## 
## $`email messages deleted`
## [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE

# access the first element of the list, note the double square brackets
l[[1]]

## [1] "sharpshooter" "gibbs digger" "auger"        "rock bar"    
## [5] "backhoe!"

# access the a specific element of the list, note the double square brackets denote which element in the list plus additional single brackets for the position within the element
l[[1]][4]

## [1] "rock bar"

# access the element by name, for example 'food'
l[['food']]

## [1] "apples"          "bread"           "cheese"          "vienna sausages"
## [5] "lutefisk"

# convert a dataframe into a list
as.list(d)

## $tax_subgroup
## [1] typic haplocryepts  andic haplocryepts  typic dystrocryepts
## Levels: andic haplocryepts typic dystrocryepts typic haplocryepts
## 
## $andic.soil.properties
## [1] FALSE  TRUE FALSE

# a list of lists
list.of.lists <- list('pedon_1'=list('top'=c(0,10,25,55), 'bottom'=c(10,25,55,76), 'pH'=c(6.8,6.6,6.5,6.4)))
list.of.lists

## $pedon_1
## $pedon_1$top
## [1]  0 10 25 55
## 
## $pedon_1$bottom
## [1] 10 25 55 76
## 
## $pedon_1$pH
## [1] 6.8 6.6 6.5 6.4

# convert list of elements with equal length into a data.frame
as.data.frame(list.of.lists)

##   pedon_1.top pedon_1.bottom pedon_1.pH
## 1           0             10        6.8
## 2          10             25        6.6
## 3          25             55        6.5
## 4          55             76        6.4

How would you change the column names on the dataframe above?

names(...dataframe_object_name...) <- c('depth_top', 'depth_bottom', 'field_pH')

Use the names() function on the dataframe and supply the concatenated new names.

1.1.5 Matrix

The matrix object is used to describe rectangular data of a single datatype: numbers, characters, Boolean values, and so on.

# make a 5x5 matrix of 0's
m <- matrix(0, nrow=5, ncol=5)
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

# operations by scalar (single value) are vectorized
m <- m + 1
m * 5

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5    5    5    5    5
## [2,]    5    5    5    5    5
## [3,]    5    5    5    5    5
## [4,]    5    5    5    5    5
## [5,]    5    5    5    5    5

# notice that the result of m * 5 is displayed but because the output wasn't assigned to 'm' it remains unchanged from m + 1
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    1    1    1    1    1
## [3,]    1    1    1    1    1
## [4,]    1    1    1    1    1
## [5,]    1    1    1    1    1

# now use the square bracket notation to get / set values within the matrix: m[row, col]
# set row 1, col 1 to 0
m[1,1] <- 0
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    1    1    1    1
## [2,]    1    1    1    1    1
## [3,]    1    1    1    1    1
## [4,]    1    1    1    1    1
## [5,]    1    1    1    1    1

# access diagonal and upper/lower triangles, useful for chapter 5
m[upper.tri(m)] <- 'U'
m[lower.tri(m)] <- 'L'
diag(m) <- 'D'
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "D"  "U"  "U"  "U"  "U" 
## [2,] "L"  "D"  "U"  "U"  "U" 
## [3,] "L"  "L"  "D"  "U"  "U" 
## [4,] "L"  "L"  "L"  "D"  "U" 
## [5,] "L"  "L"  "L"  "L"  "D"

# many functions return matrix objects
# create a matrix of the sequences 1:10 and 1:10, then multiply every combination by applying a function
outer(1:10, 1:10, FUN='*')

##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]    2    4    6    8   10   12   14   16   18    20
##  [3,]    3    6    9   12   15   18   21   24   27    30
##  [4,]    4    8   12   16   20   24   28   32   36    40
##  [5,]    5   10   15   20   25   30   35   40   45    50
##  [6,]    6   12   18   24   30   36   42   48   54    60
##  [7,]    7   14   21   28   35   42   49   56   63    70
##  [8,]    8   16   24   32   40   48   56   64   72    80
##  [9,]    9   18   27   36   45   54   63   72   81    90
## [10,]   10   20   30   40   50   60   70   80   90   100

1.2 Review of Common Data Type Definitions

1.2.1 Measurement Scales

Refers to the measurement scale used. Four measurement scales, in order of decreasing precision are recognized:

Ratio - Measurements having a constant interval size and a true zero point. Examples include measurements of length, weight, volume, rates, length of time, counts of items and temperature in Kelvin

Interval - Measurements having a constant interval size but no true zero point. Examples include Temperature (excluding Kelvin), direction (e.g. slope aspect), time of day. Specific statistical procedures are available to handle circular data like slop aspect

Ordinal - Members of a set are differentiated by rank. Examples include Soil interpretation classes (e.g., slight, moderate, severe), soil structure grade (e.g.,structureless, weak, moderate, strong)

Nominal (Categorical) - Members of a set are differentiated by kind. Examples include Vegetation classes, soil map units, geologic units

The data type controls the type of statistical operation that can be performed (Stevens), 1946.

1.2.2 Continuous and Discrete Data

Continuous Data - Any measured value. Data with a possible value between any observed range. For example, the depth of an Ap horizon could range from 20cm to 30cm, with an infinite number of values between, limited only by the precision of the measurement device

Discrete Data - Data with exact values. For example, the number of Quercus alba seedlings observed in a square meter plot, the number of legs on a dog, the presence/absence of a feature or phenomenon

1.2.3 Accuracy and Precision

Accuracy is the closeness of a number to its actual value

Precision is the closeness of repeated measurements to each other

1.2.4 Significant Figures

The digits in a number that define the accuracy of a measurement. The value of 6 cm has one significant digit. The implied range is 1 cm. The true value lies between 5.50 and 6.49. The value of 6.2 cm has two significant digits. The implied range is 0.1 cm. The true value lies between 6.150 and 6.249. The implied accuracy is greater for the number 6.0 cm than 6 cm.

1.3 Additional Resources

Venables, W.N., D.M. Smith, and the R Core Team. 2015. Introduction to R, Notes on R: A programming environment for data analysis and graphics, version (3.2.3, 2015-12-10). https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf.

Wickham, H. 2014. Advanced R. CRC Press, New York. http://adv-r.had.co.nz/.

Chapter 2, Appendix - R Objects and Data Types

Jay Skovlin, Dylan Beaudette, Stephen Roecker

2018-04-09