Chapter 3 - Exploratory Data Analysis

Stephen Roecker and Tom D’Avello

2021-02-03

Objectives

Review methods for estimating Low, RV, and High values
Review different methods for visualing soil data

Range in Characteristics (RIC)

Why does the L, RV, and H clay represent?
- What taxonomic unit do they refer too?
What is the L and H for texture?

hzname	clay_l	clay_r	clay_h	texture
Ap	7	18	26.1	sil
Bt1	24	27	35.0	sicl
2Bt2	27	31	35.0	cl
2BCt	15	22	25.0	l
2Cd	10	15	20.0	l

Descriptive Statistics - Estimates of L, RV & H

Parameter	NASIS	Description	R function
Mean	RV	arithmetic average	mean()
Median	RV	middle value, 50% quantile	median()
Mode	RV	most frequent value	sort(table(), decreasting = TRUE)[1]
Standard Deviation	L & H	variation	sd()
Quantiles	L & H	percent rank of values, such that all values are <= p	quantile()

Descriptive Statistics - Numeric Data

data("loafercreek")
h <- horizons(loafercreek) 
h$texture_class <- factor(h$texture_class)

h %>%
  select(clay, phfield, total_frags_pct, texture_class) %>%
  summary()

##       clay          phfield     total_frags_pct texture_class
##  Min.   :10.00   Min.   :4.90   Min.   : 0.00   l      :265  
##  1st Qu.:18.00   1st Qu.:6.00   1st Qu.: 0.00   br     :105  
##  Median :22.00   Median :6.30   Median : 5.00   cl     :101  
##  Mean   :23.63   Mean   :6.18   Mean   :13.88   sil    : 56  
##  3rd Qu.:28.00   3rd Qu.:6.50   3rd Qu.:20.00   spm    : 22  
##  Max.   :60.00   Max.   :7.00   Max.   :95.00   (Other): 52  
##  NA's   :167     NA's   :381                    NA's   : 25

Descriptive Statistics - Categorical Data

table(h$genhz, h$texture_class, useNA = "ifany")

##       
##        br  c cb cl gr  l pg scl sic sicl sil sl spm <NA>
##   2BCt  0  6  0  3  0  0  0   0   0    0   0  0   0    0
##   2Bt   0  5  0  1  0  0  0   0   0    0   0  0   0    0
##   A     0  0  0  1  0 97  0   0   0    1  29  7   0    3
##   BA    0  0  0  0  0  2  0   0   0    0   0  0   0    0
##   BCt   0  2  0  8  0  7  0   1   0    1   1  0   0    1
##   Bt    0  0  0 17  0 37  0   0   0    0   1  1   0    0
##   Bt1   0  1  0 13  0 57  0   3   0    1  12  0   0    2
##   Bt2   0  4  0 40  0 45  0   4   2    6   8  0   0    0
##   Bt3   0  0  0  6  0  2  0   0   0    0   0  0   0    0
##   Cr   55  1  1  1  3  1  1   1   0    0   0  0   0   10
##   Oi    0  0  0  0  0  0  0   0   0    0   0  0   4    0
##   R    43  0  0  0  0  0  0   0   0    0   0  0   0    2
##   <NA>  7  0  0 11  0 17  0   0   0    0   5  0  18    7

Data Inspection - Missing Data

Exclude all rows or columns that contain missing values using the function na.exclude(), such as h2 <- na.exclude(h). However this can be wasteful because it removes all rows (e.g., horizons), regardless if the row only has 1 missing value. Instead it’s sometimes best to create a temporary copy of the variable in question and then remove the missing variables, such as clay <- na.exclude(h$clay).
Replace missing values with another value, such as zero, a global constant, or the mean or median value for that column, such as h$clay <- ifelse(is.na(h$clay), 0, h$clay) # or h[is.na(h$clay), ] <- 0.
Read the help file for the function you’re attempting to use. Many functions have additional arguments for dealing with missing values, such as na.rm.

Graphical Methods - Descriptions

Plot Types	Description
Bar	a plot where each bar represents the frequency of observations for a ‘group’
Histogram	a plot where each bar represents the frequency of observations for a ‘given range of values’
Density	an estimation of the frequency distribution based on the sample data
Quantile-Quantile	a plot of the actual data values against a normal distribution
Box-Whisker	a visual representation of median, quartiles, symmetry, skewness, and outliers
Scatter & Line	a graphical display of one variable plotted on the x axis and another on the y axis

Graphical Methods - Functions

Plot Types	Base R	lattice	ggplot geoms
Bar	barplot()	barchart()	geom_bar()
Histogram	hist()	histogram()	geom_histogram()
Density	plot(density())	densityplot()	geom_density()
Quantile-Quantile	qqnorm()	qq()	geom_qq()
Box-Whisker	boxplot()	bwplot()	geom_boxplot()
Scatter & Line	plot()	xyplot	geom_point()

Distributions

The ’Normal Distribution

Mean & SD vs Median & Quantiles

Distributions

Relationships

3rd Dimension - Color, Shape, Size, Layers, etc…

3rd Dimension - Color

3rd Dimension - Groups

3rd Dimension - Facets

3rd Dimension - Correlation

ggplot2 basics

h <- horizons(loafercreek)

# set figure axis and aesthetics
p <- ggplot(data = h, aes(x = clay, y = hzdept, color = genhz))

# add layers
p + geom_point()

Summary of Components

Additional Reading

Healy, K., 2018. Data Visualization: a practical introduction. Princeton University Press. http://socviz.co/

Helsel, D.R., and R.M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages. http://pubs.usgs.gov/twri/twri4a3/

Kabacoff, R.I., 2015. R in Action. Manning Publications Co. Shelter Island, NY. https://www.statmethods.net/

Kabacoff, R.I., 2018. Data Visualization in R. https://rkabacoff.github.io/datavis/

Peng, R. D., 2016. Exploratory Data Analysis with R. Leanpub. https://bookdown.org/rdpeng/exdata/

Wilke, C.O., 2019. Fundamentals of Data Visualization. O’Reily Media, Inc. https://serialmentor.com/dataviz/