Chapter 3 - Exploratory Data Analysis

Stephen Roecker and Tom D’Avello

2021-02-03

Objectives

Range in Characteristics (RIC)

hzname clay_l clay_r clay_h texture
Ap 7 18 26.1 sil
Bt1 24 27 35.0 sicl
2Bt2 27 31 35.0 cl
2BCt 15 22 25.0 l
2Cd 10 15 20.0 l

Descriptive Statistics - Estimates of L, RV & H

Parameter NASIS Description R function
Mean RV arithmetic average mean()
Median RV middle value, 50% quantile median()
Mode RV most frequent value sort(table(), decreasting = TRUE)[1]
Standard Deviation L & H variation sd()
Quantiles L & H percent rank of values, such that all values are <= p quantile()

Descriptive Statistics - Numeric Data

data("loafercreek")
h <- horizons(loafercreek) 
h$texture_class <- factor(h$texture_class)

h %>%
  select(clay, phfield, total_frags_pct, texture_class) %>%
  summary()
##       clay          phfield     total_frags_pct texture_class
##  Min.   :10.00   Min.   :4.90   Min.   : 0.00   l      :265  
##  1st Qu.:18.00   1st Qu.:6.00   1st Qu.: 0.00   br     :105  
##  Median :22.00   Median :6.30   Median : 5.00   cl     :101  
##  Mean   :23.63   Mean   :6.18   Mean   :13.88   sil    : 56  
##  3rd Qu.:28.00   3rd Qu.:6.50   3rd Qu.:20.00   spm    : 22  
##  Max.   :60.00   Max.   :7.00   Max.   :95.00   (Other): 52  
##  NA's   :167     NA's   :381                    NA's   : 25

Descriptive Statistics - Categorical Data

table(h$genhz, h$texture_class, useNA = "ifany")
##       
##        br  c cb cl gr  l pg scl sic sicl sil sl spm <NA>
##   2BCt  0  6  0  3  0  0  0   0   0    0   0  0   0    0
##   2Bt   0  5  0  1  0  0  0   0   0    0   0  0   0    0
##   A     0  0  0  1  0 97  0   0   0    1  29  7   0    3
##   BA    0  0  0  0  0  2  0   0   0    0   0  0   0    0
##   BCt   0  2  0  8  0  7  0   1   0    1   1  0   0    1
##   Bt    0  0  0 17  0 37  0   0   0    0   1  1   0    0
##   Bt1   0  1  0 13  0 57  0   3   0    1  12  0   0    2
##   Bt2   0  4  0 40  0 45  0   4   2    6   8  0   0    0
##   Bt3   0  0  0  6  0  2  0   0   0    0   0  0   0    0
##   Cr   55  1  1  1  3  1  1   1   0    0   0  0   0   10
##   Oi    0  0  0  0  0  0  0   0   0    0   0  0   4    0
##   R    43  0  0  0  0  0  0   0   0    0   0  0   0    2
##   <NA>  7  0  0 11  0 17  0   0   0    0   5  0  18    7

Data Inspection - Missing Data

  1. Exclude all rows or columns that contain missing values using the function na.exclude(), such as h2 <- na.exclude(h). However this can be wasteful because it removes all rows (e.g., horizons), regardless if the row only has 1 missing value. Instead it’s sometimes best to create a temporary copy of the variable in question and then remove the missing variables, such as clay <- na.exclude(h$clay).
  2. Replace missing values with another value, such as zero, a global constant, or the mean or median value for that column, such as h$clay <- ifelse(is.na(h$clay), 0, h$clay) # or h[is.na(h$clay), ] <- 0.
  3. Read the help file for the function you’re attempting to use. Many functions have additional arguments for dealing with missing values, such as na.rm.

Graphical Methods - Descriptions

Plot Types Description
Bar a plot where each bar represents the frequency of observations for a ‘group’
Histogram a plot where each bar represents the frequency of observations for a ‘given range of values’
Density an estimation of the frequency distribution based on the sample data
Quantile-Quantile a plot of the actual data values against a normal distribution
Box-Whisker a visual representation of median, quartiles, symmetry, skewness, and outliers
Scatter & Line a graphical display of one variable plotted on the x axis and another on the y axis

Graphical Methods - Functions

Plot Types Base R lattice ggplot geoms
Bar barplot() barchart() geom_bar()
Histogram hist() histogram() geom_histogram()
Density plot(density()) densityplot() geom_density()
Quantile-Quantile qqnorm() qq() geom_qq()
Box-Whisker boxplot() bwplot() geom_boxplot()
Scatter & Line plot() xyplot geom_point()

Distributions

The ’Normal Distribution

Mean & SD vs Median & Quantiles

Distributions

Relationships

3rd Dimension - Color, Shape, Size, Layers, etc…

3rd Dimension - Color

3rd Dimension - Groups

3rd Dimension - Facets

3rd Dimension - Facets

3rd Dimension - Correlation

ggplot2 basics

h <- horizons(loafercreek)

# set figure axis and aesthetics
p <- ggplot(data = h, aes(x = clay, y = hzdept, color = genhz))

# add layers
p + geom_point()

Summary of Components

Additional Reading

Healy, K., 2018. Data Visualization: a practical introduction. Princeton University Press. http://socviz.co/

Helsel, D.R., and R.M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages. http://pubs.usgs.gov/twri/twri4a3/

Kabacoff, R.I., 2015. R in Action. Manning Publications Co. Shelter Island, NY. https://www.statmethods.net/

Kabacoff, R.I., 2018. Data Visualization in R. https://rkabacoff.github.io/datavis/

Peng, R. D., 2016. Exploratory Data Analysis with R. Leanpub. https://bookdown.org/rdpeng/exdata/

Wilke, C.O., 2019. Fundamentals of Data Visualization. O’Reily Media, Inc. https://serialmentor.com/dataviz/