Chapter 3 - Exploratory Data Analysis

Stephen Roecker and Tom D’Avello



Range in Characteristics (RIC)

hzname clay_l clay_r clay_h texture
Ap 7 18 26.1 sil
Bt1 24 27 35.0 sicl
2Bt2 27 31 35.0 cl
2BCt 15 22 25.0 l
2Cd 10 15 20.0 l

Descriptive Statistics - Estimates of L, RV & H

Parameter NASIS Description R function
Mean RV arithmetic average mean()
Median RV middle value, 50% quantile median()
Mode RV most frequent value sort(table(), decreasting = TRUE)[1]
Standard Deviation L & H variation sd()
Quantiles L & H percent rank of values, such that all values are <= p quantile()

Descriptive Statistics - Numeric Data

h <- horizons(loafercreek) 
h$texture_class <- factor(h$texture_class)

h %>%
  select(clay, phfield, total_frags_pct, texture_class) %>%
##       clay          phfield     total_frags_pct texture_class
##  Min.   :10.00   Min.   :4.90   Min.   : 0.00   l      :265  
##  1st Qu.:18.00   1st Qu.:6.00   1st Qu.: 0.00   br     :105  
##  Median :22.00   Median :6.30   Median : 5.00   cl     :101  
##  Mean   :23.63   Mean   :6.18   Mean   :13.88   sil    : 56  
##  3rd Qu.:28.00   3rd Qu.:6.50   3rd Qu.:20.00   spm    : 22  
##  Max.   :60.00   Max.   :7.00   Max.   :95.00   (Other): 52  
##  NA's   :167     NA's   :381                    NA's   : 25

Descriptive Statistics - Categorical Data

table(h$genhz, h$texture_class, useNA = "ifany")
##        br  c cb cl gr  l pg scl sic sicl sil sl spm <NA>
##   2BCt  0  6  0  3  0  0  0   0   0    0   0  0   0    0
##   2Bt   0  5  0  1  0  0  0   0   0    0   0  0   0    0
##   A     0  0  0  1  0 97  0   0   0    1  29  7   0    3
##   BA    0  0  0  0  0  2  0   0   0    0   0  0   0    0
##   BCt   0  2  0  8  0  7  0   1   0    1   1  0   0    1
##   Bt    0  0  0 17  0 37  0   0   0    0   1  1   0    0
##   Bt1   0  1  0 13  0 57  0   3   0    1  12  0   0    2
##   Bt2   0  4  0 40  0 45  0   4   2    6   8  0   0    0
##   Bt3   0  0  0  6  0  2  0   0   0    0   0  0   0    0
##   Cr   55  1  1  1  3  1  1   1   0    0   0  0   0   10
##   Oi    0  0  0  0  0  0  0   0   0    0   0  0   4    0
##   R    43  0  0  0  0  0  0   0   0    0   0  0   0    2
##   <NA>  7  0  0 11  0 17  0   0   0    0   5  0  18    7

Data Inspection - Missing Data

  1. Exclude all rows or columns that contain missing values using the function na.exclude(), such as h2 <- na.exclude(h). However this can be wasteful because it removes all rows (e.g., horizons), regardless if the row only has 1 missing value. Instead it’s sometimes best to create a temporary copy of the variable in question and then remove the missing variables, such as clay <- na.exclude(h$clay).
  2. Replace missing values with another value, such as zero, a global constant, or the mean or median value for that column, such as h$clay <- ifelse($clay), 0, h$clay) # or h[$clay), ] <- 0.
  3. Read the help file for the function you’re attempting to use. Many functions have additional arguments for dealing with missing values, such as na.rm.

Graphical Methods - Descriptions

Plot Types Description
Bar a plot where each bar represents the frequency of observations for a ‘group’
Histogram a plot where each bar represents the frequency of observations for a ‘given range of values’
Density an estimation of the frequency distribution based on the sample data
Quantile-Quantile a plot of the actual data values against a normal distribution
Box-Whisker a visual representation of median, quartiles, symmetry, skewness, and outliers
Scatter & Line a graphical display of one variable plotted on the x axis and another on the y axis

Graphical Methods - Functions

Plot Types Base R lattice ggplot geoms
Bar barplot() barchart() geom_bar()
Histogram hist() histogram() geom_histogram()
Density plot(density()) densityplot() geom_density()
Quantile-Quantile qqnorm() qq() geom_qq()
Box-Whisker boxplot() bwplot() geom_boxplot()
Scatter & Line plot() xyplot geom_point()


The ’Normal Distribution

Mean & SD vs Median & Quantiles



3rd Dimension - Color, Shape, Size, Layers, etc…

3rd Dimension - Color

3rd Dimension - Groups

3rd Dimension - Facets

3rd Dimension - Facets

3rd Dimension - Correlation

ggplot2 basics

h <- horizons(loafercreek)

# set figure axis and aesthetics
p <- ggplot(data = h, aes(x = clay, y = hzdept, color = genhz))

# add layers
p + geom_point()

Summary of Components

Additional Reading

Healy, K., 2018. Data Visualization: a practical introduction. Princeton University Press.

Helsel, D.R., and R.M. Hirsch, 2002. Statistical Methods in Water Resources Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. 522 pages.

Kabacoff, R.I., 2015. R in Action. Manning Publications Co. Shelter Island, NY.

Kabacoff, R.I., 2018. Data Visualization in R.

Peng, R. D., 2016. Exploratory Data Analysis with R. Leanpub.

Wilke, C.O., 2019. Fundamentals of Data Visualization. O’Reily Media, Inc.