This document is based on aqp version 1.19.01. |
This document is an attempt at addressing some of the issues related to accuracy and uncertainty that I have brought up over discussion of raster (soil class) mapping standards. As such, the following is a combination of soap box moments, demonstrations of methods, todo items for my future self, and references. Honestly, before going any further be sure to read the recent paper by Rossiter, Zeng, and Zhang (2017).
This article does a fine job of summarizing the major differences between classification and prediction. Most of the modeling frameworks we will be using or evaluating generate predictions in the form of probabilities (prediction). At some point the stack of probabilities will be converted into a single map depicting the most likely class at each pixel (classification). The iterative assessment of model performance (e.g. accuracy and uncertainty) should happen as part of the prediction phase via metrics such as the Brier score (accuracy) and Shannon entropy (uncertainty). An evaluation of the final classification is helpful for communicating accuracy to a wider audience (e.g. percent correctly classified or weighted \(\tau\)) but should not be the primary means of optimizing model performance.
The "Digital Soil Mapping" chapter (5) from the latest Soil Survey Manual describes two commonly used metrics for the description of accuracy and uncertainty: overall accuracy / percent correctly classified (PCC) and the confusion index (CI) of Burrough, van Gaans, and Hootsmans (1997). These methods are widely used and implementation is simple.
Given the complex nature of class mapping results (e.g. stack of class probabilities) and inherent (likely quantifiable) similarity of soil classes, I think that we should explore options for a more robust suite of accuracy and uncertainty metrics. Furthermore, it is my opinion that our quantification of uncertainty should be integrated over all classes (e.g. Shannon entropy).
The \(\tau\) statistic of (Rossiter, Zeng, and Zhang 2017) is a more reliable and nuanced representation of accuracy vs. PCC.
The \(\tau\) statistic can be upgraded with additional knowledge given the availability of 1) prior understanding of class proportions, and/or, 2) meaningful parameterization of pair-wise class distances.
There can be consensus on formulation of approximate pair-wise distances, within a given modeling domain. Pair-wise distances may not necessarily be the same across modeling domains or projects.
Brier scores are option for an even more nuanced representation of accuracy as they integrate all predicted probabilities.
The confusion index of Burrough, van Gaans, and Hootsmans (1997) is an unstable metric when the number of predicted classes is large and when the most likely classes are associated with low probabilities.
Shannon entropy (log base 2) is a more reliable representation of uncertainty than the confusion index, especially when the number of possible classes varies by project. The importance of a universally reliable representation of uncertainty is even more important when several methods are used concurrently.
There should be a way to integrate pair-wise distances into the Shannon entropy (or related method) and Brier scores; maybe we will discover those here.
Our current QC/QA process is based on many forms of evaluation, accumulates some degree of subjectivity and relies heavily on qualitative forms of information (field experience, institutional knowledge, etc.). On the opposite side of the spectrum, the validation of raster mapping is often claimed to be free of subjective interference and entirely quantitative. Those are "good things" that we should always strive for, however, the simplicity of calculating a "percent correctly classified" can interfere with a more nuanced evaluation of accuracy. As I mentioned on the phone (and implicitly volunteered for) a validation "score" might be more meaningful than any single validation metrics.
One such score might include:
I strongly believe that we need a robust suite of metrics primarily for internal discussion and evaluation of raster mapping products; even more so when complex modeling frameworks such as randomForest or neural nets are used.
Accuracy and uncertainty metrics are primarily vehicles for understanding, re-calibrating (as needed), and communicating statistical models as part of the development and QA/QC process.
Consider a supervised classification that generates predictions for 5 possible soil classes. Suites of predicted probabilities fall into 3 general cases:
Even though these are simulated data, the three cases above demonstrate common modeling scenarios where classification uncertainty ranges from very low ("Case 3") in some areas to quite high ("Case 1") in others. These three cases could easily be associated with real situations:
Brier scores (Brier 1950, Harrell (2001)) quantify agreement between observed classes and predicted probabilities: \[ B = \frac{1}{n} \sum_{i=1}^{n}{ ( p_{i} - y_{i} )^{2} } \] where \(B\) is an index of agreement between predicted probabilities, \(\mathbf{p}\), and class labels, \(\mathbf{y}\). Larger values suggest less agreement between probabilities and observed class labels.
Follow-up:
What about a weighted version of this score, based on a re-statement of the distance matrix?
(Rossiter, Zeng, and Zhang 2017) implemented in aqp::tauw()
. This paper contains some discussion on a weighted version of Shannon Entropy using the subset of similarities between predicted classes and the actual class.
example | brier.score | tau.equal | tau.actual | PCC |
---|---|---|---|---|
Case 1 | 0.74 | 0.19 | 0.14 | 0.35 |
Case 2 | 0.73 | 0.25 | 0.19 | 0.40 |
Case 3 | 0.31 | 0.78 | 0.44 | 0.82 |
\[ H = -\sum_{i=1}^{n}{p_{i} * log_{2}(p_{i})} \]
where \(H\) is an index of uncertainty associated with predicted probabilities, \(\mathbf{p}\), of encountering classes \(i\) through \(n\). Smaller values imply less entropy (more information). Given equal class probabilities, H will increas as the number of classes increases.
Kempen et al. (2009) described a normalized version of Shannon entropy that is constrained to values between 0 and 1:
\[ H = -\sum_{i=1}^{n}{p_{i} * log_{n}(p_{i})} \] where \(H\) is an index of uncertainty associated with predicted probabilities, \(\mathbf{p}\), of encountering classes \(i\) through \(n\). This representation may be conveniently contained within the range of \([0,1]\), however, it cannot be used to compare uncertainty from models using different numbers of classes.
It is my recommendation that the \(log_{2}\) version of Shannon H be used as our primary metric of uncertainty for predictive soil class mapping. The normalized version of \(H\) might be useful for understanding uncertainty within a single model or in such cases that comparisons between \(CI\) and normalized \(H\) need to be made on the same scale.
Following (Burrough, van Gaans, and Hootsmans 1997):
\[ \mathit{CI} = [1 - (\mu_{max} - \mu_{max-1})] \] where \(\mathit{CI}\) is an index of confusion between the first most likely, \(\mu_{max}\), and second most likely, \(\mu_{max-1}\), class probabilities.
Examples from the simulated data:
example | A | B | C | D | E | Shannon.H | CI |
---|---|---|---|---|---|---|---|
Case 1 | 0.06 | 0.16 | 0.10 | 0.24 | 0.44 | 2.00 | 0.79 |
Case 2 | 0.09 | 0.18 | 0.06 | 0.33 | 0.34 | 2.07 | 0.99 |
Case 3 | 0.09 | 0.04 | 0.03 | 0.03 | 0.81 | 1.04 | 0.27 |
Consider a probabilistic soil class model, that generates predictions for 5 possible classes. Suites of predicted probabilities fall into 3 general cases:
Confusion Index
Shannon H
Still working on this.
A simple extension (weighted entropy score) of the standard version is described in this paper:
\[ \mathit{WES} = -\sum_{i=1}^{n}{w_{i} * p_{i} * log_{n}(p_{i})} \] where \(w_{i}\) is the weight associated with \(p_{i}\). This isn't quite applicable as the weights (we are interested in) are based on pair-wise distances between classes. More promising ideas:
Pairwise mutual information seems like a convenient extension of Shannon H that would easily accommodate pairwise distances.
The aqp
package has an implementation of Shannon entropy and Brier score; there are many other implementations but these are convenient for soil survey work. Consider the following table of predicted probabilities (classes A,B,C,D,E) and observed class (actual).
library(aqp)
# example data
d <- structure(list(A = c(0.0897243494322252, 0.0537087411977284,
0.0643087579284512, 0.0582791533521884, 0.0655491726966812, 0.0878056947034425,
0.0550727743006022, 0.10724015754623, 0.0332599961787985, 0.0555131608754956
), B = c(0.191110141078936, 0.187244044389649, 0.119214057525671,
0.198461646003737, 0.161851348940294, 0.172157251906694, 0.113611770097243,
0.178697159594029, 0.194607795787689, 0.188977055949146), C = c(0.121941735763077,
0.0770539012535731, 0.0977753159795662, 0.0774293724263895, 0.072198187957068,
0.0366921003115242, 0.151033286139089, 0.0974443429098862, 0.124876574685048,
0.0864142563046045), D = c(0.351108807309283, 0.322120077305279,
0.440632731639948, 0.401063395801608, 0.312647702445919, 0.304193047630158,
0.270239142407351, 0.258895264130713, 0.422747316475851, 0.252724366285052
), E = c(0.246114966416479, 0.359873235853771, 0.278069136926363,
0.264766432416077, 0.387753587960038, 0.399151905448182, 0.410043027055715,
0.357723075819142, 0.224508316872614, 0.416371160585702), id = c("1",
"10", "100", "1000", "101", "102", "103", "104", "105", "106"
), actual = c("D", "B", "D", "E", "D", "D", "E", "E", "D", "E"
)), .Names = c("A", "B", "C", "D", "E", "id", "actual"), row.names = c(NA,
10L), class = "data.frame")
# check it out
# predictions, and actual, observed class
head(d)
## A B C D E id actual
## 1 0.08972435 0.1911101 0.12194174 0.3511088 0.2461150 1 D
## 2 0.05370874 0.1872440 0.07705390 0.3221201 0.3598732 10 B
## 3 0.06430876 0.1192141 0.09777532 0.4406327 0.2780691 100 D
## 4 0.05827915 0.1984616 0.07742937 0.4010634 0.2647664 1000 E
## 5 0.06554917 0.1618513 0.07219819 0.3126477 0.3877536 101 D
## 6 0.08780569 0.1721573 0.03669210 0.3041930 0.3991519 102 D
Brier scores (accuracy) are computed over all predictions and associated observed classes. Smaller is better.
# compute Brier score from all predictions
brierScore(d, classLabels = c('A', 'B', 'C', 'D', 'E'), actual = 'actual')
## [1] 0.5833992
Shannon entropy (uncertainty) is computed from each vector of predicted probabilities. Larger values are less certain, contain more diversity, or less information; depending on the interpretation of entropy.
# shannon entropy for first row, could be a single pixel or obs. point
shannonEntropy(d[1, c('A', 'B', 'C', 'D', 'E')])
## [1] 2.166525
# compute shannon entropy for all rows
apply(d[, c('A', 'B', 'C', 'D', 'E')], 1, shannonEntropy)
## 1 2 3 4 5 6 7 8 9 10
## 2.166525 2.021157 1.982791 2.024063 2.011094 1.971243 2.036219 2.151995 2.006615 2.018874
Variations on the \(\tau\) statistic (pending).
Brier, GLenn W. 1950. “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review 78 (1): 1–3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
Burrough, P.A., P.F.M. van Gaans, and R. Hootsmans. 1997. “Continuous Classification in Soil Survey: Spatial Correlation, Confusion and Boundaries.” Geoderma 77: 115–35. doi:10.1016/S0016-7061(97)00018-9.
Harrell, Frank E. 2001. Regression Modeling Strategies. Springer Series in Statistics. New York, NY: Springer.
Kempen, Bas, Dick J. Brus, Gerard B.M. Heuvelink, and Jetse J. Stoorvogel. 2009. “Updating the 1:50,000 Dutch Soil Map Using Legacy Soil Data: A Multinominal Logistic Regression Approach.” Geoderma 151: 311–26. doi:10.1016/j.geoderma.2009.04.023.
Rossiter, David G., Rong Zeng, and Gan-Lin Zhang. 2017. “Accounting for Taxonomic Distance in Accuracy Assessment of Soil Class Predictions.” Geoderma 292 (Supplement C): 118–27. doi:10.1016/j.geoderma.2017.01.012.