The choice of modeling framework has a significant impact on the quantity of data required, flexibility to account for non-linearities and interactions, potential for over-fitting, model performance, and degree to which the final model can be interpreted. This document and related commentary provides a nice background on the interplay between model generality, interpretation, and performance.
TODO: package up these data to be used as an example.
The following figure goes along with a bit of conversation I had (below) with some of my colleagues, on the topic of modeling soil temperature regime using a variety of frameworks. The strong bioclimatic gradient with MLRAs 17, 18, 22A, 22B made is possible to use a modeling framework that generated reasonable predictions and resulted in an interpretable model.
Essentially, each framework (MLR, regression trees, randomForest, etc.) is useful for different tasks but thinking about the most appropriate framework ahead of time is time well spent. Predictions aren’t the same as science or understanding. Sometimes we need the former, sometimes we need the latter and sometimes we need both.
The x-axis is elevation, in most cases the dominant driver of soil temperature and soil temperature regime in this area. The y-axis is conditional probability of several STR.
The top panel represents the smooth surface fit by multinomial logistic regression. These smooth surfaces lend to testable interpretations such as “the transition between STR per 1,000’ of elevation gain follows XXX”. This is far more useful when the model includes other factors such as annual beam radiance and the effect of cold air drainages. Absolute accuracy is sacrificed for a general (e.g. continuous over predictors) representation of the system that can support inference. Another example, “at elevation XXX, what is the average effect of moving from a south-facing slope to a north-facing slope?”.
The second panel down represents the hard thresholds generated by an algorithm from the tree-based classification framework (e.g. recursive partitioning trees via rpart). Accuracy is about the same as the MLR approach and the hard breaks can be interpreted as “reasonable” cut points or thresholds that may have links to physical processes. Note that the cut points identified by this framework are very close to the 50% probability cross-over points in the MLR panel. The result is a (potentially) pragmatic partitioning of reality that can support decisions but not inference (e.g. “rate of change in Pr(STR) vs. 1000’ elevation gain”).
The third panel down represents the nearly-exact (over?) fitting of STR probabilities generated by the random forest framework. This approach builds thousands of classification trees and (roughly) averages them together. The results are incredible (unbelievable?) within-training-set accuracy (99% here) at the expense of an interpretable model. That isn’t always a problem: sometimes predictions are all that we have time for. That said, this framework requires a 100x larger training sample (vs. MLR) and an independent validation data set before it can be trusted on new data.
Arkley, Rodney J. 1976. “Statistical Methods in Soil Classification Research.” In Advances in Agronomy, edited by N. C. Brady, 37–69. New York, NY: Academic Press.
Belbin, Lee, Daniel P. Faith, and Glenn W. Milligan. 1992. “A Comparison of Two Approaches to Beta-Flexible Clustering.” Multivariate Behavioral Research 27 (3): 417–33. https://doi.org/10.1207/s15327906mbr2703\_6.
Brier, GLenn W. 1950. “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review 78 (1): 1–3. https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2.
Burrough, P. A., P. F. M. van Gaans, and R. Hootsmans. 1997. “Continuous Classification in Soil Survey: Spatial Correlation, Confusion and Boundaries.” Geoderma 77: 115–35. https://doi.org/10.1016/S0016-7061(97)00018-9.
Gower, J. C. 1971. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics 27 (4): 857–71. http://www.jstor.org/stable/2528823.
Harrell, Frank E. 2001. Regression Modeling Strategies. Springer Series in Statistics. New York, NY: Springer.
Hole, F. D., and M. Hironaka. 1960. “An Experiment in Ordination of Some Soil Profiles.” Proceedings of the Soil Science Society of America 24 24: 309–12.
Kaufman, Leonard, and Peter J. Rousseeuw. 2005. Finding Groups in Data an Introduction to Cluster Analysis. Wiley-Interscience.
Kempen, Bas, Dick J. Brus, Gerard B. M. Heuvelink, and Jetse J. Stoorvogel. 2009. “Updating the 1:50,000 Dutch Soil Map Using Legacy Soil Data: A Multinominal Logistic Regression Approach.” Geoderma 151: 311–26. https://doi.org/10.1016/j.geoderma.2009.04.023.
Legendre, P., and L. Legendre. 1998. Numerical Ecology. 2nd ed. Developments in Environmental Modeling 20. Amsterdam: Elsevier.
Rossiter, David G., Rong Zeng, and Gan-Lin Zhang. 2017. “Accounting for Taxonomic Distance in Accuracy Assessment of Soil Class Predictions.” Geoderma 292 (Supplement C): 118–27. https://doi.org/10.1016/j.geoderma.2017.01.012.
Rousseeuw, P. J. 1987. “Silhouettes: A Grapical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathmatics 20: 53–65.
Sneath, Peter H. A., and Robert R. Sokal. 1973. Numerical Taxonomy. W.H. Freeman; Company.