This is a short demonstration of why map unit summary reports are based on a sampling of overlapping pixels (from raster data sources) instead of using all overlapping pixels.
Nearly all statistical methods are based on the premise that meaningful descriptions of an underlying population can be derived from a subset of the population, or sample. The efficiency of a sample is (mostly) dependent on three parameters: the shape of the population distribution, the size of the sample and the degree to which values in the population are related (autocorrelation). A population with a simple distribution, such as the normal, can be characterized with a smaller sample than more complex distributions. The effect of sample size makes intuitive sense: more observations (e.g. larger sample) will give a better description of the population, regardless of distribution. Characterization of populations with a low degree of autocorrelation require relatively larger samples as compared to populations with a high degree of autocorrelation.
The normal distribution, with a mean of 12% and standard deviation of 2%. Vertical lines mark the mean +/- 1 standard deviation.
For the sake of demonstration, lets pretend that we know some details about the population of possible clay contents associated with all A horizons, within a given soil series: the distribution shape is approximately normal, has a mean of 12% and a standard deviation of 2%. We can simulate these conditions by drawing a large number (1,000) of samples from the normal distribution. Having “access” to the clay content population is analogous to collecting all pixels (e.g. slope values) that overlap with a collection of map unit polygons. Next, draw a small sample (5% of the observations) from the population. This is analogous to the approach used by the MU summary reports: selecting pixels at a constant sampling density of approximately 1-5 points per acre.
It is clear that a very small sample can adequately describe an underlying population. A larger sample would have been required if the distribution of the population were more complex.
Lets investigate the “stability” of the median as we increase the sample size from 1 to 1,000–using the same A horizon clay content example from above. We will operationally define “stability” as the interval spanning the 5th–95th percentiles of estimated median values over 100 (sampling) replications.
After about 50 samples, we are able to estimate the population median to within approximately 1% clay content (absolute), 90% of the time (e.g. within the 5th–95th percentile interval).
The degree of autocorrelation within a population (and the effect on sampling constraints) is harder to intuit and best demonstrated by example. Consider two simulated time-series type populations generated by processes that have relatively high (0.7) and low (0.01) degrees of autocorrelation.
Clearly, more observations (e.g. a larger sample size) would be required to describe patterns in the lower figure vs. the upper figure.
Spatial patterns also exhibit varying degrees of autocorrelation (spatial autocorrelation). Spatial autocorrelation is closely related to scale: spatial processes that generate finer-scale phenomena such as drainage pattern or surface slope typically exhibit lower spatial autocorrelation than processes that generate coarse-scale phenomena such as climate. Moran’s I is one of the most common methods for quantifying the degree of spatial autocorrelation.
Consider the following simulated spatial patterns with relatively low and high degrees of spatial autocorrelation.
A larger sample would be required to describe spatial patterns in the left-hand panel vs. right-hand panel above.
Fortunately for us, map unit concepts are (ideally) crafted to constrain variation within delineations. This means that spatial autocorrelation within map unit delineations is typically much greater than for an entire survey area or MLRA. Using a typical map unit from the CA630 survey and the surface slope raster (10m DEM), the median Moran’s I within all delineations is 0.90, while the Moran’s I for the entire survey area is 0.48. The high degree of spatial autocorrelation within map units makes it possible to generate meaningful summaries of raster data using only 1-2% of the overlapping pixels.
The MU Summary / Comparison reports developed for Region 2 staff draw heavily from the concepts of statistical sampling described above. Summaries are derived from samples drawn from the population of pixels that overlap collections of map unit polygons. The optimal sampling density will vary based on the following parameters:
Ideally a sampling size is selected to minimize resource utilization (staff time, computer time, etc.) while maximizing the reliability (stability) of estimated percentiles and proportions.
Lets investigate samples derived from the following raster maps and map unit “7089”. This map unit covers the thermic band of steeply sloping hills and mountains, underlain by metavolcanic rocks. Each raster was loaded into RAM and therefore time requirements represent optimal conditions. Processing times for rasters that do not fit into RAM (e.g. regional-scale elevation or slope maps) is typically 4 to 8 times longer.
Moran’s I values were evaluated over the entire extent (bounding-box) of “7089” polygons.
Raster | Moran’s I | Grid Size (m) | Density (pixels / acre) | Processing: all pixels (seconds) | Processing: 1 pt/ac. (seconds) |
---|---|---|---|---|---|
Elevation | 0.97 | 10 | 40.000 | 40 | 2 |
Slope | 0.48 | 10 | 40.000 | 40 | 2 |
Beam Radiance | 0.14 | 30 | 4.500 | 10 | 1 |
PRISM MAAT | 0.97 | 800 | 0.006 | 1 | 1 |
Note that the use of all overlapping pixels would require approximately 40 seconds of computer time to summarize each combination of (comparable) map units and 10m raster data. Therefore, a comparison between four map units, drawing from four 10m data sources would require approximately 10 minutes. The use of sampling (at the default 1 sample / ac.) would require approximately 30 seconds. Apart from spatial data processing time, the creation of data visualizations (e.g. box and whisker or density plots) would take much longer if all overlapping pixels were used.
The median polygon area for this map unit is approximately 64 acres–at a sampling density of 1 point per acre this would result in 64 samples collected from each raster, per polygon. The 5th and 95th percentiles for polygon area are 18 and 506 acres. These might seem like small sample sizes, however, the summaries presented in the MU reports are based on samples collected from all polygons associated with a map unit; 16,856 acres (16,856 samples at 1 point per acre) in the case of our example.
Here we compare select percentiles derived from samples (at several densities) to the percentiles derived from the all pixels–the “population”.
10m elevation (meters) raster summaries at several sampling intensities.
sampling.density | Q0 | Q5 | Q25 | Q50 | Q75 | Q95 | Q100 | n | percent pixels sampled |
---|---|---|---|---|---|---|---|---|---|
0.01 pts/ac | 120.3 | 196.1 | 318.3 | 400.3 | 517.2 | 661.5 | 764.4 | 116 | 0.0 |
0.1 pts/ac | 78.7 | 193.7 | 321.1 | 405.4 | 523.0 | 673.2 | 825.7 | 1388 | 0.2 |
0.5 pts/ac | 66.5 | 195.5 | 321.4 | 404.7 | 521.8 | 673.2 | 839.4 | 7145 | 1.1 |
1 pts/ac | 65.1 | 195.0 | 320.9 | 404.4 | 521.5 | 673.1 | 841.2 | 14368 | 2.1 |
2 pts/ac | 64.1 | 195.4 | 321.4 | 404.5 | 521.7 | 673.0 | 845.3 | 28773 | 4.3 |
5 pts/ac | 63.4 | 195.5 | 321.3 | 404.4 | 521.5 | 673.0 | 845.5 | 72088 | 10.7 |
population | 63.1 | 195.4 | 321.3 | 404.4 | 521.5 | 672.9 | 847.2 | 675069 | 100.0 |
10m surface slope (percent) raster summaries at several sampling intensities.
sampling.density | Q0 | Q5 | Q25 | Q50 | Q75 | Q95 | Q100 | n | percent pixels sampled |
---|---|---|---|---|---|---|---|---|---|
0.01 pts/ac | 2 | 16.4 | 29.7 | 37.9 | 46.7 | 62.2 | 85.3 | 119 | 0.0 |
0.1 pts/ac | 0 | 16.5 | 29.8 | 37.6 | 47.3 | 64.5 | 109.2 | 1386 | 0.2 |
0.5 pts/ac | 0 | 16.6 | 29.8 | 37.8 | 47.4 | 64.5 | 127.4 | 7147 | 1.1 |
1 pts/ac | 0 | 16.8 | 29.8 | 37.9 | 47.5 | 64.6 | 132.3 | 14355 | 2.1 |
2 pts/ac | 0 | 16.7 | 29.8 | 37.8 | 47.5 | 64.6 | 142.5 | 28770 | 4.3 |
5 pts/ac | 0 | 16.6 | 29.8 | 37.8 | 47.5 | 64.5 | 146.9 | 72066 | 10.7 |
population | 0 | 16.7 | 29.8 | 37.8 | 47.4 | 64.5 | 152.5 | 675069 | 100.0 |
30m annual beam radiance (mega-Joules per square meter) raster summaries at several sampling intensities.
sampling.density | Q0 | Q5 | Q25 | Q50 | Q75 | Q95 | Q100 | n | percent pixels sampled |
---|---|---|---|---|---|---|---|---|---|
0.01 pts/ac | 34120 | 42401 | 50157 | 55152 | 61052 | 71547 | 77128 | 118 | 0 |
0.1 pts/ac | 28599 | 42057 | 50211 | 55447 | 60935 | 70587 | 78558 | 1391 | 2 |
0.5 pts/ac | 25280 | 42025 | 50235 | 55383 | 60978 | 71095 | 78841 | 7145 | 10 |
1 pts/ac | 24207 | 42025 | 50263 | 55395 | 61003 | 71080 | 79280 | 14361 | 19 |
2 pts/ac | 23550 | 42043 | 50271 | 55419 | 61031 | 71160 | 79306 | 28768 | 38 |
5 pts/ac | 22918 | 42045 | 50269 | 55418 | 61047 | 71127 | 79495 | 72072 | 96 |
population | 22918 | 42043 | 50253 | 55394 | 60991 | 71077 | 79296 | 74973 | 100 |
800m PRSIM MAAT (degrees celsius) raster summaries at several sampling intensities.
sampling.density | Q0 | Q5 | Q25 | Q50 | Q75 | Q95 | Q100 | n | percent pixels sampled |
---|---|---|---|---|---|---|---|---|---|
0.01 pts/ac | 14.7 | 15.2 | 15.7 | 15.9 | 16.2 | 16.5 | 16.6 | 118 | 41.8 |
0.1 pts/ac | 14.6 | 15.2 | 15.7 | 15.9 | 16.2 | 16.5 | 16.9 | 1399 | 496.1 |
0.5 pts/ac | 14.6 | 15.2 | 15.7 | 15.9 | 16.2 | 16.5 | 16.9 | 7148 | 2534.8 |
1 pts/ac | 14.6 | 15.2 | 15.7 | 15.9 | 16.2 | 16.5 | 16.9 | 14355 | 5090.4 |
2 pts/ac | 14.6 | 15.2 | 15.7 | 15.9 | 16.2 | 16.5 | 16.9 | 28803 | 10213.8 |
5 pts/ac | 14.6 | 15.2 | 15.7 | 15.9 | 16.2 | 16.5 | 16.9 | 72067 | 25555.7 |
population | 14.8 | 15.2 | 15.6 | 16.0 | 16.2 | 16.5 | 16.9 | 282 | 100.0 |
There is a trade-off when sampling raster data of varying resolution: an optimal sampling density at 10m (e.g. surface slope) may be excessive at 800m (e.g. PRISM data) resolution. Spatial autocorrelation will affect the selection of an optimal sampling density. For example, surface curvature will require more samples than surface slope, which in turn require more samples than elevation.
Sampling implies stochasticity–each sampling event will result in a slightly different estimate. We can test the “stability” of sample-based estimates, like the median, via simulation.
For each combination of raster layer and collection of map unit polygons:
How do we interpret the above figure? For the collection of “7089” map unit polygons:
A sampling density between 1 and 2 points per acre should be sufficient for most of the commonly used 10m to 30m data sources, and excessive for PRISM data (800m). The sampling density “sweet spot” for most combinations of map units and raster data sources is on the order of 1 point per acre.
There are two cases that require further consideration of an “optimal” sampling density: 1) map unit polygons of limited extent, and 2) map units composed of very small delineations (less than 5 acres). In these cases it would be wise to increase the sampling density to a value between 2 and 5 points per acre.