MU Summary Reports: Sampling Theory

A short justification for using some vs. all pixels

D.E. Beaudette

2021-09-24

Sampling a single map unit polygon (43 acres) at several target densities. Raster data are from a 10m slope (%) map; 1760 pixels within this polygon.

Sampling a single map unit polygon (43 acres) at several target densities. Raster data are from a 10m slope (%) map; 1760 pixels within this polygon.

This is a short demonstration of why map unit summary reports are based on a sampling of overlapping pixels (from raster data sources) instead of using all overlapping pixels.

Introduction

Nearly all statistical methods are based on the premise that meaningful descriptions of an underlying population can be derived from a subset of the population, or sample. The efficiency of a sample is (mostly) dependent on three parameters: the shape of the population distribution, the size of the sample and the degree to which values in the population are related (autocorrelation). A population with a simple distribution, such as the normal, can be characterized with a smaller sample than more complex distributions. The effect of sample size makes intuitive sense: more observations (e.g. larger sample) will give a better description of the population, regardless of distribution. Characterization of populations with a low degree of autocorrelation require relatively larger samples as compared to populations with a high degree of autocorrelation.

The population and the sample

The normal distribution, with a mean of 12% and standard deviation of 2%. Vertical lines mark the mean +/- 1 standard deviation. The normal distribution, with a mean of 12% and standard deviation of 2%. Vertical lines mark the mean +/- 1 standard deviation.

For the sake of demonstration, lets pretend that we know some details about the population of possible clay contents associated with all A horizons, within a given soil series: the distribution shape is approximately normal, has a mean of 12% and a standard deviation of 2%. We can simulate these conditions by drawing a large number (1,000) of samples from the normal distribution. Having “access” to the clay content population is analogous to collecting all pixels (e.g. slope values) that overlap with a collection of map unit polygons. Next, draw a small sample (5% of the observations) from the population. This is analogous to the approach used by the MU summary reports: selecting pixels at a constant sampling density of approximately 1-5 points per acre.

Box and whisker representation of the simulated population and associated sample. Notches represent an approximate confidence interval around the median.

Box and whisker representation of the simulated population and associated sample. Notches represent an approximate confidence interval around the median.

Density representation of the simulated population and associated sample.

Density representation of the simulated population and associated sample.

It is clear that a very small sample can adequately describe an underlying population. A larger sample would have been required if the distribution of the population were more complex.

Testing the limits

Lets investigate the “stability” of the median as we increase the sample size from 1 to 1,000–using the same A horizon clay content example from above. We will operationally define “stability” as the interval spanning the 5th–95th percentiles of estimated median values over 100 (sampling) replications.

After about 50 samples, we are able to estimate the population median to within approximately 1% clay content (absolute), 90% of the time (e.g. within the 5th–95th percentile interval).

Autocorrelation

The degree of autocorrelation within a population (and the effect on sampling constraints) is harder to intuit and best demonstrated by example. Consider two simulated time-series type populations generated by processes that have relatively high (0.7) and low (0.01) degrees of autocorrelation.

Autocorrelation: 0.7. Soil temperature measurements collected at depth typically have high degree of autocorrelation.

Autocorrelation: 0.7. Soil temperature measurements collected  at depth typically have high degree of autocorrelation.

Autocorrelation: 0.01. Air quality measurements collected over a large city typically have a lower degree of autocorrelation.

Autocorrelation: 0.01. Air quality measurements collected over a large city typically have a lower degree of autocorrelation.

Clearly, more observations (e.g. a larger sample size) would be required to describe patterns in the lower figure vs. the upper figure.

Spatial autocorrelation

Spatial patterns also exhibit varying degrees of autocorrelation (spatial autocorrelation). Spatial autocorrelation is closely related to scale: spatial processes that generate finer-scale phenomena such as drainage pattern or surface slope typically exhibit lower spatial autocorrelation than processes that generate coarse-scale phenomena such as climate. Moran’s I is one of the most common methods for quantifying the degree of spatial autocorrelation.

Consider the following simulated spatial patterns with relatively low and high degrees of spatial autocorrelation.

Surface slope maps typically resemble the scale and degree of spatial autocorrelation of the left-hand panel. The spatial properties of the right-hand panel resembe those of a climatic parameter such as mean annual air temperature.

Surface slope maps typically resemble the scale and degree of spatial autocorrelation of the left-hand panel. The spatial properties of the right-hand panel resembe those of a climatic parameter such as mean annual air temperature.

A larger sample would be required to describe spatial patterns in the left-hand panel vs. right-hand panel above.

Fortunately for us, map unit concepts are (ideally) crafted to constrain variation within delineations. This means that spatial autocorrelation within map unit delineations is typically much greater than for an entire survey area or MLRA. Using a typical map unit from the CA630 survey and the surface slope raster (10m DEM), the median Moran’s I within all delineations is 0.90, while the Moran’s I for the entire survey area is 0.48. The high degree of spatial autocorrelation within map units makes it possible to generate meaningful summaries of raster data using only 1-2% of the overlapping pixels.

Map unit summaries

The MU Summary / Comparison reports developed for Region 2 staff draw heavily from the concepts of statistical sampling described above. Summaries are derived from samples drawn from the population of pixels that overlap collections of map unit polygons. The optimal sampling density will vary based on the following parameters:

Ideally a sampling size is selected to minimize resource utilization (staff time, computer time, etc.) while maximizing the reliability (stability) of estimated percentiles and proportions.

An example from the CA630 initial soil survey

Lets investigate samples derived from the following raster maps and map unit “7089”. This map unit covers the thermic band of steeply sloping hills and mountains, underlain by metavolcanic rocks. Each raster was loaded into RAM and therefore time requirements represent optimal conditions. Processing times for rasters that do not fit into RAM (e.g. regional-scale elevation or slope maps) is typically 4 to 8 times longer.

Moran’s I values were evaluated over the entire extent (bounding-box) of “7089” polygons.

Raster Moran’s I Grid Size (m) Density (pixels / acre) Processing: all pixels (seconds) Processing: 1 pt/ac. (seconds)
Elevation 0.97 10 40.000 40 2
Slope 0.48 10 40.000 40 2
Beam Radiance 0.14 30 4.500 10 1
PRISM MAAT 0.97 800 0.006 1 1

Note that the use of all overlapping pixels would require approximately 40 seconds of computer time to summarize each combination of (comparable) map units and 10m raster data. Therefore, a comparison between four map units, drawing from four 10m data sources would require approximately 10 minutes. The use of sampling (at the default 1 sample / ac.) would require approximately 30 seconds. Apart from spatial data processing time, the creation of data visualizations (e.g. box and whisker or density plots) would take much longer if all overlapping pixels were used.

Polygon area distribution for map unit “7089”.

Polygon area distribution for map unit "7089".

The median polygon area for this map unit is approximately 64 acres–at a sampling density of 1 point per acre this would result in 64 samples collected from each raster, per polygon. The 5th and 95th percentiles for polygon area are 18 and 506 acres. These might seem like small sample sizes, however, the summaries presented in the MU reports are based on samples collected from all polygons associated with a map unit; 16,856 acres (16,856 samples at 1 point per acre) in the case of our example.

Results: sampling density and population

Here we compare select percentiles derived from samples (at several densities) to the percentiles derived from the all pixels–the “population”.

10m elevation (meters) raster summaries at several sampling intensities.

sampling.density Q0 Q5 Q25 Q50 Q75 Q95 Q100 n percent pixels sampled
0.01 pts/ac 120.3 196.1 318.3 400.3 517.2 661.5 764.4 116 0.0
0.1 pts/ac 78.7 193.7 321.1 405.4 523.0 673.2 825.7 1388 0.2
0.5 pts/ac 66.5 195.5 321.4 404.7 521.8 673.2 839.4 7145 1.1
1 pts/ac 65.1 195.0 320.9 404.4 521.5 673.1 841.2 14368 2.1
2 pts/ac 64.1 195.4 321.4 404.5 521.7 673.0 845.3 28773 4.3
5 pts/ac 63.4 195.5 321.3 404.4 521.5 673.0 845.5 72088 10.7
population 63.1 195.4 321.3 404.4 521.5 672.9 847.2 675069 100.0

10m surface slope (percent) raster summaries at several sampling intensities.

sampling.density Q0 Q5 Q25 Q50 Q75 Q95 Q100 n percent pixels sampled
0.01 pts/ac 2 16.4 29.7 37.9 46.7 62.2 85.3 119 0.0
0.1 pts/ac 0 16.5 29.8 37.6 47.3 64.5 109.2 1386 0.2
0.5 pts/ac 0 16.6 29.8 37.8 47.4 64.5 127.4 7147 1.1
1 pts/ac 0 16.8 29.8 37.9 47.5 64.6 132.3 14355 2.1
2 pts/ac 0 16.7 29.8 37.8 47.5 64.6 142.5 28770 4.3
5 pts/ac 0 16.6 29.8 37.8 47.5 64.5 146.9 72066 10.7
population 0 16.7 29.8 37.8 47.4 64.5 152.5 675069 100.0

30m annual beam radiance (mega-Joules per square meter) raster summaries at several sampling intensities.

sampling.density Q0 Q5 Q25 Q50 Q75 Q95 Q100 n percent pixels sampled
0.01 pts/ac 34120 42401 50157 55152 61052 71547 77128 118 0
0.1 pts/ac 28599 42057 50211 55447 60935 70587 78558 1391 2
0.5 pts/ac 25280 42025 50235 55383 60978 71095 78841 7145 10
1 pts/ac 24207 42025 50263 55395 61003 71080 79280 14361 19
2 pts/ac 23550 42043 50271 55419 61031 71160 79306 28768 38
5 pts/ac 22918 42045 50269 55418 61047 71127 79495 72072 96
population 22918 42043 50253 55394 60991 71077 79296 74973 100

800m PRSIM MAAT (degrees celsius) raster summaries at several sampling intensities.

sampling.density Q0 Q5 Q25 Q50 Q75 Q95 Q100 n percent pixels sampled
0.01 pts/ac 14.7 15.2 15.7 15.9 16.2 16.5 16.6 118 41.8
0.1 pts/ac 14.6 15.2 15.7 15.9 16.2 16.5 16.9 1399 496.1
0.5 pts/ac 14.6 15.2 15.7 15.9 16.2 16.5 16.9 7148 2534.8
1 pts/ac 14.6 15.2 15.7 15.9 16.2 16.5 16.9 14355 5090.4
2 pts/ac 14.6 15.2 15.7 15.9 16.2 16.5 16.9 28803 10213.8
5 pts/ac 14.6 15.2 15.7 15.9 16.2 16.5 16.9 72067 25555.7
population 14.8 15.2 15.6 16.0 16.2 16.5 16.9 282 100.0

There is a trade-off when sampling raster data of varying resolution: an optimal sampling density at 10m (e.g. surface slope) may be excessive at 800m (e.g. PRISM data) resolution. Spatial autocorrelation will affect the selection of an optimal sampling density. For example, surface curvature will require more samples than surface slope, which in turn require more samples than elevation.

Stability of the estimated median

Sampling implies stochasticity–each sampling event will result in a slightly different estimate. We can test the “stability” of sample-based estimates, like the median, via simulation.

For each combination of raster layer and collection of map unit polygons:

  1. collect values from raster using a fixed sampling density
  2. estimate the median of these values
  3. repeat steps 1 and 2 many times (10 times in our example)
  4. compute with width of the 5th–95th percentile interval
  5. scale the resulting interval by the population median
  6. increase sampling density and repeat steps 1 through 5

Recall that “stability” is the width of the 5th–95th percentile interval of median estimates. Scaling this interval by the population median puts the values into practical context.

Recall that "stability" is the width of the 5th--95th percentile interval of median estimates. Scaling this interval by the population median puts the values into practical context.

How do we interpret the above figure? For the collection of “7089” map unit polygons:

Summary

A sampling density between 1 and 2 points per acre should be sufficient for most of the commonly used 10m to 30m data sources, and excessive for PRISM data (800m). The sampling density “sweet spot” for most combinations of map units and raster data sources is on the order of 1 point per acre.

There are two cases that require further consideration of an “optimal” sampling density: 1) map unit polygons of limited extent, and 2) map units composed of very small delineations (less than 5 acres). In these cases it would be wise to increase the sampling density to a value between 2 and 5 points per acre.

Test out the “stability” of different sampling density values using your own map units and raster data sources.