This is a short demonstration of why map unit summary reports are based on a sampling of overlapping pixels (from raster data sources) instead of using all overlapping pixels.

Introduction

Nearly all statistical methods are based on the premise that meaningful descriptions of an underlying population can be derived from a subset of the population, or sample. The efficiency of a sample is (mostly) dependent on three parameters: the shape of the population distribution, the size of the sample and the degree to which values in the population are related (autocorrelation). A population with a simple distribution, such as the normal, can be characterized with a smaller sample than more complex distributions. The effect of sample size makes intuitive sense: more observations (e.g. larger sample) will give a better description of the population, regardless of distribution. Characterization of populations with a low degree of autocorrelation require relatively larger samples as compared to populations with a high degree of autocorrelation.

The population and the sample

The normal distribution, with a mean of 12% and standard deviation of 2%. Vertical lines mark the mean +/- 1 standard deviation.

For the sake of demonstration, lets pretend that we know some details about the population of possible clay contents associated with all A horizons, within a given soil series: the distribution shape is approximately normal, has a mean of 12% and a standard deviation of 2%. We can simulate these conditions by drawing a large number (1,000) of samples from the normal distribution. Having “access” to the clay content population is analogous to collecting all pixels (e.g. slope values) that overlap with a collection of map unit polygons. Next, draw a small sample (5% of the observations) from the population. This is analogous to the approach used by the MU summary reports: selecting pixels at a constant sampling density of approximately 1-5 points per acre.

Box and whisker representation of the simulated population and associated sample. Notches represent an approximate confidence interval around the median.

Density representation of the simulated population and associated sample.

It is clear that a very small sample can adequately describe an underlying population. A larger sample would have been required if the distribution of the population were more complex.

Testing the limits

Lets investigate the “stability” of the median as we increase the sample size from 1 to 1,000–using the same A horizon clay content example from above. We will operationally define “stability” as the interval spanning the 5th–95th percentiles of estimated median values over 100 (sampling) replications.

After about 50 samples, we are able to estimate the population median to within approximately 1% clay content (absolute), 90% of the time (e.g. within the 5th–95th percentile interval).

Autocorrelation

The degree of autocorrelation within a population (and the effect on sampling constraints) is harder to intuit and best demonstrated by example. Consider two simulated time-series type populations generated by processes that have relatively high (0.7) and low (0.01) degrees of autocorrelation.

Autocorrelation: 0.7. Soil temperature measurements collected at depth typically have high degree of autocorrelation.

Autocorrelation: 0.01. Air quality measurements collected over a large city typically have a lower degree of autocorrelation.

Clearly, more observations (e.g. a larger sample size) would be required to describe patterns in the lower figure vs. the upper figure.

Spatial autocorrelation

Spatial patterns also exhibit varying degrees of autocorrelation (spatial autocorrelation). Spatial autocorrelation is closely related to scale: spatial processes that generate finer-scale phenomena such as drainage pattern or surface slope typically exhibit lower spatial autocorrelation than processes that generate coarse-scale phenomena such as climate. Moran’s I is one of the most common methods for quantifying the degree of spatial autocorrelation.

Consider the following simulated spatial patterns with relatively low and high degrees of spatial autocorrelation.

Surface slope maps typically resemble the scale and degree of spatial autocorrelation of the left-hand panel. The spatial properties of the right-hand panel resembe those of a climatic parameter such as mean annual air temperature.

A larger sample would be required to describe spatial patterns in the left-hand panel vs. right-hand panel above.

Fortunately for us, map unit concepts are (ideally) crafted to constrain variation within delineations. This means that spatial autocorrelation within map unit delineations is typically much greater than for an entire survey area or MLRA. Using a typical map unit from the CA630 survey and the surface slope raster (10m DEM), the median Moran’s I within all delineations is 0.90, while the Moran’s I for the entire survey area is 0.48. The high degree of spatial autocorrelation within map units makes it possible to generate meaningful summaries of raster data using only 1-2% of the overlapping pixels.

Map unit summaries

The MU Summary / Comparison reports developed for Region 2 staff draw heavily from the concepts of statistical sampling described above. Summaries are derived from samples drawn from the population of pixels that overlap collections of map unit polygons. The optimal sampling density will vary based on the following parameters:

raster cell size and extent: large rasters will take longer to sample, especially if the file cannot fit into available memory
resources: a larger sample size requires more time and memory (RAM)
autocorrelation / spatial complexity: larger autocorrelation requires relatively smaller sample sizes
map unit size / scale: smaller, more detailed map units will require more samples to effectively characterize

Ideally a sampling size is selected to minimize resource utilization (staff time, computer time, etc.) while maximizing the reliability (stability) of estimated percentiles and proportions.

An example from the CA630 initial soil survey

Lets investigate samples derived from the following raster maps and map unit “7089”. This map unit covers the thermic band of steeply sloping hills and mountains, underlain by metavolcanic rocks. Each raster was loaded into RAM and therefore time requirements represent optimal conditions. Processing times for rasters that do not fit into RAM (e.g. regional-scale elevation or slope maps) is typically 4 to 8 times longer.

Moran’s I values were evaluated over the entire extent (bounding-box) of “7089” polygons.

Raster	Moran’s I	Grid Size (m)	Density (pixels / acre)	Processing: all pixels (seconds)	Processing: 1 pt/ac. (seconds)
Elevation	0.97	10	40.000	40	2
Slope	0.48	10	40.000	40	2
Beam Radiance	0.14	30	4.500	10	1
PRISM MAAT	0.97	800	0.006	1	1

Note that the use of all overlapping pixels would require approximately 40 seconds of computer time to summarize each combination of (comparable) map units and 10m raster data. Therefore, a comparison between four map units, drawing from four 10m data sources would require approximately 10 minutes. The use of sampling (at the default 1 sample / ac.) would require approximately 30 seconds. Apart from spatial data processing time, the creation of data visualizations (e.g. box and whisker or density plots) would take much longer if all overlapping pixels were used.

Polygon area distribution for map unit “7089”.

The median polygon area for this map unit is approximately 64 acres–at a sampling density of 1 point per acre this would result in 64 samples collected from each raster, per polygon. The 5th and 95th percentiles for polygon area are 18 and 506 acres. These might seem like small sample sizes, however, the summaries presented in the MU reports are based on samples collected from all polygons associated with a map unit; 16,856 acres (16,856 samples at 1 point per acre) in the case of our example.

Results: sampling density and population

Here we compare select percentiles derived from samples (at several densities) to the percentiles derived from the all pixels–the “population”.

10m elevation (meters) raster summaries at several sampling intensities.

sampling.density	Q0	Q5	Q25	Q50	Q75	Q95	Q100	n	percent pixels sampled
0.01 pts/ac	120.3	196.1	318.3	400.3	517.2	661.5	764.4	116	0.0
0.1 pts/ac	78.7	193.7	321.1	405.4	523.0	673.2	825.7	1388	0.2
0.5 pts/ac	66.5	195.5	321.4	404.7	521.8	673.2	839.4	7145	1.1
1 pts/ac	65.1	195.0	320.9	404.4	521.5	673.1	841.2	14368	2.1
2 pts/ac	64.1	195.4	321.4	404.5	521.7	673.0	845.3	28773	4.3
5 pts/ac	63.4	195.5	321.3	404.4	521.5	673.0	845.5	72088	10.7
population	63.1	195.4	321.3	404.4	521.5	672.9	847.2	675069	100.0

10m surface slope (percent) raster summaries at several sampling intensities.

sampling.density	Q0	Q5	Q25	Q50	Q75	Q95	Q100	n	percent pixels sampled
0.01 pts/ac	2	16.4	29.7	37.9	46.7	62.2	85.3	119	0.0
0.1 pts/ac	0	16.5	29.8	37.6	47.3	64.5	109.2	1386	0.2
0.5 pts/ac	0	16.6	29.8	37.8	47.4	64.5	127.4	7147	1.1
1 pts/ac	0	16.8	29.8	37.9	47.5	64.6	132.3	14355	2.1
2 pts/ac	0	16.7	29.8	37.8	47.5	64.6	142.5	28770	4.3
5 pts/ac	0	16.6	29.8	37.8	47.5	64.5	146.9	72066	10.7
population	0	16.7	29.8	37.8	47.4	64.5	152.5	675069	100.0

30m annual beam radiance (mega-Joules per square meter) raster summaries at several sampling intensities.

sampling.density	Q0	Q5	Q25	Q50	Q75	Q95	Q100	n	percent pixels sampled
0.01 pts/ac	34120	42401	50157	55152	61052	71547	77128	118	0
0.1 pts/ac	28599	42057	50211	55447	60935	70587	78558	1391	2
0.5 pts/ac	25280	42025	50235	55383	60978	71095	78841	7145	10
1 pts/ac	24207	42025	50263	55395	61003	71080	79280	14361	19
2 pts/ac	23550	42043	50271	55419	61031	71160	79306	28768	38
5 pts/ac	22918	42045	50269	55418	61047	71127	79495	72072	96
population	22918	42043	50253	55394	60991	71077	79296	74973	100

800m PRSIM MAAT (degrees celsius) raster summaries at several sampling intensities.

sampling.density	Q0	Q5	Q25	Q50	Q75	Q95	Q100	n	percent pixels sampled
0.01 pts/ac	14.7	15.2	15.7	15.9	16.2	16.5	16.6	118	41.8
0.1 pts/ac	14.6	15.2	15.7	15.9	16.2	16.5	16.9	1399	496.1
0.5 pts/ac	14.6	15.2	15.7	15.9	16.2	16.5	16.9	7148	2534.8
1 pts/ac	14.6	15.2	15.7	15.9	16.2	16.5	16.9	14355	5090.4
2 pts/ac	14.6	15.2	15.7	15.9	16.2	16.5	16.9	28803	10213.8
5 pts/ac	14.6	15.2	15.7	15.9	16.2	16.5	16.9	72067	25555.7
population	14.8	15.2	15.6	16.0	16.2	16.5	16.9	282	100.0

There is a trade-off when sampling raster data of varying resolution: an optimal sampling density at 10m (e.g. surface slope) may be excessive at 800m (e.g. PRISM data) resolution. Spatial autocorrelation will affect the selection of an optimal sampling density. For example, surface curvature will require more samples than surface slope, which in turn require more samples than elevation.

Stability of the estimated median

Sampling implies stochasticity–each sampling event will result in a slightly different estimate. We can test the “stability” of sample-based estimates, like the median, via simulation.

For each combination of raster layer and collection of map unit polygons:

collect values from raster using a fixed sampling density
estimate the median of these values
repeat steps 1 and 2 many times (10 times in our example)
compute with width of the 5th–95th percentile interval
scale the resulting interval by the population median
increase sampling density and repeat steps 1 through 5

Recall that “stability” is the width of the 5th–95th percentile interval of median estimates. Scaling this interval by the population median puts the values into practical context.

How do we interpret the above figure? For the collection of “7089” map unit polygons:

Elevation: at 0.5 pts/ac. the estimated median is within 1% (4 meters) of the population median 90% of the time.
Slope: at 1 pts/ac. the estimated median is within 1% (0.3 % slope) of the population median 90% of the time.
Beam Radiance: at 0.1 pts/ac. the estimated median less than 1% (385 MJ/sq. meter) of the population median 90% of the time.
PRSIM MAAT: at 0.01 pts/ac. the estimated median is less than 1% (0.09 deg. C) of the population median 90% of the time.

Summary

A sampling density between 1 and 2 points per acre should be sufficient for most of the commonly used 10m to 30m data sources, and excessive for PRISM data (800m). The sampling density “sweet spot” for most combinations of map units and raster data sources is on the order of 1 point per acre.

There are two cases that require further consideration of an “optimal” sampling density: 1) map unit polygons of limited extent, and 2) map units composed of very small delineations (less than 5 acres). In these cases it would be wise to increase the sampling density to a value between 2 and 5 points per acre.

Test out the “stability” of different sampling density values using your own map units and raster data sources.

MU Summary Reports: Sampling Theory

A short justification for using some vs. all pixels

D.E. Beaudette

2021-09-24