Clustering of Image Colors

Introduction

This exercise demonstrates several aspects of numerical taxonomy as applied to a three-band image, in this case a photograph. A similar workflow could be applied to a stack of DEM and climate surface derivatives. After working through this exercise, use a different photograph of approximately the same image dimensions.

Background

Paraphrased from Numerical Taxonomy (Sneath and Sokal, 1973).

What is a good classification?
Is classification “A” better than classification “B”?
How similar are two classifications?
Is there significant structure in a set of operational taxonomic units (pixels, soil profiles, climate stations, etc.)?
Do two or more clusters, delineated by prior procedure, differ in their newly observed properties?
Do two or more clusters, delineated by numerical taxonomy, differ in any of the properties on which they were clustered?

Objectives

In this example we will apply ideas from the Numerical Taxonomy / Ordination module to a photograph. These methods can also be applied to a stack of terrain/climate derivatives for the purposes of initial or update soil survey work. The basic work-flow looks something like this:

load stack of conformal grids: same coordinate reference system, grid size, grid extent
exhaustive sampling of grids (random, regular, etc.) at points
sub-sample points via cLHS
compute distance matrix
apply clustering algorithms
visualize: dendrogram (if possible), nMDS, etc.
evaluate optimal number of clusters for PAM-style clustering (silhouette width scree plot)
evaluate goodness of nMDS solution (stress scree plot and Shepard diagram)
fit simple model (LDA, etc.) to predict cluster ID at every pixel in the stack
make predictions: class membership and most-likely class
visualize predictions as an image
evaluate class-wise uncertainty with Shannon entropy
estimate nMDS scores for each pixel of the example image

Several parts of this demonstration depend on functions that have a non-deterministic outcome:

cLHS-sampling via clhs::clhs()
nMDS via MASS::sammon()
partitioning around medoids clustering via cluster::pam()

In order to create repeatable results, the following examples use the set.seed() function to control randomness. This can be a useful trick when it is neccessary to generate the same output each time a report or analysis is run. However, one should not use set.seed() when a truly random outcome is required.

Setup

Load packages that we will be using.

library(aqp)
library(terra)
library(cluster)
library(ape)
library(farver)
library(MASS)
library(colorspace)
library(viridisLite)
library(clhs)

Use an image with well-defined colors to demonstrate methods / syntax.

# I like airplanes
# URL to example image, you can swap out for any image you like
# be sure to coordinate the image URL and file extension
url <- 'https://github.com/ncss-tech/stats_for_soil_survey/raw/master/exercises/numerical-taxonomy/photo.jpg'
tf <- tempfile(fileext = '.jpg')

# download, this may fail if you are on the VPN
download.file(url, destfile = tf, mode = 'wb')

# or, point to local file that can be read by GDAL:
# tf <- 'photo.jpg'

# load as 3-band SpatRast object
# bands: red, green, blue (sRGB color space)
r <- rast(tf)

# optionally, use your own image or stack of GIS data
# r <- rast(...)

Inspect the resulting SpatRast object.

# re-name bands for access later
names(r) <- c('r', 'g', 'b')

# object properties
print(r)

## class       : SpatRaster 
## dimensions  : 640, 628, 3  (nrow, ncol, nlyr)
## resolution  : 1, 1  (x, y)
## extent      : 0, 628, 0, 640  (xmin, xmax, ymin, ymax)
## coord. ref. :  
## source      : file44043044313.jpg 
## colors RGB  : 1, 2, 3 
## names       : r, g, b

Display 3-band stack as an RGB image.

plotRGB(r)

Example image, a red model airplane leaning on the propellor, against a background of patchy green and light green grass.

Color Conversion

Convert to CIELAB color space, much more useful for analysis of color. I’ll abbreviate CIELAB as LAB for simplicity.

# copy SpatRast / update with LAB color coordinates
r.lab <- r

# simple trick to directly access / replace values of a SpatRast
values(r.lab) <- convert_colour(values(r), from = 'rgb', to = 'lab', white_from = 'D65')
names(r.lab) <- c('l', 'a', 'b')

# check
# bands are L, A, B color space coordinates
# positive "A" coordinates are "red"
plot(r.lab, col = mako(25), axes = FALSE)

Example image split into CIELAB color coordinates.

Sampling

Exhaustive sampling, extracting LAB color coordinates.

s <- spatSample(r.lab, size = 5000, method = 'regular', as.points = TRUE)

Sub-sample using cLHS algorithm. Much easier to compute distance matrix from a small, representative sample.

# note temporary conversion to sf object
# required by clhs()
# result is an integer index to samples
set.seed(42)
idx <- clhs(sf::st_as_sf(s), size = 250)

Visual check on cLHS-selected samples (blue).

plotRGB(r, axes = FALSE)
# points(s, pch = 3, col = 'white', cex = 0.25)
points(s[idx, ], pch = 1, col = 'blue', cex = 1.5)

Example image, with clhs-defined sampling points.

Note that that there are very few samples covering the (black) windows and wheels.

Visualize color space coordinates.

# keep only the cLHS samples
s <- s[idx, ]

# convert SpatVect -> data.frame
s.lab <- as.data.frame(s)

## convert CIE LAB values extracted at smples -> sRGB
s.rgb <- convert_colour(s.lab[, c('l', 'a', 'b')], from = 'lab', to = 'rgb', white_from = 'D65')

# convert to R colors, hex notation
s.cols <- rgb(s.rgb, maxColorValue = 255)


# helper function for paired scatterplot,
# see ?pairs for further examples
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) {
      cex.cor <- 0.8 / strwidth(txt)
    }
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

# 6-variable scatter plot matrix
# CIELAB color space coordinates
# sRGB color space coordinates
pairs(
  cbind(
    s.lab[, c('l', 'a', 'b')],
    s.rgb[, c('r', 'g', 'b')]
  ),
  pch = 16, 
  col = s.cols,
  labels = c('CIE L', 'CIE A', 'CIE B', 'sRGB red', 'sRGB green', 'sRGB blue'),
  upper.panel = panel.cor,
  gap = 0
)

Scatter plot matrix of CIELAB and sRGB color space coordinates, at sampling points, symbolized using image colors. Pearson correlation coefficients are printed in the upper triangle of the matrix.

Distance Matrix

Compute pair-wise distances via ΔE₀₀ color contrast metric.

## compute pair-wise distances
# using the CIE2000 color contrast metric
# perceptual (avg. human vision) vs. Euclidean distance
d <- compare_colour(
  from = s.lab[, c('l', 'a', 'b')], 
  to = s.lab[, c('l', 'a', 'b')], 
  from_space = 'lab', 
  method = 'CIE2000'
)

# convert full-matrix form to simpler 'dist' representation
d <- as.dist(d)

# check for 0 distances, can cause problems for nMDS
summary(d)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2067 16.2310 30.4435 31.3316 46.1968 85.4657

# if there are any, set those to a small value
d[which(d == 0)] <- quantile(d[which(d > 0)], prob = 0.001, na.rm = TRUE)

Ordination by nMDS

There are many implementations of non-metric multidimensional scaling; Sammon’s Non-Linear Mapping is a good place to start.

Recall that an ordination is a projection of the high-dimensionality distance matrix to a small number of dimensions, usually 2 or 3. This projection incurs a loss of fidelity. A scree plot is a qualitative way to visually determine a reasonable number of dimensions required for a given distance matrix. Look for the “elbow” in the scree plot for a reasonable starting point.

.seq <- 1:6
.stress <- sapply(.seq, function(i) {
  .res <- sammon(d, k = i)$stress
})

plot(x = .seq, y = .stress, type = "b", las = 1, ylab = "Stress", xlab = 'Number of Dimensions', main = "Sammon's Non-Linear Mapping (nMDS)")

Scree plot of nMDS stress values as a function of ordination rank.

In this case, it would appear that 2 dimensions are sufficient.

Generate an nMDS solution for our distance matrix. Output is always slightly different, unless controlling the randomness via set.seed().

# "fixed" randomness, 
# only important when you want the same result every time
set.seed(42)

mds <- sammon(d)

## Initial stress        : 0.01512
## stress after  10 iters: 0.00610, magic = 0.500
## stress after  20 iters: 0.00599, magic = 0.500
## stress after  30 iters: 0.00596, magic = 0.500

# check resulting structure
str(mds)

## List of 3
##  $ points: num [1:250, 1:2] -35.72 -35.93 -35.42 -18.22 -3.58 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##  $ stress: num 0.00596
##  $ call  : language sammon(d = d)

A scatter plot of nMDS scores, with symbols defined by the original image colors at sample points. Dashed lines show nMDS solution origin at (0,0).

# simple visualization of nMDS ordination
# colors from sRGB color coordinates, converted to hex notation
par(mar = c(2, 2, 1, 1))
plot(mds$points, pch = 16, col = s.cols, cex = 2, axes = FALSE)
# add horizontal / vertical lines to see "origin" of nMDS coordinates
abline(h = 0, v = 0, lty = 3)
box()
mtext('nMDS Axis 1', side = 1, line = 0.5, font = 2)
mtext('nMDS Axis 2', side = 2, line = 0.5, font = 2)

Ordination by nMDS.

Shepard plot. For more options, see related functions from the vegan package.

# generate data for Shepard Diagram
shep <- Shepard(d, mds$points) 

plot(shep, pch = ".", asp = 1, xlab = 'Original Distance', ylab = 'nMDS Distance', main = 'Shepard Diagram\n2 dimensions', las = 1)
lines(shep$x, shep$yf, type = "S", col = 2, lwd = 2)
abline(0, 1, col = 4, lty = 2, lwd = 2)

Shepard plot for ordination above. Values close to the 1:1 line (dashed blue line) indicate a reasonable ordination solution.

See nMDS-eval-01 for additional examples.

Cluster Analysis

Divisive Hierarchical Clustering

Note that the hclust object is often used as a common format when converting from one object class to another. For example, converting the divisive hierarchical clustering object created by diana() to phylo class requires a trip through the hclust format.

# divisive hierarchical clustering of distance matrix
dd <- diana(d)

# convert cluster pkg object -> hclust obj. -> phylo obj.
h <- as.phylo(as.hclust(dd))

# attempt rotation of the dendrogram leaves
# using rank-order from the first nMDS axis above
h <- rotateConstr(h, constraint = order(mds$points[, 1]))

Using the plot method for phylo class objects, as provided by the ape package.

par(mar = c(0, 0, 2, 0))

plot(
  h, 
  show.tip.label = FALSE, 
  direct = 'down', 
  type = 'phylogram', 
  y.lim = c(0, 50),
  main = sprintf("Divisive Coef: %s", round(dd$dc, 2))
)

# make a block of tip label colors
# using repeated calls, varying vertical offsets
tiplabels(pch = 15, col = s.cols, offset = 1.5)
tiplabels(pch = 15, col = s.cols, offset = 2.5)
tiplabels(pch = 15, col = s.cols, offset = 3.5)

Dendrogram representation of pair-wise distances, after divisive hierarchical clustering.

K-medoids Clustering

Look for the “elbow” in the scree plot of average sill width vs. number of clusters. Search for the lowest possible average sill width and number of clusters.

# sequence of candidate clusters
.seq <- 2:8
# iterate over requested clusters
# retain avg. sill width metric
.silwidth <- sapply(.seq, function(i) {
  pam(d, k = i, diss = TRUE)$silinfo$avg.width
})

plot(x = .seq, y = .silwidth, type = "b", las = 1, ylab = "Average Sill Width", xlab = 'Number of Clusters', main = "Partitioning Around Medoids")

Scree plot of average sill width vs. number of clusters.

Five clusters seems about right.

# partitioning around medoids
# using the distance matrix vs. data matrix (diss = TRUE)
set.seed(42)
cl <- pam(d, k = 5, diss = TRUE)

# save clustering vector, these are integers
# interpret as categories
s.lab$cl <- factor(cl$clustering)

# plot method for partioning-style clustering (cluster package)
clusplot(cl, col.p = s.cols, plotchar = FALSE, col.clus = 'black')

Cluster arrangement as visualized on axes of first and second principal components.

Extract medoid colors in sRGB color space (for viewing on the screen).

# cl$medoids is an index to the original data matrix
# will use these later
medoid.cols <- rgb(s.rgb[cl$medoids, ], maxColorValue = 255)

# quick check on medoid colors
par(mar = c(0, 0, 0, 0))
# using aqp::soilPalette() for convenience
soilPalette(medoid.cols, lab = 1:length(medoid.cols))

Color swatches of medoid colors.

Visualize PAM clustering structure via previous ordination solution.

# simple visualization of nMDS ordination
# colors based on PAM medoid colors
par(mar = c(2, 2, 1, 1))
plot(mds$points, axes = FALSE, type = 'n')
text(mds$points, labels = cl$clustering, col = medoid.cols[cl$clustering], cex = 1, font = 2)
# add horizontal / vertical lines to see "origin" of nMDS coordinates
abline(h = 0, v = 0, lty = 3)
box()
mtext('nMDS Axis 1', side = 1, line = 0.5, font = 2)
mtext('nMDS Axis 2', side = 2, line = 0.5, font = 2)

$PAM clusters arranged by nMDS ordination.$

PAM clusters arranged by nMDS ordination.

Supervised Classification

Supervised classification to interpolate between point samples, using linear discriminant analysis.

# linear discriminant analysis (LDA) model works well here
# simple and efficient
m <- lda(cl ~ l + a + b, data = s.lab)

Demonstrate prediction output from lda object using some of the training data. The result is a list:

class is the most likely class, as a factor
posterior are the class-wise probabilities, as a matrix
x are the LDA “scores” (not used here)

# use only the first 6 rows of data
p <- predict(m, newdata = s.lab[1:6, ])

p$class

## [1] 1 1 1 2 3 3
## Levels: 1 2 3 4 5

round(p$posterior, 3)

##       1     2     3     4 5
## 1 1.000 0.000 0.000 0.000 0
## 2 1.000 0.000 0.000 0.000 0
## 3 1.000 0.000 0.000 0.000 0
## 4 0.295 0.702 0.000 0.002 0
## 5 0.000 0.010 0.815 0.175 0
## 6 0.000 0.000 0.976 0.024 0

Linear discriminant function scatter plot matrix.

plot(m, col = s.cols, fig.alt='Scatter plot matrix of linear discriminant scores.', fig.cap='Scatter plot matrix of linear discriminant scores.')

Wrapper functions around predict method used by LDA models, as applied to multi-band spatRaster objects.

# prediction for most-likely class
pred.class <- function(model, data) {
  predict(model, data)$class
}

# class probabilities
pred.pr <- function(model, data) {
  predict(model, data)$posterior
}

Predict over entire stack (3-band) of LAB color coordinates.

# most-likely cluster
p.class <- predict(r.lab, m, fun = pred.class)

# cluster probabilities
p.pr <- predict(r.lab, m, fun = pred.pr)

Cluster Images (Maps)

Most-likely cluster (by pixel), followed by individual cluster probabilities.

par(mfrow = c(2, 3))

# most-likely cluster
plot(p.class, col = medoid.cols, axes = FALSE, main = 'Most-likely Cluster ', mar = c(0.5, 0.1, 2.5, 2))

for(i in 1:5) {
  
  # smooth color ramp
  # from white -> medoid color i
  cr <- colorRampPalette(c('white', medoid.cols[i]), space = 'Lab')
  
  plot(
    p.pr[[i]], 
    col = cr(10), 
    axes = FALSE, 
    mar = c(0.5, 0.1, 2.5, 2),
    main = sprintf("Cluster %s", i)
  )
}

Clustering vector and class-wise probabilities for all pixels of the example image.

Shannon entropy (by pixel). Higher values suggest more uncertainty within predictions.

# apply aqp::shannonEntropy() function at each stack of pixels
# stack represents predicted class-wise probability
H <- app(p.pr, shannonEntropy)

plot(H, col = viridis(25), axes = FALSE, main = 'Shannon Entropy (log base 2)')

nMDS Images (Maps)

Using multiple linear regression, we can estimate the nMDS scores for all pixels in the 3-band image.

# row-order is preserved
# transfer nMDS scores to source data
# 250 sub-samples
s.lab$mds1 <- mds$points[, 1]
s.lab$mds2 <- mds$points[, 2]

# multiple linear regression of nMDS scores
# via CIE LAB color coordinates
lm.mds1 <- lm(mds1 ~ l + a + b, data = s.lab)
lm.mds2 <- lm(mds2 ~ l + a + b, data = s.lab)

# internal validation metrics, looks OK
summary(lm.mds1)

## 
## Call:
## lm(formula = mds1 ~ l + a + b, data = s.lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6800 -1.1275  0.2431  1.3296 10.5498 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -31.356258   0.534929  -58.62   <2e-16 ***
## l             0.511635   0.008601   59.48   <2e-16 ***
## a            -0.542203   0.005250 -103.27   <2e-16 ***
## b             0.229445   0.008698   26.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.297 on 246 degrees of freedom
## Multiple R-squared:  0.9872, Adjusted R-squared:  0.987 
## F-statistic:  6318 on 3 and 246 DF,  p-value: < 2.2e-16

summary(lm.mds2)

## 
## Call:
## lm(formula = mds2 ~ l + a + b, data = s.lab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.2250 -0.9468  0.1316  1.2130 11.4810 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.273817   0.486844   78.62   <2e-16 ***
## l           -0.774179   0.007828  -98.90   <2e-16 ***
## a           -0.366447   0.004778  -76.69   <2e-16 ***
## b            0.251507   0.007916   31.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.09 on 246 degrees of freedom
## Multiple R-squared:  0.9812, Adjusted R-squared:  0.981 
## F-statistic:  4291 on 3 and 246 DF,  p-value: < 2.2e-16

# spatial prediction
p.mds1 <- predict(r.lab, lm.mds1)
p.mds2 <- predict(r.lab, lm.mds2)

Attempt a simple interpretation of nMDS scores (axes).

# colors
mds1.cols <- colorRampPalette(c('red', 'green'), space = 'Lab')(25)
mds2.cols <- rev(grey.colors(25))

# maps
par(mfcol = c(1, 2))
plot(p.mds1, col = mds1.cols, axes = FALSE, main = 'nMDS Axis 1', mar = c(2, 1, 2, 3))
mtext(text = '"red \U2192 green"', side = 1, line = 0.5)

plot(p.mds2, col = mds2.cols, axes = FALSE, main = 'nMDS Axis 2', mar = c(2, 1, 2, 3))
mtext(text = '"light \U2192 dark"', side = 1, line = 0.5)

Images of nMDS scores, estimated at each pixel of the example image.