How to classify a GeoTIFF file using NumPy

Dr. Huidae Cho
Institute for Environmental and Spatial Analysis...University of North Georgia

1   Introduction

Try these exercises first:

In this cell extraction exercise, we extracted cell values as 1-dimensional arrays, but they are not maps and lost location information. Now, let’s try to classify a GeoTIFF file into multiple classes and retain cell locations as 2-dimensional arrays. Use elevation.tif.

classified-elevation.png

2   Python code

from osgeo import gdal
import numpy as np
import matplotlib.pyplot as plt

def read_geotiff(filename):
    ds = gdal.Open(filename)
    band = ds.GetRasterBand(1)
    arr = band.ReadAsArray()
    del ds, band
    return arr

elev_arr = read_geotiff("elevation.tif")
elev_mean = elev_arr.mean()
elev_std = elev_arr.std()

# use np.where() to extract 2-dimensional subsets of elev_arr
low_arr = np.where(elev_arr < elev_mean-elev_std, elev_arr, np.nan)
middle_arr = np.where((elev_arr >= elev_mean-elev_std) &
                      (elev_arr <= elev_mean+elev_std), elev_arr, np.nan)
high_arr = np.where(elev_arr > elev_mean+elev_std, elev_arr, np.nan)

class_arr = np.where(elev_arr < elev_mean-elev_std, 1,
                     np.where(elev_arr <= elev_mean+elev_std, 2, 3))

plt.subplot(321)
plt.imshow(elev_arr)

plt.subplot(322)
# low_arr is a 2-dimensional array, so we need to use plt.imshow() to plot it
plt.imshow(low_arr)

plt.subplot(323)
plt.imshow(middle_arr)

plt.subplot(324)
plt.imshow(high_arr)

plt.subplot(325)
plt.imshow(class_arr)

plt.show()

3   Explanation

Import necessary modules and define the read_geotiff() function.

from osgeo import gdal
import numpy as np
import matplotlib.pyplot as plt

def read_geotiff(filename):
    ds = gdal.Open(filename)
    band = ds.GetRasterBand(1)
    arr = band.ReadAsArray()
    del ds, band
    return arr

Read data as a 2-dimensional NumPy array and caculate its statistics.

elev_arr = read_geotiff("elevation.tif")
elev_mean = elev_arr.mean()
elev_std = elev_arr.std()

We use the np.where() function to extract subsets of the 2-dimensional array. It takes three arguments including a boolean array, a true array or value, and a false array or value. Its output has the same shape as that of the first boolean array argument. The second and third arguments should be either of the same shape as the first argument or a single value. For each element in the boolean array, if its value is True, the value in the true array at the same location as in the first boolean array or the true value is returned to that location in the 2-dimensional output array. The same thing happens for False values in the boolean array.

In low_arr, we just want to store low elevation cells whose value is less than the mean minus standard deviation, and nothing else for the other cells. We can use np.nan to represent NoData cells. Here, nan means Not-a-Number. Repeat the same syntax for middle_arr and high_arr.

# use np.where() to extract 2-dimensional subsets of elev_arr
low_arr = np.where(elev_arr < elev_mean-elev_std, elev_arr, np.nan)
middle_arr = np.where((elev_arr >= elev_mean-elev_std) &
                      (elev_arr <= elev_mean+elev_std), elev_arr, np.nan)
high_arr = np.where(elev_arr > elev_mean+elev_std, elev_arr, np.nan)

This time, instead of extracting 2-dimensional cells for low, middle, and high bands, let’s classify them as 1, 2, and 3, respectively, in one array. We still use np.where(), but we use it in a nested manner. In short, we use this syntax: np.where(condition_1, value_1, np.where(condition_2, value_2, value_3)). If condition_1 is true, value_1 is used. Otherwise, np.where(condition_2, value_2, value_3) is used. The false case for condition_1 is again a conditional array. If condition_2 is true, value_2 is used. Otherwise, value_3 is used. Logically, it’s similar to this if-elif-else construct.

if condition_1 is True:
    return value_1
elif condition_2 is True:
    return value_2
else:
    return value_3

However, the construct above is only for single value comparisons and we still need to use np.where() to perform cell-by-cell comparisons in a 2-dimensional array.

First, extract low elevation cells using elev_arr < elev_mean-elev_std. If this condition is true, use 1. Otherwise, use np.where(elev_arr <= elev_mean+elev_std, 2, 3). In the otherwise case, we still need to evaluate this truthiness using the second np.where(). If a cell in elev_arr is not less than the mean minus standard deviation (elev_arr < elev_mean-elev_std), but is less than or equal to the mean plus standard deviation (elev_arr <= elev_mean+elev_std), use 2. Otherwise (elev_arr is not less than elev_mean-elev_std and not less than or equal to elev_mean+elev_std), use 3. The last case is actually for elevation cells whose value is greater than the mean plus standard deviation.

class_arr = np.where(elev_arr < elev_mean-elev_std, 1,
                     np.where(elev_arr <= elev_mean+elev_std, 2, 3))

The low_arr, middle_arr, high_arr, and class_arr arrays are all 2-dimensional and can be plotted using plt.imshow().

plt.subplot(321)
plt.imshow(elev_arr)

plt.subplot(322)
# low_arr is a 2-dimensional array, so we need to use plt.imshow() to plot it
plt.imshow(low_arr)

plt.subplot(323)
plt.imshow(middle_arr)

plt.subplot(324)
plt.imshow(high_arr)

plt.subplot(325)
plt.imshow(class_arr)

plt.show()