How to classify a GeoTIFF file using NumPy
1 Introduction
Try these exercises first:
- How to read a GeoTIFF file as a NumPy array using GDAL
- Exercise: Plot a GeoTIFF file
- Exercise: Calculate the mean cell value of a GeoTIFF file
- How to extract cell values from a GeoTIFF file using NumPy
In this cell extraction exercise, we extracted cell values as 1-dimensional arrays, but they are not maps and lost location information. Now, let’s try to classify a GeoTIFF file into multiple classes and retain cell locations as 2-dimensional arrays. Use elevation.tif.
2 Python code
from osgeo import gdal
import numpy as np
import matplotlib.pyplot as plt
def read_geotiff(filename):
ds = gdal.Open(filename)
band = ds.GetRasterBand(1)
arr = band.ReadAsArray()
del ds, band
return arr
elev_arr = read_geotiff("elevation.tif")
elev_mean = elev_arr.mean()
elev_std = elev_arr.std()
# use np.where() to extract 2-dimensional subsets of elev_arr
low_arr = np.where(elev_arr < elev_mean-elev_std, elev_arr, np.nan)
middle_arr = np.where((elev_arr >= elev_mean-elev_std) &
(elev_arr <= elev_mean+elev_std), elev_arr, np.nan)
high_arr = np.where(elev_arr > elev_mean+elev_std, elev_arr, np.nan)
class_arr = np.where(elev_arr < elev_mean-elev_std, 1,
np.where(elev_arr <= elev_mean+elev_std, 2, 3))
plt.subplot(321)
plt.imshow(elev_arr)
plt.subplot(322)
# low_arr is a 2-dimensional array, so we need to use plt.imshow() to plot it
plt.imshow(low_arr)
plt.subplot(323)
plt.imshow(middle_arr)
plt.subplot(324)
plt.imshow(high_arr)
plt.subplot(325)
plt.imshow(class_arr)
plt.show()
3 Explanation
Import necessary modules and define the read_geotiff()
function.
from osgeo import gdal
import numpy as np
import matplotlib.pyplot as plt
def read_geotiff(filename):
ds = gdal.Open(filename)
band = ds.GetRasterBand(1)
arr = band.ReadAsArray()
del ds, band
return arr
Read data as a 2-dimensional NumPy array and caculate its statistics.
elev_arr = read_geotiff("elevation.tif")
elev_mean = elev_arr.mean()
elev_std = elev_arr.std()
We use the np.where()
function to extract subsets of the 2-dimensional array. It takes three arguments including a boolean array, a true array or value, and a false array or value. Its output has the same shape as that of the first boolean array argument. The second and third arguments should be either of the same shape as the first argument or a single value. For each element in the boolean array, if its value is True
, the value in the true array at the same location as in the first boolean array or the true value is returned to that location in the 2-dimensional output array. The same thing happens for False
values in the boolean array.
In low_arr
, we just want to store low elevation cells whose value is less than the mean minus standard deviation, and nothing else for the other cells. We can use np.nan
to represent NoData cells. Here, nan
means Not-a-Number. Repeat the same syntax for middle_arr
and high_arr
.
# use np.where() to extract 2-dimensional subsets of elev_arr
low_arr = np.where(elev_arr < elev_mean-elev_std, elev_arr, np.nan)
middle_arr = np.where((elev_arr >= elev_mean-elev_std) &
(elev_arr <= elev_mean+elev_std), elev_arr, np.nan)
high_arr = np.where(elev_arr > elev_mean+elev_std, elev_arr, np.nan)
This time, instead of extracting 2-dimensional cells for low, middle, and high bands, let’s classify them as 1, 2, and 3, respectively, in one array. We still use np.where()
, but we use it in a nested manner. In short, we use this syntax: np.where(condition_1, value_1, np.where(condition_2, value_2, value_3))
. If condition_1
is true, value_1
is used. Otherwise, np.where(condition_2, value_2, value_3)
is used. The false case for condition_1
is again a conditional array. If condition_2
is true, value_2
is used. Otherwise, value_3
is used. Logically, it’s similar to this if-elif-else
construct.
if condition_1 is True:
return value_1
elif condition_2 is True:
return value_2
else:
return value_3
However, the construct above is only for single value comparisons and we still need to use np.where()
to perform cell-by-cell comparisons in a 2-dimensional array.
First, extract low elevation cells using elev_arr < elev_mean-elev_std
. If this condition is true, use 1
. Otherwise, use np.where(elev_arr <= elev_mean+elev_std, 2, 3)
. In the otherwise case, we still need to evaluate this truthiness using the second np.where()
. If a cell in elev_arr
is not less than the mean minus standard deviation (elev_arr < elev_mean-elev_std
), but is less than or equal to the mean plus standard deviation (elev_arr <= elev_mean+elev_std
), use 2. Otherwise (elev_arr
is not less than elev_mean-elev_std
and not less than or equal to elev_mean+elev_std
), use 3. The last case is actually for elevation cells whose value is greater than the mean plus standard deviation.
class_arr = np.where(elev_arr < elev_mean-elev_std, 1,
np.where(elev_arr <= elev_mean+elev_std, 2, 3))
The low_arr
, middle_arr
, high_arr
, and class_arr
arrays are all 2-dimensional and can be plotted using plt.imshow()
.
plt.subplot(321)
plt.imshow(elev_arr)
plt.subplot(322)
# low_arr is a 2-dimensional array, so we need to use plt.imshow() to plot it
plt.imshow(low_arr)
plt.subplot(323)
plt.imshow(middle_arr)
plt.subplot(324)
plt.imshow(high_arr)
plt.subplot(325)
plt.imshow(class_arr)
plt.show()