Characteristics and measures of hydrologic data

Dr. Huidae Cho
Department of Civil Engineering...New Mexico State University

1   Why do we need to know the characteristics of hydrologic data

Knowing the characteristics of the data you want to analyze is crucial.

Many statistical tools require certain assumptions about data.

We don’t want to make false assumptions about hydrologic data!

2   How do we measure those characteristics?

Population is usually of an infinite size ⇒ Subset of the population (sample)

3   Characteristics of hydrologic data

  • A lower bound of zero in most cases and rare to find negative values (any examples?)
  • Presence of outliers, often high ones
  • Positive skewness
  • Non-normal distribution
  • Data reported with some thresholds (censored data)
  • Seasonal patterns
  • Autocorrelation
  • Dependence on other variables

4   Measures of central tendency

  • Mean (typically arithmetic mean)
  • Median
  • Mode

4.1   Arithmetic mean: A classical measure of central tendency

$\def\mean#1{\bar{#1}}$ \begin{equation} \mean{X}= \sum_{i=1}^n\frac{X_i}{n}= \sum_{i=1}^k\mean{X}_i\frac{n_i}{n}= \mean{X}_{(j)}\frac{n-1}{n}+X_j\frac{1}{n}= \mean{X}_{(j)}+\left(X_j-\mean{X}_{(j)}\right)\frac{1}{n} \end{equation}

Sensitive to outliers.

4.2   Median: A resistant measure of central tendency

$\def\median{\text{Median}}$ \begin{equation} \median= \begin{cases} X\left(\frac{n+1}{2}\right)&\text{if $n$ is odd}\\ \frac{1}{2}\left[X\left(\frac{n}{2}\right)+X\left(\frac{n}{2}+1\right)\right]&\text{if $n$ is even} \end{cases} \end{equation}

Less sensitive to outliers.

4.3   Mode

Occurring most often from a discrete dataset

4.4   Geometric mean

$\def\gmean{\text{GM}}$ \begin{equation} \gmean=\left(\prod_{i=1}^nX_i\right)^{1/n} \end{equation} where $X_i>0$.

Useful for positively skewed datasets.

5   Measures of variability

5.1   Sample variance: A classical measure of variability

\begin{equation} s^2=\sum_{i=1}^n\frac{\left(X_i-\mean{X}\right)^2}{n-1} \end{equation}

5.2   Interquartile range (IQR): A resistant measure of variability

Percentiles $P_{X,j}$ can be calculated from a sorted dataset from smallest to largest, $X_i$ for $i=1,\cdots,n$: \begin{equation} P_j=X_{(n+1)\cdot j} \end{equation} and the interquartile range (IQR) can be calculated as follows: $\def\iqr{\text{IQR}}$ \begin{equation} \iqr=P_{0.75}-P_{0.25} \end{equation}

What if $(n+1)\cdot j$ is not an integer? Interpolation and we typically use the Weibull plotting position in hydrology (type=6 in quantile() in R).

5.3   Median absolute deviation (MAD): A resistant measure of variability

$\def\mad{\text{MAD}}$ \begin{equation} \mad(X)=\median{\left(\left|X_i-\median(X)\right|\right)} \end{equation}

5.4   Coefficient of variation (CV): A nondimensional measure of variability

$\def\cv{\text{CV}}$ \begin{equation} \cv=\frac{s^2}{\mean{X}} \end{equation}

Useful for characterizing the degree of variability in datasets.

6   Example 1.1

  • Dataset (a): 2, 4, 8, 9, 11, 11, 12
  • Dataset (b): 2, 4, 8, 9, 11, 11, 120

7   Homework: Measures of hydrologic data

  1. Define your own function for Eq. (1.10).
  2. Solve Exercise 2.

Submit your R file with comments.

8   Exercise: Streamflow data cleaning and measures

q <- read.table("usgs_streamflow.txt")
q[1:10,]
q <- read.table("usgs_streamflow.txt", header=T)
q[1:10,]
colnames(q)
q <- read.table("usgs_streamflow.txt", header=T, skip=1)
q[1:10,]
q <- read.table("usgs_streamflow.txt", header=T)
q[1:10,]
q <- q[-1,]
q[1:10,]
rownames(q) <- 1:nrow(q)
q[1:10,]
colnames(q)[1:3]
colnames(q) <- c(colnames(q)[1:3], "q", "qcode")
plot(q[,"q"], pch=20, type="l")
plot(q[,"q"], pch=20, type="l", log="y")
hist(q[,"q"])
q[grep("[A-Za-z]", q[,"q"]),"q"] <- NA
q[grep("[A-Za-z]", q[,"q"]),"q"]
hist(q2[,"q"])
mean(q[,"q"], na.rm=T)
var(q[,"q"], na.rm=T)
sd(q[,"q"], na.rm=T)
is.na(q[,"q"])
(1:nrow(q))[is.na(q[,"q"])]