Characteristics and measures of hydrologic data
- 1 Why do we need to know the characteristics of hydrologic data
- 2 How do we measure those characteristics?
- 3 Characteristics of hydrologic data
- 4 Measures of central tendency
- 5 Measures of variability
- 6 Example 1.1
- 7 Homework: Measures of hydrologic data
- 8 Exercise: Streamflow data cleaning and measures
1 Why do we need to know the characteristics of hydrologic data
Knowing the characteristics of the data you want to analyze is crucial.
Many statistical tools require certain assumptions about data.
We don’t want to make false assumptions about hydrologic data!
2 How do we measure those characteristics?
Population is usually of an infinite size ⇒ Subset of the population (sample)
3 Characteristics of hydrologic data
- A lower bound of zero in most cases and rare to find negative values (any examples?)
- Presence of outliers, often high ones
- Positive skewness
- Non-normal distribution
- Data reported with some thresholds (censored data)
- Seasonal patterns
- Autocorrelation
- Dependence on other variables
4 Measures of central tendency
- Mean (typically arithmetic mean)
- Median
- Mode
4.1 Arithmetic mean: A classical measure of central tendency
$\def\mean#1{\bar{#1}}$ \begin{equation} \mean{X}= \sum_{i=1}^n\frac{X_i}{n}= \sum_{i=1}^k\mean{X}_i\frac{n_i}{n}= \mean{X}_{(j)}\frac{n-1}{n}+X_j\frac{1}{n}= \mean{X}_{(j)}+\left(X_j-\mean{X}_{(j)}\right)\frac{1}{n} \end{equation}
Sensitive to outliers.
4.2 Median: A resistant measure of central tendency
$\def\median{\text{Median}}$ \begin{equation} \median= \begin{cases} X\left(\frac{n+1}{2}\right)&\text{if $n$ is odd}\\ \frac{1}{2}\left[X\left(\frac{n}{2}\right)+X\left(\frac{n}{2}+1\right)\right]&\text{if $n$ is even} \end{cases} \end{equation}
Less sensitive to outliers.
4.3 Mode
Occurring most often from a discrete dataset
4.4 Geometric mean
$\def\gmean{\text{GM}}$ \begin{equation} \gmean=\left(\prod_{i=1}^nX_i\right)^{1/n} \end{equation} where $X_i>0$.
Useful for positively skewed datasets.
5 Measures of variability
5.1 Sample variance: A classical measure of variability
\begin{equation} s^2=\sum_{i=1}^n\frac{\left(X_i-\mean{X}\right)^2}{n-1} \end{equation}
5.2 Interquartile range (IQR): A resistant measure of variability
Percentiles $P_{X,j}$ can be calculated from a sorted dataset from smallest to largest, $X_i$ for $i=1,\cdots,n$: \begin{equation} P_j=X_{(n+1)\cdot j} \end{equation} and the interquartile range (IQR) can be calculated as follows: $\def\iqr{\text{IQR}}$ \begin{equation} \iqr=P_{0.75}-P_{0.25} \end{equation}
What if $(n+1)\cdot j$ is not an integer? Interpolation and we typically use the Weibull plotting position in hydrology (type=6
in quantile()
in R).
5.3 Median absolute deviation (MAD): A resistant measure of variability
$\def\mad{\text{MAD}}$ \begin{equation} \mad(X)=\median{\left(\left|X_i-\median(X)\right|\right)} \end{equation}
5.4 Coefficient of variation (CV): A nondimensional measure of variability
$\def\cv{\text{CV}}$ \begin{equation} \cv=\frac{s^2}{\mean{X}} \end{equation}
Useful for characterizing the degree of variability in datasets.
6 Example 1.1
- Dataset (a): 2, 4, 8, 9, 11, 11, 12
- Dataset (b): 2, 4, 8, 9, 11, 11, 120
7 Homework: Measures of hydrologic data
- Define your own function for Eq. (1.10).
- Solve Exercise 2.
Submit your R file with comments.
8 Exercise: Streamflow data cleaning and measures
q <- read.table("usgs_streamflow.txt")
q[1:10,]
q <- read.table("usgs_streamflow.txt", header=T)
q[1:10,]
colnames(q)
q <- read.table("usgs_streamflow.txt", header=T, skip=1)
q[1:10,]
q <- read.table("usgs_streamflow.txt", header=T)
q[1:10,]
q <- q[-1,]
q[1:10,]
rownames(q) <- 1:nrow(q)
q[1:10,]
colnames(q)[1:3]
colnames(q) <- c(colnames(q)[1:3], "q", "qcode")
plot(q[,"q"], pch=20, type="l")
plot(q[,"q"], pch=20, type="l", log="y")
hist(q[,"q"])
q[grep("[A-Za-z]", q[,"q"]),"q"] <- NA
q[grep("[A-Za-z]", q[,"q"]),"q"]
hist(q2[,"q"])
mean(q[,"q"], na.rm=T)
var(q[,"q"], na.rm=T)
sd(q[,"q"], na.rm=T)
is.na(q[,"q"])
(1:nrow(q))[is.na(q[,"q"])]