Graphical data analysis
1 Why do we use graphical data analysis?
- Provides the analyst insight into the data under scrutiny ⇒ Exploratory data analysis
- Illustrates important concepts when presenting the results to others ⇒ Visual and intuitive, no better ways to convince your audience!
Inductive procedures to analyze data to formulate a theory.
Avoid misunderstanding of data measures.
2 Histograms
Histograms show
- Central tendency
- Variability
- Symmetry
Suitable for discrete data.
2.1 Histogram vs. probability density function
They tend to converge to the probability density function (pdf) of the population if the same size becomes large enough.
2.2 Bin size
A sample of $n$ observations is divided into $k$ equal intervals (bins).
Iman and Conover (1983) suggest that $k$ should be the smallest integer such that $2^k\geq n$. Can you find this $k$ in R?
Bin size is the range divided by $k$.
Histograms are sensitive to the bin size. See Figures 2.2 and 2.3. Generally, using hist(Q)
without other arguments should be good.
3 Quantile plots
As histograms are an approximation of the pdf, quantile plots are an approximation of the cumulative distribution function (cdf).
They are also referred to as empirical cumulative distribution functions (ecdf).
3.1 Advantages of quantile plots
Three important advantages over histograms and boxplots:
- Arbitrary categories are not required (no bins as in histograms)
- All of the data are displayed (no ranges as in boxplots)
- Every point has a distinct position (no overlap)
3.2 Construction of quantile plots
The dataset is sorted from smallest to largest.
Data values on the x-axis and positions on the y-axis. For the y-axis, we usually use the Weibull plotting position: \begin{equation} p_i=\frac{i}{n+i} \end{equation}
Tied values (same $x$'s) will have different positions (different $y$'s) ⇒ Vertical jump!
3.3 Plotting positions
See other plotting positions in Table 2.2.
Why do we prefer the Weibull plotting position?
- Most important, $p_n<1$, recognizing the existence of a nonzero probability of exceeding the maximum observation
- Has long been used in the United States
- Used in Bulletin 17C for determining flood frequencies in the United States
4 Boxplots
Boxplots summarize
- Center of the data (median, box centerline)
- Variation or spread (interquartile range $\text{IQR}=Q_3-Q_1$, box height, 50% of the data)
- Skewness (quartile skew, relative size of box halves)
- Presence of absence of unusual values and their magnitude (outliers, usually outside $1.5\times\text{IQR}$ away from $Q_1$ and $Q_3$)
Remember, boxplots are always drawn at data points.
5 Probability (Q-Q) plots
- Theoretical cdf vs. ecdf
- Two ecdfs
Compare Figures 2.8 and 2.9. Which one is easier to compare the dataset and its theoretical cdf?
Plotting at $(Z_i, Q_i)$ where
- $Z_i=F_N^{-1}(p_i)$ where $p_i=\frac{i}{n+1}$
- $Q_i=\bar{Q}+Z_i\cdot s_Q$
See Figure 2.10 for its use as an exceedance probability plot.
See Figure 2.11 for skewness (symmetry vs. asymmetry) and kurtosis (heavy vs. light tail).
6 Scatterplots
Scatterplots illustrate the relation between two variables.
We often use smooth curves (smooths in short) to simply these plots. See Figure 2.22.
- LOcally WEighted Scatterplot Smoothing (LOWESS) (Cleveland amd McGill, 1984a; Cleveland, 1985)
- LOESS (Cleveland et al., 1992),
loess
in R
7 Homework: Univariate regression analysis
Download your data with two variables for a USGS daily station. Perform a linear regression analysis and LOWESS analysis in QMD. Zip your data and QMD file in one Zip file and upload it so I can try it myself.
8 Graphs for multivariate data
See “Graphcs for multivariate data.qmd.”