Graphical data analysis

Dr. Huidae Cho
Department of Civil Engineering...New Mexico State University

1   Why do we use graphical data analysis?

  • Provides the analyst insight into the data under scrutiny ⇒ Exploratory data analysis
  • Illustrates important concepts when presenting the results to others ⇒ Visual and intuitive, no better ways to convince your audience!

Inductive procedures to analyze data to formulate a theory.

Avoid misunderstanding of data measures.

2   Histograms

Histograms show

  • Central tendency
  • Variability
  • Symmetry

Suitable for discrete data.

2.1   Histogram vs. probability density function

They tend to converge to the probability density function (pdf) of the population if the same size becomes large enough.

2.2   Bin size

A sample of $n$ observations is divided into $k$ equal intervals (bins).

Iman and Conover (1983) suggest that $k$ should be the smallest integer such that $2^k\geq n$. Can you find this $k$ in R?

Bin size is the range divided by $k$.

Histograms are sensitive to the bin size. See Figures 2.2 and 2.3. Generally, using hist(Q) without other arguments should be good.

3   Quantile plots

As histograms are an approximation of the pdf, quantile plots are an approximation of the cumulative distribution function (cdf).

They are also referred to as empirical cumulative distribution functions (ecdf).

3.1   Advantages of quantile plots

Three important advantages over histograms and boxplots:

  • Arbitrary categories are not required (no bins as in histograms)
  • All of the data are displayed (no ranges as in boxplots)
  • Every point has a distinct position (no overlap)

3.2   Construction of quantile plots

The dataset is sorted from smallest to largest.

Data values on the x-axis and positions on the y-axis. For the y-axis, we usually use the Weibull plotting position: \begin{equation} p_i=\frac{i}{n+i} \end{equation}

Tied values (same $x$'s) will have different positions (different $y$'s) ⇒ Vertical jump!

3.3   Plotting positions

See other plotting positions in Table 2.2.

Why do we prefer the Weibull plotting position?

  • Most important, $p_n<1$, recognizing the existence of a nonzero probability of exceeding the maximum observation
  • Has long been used in the United States
  • Used in Bulletin 17C for determining flood frequencies in the United States

4   Boxplots

Boxplots summarize

  • Center of the data (median, box centerline)
  • Variation or spread (interquartile range $\text{IQR}=Q_3-Q_1$, box height, 50% of the data)
  • Skewness (quartile skew, relative size of box halves)
  • Presence of absence of unusual values and their magnitude (outliers, usually outside $1.5\times\text{IQR}$ away from $Q_1$ and $Q_3$)

Remember, boxplots are always drawn at data points.

5   Probability (Q-Q) plots

  • Theoretical cdf vs. ecdf
  • Two ecdfs

Compare Figures 2.8 and 2.9. Which one is easier to compare the dataset and its theoretical cdf?

Plotting at $(Z_i, Q_i)$ where

  • $Z_i=F_N^{-1}(p_i)$ where $p_i=\frac{i}{n+1}$
  • $Q_i=\bar{Q}+Z_i\cdot s_Q$

See Figure 2.10 for its use as an exceedance probability plot.

See Figure 2.11 for skewness (symmetry vs. asymmetry) and kurtosis (heavy vs. light tail).

6   Scatterplots

Scatterplots illustrate the relation between two variables.

We often use smooth curves (smooths in short) to simply these plots. See Figure 2.22.

  • LOcally WEighted Scatterplot Smoothing (LOWESS) (Cleveland amd McGill, 1984a; Cleveland, 1985)
  • LOESS (Cleveland et al., 1992), loess in R

See LOESS (aka LOWESS) and Example of LOESS Computations

7   Homework: Univariate regression analysis

Download your data with two variables for a USGS daily station. Perform a linear regression analysis and LOWESS analysis in QMD. Zip your data and QMD file in one Zip file and upload it so I can try it myself.

8   Graphs for multivariate data

See “Graphcs for multivariate data.qmd.”