Tags: ce 582, lectures, presentations

Graphical data analysis

CE 582

Dr. Huidae Cho

Department of Civil Engineering...New Mexico State University

Edit

Contents

1 Why do we use graphical data analysis?
2 Histograms
- 2.1 Histogram vs. probability density function
- 2.2 Bin size
3 Quantile plots
4 Boxplots
5 Probability (Q-Q) plots
6 Scatterplots
7 Homework: Univariate regression analysis
8 Graphs for multivariate data

1 Why do we use graphical data analysis?

Provides the analyst insight into the data under scrutiny ⇒ Exploratory data analysis
Illustrates important concepts when presenting the results to others ⇒ Visual and intuitive, no better ways to convince your audience!

Inductive procedures to analyze data to formulate a theory.

Avoid misunderstanding of data measures.

Edit

2 Histograms

Histograms show

Central tendency
Variability
Symmetry

Suitable for discrete data.

Edit

2.1 Histogram vs. probability density function

They tend to converge to the probability density function (pdf) of the population if the same size becomes large enough.

Edit

2.2 Bin size

A sample of $n$ observations is divided into $k$ equal intervals (bins).

Iman and Conover (1983) suggest that $k$ should be the smallest integer such that $2^k\geq n$. Can you find this $k$ in R?

Bin size is the range divided by $k$.

Histograms are sensitive to the bin size. See Figures 2.2 and 2.3. Generally, using hist(Q) without other arguments should be good.

Edit

3 Quantile plots

As histograms are an approximation of the pdf, quantile plots are an approximation of the cumulative distribution function (cdf).

They are also referred to as empirical cumulative distribution functions (ecdf).

Edit

3.1 Advantages of quantile plots

Three important advantages over histograms and boxplots:

Arbitrary categories are not required (no bins as in histograms)
All of the data are displayed (no ranges as in boxplots)
Every point has a distinct position (no overlap)

Edit

3.2 Construction of quantile plots

The dataset is sorted from smallest to largest.

Data values on the x-axis and positions on the y-axis. For the y-axis, we usually use the Weibull plotting position: \begin{equation} p_i=\frac{i}{n+i} \end{equation}

Tied values (same $x$'s) will have different positions (different $y$'s) ⇒ Vertical jump!

Edit

3.3 Plotting positions

See other plotting positions in Table 2.2.

Why do we prefer the Weibull plotting position?

Most important, $p_n<1$, recognizing the existence of a nonzero probability of exceeding the maximum observation
Has long been used in the United States
Used in Bulletin 17C for determining flood frequencies in the United States

Edit

4 Boxplots

Boxplots summarize

Center of the data (median, box centerline)
Variation or spread (interquartile range $\text{IQR}=Q_3-Q_1$, box height, 50% of the data)
Skewness (quartile skew, relative size of box halves)
Presence of absence of unusual values and their magnitude (outliers, usually outside $1.5\times\text{IQR}$ away from $Q_1$ and $Q_3$)

Remember, boxplots are always drawn at data points.

Edit

5 Probability (Q-Q) plots

Theoretical cdf vs. ecdf
Two ecdfs

Compare Figures 2.8 and 2.9. Which one is easier to compare the dataset and its theoretical cdf?

Plotting at $(Z_i, Q_i)$ where

$Z_i=F_N^{-1}(p_i)$ where $p_i=\frac{i}{n+1}$
$Q_i=\bar{Q}+Z_i\cdot s_Q$

See Figure 2.10 for its use as an exceedance probability plot.

See Figure 2.11 for skewness (symmetry vs. asymmetry) and kurtosis (heavy vs. light tail).

Edit

6 Scatterplots

Scatterplots illustrate the relation between two variables.

We often use smooth curves (smooths in short) to simply these plots. See Figure 2.22.

LOcally WEighted Scatterplot Smoothing (LOWESS) (Cleveland amd McGill, 1984a; Cleveland, 1985)
LOESS (Cleveland et al., 1992), loess in R

See LOESS (aka LOWESS) and Example of LOESS Computations

Edit

7 Homework: Univariate regression analysis

Download your data with two variables for a USGS daily station. Perform a linear regression analysis and LOWESS analysis in QMD. Zip your data and QMD file in one Zip file and upload it so I can try it myself.

Edit

8 Graphs for multivariate data

See “Graphcs for multivariate data.qmd.”

Edit