# Histogram and boxplot

Institute for Environmental and Spatial Analysis...University of North Georgia

Contents

## 1 What is a histogram?

Have you ever heard of a histogram or seen something like this?

Statistically speaking, a histogram shows the distribution of data. There are data ranges or bins on the x-axis and count or frequency on the y-axis.

## 2 Bins

Bins in the histogram have an equal width or interval of data. They can be either numeric intervals or categorical classes.

## 3 Frequency

Frequency represents in each bin how often events occur or how many data points are found.

## 4 Drawing a histogram manually

- Sort data in an ascending order.
- Determine your bin width.
- Count how many data points fall in each bin.
- Draw a bar at each bin with the bin width as its width and its count as the height.

## 5 Exercise: Manual histogram

Original data

NAME10 | PopDens |
---|---|

Lanier | 50 |

Bryan | 66 |

Appling | 36 |

Rabun | 43 |

Bleckley | 60 |

Fayette | 535 |

Jefferson | 32 |

Seminole | 34 |

Camden | 65 |

Glynn | 136 |

Polk | 133 |

Morgan | 50 |

Talbot | 17 |

Schley | 30 |

Union | 65 |

Sorted by PopDens

NAME10 | PopDens |
---|---|

Talbot | 17 |

Schley | 30 |

Jefferson | 32 |

Seminole | 34 |

Appling | 36 |

Rabun | 43 |

Morgan | 50 |

Lanier | 50 |

Bleckley | 60 |

Camden | 65 |

Union | 65 |

Bryan | 66 |

Polk | 133 |

Glynn | 136 |

Fayette | 535 |

Counts for a bin width of 50

Class | PopDens | Count |
---|---|---|

Low | [0, 50] | 8 |

Medium | (50, 100] | 4 |

High | (100, 150] | 2 |

Too High | (150, 550] | 1 |

## 6 Exercise: Histograms in Excel

There are three methods for creating histograms in Excel:

- Insert a histogram
`=FREQUENCY()`

- Data analysis add-in

### 6.1 Clean up data

- Download Counties_Georgia.xlsx.
- Open it in Excel and enable editing.
- Delete unnecessary columns.
- Select column headers A–D and right click ⇒ Delete.
- Repeat until only NAME10, totpop10, and Sq_Miles columns are left.

- Select all three columns and adjust their widths by double-clicking the border between any two column headers (red line).
- Decrease decimal in the Sq_Miles column and adjust its width.

### 6.2 Calculate the population density

### 6.3 “Insert a histogram” method

### 6.4 “`=FREQUENCY()`

” method

- For this method, you need to know the minimum and maximum population densities first.

- Enter “Min PopDens10:” and “Max PopDens10:” in cells F1 and F2, respectively.
- Enter
`=MIN(D2:D160)`

and`=MAX(D2:D160)`

in cells G1 and G2, respectively.

- Enter bins with an interval of 200 starting from 0 up to 2600 in cells G5–G18.
- Enter labels.
- Select cells H5–H19. Yes, select one extra cell past the last bin.
- Enter
`=FREQUENCY(D2:D160, G5:G18)`

and hit Ctrl+Shift+Enter. - Select cells F6–F18 and H6–H18. Use Ctrl to select disconnected cell groups.
- Insert ⇒ Charts ⇒ 2-D Column.

### 6.5 “Data analysis add-in” method

## 7 Drawing a boxplot manually

- Sort data in an ascending order.
- Add two new columns: Count and Percentile
- For each row, count the number of data points above and in the current row.
- Enter the count in the Count column.
- Divide the count by the total number of rows and multiply it by 100.
- Enter the percentile in the Percentile column.
- Find the middle row if there are an odd number of records. Otherwise, calculate the average of two middle rows. This is the median.
- Repeat the median calculation for the upper and lower halves. They are the 1st (Q1) and 3rd (Q3) quartiles, respectively.
- Calculate the inter-quartile range by subtracting Q3 from Q1. $\text{IQR}=\text{Q3}-\text{Q1}$
- Identify outliers falling outside 1.5IQR from Q1 and Q3.
- Draw a box using Q1 and Q3.
- Draw whiskers using the lower and upper extremes.
- Draw outliers if any.

There are different methods for calculating quartiles, but we will use this simple method.

## 8 Exercise: Manual boxplot

Original data

NAME10 | PopDens |
---|---|

Lanier | 50 |

Bryan | 66 |

Appling | 36 |

Rabun | 43 |

Bleckley | 60 |

Fayette | 535 |

Jefferson | 32 |

Seminole | 34 |

Camden | 65 |

Glynn | 136 |

Polk | 133 |

Morgan | 50 |

Talbot | 17 |

Schley | 30 |

Union | 65 |

Sorted by PopDens and calculated percentiles

NAME10 | PopDens | Count | Percentile | Note |
---|---|---|---|---|

Talbot | 17 | 1 | 6.7 | Lower extreme |

Schley | 30 | 2 | 13.3 | |

Jefferson | 32 | 3 | 20.0 | |

Seminole | 34 | 4 | 26.7 | Q1 |

Appling | 36 | 5 | 33.3 | |

Rabun | 43 | 6 | 40.0 | |

Morgan | 50 | 8 | 53.3 | |

Lanier | 50 | 8 | 53.3 | Median |

Bleckley | 60 | 9 | 60.0 | |

Camden | 65 | 11 | 73.3 | |

Union | 65 | 11 | 73.3 | |

Bryan | 66 | 12 | 80.0 | Q3, Upper extreme |

Polk | 133 | 13 | 86.7 | Upper outlier |

Glynn | 136 | 14 | 93.3 | Upper outlier |

Fayette | 535 | 15 | 100.0 | Upper outlier |

- $\text{IQR} = 66 - 34 = 32$
- Lower outliers: Below $34 - 1.5\times 32 = -14$. There are no lower outliers.
- Upper outliers: Above $66 + 1.5\times 32 = 114$. Polk, Glynn, and Fayette are upper outliers.

## 9 Reading materials

## 10 Homework: Histogram and boxplot

- Create a histogram of Georgia population by county with five labels: Very low, low, medium, high, very high.
- The minimum and maximum are 8.25 and 2552.29, respectively.
- Divide 2600 by 5. That is your bin width.
- Create a new histogram.

- Create two boxplots for 2010 and 2020 Georgia population by county.
- Find the most recent Georgia population data by county. Just google “Georgia population by county” and use the first link.
- Copy and paste this data into a new Excel worksheet.
- Sort by county.
- Copy the 2020 population column into the 2010 population worksheet.
- Calculate the 2020 population density.
- Use the Excel boxplot tool to create two boxplots side by side.

Please submit two separate Excel files:

- Histogram.xlsx
- Boxplots.xlsx