Histogram and boxplot
Institute for Environmental and Spatial Analysis...University of North Georgia
Contents
1 What is a histogram?
Have you ever heard of a histogram or seen something like this?
Statistically speaking, a histogram shows the distribution of data. There are data ranges or bins on the x-axis and count or frequency on the y-axis.
2 Bins
Bins in the histogram have an equal width or interval of data. They can be either numeric intervals or categorical classes.
3 Frequency
Frequency represents in each bin how often events occur or how many data points are found.
4 Drawing a histogram manually
- Sort data in an ascending order.
- Determine your bin width.
- Count how many data points fall in each bin.
- Draw a bar at each bin with the bin width as its width and its count as the height.
5 Exercise: Manual histogram
Original data
NAME10 | PopDens |
---|---|
Lanier | 50 |
Bryan | 66 |
Appling | 36 |
Rabun | 43 |
Bleckley | 60 |
Fayette | 535 |
Jefferson | 32 |
Seminole | 34 |
Camden | 65 |
Glynn | 136 |
Polk | 133 |
Morgan | 50 |
Talbot | 17 |
Schley | 30 |
Union | 65 |
Sorted by PopDens
NAME10 | PopDens |
---|---|
Talbot | 17 |
Schley | 30 |
Jefferson | 32 |
Seminole | 34 |
Appling | 36 |
Rabun | 43 |
Morgan | 50 |
Lanier | 50 |
Bleckley | 60 |
Camden | 65 |
Union | 65 |
Bryan | 66 |
Polk | 133 |
Glynn | 136 |
Fayette | 535 |
Counts for a bin width of 50
Class | PopDens | Count |
---|---|---|
Low | [0, 50] | 8 |
Medium | (50, 100] | 4 |
High | (100, 150] | 2 |
Too High | (150, 550] | 1 |
6 Exercise: Histograms in Excel
There are three methods for creating histograms in Excel:
- Insert a histogram
=FREQUENCY()
- Data analysis add-in
6.1 Clean up data
- Download Counties_Georgia.xlsx.
- Open it in Excel and enable editing.
- Delete unnecessary columns.
- Select column headers A–D and right click ⇒ Delete.
- Repeat until only NAME10, totpop10, and Sq_Miles columns are left.
- Select all three columns and adjust their widths by double-clicking the border between any two column headers (red line).
- Decrease decimal in the Sq_Miles column and adjust its width.
6.2 Calculate the population density
6.3 “Insert a histogram” method
6.4 “=FREQUENCY()
” method
- For this method, you need to know the minimum and maximum population densities first.
- Enter “Min PopDens10:” and “Max PopDens10:” in cells F1 and F2, respectively.
- Enter
=MIN(D2:D160)
and=MAX(D2:D160)
in cells G1 and G2, respectively.
- Enter bins with an interval of 200 starting from 0 up to 2600 in cells G5–G18.
- Enter labels.
- Select cells H5–H19. Yes, select one extra cell past the last bin.
- Enter
=FREQUENCY(D2:D160, G5:G18)
and hit Ctrl+Shift+Enter. - Select cells F6–F18 and H6–H18. Use Ctrl to select disconnected cell groups.
- Insert ⇒ Charts ⇒ 2-D Column.
6.5 “Data analysis add-in” method
7 Drawing a boxplot manually
- Sort data in an ascending order.
- Add two new columns: Count and Percentile
- For each row, count the number of data points above and in the current row.
- Enter the count in the Count column.
- Divide the count by the total number of rows and multiply it by 100.
- Enter the percentile in the Percentile column.
- Find the middle row if there are an odd number of records. Otherwise, calculate the average of two middle rows. This is the median.
- Repeat the median calculation for the upper and lower halves. They are the 1st (Q1) and 3rd (Q3) quartiles, respectively.
- Calculate the inter-quartile range by subtracting Q3 from Q1. $\text{IQR}=\text{Q3}-\text{Q1}$
- Identify outliers falling outside 1.5IQR from Q1 and Q3.
- Draw a box using Q1 and Q3.
- Draw whiskers using the lower and upper extremes.
- Draw outliers if any.
There are different methods for calculating quartiles, but we will use this simple method.
8 Exercise: Manual boxplot
Original data
NAME10 | PopDens |
---|---|
Lanier | 50 |
Bryan | 66 |
Appling | 36 |
Rabun | 43 |
Bleckley | 60 |
Fayette | 535 |
Jefferson | 32 |
Seminole | 34 |
Camden | 65 |
Glynn | 136 |
Polk | 133 |
Morgan | 50 |
Talbot | 17 |
Schley | 30 |
Union | 65 |
Sorted by PopDens and calculated percentiles
NAME10 | PopDens | Count | Percentile | Note |
---|---|---|---|---|
Talbot | 17 | 1 | 6.7 | Lower extreme |
Schley | 30 | 2 | 13.3 | |
Jefferson | 32 | 3 | 20.0 | |
Seminole | 34 | 4 | 26.7 | Q1 |
Appling | 36 | 5 | 33.3 | |
Rabun | 43 | 6 | 40.0 | |
Morgan | 50 | 8 | 53.3 | |
Lanier | 50 | 8 | 53.3 | Median |
Bleckley | 60 | 9 | 60.0 | |
Camden | 65 | 11 | 73.3 | |
Union | 65 | 11 | 73.3 | |
Bryan | 66 | 12 | 80.0 | Q3, Upper extreme |
Polk | 133 | 13 | 86.7 | Upper outlier |
Glynn | 136 | 14 | 93.3 | Upper outlier |
Fayette | 535 | 15 | 100.0 | Upper outlier |
- $\text{IQR} = 66 - 34 = 32$
- Lower outliers: Below $34 - 1.5\times 32 = -14$. There are no lower outliers.
- Upper outliers: Above $66 + 1.5\times 32 = 114$. Polk, Glynn, and Fayette are upper outliers.
9 Reading materials
10 Homework: Histogram and boxplot
- Create a histogram of Georgia population by county with five labels: Very low, low, medium, high, very high.
- The minimum and maximum are 8.25 and 2552.29, respectively.
- Divide 2600 by 5. That is your bin width.
- Create a new histogram.
- Create two boxplots for 2010 and 2020 Georgia population by county.
- Find the most recent Georgia population data by county. Just google “Georgia population by county” and use the first link.
- Copy and paste this data into a new Excel worksheet.
- Sort by county.
- Copy the 2020 population column into the 2010 population worksheet.
- Calculate the 2020 population density.
- Use the Excel boxplot tool to create two boxplots side by side.
Please submit two separate Excel files:
- Histogram.xlsx
- Boxplots.xlsx