Code
library(tidyverse)
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/redlining/metro-grades.csv") redlining_data
library(tidyverse)
<- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/redlining/metro-grades.csv") redlining_data
The data
that will be used here is one of FiveThirtyEight’s data sources for their “The Lasting Legacy of Redlining” article.1 This data set has the 2020 total population estimates by race/ethnicity (based on the 2020 census) for combined zones of each redlining grade (which can range from A (“Best”) to D (“Hazardous”)2) from the Home Owners’ Loan Corporation’s (HOLC) maps drawn in 1935-40. The maps were provided from the Mapping Inequality project.
The data set also includes population estimates in the surrounding area of each metropolitan area’s HOLC map and location quotients (LQs, which are compared to their proximity to 1, which indicates an accurate representation of a population)3 for each racial/ethnic group and HOLC grade. You can find more information on the data set, variables, and calculations in the Github link listed above.
One unit of observation corresponds to a metropolitan area’s combined total population estimates, surrounding area population estimates, percentages of both, and LQs for all of the zones in that area that correspond to the same HOLC grade.
For example, the first observation describes the aforementioned data points for all locations within Akron, OH that have an “A” HOLC grade. Reading the columns from left to right, the first row tells us that:
It is important to note that the population estimates are rounded to the next integer and the percentages are rounded to the nearest 2 decimal places.
Our goal is try to understand the correlation between grade assigned by the Home Owners’ Loan Corporation (HOLC grade) and population, particularly the white population, which is the most populous and dominant population in the history of the United States. Because of the long history of racial discrimination in the United States, we want to explore whether the idea of white privilege has an effect on HOLC grades. We will explore this correlation through four graphs using variables such as white population, percentage of white population, and their corresponding HOLC grades.
Figure 1 is a bar graph that takes in one categorical explanatory variable (HOLC grade) and outputs a numeric variable (White population).
<- redlining_data |>
population_by_holc_grade group_by(holc_grade) |>
summarize(
White = sum(white_pop),
Black = sum(black_pop),
Hispanic = sum(hisp_pop),
Asian = sum(asian_pop),
Other = sum(other_pop)
)
ggplot(data = population_by_holc_grade, aes(x = holc_grade, y = White)) +
geom_col() +
labs(
x = "HOLC grade",
y = "White population",
title = "The total number of white population in areas with different HOLC grades"
+
) scale_y_continuous(labels = scales::label_comma())
In addition to Figure 1, this graph takes in one more categorical explanatory variable (HOLC grade and Race). Instead of outputting the White population, this bar graph outputs the population of all race in areas with different HOLC grades. We can clearly see from Figure 2 that there are more white people living in areas with a higher HOLC grades (A and Bs).
<- population_by_holc_grade |>
population_by_holc_grade_clean pivot_longer(
cols = c(White:Other),
names_to = "race",
values_to = "count"
)
ggplot(data = population_by_holc_grade_clean, aes(x = holc_grade, y = count, fill = race)) +
geom_col(position = "dodge") +
labs(
x = "Grade assigned by the Home Owners' Loan Corporation (HOLC grade)",
y = "Total population",
fill = "Race",
title = "The total population of all race in areas with different HOLC grades"
+
) scale_y_continuous(labels = scales::label_comma())
However, we cannot make conclusions based solely on the size of the white population, which is the largest population in the U.S. We also need to consider the percentage of the white population in each HOLC zone. Figure 3 is a histogram that takes in one numeric explanatory variable (Percentage of white population) and outputs the number of HOLC zones. It counts the number of HOLC zones with different percentages of white population.
ggplot(data = redlining_data, aes(x = pct_white)) +
geom_histogram(binwidth = 12.5, boundary = 50, color = "white") +
labs(
y = "Number of HOLC zones",
x = "Percentage of white population"
)
Finally, we finalize Figure 3 by taking in the categorical explanatory variable (HOLC grade). Figure 4 counts the number of HOLC zones of the percentage of white population with a given HOLC grade (A-D).
ggplot(data = redlining_data, aes(x = pct_white, fill = holc_grade)) +
geom_histogram(binwidth = 12.5, boundary = 50, color = "white") +
labs(
y = "Number of HOLC zones",
x = "Percentage of white population",
fill = "HOLC grade"
)
We can see from Figure 4 that HOLC zones with a higher percentage of white population will receive a higher HOLC grades. In other words, we can conclude that there is a correlation between the percentage of white population and HOLC grades.
library(knitr)
Table 1 is a table of all the variable names in the data set. The names are divided by the kinds of data we expect to see from them: categorical, nominal, or ordinal. Based on the glimpse of the data, we can identify that this data contains both categorical and nominal variables- and no ordinal variables. However, the amount of numeric variables significantly exceeds the amount of categorical variables.
<- tibble(
table_variables variables = c("metro_area", "holc_grade", "white_pop", "black_pop", "hisp_pop", "asian_pop", "other_pop", "total_pop", "pct_white", "pct_black", "pct_hisp", "pct_asian", "pct_other", "lq_white", "lq_black", "lq_asian", "lq_other", "surr_area_white_pop", "surr_area_black_pop", "surr_area_hisp_pop", "surr_area_asian_pop", "surr_area_other_pop", "surr_area_pct_white", "surr_area_pct_black", "surr_area_pct_hisp", "surr_area_pct_asian", "surr_area_pct_other"),
type = c("categorical", "categorical", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric", "numeric")
)
kable(table_variables)
variables | type |
---|---|
metro_area | categorical |
holc_grade | categorical |
white_pop | numeric |
black_pop | numeric |
hisp_pop | numeric |
asian_pop | numeric |
other_pop | numeric |
total_pop | numeric |
pct_white | numeric |
pct_black | numeric |
pct_hisp | numeric |
pct_asian | numeric |
pct_other | numeric |
lq_white | numeric |
lq_black | numeric |
lq_asian | numeric |
lq_other | numeric |
surr_area_white_pop | numeric |
surr_area_black_pop | numeric |
surr_area_hisp_pop | numeric |
surr_area_asian_pop | numeric |
surr_area_other_pop | numeric |
surr_area_pct_white | numeric |
surr_area_pct_black | numeric |
surr_area_pct_hisp | numeric |
surr_area_pct_asian | numeric |
surr_area_pct_other | numeric |
After having a general sense of this data, we created Table 2, which is a table of summary statistics of four variables: “holc_grade”, “white_pop”, “total_pop”, and “pct_white”. The categorical variable “holc_grade” was used to group the numeric variables of “white_pop”, “total_pop”, and “pct_white”. Based on these statistics, we can state that areas with a higher percentage of white people are given a higher HOLC grade. In general, the number of white indiviuals also has the highest population out of the total population in areas with higher HOLC grades, espeically areas with HOLC grades A and B.
|>
redlining_data group_by(holc_grade) |>
::skim("white_pop", "total_pop", "pct_white") |>
skimrrename(`Missing Obs.` = n_missing)
Name | group_by(redlining_data, … |
Number of rows | 551 |
Number of columns | 28 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | holc_grade |
Variable type: numeric
skim_variable | holc_grade | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist | Missing Obs. |
---|---|---|---|---|---|---|---|---|---|---|---|
white_pop | A | 1 | 11641.85 | 21984.19 | 188.00 | 1679.50 | 4566.00 | 14625.25 | 205702.00 | ▇▁▁▁▁ | 0 |
white_pop | B | 1 | 30901.47 | 74403.34 | 787.00 | 4855.00 | 10171.00 | 24992.25 | 752223.00 | ▇▁▁▁▁ | 0 |
white_pop | C | 1 | 49202.88 | 126446.11 | 158.00 | 7395.00 | 15605.00 | 39031.00 | 1164087.00 | ▇▁▁▁▁ | 0 |
white_pop | D | 1 | 23757.23 | 84629.09 | 183.00 | 2797.00 | 6204.50 | 15791.25 | 914030.00 | ▇▁▁▁▁ | 0 |
total_pop | A | 1 | 17312.38 | 34613.85 | 228.00 | 2913.75 | 6048.00 | 19347.75 | 313867.00 | ▇▁▁▁▁ | 0 |
total_pop | B | 1 | 63200.23 | 183446.18 | 1848.00 | 8010.00 | 19707.50 | 42165.00 | 1903251.00 | ▇▁▁▁▁ | 0 |
total_pop | C | 1 | 136444.43 | 470333.41 | 351.00 | 14620.00 | 32594.00 | 92564.00 | 4558038.00 | ▇▁▁▁▁ | 0 |
total_pop | D | 1 | 78458.62 | 304296.54 | 842.00 | 9139.00 | 15899.50 | 41134.00 | 3217775.00 | ▇▁▁▁▁ | 0 |
pct_white | A | 1 | 73.78 | 15.21 | 11.29 | 68.53 | 77.54 | 83.04 | 94.12 | ▁▁▁▆▇ | 0 |
pct_white | B | 1 | 59.94 | 17.89 | 6.63 | 50.03 | 62.86 | 72.12 | 90.97 | ▁▂▃▇▅ | 0 |
pct_white | C | 1 | 48.65 | 19.48 | 6.99 | 33.69 | 48.43 | 63.34 | 87.65 | ▃▇▇▇▅ | 0 |
pct_white | D | 1 | 39.39 | 20.88 | 3.77 | 22.56 | 39.85 | 53.08 | 86.17 | ▆▆▇▃▃ | 0 |
In this section of the project, the data has been manipulated in order to show the specific variables we are working with in our data set. Later, each of these variables will be described further along with what we have used these variables for.
|> select(metro_area, holc_grade, white_pop, pct_white, pct_black, pct_hisp, pct_asian, pct_other) redlining_data
# A tibble: 551 × 8
metro_area holc_grade white_pop pct_white pct_black pct_hisp pct_asian
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Akron, OH A 24702 66.8 23.3 2.59 1.86
2 Akron, OH B 41531 61.2 24.3 3.26 4.96
3 Akron, OH C 73105 64.9 20.3 2.79 5.58
4 Akron, OH D 6179 40.8 45.7 3.75 3
5 Albany-Schenecta… A 16989 72.9 7.8 5.65 8.57
6 Albany-Schenecta… B 26644 58.9 15.7 9.58 5.55
7 Albany-Schenecta… C 56878 56.0 16.5 10.2 6.26
8 Albany-Schenecta… D 16806 33.9 39.4 13.5 4.42
9 Allentown-Bethle… A 1076 66.6 4.38 22.7 1.3
10 Allentown-Bethle… B 16774 58.2 6.81 27.6 2.54
# ℹ 541 more rows
# ℹ 1 more variable: pct_other <dbl>
Final list and description of Data: metro_area (categorical variable): is the zones where the data was collected from
holc_grade (categorical ordinal variable) : is the grade given to the area
white_pop (numeric variable): is the white population in that area
pct_white (numeric variable): is the percentage of white people living in that area
pct_black (numeric variable): is the percentage of black people living in that area
pct_hisp (numeric variable): is the percentage of hispanic people living in that area
pct_asian (numeric variable): is the percentage of asian people living in that area
pct_other (numeric variable): is the percentage of people who do not have a specified race living in that area
This data set is considering the way redlining works to divide up different people into different areas.
Description:
This data has, throughout the project helped us to understand the way that redlining works. In particular we have examined this issue broken down into race in each zone and how that relates to several variables given in the data set. One critical factor we looked into was the grade given to each set area. This grade was an essential part of our project that informed much of our visualizations and gives critical insight into how redlining works.
This data is about redlining. Redlining is refusing (a loan or insurance) to someone because they live in an area deemed to be a poor financial risk. It is racial discrimination and is a big problem in America. To fix problems like this, it is important to have the data to know the numbers behind the problem. Looking at data about issues of racial discrimination, it is important that it does not skew the data to make it look like there is no racial discrimination at play (Lloyd 2016). As a group, we thought it was important to look into redlining because of the long history of racial discrimination in the United States and wanted to explore whether the idea of white privilege has an effect on HOLC grades.
Does this data offer a fair and just statistical picture of redlining?
Are there any gaps in the data that we think are important to make note of?
Are there biases that we are going in with that could change the way that we interact with the data?
Where might further research take us? Are there certain racial groups that we could focus in on instead of looking at a more general data picture?
Our original goal is try to understand the correlation between grade assigned by the Home Owners’ Loan Corporation (HOLC grade) and population, particularly the white population, which is the most populous and dominant population in the history of the United States. We sought to explore this correlation through four graphs using variables such as white population, percentage of white population, and their corresponding HOLC grades. This does meet the final goal. I think this best fits in with Lab #10, when we talked about Data science ethics as well as Lab #11, which built upon Lab #10, when we walked about codes for Data science ethics. It is important that data analysis is correct, but if the analysis is discriminatory, it should be disregarded. Ethics when it comes to data science is extremely important. Not only is it important to try to weigh ethics and topics that we discussed in lab 10 and 11, but it is also important to shed light on data that illuminates issues of discrimination which we found that the redlining data did.
https://projects.fivethirtyeight.com/redlining/↩︎
You can read more about the redlining zone classifications here.↩︎
More information on location quotients.↩︎