Food Composition

Author

REDACTED

Rotation 1: Selection of Data

Agricultural-based food dataset

Image of Tractor Work in Agricultural Field

This dataset is derived from the US Department of Agriculture’s Food Composition’s Database. It displays a myriad of foods common in the US, ranging from vegetables to meats. It specifically hones in on the amount of minerals and vitamins found in each. Our analysis focuses on nutrition, specifically comparing proteins, carbohydrates, fibers, and sugars.

Rotation 2: Basic Graphics

Code
library(tidyverse)
food_dataset_raw <- read.csv("https://corgis-edu.github.io/corgis/datasets/csv/food/food.csv")
library(janitor)
food_dataset_clean <- clean_names(food_dataset_raw)
glimpse(food_dataset_clean)
Rows: 7,083
Columns: 38
$ category                       <chr> "Milk", "Milk", "Milk", "Milk", "Milk",…
$ description                    <chr> "Milk, human", "Milk, NFS", "Milk, whol…
$ nutrient_data_bank_number      <int> 11000000, 11100000, 11111000, 11111100,…
$ data_alpha_carotene            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ data_beta_carotene             <int> 7, 4, 7, 7, 7, 1, 0, 3, 1, 3, 1, 2, 1, …
$ data_beta_cryptoxanthin        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ data_carbohydrate              <dbl> 6.89, 4.87, 4.67, 4.46, 4.67, 5.19, 4.8…
$ data_cholesterol               <int> 14, 8, 12, 14, 12, 5, 2, 8, 5, 8, 5, 3,…
$ data_choline                   <dbl> 16.0, 17.9, 17.8, 16.0, 17.8, 17.4, 16.…
$ data_fiber                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ data_lutein_and_zeaxanthin     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ data_lycopene                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ data_niacin                    <dbl> 0.177, 0.110, 0.105, 0.043, 0.105, 0.11…
$ data_protein                   <dbl> 1.03, 3.34, 3.28, 3.10, 3.28, 3.38, 3.4…
$ data_retinol                   <int> 60, 58, 31, 28, 31, 58, 137, 83, 58, 83…
$ data_riboflavin                <dbl> 0.036, 0.137, 0.138, 0.105, 0.138, 0.14…
$ data_selenium                  <dbl> 1.8, 1.9, 1.9, 2.0, 1.9, 2.1, 2.1, 1.8,…
$ data_sugar_total               <dbl> 6.89, 4.89, 4.81, 4.46, 4.81, 4.96, 4.8…
$ data_thiamin                   <dbl> 0.014, 0.057, 0.056, 0.020, 0.056, 0.05…
$ data_water                     <dbl> 87.50, 89.04, 88.10, 88.20, 88.10, 89.7…
$ data_fat_monosaturated_fat     <dbl> 1.658, 0.426, 0.688, 0.999, 0.688, 0.21…
$ data_fat_polysaturated_fat     <dbl> 0.497, 0.065, 0.108, 0.128, 0.108, 0.03…
$ data_fat_saturated_fat         <dbl> 2.009, 1.164, 1.860, 2.154, 1.860, 0.56…
$ data_fat_total_lipid           <dbl> 4.38, 1.99, 3.20, 3.46, 3.20, 0.95, 0.1…
$ data_major_minerals_calcium    <int> 32, 126, 123, 101, 123, 126, 204, 126, …
$ data_major_minerals_copper     <dbl> 0.052, 0.001, 0.001, 0.010, 0.001, 0.00…
$ data_major_minerals_iron       <dbl> 0.03, 0.00, 0.00, 0.05, 0.00, 0.00, 0.0…
$ data_major_minerals_magnesium  <int> 3, 12, 12, 5, 12, 12, 11, 12, 12, 12, 1…
$ data_major_minerals_phosphorus <int> 14, 103, 101, 86, 101, 103, 101, 103, 1…
$ data_major_minerals_potassium  <int> 51, 157, 150, 253, 150, 159, 166, 159, …
$ data_major_minerals_sodium     <int> 17, 39, 38, 3, 38, 39, 52, 39, 39, 39, …
$ data_major_minerals_zinc       <dbl> 0.17, 0.42, 0.41, 0.38, 0.41, 0.43, 0.4…
$ data_vitamins_vitamin_a_rae    <int> 61, 59, 32, 29, 32, 58, 137, 83, 58, 83…
$ data_vitamins_vitamin_b12      <dbl> 0.05, 0.56, 0.54, 0.36, 0.54, 0.61, 0.3…
$ data_vitamins_vitamin_b6       <dbl> 0.011, 0.060, 0.061, 0.034, 0.061, 0.06…
$ data_vitamins_vitamin_c        <dbl> 5.0, 0.1, 0.0, 0.9, 0.0, 0.0, 1.0, 0.2,…
$ data_vitamins_vitamin_e        <dbl> 0.08, 0.03, 0.05, 0.08, 0.05, 0.02, 0.0…
$ data_vitamins_vitamin_k        <dbl> 0.3, 0.2, 0.3, 0.3, 0.3, 0.1, 0.0, 0.2,…
Code
ggplot(food_dataset_clean, aes(x = data_protein, y = data_sugar_total)) +
  geom_jitter(size = 1) +
  theme(legend.position = "right") +
  labs(y = "Grams of Sugar in a Food Item", x = "Total Grams of Protein in a Food Item")
Figure 1: Protein vs. Sugar in Food Items

Figure 1: 2 Numerical Variables

In the first scatterplot, Figure 1, the amount of protein a food item had in the data was placed on the x-axis while the amount of total sugar a food item had was placed on the y-axis. While most of the points are clustered towards the right of the graph, it appears that there might be a negative slope in the data. Specifically, it appears as though there may be a negative association between the amount of sugar a food has and the amount of protein it has. So, a food with lots of sugar might not have a lot of protein, and a food with lots of protein may not have a lot of sugar.

Figure 2: 2 Categorical Variables

The only other categorical variable in this dataset is the description of the food items, so I will do another plot with numerical variables

Code
ggplot(food_dataset_clean, aes(x = data_major_minerals_iron, y = data_major_minerals_calcium)) +
  geom_col(aes(fill = data_protein)) +
  scale_fill_viridis_c(name = "Amount of Water(mL)") +
  scale_x_continuous(name = "Grams of Iron", limits = c(0, 0.25)) +
  scale_y_continuous(name = "Grams of Calcium", limits = c(0, 150))
Figure 2: Iron vs.Calcium in a Food & Water Percentage

In Figure 2, I limited the foods I examined to those who had less than or equal to 0.25 grams of iron Because most of the observations for grams of iron appeared to be below 1, I decided to limit my scale to make the graph more readable. On the x-axis, there is grams of iron and on the y-axis, there is grams of calcium. The color fill of the bars shows the amount of water(g) in each food corresponding to a particular color. Looking at the graph, the foods that had less than 0.25g of iron seems to also all have a lot of calcium (many reaching almost 150g of calcium). Therefore, there might be a positive association between calcium and iron levels, However, since this data is limited to foods with iron levels below 0.25g, it would be difficult to make that association for the entirety of the data. Most of the food items also appear to have low amounts of water (shown by purple fill color).

Figure 3: 1 Categorical, 1 Numerical Variable

Code
milkorcheese <- food_dataset_clean |>
  filter(category == "Milk" | category == "Cheese" | category == "Beef" | category == "Fish")
Code
ggplot(milkorcheese, aes(X = data_fiber, y = category)) +
  geom_bar(fill = "steelblue") +
  labs(y = "Category of Food", x = "Total Grams of Fiber in a Food Group")
Figure 3: Amount of Total Fiber (g) in Different Food Groups

In the bar graph above, Figure 3, I decided to limit the categories of food I was interested in to Fish, Cheese, Beef, and Milk. On the y axis, I plotted the category of food. On the x axis, I plotted the total grams of fiber found in each food group. By looking at the graph, I can see that the data which belongs to the food category “beef” has the most amount of total fiber while the category “fish” has the least. From this graph, we may be able to hypothesize that beef, on average, has more fiber than fish, cheese, or milk. However, we would need to see how many n values are in each category to see if there were simply more observations of the beef category that resulted in a higher fiber total.

Rotation 3: Tables

Code
library(knitr)
food_dataset_clean |>
  names() |>
  kable()
x
category
description
nutrient_data_bank_number
data_alpha_carotene
data_beta_carotene
data_beta_cryptoxanthin
data_carbohydrate
data_cholesterol
data_choline
data_fiber
data_lutein_and_zeaxanthin
data_lycopene
data_niacin
data_protein
data_retinol
data_riboflavin
data_selenium
data_sugar_total
data_thiamin
data_water
data_fat_monosaturated_fat
data_fat_polysaturated_fat
data_fat_saturated_fat
data_fat_total_lipid
data_major_minerals_calcium
data_major_minerals_copper
data_major_minerals_iron
data_major_minerals_magnesium
data_major_minerals_phosphorus
data_major_minerals_potassium
data_major_minerals_sodium
data_major_minerals_zinc
data_vitamins_vitamin_a_rae
data_vitamins_vitamin_b12
data_vitamins_vitamin_b6
data_vitamins_vitamin_c
data_vitamins_vitamin_e
data_vitamins_vitamin_k

From these variable names, I expect to mostly see numeric data, and a few categorical variables for “category” and “description.”

After checking, the data basically matches my expectations of what I expected to see, although I didn’t expect the data bank id number to be numeric. I don’t know why it would be useful for that variable to be numeric. The only categorical variables are the category of food and the description.

Code
food_dataset_clean |>
  group_by(category) |>
  summarize(numb_category = n()) |>
  arrange(desc(numb_category)) |>
  head(12) |>
  kable()
Table 1: Table 1 shows the 12 categories with the highest number of individual foods recorded in this dataset, which will be used for further summary.
category numb_category
Infant formula 173
Rice 143
Bread 119
Potato 105
Cookie 100
Frankfurter or hot dog sandwich 95
Coffee 91
Egg omelet or scrambled egg 82
Pie 78
Beef 77
Cheese 60
Pasta 52
Code
food_dataset_clean |>
  filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
  group_by(category) |>
  summarize(mean_carb = mean(data_carbohydrate), median_carb = median(data_carbohydrate), max_carb = max(data_carbohydrate)) |>
  arrange(desc(mean_carb)) |>
  kable()
Table 2: Table 2 shows the mean, median, and maximum number of carbohydrants in the 12 largest categories of foods, arranged by descending mean.
category mean_carb median_carb max_carb
Cookie 68.526667 66.26 80.44
Bread 51.695454 48.30 65.40
Pie 39.440000 36.35 48.00
Rice 21.658182 21.26 24.86
Potato 20.787778 17.91 41.44
Frankfurter or hot dog sandwich 20.265000 20.39 23.92
Pasta 19.178000 17.13 30.68
Coffee 13.087500 3.96 74.04
Beef 9.256667 12.51 15.54
Infant formula 7.065333 6.99 7.45
Cheese 3.722000 3.88 5.27
Code
food_dataset_clean |>
  filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
  group_by(category) |>
  summarize(mean_chol = mean(data_cholesterol), median_chol = median(data_cholesterol), max_chol = max(data_cholesterol)) |>
  arrange(desc(mean_chol)) |>
  kable()
Table 3: Table 3 shows the mean, median, and maximum amount of cholesterol in the 12 largest categories of foods, arranged by descending mean.
category mean_chol median_chol max_chol
Cheese 62.2000000 89.0 98
Beef 58.6666667 46.5 119
Frankfurter or hot dog sandwich 34.7500000 34.0 46
Pasta 13.2000000 13.0 25
Pie 10.4285714 0.0 38
Potato 10.0000000 0.0 37
Cookie 7.0000000 0.0 42
Bread 4.7272727 0.0 41
Coffee 1.2500000 1.0 4
Rice 1.0909091 0.0 11
Infant formula 0.6666667 0.0 2
Code
food_dataset_clean |>
  filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
  group_by(category) |>
  summarize(mean_fiber = mean(data_fiber), median_fiber = median(data_fiber), max_fiber = max(data_fiber)) |>
  arrange(desc(mean_fiber)) |>
  kable()
Table 4: Table 4 shows the mean, median, and maximum amount of fiber in the 12 largest categories of foods, arranged by descending mean.
category mean_fiber median_fiber max_fiber
Bread 4.254545 4.40 7.8
Cookie 3.511111 2.10 14.3
Pasta 2.160000 2.00 2.5
Pie 2.142857 1.80 3.8
Frankfurter or hot dog sandwich 2.112500 2.25 3.5
Potato 1.844444 1.50 3.8
Rice 1.754546 1.60 4.1
Beef 0.650000 0.55 1.6
Coffee 0.237500 0.00 1.9
Cheese 0.040000 0.00 0.2
Infant formula 0.000000 0.00 0.0

From these tables, we can note the categories with the highest amount of carbohydrates, cookies, the highest amount of cholesterol, cheese, and the highest amount of fiber, bread. The tables generally reinforce what I know about foods, such as that the highest categories in Table 2 are breads and sweets, and the highest categories in Table 3 are meats and cheeses. Table 4 presented more surprises in showing that sweets and breads actually have a good amount of fiber (comparably), and infant formula has none.

Rotation 4: Detailed Description of Dataset

3 Variables that are most critical to the Dataset: Category of food, Amount of Carbohydrates in grams, and Amount of Protein in Grams

Code
subset_food_clean <- food_dataset_clean |>
  select(category, data_carbohydrate, data_protein) |>
  filter(category == "Milk" | category == "Cheese" | category == "Beef" | category == "Fish" | category == "Infant formula" | category == "Rice" | category == "Bread" | category == "Potato" | category == "Cookie" | category == "Frankfurter or hot dog sandwich" | category == "Coffee" | category == "Egg omlet or scrambled egg" | category == "Pie" | category == "Pasta")
glimpse(subset_food_clean)
Rows: 1,149
Columns: 3
$ category          <chr> "Milk", "Milk", "Milk", "Milk", "Milk", "Milk", "Mil…
$ data_carbohydrate <dbl> 6.89, 4.87, 4.67, 4.46, 4.67, 5.19, 4.85, 4.91, 5.19…
$ data_protein      <dbl> 1.03, 3.34, 3.28, 3.10, 3.28, 3.38, 3.40, 3.35, 3.38…

Our data includes information from the United States Department of Agriculture’s Food Composition Database. This data includes categories of foods, and the nutrients in them like amount of vitamins, minerals, and other nutrients each food has. The variables that we think are the most important are the category of food to give the information about, then we chose the amount of carbohydrates and amount of protein in the categories of food we chose. For our category of food, we chose foods we thought were main parts of the average American’s diet. This consisted of various types of dairy, protein, carbohydrates, and important drinks. We think it would be interesting to look into which of these common foods had the most carbohydrates and proteins, two nutrients consumers often look into. Carbohydrates do not store as much energy as protein does and are often blamed for weight gain, so there may be something in learning what foods have many unexpected carbohydrates. Because protein stores more energy, we were interested in comparing the amount of proteins between common foods and different kinds of animal products.

  • Category: a categorical variable that describes the kind of food category that the observation falls under For our project, we focused on the following food categories: milk, cheese, beef, infant formula, fish, Rice, Bread, Potato, Cookie, Frankfurter or hot dog sandwich, Coffee, Egg omlet or scrambled egg, Pie, Pasta (the most common food categories)

  • Carbohydrates: a numerical variable which measures the amount of carbohydrates within a certain food (in grams).

  • Protein: a numerical variable which measures the amount of protein within a certain food (in grams).

Rotation 5: Ethics

Statement of Ethical considerations

This data is from the United States Department of Agriculture’s Food Composition Database. This data set contains a large amount of data concerning the nutrition value for lots of foods, serving the purpose of quick and easy access for people with dietary restrictions. This data can have many uses. Like mentioned previously, people with dietary restrictions can quickly access information on foods and food companies. This data, joined with other data sets, can help researchers find the types of nutrition food deserts fail to provide to the people living in them (Reynolds, Block, and Bradley 2018) Lastly, this database can be used by grocery store owners to sort and evenly buy produce to stock the shelves in their store. Some harm that it can cause is to the food companies themselves. Consumers may not know of certain nutritional aspects in their food, and use the newfound information from the data to sue big food companies.

Tie the data to the Final Project’s goal

This data can potentially be used for future SDS 100 classes. In lab 9, when learning how to create tables, we ordered fruits. I think it would be interesting to see how many kinds of tables and plots you can create from the data. The exercise can be to add some nutritional data about the fruits and/or other produce to the table.

References

Reynolds, Kristin, Daniel Block, and Katharine Bradley. 2018. “Food Justice Scholar-Activism and Activist-Scholarship.” ACME: An International Journal for Critical Geographies 17 (4): 988–98. https://acme-journal.org/index.php/acme/article/view/1735.