Food Composition

Agricultural-based food dataset

Image of Tractor Work in Agricultural Field

This dataset is derived from the US Department of Agriculture’s Food Composition’s Database. It displays a myriad of foods common in the US, ranging from vegetables to meats. It specifically hones in on the amount of minerals and vitamins found in each. Our analysis focuses on nutrition, specifically comparing proteins, carbohydrates, fibers, and sugars.

Code

library(tidyverse)
food_dataset_raw <- read.csv("https://corgis-edu.github.io/corgis/datasets/csv/food/food.csv")
library(janitor)
food_dataset_clean <- clean_names(food_dataset_raw)

Code

food_dataset_clean <- food_dataset_clean |>
  select(category, data_carbohydrate, data_protein, data_cholesterol, data_fiber, data_sugar_total, data_major_minerals_iron, data_major_minerals_calcium) |>
  filter(category == "Milk" | category == "Cheese" | category == "Beef" | category == "Fish" | category == "Infant formula" | category == "Rice" | category == "Bread" | category == "Potato" | category == "Cookie" | category == "Frankfurter or hot dog sandwich" | category == "Coffee" | category == "Egg omlet or scrambled egg" | category == "Pie" | category == "Pasta")

glimpse(food_dataset_clean)

Rows: 1,149
Columns: 8
$ category                    <chr> "Milk", "Milk", "Milk", "Milk", "Milk", "M…
$ data_carbohydrate           <dbl> 6.89, 4.87, 4.67, 4.46, 4.67, 5.19, 4.85, …
$ data_protein                <dbl> 1.03, 3.34, 3.28, 3.10, 3.28, 3.38, 3.40, …
$ data_cholesterol            <int> 14, 8, 12, 14, 12, 5, 2, 8, 5, 8, 5, 3, 5,…
$ data_fiber                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ data_sugar_total            <dbl> 6.89, 4.89, 4.81, 4.46, 4.81, 4.96, 4.85, …
$ data_major_minerals_iron    <dbl> 0.03, 0.00, 0.00, 0.05, 0.00, 0.00, 0.04, …
$ data_major_minerals_calcium <int> 32, 126, 123, 101, 123, 126, 204, 126, 126…

Our data includes information from the United States Department of Agriculture’s Food Composition Database. This data includes categories of foods, and the nutrients in them like amount of vitamins, minerals, and other nutrients each food has. The variables that we think are the most important are the category of food to give the information about, then we chose the amount of carbohydrates and amount of protein in the categories of food we chose. For our category of food, we chose foods we thought were main parts of the average American’s diet. This consisted of various types of dairy, protein, carbohydrates, and important drinks. We think it would be interesting to look into which of these common foods had the most carbohydrates and proteins, two nutrients consumers often look into. Carbohydrates do not store as much energy as protein does and are often blamed for weight gain, so there may be something in learning what foods have many unexpected carbohydrates. Because protein stores more energy, we were interested in comparing the amount of proteins between common foods and different kinds of animal products.

3 Variables that are most critical to the Dataset: Category of food, Amount of Carbohydrates in grams, and Amount of Protein in Grams

Category: a categorical variable that describes the kind of food category that the observation falls under For our project, we focused on the following food categories: milk, cheese, beef, infant formula, fish, Rice, Bread, Potato, Cookie, Frankfurter or hot dog sandwich, Coffee, Egg omlet or scrambled egg, Pie, Pasta (the most common food categories)
Carbohydrates: a numerical variable which measures the amount of carbohydrates within a certain food (in grams).
Protein: a numerical variable which measures the amount of protein within a certain food (in grams).
Cholesterol: a numerical variable which measures the amount of cholesterol within a certain food (in milligrams)
Fiber: The indigestible portion of food derived from plants (in grams)
Sugar: Amount of sugar from all sources (naturally occuring in the food or added) in grams

Summary Tables

Code

library(knitr)

food_dataset_clean |>
  group_by(category) |>
  summarize(numb_category = n()) |>
  arrange(desc(numb_category)) |>
  head(12) |>
  kable()

Table 1: Table 1 shows the 12 categories with the highest number of individual foods recorded in this dataset, which will be used for further summary.

category	numb_category
Infant formula	173
Rice	143
Bread	119
Potato	105
Cookie	100
Frankfurter or hot dog sandwich	95
Coffee	91
Pie	78
Beef	77
Cheese	60
Pasta	52
Milk	31

Code

food_dataset_clean |>
  filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
  group_by(category) |>
  summarize(mean_carb = mean(data_carbohydrate), median_carb = median(data_carbohydrate), max_carb = max(data_carbohydrate)) |>
  arrange(desc(mean_carb)) |>
  kable()

Table 2: Table 2 shows the mean, median, and maximum number of carbohydrants in the 12 largest categories of foods, arranged by descending mean.

category	mean_carb	median_carb	max_carb
Cookie	70.105000	71.800	79.00
Bread	49.670000	51.300	61.42
Pie	35.838571	38.510	42.60
Rice	22.166667	21.385	32.50
Frankfurter or hot dog sandwich	21.790000	20.810	27.05
Potato	21.563750	18.995	41.44
Pasta	17.250000	17.250	18.36
Coffee	15.216250	6.295	86.28
Beef	7.618333	7.050	16.71
Infant formula	7.135714	7.200	7.45
Cheese	5.560000	3.880	9.99

Code

food_dataset_clean |>
  filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
  group_by(category) |>
  summarize(mean_chol = mean(data_cholesterol), median_chol = median(data_cholesterol), max_chol = max(data_cholesterol)) |>
  arrange(desc(mean_chol)) |>
  kable()

Table 3: Table 3 shows the mean, median, and maximum amount of cholesterol in the 12 largest categories of foods, arranged by descending mean.

category	mean_chol	median_chol	max_chol
Beef	51.6666667	42.5	105
Cheese	43.6000000	18.0	96
Frankfurter or hot dog sandwich	34.3750000	31.0	55
Pie	28.8571429	7.0	122
Pasta	17.5000000	17.5	27
Potato	9.8750000	3.0	37
Cookie	8.3750000	0.0	58
Bread	5.7000000	0.0	49
Rice	2.5000000	0.0	18
Coffee	2.2500000	1.0	6
Infant formula	0.4285714	0.0	2

Code

food_dataset_clean |>
  filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
  group_by(category) |>
  summarize(mean_fiber = mean(data_fiber), median_fiber = median(data_fiber), max_fiber = max(data_fiber)) |>
  arrange(desc(mean_fiber)) |>
  kable()

Table 4: Table 4 shows the mean, median, and maximum amount of fiber in the 12 largest categories of foods, arranged by descending mean.

category	mean_fiber	median_fiber	max_fiber
Bread	4.1800000	3.45	9.2
Pasta	2.4000000	2.40	2.7
Frankfurter or hot dog sandwich	2.1625000	2.15	3.6
Cookie	2.0250000	1.95	3.0
Potato	1.9500000	1.85	3.8
Rice	1.2916667	1.10	1.9
Pie	1.2000000	0.80	2.6
Beef	0.7333333	0.55	1.8
Cheese	0.1600000	0.00	0.4
Infant formula	0.0500000	0.00	0.7
Coffee	0.0000000	0.00	0.0

From these tables, we can note the categories with the highest amount of carbohydrates, cookies, the highest amount of cholesterol, cheese, and the highest amount of fiber, bread. The tables generally reinforce what I know about foods, such as that the highest categories in Table 2 are breads and sweets, and the highest categories in Table 3 are meats and cheeses. Table 4 presented more surprises in showing that sweets and breads actually have a good amount of fiber (comparably), and infant formula has none.

Data Visualizations

Code

ggplot(food_dataset_clean, aes(x = data_protein, y = data_sugar_total)) +
  geom_jitter(size = 1) +
  theme(legend.position = "right") +
  labs(y = "Grams of Sugar in a Food Item", x = "Total Grams of Protein in a Food Item")

Figure 1: Protein vs. Sugar in Food Items

Figure 1: 2 Numerical Variables

In the first scatterplot, Figure 1, the amount of protein a food item had in the data was placed on the x-axis while the amount of total sugar a food item had was placed on the y-axis. While most of the points are clustered towards the right of the graph, it appears that there might be a negative slope in the data. Specifically, it appears as though there may be a negative association between the amount of sugar a food has and the amount of protein it has. So, a food with lots of sugar might not have a lot of protein, and a food with lots of protein may not have a lot of sugar.

Figure 2: 2 Categorical Variables

The only other categorical variable in this dataset is the description of the food items, so I will do another plot with numerical variables

Code

ggplot(food_dataset_clean, aes(x = data_major_minerals_iron, y = data_major_minerals_calcium)) +
  geom_col(aes(fill = data_protein)) +
  scale_fill_viridis_c(name = "Amount of Water(mL)") +
  scale_x_continuous(name = "Grams of Iron", limits = c(0, 0.25)) +
  scale_y_continuous(name = "Grams of Calcium", limits = c(0, 150))

Figure 2: Iron vs.Calcium in a Food & Water Percentage

In Figure 2, I limited the foods I examined to those who had less than or equal to 0.25 grams of iron Because most of the observations for grams of iron appeared to be below 1, I decided to limit my scale to make the graph more readable. On the x-axis, there is grams of iron and on the y-axis, there is grams of calcium. The color fill of the bars shows the amount of water(g) in each food corresponding to a particular color. Looking at the graph, the foods that had less than 0.25g of iron seems to also all have a lot of calcium (many reaching almost 150g of calcium). Therefore, there might be a positive association between calcium and iron levels, However, since this data is limited to foods with iron levels below 0.25g, it would be difficult to make that association for the entirety of the data. Most of the food items also appear to have low amounts of water (shown by purple fill color).

Figure 3: 1 Categorical, 1 Numerical Variable

Code

milkorcheese <- food_dataset_clean |>
  filter(category == "Milk" | category == "Cheese" | category == "Beef" | category == "Fish")

Code

ggplot(milkorcheese, aes(X = data_fiber, y = category)) +
  geom_bar(fill = "steelblue") +
  labs(y = "Category of Food", x = "Total Grams of Fiber in a Food Group")

Figure 3: Amount of Total Fiber (g) in Different Food Groups

In the bar graph above, Figure 3, I decided to limit the categories of food I was interested in to Fish, Cheese, Beef, and Milk. On the y axis, I plotted the category of food. On the x axis, I plotted the total grams of fiber found in each food group. By looking at the graph, I can see that the data which belongs to the food category “beef” has the most amount of total fiber while the category “fish” has the least. From this graph, we may be able to hypothesize that beef, on average, has more fiber than fish, cheese, or milk. However, we would need to see how many n values are in each category to see if there were simply more observations of the beef category that resulted in a higher fiber total.

Ethics

This data is from the United States Department of Agriculture’s Food Composition Database. This data set contains a large amount of data concerning the nutrition value for lots of foods, serving the purpose of quick and easy access for people with dietary restrictions. This data can have many uses. Like mentioned previously, people with dietary restrictions can quickly access information on foods and food companies. This data, joined with other data sets, can help researchers find the types of nutrition food deserts fail to provide to the people living in them (Reynolds, Block, and Bradley 2018) Lastly, this database can be used by grocery store owners to sort and evenly buy produce to stock the shelves in their store. Some harm that it can cause is to the food companies themselves. Consumers may not know of certain nutritional aspects in their food, and use the newfound information from the data to sue big food companies.

Tie the data to the Final Project’s goal

This data can potentially be used for future SDS 100 classes. In lab 9, when learning how to create tables, we ordered fruits. I think it would be interesting to see how many kinds of tables and plots you can create from the data. The exercise can be to add some nutritional data about the fruits and/or other produce to the table.

References

Reynolds, Kristin, Daniel Block, and Katharine Bradley. 2018. “Food Justice Scholar-Activism and Activist-Scholarship.” ACME: An International Journal for Critical Geographies 17 (4): 988–98. https://acme-journal.org/index.php/acme/article/view/1735.