Code
library(tidyverse)
<- read.csv("https://corgis-edu.github.io/corgis/datasets/csv/food/food.csv")
food_dataset_raw library(janitor)
<- clean_names(food_dataset_raw) food_dataset_clean
This dataset is derived from the US Department of Agriculture’s Food Composition’s Database. It displays a myriad of foods common in the US, ranging from vegetables to meats. It specifically hones in on the amount of minerals and vitamins found in each. Our analysis focuses on nutrition, specifically comparing proteins, carbohydrates, fibers, and sugars.
library(tidyverse)
<- read.csv("https://corgis-edu.github.io/corgis/datasets/csv/food/food.csv")
food_dataset_raw library(janitor)
<- clean_names(food_dataset_raw) food_dataset_clean
<- food_dataset_clean |>
food_dataset_clean select(category, data_carbohydrate, data_protein, data_cholesterol, data_fiber, data_sugar_total, data_major_minerals_iron, data_major_minerals_calcium) |>
filter(category == "Milk" | category == "Cheese" | category == "Beef" | category == "Fish" | category == "Infant formula" | category == "Rice" | category == "Bread" | category == "Potato" | category == "Cookie" | category == "Frankfurter or hot dog sandwich" | category == "Coffee" | category == "Egg omlet or scrambled egg" | category == "Pie" | category == "Pasta")
glimpse(food_dataset_clean)
Rows: 1,149
Columns: 8
$ category <chr> "Milk", "Milk", "Milk", "Milk", "Milk", "M…
$ data_carbohydrate <dbl> 6.89, 4.87, 4.67, 4.46, 4.67, 5.19, 4.85, …
$ data_protein <dbl> 1.03, 3.34, 3.28, 3.10, 3.28, 3.38, 3.40, …
$ data_cholesterol <int> 14, 8, 12, 14, 12, 5, 2, 8, 5, 8, 5, 3, 5,…
$ data_fiber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ data_sugar_total <dbl> 6.89, 4.89, 4.81, 4.46, 4.81, 4.96, 4.85, …
$ data_major_minerals_iron <dbl> 0.03, 0.00, 0.00, 0.05, 0.00, 0.00, 0.04, …
$ data_major_minerals_calcium <int> 32, 126, 123, 101, 123, 126, 204, 126, 126…
Our data includes information from the United States Department of Agriculture’s Food Composition Database. This data includes categories of foods, and the nutrients in them like amount of vitamins, minerals, and other nutrients each food has. The variables that we think are the most important are the category of food to give the information about, then we chose the amount of carbohydrates and amount of protein in the categories of food we chose. For our category of food, we chose foods we thought were main parts of the average American’s diet. This consisted of various types of dairy, protein, carbohydrates, and important drinks. We think it would be interesting to look into which of these common foods had the most carbohydrates and proteins, two nutrients consumers often look into. Carbohydrates do not store as much energy as protein does and are often blamed for weight gain, so there may be something in learning what foods have many unexpected carbohydrates. Because protein stores more energy, we were interested in comparing the amount of proteins between common foods and different kinds of animal products.
3 Variables that are most critical to the Dataset: Category of food, Amount of Carbohydrates in grams, and Amount of Protein in Grams
library(knitr)
|>
food_dataset_clean group_by(category) |>
summarize(numb_category = n()) |>
arrange(desc(numb_category)) |>
head(12) |>
kable()
category | numb_category |
---|---|
Infant formula | 173 |
Rice | 143 |
Bread | 119 |
Potato | 105 |
Cookie | 100 |
Frankfurter or hot dog sandwich | 95 |
Coffee | 91 |
Pie | 78 |
Beef | 77 |
Cheese | 60 |
Pasta | 52 |
Milk | 31 |
|>
food_dataset_clean filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
group_by(category) |>
summarize(mean_carb = mean(data_carbohydrate), median_carb = median(data_carbohydrate), max_carb = max(data_carbohydrate)) |>
arrange(desc(mean_carb)) |>
kable()
category | mean_carb | median_carb | max_carb |
---|---|---|---|
Cookie | 70.105000 | 71.800 | 79.00 |
Bread | 49.670000 | 51.300 | 61.42 |
Pie | 35.838571 | 38.510 | 42.60 |
Rice | 22.166667 | 21.385 | 32.50 |
Frankfurter or hot dog sandwich | 21.790000 | 20.810 | 27.05 |
Potato | 21.563750 | 18.995 | 41.44 |
Pasta | 17.250000 | 17.250 | 18.36 |
Coffee | 15.216250 | 6.295 | 86.28 |
Beef | 7.618333 | 7.050 | 16.71 |
Infant formula | 7.135714 | 7.200 | 7.45 |
Cheese | 5.560000 | 3.880 | 9.99 |
|>
food_dataset_clean filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
group_by(category) |>
summarize(mean_chol = mean(data_cholesterol), median_chol = median(data_cholesterol), max_chol = max(data_cholesterol)) |>
arrange(desc(mean_chol)) |>
kable()
category | mean_chol | median_chol | max_chol |
---|---|---|---|
Beef | 51.6666667 | 42.5 | 105 |
Cheese | 43.6000000 | 18.0 | 96 |
Frankfurter or hot dog sandwich | 34.3750000 | 31.0 | 55 |
Pie | 28.8571429 | 7.0 | 122 |
Pasta | 17.5000000 | 17.5 | 27 |
Potato | 9.8750000 | 3.0 | 37 |
Cookie | 8.3750000 | 0.0 | 58 |
Bread | 5.7000000 | 0.0 | 49 |
Rice | 2.5000000 | 0.0 | 18 |
Coffee | 2.2500000 | 1.0 | 6 |
Infant formula | 0.4285714 | 0.0 | 2 |
|>
food_dataset_clean filter(category == c("Infant formula", "Rice", "Bread", "Potato", "Cookie", "Frankfurter or hot dog sandwich", "Coffee", "Egg omlet or scrambled egg", "Pie", "Beef", "Cheese", "Pasta")) |>
group_by(category) |>
summarize(mean_fiber = mean(data_fiber), median_fiber = median(data_fiber), max_fiber = max(data_fiber)) |>
arrange(desc(mean_fiber)) |>
kable()
category | mean_fiber | median_fiber | max_fiber |
---|---|---|---|
Bread | 4.1800000 | 3.45 | 9.2 |
Pasta | 2.4000000 | 2.40 | 2.7 |
Frankfurter or hot dog sandwich | 2.1625000 | 2.15 | 3.6 |
Cookie | 2.0250000 | 1.95 | 3.0 |
Potato | 1.9500000 | 1.85 | 3.8 |
Rice | 1.2916667 | 1.10 | 1.9 |
Pie | 1.2000000 | 0.80 | 2.6 |
Beef | 0.7333333 | 0.55 | 1.8 |
Cheese | 0.1600000 | 0.00 | 0.4 |
Infant formula | 0.0500000 | 0.00 | 0.7 |
Coffee | 0.0000000 | 0.00 | 0.0 |
From these tables, we can note the categories with the highest amount of carbohydrates, cookies, the highest amount of cholesterol, cheese, and the highest amount of fiber, bread. The tables generally reinforce what I know about foods, such as that the highest categories in Table 2 are breads and sweets, and the highest categories in Table 3 are meats and cheeses. Table 4 presented more surprises in showing that sweets and breads actually have a good amount of fiber (comparably), and infant formula has none.
ggplot(food_dataset_clean, aes(x = data_protein, y = data_sugar_total)) +
geom_jitter(size = 1) +
theme(legend.position = "right") +
labs(y = "Grams of Sugar in a Food Item", x = "Total Grams of Protein in a Food Item")
In the first scatterplot, Figure 1, the amount of protein a food item had in the data was placed on the x-axis while the amount of total sugar a food item had was placed on the y-axis. While most of the points are clustered towards the right of the graph, it appears that there might be a negative slope in the data. Specifically, it appears as though there may be a negative association between the amount of sugar a food has and the amount of protein it has. So, a food with lots of sugar might not have a lot of protein, and a food with lots of protein may not have a lot of sugar.
The only other categorical variable in this dataset is the description of the food items, so I will do another plot with numerical variables
ggplot(food_dataset_clean, aes(x = data_major_minerals_iron, y = data_major_minerals_calcium)) +
geom_col(aes(fill = data_protein)) +
scale_fill_viridis_c(name = "Amount of Water(mL)") +
scale_x_continuous(name = "Grams of Iron", limits = c(0, 0.25)) +
scale_y_continuous(name = "Grams of Calcium", limits = c(0, 150))
In Figure 2, I limited the foods I examined to those who had less than or equal to 0.25 grams of iron Because most of the observations for grams of iron appeared to be below 1, I decided to limit my scale to make the graph more readable. On the x-axis, there is grams of iron and on the y-axis, there is grams of calcium. The color fill of the bars shows the amount of water(g) in each food corresponding to a particular color. Looking at the graph, the foods that had less than 0.25g of iron seems to also all have a lot of calcium (many reaching almost 150g of calcium). Therefore, there might be a positive association between calcium and iron levels, However, since this data is limited to foods with iron levels below 0.25g, it would be difficult to make that association for the entirety of the data. Most of the food items also appear to have low amounts of water (shown by purple fill color).
<- food_dataset_clean |>
milkorcheese filter(category == "Milk" | category == "Cheese" | category == "Beef" | category == "Fish")
ggplot(milkorcheese, aes(X = data_fiber, y = category)) +
geom_bar(fill = "steelblue") +
labs(y = "Category of Food", x = "Total Grams of Fiber in a Food Group")
In the bar graph above, Figure 3, I decided to limit the categories of food I was interested in to Fish, Cheese, Beef, and Milk. On the y axis, I plotted the category of food. On the x axis, I plotted the total grams of fiber found in each food group. By looking at the graph, I can see that the data which belongs to the food category “beef” has the most amount of total fiber while the category “fish” has the least. From this graph, we may be able to hypothesize that beef, on average, has more fiber than fish, cheese, or milk. However, we would need to see how many n values are in each category to see if there were simply more observations of the beef category that resulted in a higher fiber total.
This data is from the United States Department of Agriculture’s Food Composition Database. This data set contains a large amount of data concerning the nutrition value for lots of foods, serving the purpose of quick and easy access for people with dietary restrictions. This data can have many uses. Like mentioned previously, people with dietary restrictions can quickly access information on foods and food companies. This data, joined with other data sets, can help researchers find the types of nutrition food deserts fail to provide to the people living in them (Reynolds, Block, and Bradley 2018) Lastly, this database can be used by grocery store owners to sort and evenly buy produce to stock the shelves in their store. Some harm that it can cause is to the food companies themselves. Consumers may not know of certain nutritional aspects in their food, and use the newfound information from the data to sue big food companies.
This data can potentially be used for future SDS 100 classes. In lab 9, when learning how to create tables, we ordered fruits. I think it would be interesting to see how many kinds of tables and plots you can create from the data. The exercise can be to add some nutritional data about the fruits and/or other produce to the table.