A Student’s Review of Expert Reviews of Chocolate Bars

Proposal

Author

Minwoo Kang, Jason Levine, Victoria Midkiff, Dani Trejo

library(tidyverse)
library(janitor)
library(maps)

chocolate <- read_csv("data/flavors_of_cacao.csv")

chocolate <- chocolate |>
  clean_names()

con2cont <- read_csv("data/countryContinent.csv")


world <- map_data("world")

Dataset

The chocolate data set contains information about chocolate bars and reviews written by experts. It contains 9 variables with 1795 observations. The data contain information about the company who makes the chocolate bars, the percent of cocoa, where the company is located, bean type & origin, and review (out of 5 stars). It comes from the kaggle data set, Chocolate Bar Ratings: Extensive EDA, by Will Canniford. The original data can be found here. Here is a small glimpse into what our data looks like. More detail can be found in our data dictionary located in data/README.md.

glimpse(chocolate)

Rows: 1,795
Columns: 9
$ company_maker_if_known           <chr> "A. Morin", "A. Morin", "A. Morin", "…
$ specific_bean_origin_or_bar_name <chr> "Agua Grande", "Kpime", "Atsane", "Ak…
$ ref                              <dbl> 1876, 1676, 1676, 1680, 1704, 1315, 1…
$ review_date                      <dbl> 2016, 2015, 2015, 2015, 2015, 2014, 2…
$ cocoa_percent                    <chr> "63%", "70%", "70%", "70%", "70%", "7…
$ company_location                 <chr> "France", "France", "France", "France…
$ rating                           <dbl> 3.75, 2.75, 3.00, 3.50, 3.50, 2.75, 3…
$ bean_type                        <chr> " ", " ", " ", " ", " ", "Criollo", "…
$ broad_bean_origin                <chr> "Sao Tome", "Togo", "Togo", "Togo", "…

The second data set, con2cont, contains information about different countries and the continents and sub-regions they are located in, their official nomenclature, and country/regional codes. The data set has 9 columns and 249 rows with each row representing a country. Through this data set, we will be mapping the different country name data of company locations with their respective continents. Since we are grouping the company locations with their continents, we only select the variables that contain information about the country names and the continent they are located in to be used in our analysis. The data comes from Kaggle and the original data can be found in this link and more description of the data set can be found in data/README.md.

glimpse(con2cont)

Rows: 249
Columns: 9
$ country         <chr> "Afghanistan", "\xc5land Islands", "Albania", "Algeria…
$ code_2          <chr> "AF", "AX", "AL", "DZ", "AS", "AD", "AO", "AI", "AQ", …
$ code_3          <chr> "AFG", "ALA", "ALB", "DZA", "ASM", "AND", "AGO", "AIA"…
$ country_code    <dbl> 4, 248, 8, 12, 16, 20, 24, 660, 10, 28, 32, 51, 533, 3…
$ iso_3166_2      <chr> "ISO 3166-2:AF", "ISO 3166-2:AX", "ISO 3166-2:AL", "IS…
$ continent       <chr> "Asia", "Europe", "Europe", "Africa", "Oceania", "Euro…
$ sub_region      <chr> "Southern Asia", "Northern Europe", "Southern Europe",…
$ region_code     <dbl> 142, 150, 150, 2, 9, 150, 2, 19, NA, 19, 19, 142, 19, …
$ sub_region_code <dbl> 34, 154, 39, 15, 61, 39, 17, 29, NA, 29, 5, 145, 29, 5…

chocolate <- chocolate |>
  mutate(
    company_location_country = case_when(
      company_location == "U.S.A." ~ "United States of America",
      company_location == "U.K." ~ "United Kingdom of Great Britain and Northern Ireland",
      company_location == "Wales" ~ "United Kingdom of Great Britain and Northern Ireland",
      company_location == "Scotland" ~ "United Kingdom of Great Britain and Northern Ireland",
      company_location == "Russia" ~ "Russian Federation",
      company_location == "South Korea" ~ "Korea (Republic of)",
      company_location == "Amsterdam" ~ "Netherlands",
      company_location == "Sao Tome" ~ "Sao Tome and Principe",
      company_location == "Bolivia" ~ "Bolivia (Plurinational State of)",
      company_location == "Venezuela" ~ "Venezuela (Bolivarian Republic of)",
      company_location == "St. Lucia" ~ "Saint Lucia",
      company_location == "Vietnam" ~ "Viet Nam",
      company_location == "Domincan Republic" ~ "Dominican Republic",
      company_location == "Niacragua" ~ "Nicaragua",
      company_location == "Eucador" ~ "Ecuador",
      TRUE ~ company_location
    )
  )

con2cont <- con2cont |>
  select(country, continent)

con2cont <- con2cont |>
  rename(company_location_country = country)

chocolate_merged <- chocolate |>
  left_join(con2cont,
            multiple = "all")

#| label: glimpsing-maps

glimpse(world)

Rows: 99,338
Columns: 6
$ long      <dbl> -69.89912, -69.89571, -69.94219, -70.00415, -70.06612, -70.0…
$ lat       <dbl> 12.45200, 12.42300, 12.43853, 12.50049, 12.54697, 12.59707, …
$ group     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ order     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17, 18, 1…
$ region    <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba…
$ subregion <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

world_df <- left_join(
  world,
  chocolate |>
    group_by(broad_bean_origin) |>
    summarise(rating = mean(rating)),
  by = c("region" = "broad_bean_origin")
)

Why we chose the chocolate dataset

We chose this dataset because we believe it had a variety of interesting variables to look at in relationship to the rating the chocolate bar was given. We were especially intrigued by the dataset containing both time and geospatial data. We believe that this will allow us to make informative and visually pleasing graphics. We also all love chocolate and want to understand what makes chocolate successful.

Questions

These are the two questions we want to answer.

What characteristics lead to the most successful chocolate bars?
How have preferences among chocolate bar reviewers changed over time?

Analysis plan

To answer the first question we will plot chocolate bar review scores against various explanatory variables. The first plot we will make will show the effect of cocoa percentage compared to reviews on a scatterplot. We will plot the cocoa percentage on the x-axis vs the review score on the y-axis for the 10 most popular companies. This will demonstrate the preference for cocoa percentage among chocolate bar reviewers and also highlight if trends differ between the popular companies. The second plot we will make will show the spatial relationship between the preferences for cocoa beans. This plot will be a world map that colors country based on the average rating for bars that contain beans originated from that country. This will be an interesting and athletically pleasing visualization to demonstrate spatial relations between the preferences of bean origins. To make this visualization we will use the maps library which contains the world data set. This contains the required mapping coordinates to create our visualization.
In visualizing the first plot of question (2), we plan to create line plots by year that show the relationship between the cocoa percentage of the chocolate bar and their ratings. This visualization will help us understand how the preference in cocoa percentage level has changed over the years from 2006 to 2017. In creating the variable with year information that will be used in faceting the plots, we will be grouping the year information provided in review_date in two’s and will create a new character variable using mutate function with conditional arguments that represents the assigned years (e.g. “2006 ~ 2007,”2008 ~ 2009”, etc). The variable will then be factored to be in chronological order. For the plot, our x-axis variable is the year and the y-axis variable is the cocoa percentage. The plot will be faceted into 6 time periods assigned within 2006 ~ 2017. Since we would only have a single line in each of the faceted grids, we will modify the color as well as the thickness (size) of the line for better readability. Our second plot will be similar to the first plot in that it will also be exploring how the preference for a certain coffee bean characteristic has changed over time. The plot will be looking at the relationship between the company location/origin and their review ratings for each of the year groups created in completing the first plot. Through this plot, we will be able to better understand which continental group of companies are preferred over time. In order to do this, we plan to implement stacked bar plots for each respective faceted year using geom_bar() function from ggplot2. To group the company locations according to their continents, I used an external dataset named “countryContinent.csv” to map the location data to their continents. I proceeded with data cleaning above using mutate() function to fix typos as well as different nomenclature for country names and left-joined the external dataset by the country names to map the continent information. Then similar to the faceting year variable, we will group the review ratings in 0.5 intervals (“0.0 ~ 0.5”, “0.5 ~ 1.0”) where the lower bound is exclusive with the exception of the first group. For our plot, we will count the number of ratings for each continent where the companies are located in and accumulate the count of continents for each level of rating using the stacked bar plots. Hence, the x-axis variable will be each review rating level and the y-axis variable will be the count of continents. We will differentiate the continents by using individual colors for easier readability.