Horror Movie Visualizations

STA/ISS 313 - Spring 2023 - Project 1

Author

Viz Villains- Christopher Tsai, Chris Liang, Jason Zhang, Kevin Ordet

Abstract

This project takes data extracted from The Movie Database via tmdb API using R httr. The purpose of this project is to analyze a) how trends in factors surrounding horror as a genre (e.g. number of releases, budgets, etc) have changed over time and b) how different subgenres are associated with certain features of horror films as a whole. This project aims to create compelling visualizations to analyze the two research questions above and deliver findings that can be useful in determining next steps for analysis.


Introduction

This dataset contains all recorded releases of films dating back to the 1950s that have horror listed as one of the genres. These films range from small productions to blockbuster productions. Each film has a few significant variables include but are not limited to name and title, language, tagline/description, release date, popularity, votes and average rating of those votes, total budget and revenue, runtime, and genre tags. Factors such as popularity, votes, and average rating were extracted from the movie database (TMDB), an online movie rating platform.

Question 1: Title that relates to the question you’re answering

Introduction

In this question we wanted to look at how the popularity of movies changed over the years 1980 to 2022. The first thing we had to do was figure out to measure popularity over time. We decided that measuring how many movies were released in that given year for the theater (which equated to having a budget of over $2,000,000 in 2022 dollars) would be a good estimator of the popularity of horror movies in a given year. We also decided to look at the popularity of both English and Foreign language horror films to see if there was a difference in the popularity trends of those.

To make this plot we also needed to use the budget and profit data given in the dataset. However, this data was given in the dollar value of the year that the movie was released. In order to scale these amounts to make them comparable, we found a data set that showed how much one dollar in 1950 was worth in that given year. We figured it would make more sense to the audience to have each dollar amount interpreted in terms of 2022 dollars so we reversed the scale to make each dollar amount in budget and profit be in terms of 2022 dollars. We were interested in this question because given the limitations of our data it was one of the only things that might show an interesting trend and we were also interested in how the climate/environment (political, recent big horror movies, etc.) may effect the number of films released in a given year.

Approach

To answer this question we will make 2 line plots using geom_line() and will use geom_smooth() to create a trend line that shows us the short-term trends in the data. We will also be making a scatter plot using geom_point() to look at individual observations of the data.

We first made a scatter plot where each point corresponded to the total number of horror movies made for theatrical release in each year. Because we wanted to look at individual observations (ie. the year 2009 and how many movies were made then) a scatter plot and then circling the observations we wanted to note would be the best way to do so. We then made a line plot using geom_line() and the used geom_smooth() to add a trend line. This plot wanted to initially look at the general trend of movie popularity over time and then relate it to the socioeconomic/political status of the United States at the time. Using geom_smooth() to create a trend line would easily allow us to see the direct impact of certain events on the popularity of horror movies while also allowing us to see when horror movies truly began to increase in popularity, decrease in popularity, and max out in popularity.

The second plot shows the total budget and profit for each year using geom_line(), as well as a trendline for each time series using geom_smooth(). Additionally, at the bottom of the plot is a column plot of the total number of movies for each year using geom_col(). This visualization allows us to understand the popularity of horror movies over time without using a statistic like the mean or median, which would be heavily skewed by very small movies. We felt that the total profit and budget were better proxies for popularity, and the column plot at the bottom adds context.

Analysis

description1 <-
  "Horror Movies saw their popularity peak in 2010. Insidious and Paranormal Activity, both released in 2010 saw the highest dollars of profit per dollars of budget of all Horror Movies between 1980 and 2022." |>
  str_wrap(width = 40) #description for annotate

description2 <-
  "7 of the 16 Horror Movies with the biggest profit were released in 1999 or 2000 marking the catalyzation of a decade long growth in the popularity of Horror Movies" |>
  str_wrap(width = 40)

description3 <-
  "Number of Horror Movies released in 2020 drastically decreases, likely due to COVID-19. The ability to make movies and the number of possible audience members decreased." |>
  str_wrap(width = 40)

plot_circle_purple <- horror_new |>
  group_by(release_year) |>
  summarise(count = n()) |>
  filter(release_year == 2009 |
           release_year == 2010) #create dataset for circles around plot points of note

plot_circle_blue <- horror_new |>
  group_by(release_year) |>
  summarise(count = n()) |>
  filter(release_year == 1999 | release_year == 2000)

plot_circle_black <- horror_new |>
  group_by(release_year) |>
  summarise(count = n()) |>
  filter(release_year == 2019 | release_year == 2020)

horror_new |>
  group_by(release_year) |>
  summarise(count = n()) |>
  ggplot(aes(x = release_year, y = count)) +
  geom_point(size = 1) +
  geom_point(
    data = plot_circle_purple,
    pch = 21,
    size = 5,
    colour = "purple"
  ) +
  geom_point(
    data = plot_circle_blue,
    pch = 21,
    size = 5,
    colour = "deepskyblue4"
  ) +
  geom_point(
    data = plot_circle_black,
    pch = 21,
    size = 5,
    colour = "black"
  ) + #adds circle around plot points of note
  annotate(
    "label",
    x = 2000,
    y = 67,
    label = description1,
    alpha = 0.6,
    size = 2.3,
    color = "purple"
  ) + #adds the annotation with the text in the same color as the annotation of note
  annotate(
    "label",
    x = 1995,
    y = 40,
    label = description2,
    alpha = 0.6,
    size = 2.3,
    color = "deepskyblue4"
  ) +
  geom_ellipse(aes(
    x0 = 2019.6,
    y0 = 30.7,
    a = 11.3,
    b = 1,
    angle =  -1.45 * pi / 3
  ),
  color = "black") + #draws an ellipse around the 2019 and 2020 points
  annotate(
    "label",
    x = 2012,
    y = 20,
    label = description3,
    alpha = 0.8,
    size = 2.3
  ) +
  labs(
    title = "How the number of horror movies launched for \ntheatrical release changed over time",
    subtitle = "From 1980-2020",
    x = "Year",
    y = "Number of Horror Movies",
    color = "Language"
  ) +
  theme(
    legend.key.size = unit(3, 'cm'),
    #change legend key size
    legend.key.height = unit(3, 'cm'),
    #change legend key height
    legend.key.width = unit(3, 'cm'),
    #change legend key width
    plot.title = element_text(size = 12),
    #increases size of plot title
    plot.subtitle = element_text(size = 10),
    axis.title = element_text(size = 10)
  ) +
  scale_color_discrete(labels = c('English', 'Other')) #changes the names on the legends
Warning: Using the `size` aesthetic in this geom was deprecated in
ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and
  elsewhere instead.

description1 <-
  "9/11- Terrorist crash a plane into the World Trade Center. Questions about American security arise" |>
  str_wrap(width = 40) #description to put in annotation

description2 <-
  "The stock market crashes in 2009. Questions about American economic security arise." |>
  str_wrap(width = 40)

horror_new |>
  mutate(en_language = ifelse(original_language == "en", "en", "non-en")) |>
  group_by(release_year, en_language) |>
  summarise(count = n()) |>
  ggplot(aes(x = release_year, y = count, color = en_language)) + #adds different line plot based on the language the movie was in
  geom_point(alpha = 0.3) +
  geom_line(alpha = 0.3) +
  geom_smooth(se = FALSE) +
  geom_segment(
    x = 2001,
    xend = 2001,
    y = -5,
    yend = 27,
    col = "black",
    linewidth = 1.5
  ) + #adds vertical line at 2001
  geom_segment(
    x = 2009,
    xend = 2009,
    y = -5,
    yend = 45,
    col = "black",
    linewidth = 1.5
  ) +
  annotate(
    "label",
    x = 1995,
    y = 35,
    label = description1,
    alpha = 0.6,
    size = 2.3
  ) +
  annotate(
    "label",
    x = 2015.5,
    y = 51,
    label = description2,
    alpha = 0.6,
    size = 2.3
  ) +
  labs(
    title = "Uncertainty and Anxiety- \nHow Horror Flicks Popularity Correlates with a 
Need for an Outlet of Fears",
    x = "Year",
    y = "Total Number of Movies for Theatrical Release",
    color = "Language"
  ) +
  theme(
    legend.key.size = unit(0.25, 'cm'),
    legend.key.height = unit(0.35, 'cm'),
    legend.key.width = unit(0.25, 'cm'),
    plot.title = element_text(size = 12),
    axis.title = element_text(size = 10)
  ) +
  scale_color_discrete(labels = c('English', 'Other'))
`summarise()` has grouped output by 'release_year'. You can
override using the `.groups` argument.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Discussion

The first thing we noticed with our plot was a peak in the number of horror movies released for theater in the range of 2008 - 2010. We wanted to try to understand if this was an anomaly or these years were truly the most popular years for horror movies. We ended up creating a new variable that looked at the dollars of profit for the movie per one dollar put into the budget and saw that the two movies with the highest ratio of profit to budget were made in 2010 confirming our suspicion that 2008 - 2010 were the most popular years for horror movies. We also noticed that in the range of 1999 - 2001 the number of horror movies released for theater began to drastically increase. Again, we noticed that the 7 of the 16 movies with the largest profit (in 2022 dollars) were released in this year. It is likely that movie directors and production companies noticed this potential for profit and began to release more horror movies. Finally, we noticed that there was a sharp decrease in the number of horror movies released from 2019 to 2020. We thought that this was likely a result of the Coronavirus and the inability for production companies to film movies and the lack of an audience to make movies for.

We then wanted to delve into the socioeconomic/political state of the US and how that potentially impacted the popularity of horror movies. Originally we had wanted to look at international films as well, but the small amount of them prevented us from being certain about the trend of their popularity. The graph showed that the amount of horror movies released began to increase in around 1996, before seeing the most drastic increase happen in 2001 and 2002. We then noticed that the number of horror movies peaked out in 2009 and 2010. We again wanted to see if there were any events happening in these years that may have affected the socioeconomic/political state of the US and there were. In 2001, terrorists from Al-Qaeda crashed a plane into the World Trade Centers causing uncertainty and panic about American national security. In 2009, the stock market crashed causing concerns and anxiety about American economic security. We then hypothesized that the popularity of horror movies drastically increased and peaked out in these years because horror movies serve as an outlet for human fears and in these times of mass uncertainty and fear horror movies are needed. Following that idea, we would expect horror movies to increase in number in the upcoming years following the uncertainty and panic caused by the Coronavirus.

Question 2: Title that relates to the question you’re answering

Introduction

We looked at several numerical variables to possibly plot multivariate plots, and found that that would be very difficult – the variables based on voting and popularity are useless to us because they either don’t mean anything, or the popularity variable is a faulty variable because it’s based on searches from the day the data were pulled. We looked at adjusted budget (by inflation), adjusted profit, and runtime for other numerical variables, and saw that there was no meaningful correlation between these pairs of variables, and plotted on a scatterplot, you would just see a random scatter of points.

As a result, we decided to look at more qualitative attributes of the data – the subgenres of the horror movies and the sentiments of their taglines. We wanted to understand if these qualitative markers of a horror movie (how scary/negative does the tagline seem? Does it fall into a bucket of super profitable, not profitable, or breakeven) were impacted by the subgenre the film came from. As a result, our visualizations all separated the subgenres by the top most popular subgenres (the op 6, as there was a large fall off in count of movies afterwards). We are seeing if these movies different in specific traits by the subgenres.

Approach

We’re going to be looking primarily at bar plots and boxplots, because those types of plots can better capture categorical variables’ visualizations. For bar plots, we will be using fill to represent genre, and then faceting by another variabel (i.e. language) if we want to create a multivariate plot. Bar plots help us show the frequency of a certain event happening – that is the count of movies that are profitable/unprofitable, that fall under a certain type of sentiment, etc. So, we take that count/frequency often as our y-axis variable and use different facets or colors to represent the different characteristics of these movies. We’ll used stacked bar plots, filled bar plots, etc. to represent different proportions of characteristics (profitability bucket of a movie or sentiment partitioning), based on subgenre.

For boxplots, we can better see the distribution of a numerical variable benchmarked by the subgenre. In this case, the numerical variable is the average sentiment of the tagline of a movie. Each word in the movie’s tagline gets assigned a sentiment score, and taking the average of those words’ sentiments gets us the tagline’s whole sentiment. This sentiment score is by movie – and by using fill/color to distinguish the different genres, we can see the distribution of how positive or negative certain taglines for these horror movies are.

Analysis

#pivoting longer with separate_rows
horror_new2 <- horror_new |>
  separate_rows(genre_names, sep = ",\\s+")

#popular genres
popular_genre_names <- horror_new2 |>
  group_by(genre_names) |>
  summarise(n = n()) |>
  mutate(prop = n / sum(n)) |>
  arrange(desc(prop)) |>
  head(6)


## Q2: Subgenres and Profitability

horror_new2_1 <- horror_new2 |>
  filter(genre_names %in% popular_genre_names$genre_names) |>
  mutate(en_language = ifelse(original_language == "en", "English", "Non-English")) |>
  mutate(profit = case_when(adj_profit2020 < 0 ~ "Unprofitable",
                            adj_profit2020 >= 0 & adj_profit2020 < adj_budget2020 ~ "Profitable",
                            adj_profit2020 >= adj_budget2020 ~ "Very Profitable")) |>
  mutate(profit = fct_relevel(profit, c("Very Profitable", "Profitable", "Unprofitable")),
         genre_names = fct_relevel(genre_names, c("Horror", "Action", "Science Fiction", "Drama", "Thriller", "Mystery")))


#plot 1 
ggplot(horror_new2_1, aes(y = genre_names, fill = profit)) +
  geom_bar(position = "fill") +
  scale_y_discrete(limits = rev) +
  labs(title = "Differences in Profitabilty Between Subgenres",
       subtitle = "for Horror Movies",
       x = "Proportion",
       y = "Subgenre",
       fill = "Profit Level") +
  scale_fill_manual(values = c("navy", "steelblue1", "lightblue"))  +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

ggplot(horror_new2_1, aes(y = genre_names, fill = profit)) +
  geom_bar(position = "fill") +
  facet_wrap(. ~ en_language, ncol = 1) +
  scale_y_discrete(limits = rev)+
  labs(title = "Differences in Profitabilty Between Subgenres",
       subtitle = "for Horror Movies, by Language",
       x = "Proportion",
       y = "Subgenre",
       fill = "Profit Level") +
  scale_fill_manual(values = c("navy", "steelblue1", "lightblue")) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(plot.subtitle = element_text(hjust = 0.5))

sentiment_horror <- horror_new2 |>
  select(title, genre_names, tagline) |>
  separate(tagline, into = paste0("word", 1:20), sep = " ") |>
  pivot_longer(cols = starts_with("word"), 
               names_to = "word_number", 
               values_to = "word") |>
  drop_na(word)
Warning: Expected 20 pieces. Additional pieces discarded in 42
rows [275, 276, 372, 777, 778, 779, 844, 845, 846, 1162, 1408,
1409, 1410, 1411, 1412, 1741, 1742, 1743, 1889, 1890, ...].
Warning: Expected 20 pieces. Missing pieces filled with `NA` in
3406 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, ...].
#create two dataframes based on different sentiment lexicons
sentiment_joined <- sentiment_horror |>
  inner_join(get_sentiments("bing"))
Joining, by = "word"
sentiment_afinn <- sentiment_horror |>
  inner_join(get_sentiments("afinn"))
Joining, by = "word"
afinn_summary <- sentiment_afinn |>
  group_by(title) |>
  summarise(averagesent = mean(value))

newsentiment <- horror_new2 |>
  inner_join(afinn_summary, by = c("title"))

#PLOTS
notstacked <- sentiment_joined |>
  group_by(genre_names, sentiment) |>
  filter(genre_names %in% popular_genre_names$genre_names) |>
  summarise(count = n()) |>
  ggplot(aes(x = fct_reorder(genre_names, count), y = count, fill = sentiment)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("navy", "steelblue1"),
                    labels = c("Negative", "Positive")) +
  theme(legend.position = "top") +
  labs(x = "Genres", y = "Count of Negative/Positive Scores \n by BING Lexicon",
       title = "The Sentiment of Words in Horror Movie Taglines",
       subtitle = "Counted for each word, by genre",
       fill = "Sentiment",
       caption = "The BING Lexicon produces \n negative or positive for a word")
`summarise()` has grouped output by 'genre_names'. You can
override using the `.groups` argument.
notstacked

## Q2 Sentiment by Proportion with BING Lexicon
stacked <- sentiment_joined |>
  group_by(genre_names, sentiment) |>
  filter(genre_names %in% popular_genre_names$genre_names) |>
  summarise(count = n()) |>
  ggplot(aes(x = fct_reorder(genre_names, count), y = count, fill = sentiment)) +
  geom_bar(stat = "identity", position = "fill") +
  scale_fill_manual(values = c("navy", "steelblue1"),
                    labels = c("Negative", "Positive")) + 
  theme(legend.position = "top") +
  labs(x = "Genres",
       y = "Proportion of \n Positive/Negatives",
       title = "The Sentiment of Words in Horror Movie Taglines",
       subtitle = "Neg/Pos proportion of sentiments by genre",
       fill = "Sentiment",
       caption = "The BING Lexicon produces \n negative or positive for a word")
`summarise()` has grouped output by 'genre_names'. You can
override using the `.groups` argument.
stacked

## Q2: Sentiment by Subgenre by AFINN Lexicon 

newsentiment |>
  filter(genre_names %in% popular_genre_names$genre_names) |>
  ggplot(aes(x = fct_reorder(genre_names, averagesent), y = averagesent,
             fill = genre_names)) +
  geom_boxplot(show.legend = FALSE, color = "darkblue") +
  coord_flip() +
  scale_fill_manual(
    values = c("lightslateblue","steelblue3","steelblue1","dodgerblue","cornflowerblue","lightskyblue")) +
  labs(y = "Average Sentiment Score for Words in a Movie Tagline",
       x = "Genre or Subgenre",
       title = "AFINN Sentiment Score of Movie Taglines",
       subtitle = "By Genre of Movie",
       caption = "AFINN Score: Measure of positive/negative \n sentiment, from -5 to 5") +
  theme(
    plot.title = element_text(size = 20),
    plot.subtitle = element_text(size = 18),
    axis.title = element_text(size = 18))

Discussion

X(1-3 paragraphs): In the Discussion section, interpret the results of your analysis. Identify any trends revealed (or not revealed) by the plots. Speculate about why the data looks the way it does.