Proposal for Eurovision Data Visualization Report

Federico Arboleda, Ryan Hu, Chandler Naylon, Aryan Poonacha

Dataset

The dataset consists of every song and contestant performed in Eurovision history, along with all the associated data - year, host city and country, points received, final rank, etc. It contains 2005 observations and 18 variables. The accompanying eurovision-votes.csv maintains a tally of which countries voted for which other countries in each year of the contest.

We chose this dataset because it contains a substantial mix of both numerical and categorical variables with little missing data, giving us significant freedom to create different visualizations. We also liked that the data was arranged by year, dating back to the first Eurovision in 1956, which will allow us to visualize relationships over time. Finally, since America does not participate in Eurovision and consequently gives the competition minimal publicity, we are excited to educate our audience on a largely unfamiliar topic.

Questions

The two questions we want to answer are:

  1. Is there a home country advantage in Eurovision?

  2. Does the order in which contestants perform affect their success?

Analysis Plan

Our first question asks if contestants from the host_country in each competition tend to perform better than foreign contestants. We will visualize this relationship (or lack thereof) by first creating a binary variable called from_host_country for each observation that indicates whether the contestant was competing in their home country (done by matching the host_country and artist_country variables and returning “true” if a match, and “false” if not). From there, we can use our first plot as a histogram of total_points or final rank of host country participants faceted by decade to look at whether home country advantage was more prevalent in certain eras than others. A ridge plot could also be used here, which we will decide upon visualization when we can compare to values of both. Furthermore, we can use total_points or rank for each contestant as a measure of success and compare the distributions of these metrics between host country participants and non-host country participants in our second plot through a histogram faceted by whether from_host_country. A violin plot can also be used here for the second plot to make comparisons better, but we will again decide which to use upon visualization.

Our second question looks at whether the running_order in which contestants perform has a relationship with their success in Eurovision. To do so, we plan on using the variable running_order to represent the order in which they perform, year for a temporal visualization, and either total_points or rank to represent how they perform in the competition. In addition, since there may be too many ranks to visualize concisely, we can bucket the ranks into broader categories, such as first 10, middle 10, and final 10. One plot can visualize the relationship between running_order and total_points over time using a scatter plot, with points colored by every decade (transforming year), while another plot can be a temporal line plot, with year on the x-axis and running_order on the y-axis, with the lines colored by contestants that had the most, second-most, or third-most, total_points at each competition. We plan create a new variable top3 with levels “first”, “second”, “third”, and “other” that determine whether a contestant came in first, second, third, or otherwise in their competition by total points. We will then plot the observations that have their value as “first”, “second”, or “third” in this lineplot.

We considered merging data where we could look at voting blocs in Eurovision history, but we decided that going in the direction of seeing if there are relationships between running order and success at the competition is something that we wanted to look deeper into with this analysis.

References

https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-05-17/readme.md