Proposal for Eurovision Data Visualization Report
Federico Arboleda, Ryan Hu, Chandler Naylon, Aryan Poonacha
Dataset
The dataset consists of every song and contestant performed in Eurovision history, along with all the associated data - year, host city and country, points received, final rank, etc. It contains 2005 observations and 18 variables. The accompanying eurovision-votes.csv maintains a tally of which countries voted for which other countries in each year of the contest.
We chose this dataset because it contains a substantial mix of both numerical and categorical variables with little missing data, giving us significant freedom to create different visualizations. We also liked that the data was arranged by year, dating back to the first Eurovision in 1956, which will allow us to visualize relationships over time. Finally, since America does not participate in Eurovision and consequently gives the competition minimal publicity, we are excited to educate our audience on a largely unfamiliar topic.
Questions
The two questions we want to answer are:
Is there a home country advantage in Eurovision?
Does the order in which contestants perform affect their success?
Analysis Plan
Our first question asks if contestants from the host_country
in each competition tend to perform better than foreign contestants. We will visualize this relationship (or lack thereof) by first creating a binary variable called from_host_country
for each observation that indicates whether the contestant was competing in their home country (done by matching the host_country
and artist_country
variables and returning “true” if a match, and “false” if not). From there, we can use our first plot as a histogram of total_points
or final rank
of host country participants faceted by decade to look at whether home country advantage was more prevalent in certain eras than others. A ridge plot could also be used here, which we will decide upon visualization when we can compare to values of both. Furthermore, we can use total_points
or rank
for each contestant as a measure of success and compare the distributions of these metrics between host country participants and non-host country participants in our second plot through a histogram faceted by whether from_host_country
. A violin plot can also be used here for the second plot to make comparisons better, but we will again decide which to use upon visualization.
Our second question looks at whether the running_order
in which contestants perform has a relationship with their success in Eurovision. To do so, we plan on using the variable running_order
to represent the order in which they perform, year
for a temporal visualization, and either total_points
or rank
to represent how they perform in the competition. In addition, since there may be too many ranks to visualize concisely, we can bucket the ranks into broader categories, such as first 10, middle 10, and final 10. One plot can visualize the relationship between running_order
and total_points
over time using a scatter plot, with points colored by every decade (transforming year
), while another plot can be a temporal line plot, with year
on the x-axis and running_order
on the y-axis, with the lines colored by contestants that had the most, second-most, or third-most, total_points
at each competition. We plan create a new variable top3
with levels “first”, “second”, “third”, and “other” that determine whether a contestant came in first, second, third, or otherwise in their competition by total points. We will then plot the observations that have their value as “first”, “second”, or “third” in this lineplot.
We considered merging data where we could look at voting blocs in Eurovision history, but we decided that going in the direction of seeing if there are relationships between running order and success at the competition is something that we wanted to look deeper into with this analysis.
References
https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-05-17/readme.md