Analyzing World Cup Goals

Proposal

library(tidyverse)

Dataset

matches <- read.csv("data/wcmatches.csv")
wc <- read.csv("data/worldcups.csv")

Our data visualization team, RGodz, has elected to use the FIFA World Cup data for our investigation. The 2022 World Cup recently came to a close in Qatar and was full of sensational story lines between FIFA’s corruption leading up to the event, the likely end to Cristiano Ronaldo’s international football career, and Lionel Messi’s fairy-tale victory in the final against a star studded French side led by Kylian Mbappe. Considering all of the sensational headlines and buzz generated by this most recent installment of the World Cup, our team thought this data would be particularly exciting and relevant to analyze.

This FIFA World Cup data was made available on kaggle.com by user Evan Gower. The data is broken up into two datasets which we have renamed to matches and wc . It should be noted that while the most recent World Cup has sparked our interest in the data, the 2022 World Cup data is not included in these datasets and our team will discuss whether or not it makes sense for those observations to be added in the context of our project.

The matches dataset includes data from every single game played in the World Cup from the group stage through the final from Uruguay’s 1930 tournament to Russia’s in 2018. Its variables include specifications such as which teams played in a given match through the variables home_team and away_team as well as outcome which states whether the match ended in a win for the home team, a win for the away team, or a draw. The matches dataset has 15 variables (which are all explained in the README.md file of our team’s data folder) and thus 15 columns along with 900 rows/observations representing all of the World Cup matches played from 1930 to 2018.

Our second dataset, wc, includes more general information about these individual World Cups. It holds a variety of variables such as which nation hosted each tournament with variable host, which year it was played with variable year, and more information on which nations placed first through fourth. The wc dataset has 10 columns for the 10 variables (which are also explained in the README.md file of our team’s data folder) and has 21 rows/observations representing the 21 World Cups that were hosted from 1930 to 2018.

Questions

1: How has the total number of goals scored in a match changed over time, and does the round in which a match is played correlate with the total number of goals scored in a match?

2: Over the course of the FIFA World Cup’s tenure, which nations have had the most success aggregated across all tournaments from 1930 to 2018 when considering semi-final appearances and overall goal differential?

Analysis plan

To answer our first question, we will be looking at the variables home_score, away_score, year, and stage. We will have to create a new variable that can be called total_goals that is the sum of home_score and away_score since we are interested in the number of goals scored per match. Also, we will have to do some data manipulation with the stage variable so that all group stage games are under the same category, as they currently have different names based on the group name (i.e. “Group A”, “Group B”, etc.) but are all a part of the same round of a World Cup. No external data is needed for this analysis. For our visualization, our current plan is to create a violin plot for the distribution of goals scored for each year that is faceted by the round. So year will be on the x-axis and goals scored will be on the y-axis with each plot showing the distribution of the data points that came from a specific stage in the tournament. We will need to reorder the points so that the order of the faceted plots makes sense in context (such that the first plot shows the group stage distributions and the final round will be the last plot). There are multiple ways we can display the data from these three variables, however, so depending on how well the described graphic shows the relationships we are interested in we may decide to change how we plot them, but this is our current plan. Following a suggestion from one of our peer reviews, for our second visualization to answer this question we will create a line plot with the average number of goals scored per game as the y-axis variable and year on the x-axis, separated by which round of the tournament. The chart would have different lines colored by tournament stage and display the change in average goals scored per game over the years for each round of the World Cup. This should give us a better sense of how goals scored per match differs by year and round on average as opposed to looking at the actual distributions in the violin plots.

To answer our second question, we look to create two visualizations ranking participating nations on their all-time World Cup performance. For our first visualization, we will be looking at the variables winner, second, third, and fourth from the wc dataset. We will count the amount of times from 1930 to 2018 that individual countries have made it into the top four teams of the tournament, indicated by being listed as winner, second, third, and fourth in the wc dataset. This count will manifest as the plot’s x-axis with country names in descending order on the y-axis. This visualization will give us a sense of which teams have had the most success with deep runs in the tournament over time. To answer the second part of our question we will be using data from the matches dataset. In order to analyze each country’s goal differential we must first do some data manipulation. First, we need to pivot the dataframe of matches longer so that each match now has two rows, with one the home and away columns being switched to team and opponent ones so that every second row is the same as the row before it in the opposite order. In case this is unclear, for example if there was a match in which Spain lost to Morocco 1-2, the first row would contain 1 under the team_score column and 2 under the opponent_score column and the following row would contain 2 under the team_score column and 1 under the opponent_score column. Then, goal differential will be calculated as team_score - opponent_score. Lastly, we will group the table by country and summarize using the sum of the new goal_differential column to be the dataframe we use in our analysis. The visualization will be faceted by country (descending order based on aggregate goal differential) and include a collection of line graphs using geom_line() that tracks each nation’s goal differential in each World Cup with the new goal differential variable on the y-axis and the year of each World Cup on the x-axis. We will most likely decide to limit the amount of countries we display due to the large number who have participated in the World Cup but we will figure this out once we have the data ready to visualize. This visualization will give us a sense of which team’s have the highest overall goal differential while also being able to gauge how that has varied from tournament to tournament over time by country. Our team may pivot to analyzing average goal differential aside from total if we feel it makes most sense in the context of the data given historic FIFA politics surrounding which nations have been permitted to participate.