The Evolution of the World Cup

Proposal

Dataset

The data covers every single World Cup match played in its history from Uruguay in 1930 to Russia in 2018. Our data is split into two datasets: wcmatches and worldcups .

In the wcmatches dataset, columns include city from which city was the match being played in, outcome which team claimed victory or did the match result in a draw, win_conditions showing if the winning side need added extra time or penalties to win the game.

In addition, the worldcups dataset includes a summary of each World Cup held, containing columns winners, games, goals_scored, and more.

We might include possible data from the Qatar World Cup with this link.

Why We Picked this Dataset

We picked this dataset because we thought that the subject would be fresh on people’s minds (the 2022 World Cup concluded less than two months ago), and therefore relevant to current discussion. We also felt that there has been a narrative in recent years about how the teams previously considered as “underdogs”, have been catching up to the traditional powerhouse teams. We wanted to see if this narrative was in fact true, or if this was just wishful thinking. We were also curious to see if the penalties being implemented in the World Cup impacted the number of goals scored per game. We liked how the data included a dataset that looked at the World Cup at more of a macro level (one observation per world cup), and then also at a micro level (every individual match).

Load Data

The wcmatches dataset has 900 observations and 15 variables. Each observation represents a different match played at a world cup.

The worldcups dataset has 21 observations and 10 variables. Each observation represents a world cup.

Questions

Question 1:

How have the goals per game and per tournament changed over time? Has the introduction of any new rules affected scores?

Question 2:

How has the representation of countries outside of Europe and South America changed over time? Has this changed by round?

Analysis plan

Merge Data

First, we merged the wcmatches dataset with the worldcups dataset by left joining worldcups onto wcmatches by the year and bring over the information for each world cup to each world cup match.

Note: We might possibly merge external data for 2022 World Cup if we decide to include data from the recent World Cup into our analysis. Additionally to enhance our analysis on penalty kicks, we may include external data showing more details regarding such.

Create New Variables

We created three variables initially to help us explore our first question. goalspergame is the number of goals scored per game. attendancepergame is the attendance of people per game. gamesperteam is the number of games played per team. In addition to these variables, we will also create other variables that will explore different factors that affect the goals scored per game, including variables that represent certain changes like VAR that were implemented in the world cup.

For our second question, we will create a variable continent assigning a continent for each team in the world cups, which will allow us to see the representation from each continent over years of the world cup. We will also create a variable that calculates the proportion of teams from each continent for each world cup to visualize how this proportion changes over each year of the world cup.

# Create a new column "win_conditions" and replace missing values based on the value in the "outcome" column
world_cup_total$win_conditions <-
  ifelse(
    is.na(world_cup_total$win_conditions) &
      world_cup_total$outcome == "H",
    paste(world_cup_total$home_team, "won in regulation"),
    ifelse(
      is.na(world_cup_total$win_conditions) &
        world_cup_total$outcome == "A",
      paste(world_cup_total$away_team, "won in regulation"),
      world_cup_total$win_conditions
    )
  )

To answer our second question, we will create a variable that calculates the proportion of teams from Europe/South American and other continents over each year of the world cup to use in our visualization and observe how this proportion changes over time.

Question #1 Plan:

There have been new additions to World Cup rules, and soccer in general, over time. This includes the introduction of penalties to break ties, VAR to contest calls on the field, the number of teams in each world cup, and more. There are also factors that affect the players individually, such as the number of additional leagues/tournaments they play in.

The first plot will examine goal scoring trends across world cups in order to examine the effect of new rules on the outcomes of games as well as changes in play over time. We are planning to create a line graphs showing the number of goals scored per game of each world cup (This is better than just goals as the number of games increases over time as the WC adds more teams). So, our x-axis will be year and the y-axis will be the average goals_scored per game of that world cup. We will also add annotations on top of this graph to show when certain changes were implemented in the World Cup, hopefully to highlight/explain changes in patterns. We may add another variable such as continent (color) in order to see if the if the power rankings across the world have shifted over time (in other words, have the goals per game of each continent become more balanced over time or grown further apart?). Additionally using continent in the graph will lead us into our second big question.

Our second plot will be a deeper analysis in to the effect of penalty kicks. A very controversial and game-deciding call on the field, the World Cup has seen both the introduction and change in frequency (partially due to VAR) of PKs in its history. Our plan for the second plot will be a line graph showing how the frequency of penalty kicks have changed over time. A possible extension of this idea would be showing two lines on the plot, one for total PKs taken per game and one for successful PKs, or one for PKs taken during the game and one for PKs taken during tie-breaking penalty shootouts. This would allow us to further analyze the trends and try to determine the possible explanations.

Our overall goal is to analyze how games have changed over time in terms of goals scored. Higher scoring games indicate many things about the way teams are playing, and this could be due to the factors we will consider in our graphs. Along with extra variables we may create in the process, the variables we intend to use are:

variable class description
year double the year the world cup took place
home_score double the number of goals the home team scored
away_score double the number of goals the away team scored
goals_scored double the number of goals scored in that year’s world cup
win_conditions character the method in which the game was decided (i.e. extra time, regulation time or penalties)

Question #2 Plan:

Back during the conception of the world cup, despite the ‘world’ name, it was dominated by European and South American teams. The world cup had very little representation from Asia, Africa and Oceania. We hope to make two graphs which can explore this question on representation. 

For our first plot, we plan to make a graph which records the proportion of teams from Europe/South American compared to all other nations (non European/South American countries) over each year of the world cup. We will create a variable continent based off of the countries of the teams. Then, for each world cup, we will calculate the proportion of the roster represented by each continent. With this, we will plot a time series graph where the x-axis will be year, the y-axis will be the proportion, and color by continent group.

For our second plot, we will explore representation across rounds of a world cup and analyze whether a potential increase in representation in early rounds leads to more representation in later rounds (with later rounds being the ‘better’ teams as it is an elimination tournament). The hypothesis of question 2.1 is that, as time progresses, the ‘world cup’ will be more of a ‘world’ representation with less teams from Europe and South America, and more teams from Africa, Asia and Oceania. However, even with this expected increase in diversity at the beginning of the cup, will this trend maintain throughout the tournament as teams are eliminated (or will we continue to end up with mostly European/South American teams in the quarterfinals to finals). We plan to make a segmented frequency bar chart with the rounds ordered on the x axis from beginning to later rounds, filled by continent, and faceted by decade. The purpose of faceting by decade (or every two decades) is we get to see how representation across the world cup stages has changed over time.