Statsquatch: A Data Analysis of Bigfoot Sightings

Proposal

suppressWarnings(library(tidyverse))
library(knitr)

Dataset

bigfoot <- readr::read_csv('data/bigfoot.csv')

glimpse(bigfoot)
Rows: 5,021
Columns: 29
$ ...1               <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ observed           <chr> "I was canoeing on the Sipsey river in Alabama. It …
$ location_details   <chr> NA, "East side of Prince William Sound", "Great swa…
$ county             <chr> "Winston County", "Valdez-Chitina-Whittier County",…
$ state              <chr> "Alabama", "Alaska", "Rhode Island", "Pennsylvania"…
$ season             <chr> "Summer", "Fall", "Fall", "Summer", "Spring", "Fall…
$ title              <chr> NA, NA, "Report 6496: Bicycling student has night e…
$ latitude           <dbl> NA, NA, 41.45000, NA, NA, 35.30110, 39.38745, 41.29…
$ longitude          <dbl> NA, NA, -71.50000, NA, NA, -99.17020, -81.67339, -7…
$ date               <date> NA, NA, 1974-09-20, NA, NA, 1973-09-28, 1971-08-01…
$ number             <dbl> 30680, 1261, 6496, 8000, 703, 9765, 4983, 31940, 56…
$ classification     <chr> "Class B", "Class A", "Class A", "Class B", "Class …
$ geohash            <chr> NA, NA, "drm5ucxrc0", NA, NA, "9y32z667yc", "dpjbj6…
$ temperature_high   <dbl> NA, NA, 78.17, NA, NA, 71.86, NA, 92.24, NA, NA, 74…
$ temperature_mid    <dbl> NA, NA, 73.425, NA, NA, 61.425, NA, 80.810, NA, NA,…
$ temperature_low    <dbl> NA, NA, 68.68, NA, NA, 50.99, NA, 69.38, NA, NA, 53…
$ dew_point          <dbl> NA, NA, 65.72, NA, NA, 51.03, NA, 67.34, 32.55, NA,…
$ humidity           <dbl> NA, NA, 0.86, NA, NA, 0.79, NA, 0.68, 0.45, NA, 0.7…
$ cloud_cover        <dbl> NA, NA, 0.86, NA, NA, 0.11, NA, 0.05, 0.00, NA, 0.6…
$ moon_phase         <dbl> NA, NA, 0.16, NA, NA, 0.07, NA, 0.76, 0.02, NA, 0.1…
$ precip_intensity   <dbl> NA, NA, 0.0000, NA, NA, NA, NA, 0.0000, 0.0000, NA,…
$ precip_probability <dbl> NA, NA, 0.00, NA, NA, NA, NA, 0.00, 0.00, NA, 0.70,…
$ precip_type        <chr> NA, NA, NA, NA, NA, "rain", NA, NA, NA, NA, "rain",…
$ pressure           <dbl> NA, NA, 1020.61, NA, NA, 1017.26, NA, 1016.80, 1012…
$ summary            <chr> NA, NA, "Foggy until afternoon.", NA, NA, "Partly c…
$ uv_index           <dbl> NA, NA, 4, NA, NA, 7, NA, 8, 8, NA, 6, 10, 6, 7, NA…
$ visibility         <dbl> NA, NA, 2.750, NA, NA, 10.000, NA, 6.922, 8.880, NA…
$ wind_bearing       <dbl> NA, NA, 198, NA, NA, 259, NA, 219, 285, NA, 262, 19…
$ wind_speed         <dbl> NA, NA, 6.92, NA, NA, 8.41, NA, 1.01, 4.01, NA, 0.4…

The Tibbles chose the Bigfoot data-set from Tidy Tuesday. The data originates from a publicly available database on the Bigfoot Field Researchers Organization (BFRO) website, and the data-set created by Timothy Renner first became available on Data World in 2017. The data-set contains 5,021 rows, with each row representing a separate Bigfoot sighting, and 28 columns, with each column providing details on the sighting. The columns can generally be categorized as lengthier descriptions of the sighting observations, time and geographic details, and weather details.

The Tibbles opted to work with this data-set because of our shared predilection for the supernatural and the potential this information holds to prove once and for all the existence (or lack thereof) of one of the most iconic creatures in the North American imagination. It contains a wealth of observations with plenty of categorical and numerical data well-suited for comprehensive analysis and interpretation.

Questions

  1. How has the geographic distribution of Bigfoot sightings changed over time?
  2. Are certain weather conditions more commonly associated with Bigfoot sightings than others?

Analysis plan

  • Question One: How has the geographic distribution of Bigfoot sightings changed over time?
    • Create new ‘year’ variable from existing ‘date’ variable using the mutate() function

    • Figure 1: Map observations as points by ‘longitude’ and ‘latitude’ variables then color by ‘year’ using a gradient color scale (oldest values shaded darker, newest values shaded brighter, e.g.). If the data points stack to the extent of obscuring the trend, we may (a) explore using transparency to improve visibility, or (b) use a heat-map instead, plotting point density faceted by decade.

    • Figure 2: Plot the geographic distribution of sightings in a stacked bar plot (using 5-year trenches, colored by state, with the stat() as frequency) over the timeline of our available data. This visualization will show how states contribute different proportions of sightings to the national tallies over time– or how their contributions are constant and there does not appear to be significant change. If ‘state’ proves to be to difficult to visualize in a coherent manner– there are dozens in our data-set, even though a small minority contribute the most sightings- we will create ‘region’ variables using U.S. Census categories.

Further Plan for Evaluation

For Figure 1, we will use the map plot to perform a visual observation of any possible trends in the geographic distribution. For example, a cluster of specific points during the 1980s from one area might be evidence of an unusual pattern (perhaps a national park, e.g.) that would be worth further observation. The Professor notes – correctly – that it may be difficult to draw formal conclusions from such a plot; however, we believe that when provided with geographic data, good practice in a comprehensive overview would include taking a look at that data in a raw format, and mapping it, even if it proves useless, before we apply arbitrary strata (like states). We have also adopted a few contingency options in case the first draft of the map is unusable.

For Figure 2, we will look for evidence in our bar plot of a change in distribution over time by states. If states’ sighting rates (as a proportion of the nation’s - therefore adjusting for relative increases in sightings nationwide) go up or down in persistent rates over time, it may provide evidence that there is a relationship between time and the geographic distribution of sightings. If the sighting rates/proportions are constant, or if they seem to vary randomly, it may provide evidence that there is no such relationship.

  • Question Two: Are certain weather conditions more commonly associated with Bigfoot sightings than others?

    • Figure 1: Using ‘dew_point’, ‘humidity’, ‘temperature_mid’, ‘cloud_cover’, ‘uv_index’, ‘visibility’, ‘wind_bearing’, ‘wind_speed’, we will make a density plot to determine at what conditions, a Bigfoot sighting is most likely. After determining the most notable variables, done finding the distributions with the least variance, we will make one figure with multiple variables with a common x axis.

    • Figure 2: Using ‘temperature_mid’, ‘humidity’, and number of big foot sightings to create a heat map. This graph would indicated if there are any areas of high or low sightings that correspond to certain weather conditions.

Further Plan for Evaluation

We will observe each of the figures produced in order to gauge which weather conditions (if any) coincide with the most Bigfoot sightings.