Lab 5 - Data Wrangling

Learning goals

Use the merge() function to join two datasets.
Look at several different ways of loading data from different sources.
Transform, filter, and mutate the data.

Lab description

This analysis will involve gathering COVID-19 related data from the CDC and then extensively processing it to merge the various datasets. To facilitate this, we will also source population data from the US Census to accurately calculate these rates.

Setup

1. Load in packages.

Load the data.table (and the dtplyr and dplyr packages if you plan to work with those). Alternatively, you can decide to work with tidyverse.

2. Census data.

The US Census API User Guide provides details on how to leverage this valuable resource. We are interested in vintage population estimates for years 2020 and 2021. Load in the population csv, and print out the first few rows of the matrix. Hint: the first row of the matrix will be the variable names and this OK as we will fix in the next exercise.

url <- 'https://raw.githubusercontent.com/dmcable/BIOSTAT620W26/main/data/covid/population.csv'
if (!file.exists("population.csv"))
  download.file(
    url = url,
    destfile = "population.csv",
    method   = "libcurl",
    timeout  = 60
    )
population <- read.csv('population.csv')

3. Clean population data.

Examine the population matrix you just created. Notice that 1) it is not tidy, 2) the column types are not what we want, 3) there is a redundant column for index, and 4) the first row is a header. Here are some recommendations for how to clean the data:

Convert population to a tidy dataset.
Make the first row the header
Remove the unnecessary index column and the state ID column
Change the name of the column with state names to state_name.
Add a column with state abbreviations called state. Make sure you assign the abbreviations for DC and PR correctly.

Hint 1: Use the row_to_names function from the janitor package to make the first row the header. Hint 2: Use setNames(state.abb, state.name) to map state names to abbreviations.

4. Regions data

The following URL:

url <- "https://raw.githubusercontent.com/dmcable/BIOSTAT620W26/main/data/covid/regions.json"

points to a JSON file that lists the states in the 10 Public Health Service (PHS) defined by CDC. We want to add these regions to the population dataset. To facilitate this create a dataframe called regions that has columns state_name, region, and region_name. For this exercise, we recommend:

Split the states into separate rows. Hint: look up the unnest function.
One of the regions has a long name. Change it to something shorter.
Print the first few rows of regions.
Make sure that the region is a factor.

library(jsonlite)
url <- "https://raw.githubusercontent.com/dmcable/BIOSTAT620W26/main/data/covid/regions.json"
regions <- fromJSON(url) # use fromJSON to read as a data.frame

5. Process regions data.

Add a region and region name columns to the population data frame using the merging or joining methods we have learned. Print out the first few rows. Hint: consider left_join.

6. Load CDC data

From reading https://data.cdc.gov/ we learn the endpoint https://data.cdc.gov/resource/pwn4-m3yp.json provides state level data from SARS-COV2 cases. We use httr2 tools to download this into a data frame. Question: what would happen and why if you replaced "?$limit=10000000000" with ""?

api <- "https://data.cdc.gov/resource/pwn4-m3yp.json"
request <- request(paste0(api,paste0("?$limit=10000000000")))
response <- request |> req_perform() |> resp_body_string()
cases_raw <- fromJSON(response)
head(cases_raw)

7. Wrangle cases data

Wrangle the cases_raw to produce a data frame with columns state, date (should be the end date) and cases. For this task:

Make sure the cases are numeric and the dates are in Date format. Hint: check out the lubridate package.
Print out the first several rows.

8. Optional freestyle.

Here are some ideas:

Merge the cases dataframe with the population data.
Use the as_date, ymd_hms, epiweek and epiyear functions in the lubridate package to get a week and year for each data point.
Order your data by date within each state.

9. Submission (due Tuesday, Feb 17 at 8:30am)

Submit through the course Github issues page.