# comment this out to run
# source("./code/wrangle.R")
# head(population)
Problem set 4
Total points: 15.
Introduction
In this problem set, we aim to use data visualization to explore the following questions:
- Based on SARS-Cov-2 cases, COVID-19 deaths and hospitalizations what periods defined the worst two waves of 2020-2021?
- Did states with higher vaccination rates experience lower COVID-19 death rates?
- Were there regional differences in vaccination rates?
We are not providing definitive answers to these questions but rather generating visualizations that may offer insights.
Objective
We will create a single data frame that contains relevant observations for each jurisdiction, for each Morbidity and Mortality Weekly Report (MMWR) period in 2020 and 2021. The key outcomes of interest are:
- SARS-CoV-2 cases
- COVID-19 hospitalizations
- COVID-19 deaths
- Individuals receiving their first COVID-19 vaccine dose
- Individuals receiving a booster dose
Task Breakdown
Your task is divided into three parts:
- Download the data: Retrieve population data from the US Census API and COVID-19 statistics from the CDC API.
- Wrangle the data: Clean and join the datasets to create a final table containing all the necessary information.
- Create visualizations: Generate graphs to explore potential insights into the questions posed above.
Instructions
As usual, copy and place the
pset-04-dataviz.qmd
file in a new directory calledp4
.Within your
p4
directory, create the following directory:code
Inside the
code
directory, include the following files:funcs.R
wrangle.R
Detailed instructions follow for each of the tasks.
Download data
For this part we want the following:
- Save all your code in a file called
wrangle.R
that produces the final data frame. - When executed, this code should save the final data frame in an RDA file in the
data
directory.
- (1 point) Copy the relevant code from the previous homework to create the
population
data frame. Put this code in the thewrangle.R
file in thecode
directory. Comment the code so we know where the population is created, where the regions are read in, and where we combine these.
Test that your wrangling code works. Comment the following code out:
- (1 point) In the previous problem set we wrote the following script to download cases data:
<- "https://data.cdc.gov/resource/pwn4-m3yp.json"
api <- request(api) |>
cases_raw req_url_query("$limit" = 10000000) |>
req_perform() |>
resp_body_json(simplifyVector = TRUE)
We are now going to download three other datasets from CDC that provide hospitalization, provisional COVID deaths, and vaccine data. A different endpoint is provided for each one, but the requests are the same otherwise. To avoid rewriting the same code more than once, write a function called get_cdc_data
that receives and endpoint and returns a data frame. Save this code in a file called funcs.R
.
- (1 point) Use the function
get_cdc_data
to download the cases, hospitalization, deaths, and vaccination data and save the data frames. We recommend saving them into objects called:cases_raw
,hosp_raw
,deaths_raw
, andvax_raw
.
- cases -
https://data.cdc.gov/resource/pwn4-m3yp.json
- hospitalizations -
https://data.cdc.gov/resource/39z2-9zu6.json
- deaths -
https://data.cdc.gov/resource/r8kw-7aab.json
- vaccinations
https://data.cdc.gov/resource/rh2h-3yt2.json
We recommend saving them into objects called: cases_raw
, hosp_raw
, deaths_raw
, and vax_raw
.
### YOUR CODE HERE
# comment this out:
# source("./code/funcs.R")
Take a look at all the dataframes you just read in.
### Uncomment this to run this
#print(head(cases_raw))
#print(head(hosp_raw))
#print(head(deaths_raw))
#print(head(vax_raw))
Wrangling Challenge
In this section, you will wrangle the files downloaded in the previous step into a single data frame containing all the necessary information. We recommend using the following column names: date
, state
, cases
, hosp
, deaths
, vax
, booster
, and population
.
Key Considerations
Align reporting periods: Ensure that the time periods for which each outcome is reported are consistent. Specifically, calculate the totals for each Morbidity and Mortality Weekly Report (MMWR) period.
Harmonize variable names: To facilitate the joining of datasets, rename variables so that they match across all datasets.
- (1 point) One challenge is data frames use different column names to represent the same variable. Examine each data frame and report back 1) the name of the column with state abbreviations, 2) if the rate is yearly, monthly, or weekly, daily data, 3) all the column names that provide date information.
Outcome | Jurisdiction variable name | Rate | time variable names |
---|---|---|---|
cases | |||
hospitalizations | |||
deaths | |||
vaccines |
- (1 point) Wrangle the cases data frame to keep state, MMWR year, MMWR week, and the total number of cases for that week in that state. Hint: Use
as_date
,ymd_hms
,epiweek
andepiyear
functions in the lubridate package. Comment appropriately. Display the result.
### YOUR CODE HERE
- (1 point) Now repeat the same exercise for hospitalizations. Note that you will have to collapse the data into weekly data and keep the same columns as in the cases dataset, except keep total weekly hospitalizations instead of cases. Remove weeks with less than 7 days reporting. Display your result and comment appropriately.
### YOUR CODE HERE
- (1 point) Repeat what you did in the previous two exercises for provisional COVID-19 deaths. Display the result and comment appropriately.
### YOUR CODE HERE
- (1 point) Repeat this now for vaccination data. Keep the variables
series_complete
andbooster
along with state and date. Display the result and comment appropriately. Hint: only use the rows withdate_type == 'Admin'
to only include vaccine data based on the day it was administered, rather than reported.
### YOUR CODE HERE
- (1 point) Now we are ready to join the tables. We will only consider 2020 and 2021 as we don’t have population sizes for 2022. However, because we want to guarantee that all dates are included we will create a data frame with all possible weeks. We can use this:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Make dates data frame
<- data.frame(date = seq(make_date(2020, 1, 25),
all_dates make_date(2021, 12, 31),
by = "week")) |>
mutate(date = ceiling_date(date, unit = "week", week_start = 7) - days(1)) |>
mutate(mmwr_year = epiyear(date), mmwr_week = epiweek(date))
#Uncomment to run
#dates_and_pop <- cross_join(all_dates, data.frame(state = unique(population$state))) |> left_join(population, by = c("state", "mmwr_year" = "year"))
Now join all the tables to create your final table. Make sure it is ordered by date within each state. Call it dat
. Show a few rows here.
Data visualization: generate some plots
We are now ready to create some figures. For each question below, write code that generates a plot that addresses the question.
- (1 point) Plot a trend plot for cases, hospitalizations and deaths for each state. Color by region. Plot rates per \(100,000\) people. Place the plots on top of each other. Hint: Use
pivot_longer
andfacet_wrap
.
### YOUR CODE HERE
- (1 point) To determine when vaccination started and when most of the population was vaccinated, compute the percent of the US population (including DC and Puerto Rico) vaccinated by date. Do the same for the booster. Then plot both percentages.
### YOUR CODE HERE
- (1 point) Plot the distribution of vaccination rates across states on July 1, 2021.
### YOUR CODE HERE
- (1 point) Is there a difference across region? Generate a plot of your choice.
### YOUR CODE HERE
Discuss what the plot shows.
YOUR SHORT ANSWER HERE
- (1 point) Using the previous figures, identify a time period that meets the following criteria:
- A significant COVID-19 wave occurred across the United States.
- A sufficient number of people had been vaccinated.
Next, follow these steps:
- For each state, calculate the COVID-19 deaths per day per 100,000 people during the selected time period.
- Determine the vaccination rate (primary series) in each state as of the last day of the period.
- Create a scatter plot to visualize the relationship between these two variables:
- The x-axis should represent the vaccination rate.
- The y-axis should represent the deaths per day per 100,000 people.
### YOUR CODE HERE
- (1 point) Repeat the exercise for the booster.
### YOUR CODE HERE