Problem set 3

Published

February 2, 2025

Max points: 12.

In the next problem set, we plan to explore the relationship between COVID-19 death rates and vaccination rates across US states by visually examining their correlation. This analysis will involve gathering COVID-19 related data from the CDC’s API and then extensively processing it to merge the various datasets. Since the population sizes of states vary significantly, we will focus on comparing rates rather than absolute numbers. To facilitate this, we will also source population data from the US Census to accurately calculate these rates.

In this problem set we will learn how to extract and wrangle data from the data US Census and CDC APIs.

All answers should be submitted in pset-03-wrangling.qmd. Be sure to include a rendered version of your file and a raw code file that successfully would render on a new computer.

  1. (1 point) Get an API key from the US Census at https://api.census.gov/data/key_signup.html. You can’t share this public key. But your code has to run on a TFs computer. Assume the TF will have a file in their working directory (i.e. in the BIOSTAT620_pset_sol/p3/ directory, assuming that you place pset-03-wrangling.qmd in the BIOSTAT620_pset_sol/p3/ folder) named census-key.R with the following one line of code:
census_key <- "A_CENSUS_KEY_THAT_WORKS"

Write a first line of code for your problem set that defines census_key by running the code in the file census-key.R.

## Your code here
  1. (1 point) The US Census API User Guide provides details on how to leverage this valuable resource. We are interested in vintage population estimates for years 2020 and 2021. From the documentation we find that the endpoint is:
url <- "https://api.census.gov/data/2021/pep/population"

Use the httr2 package to construct the following GET request.

https://api.census.gov/data/2021/pep/population?get=POP_2020,POP_2021,NAME&for=state:*&key=YOURKEYHERE

Create an object called request of class httr2_request with this URL as an endpoint. Print out request to check that the URL matches what we want.

library(httr2)
#request <- 
  1. (1 point) Make a request to the US Census API using the request object. Save the response to and object named response, and print it out here. Check the response status of your request and make sure it was successful. You can learn about status codes here.
#response <- 
  1. (1 point) Use a function from the httr2 package to determine the content type of your response (print it out).
# Your code here
  1. (1 point) Use just one line of code and one function to extract the data into a matrix. Print out the first few rows of the matrix (title: population). Hints: 1) Use the resp_body_json function. 2) The first row of the matrix will be the variable names and this OK as we will fix in the next exercise.
#population <- 
  1. (1 point) Examine the population matrix you just created. Notice that 1) it is not tidy, 2) the column types are not what we want, and 3) the first row is a header. Convert population to a tidy dataset. Remove the state ID column and change the name of the column with state names to state_name. Add a column with state abbreviations called state. Make sure you assign the abbreviations for DC and PR correctly. Hint: Use the janitor package to make the first row the header. Print out the first few rows of your cleaned dataset.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
#population <- population |> ## Use janitor row to names function
  # convert to tibble
  # remove state column
  # rename NAME column to state_name
  # use pivot_longer to tidy
  # remove POP_ from year
  # parse all relevant colunns to numeric
  # add state abbreviations using state.abb variable mapped from the state.name variable
  # use case_when to add abbreviations for DC and PR
  1. (1 point) As a check, make a barplot of states’ 2021 and 2022 populations. Show the state names in the y-axis ordered by descending population size. Hint: You will need to use facet_wrap.
# population |> 
  # reorder state
  # assign aesthetic mapping
  # use geom_col to plot barplot
  # flip coordinates
  # facet by year
  1. (1 point) The following URL:
url <- "https://github.com/datasciencelabs/2024/raw/refs/heads/main/data/regions.json"

points to a JSON file that lists the states in the 10 Public Health Service (PHS) defined by CDC. We want to add these regions to the population dataset. To facilitate this create a data frame called regions that has two columns state_name, region, region_name. One of the regions has a long name. Change it to something shorter. Print the first few rows of regions. Make sure that the region is a factor.

library(jsonlite)

Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':

    flatten
library(purrr)
url <- "https://github.com/datasciencelabs/2024/raw/refs/heads/main/data/regions.json"
# regions <- use fromJSON to read as a data.frame
# rename long region
# use unnest to split the states
# rename column as state_name
  1. (1 point) Add a region and region name columns to the population data frame using the joining methods we have learned. Print out the first few rows.
# population <- 
  1. (1 point) From reading https://data.cdc.gov/ we learn the endpoint https://data.cdc.gov/resource/pwn4-m3yp.json provides state level data from SARS-COV2 cases. Use the httr2 tools you have learned to download this into a data frame. Is all the data there? If not, comment on why.
api <- "https://data.cdc.gov/resource/pwn4-m3yp.json"
# cases_raw <- 
  1. (1 point) The reason you see exactly 1,000 rows is because CDC has a default limit. You can change this limit by adding $limit=10000000000 to the request. Rewrite the previous request to ensure that you receive all the data. Then wrangle the resulting data frame to produce a data frame with columns state, date (should be the end date) and cases. Make sure the cases are numeric and the dates are in Date ISO-8601 format. Print out the first several rows.
api <- "https://data.cdc.gov/resource/pwn4-m3yp.json"
# cases_raw <- 
  1. (1 point) For 2020 and 2021, make a time series plot of cases per 100,000 versus time for each state. Stratify the plot by region name and make a separate line plot for each state. Don’t use colors for this plot, but set alpha = 0.2 to make the plots more easily visable. Make sure to label your graph appropriately.
#cases_raw |>