## Your code here
Problem set 3
Max points: 12.
In the next problem set, we plan to explore the relationship between COVID-19 death rates and vaccination rates across US states by visually examining their correlation. This analysis will involve gathering COVID-19 related data from the CDC’s API and then extensively processing it to merge the various datasets. Since the population sizes of states vary significantly, we will focus on comparing rates rather than absolute numbers. To facilitate this, we will also source population data from the US Census to accurately calculate these rates.
In this problem set we will learn how to extract and wrangle data from the data US Census and CDC APIs.
All answers should be submitted in pset-03-wrangling.qmd
. Be sure to include a rendered version of your file and a raw code file that successfully would render on a new computer.
- (1 point) Get an API key from the US Census at https://api.census.gov/data/key_signup.html. You can’t share this public key. But your code has to run on a TFs computer. Assume the TF will have a file in their working directory (i.e. in the
BIOSTAT620_pset_sol/p3/
directory, assuming that you placepset-03-wrangling.qmd
in theBIOSTAT620_pset_sol/p3/
folder) namedcensus-key.R
with the following one line of code:
census_key <- "A_CENSUS_KEY_THAT_WORKS"
Write a first line of code for your problem set that defines census_key
by running the code in the file census-key.R
.
- (1 point) The US Census API User Guide provides details on how to leverage this valuable resource. We are interested in vintage population estimates for years 2020 and 2021. From the documentation we find that the endpoint is:
<- "https://api.census.gov/data/2021/pep/population" url
Use the httr2 package to construct the following GET request.
https://api.census.gov/data/2021/pep/population?get=POP_2020,POP_2021,NAME&for=state:*&key=YOURKEYHERE
Create an object called request
of class httr2_request
with this URL as an endpoint. Print out request
to check that the URL matches what we want.
library(httr2)
#request <-
- (1 point) Make a request to the US Census API using the
request
object. Save the response to and object namedresponse
, and print it out here. Check the response status of your request and make sure it was successful. You can learn about status codes here.
#response <-
- (1 point) Use a function from the httr2 package to determine the content type of your response (print it out).
# Your code here
- (1 point) Use just one line of code and one function to extract the data into a matrix. Print out the first few rows of the matrix (title:
population
). Hints: 1) Use theresp_body_json
function. 2) The first row of the matrix will be the variable names and this OK as we will fix in the next exercise.
#population <-
- (1 point) Examine the
population
matrix you just created. Notice that 1) it is not tidy, 2) the column types are not what we want, and 3) the first row is a header. Convertpopulation
to a tidy dataset. Remove the state ID column and change the name of the column with state names tostate_name
. Add a column with state abbreviations calledstate
. Make sure you assign the abbreviations for DC and PR correctly. Hint: Use the janitor package to make the first row the header. Print out the first few rows of your cleaned dataset.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
#population <- population |> ## Use janitor row to names function
# convert to tibble
# remove state column
# rename NAME column to state_name
# use pivot_longer to tidy
# remove POP_ from year
# parse all relevant colunns to numeric
# add state abbreviations using state.abb variable mapped from the state.name variable
# use case_when to add abbreviations for DC and PR
- (1 point) As a check, make a barplot of states’ 2021 and 2022 populations. Show the state names in the y-axis ordered by descending population size. Hint: You will need to use
facet_wrap
.
# population |>
# reorder state
# assign aesthetic mapping
# use geom_col to plot barplot
# flip coordinates
# facet by year
- (1 point) The following URL:
<- "https://github.com/datasciencelabs/2024/raw/refs/heads/main/data/regions.json" url
points to a JSON file that lists the states in the 10 Public Health Service (PHS) defined by CDC. We want to add these regions to the population
dataset. To facilitate this create a data frame called regions
that has two columns state_name
, region
, region_name
. One of the regions has a long name. Change it to something shorter. Print the first few rows of regions. Make sure that the region is a factor.
library(jsonlite)
Attaching package: 'jsonlite'
The following object is masked from 'package:purrr':
flatten
library(purrr)
<- "https://github.com/datasciencelabs/2024/raw/refs/heads/main/data/regions.json"
url # regions <- use fromJSON to read as a data.frame
# rename long region
# use unnest to split the states
# rename column as state_name
- (1 point) Add a region and region name columns to the
population
data frame using the joining methods we have learned. Print out the first few rows.
# population <-
- (1 point) From reading https://data.cdc.gov/ we learn the endpoint
https://data.cdc.gov/resource/pwn4-m3yp.json
provides state level data from SARS-COV2 cases. Use the httr2 tools you have learned to download this into a data frame. Is all the data there? If not, comment on why.
<- "https://data.cdc.gov/resource/pwn4-m3yp.json"
api # cases_raw <-
- (1 point) The reason you see exactly 1,000 rows is because CDC has a default limit. You can change this limit by adding
$limit=10000000000
to the request. Rewrite the previous request to ensure that you receive all the data. Then wrangle the resulting data frame to produce a data frame with columnsstate
,date
(should be the end date) andcases
. Make sure the cases are numeric and the dates are inDate
ISO-8601 format. Print out the first several rows.
<- "https://data.cdc.gov/resource/pwn4-m3yp.json"
api # cases_raw <-
- (1 point) For 2020 and 2021, make a time series plot of cases per 100,000 versus time for each state. Stratify the plot by region name and make a separate line plot for each state. Don’t use colors for this plot, but set
alpha = 0.2
to make the plots more easily visable. Make sure to label your graph appropriately.
#cases_raw |>