There’s a lot of information in these slides! While tidyverse, the data.table package and Python all have a lot of useful functionality, we recommend sticking to the base R and tidyverse tools presented here. Slides covering material outside this scope will be marked with an asterisk (*); you should be cautious about using code from those slides!
Tidying data
We will work with a wide format dataset as an example:
One of the most used functions in tidyr is pivot_longer.
The first argument is a data frame, the one that will be converted.
We want to reshape rows represents a fertility observation.
We need three columns to store the year, country, and the observed value.
pivot_longer
In its current form, data from different years are in different columns with the year values stored in the column names.
The names_to and values_to argument tell pivot_longer the column names we want to assign to the columns containing the current column names and observations, respectively.
The default names are name and value, in our case a better choice is year and fertility.
pivot_longer
Through cols, the second argument, we specify the columns containing observed values; these are the columns that will be pivoted.
The default is to pivot all columns so, in most cases, we have to specify the columns. In our example we want columns 1960, 1961 up to 2015.
pivot_longer
The code to pivot the fertility data therefore looks like this:
# A tibble: 2 × 5
country `1960_fertility` `1960_life_expectancy` `1961_fertility`
<chr> <dbl> <dbl> <dbl>
1 Germany 2.41 69.3 2.44
2 South Korea 6.16 53.0 5.99
# ℹ 1 more variable: `1961_life_expectancy` <dbl>
Example
Note that the data is in wide format.
Also that this table includes values for two variables, fertility and life expectancy, with the column name encoding which column represents which variable.
Encoding information in the column names is not recommended but, unfortunately, it is quite common.
Example
We start with the pivot_longer function, but we should no longer use the column name year for the new column since it also contains the variable type.
We will call it name, the default, for now:
raw_dat |>pivot_longer(-country) |>head()
# A tibble: 6 × 3
country name value
<chr> <chr> <dbl>
1 Germany 1960_fertility 2.41
2 Germany 1960_life_expectancy 69.3
3 Germany 1961_fertility 2.44
4 Germany 1961_life_expectancy 69.8
5 Germany 1962_fertility 2.47
6 Germany 1962_life_expectancy 70.0
Example
The result is not exactly what we refer to as tidy since each observation is associated with two, not one, rows.
We want to have the values from the two variables, fertility and life expectancy, in two separate columns.
The first challenge to achieve this is to separate the name column into the year and the variable type.
Example
Encoding multiple variables in a column name is such a common problem that the tidyr package includes function to separate these columns into two or more:
The dtplyr R package translates dplyr (tidyverse) syntax to data.table, so that we can still use the dplyr verbs while at the same time leveraging the performance of data.table.
Loading the data
The data that we will be using is an already processed version of the MET dataset. We can download (and load) the data directly in our session using the following commands:
# Where are we getting the data frommet_url <-"https://github.com/dmcable/BIOSTAT620W26/raw/main/data/met/met_all.gz"# Downloading the data to a tempfile (so it is destroyed afterwards)# you can replace this with, for example, your own data:tmp <-tempfile(pattern ="met", fileext =".gz")# tmp <- "met.gz"# We should be downloading this, ONLY IF this was not downloaded already.# otherwise is just a waste of time.if (!file.exists(tmp)) {download.file(url = met_url,destfile = tmp,# method = "libcurl", timeout = 1000 (you may need this option) )}
Now we can load the data using the read.csv() or fread() functions.
Read the Data
In R (base)
# Reading the datadat <-read.csv(tmp)head(dat)
In R (data.table)
# Reading the datadat <-fread(tmp)head(dat)dat <-as.data.frame(dat)
In Python
import datatable as dtdat = dt.fread("met.gz")dat.head(5)
Before we continue, let’s learn a bit more on data.table and dtplyr
* data.table and dtplyr: Data Table’s Syntax
As you have seen in previous lectures, in data.table all happens within the square brackets. Here is common way to imagine DT:
Any time that you see := in j that is “Assignment by reference.” Using = within j only works in some specific cases.
* data.table and dtplyr: Data Table’s Syntax
Operations applied in j are evaluated within the data, meaning that names work as symbols, e.g.,
data("USArrests")USArrests_dt <-data.table(USArrests)# This returns an errorUSArrests[, Murder]# This works fineUSArrests_dt[, Murder]
Furthermore, we can do things like this:
USArrests_dt[, plot(Murder, UrbanPop)]
NULL
* data.table and dtplyr: Lazy table
The dtplyr package provides a way to translate dplyr verbs to data.table syntax.
The key lies on the function lazy_dt from dtplyr (see ?dtplyr::lazy_dt).
This function creates a wrapper that “points” to a data.table object
* data.table and dtplyr: Lazy table (cont.)
# Creating a lazy table objectdat_ldt <-lazy_dt(dat, immutable =FALSE)# We can use the address() function from data.tableaddress(dat)address(dat_ldt$parent)
Question: What is the immutable = FALSE option used for?
* Selecting columns
How can we select the columns USAFID, lat, and lon, using data.table:
dat[, list(USAFID, lat, lon)]# dat[, .(USAFID, lat, lon)] # Alternative 1# dat[, c("USAFID", "lat", "lon")] # Alternative 2
# Question 3# Drawing a sampleset.seed(123)idx1 <-sample(which(is.na(dat$temp)), 1000)idx2 <-sample(which(!is.na(dat$temp)), 1000)# Visualizing the data# make a map of the US, as we did last classggplot(map_data("state"), aes(x = long, y = lat)) +geom_map(aes(map_id = region), map =map_data("state"), col ="lightgrey", fill ="gray") +geom_jitter(data = dat[c(idx1, idx2), ],mapping =aes(x = lon, y = lat, col =is.na(temp)),inherit.aes =FALSE, alpha = .5, cex =2 )
Creating variables: Data types
logical: Bool true/false type, e.g. dead/alive, sick/healthy, good/bad, yes/no, etc.
strings: string of characters (letters/symbols), e.g. names, text, etc.
integer: Numeric variable with no decimal (discrete), e.g. age, days, counts, etc.
double: Numeric variable with decimals (continuous), e.g. distance, expression level, time.
In C (and other languages), strings, integers, and doubles may be specified with size, e.g. integers can be of 8, 16, and 32 bits. This is relevant when managing large datasets, where saving space can be fundamental (more info).
Creating variables: Special data types
Most programming languages have special types which are built using basic types. A few examples:
time: Could be date, date + time, or a combination of both. Usually it has a reference number defined as date 0. In R, the Date class has as reference 1970-01-01, in other words, “days since January 1st, 1970”.
categorical: Commonly used to represent strata/levels of variables, e.g. a variable “country” could be represented as a factor, where the data is stored as numbers but has a label.
ordinal: Similar to factor, but it has ordering, e.g. “satisfaction level: 5 very satisfied, …, 1 very unsatisfied”.
Other special data types could be ways to represent missings (usually described as na or NA), or special numeric types, e.g. +-Inf and Undefined (NaN).
When storing/sharing datasets, it is a good practice to do it along a dictionary describing each column data type/format.
Questions 3: What’s the best way to represent the following
0, 1, 1, 0, 0, 1
Diabetes type 1, Diabetes type 2, Diabetes type 1, Diabetes type 2
on, off, off, on, on, on
5, 10, 1, 15, 0, 0, 1
1.0, 2.0, 10.0, 6.0
high, low, medium, medium, high
-1, 1, -1, -1, 1,
.2, 1.5, .8, \(\pi\)
\(\pi\), \(\exp{1}\), \(\pi\), \(\pi\)
Variable creation
If we wanted to create two variables, elev^2 and the scaled version of wind.sp by it’s standard error, we could do the following
Imagine that we needed to scale multiple variables by their SD and didn’t want to copy-paste this code several times. Here’s how we could do it automatically for a given list of variable names:
# Listing the namesnames <-c("wind.sp", "temp", "atm.press")for(var in names){ dat[,paste0(var,'_scaled')] <- dat[,var] /sd(dat[,var], na.rm =TRUE)}
Why can’t we use dat$var inside the loop?
* Or with data.table
in_names <-c("wind.sp", "temp", "atm.press")out_names <-paste0(in_names, "_scaled")dat[,c(out_names) :=lapply(.SD, function(x) x/sd(x, na.rm =TRUE)), .SDcols = in_names ]# Looking at the first 6head(dat[, .SD, .SDcols = out_names], n =4)
Key things to notice here: c(out_names), .SD, and .SDCols.
Variable creation (cont. 3)
In the case of dplyr, we could use the following
names <-c("wind.sp", "temp", "atm.press")dat |>mutate(across(all_of(names),function(x) x/sd(x, na.rm =TRUE),.names ="{col}_scaled2" ) ) |># Just to print the last columnsselect(ends_with("_scaled2")) |>head(n =4)
Don’t forget about loops! for loops and sapply may be slow on a dataset of this size, but they can be quite handy for creating variables that rely on complicated relationships between variables. Consider this a “brute force” approach. Vectorized methods will always be faster, but these can be easier to conceptualize and, in rare cases, may be the only option.
Consider the problem creating a weird variable: wind.temp. This will take on 4 possible values, based on the temperature and wind speed: cool & still, cool & windy, warm & still, or warm & windy. We will split each variable based on their median value. Note that this code is too slow to actually run on this large dataset.
Complex variable creation (cont 1)
Here’s how we would do that with the sapply function (and a custom, unnamed function):
# create the new variable one entry at a timewind.temp <-sapply(1:nrow(dat), function(i){if(is.na(dat$temp[i]) |is.na(dat$wind.sp[i])){return(NA) }if(dat$temp[i] <=median(dat$temp, na.rm=TRUE)){if(dat$wind.sp[i] <=median(dat$wind.sp, na.rm=TRUE)){return('cool & still') }else{return('cool & windy') } }else{if(dat$wind.sp[i] <=median(dat$wind.sp, na.rm=TRUE)){return('warm & still') }else{return('warm & windy') } }})
Check: what would we need to change to add this variable to our dataset?
Complex variable creation (cont 2)
Here’s the code for doing that with a for loop:
# initialize a variable of all missing valueswind.temp <-rep(NA, nrow(dat))# fill in the values one at a timefor(i in1:nrow(dat)){if(is.na(dat$temp[i]) |is.na(dat$wind.sp[i])){return(NA) }else{if(dat$temp[i] <=median(dat$temp, na.rm=TRUE)){if(dat$wind.sp[i] <=median(dat$wind.sp, na.rm=TRUE)){ wind.temp[i] <-'cool & still' }else{ wind.temp[i] <-'cool & windy' } }else{if(dat$wind.sp[i] <=median(dat$wind.sp, na.rm=TRUE)){ wind.temp[i] <-'warm & still' }else{ wind.temp[i] <-'warm & windy' } } }}
Check: why do we need to include na.rm=TRUE when calculating the medians?
Complex variable creation (cont 3)
Here’s a simple vectorized approach that will actually run on a large dataset. This works for our current case, but it’s still a brute force approach, because we had to specifically assign every possible value of our new variable. You can imagine that as the number of possible values increases, this code will get increasingly cumbersome.
# initialize a variable of all missing valueswind.temp <-rep(NA, nrow(dat))# assign every possible value by subsettingwind.temp[dat$temp <=median(dat$temp, na.rm=TRUE) & dat$wind.sp <=median(dat$wind.sp, na.rm=TRUE)] <-'cool & still'wind.temp[dat$temp <=median(dat$temp, na.rm=TRUE) & dat$wind.sp >median(dat$wind.sp, na.rm=TRUE)] <-'cool & windy'wind.temp[dat$temp >median(dat$temp, na.rm=TRUE) & dat$wind.sp <=median(dat$wind.sp, na.rm=TRUE)] <-'warm & still'wind.temp[dat$temp >median(dat$temp, na.rm=TRUE) & dat$wind.sp >median(dat$wind.sp, na.rm=TRUE)] <-'warm & windy'head(wind.temp)
While building the MET dataset, we dropped the State data.
We can use the original Stations dataset and merge it to the MET dataset.
But we cannot do it right away. We need to process the data somewhat first.
Merging data (cont. 1)
stations <-fread("https://noaa-isd-pds.s3.amazonaws.com/isd-history.csv")stations <-as.data.frame(stations)stations$USAF <-as.integer(stations$USAF)# Dealing with NAs and 999999stations$USAF[stations$USAF ==999999] <-NAstations$CTRY[stations$CTRY ==""] <-NAstations$STATE[stations$STATE ==""] <-NA# Selecting the three relevant columns, and keep unique recordsstations <-unique(stations[, c('USAF', 'CTRY', 'STATE')])# Dropping NAsstations <- stations[!is.na(stations$USAF), ]head(stations, n =4)
USAF CTRY STATE
1 7018 <NA> <NA>
2 7026 AF <NA>
3 7070 AF <NA>
4 8260 <NA> <NA>
Merging data (cont. 2)
merge(# Datax = dat, y = stations, # List of variables to matchby.x ="USAFID",by.y ="USAF", # Which obs to keep?all.x =TRUE, all.y =FALSE ) |>nrow()
[1] 2385443
This is more rows! The original dataset, dat, has 2377343. This means that the stations dataset has duplicated IDs. We can fix this:
We now can use the function merge() to add the extra data
dat <-merge(# Datax = dat, y = stations, # List of variables to matchby.x ="USAFID",by.y ="USAF", # Which obs to keep?all.x =TRUE, all.y =FALSE )head(dat[, c('USAFID', 'WBAN', 'STATE')], n =4)
USAFID WBAN STATE
1 690150 93121 CA
2 690150 93121 CA
3 690150 93121 CA
4 690150 93121 CA
What happens when you change the options all.x and all.y?
* Aggregating data: Adding grouped variables
Many times we need to either impute some data, or generate variables by strata.
If we, for example, wanted to impute missing temperature with the daily state average, we could use by together with the data.table::fcoalesce() function: