2025-01-16
Use install.packages to install the dslabs package.
Tryout the following functions: sessionInfo, installed.packages
Much of what we do in R is based on prebuilt functions.
Many are included in automatically loaded packages: stats, graphics, grDevices, utils, datasets, methods.
This subset of the R universe is refereed to as R base.
Very popular packages not included in R base: ggplot2, dplyr, tidyr, and data.table.
Example of prebuilt functions that we will use today: ls, rm, library, search, factor, list, exists, str, typeof, and class.
You can see the raw code for a function by typing it without the parenthesis: type ls on your console to see an example.
In R you can use ? or help to learn more about functions.
You can learn about function using
or
ls to see if it’s there. Also take a look at the Environment tab in RStudio.rm to remove the variable you defined.A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces.
For more we recommend this guide.
The main data types in R are:
One dimensional vectors: numeric, integer, logical, complex, characters.
Factors
Lists: this includes data frames.
Arrays: Matrices are the most widely used.
Date and time
tibble
S4 objects
Many errors in R come from confusing data types.
str stands for structure, gives us information about an object.
typeof gives you the basic data type of the object. It reveals the lower-level, more fundamental type of an object in R’s memory.
class This function returns the class attribute of an object. The class of an object is essentially type_of at a higher, often user-facing level.
Let’s see some example:
[1] "list"
[1] "data.frame"
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
Date frames are the most common class used in data analysis. It is like a spreadsheet.
Usually, rows represents observations and columns variables.
Each variable can be a different data type.
You can see part of the content like this
state abb region population total pop_rank
1 Alabama AL South 4779736 135 29
2 Alaska AK West 710231 19 5
3 Arizona AZ West 6392017 232 36
4 Arkansas AR South 2915918 93 20
5 California CA West 37253956 1257 51
6 Colorado CO West 5029196 65 30
Note that we used $.
This is called the accessor because it lets us access columns.
[1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
[9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
[17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
[25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
[33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
[41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
[49] 1852994 5686986 563626
One way R confuses beginners is by having multiple ways of doing the same thing.
For example you can access the 4th column in the following five different ways:
with let’s us use the column names as objects.
This is convenient to avoid typing the data frame name over and over:
Often we have to create vectors.
The concatenate function c is the most basic way used to create vectors:
:1:length(x) is seq_along:One key distinction between data types you need to understad is the difference between factors and characters.
The murder dataset has examples of both.
Factors store levels and the label of each level.
This is useful for categorical data.
In data analysis we often have to stratify continuous variables into categories.
The function cut helps us do this:
age <- c(5, 93, 18, 102, 14, 22, 45, 65, 67, 25, 30, 16, 21)
cut(age, c(0, 11, 27, 43, 59, 78, 96, Inf),
labels = c("Alpha", "Zoomer", "Millennial", "X", "Boomer", "Silent", "Greatest")) [1] Alpha Silent Zoomer Greatest Zoomer Zoomer
[7] X Boomer Boomer Zoomer Millennial Zoomer
[13] Zoomer
Levels: Alpha Zoomer Millennial X Boomer Silent Greatest
levels argument:A common reason we need to change levels is to assure R is aware which is the reference strata.
This is important for linear models because the first level is assumed to be the reference.
We often want to order strata based on a summary statistic.
This is common in data visualization.
We can use reorder for this:
80000232 bytes
40000648 bytes
Exercise: How can we make this go much faster?
droplevels:NA stands for not available.
Data analysts have to deal with NAs often.
is.na function is key for dealing with NAsA related constant is NaN.
Unlike NA, which is a logical, NaN is a number.
It is a numeric that is Not a Number.
Here are some examples:
When you do something inconsistent with data types, R tries to figure out what you mean and change it accordingly.
We call this coercing.
R does not return an error and in some cases does not return a warning either.
This can cause confusion and unnoticed errors.
So it’s important to understand how and when it happens.
NAs in arithmetical operations usually returns an NA.You want to avoid automatic coercion and instead explicitly do it.
Most coercion functions start with as.
Here is an example.
Data frames are a type of list.
Lists permit components of different types and, unlike data frames, different lengths:
Matrices are another widely used data type.
They are similar to data frames except all entries need to be of the same type.
We will learn more about matrices in the High Dimensional data Analysis part of the class.
Note what R searches the Global Environment first.
Use search to see other environments R searches.
Note many prebuilt functions are in stats.
filter you want using namespaces:R uses object oriented programming (OOP).
It uses two approaches referred to as S3 and S4, respectively.
S3, the original approach, is more common.
The S4 approach is more similar to the conventions used by modern OOP languages.
co2 is not numeric:plot behaves different with different classes.plot actually calls the functionNotice all the plot functions that start with plot by typing plot. and then tab.
The function plot will call different functions depending on the class of the arguments.
Soon we will learn how to use the ggplot2 package to make plots.
R base does have functions for plotting though
Some you should know about are:
plot - mainly for making scatterplots.lines - add lines/curves to an existing plot.hist - to make a histogram.boxplot - makes boxplots.image - uses color to represent entries in a matrix.Although, in general, we recommend using ggplot2, R base plots are often better for quick exploratory plots.
For example, to make a histogram of values in x simply type:
y versus x and then interpolate we type: