If you are a biologist, chances are good you have heard about the programming language R, used for statistical computing and graphics. This is because it is a free, open-source software with hundreds of packages available to aid analyses of biological data. We know that programming can be very intimidating at first, but R is like any other natural language and takes time to learn. The USF OmicsHub provide this introductory course to help researchers such as you start your programming journey. You will not become an expert after this course but you will have the basic foundations to continue learning R confidently.
“R” is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
The one downside of using R is that the user-interface is not very user-friendly, so a user-interface called RStudio was developed as an Integrated Development Environment (IDE) of R to provide further functionality. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging and managing your workspace.
To install R, go to the CRAN R project website and follow the links for your operating system. R must be downloaded to run RStudio. Download RStudio here.
We have created step-by-step instructions here for this process!
Let’s get familiar with RStudio-
When you open RStudio, it may look something like this. There are four general windows highlighted above:
Console: where you can type commands and see outputs. The console is all you would see if you ran R in the command line without RStudio.
In the image above, you can see a “>” in this window. This is a prompt and where you can run commands one at a time by pressing the return key. If a plus sign appears while in the console, it means that R is wanting you to enter some additional information. You can press the escape key to return to the prompt.
You can also clear the console by clicking the faint broom icon in the top-right corner of this window.
Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.
In the image above, you can see that a script titled “Untitled1” is opened but is empty starting at line 1. There are different types of R scripts such as RMarkdowns (.Rmd) but we will start with R Scripts (.R) like this one. You can also see the “Run” button in the top right of this window. The first run action runs the line of code where your blink text cursor is located. You may also select and highlight the code you want to run before clicking this option. The second run option runs the entire script.
In the top-left corner of this window, you can see a floppy disk image that lets you save this script.
Environment/History: shows all active objects and history keeps track of all objects/functions assigned.
Files/Plots/Packages/Help: is primarily used for displaying graphs and for using the help system but it also shows the folders on your local computer.
Now that we know our way around RStudio, we can begin running some code. We can start by practicing within the console, but you can save these commands in a scripts as notes for later.
R is an object-oriented language. Objects are entities R operates on. These can be individual values, data sets, statistical outputs, or specialized functions. If it is something to which you can assign a name, it is an object.
Lets try creating a variable object using the assignment operator <-
a <- 10
b <- 20
c <- a + b
These are numeric objects.You can probably guess what the value of c
is! R can handle most simple types of math operators. Here are some more:
Arithmetic Operators
+
: Addition
-
: Subtraction
*
: Multiplication
/
: Division
^
or **
: Exponentiation
We can run these variables alone to print their value. We can also see them in our environment.
c
## [1] 30
Objects or functions defined will show up here.
Now let’s create a vector or one-dimensional array using the concatenate function c()
a <- c(1,2,5.3,6,-2,4/9) # numeric vector
b <- c("one","two","three") # character vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) # logical vector
There are three basic examples of vectors shown above: numeric, character, and logical.
Numeric: this data-type is straightforward. Inside vector a
includes examples of what can be considered numerical data.
Character: this vector is made up of strings - these are any values written within a pair of single or double quotes. For example, "1"
would be considered a character value within R even though we know 1 is a number. We can check what type an R object is by using the class()
command. For example, try running class(a)
.
class(a)
## [1] "numeric"
Logical: these vectors are made up of TRUE/FALSE and are typically a way to index another vector which will return values for which the logical vector is TRUE.
Let’s try indexing or subseting a
where c
is TRUE using the bracket operators []
a[c]
## [1] 1.0 2.0 5.3 -2.0
This is one way to subset data but logical vectors are typically an output from using comparison operators such as the following:
==
: equal
!=
: not equal
<
: less than
<=
: less than or equal
>
: greater than
>=
: greater than or equal
|
: or
!
: not
%in%
: in the set
Lets use character vectors of chemical elements and US states to practice using comparison operator %in%
chem <- c("Li","Fl","Ca","Na","Fe","Se","Rb","Ag")
us <- c("Se","Ak","Ct","Hi","Ks","Mi","Fl","Ca")
This subsets the character vector of US states which are also chemical elements.
💭EXERCISE: try to subset ‘us’ where states are not also chemical elements.
us[us %in% chem]
## [1] "Se" "Fl" "Ca"
Typically we deal with larger data and sometimes just want a summary of our data.
We can use the summary()
function to see how many US states which are also chemical elements
summary(us %in% chem)
## Mode FALSE TRUE
## logical 5 3
TAKE NOTE |
---|
- In the code above, you may have noticed the # operator. Text written after this symbol is not recognized as code and are considered comments which can be used to describe certain lines of code or simply block out code. |
- You may have realized by now that R is white-space friendly meaning that you can leave spaces wherever except within names of variables or functions. The important thing is to be consistent. We will expand on style guidelines later. |
- However, R is case-sensitive. This means that variable x is not the same as X . That applies to pretty much everything in R; for example, the function subset() is not the same as Subset() . |
Below are some examples of data types in R that more than two dimensions that are commonly used.
Matrix matrix()
: a homogeneous collection of data sets which is arranged in a two dimensional rectangular organization
Data frame data.frame()
: we can think of a data frame as a rectangular list made up of vectors of the same length.
These sound very similar but the main differences are that matrices have to be of the same type(numeric,character,logical,etc.) where columns of each data frames can be different. Data frames are typically more common because they are easier to manipulate but a lot of R functions coerce matrices to data frames since matrices are far more computationally efficient.
Here is an example of creating a data frame.
✏️NOTE: Creating a data frame is just combining a bunch of vectors using the =
assignment operator. The name of the vector is the name of the column.
df <- data.frame(
name = c("Brad","Janet","Rocky","Magenta"),
cat_breed = c("Persian","Russian Blue", "Siamese","Ragdoll"),
sex = rep(c("Male","Female"),2), # same as c("Male","Female","Male","Female","Male")
age_yrs = c(1:4) # same as c(1,2,3,4)
)
df
## name cat_breed sex age_yrs
## 1 Brad Persian Male 1
## 2 Janet Russian Blue Female 2
## 3 Rocky Siamese Male 3
## 4 Magenta Ragdoll Female 4
We can turn this data frame into a matrix using as.matrix
✏️NOTE: remember that matrices in R have to be of the same type. Since R recognizes a character vector, then the entire matrix is forced to a character type including age_yrs.
as.matrix(df)
## name cat_breed sex age_yrs
## [1,] "Brad" "Persian" "Male" "1"
## [2,] "Janet" "Russian Blue" "Female" "2"
## [3,] "Rocky" "Siamese" "Male" "3"
## [4,] "Magenta" "Ragdoll" "Female" "4"
TAKE NOTE |
---|
- Variable and function names should be lowercase. Words should be separated by an underscore (_). |
- Try to avoid using names of existing functions and variables. Doing so will cause confusion for the readers of your code. |
- Variable names should be meaningful. For example, after cleaning a dataset named diseases , you don’t want to rename it Diseases or diseases2 but maybe diseases_clean . Camel cases are also known to be harder to read and should be avoided. Ex. DiseasesClean |
Before we begin exploring datasets, let’s go into more detail on how functions work. We have introduced a few functions already such as summary()
and class()
so you may have been picking up on a semantic pattern, but its important to understand the components of a function since you will most likely be creating your own or installing packages with new functions. Inside the parenthesis of the function name is where arguments are stored. If a function has more than one argument, then they are separated by a comma. The arguments input are ran through the body of the function, the code, to carry out a specific task and output a return object.
Here is the syntax of a function.
function_name <- function ( arg1, arg2,... ) {
statement1
statement2
etc..
return( output )
}
In the Help tab of the bottom-right pane, you can see the description and arguments of most functions by running “?” in front of the name of the function. For example,
?class
✏️NOTE: you can see in the help tab that arguments have names. We can be explicit by assigning our input to for an argument by running class(x = us)
. This is relevant for more complicated functions.
Let’s make a simple function called quadruple()
which multiplies its inputs by four.
quadruple <- function(x) {
y <- x*4
return(y)
}
✏️NOTE: In simple terms, we have a placeholder argument ‘x’ which in the body of our code gets quadrupled. Our quadrupled number is assigned to a new variable ‘y’ created in the function. The return function prints out the value of y for the value assigned to the argument ‘x’.
Now, we can call the function by assigning a numerical vector to ‘x’
v <- c(3,5.3,6,38,20,11)
quadruple(v)
## [1] 12.0 21.2 24.0 152.0 80.0 44.0
💭EXERCISE: Try creating a function that returns numbers less than 10. Use ‘v’ again to test your function.
It is likely the case that we will need to write our own functions so it is important to practice functional programming. If you are doing something more than once, it belongs in a function. Repeating the same blocks of code may make more sense to new programmers, but carrying out tasks within a function makes your code easier to read, fix, and maintain. To write these functions, you may need to include control statements.
Control statements are expressions used to control the execution and flow of the program based on the conditions provided. We will introduce a few popular control statements: if/else, for loops.
Before jumping into some code writing, it is recommended to map out or create a work flow of what we want our function to do.
Here, is an example for a if/else statement. An if statement can be followed by an optional else statement which is executed when the conditional expression is false.
The syntax of the if/else statement:
if (conditon) {
statement1
} else {
statement2
}
Within the parenthesis after ‘if’ is where the condition is expressed. If our input meets these conditions, then the body of code ‘statement1’ is run. If the condition is not met for our input, then ‘statement2’ is run.
✏️NOTE none of the remaining else if’s or else’s will be tested if a condition is met
Example of if/else statement in a function
We will use a simple example of this control statement used within a function. We want our function to take in two numerical variables and output which variable is greater or if they are equal. Since we have more than one condition, we add a nested condition by replacing ‘else’ with ‘else if’
Remember that we are just creating our user-defined function here so running this alone will not output anything until we ‘call’ it later but we should see compare_num
in our environment now.
compare_num <- function(var1,var2){
if (var2 > var1){
paste(var2, "is greater than", var1)
} else if(var1 == var2){
paste(var1,"and", var2, "are equal")
} else {
paste(var1,"is greater than", var2)}
}
Now that R recognizes the function, we can treat it as any built-in function by calling it.
💭EXERCISE: assigning different numbers to variables a and b
a <- 100
b <- 81
compare_num(a,b)
## [1] "100 is greater than 81"
Now, let’s look at for loop statements. A loop is a way to repeat a sequence of instructions under certain conditions
The syntax of the for loop statement:
for (var in sequence)
{
print(statement)
}
In this syntax, ‘var’ is the individual items being iterated through the ‘sequence’ - this can be a collection of objects like a vector, list, etc.
The ‘statement’ here again is the body of code to be run for each item in the sequence.
The ‘var’ here is typically represented as i
for index. It is not a variable we assign a value to but is a placeholder representing the elements of the vector. You may also see x
being used for this.
It is important to write our output within the print()
or return()
function for for loop statements or else nothing will print.
Example of a for loop statement in a function
Let’s create a for loop function that takes a numerical vector and outputs the square root of each element.
get_sqrt <- function(vct){
for(i in vct)
print(sqrt(i))
}
x <- c(9,1,3.2,5,79,40)
get_sqrt(x)
## [1] 3
## [1] 1
## [1] 1.788854
## [1] 2.236068
## [1] 8.888194
## [1] 6.324555
TAKE NOTE |
---|
- We have covered creating very simple functions but before trying to write a function to carry out your task, GOOGLE FIRST. It is very likely that someone else had already asked the same question on Stack Overflow. |
- Remember that R is very literal! When writing functions, it can be easy to create errors by having misplaced brackets or missing commas somewhere. Practicing neat code in the beginning will make it easier to spot these simple mistakes. Follow R’s automatic line spacing. |
- When creating functions, it may be helpful to comment what each argument should be and what the function is doing. |
- Take the time to draw out a flow chart before writing your code. At each step of your algorithm, build upon your function. If at the end you have errors, do the opposite and work your way backwards! |
- Overwhelmed? Don’t worry! Creating user-defined functions can be difficult. The important thing to know is how they work and the concept behind them. |
R comes with pre-loaded data sets that we can grab using the function data()
. You can run the command alone to see these available data sets.
We will load the “iris” data set that includes information on 150 samples of flowers from the iris genus.
✏️NOTE: “iris” may show up in the environment as a <Promise>
but will show as a data frame when used in later code.
data(iris)
We will introduce more functions we can use to explore this data.
We can display the first few rows of our data by using head()
✏️NOTE: the default for the number of rows to display is 6. You can change the number of rows displayed by assigning the ‘n’ argument. For example, head(mtcars, n = 10)
.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Thetail()
function works similarly
tail(iris,10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 141 6.7 3.1 5.6 2.4 virginica
## 142 6.9 3.1 5.1 2.3 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
Printing the column names of the dataset.
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
Since data frames can have variables with different data types, we can use str()
to check all of them at once.
In the first line of this output, it shows the class and dimensions of our input object. Dimensions of an object can also be found by running dim()
.
After each extraction operator $
, we can see the column names or variables, its class, and the first few values in the column.
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Here, we are using the extraction operator $
to extract the “Species” variable.
head(iris$Species)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
You can see that “Species” is factor type variable. Factors are pre-defined values stored as ordered or unordered levels. Factors in R can be character or numeric. We can see that “Species” is a numeric factor vector with three levels. Factors are important for statistical modeling and plotting. Other examples of common factor variables:
Gender: Factor w/ 2 levels “Female”, “Male”
Groups: Factor w/ 2 levels 0, 1
We have used summary()
on logical objects but it also works on other variable types and datasets.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We have used the $
to extract individual columns but in some cases, we want to extract more than one column.
Here are some ways we can subset the iris data if we just want the petal measurements.
✏️NOTE: When we use the indexing brackets [] on multi-dimensional arrays, you have to specify rows and columns to subset using a comma. The syntax is [rows,columns].
💭EXERCISE: subset the iris data set by sepal length for just versicolor species
iris[, c("Petal.Length","Petal.Width")] # We can specify the columns we want to subset using a chr vec
iris[, c(3:4)] # subset by column position
iris[, c(-1:-2,-5)] # subset by removing unwanted columns
iris[, colnames(iris) %in% c("Petal.Length","Petal.Width")] # subset by a condition
All of these get the same result! hover over table to scroll
Petal.Length | Petal.Width |
---|---|
1.4 | 0.2 |
1.4 | 0.2 |
1.3 | 0.2 |
1.5 | 0.2 |
1.4 | 0.2 |
1.7 | 0.4 |
1.4 | 0.3 |
1.5 | 0.2 |
1.4 | 0.2 |
1.5 | 0.1 |
1.5 | 0.2 |
1.6 | 0.2 |
1.4 | 0.1 |
1.1 | 0.1 |
1.2 | 0.2 |
1.5 | 0.4 |
1.3 | 0.4 |
1.4 | 0.3 |
1.7 | 0.3 |
1.5 | 0.3 |
1.7 | 0.2 |
1.5 | 0.4 |
1.0 | 0.2 |
1.7 | 0.5 |
1.9 | 0.2 |
1.6 | 0.2 |
1.6 | 0.4 |
1.5 | 0.2 |
1.4 | 0.2 |
1.6 | 0.2 |
1.6 | 0.2 |
1.5 | 0.4 |
1.5 | 0.1 |
1.4 | 0.2 |
1.5 | 0.2 |
1.2 | 0.2 |
1.3 | 0.2 |
1.4 | 0.1 |
1.3 | 0.2 |
1.5 | 0.2 |
1.3 | 0.3 |
1.3 | 0.3 |
1.3 | 0.2 |
1.6 | 0.6 |
1.9 | 0.4 |
1.4 | 0.3 |
1.6 | 0.2 |
1.4 | 0.2 |
1.5 | 0.2 |
1.4 | 0.2 |
4.7 | 1.4 |
4.5 | 1.5 |
4.9 | 1.5 |
4.0 | 1.3 |
4.6 | 1.5 |
4.5 | 1.3 |
4.7 | 1.6 |
3.3 | 1.0 |
4.6 | 1.3 |
3.9 | 1.4 |
3.5 | 1.0 |
4.2 | 1.5 |
4.0 | 1.0 |
4.7 | 1.4 |
3.6 | 1.3 |
4.4 | 1.4 |
4.5 | 1.5 |
4.1 | 1.0 |
4.5 | 1.5 |
3.9 | 1.1 |
4.8 | 1.8 |
4.0 | 1.3 |
4.9 | 1.5 |
4.7 | 1.2 |
4.3 | 1.3 |
4.4 | 1.4 |
4.8 | 1.4 |
5.0 | 1.7 |
4.5 | 1.5 |
3.5 | 1.0 |
3.8 | 1.1 |
3.7 | 1.0 |
3.9 | 1.2 |
5.1 | 1.6 |
4.5 | 1.5 |
4.5 | 1.6 |
4.7 | 1.5 |
4.4 | 1.3 |
4.1 | 1.3 |
4.0 | 1.3 |
4.4 | 1.2 |
4.6 | 1.4 |
4.0 | 1.2 |
3.3 | 1.0 |
4.2 | 1.3 |
4.2 | 1.2 |
4.2 | 1.3 |
4.3 | 1.3 |
3.0 | 1.1 |
4.1 | 1.3 |
6.0 | 2.5 |
5.1 | 1.9 |
5.9 | 2.1 |
5.6 | 1.8 |
5.8 | 2.2 |
6.6 | 2.1 |
4.5 | 1.7 |
6.3 | 1.8 |
5.8 | 1.8 |
6.1 | 2.5 |
5.1 | 2.0 |
5.3 | 1.9 |
5.5 | 2.1 |
5.0 | 2.0 |
5.1 | 2.4 |
5.3 | 2.3 |
5.5 | 1.8 |
6.7 | 2.2 |
6.9 | 2.3 |
5.0 | 1.5 |
5.7 | 2.3 |
4.9 | 2.0 |
6.7 | 2.0 |
4.9 | 1.8 |
5.7 | 2.1 |
6.0 | 1.8 |
4.8 | 1.8 |
4.9 | 1.8 |
5.6 | 2.1 |
5.8 | 1.6 |
6.1 | 1.9 |
6.4 | 2.0 |
5.6 | 2.2 |
5.1 | 1.5 |
5.6 | 1.4 |
6.1 | 2.3 |
5.6 | 2.4 |
5.5 | 1.8 |
4.8 | 1.8 |
5.4 | 2.1 |
5.6 | 2.4 |
5.1 | 2.3 |
5.1 | 1.9 |
5.9 | 2.3 |
5.7 | 2.5 |
5.2 | 2.3 |
5.0 | 1.9 |
5.2 | 2.0 |
5.4 | 2.3 |
5.1 | 1.8 |
💭EXERCISE: Try to subset the iris dataset so that it is only showing measurements for versicolor species.
There are hundreds of books written about data visualization in R alone, but we will quickly create some plots to show R’s ability to create high-quality graphics.
Using plot()
on a data frame will create a scatterplot matrix showing the relationships between each pair of variables in the data frame.
plot(iris)
The plot()
function has many arguments to customize our graphs.
Main Base R Plot Parameters:
x
: coordinates of points in the plot
y
: y coordinates of points in the plot
type
: type of plot to be drawn. Ex. “p” for points,“l” for lines,…more here
main
: overall title for the plot
xlab
: x axis label
ylab
: y axis label
pch
: shape of points. Ex. 0 for open squares, 17 closed triangles, more here
col
: color of points or lines. Run colors()
for predefined colors in R. R also recognizes HEX and RBG values. Ex. “white” or “#FFFFFF”. We can also color points or lines by factor.
las
: axes label style. The default is parallel to the axis. “1” = horizontal, “2” = perpendicular to the axis, and “3” always vertical.
bty
: box type. Default draws a rectangle around the plot, “n” draws nothing around the plot..
cex
: The amount of scaling plotting text and symbols
Lets plot sepal length against sepal width.
plot(x = iris$Sepal.Length, y = iris$Sepal.Width,
type = "p",
main = "Sepal Flower Measurements of Iris Species",
xlab = "Sepal Length",
ylab = "Sepal Width")
This graph does what we want it to, but we can include more information and make this look more interesting.
plot(x = iris$Sepal.Length, y = iris$Sepal.Width,
type = "p",
main = "Sepal Flower Measurements of Iris Species",
xlab = "Sepal Length",
ylab = "Sepal Width",
pch = 20,
col = c("#332288","#AA4499","#88CCEE")[iris$Species])
legend(x = "topright",
legend = levels(iris$Species),
col = c("#332288","#AA4499","#88CCEE"),
pch = 20)
In the new code above, we assign colors for each level of ‘Species’ and include a legend to match colors to iris species. These colors were chosen because they are color blind accessible. You can learn more data visualization tips from USFs Genomics Seminar by Clause Wilke Uppin your datadiz game. This recording is only available to USF students and faculty but his book Fundamentals of Data Visualization can be accessed by everyone.
Below are some more different types of graphs R can plot..
Histogram of Iris Flower Sepal Length
hist(iris$Sepal.Length,
main = "Histogram of Iris Sepal Length",
xlab = "Sepal Length",
col = "purple" )
Heat map of Iris Data
dist()
is used to calculate the similarity between different flowers in the iris data
heatmap(as.matrix(dist(iris[, 1:4])))
Boxplot of Iris Data
boxplot(iris[,1:4],
col = c(rep("pink",2),rep("lightblue",2)),
main = "Boxplot of Iris Data")
💭EXERCISE: Use the built-in ‘pressure’ dataset to create a line graph of temperature against pressure.
TAKE NOTE |
---|
- We introduced a few functions to explore and clean our data but unfortunately, we cannot cover everything. A big part of programming is learning how to problem solve - how to google and referring to the manual. |
- When producing graphs, first consider what is the best way in which to convey the information: a line graph, a bar graph, etc. It is import to invest sufficient time and effort in the process. |
- Make sure your graph communicates the information well. Omit needless graphical elements and use large font sizes. |
Now we can move forward with using our own data, exploring new packages, and utilizing other R tools.
Before loading our own data, we have to know what working directories are. The working directory is the location on your computer where R can read and save files. You can only have one working directory at a time.
Run getwd()
to print your working directory
getwd()
If you are using a Mac computer, your working directory may look like “/Users/path/to/my/directory/” but if you are on Windows, then your output may look like “c:/path/to/my/directory/”
If the files we need for analysis are in a different directory, then we can use setwd()
to change it.
First, lets create a folder named “R_exercise” in our “Documents” and change our working directory to it.
✏️NOTE: check for correct back/forward slashes and upper/lower cases if you run into errors.
setwd("/Users/username/Documents/R_exercise")
You can also change your working directory by going to ‘Global Options…’ in the Tools tab, but it is important to know how file paths work and how to write them.
Run getwd()
to print your working directory. Check that you are in the right directory.
getwd()
We will take available stroke data from Kaggle provided by fedesoriano. Click here to download the data set. Unzip the file and move the .csv to your “R_exercise” folder.
We can run the list.files()
function to list all files in the path in strings provided. We can input getwd()
as a short cut. We should see the file we just downloaded and transferred.
list.files(getwd())
## [1] "kaggle_stroke_data.csv"
To load csv files into our environment, we can use the read.csv()
function. Assign the data to ‘stroke’
✏️NOTE: Since we are in the environment of the file we want to upload, we can just specify the file name. If not, then the entire file path must be specified.
✏️NOTE: Remember that files are not always ideal for loading within R. Check the manual by running ?read.csv
and examine the arguments. Sometimes, you might need to change an argument setting.
stroke <- read.csv(file = "kaggle_stroke_data.csv")
After doing some analysis or data cleaning, we can to save our results into a .csv file.
Let’s clean our stroke data set so that there are no missing data. We can typically check this by running is.na()
on our data but this will return a logical matrix of all FALSE values. However, when you look at data in the ‘bmi’ column, we can see that unavailable data is indicated as “N/A” - this or an empty string won’t be recognized by the is.na()
function so it is important to examine our data carefully.
💭EXERCISE: What happens when we remove the ‘!’ operator in this code?
stroke_clean <- stroke[!stroke$bmi == "N/A",]
Now that our data is clean, we want to save it. To export our data as a .csv, we can use write.csv()
we will comment the syntax of the function
write.csv(x = stroke_clean, # the data set we want to save
file = "kaggle_stroke_data(n=455).csv", # the filename we want to save as
row.names = FALSE, # our dataset technically has rownames (1-455) which we do not actually want so we assign this argument as FALSE to prevent this
)
The process works the same for different file types. For example, you can use read.table()
for .txt or tab-delimited text files. You can check this DataCamp tutorial for more examples importing different files into R.
If you checked out that DataCamp tutorial or have already done some troubleshooting via Google then you most likely have been introduced to a new package. So far, we have only been using base R and built-in R functions, but many other useful R functions come in packages. These are free libraries of code written by R’s active user community.
1. Most packages can be installed from the official repository, **CRAN**, the Comprehensive R Archive Network. Packages submitted to CRAN is subject to testing, policies, and legal requirements. CRAN packages can be installed using base R function `install.packages()` and ALL packages can be loaded using ```library()```
✏️NOTE: Once you install a package, you don’t have to install it again unless it needs to be updated. You can check if a package needs updating in the lower right panel in RStudio next to the Files/Plots/Helps tabs. You will need to reload packages in new R sessions.
install.packages("packagename")
library(packagename)
2. Authors also host packages available on **Github** where users can report bugs or suggest new features. To download R packages from Github, the 'devtools' package must be downloaded first.
install.packages("devtools")
devtools::install_github("username/packagename")
✏️NOTE The ‘::’ operator indicated after a package name lists functions available only within that package. If you did not load the package by running library(devtools)
then using ‘::’ will explicitly load that package for that function. This is useful when you have packages loaded with the same function name. If you are using a function from a package more than once, then load it using the library function.
3. And if you are a bioinformatic research, then you will definitely be downloading from **BioConductor** - a free, open source and open development software project for bioinformatics. Installing packages from Bioconductor is similar to downloading Github packages.
install.packages("BiocManager")
BiocManager::install("packagename")
All packages include documentation. Without it, users would not know how to use the package. We have been able to access object documentation using ?
for individual functions, but it is only helpful if you know the function name and want to know what it means. It does not find helpful functions for a new problems.
When googling solutions to new problems, we might run into a package that looks useful but we need to learn more about it. For CRAN packages, you can find the CRAN page of the package by searching it by name at cran.r-project.org. From there, you will be able to access the reference manual. This manual provides object documentation for all the functions in the packages. The following is an example using ggplot2, a popular package for creating visualizations