Introduction to R

If you are a biologist, chances are good you have heard about the programming language R, used for statistical computing and graphics. This is because it is a free, open-source software with hundreds of packages available to aid analyses of biological data. We know that programming can be very intimidating at first, but R is like any other natural language and takes time to learn. The USF OmicsHub provide this introductory course to help researchers such as you start your programming journey. You will not become an expert after this course but you will have the basic foundations to continue learning R confidently.

R vs. RStudio


“R” is a programming language and free software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

The one downside of using R is that the user-interface is not very user-friendly, so a user-interface called RStudio was developed as an Integrated Development Environment (IDE) of R to provide further functionality. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging and managing your workspace.

R Installation Instructions


To install R, go to the CRAN R project website and follow the links for your operating system. R must be downloaded to run RStudio. Download RStudio here.

We have created step-by-step instructions here for this process!

RStudio User-Interface


Let’s get familiar with RStudio-

When you open RStudio, it may look something like this. There are four general windows highlighted above:

  1. Console: where you can type commands and see outputs. The console is all you would see if you ran R in the command line without RStudio.

    • In the image above, you can see a “>” in this window. This is a prompt and where you can run commands one at a time by pressing the return key. If a plus sign appears while in the console, it means that R is wanting you to enter some additional information. You can press the escape key to return to the prompt.

    • You can also clear the console by clicking the faint broom icon in the top-right corner of this window.

  2. Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console.

    • In the image above, you can see that a script titled “Untitled1” is opened but is empty starting at line 1. There are different types of R scripts such as RMarkdowns (.Rmd) but we will start with R Scripts (.R) like this one. You can also see the “Run” button in the top right of this window. The first run action runs the line of code where your blink text cursor is located. You may also select and highlight the code you want to run before clicking this option. The second run option runs the entire script.

    • In the top-left corner of this window, you can see a floppy disk image that lets you save this script.

  3. Environment/History: shows all active objects and history keeps track of all objects/functions assigned.

  4. Files/Plots/Packages/Help: is primarily used for displaying graphs and for using the help system but it also shows the folders on your local computer.

    • You can click through the sub-tabs of this window. The help tab includes other R resources and manuals.

The Basics - pt.1

Now that we know our way around RStudio, we can begin running some code. We can start by practicing within the console, but you can save these commands in a scripts as notes for later.

R Syntax


Creating new variables

R is an object-oriented language. Objects are entities R operates on. These can be individual values, data sets, statistical outputs, or specialized functions. If it is something to which you can assign a name, it is an object.

Lets try creating a variable object using the assignment operator <-

a <- 10
b <- 20
c <- a + b

These are numeric objects.You can probably guess what the value of c is! R can handle most simple types of math operators. Here are some more:

Arithmetic Operators

+ : Addition

- : Subtraction

* : Multiplication

/ : Division

^ or ** : Exponentiation

We can run these variables alone to print their value. We can also see them in our environment.

c
## [1] 30

Objects or functions defined will show up here.

Now let’s create a vector or one-dimensional array using the concatenate function c()

a <- c(1,2,5.3,6,-2,4/9) # numeric vector 
b <- c("one","two","three") # character vector 
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) # logical vector 

There are three basic examples of vectors shown above: numeric, character, and logical.

Numeric: this data-type is straightforward. Inside vector a includes examples of what can be considered numerical data.

Character: this vector is made up of strings - these are any values written within a pair of single or double quotes. For example, "1" would be considered a character value within R even though we know 1 is a number. We can check what type an R object is by using the class() command. For example, try running class(a).

class(a)
## [1] "numeric"

Logical: these vectors are made up of TRUE/FALSE and are typically a way to index another vector which will return values for which the logical vector is TRUE.

Let’s try indexing or subseting a where c is TRUE using the bracket operators []

a[c]
## [1]  1.0  2.0  5.3 -2.0

This is one way to subset data but logical vectors are typically an output from using comparison operators such as the following:

==: equal

!=: not equal

< : less than

<= : less than or equal

> : greater than

>=: greater than or equal

| : or

! : not

%in% : in the set

Lets use character vectors of chemical elements and US states to practice using comparison operator %in%

chem <- c("Li","Fl","Ca","Na","Fe","Se","Rb","Ag")
us <- c("Se","Ak","Ct","Hi","Ks","Mi","Fl","Ca")

This subsets the character vector of US states which are also chemical elements.

💭EXERCISE: try to subset ‘us’ where states are not also chemical elements.

us[us %in% chem]
## [1] "Se" "Fl" "Ca"

Typically we deal with larger data and sometimes just want a summary of our data.

We can use the summary() function to see how many US states which are also chemical elements

summary(us %in% chem)
##    Mode   FALSE    TRUE 
## logical       5       3
TAKE NOTE
- In the code above, you may have noticed the # operator. Text written after this symbol is not recognized as code and are considered comments which can be used to describe certain lines of code or simply block out code.
- You may have realized by now that R is white-space friendly meaning that you can leave spaces wherever except within names of variables or functions. The important thing is to be consistent. We will expand on style guidelines later.
- However, R is case-sensitive. This means that variable x is not the same as X. That applies to pretty much everything in R; for example, the function subset() is not the same as Subset().

More data types…

Below are some examples of data types in R that more than two dimensions that are commonly used.

Matrix matrix(): a homogeneous collection of data sets which is arranged in a two dimensional rectangular organization

Data frame data.frame(): we can think of a data frame as a rectangular list made up of vectors of the same length.

These sound very similar but the main differences are that matrices have to be of the same type(numeric,character,logical,etc.) where columns of each data frames can be different. Data frames are typically more common because they are easier to manipulate but a lot of R functions coerce matrices to data frames since matrices are far more computationally efficient.

Here is an example of creating a data frame.

✏️NOTE: Creating a data frame is just combining a bunch of vectors using the = assignment operator. The name of the vector is the name of the column.

  • Remember that R is white-space friendly but there are still guidelines when it comes to organizing our code. Here, we the contents of the functions are separated in each line, neatly tabbed and spaced.
df <- data.frame(
  name = c("Brad","Janet","Rocky","Magenta"),
  cat_breed = c("Persian","Russian Blue", "Siamese","Ragdoll"),
  sex = rep(c("Male","Female"),2), # same as c("Male","Female","Male","Female","Male")
  age_yrs = c(1:4) # same as c(1,2,3,4)
)

df
##      name    cat_breed    sex age_yrs
## 1    Brad      Persian   Male       1
## 2   Janet Russian Blue Female       2
## 3   Rocky      Siamese   Male       3
## 4 Magenta      Ragdoll Female       4

We can turn this data frame into a matrix using as.matrix

✏️NOTE: remember that matrices in R have to be of the same type. Since R recognizes a character vector, then the entire matrix is forced to a character type including age_yrs.

as.matrix(df)
##      name      cat_breed      sex      age_yrs
## [1,] "Brad"    "Persian"      "Male"   "1"    
## [2,] "Janet"   "Russian Blue" "Female" "2"    
## [3,] "Rocky"   "Siamese"      "Male"   "3"    
## [4,] "Magenta" "Ragdoll"      "Female" "4"
TAKE NOTE
- Variable and function names should be lowercase. Words should be separated by an underscore (_).
- Try to avoid using names of existing functions and variables. Doing so will cause confusion for the readers of your code.
- Variable names should be meaningful. For example, after cleaning a dataset named diseases, you don’t want to rename it Diseases or diseases2 but maybe diseases_clean. Camel cases are also known to be harder to read and should be avoided. Ex. DiseasesClean

How R Functions Work


Parts of a Function

Before we begin exploring datasets, let’s go into more detail on how functions work. We have introduced a few functions already such as summary() and class() so you may have been picking up on a semantic pattern, but its important to understand the components of a function since you will most likely be creating your own or installing packages with new functions. Inside the parenthesis of the function name is where arguments are stored. If a function has more than one argument, then they are separated by a comma. The arguments input are ran through the body of the function, the code, to carry out a specific task and output a return object.

Here is the syntax of a function.

function_name <- function ( arg1, arg2,... ) {
    statement1
    statement2
    etc..
   return( output )
     }

In the Help tab of the bottom-right pane, you can see the description and arguments of most functions by running “?” in front of the name of the function. For example,

?class

✏️NOTE: you can see in the help tab that arguments have names. We can be explicit by assigning our input to for an argument by running class(x = us). This is relevant for more complicated functions.

Let’s make a simple function called quadruple() which multiplies its inputs by four.

quadruple <- function(x) {
  y <- x*4
  return(y)
} 

✏️NOTE: In simple terms, we have a placeholder argument ‘x’ which in the body of our code gets quadrupled. Our quadrupled number is assigned to a new variable ‘y’ created in the function. The return function prints out the value of y for the value assigned to the argument ‘x’.

Now, we can call the function by assigning a numerical vector to ‘x’

v <- c(3,5.3,6,38,20,11)
quadruple(v)
## [1]  12.0  21.2  24.0 152.0  80.0  44.0

💭EXERCISE: Try creating a function that returns numbers less than 10. Use ‘v’ again to test your function.

Writing functions using control statements

It is likely the case that we will need to write our own functions so it is important to practice functional programming. If you are doing something more than once, it belongs in a function. Repeating the same blocks of code may make more sense to new programmers, but carrying out tasks within a function makes your code easier to read, fix, and maintain. To write these functions, you may need to include control statements.

Control statements are expressions used to control the execution and flow of the program based on the conditions provided. We will introduce a few popular control statements: if/else, for loops.

Before jumping into some code writing, it is recommended to map out or create a work flow of what we want our function to do.

Here, is an example for a if/else statement. An if statement can be followed by an optional else statement which is executed when the conditional expression is false.

The syntax of the if/else statement:

if (conditon) {
statement1 
} else {
statement2 
}

Within the parenthesis after ‘if’ is where the condition is expressed. If our input meets these conditions, then the body of code ‘statement1’ is run. If the condition is not met for our input, then ‘statement2’ is run.

✏️NOTE none of the remaining else if’s or else’s will be tested if a condition is met

 

Example of if/else statement in a function

We will use a simple example of this control statement used within a function. We want our function to take in two numerical variables and output which variable is greater or if they are equal. Since we have more than one condition, we add a nested condition by replacing ‘else’ with ‘else if’

Remember that we are just creating our user-defined function here so running this alone will not output anything until we ‘call’ it later but we should see compare_num in our environment now.

compare_num <- function(var1,var2){
  if (var2 > var1){
  paste(var2, "is greater than", var1)
} else if(var1 == var2){
  paste(var1,"and", var2, "are equal")
} else {
  paste(var1,"is greater than", var2)}
}

Now that R recognizes the function, we can treat it as any built-in function by calling it.

💭EXERCISE: assigning different numbers to variables a and b

a <- 100
b <- 81

compare_num(a,b)
## [1] "100 is greater than 81"

Now, let’s look at for loop statements. A loop is a way to repeat a sequence of instructions under certain conditions

The syntax of the for loop statement:

for (var in sequence)
{
print(statement)
}

In this syntax, ‘var’ is the individual items being iterated through the ‘sequence’ - this can be a collection of objects like a vector, list, etc.

The ‘statement’ here again is the body of code to be run for each item in the sequence.

The ‘var’ here is typically represented as i for index. It is not a variable we assign a value to but is a placeholder representing the elements of the vector. You may also see x being used for this.

It is important to write our output within the print() or return() function for for loop statements or else nothing will print.

    

Example of a for loop statement in a function

Let’s create a for loop function that takes a numerical vector and outputs the square root of each element.

get_sqrt <- function(vct){
  for(i in vct)
  print(sqrt(i))
}
x <- c(9,1,3.2,5,79,40)

get_sqrt(x)
## [1] 3
## [1] 1
## [1] 1.788854
## [1] 2.236068
## [1] 8.888194
## [1] 6.324555
TAKE NOTE
- We have covered creating very simple functions but before trying to write a function to carry out your task, GOOGLE FIRST. It is very likely that someone else had already asked the same question on Stack Overflow.
- Remember that R is very literal! When writing functions, it can be easy to create errors by having misplaced brackets or missing commas somewhere. Practicing neat code in the beginning will make it easier to spot these simple mistakes. Follow R’s automatic line spacing.
- When creating functions, it may be helpful to comment what each argument should be and what the function is doing.
- Take the time to draw out a flow chart before writing your code. At each step of your algorithm, build upon your function. If at the end you have errors, do the opposite and work your way backwards!
- Overwhelmed? Don’t worry! Creating user-defined functions can be difficult. The important thing to know is how they work and the concept behind them.

Examining our Data


R comes with pre-loaded data sets that we can grab using the function data(). You can run the command alone to see these available data sets.

We will load the “iris” data set that includes information on 150 samples of flowers from the iris genus.

✏️NOTE: “iris” may show up in the environment as a <Promise> but will show as a data frame when used in later code.

data(iris)

We will introduce more functions we can use to explore this data.

Extracting simple stats

We can display the first few rows of our data by using head()

✏️NOTE: the default for the number of rows to display is 6. You can change the number of rows displayed by assigning the ‘n’ argument. For example, head(mtcars, n = 10).

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Thetail() function works similarly

tail(iris,10)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 141          6.7         3.1          5.6         2.4 virginica
## 142          6.9         3.1          5.1         2.3 virginica
## 143          5.8         2.7          5.1         1.9 virginica
## 144          6.8         3.2          5.9         2.3 virginica
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

Printing the column names of the dataset.

colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

Since data frames can have variables with different data types, we can use str() to check all of them at once.

  • In the first line of this output, it shows the class and dimensions of our input object. Dimensions of an object can also be found by running dim().

  • After each extraction operator $, we can see the column names or variables, its class, and the first few values in the column.

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Here, we are using the extraction operator $ to extract the “Species” variable.

head(iris$Species)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

You can see that “Species” is factor type variable. Factors are pre-defined values stored as ordered or unordered levels. Factors in R can be character or numeric. We can see that “Species” is a numeric factor vector with three levels. Factors are important for statistical modeling and plotting. Other examples of common factor variables:

  • Gender: Factor w/ 2 levels “Female”, “Male”

  • Groups: Factor w/ 2 levels 0, 1

We have used summary() on logical objects but it also works on other variable types and datasets.

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

We have used the $ to extract individual columns but in some cases, we want to extract more than one column.

Here are some ways we can subset the iris data if we just want the petal measurements.

✏️NOTE: When we use the indexing brackets [] on multi-dimensional arrays, you have to specify rows and columns to subset using a comma. The syntax is [rows,columns].

💭EXERCISE: subset the iris data set by sepal length for just versicolor species

iris[, c("Petal.Length","Petal.Width")] # We can specify the columns we want to subset using a chr vec
iris[, c(3:4)] # subset by column position 
iris[, c(-1:-2,-5)] # subset by removing unwanted columns 
iris[, colnames(iris) %in% c("Petal.Length","Petal.Width")] # subset by a condition 
All of these get the same result! hover over table to scroll
Petal.Length Petal.Width
1.4 0.2
1.4 0.2
1.3 0.2
1.5 0.2
1.4 0.2
1.7 0.4
1.4 0.3
1.5 0.2
1.4 0.2
1.5 0.1
1.5 0.2
1.6 0.2
1.4 0.1
1.1 0.1
1.2 0.2
1.5 0.4
1.3 0.4
1.4 0.3
1.7 0.3
1.5 0.3
1.7 0.2
1.5 0.4
1.0 0.2
1.7 0.5
1.9 0.2
1.6 0.2
1.6 0.4
1.5 0.2
1.4 0.2
1.6 0.2
1.6 0.2
1.5 0.4
1.5 0.1
1.4 0.2
1.5 0.2
1.2 0.2
1.3 0.2
1.4 0.1
1.3 0.2
1.5 0.2
1.3 0.3
1.3 0.3
1.3 0.2
1.6 0.6
1.9 0.4
1.4 0.3
1.6 0.2
1.4 0.2
1.5 0.2
1.4 0.2
4.7 1.4
4.5 1.5
4.9 1.5
4.0 1.3
4.6 1.5
4.5 1.3
4.7 1.6
3.3 1.0
4.6 1.3
3.9 1.4
3.5 1.0
4.2 1.5
4.0 1.0
4.7 1.4
3.6 1.3
4.4 1.4
4.5 1.5
4.1 1.0
4.5 1.5
3.9 1.1
4.8 1.8
4.0 1.3
4.9 1.5
4.7 1.2
4.3 1.3
4.4 1.4
4.8 1.4
5.0 1.7
4.5 1.5
3.5 1.0
3.8 1.1
3.7 1.0
3.9 1.2
5.1 1.6
4.5 1.5
4.5 1.6
4.7 1.5
4.4 1.3
4.1 1.3
4.0 1.3
4.4 1.2
4.6 1.4
4.0 1.2
3.3 1.0
4.2 1.3
4.2 1.2
4.2 1.3
4.3 1.3
3.0 1.1
4.1 1.3
6.0 2.5
5.1 1.9
5.9 2.1
5.6 1.8
5.8 2.2
6.6 2.1
4.5 1.7
6.3 1.8
5.8 1.8
6.1 2.5
5.1 2.0
5.3 1.9
5.5 2.1
5.0 2.0
5.1 2.4
5.3 2.3
5.5 1.8
6.7 2.2
6.9 2.3
5.0 1.5
5.7 2.3
4.9 2.0
6.7 2.0
4.9 1.8
5.7 2.1
6.0 1.8
4.8 1.8
4.9 1.8
5.6 2.1
5.8 1.6
6.1 1.9
6.4 2.0
5.6 2.2
5.1 1.5
5.6 1.4
6.1 2.3
5.6 2.4
5.5 1.8
4.8 1.8
5.4 2.1
5.6 2.4
5.1 2.3
5.1 1.9
5.9 2.3
5.7 2.5
5.2 2.3
5.0 1.9
5.2 2.0
5.4 2.3
5.1 1.8

💭EXERCISE: Try to subset the iris dataset so that it is only showing measurements for versicolor species.

Visualizing our data

There are hundreds of books written about data visualization in R alone, but we will quickly create some plots to show R’s ability to create high-quality graphics.

Using plot() on a data frame will create a scatterplot matrix showing the relationships between each pair of variables in the data frame.

plot(iris)

The plot() function has many arguments to customize our graphs.

Main Base R Plot Parameters:

x : coordinates of points in the plot

y : y coordinates of points in the plot

type : type of plot to be drawn. Ex. “p” for points,“l” for lines,…more here

main : overall title for the plot

xlab : x axis label

ylab : y axis label

pch : shape of points. Ex. 0 for open squares, 17 closed triangles, more here

col : color of points or lines. Run colors() for predefined colors in R. R also recognizes HEX and RBG values. Ex. “white” or “#FFFFFF”. We can also color points or lines by factor.

las : axes label style. The default is parallel to the axis. “1” = horizontal, “2” = perpendicular to the axis, and “3” always vertical.

bty : box type. Default draws a rectangle around the plot, “n” draws nothing around the plot..

cex : The amount of scaling plotting text and symbols

Lets plot sepal length against sepal width.

plot(x = iris$Sepal.Length, y = iris$Sepal.Width,
     type = "p",
     main = "Sepal Flower Measurements of Iris Species",
     xlab = "Sepal Length",
     ylab = "Sepal Width")

This graph does what we want it to, but we can include more information and make this look more interesting.

plot(x = iris$Sepal.Length, y = iris$Sepal.Width,
     type = "p",
     main = "Sepal Flower Measurements of Iris Species",
     xlab = "Sepal Length",
     ylab = "Sepal Width",
     pch = 20,
     col = c("#332288","#AA4499","#88CCEE")[iris$Species])
legend(x = "topright",
       legend = levels(iris$Species),
       col = c("#332288","#AA4499","#88CCEE"),
       pch = 20)

In the new code above, we assign colors for each level of ‘Species’ and include a legend to match colors to iris species. These colors were chosen because they are color blind accessible. You can learn more data visualization tips from USFs Genomics Seminar by Clause Wilke Uppin your datadiz game. This recording is only available to USF students and faculty but his book Fundamentals of Data Visualization can be accessed by everyone.

Below are some more different types of graphs R can plot..

Histogram of Iris Flower Sepal Length

hist(iris$Sepal.Length,
     main = "Histogram of Iris Sepal Length",
     xlab = "Sepal Length",
     col = "purple" )

Heat map of Iris Data

dist() is used to calculate the similarity between different flowers in the iris data

heatmap(as.matrix(dist(iris[, 1:4])))

Boxplot of Iris Data

boxplot(iris[,1:4],
      col = c(rep("pink",2),rep("lightblue",2)),
      main = "Boxplot of Iris Data")

💭EXERCISE: Use the built-in ‘pressure’ dataset to create a line graph of temperature against pressure.

TAKE NOTE
- We introduced a few functions to explore and clean our data but unfortunately, we cannot cover everything. A big part of programming is learning how to problem solve - how to google and referring to the manual.
- When producing graphs, first consider what is the best way in which to convey the information: a line graph, a bar graph, etc. It is import to invest sufficient time and effort in the process.
- Make sure your graph communicates the information well. Omit needless graphical elements and use large font sizes.

The Basics - pt.2

Now we can move forward with using our own data, exploring new packages, and utilizing other R tools.

Working directories


Before loading our own data, we have to know what working directories are. The working directory is the location on your computer where R can read and save files. You can only have one working directory at a time.

Run getwd() to print your working directory

getwd()

If you are using a Mac computer, your working directory may look like “/Users/path/to/my/directory/” but if you are on Windows, then your output may look like “c:/path/to/my/directory/”

If the files we need for analysis are in a different directory, then we can use setwd() to change it.

First, lets create a folder named “R_exercise” in our “Documents” and change our working directory to it.

✏️NOTE: check for correct back/forward slashes and upper/lower cases if you run into errors.

setwd("/Users/username/Documents/R_exercise") 

You can also change your working directory by going to ‘Global Options…’ in the Tools tab, but it is important to know how file paths work and how to write them.

Run getwd() to print your working directory. Check that you are in the right directory.

getwd()

Importing and Exporting Data


Loading our own data

We will take available stroke data from Kaggle provided by fedesoriano. Click here to download the data set. Unzip the file and move the .csv to your “R_exercise” folder.

We can run the list.files() function to list all files in the path in strings provided. We can input getwd() as a short cut. We should see the file we just downloaded and transferred.

list.files(getwd())
## [1] "kaggle_stroke_data.csv"

To load csv files into our environment, we can use the read.csv() function. Assign the data to ‘stroke’

✏️NOTE: Since we are in the environment of the file we want to upload, we can just specify the file name. If not, then the entire file path must be specified.

✏️NOTE: Remember that files are not always ideal for loading within R. Check the manual by running ?read.csv and examine the arguments. Sometimes, you might need to change an argument setting.

stroke <- read.csv(file = "kaggle_stroke_data.csv")

Saving our data

After doing some analysis or data cleaning, we can to save our results into a .csv file.

Let’s clean our stroke data set so that there are no missing data. We can typically check this by running is.na() on our data but this will return a logical matrix of all FALSE values. However, when you look at data in the ‘bmi’ column, we can see that unavailable data is indicated as “N/A” - this or an empty string won’t be recognized by the is.na() function so it is important to examine our data carefully.

💭EXERCISE: What happens when we remove the ‘!’ operator in this code?

stroke_clean <- stroke[!stroke$bmi == "N/A",]

Now that our data is clean, we want to save it. To export our data as a .csv, we can use write.csv() we will comment the syntax of the function

write.csv(x = stroke_clean, # the data set we want to save
          file = "kaggle_stroke_data(n=455).csv", # the filename we want to save as
          row.names = FALSE, # our dataset technically has rownames (1-455) which we do not actually want so we assign this argument as FALSE to prevent this
)

The process works the same for different file types. For example, you can use read.table() for .txt or tab-delimited text files. You can check this DataCamp tutorial for more examples importing different files into R.

R Packages


If you checked out that DataCamp tutorial or have already done some troubleshooting via Google then you most likely have been introduced to a new package. So far, we have only been using base R and built-in R functions, but many other useful R functions come in packages. These are free libraries of code written by R’s active user community.

How to install R packages

1. Most packages can be installed from the official repository, **CRAN**, the Comprehensive R Archive Network. Packages submitted to CRAN is subject to testing, policies, and legal requirements. CRAN packages can be installed using base R function `install.packages()` and ALL packages can be loaded using ```library()```

✏️NOTE: Once you install a package, you don’t have to install it again unless it needs to be updated. You can check if a package needs updating in the lower right panel in RStudio next to the Files/Plots/Helps tabs. You will need to reload packages in new R sessions.

install.packages("packagename")
library(packagename)
2. Authors also host packages available on **Github** where users can report bugs or suggest new features. To download R packages from Github, the 'devtools' package must be downloaded first. 
install.packages("devtools") 
devtools::install_github("username/packagename") 

✏️NOTE The ‘::’ operator indicated after a package name lists functions available only within that package. If you did not load the package by running library(devtools) then using ‘::’ will explicitly load that package for that function. This is useful when you have packages loaded with the same function name. If you are using a function from a package more than once, then load it using the library function.

3. And if you are a bioinformatic research, then you will definitely be downloading from **BioConductor** - a free, open source and open development software project for bioinformatics. Installing packages from Bioconductor is similar to downloading Github packages. 
install.packages("BiocManager") 
BiocManager::install("packagename")

Package Documentation

All packages include documentation. Without it, users would not know how to use the package. We have been able to access object documentation using ? for individual functions, but it is only helpful if you know the function name and want to know what it means. It does not find helpful functions for a new problems.

When googling solutions to new problems, we might run into a package that looks useful but we need to learn more about it. For CRAN packages, you can find the CRAN page of the package by searching it by name at cran.r-project.org. From there, you will be able to access the reference manual. This manual provides object documentation for all the functions in the packages. The following is an example using ggplot2, a popular package for creating visualizations

  

Below the reference manual, is where you can usually find the vignettes. Vignettes are helpful for looking at informative examples and how results can be interpreted. Some authors provide more than one vignette. The www.bioconductor.org Bioconductor page displays the package information similar to CRAN. You can search the name of the package in the search bar located in the top right. The following is an example of how to access the vignettes for the biomaRt package. This package makes it easy to access and retrieve Ensembl data from R.

  

Other R tools


R Markdown

R markdown is a file format for making dynamic documents with R. An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. R markdown scripts can be knitted to create html, pdf, or word documents. It is popular for serving neat records of analyses. They can also be combined with HTML widgets to make it interactive. Below are some examples of R Markdown documents.

To create an R Markdown (.Rmd) file, go to the ‘File’ tab within RStudio, open ‘R Markdown…’ under the ‘New File’ options. It will prompt you to choose the file type we want to knit to. HTML is the default. When you create a new .Rmd file, it will always start with a predefined layout beginning with YAML header, some Markdown syntax, and chunks of code as an example.

Parts of an Rmarkdown

+ **YAML header** - they YAML header in R markdown is the content provided within the '- - -' operators. 
---
title: "R Markdown Title"
author: "First and Last Name"
date: "date"
output: html_document
---
+ **Markdown text** - all markdown text occurs in the space of the script outside of the YAML header and chunks of code. Words can be italicized by surrounding the text with asterisks. Headers can also be created using #'s. 
# Header 1 
## Header 2 

Check out this Rmarkdown Cheat Sheet for more.

+ **Code Chunk** - Chunks of embedded code are formated beginning with ` ```{r}` and ending in ` ``` `. Everything in these chunks of code are treated as regular R code. You can click the green arrow icon in the top right corner of the chunk to run the entire chunk of code. 

 

 

+ **Set up Chunk** - this automatic set-up chunk defines the global settings for each chunk of code in the R Markdown. For example, the following global options is set as "echo = TRUE" so that each chunk of code prints the code along with the output in the knitted RMarkdown. Global options can be overwritten by individual chunk settings. 
knitr::opts_chunk$set(echo = TRUE)

💭EXERCISE: Try knitting the default .Rmd script by clicking the knit icon next to the save icon. How do you bold a word in Markdown?

Check out The Coding Club’s R Markdown Tutorial for more on creating R Markdowns.

R Projects

Good project organization is crucial! It is arguably easier to import/export data to and from the same directory but having one file path for everything can get messy very quick. Ideally, you want to layout your project so that it is reproducible, easily understandable to others, and easy to come back to after a break. Projects should be in its own folder with sub folders organizing each document:

/data : This folder should store raw input data used for analysis. Preprocessed or cleaned data should be stored in a sub or completely different folder such as /data/preprocessed or /outputs.

/scripts : R scripts should be stored in this folder. If there are more than one script to be run in a specific order, then it should be specified in the file name. Ex. 01_preprocess_stroke_data.R.

/outputs : Outputs exported from analyses should be stored in this folder. Figures can also be in their separate sub-folder here or if you have many figures, store them in a separate folder.

✏️NOTE: this is a standard directory template for organizing your research project. Modify it to best fit your needs!

More information on good project management can be found here.

The different file paths seem overwhelming but RStudio has a tool called R Project to deal with this issue. R Projects basically run your R session through an extension (.Rproj). Where we create our R Project is where the current working directory is. This extension of your working directory makes it easy to return to and share with others.

Why not just use setwd() ?

The setwd() requires an absolute file path unique to you making it difficult to share code with others. Also, what if you wanted to change directory levels? Moving project folders will create errors when running your code.

How to Create an R Project

R Projects can be created by clicking ‘New Project…’ under the ‘Files’ tab in the top left corner within RStudio. It will then prompt you to create the project in a new working directory, existing directory, or check out a project from a version control repository.

New Directory: this option just lets you create a folder on your local computer within RStudio if you had not already created a folder you would like your R Project set to.

Existing Directory: this option lets you browse to the directory you would like your R Project directory set to.

Version Control: this allows you to run a project using a GitHub repository.

After creating the project, you should see the .Rproj extension in your directory and automatically be switched to your R Project directory. You can return to your project session by clicking this file.

Follow this Beginner’s Guide for more!

   

✏️NOTE: The OmicsHub training workshops run tutorials in RMarkdown and use R Projects.

Final Key Points

  • Get to know your data first. Before jumping into analyses, examine your data. Are all the variables the right type? Does anything need to be cleaned?

  • Practice neat code! Create variable names that are meaningful. Space code according to common style guides.

  • Remember to comment your code. The best way to learn how to program is to practice everyday but things happen! Make sure to comment your code so that your future self or others can understand whats going on.

  • Menial tasks save heartache. Take time to organize your project from the very beginning. Use R Projects.

  • GOOGLE GOOGLE GOOGLE - and check the manual.

  • You’re not alone. There is a big R community. StackExchange and Biostars are wonderful resources for asking questions.

  • Stay motivated :) You don’t need to keep running through different R tutorials. Pick an area you are interested in like data visualization or exploratory analysis and focus on it. Do whatever helps YOU learn best.

Futher Reading & Resources

Statistical Inference via Data Science A ModernDrive article focusing on R and Tidyverse functions.

Braham Bioinformatics As part of its work with the Babraham Institute, the Bioinformatics group runs a regular series of training courses on many aspects of bioinformatics.

swirl An interactive R tutorial within the R console.

R for Data Science This book will teach you how to do data science with R.

Reach out to us! - The USF Genomics Hub We are also employed full-time as researchers on various extramurally-funded projects. Please see our main USF Genomics Hub Services page to request computational assistance.

   

HAPPY CODING!!! 😁😁😁