7 Files and subsetting data

7.1 Reading from a file

In chapter 4 we created data frames with R functions. This was useful to help understand how data frames work in R. However, in real life you will most likely not do this very often. Instead you will have data files you need to analyse with R.

You can get your data into R by having R read your file.

Prior to using a file you will need to acquire it. Make a directory/folder called “Chapter_7” within your main directory/folder for this workshop. Download the following file into your “Chapter_7” directory:

Next set your working directory to this new directory (Directories). You will stay here for this chapter.

The next task to carry out is to read in the file “Liverpool_beaches.csv” . Before reading in the file we can check the contents of the file. This can be carried out by opening it with notepad (or similar text tool) or viewing the file with RStudio.

To view the file with RStudio:

Use the Files pane of the MISC window to navigate to the directory containing the file.
Click on the file name and then click “View File”
This will open a tab in the Source window matching the file name

You will notice that the values are separated by commas as this is a “comma separated value” (.csv) file. Additionally, this is the same data as the “beach_df_2” data frame you created in the chapter 5 exercises.

Note: Create a new R script file called “3-Files_and_subsetting.r” for this chapter’s scripts.

There are various functions to read in files into R. My favourite that I find most consistent is read.csv(). Use this function to read in the file “Liverpool_beaches.csv”:

liv_beaches_df <- read.csv("Liverpool_beaches.csv")

Have a look at the newly created data frame. Is it how you would like it?

The row names are empty and the beach names are in the first column. Let us fix this and make it so the beach names are the row names. This can be carried out by including the option row.names = 1 to specify the 1st column will be the row names:

liv_beaches_df <- read.csv("Liverpool_beaches.csv", row.names = 1)

We now know how to read in a csv file with read.csv. Therefore we will next read in a tab separated file containing the sales figures of fictional clothing stores through the seasons.

Download the file “Clothing_stores.tsv” into your “Chapter_7” directory and look at the file contents before reading them in.

Now use read.csv() to read in the file. We’ll set row.names = 1 again but we will also include the option sep = "\t". This option specifies the columns are separated (sep) by tabs ("\t").

clothing_df <- read.csv("Clothing_stores.tsv", row.names = 1, sep = "\t")

Look at the resulting data frame and you will notice the column names have been changed by R. This is annoying but thankfully there is an easy fix. Read in the data again with the inclusion of the parameter check.names = FALSE. This will stop the function read.csv() from ‘checking’ and ‘fixing’ the column names. I always use this option.

clothing_df <- read.csv("Clothing_stores.tsv", row.names = 1, 
                        sep = "\t", check.names = FALSE)

You may want to open excel files with R. Normally to do this I open the file in excel and save it as a .csv or a tab separated file and read this into R. Alternatively there are R packages that can directly read in excel files. If this is something you would like to do you can look at the following package:

https://readxl.tidyverse.org/

An important note is that reading in a file into R will not change the file. You are creating a new R object. Modifying this object will not alter the original file. Later in the materials we will look into how to create new files or overwrite files by writing.

7.2 Subsetting data

R allows you to specify specific points in R objects. This is one of the primary reasons R is so useful and flexible. With good use of assignment operators this allows for the subsetting of variables.

7.2.1 Vectors

We will start with vectors. Before carrying out and subsetting let us create some new vectors. We will use a new function to create these, seq().

Tip: Look at the resulting vectors and, use ?seq() or search online to understand the seq() function better.

even_seq <- seq(from = 0, to = 8, by = 2)
odd_seq <- seq(from = 1, to = 9, by = 2)
long_seq <- seq(from = 10, to = 300, by = 10)

Grand! Now let us subset the vectors with square brackets [].

Vectors are one-dimensional, we therefore provide the square brackets with one number or one range of numbers. The number/s we provide in the square brackets are the index.

Try out indexing/subsetting the vectors.

even_seq[2]
odd_seq[1]
long_seq[10]
even_seq[2:3]
odd_seq[1:4]
long_seq[21:24]
long_seq[24:21]
even_seq[c(2,3)]
odd_seq[c(1,3,2,5)]
long_seq[c(1,21,21:24,24:21,1)]
#As long as the contents within the [] equal numbers they will work
even_seq[seq(from = 1, to = 3, by = 2)]
even_seq[seq(from = 0, to = 5, by = 3)]
long_seq[seq(from = 1, to = 19, by = 2)]
even_seq[1*2]
odd_seq[2/1]
long_seq[(1:10)*2]

The vectors even_seq and odd_seq have the indexes 1,2,3,4, and 5 as they each contain 5 scalars. What if we try to use a higher number to index than is available?

even_seq[6]
even_seq[c(4,7)]
odd_seq[3:9]

As you can see the above all work with no complaints. Any indexes that are out of range will return a NA value. NA stands for ‘Not Available’. We will not go into how NA works in R too much. The most important thing to know about NA is that you will most likely get NA if you use operators or functions with NA. Below are a few examples:

#Will give NA
1 + NA
2 - NA
even_seq[2] * NA
odd_seq[5] / NA
#mean() function without NA
mean(even_seq[2:5])
#mean() function with NA
mean(c(1,2,3,4,5,NA))
mean(even_seq[2:7])

Above we subsetted vectors by specifying which indexes we want. We can also specify which indexes we don’t want:

even_seq[-2]
odd_seq[-3:-5]
long_seq[c(-1,-2,-6)]

The rep() function will replicate a scalar/vector a specified amount of times. We will use this function to overwrite our previously created variables with longer versions:

#Replicate vector even_seq 2 times
rep(x = even_seq, 2)
#Replicate vector even_seq 4 times and then assign even_seq as the newly created vector
even_seq <- rep(x = even_seq, 4)
#More examples
odd_seq <- rep(x = odd_seq, 4)
long_seq <- rep(x = long_seq, 3)

Logical operators can be used as indexes to subset vectors. Having a logical expression (i.e. 1 > 2) as the index will cause all TRUE positions to be included and all FALSE positions to be excluded.

Tip: If it is difficult to deduce what the below commands are doing you can run the part in the square brackets by itself. Remember if you highlight code in the script editor it will only run that part, excluding unhighlighted parts of script in the same line.

even_seq > 3
even_seq[even_seq  > 2]
odd_seq[odd_seq <= 1 ]
long_seq <- long_seq[long_seq < 50]

We will quickly look at a new operator, %%. This is the modulus operator, it divides two numbers and gives the remainder of the division.

With the modulus operator, logical expressions, and subsetting we can extract even or odd numbers from a vector:

#First some basic modulus examples
2%%2
3%%2
#Create a vector with numbers 0 to 9
single_digit_vec <- 0:9
#Extract even numbers then odd numbers from the vector
#We carry this out by determining if numbers are divisible by 2 or not
even_seq <- single_digit_vec[(single_digit_vec %% 2) == 0 ]
odd_seq <- single_digit_vec[(single_digit_vec %% 2) != 0]
#We can determine which numbers in a vector are divisible by any specific number
#Divisible by 3
#remember variable names cannot start with numbers
divis_3_vec <- single_digit_vec[(single_digit_vec %% 3) == 0]
#Divisible by 7
divis_7_vec <- single_digit_vec[(single_digit_vec %% 7) == 0]
#Try out other numbers!

7.2.2 Data frames

Data frames can be subset similar to vectors. As with vectors you can use []. Additionally, $ can be used to subset data frames.

Square brackets must be provided indexes for rows and for columns. The structure for this is df[row,column]. It is very useful to remember that R always wants rows first then columns second.

To practice subsetting data frames with square brackets we will read in a new file called - Census_2011_L_postcodes_counts.csv]. This contains UK 2011 census information on total residents for postcodes that start with the letter L.

Download the following file into your “Chapter_7” directory:

The source of the data is: https://www.nomisweb.co.uk/census/2011/postcode_headcounts_and_household_estimates

Read the downloaded file into R:

L_2011_census_df <- read.csv("Census_2011_L_postcodes_counts.csv",
                             check.names = FALSE,
                             row.names = 1)

Now for some subset commands:

#Scalar from the 1st row and 1st column
L_2011_census_df[1,1]
#Row names and column names can be used for indexing
#Scalar from the row called L10 1LD and the column called Area
L_2011_census_df["L10 1LD","Area"]
#More examples
L_2011_census_df[1:10,2]
L_2011_census_df[1:10,"District"]
L_2011_census_df[3,2:4]
L_2011_census_df["L10 1LD",2]
L_2011_census_df[1:10,"Total"]
L_2011_census_df[c(1,3,5,6),c("Total","Occupied_Households")]

Depending on how you subset a data frame you may get a scalar, vector, or data frame. Below describes which you will get based on the subsetting.

Scalar:
- Indexing to get a single value by choosing one row and one column.
- E.g. L_2011_census_df[1,1]
Vector:
- Indexing so you get multiple values from one column. This occurs as each column is in essence a vector.
- E.g. L_2011_census_df[1:10,2]
Data frame:
- Indexing so you get multiple values from a row or multiple rows. Subsetting a data frame like this provides you a data frame.
- E.g. L_2011_census_df[3,2:4] or L_2011_census_df[3:4,2:4]

A quick function to subset a data frame is head(). By default it will return the first 6 rows.

#Return first 6 rows
head(L_2011_census_df)
#Return first 10 rows
head(L_2011_census_df, 10)

The data frame is quite large. We will therefore use the head() function and the assignment operator to make the data frame smaller for further examples.

L_2011_census_df <- head(L_2011_census_df, 20)

To return all the rows of the specified columns you can leave the part before the comma empty. Similarly you can leave the part after the comma empty to return all of the columns of the specified rows. Leave both sides empty and you will get the entire data frame.

L_2011_census_df[,]
L_2011_census_df[,2]
L_2011_census_df[3,]
L_2011_census_df[,"District"]
L_2011_census_df[2:4,]

The sign $ allows you to indicate which column you would like from the data frame. This is done like so:

L_2011_census_df$Area
L_2011_census_df$District
L_2011_census_df$Total

You will notice that the above commands return vectors. We can therefore subset these vectors with []:

L_2011_census_df$Area[2]
L_2011_census_df$District[2]
L_2011_census_df$Total[4:7]

Below are a selection of useful functions that can be used on vectors.

#Sum the values of a numeric vector
sum(L_2011_census_df$Total)
#Mean of the values of a numeric vector
mean(L_2011_census_df$Total)

The above functions are useful but limiting if you are working with data frames. Thankfully there are also many functions used specifically for data frames (they can also be used for matrices).

#Sum numeric columns
colSums(L_2011_census_df[,3:6])
#Sum numeric rows
rowSums(L_2011_census_df[,4:5])
#Mean of numeric columns
colMeans(L_2011_census_df[,3:6])
#Mean of numeric rows
rowMeans(L_2011_census_df[,4:5])
#Summary information for each column
#This works for string and numeric columns with different outputs
summary(L_2011_census_df)

Try out some of the above commands with the entire data frame. Do they give an error? Is so, why?

Before we learn how to write data to a file I will introduce one more data frame associated function. t() which stands for transpose:

L_2011_census_df[3:5]
t(L_2011_census_df[,3:5])
summary(t(L_2011_census_df[,3:5]))

Try the above commands without subsetting the data frame. What is happening and why?

7.3 Writing to a file

Before we write data to a file we will create a new data frame from “L_2011_census_df”.

First I like to create a new variable from our old variable if there are many steps. This means if we make a mistake we can go back and recreate the new variable.

L_2011_census_t_df <- L_2011_census_df

Next step we will create a new column called “Average_per_occupied_households”.

NOTE: I am including many ways to subset columns as reminders. Normally I wouldn’t have so many different ways in one command.

NOTE: We are using “_” instead of spaces as R doesn’t particularly like spaces in column names. We will see how to use spaces later.

L_2011_census_t_df$Average_per_occupied_households <- 
  L_2011_census_t_df[,3] / L_2011_census_t_df[,"Occupied_Households"]

Have a look at the current data frame. You may notice an Inf value. This appears as when you divide a number by 0 in R you will get Inf. I am not sure how a Post code has 174 residents and 0 Occupied households but it doesn’t matter for us.

The final step before writing is to transpose the data frame leaving out the Area and District columns:

#Transpose dataframe
L_2011_census_t_df <- t(L_2011_census_t_df[,3:7])
#Check structure
str(L_2011_census_t_df)
#It is not a dataframe
#Let us therefore convert it to a data frame
L_2011_census_t_df <- as.data.frame(L_2011_census_t_df)
#Structure check
str(L_2011_census_t_df)

After all that let us write the data frame to a file called “Census_info_2011.csv”. When reading from a file I prefer read.csv(), however when writing to a file I prefer write.table(). With this function we will include the option sep="," to have commas as the column separators. We will also include the option col.names=NA to create an empty space above the row names. If this was not included then the first column name would be above the row names.

write.table(L_2011_census_t_df, file = "Census_info_2011.csv", sep = ",", col.names=NA)

Have a look at the file contents with RStudio.

Let us do it one more time with the clothing store info. First let us read in the file again in case you do not have it. Then we will create a total sales column and finally transpose the data frame:

#Read in
clothing_df <- read.csv("Clothing_stores.tsv", row.names = 1, 
                        sep = "\t", check.names = FALSE)
#Create total column
#We are referring to a column name with spaces
#Therefore we must surround the name with `
#The button for ` is left of the 1 key and below the esc key
clothing_df$`Total sales` <- rowSums(clothing_df)
#Transpose ensuring output is a data frame
clothing_t_df <- as.data.frame(t(clothing_df))

Write the data frame to a tab delimited file (.tsv). This time we will make it so the row and column names are not surrounded by quotes:

write.table(clothing_t_df, 
            "Clothing_stores_transposed.tsv", 
            sep = "\t",  
            col.names=NA, 
            quote = FALSE)

With the fundamentals of reading, subsetting data frames, and writing covered it is time to carry out some exercises.