21 Student grades example
In this section we are going to go through analysing datasets of students grades.
21.1 The student grades data
Download the five following files into your directory “Chapter_20-21”.
These files contain the marks secured by USA students in various subjects. The files are separated by anonymised race/ethnicity (groups A-E).
The files each contain the columns:
- gender: female or male.
- parental level of education: some high school, high school, associate’s degree, some college, bachelor’s degree, or master’s degree.
- lunch: Whether the student ate free/reduced lunches or standard lunches.
- test preparation course: Whether the student completed a test prep course (completed) or not (none).
- math score: The student’s math score.
- reading score: The student’s reading score.
- writing score: The student’s writing score.
Source of data: https://www.kaggle.com/spscientist/students-performance-in-exams
21.2 Analysis script
Before carrying on I would suggest to set your working directory into the directory “Chapter_20-21”.
Now create a new R Script and save it as “student_grades.R” in the “Chapter_20-21” directory. It is useful to have your scripts close to their relevant data.
Ensure you use annotations and code sections to keep your script organised and tidy.
One extra tip: I normally like to keep a setwd()
command at the top of my R Scripts so the script will set itself to the correct working directory. I just copy the relevant command from the console.
21.3 Reading in and preprocessing the data
We are going to first read in the data and carry out some changes to it. First we’ll carry this out on just the group A data.
Read in the group A TSV (tab separated value) file:
<- read.csv("student_grades_group_A.tsv",
group_a_df sep = "\t",
stringsAsFactors = TRUE,
check.names = FALSE)
When you first read in a file it is always good practice to check the resulting data frame’s contents:
#Check the top 5 rows
head(group_a_df)
#View the data frame
View(group_a_df)
#Summary of each of its columns
summary(group_a_df)
There are three main things we want to do with this data frame:
- Change the column names so spaces are underscores. We can carry this out with
gsub()
. - Change the order of the levels for the column “parental level of education” so they are in order from least to most education. We can do this with the
factor()
function. - Add a group column which will contain the information on what race/ethnicity group each student belongs to. This will contain all the same info for this data frame but it will be useful later on. We are happy for it to be a character column for now.
Let us carry out these tasks:
#Change spaces to _ in column names
colnames(group_a_df) <- gsub(pattern = " ",
replacement = "_",
x = colnames(group_a_df))
#check our change
colnames(group_a_df)
#Change the order of the parental level of education levels
<- c("some high school","high school",
edu_level_order "associate's degree","some college",
"bachelor's degree","master's degree")
$parental_level_of_education <- factor(
group_a_df$parental_level_of_education,
group_a_dflevels = edu_level_order
)#check our change
levels(group_a_df$parental_level_of_education)
#Add group column using the recycle rule
"group"] <- "group_A"
group_a_df[,#check the new column
$group group_a_df
Great we have created code to read in our file and preprocess the data frame. We have checked the output and know it produces the data frame in the format we want.
We could copy and paste the code four times and change the relevant variable names to read in groups B-E. This would be pretty annoying and time consuming though.
Instead let us create a function to do this for us. This will mostly use the same code above with variable names changed. There are a few changes though especially for the “group_name”” column creation.
Note: The variable names used within the function should not be the same as any you are using outside the function.
#Function name: "read_and_preprocess"
#Input: file_name
<- function(file_name){
read_and_preprocess #read in the file
<- read.csv(file_name, sep = "\t",
df stringsAsFactors = TRUE, check.names = FALSE)
#column name change
colnames(df) <- gsub(pattern = " ", replacement = "_", x = colnames(df))
#Parental level of education order
<- c("some high school","high school",
edu_level_order "associate's degree","some college",
"bachelor's degree","master's degree")
$parental_level_of_education <-
dffactor(df$parental_level_of_education, levels = edu_level_order)
#extract the group name from the file name
<- gsub(".*_group","group", file_name)
group_name <- gsub(".tsv","", group_name)
group_name #Add group column
$group <- group_name
df#Return df
return(df)
}
We have created the _function. Now to run it with all our files:
#We can specify the option name
#This is useful when we have lots of options
<- read_and_preprocess(file_name = "student_grades_group_A.tsv")
group_a_df #But we don't need to, which is useful with less options
<- read_and_preprocess("student_grades_group_B.tsv")
group_b_df <- read_and_preprocess("student_grades_group_C.tsv")
group_c_df <- read_and_preprocess("student_grades_group_D.tsv")
group_d_df <- read_and_preprocess("student_grades_group_E.tsv") group_e_df
21.4 Combining the data frames
A lot of the time we have multiple files but we want all the data to be in one data frame.
The first step is to create a list containing our data frames. A list is just like a vector except it can hold heterogeneous data (multiple classes) whilst a vector can only contain homogeneous data (one class).
We will use this so we can loop over our data frames. There is more to lists, if you would like to see more check out: https://rc2e.com/datastructures#intro-DataStructures
We’ll start with creating our list then using a for loop to print out the first 3 rows of each data frame. To create a list of data frames we use the function list()
just like we would with c()
.
#Create list
<- list(group_a_df, group_b_df, group_c_df, group_d_df, group_e_df)
df_lst #Loop through list
for (df in df_lst) {
#Print the first element of the group name column
print(df$group[1])
#Print the first 3 rows
print(head(x = df, n = 3))
}
With our list we can combine the data frames into one data frame with a for loop and rbind()
.
It would be very simple but rbind()
will not function if one of the data frames provided does not exist. We will therefore use an if statement to create a data frame for our combined data in the first loop.
We will use a new function, exists()
. It will return TRUE
if the provided variable exists and FALSE
if it does not. The provided variable name needs to be a character (a string in quotes in this case).
#First ensure the variable we want to create does not exist
rm(student_grades_df)
#Now to loop through our list
for (df in df_lst) {
#If statment using the exists() function
#If the variable student_grades_df exists
if (exists("student_grades_df")) {
#Combine student_grades_df with df by rows
<- rbind(student_grades_df, df)
student_grades_df else {
} #else if it doesn't exist assign df as student_grades_df
<- df
student_grades_df
}
}
#whilst we are here let us change the group column to a factor now
#No need to change the order as we are happy with it being alphabetical
$group <- factor(student_grades_df$group) student_grades_df
21.5 Saving the data frame
You can write the new data frame to a text file. This is good if you want to use the file outside of R.
What if you want to save a R object so you can use it later without having to run all the script you used to create it or reading in a file you created?
You can use the save()
function:
save(student_grades_df, file = "student_grades_df")
The above created a file called “students_grades_df” that is not human-readable. It contains the data frame student_grades_df
. The name of the file and variable do not have to match but I find it easier if they do.
You can then load in the file which will load the variable with its variable name. Before we load the data we will remove our current student_grades_df
object:
#Remove the data frame
rm(student_grades_df)
#Load in the data frame
load("student_grades_df")
Saving your R objects acts as very good checkpoints after code that takes a long time to run.
21.6 Scaling & Grading
We currently have the base scores (%) for the students but not their letter grade (A,B,C, or F in this case).
Before we provide the grades we are go to scale each student’s results. There are complicated ways to do this but we will get the mean score for each topic (math, writing, and reading). For each score below the mean we will minus 5 and for each score above will will add 5. Scores equal to the mean will not be affected.
#Create function to scale score vector
<- function(score_vec){
scale_score #Get mean score which we will round to 0 decimal places
<- round(mean(score_vec), digits = 0)
mean_score #Create an empty scale_score_vec to add to
<- c()
scale_score_vec #Loop through scores to scale them
for (i in 1:length(score_vec)) {
<- score_vec[i]
score #if to add 5 if larger than mean
if (score > mean_score) { score <- score + 5 }
#if to minus 5 if lower than mean
if (score < mean_score) { score <- score - 5 }
#We don't add or minus anything if the score is equal to the mean
#Add the scaled score to the scale_score_vec
<- c(scale_score_vec, score)
scale_score_vec
}#Return the created vector
return(scale_score_vec)
}
#Run the function on the topics to create new scaled score topics
$math_scaled_score <-
student_grades_dfscale_score(student_grades_df$math_score)
$reading_scaled_score <-
student_grades_dfscale_score(student_grades_df$reading_score)
$writing_scaled_score <-
student_grades_dfscale_score(student_grades_df$writing_score)
Now we have scaled the scores. Some students now have more than 100% and some have less than 0% but that is fine. It doesn’t matter for their final grade.
We will give out grades based on the scaled scores like so:
- A: >=95%
- B: >=80% and <95%
- C: >=60% and <80%
- F: <60%
We can carry this out with the below code. To create it I copied the last code block and altered it. This time we use else if
so the subsequent ifs only work if all the previous ifs were FALSE
.
#Create function to grade a vector
<- function(score_vec){
grade_score #Create empty grade_vec to add to
<- c()
grade_vec #Loop through scores to scale them
for (i in 1:length(score_vec)) {
#get current score
<- score_vec[i]
score #if statements with else ifs
if (score >= 95) {
<- "A"
grade_vec[i] else if (score >= 80) {
} <- "B"
grade_vec[i] else if (score >= 60) {
} <- "C"
grade_vec[i] else {
} <- "F"
grade_vec[i]
}
}#Return the created vector as a factor
return(factor(grade_vec))
}#Run the function on the topics to create the grade columns
$math_grade <-
student_grades_dfgrade_score(student_grades_df$math_scaled_score)
$reading_grade <-
student_grades_dfgrade_score(student_grades_df$reading_scaled_score)
$writing_grade <-
student_grades_dfgrade_score(student_grades_df$writing_scaled_score)
That is he last edit we will do to our data frame so lets save it.
save(student_grades_df, file = "student_grades_df")
21.7 Some quick plots
Below are some quick plots with their code.
First let us see the distribution of grades for the three topics:
#Math
#With plot()
plot(table(student_grades_df$math_grade))
#With barplot()
barplot(table(student_grades_df$math_grade))
#Reading
#With barplot()
barplot(table(student_grades_df$reading_grade))
#Writing
#With barplot()
barplot(table(student_grades_df$writing_grade))
It seems the grades are pretty low. Maybe that is to do with our scaling and harsh grading.
Quickly we’ll check the histogram of all the scores and all the scaled scores.
#scores
hist(
c(student_grades_df$math_score,
$reading_score,
student_grades_df$writing_score
student_grades_df
) )
#scaled scores
hist(
c(student_grades_df$math_scaled_score,
$reading_scaled_score,
student_grades_df$writing_scaled_score
student_grades_df
) )
We can see a big dip in the middle of the scaled scores but not in the non-scaled scores. Seems scaling just increased the distance between people who did better than others.
Last thing is we’ll create a boxplot of math scores separating the x axis by some metadata.
boxplot(
~test_preparation_course+lunch,
math_scoredata = student_grades_df)
I think that is plenty of coding for today and for this course. Thank you very much!
21.8 Tips for creating your own loops and functions
Before you go here are some quick tips.
- Create the code outside the loop/function to begin with.
- Test with a small object to start (e.g. a part of a data frame).
- Keep backups of code.
- If you encounter errors try running your code step by step to determine where the error is.
- It can useful to run the code inside your loop/function by setting the loop/function variable (e.g.
i <- 1
)