Lab 1
This tutorial will start getting you accustomed to R and RStudio. Please refer to Lecture Slides 1 for information on how to install R and RStudio. R is a free software environment for statistical computing and graphics. R is one of the fastest growing programming languages in the last 5 years, and is used in a wide array of industries. R has the advantage that it is free and open-source, and that thousands of users have contributed “add-on” packages that are readily downloadable by anyone. RStudio is a free extension to R, that we will be using for the course.
Open RStudio
Search your computer for RStudio.exe and open the program. It should look something like this:
Create a script file
Click on “File”, “New File”, “R Script”.
- In the top left is your Script file. R commands can be run from the R Script file, and saved at any time.
- In the bottom left is the Console window. Output is displayed here. R commands can be run from the Console, but not saved.
- In the top right is the Environment. Data and variables will be visible here.
- The bottom right will display graphics (e.g. histograms and scatterplots).
Use R as a calculator
R’s arithmetic operators include:
Operator | Function |
---|---|
+ | addition |
− | subtraction |
* | multiplication |
/ | division |
^ | exponentiation |
Type the following command into your R script: 1 + 2
. Run the command
by highlighting it, or making sure the cursor is active at the end of
the line, and clicking ``Run’’.
Ignore the [1]
that is printed out.
Create an object
You can create objects in R. Objects can be vectors, matrices, character
strings, data frames, etc. Create two different scalars (give them any
name you like, it doesn’t have to be a
and b
):
a <- 3
b <- 5
Note that you can copy-paste commands in the gray boxes directly into R.
We have created two new objects called a
and b
, and have assigned
them values using the assignment operator <-
(the “less than” symbol
followed by the “minus” symbol). Notice that a
and b
pop up in the
top-right of your screen. We can now refer to these objects by name:
a * b
## [1] 15
To create a vector in R we can use the “combine” function, c()
:
myvector <- c(1, 2, 4, 6, 7)
Simple functions in R
To call a function in R, type the name of the function, and the
arguments of the function in parentheses: functionname(arguments)
.
There are thousands of functions in R. Here are a few that we’ll need:
Function |
---|
sum() |
mean() |
var() |
summary() |
Try all of these functions on myvector
. For example:
sum(myvector)
## [1] 20
The sum()
function is looking for arguments that it can add together.
Put an object in the brackets, and the function will try to add. Try
mean(myvector)
. You can get help on a function by typing ?
followed
by the function name, and running the command. For example, try
?summary
.
Logical operators
Logical operators are used to determine whether something is TRUE
or
FALSE
. Some logical operators are:
Operator | Function |
---|---|
> | greater than |
== | equal to |
< | less than |
>= | greater than or equal to |
<= | less than or equal to |
!= | not equal to |
The operators &
“and” and |
“or” may be used to combine the above
operators.
Try entering the following commands:
8 > 4
## [1] TRUE
b == 6
## [1] FALSE
b > 2
## [1] TRUE
myvector > 3
## [1] FALSE FALSE TRUE TRUE TRUE
myvector > 3 & myvector < 7
## [1] FALSE FALSE TRUE TRUE FALSE
For the last example, R has checked to see whether each element in
myvector
is greater than 3 and less than 7.
We will use these logical operators to create subsamples. We can choose certain observations in the data set based on values of the variables.
Load data
The data for this tutorial was scraped by Abdulshaheed Alqunber and is originally from (https://www.vgchartz.com/). We are using a sub sample of the data, and only look at video game sales for games making at least $100,000 USD in global sales.
The data is in the “comma-separated-format” or “.csv”. This is a very simple and common format for storing data.
RStudio can read data from your computer, or from the internet. Load the data with:
mydata <- read.csv("https://rtgodwin.com/data/vidsales.csv")
We have created a new object called mydata
, and have assigned it the
values in the .csv
file using the assignment operator <-
. You could
have chosen a name different than mydata
.
The mydata
object shows up in the top-right of your screen. The sample
size is 4706 and there are 8 variables.
Click the “spreadsheet” icon next to the data set to view it. Take a
moment to look through the data, and make sure you have an idea of what
each of the variables are. The Sales
variable is either the total
global sales of the game, or the sales from bundling with consoles, and
is measured in millions of US dollars. Score
is the critic score from
video game reviewers, and is on a scale of 0 to 10, with 10 being best.
Notice that each video game (each observation in the data set) takes up
a different row, while the type of information on the video game (the
variables) takes up a different column. Close the “data” tab when you
are done viewing.
Explore the data
Click the blue arrow next to data
in the top-right of your screen.
Each variable name is listed, along with some info. Notice that the variables each have a type:
Name | Type |
---|---|
chr | A character string (words) |
num | Any real (continuous) number. |
int | Any integer. |
The “character” variables can be used to create dummy variables (more on this later!)
To explore the data, we can calculate “summary” statistics for all the variables:
summary(mydata)
## Name Genre ESRB_Rating Platform
## Length:4706 Length:4706 Length:4706 Length:4706
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Publisher Score Sales Year
## Length:4706 Min. : 1.00 Min. : 0.010 Min. :1985
## Class :character 1st Qu.: 6.50 1st Qu.: 0.150 1st Qu.:2004
## Mode :character Median : 7.50 Median : 0.420 Median :2008
## Mean : 7.27 Mean : 1.177 Mean :2007
## 3rd Qu.: 8.30 3rd Qu.: 1.100 3rd Qu.:2010
## Max. :10.00 Max. :82.860 Max. :2020
The min, max, quartiles, and mean have been calculated. For example, the
sample mean Score
is 7.27.
We can extract variables from the data set, and perform functions on
them. To calculate the sample variance for Sales
:
var(mydata$Sales)
## [1] 7.607421
Important: when we type mydata$Sales
we are getting the Sales
variable from within the mydata
data set.
Now, calculate the correlation between Sales
and Score
:
cor(mydata$Sales, mydata$Score)
## [1] 0.2634555
What does this tell you?
Sub-sampling
We can use the subset()
function, together with logical operators, to
create sub-samples.
Nintendo is my favourite publisher. Let’s see their critic scores vs. the critic scores from other publishers:
mean(subset(mydata$Score, mydata$Publisher == "Nintendo"))
## [1] 7.799296
mean(subset(mydata$Score, mydata$Publisher != "Nintendo"))
## [1] 7.21722
Notice how we have used logical operators to create two sub-samples: Nintendo games, and other games.
On your own, determine the total global sales in the data set for the publisher “Rockstar Games”.
sum(subset(mydata$Sales, mydata$Publisher == "Rockstar Games"))
Now, find all the video games that have received a perfect score of 10.
subset(mydata, mydata$Score == 10)
For the rest of the tutorial, we’ll use a sub-sample of all video games that have sales 2 million USD or more:
vid2 <- subset(mydata, mydata$Sales >= 2)
The above line creates a new data set, by selecting only the rows from
mydata
which have Sales >= 2
. The new data set shows up in the
top-right panel. Check the sample size.
Hint for Assignment 1, Question 2. Create a subsample using the “or” |
operator.
To create a subsample that includes two (or more) video game publishers (let’s take “Nintendo” and “Atari” as an example), we can use the “or” operator |
:
vid3 <- subset(mydata, Publisher == "Nintendo" | Publisher == "Atari")
The vid3
subset only contains only 2 different publishers!
Histograms and scatterplots
Visualization is important. Plot a histogram of critic scores:
hist(vid2$Score)
Many options are available to customize the histogram, see ?hist
.
Let’s add some labels, and control the number of “breakpoints”:
hist(vid2$Score,
main = "Histogram of video game critic scores",
xlab = "score",
breaks = 10)
The scatterplot is a widely used tool for visualizing the relationship
between two variables. Draw a scatterplot for Sales
and Score
,
adding a title and labeling the axis:
plot(vid2$Score, vid2$Sales,
main = "critic scores and video game sales",
xlab = "score", ylab = "Sales")
We can also change the color and style of the dots:
plot(vid2$Score, vid2$Sales,
col = 3,
pch = 16)
Type ?pch
to see the different styles, and Google to see the different
colours.
Create new variables
Let’s create a new variable that we’ll use to select the colour of each
data point. Make a new variable called G
inside of the vid2
dataset,
and use G
to control the colour of each data point. We begin by
setting all rows of the variable equal to 1, and then change the value
of G
based on the game’s Genre
. Copy and paste the code below:
vid2$G <- 1
vid2$G[vid2$Genre == "Action"] <- 2
vid2$G[vid2$Genre == "Sports"] <- 3
vid2$G[vid2$Genre == "Shooter"] <- 7
vid2$G[vid2$Genre == "Role-Playing"] <- 4
vid2$G[vid2$Genre == "Platform"] <- 5
vid2$G[vid2$Genre == "Racing"] <- 6
For example, if Genre
is equal to "Action"
, the variable G
gets a
value of 2. Look at this G
variable in the spreadsheet to understand
what has happened, Now, use this variable to determine the colour for
each data point:
plot(vid2$Score, vid2$Sales, col=vid2$G, pch=16,
main = "video game sales and scores by genre",
xlab = "score", ylab = "sales")
We need to add a legend to the plot to show what the colours mean:
plot(vid2$Score, vid2$Sales, col=vid2$G, pch=16)
legend("topleft",
legend = c("Action", "Sports", "Shooter", "Role-Playing", "Platform", "Racing", "Other"),
col=c(2, 3, 7, 4, 5, 6, 1), pch=16)
Import your work into Word
In the “plots” window, click “Export” and “Save as Image…”. This way, you can import graphics you create into Word, or another word processor.
Simple least-squares regression
What is the average increase in Sales
associated with an increase in
Score
? We can estimate a simple linear regression using the lm()
function (“lm” stands for “linear model”):
lm(Sales ~ Score, data = vid2)
## Call:
## lm(formula = Sales ~ Score, data = vid2)
##
## Coefficients:
## (Intercept) Score
## -2.1472 0.8912
The interpretation of the results is that an increase in critic score of 1 is associated with an average increase in sales of $0.89 million.
Save your script file
Make sure the top-left window is active, then click “File”, “Save As…”, and name your script file.
Clear your workspace
If you get too much junk floating around in R, you can clear the environment by clicking the “sweep” icon in the top-right, and start over.