Exploring Data Frames
Last updated on Feb 12, 2021 | Edit this page
Estimated time 20 minutes
Overview
Questions
- How can I manipulate a data frame?
Objectives
- Use the dplyr package to manipulate data frames.
- Remove rows with NA values
- Append two data frames.
- Understand what a factor is.
Adding columns and rows in data frames
We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. As such, if we want to add a new column, we can start by making a new vector.
R
age <- c(2, 3, 5)
cats
OUTPUT
coat weight likes_string
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
We can then add this as a column via:
R
cbind(cats, age)
OUTPUT
coat weight likes_string age
1 calico 2.1 1 2
2 black 5.0 0 3
3 tabby 3.2 1 5
Note that if we tried to add a vector of ages with a different number of entries than the number of rows in the data frame, it would fail:
R
age <- c(2, 3, 5, 12)
cbind(cats, age)
ERROR
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4
R
age <- c(2, 3)
cbind(cats, age)
ERROR
Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2
Why didn’t this work? Of course, R wants to see one element in our new column for every row in the table:
R
nrow(cats)
OUTPUT
[1] 3
Factors
For an object containing the data type factor, each different value represents what is called a level. In our case, the factor “coat” has 3 levels: “black”, “calico”, and “tabby”. R will only accept values that match one of the levels. If you add a new value, it will become NA.
The warning is telling us that we unsuccessfully added “tortoiseshell” to our coat factor, but 3.3 (a numeric), TRUE (a logical), and 9 (a numeric) were successfully added to weight, likes_string, and age, respectively, since those variables are not factors. To successfully add a cat with a “tortoiseshell” coat, add “tortoiseshell” as a possible level of the factor:
R
1. human_age <- cats$age * 7
2. human_age <- factor(human_age). as.factor(human_age) works just as well.
3. as.numeric(human_age) yields 1 2 3 4 4
OUTPUT
[1] 3
R
1. human_age <- cats$age * 7
2. human_age <- factor(human_age). as.factor(human_age) works just as well.
3. as.numeric(human_age) yields 1 2 3 4 4
Challenge
Let’s imagine that 1 cat year is equivalent to 7 human years.
Create a vector called human_age by multiplying cats$age by 7. Convert human_age to a factor. Convert human_age back to a numeric vector using the as.numeric() function. Now divide it by 7 to get the original ages back. Explain what happened.
INPUT
1. human_age <- cats$age * 7
2. human_age <- factor(human_age). as.factor(human_age) works just as well.
3. as.numeric(human_age) yields 1 2 3 4 4
because factors are stored as integers (here, 1:4), each of which is associated with a label (here, 28, 35, 56, and 63). Converting the factor to a numeric vector gives us the underlying integers, not the labels. If we want the original numbers, we need to convert human_age to a character vector (using as.character (human_age)) and then to a numeric vector (why does this work?). This comes up in real life when we accidentally include a character somewhere in a column of a .csv file supposed to only contain numbers, and forget to set stringsAsFactors=FALSE when we read in the data.
.accordion-body
, though the transition does limit overflow.
Removing columns
We can also remove columns in our data frame. What if we want to remove the column “age”. We can remove it in two ways, by variable number or by index.
R
<- names(cats) %in% c("age")
drop !drop] cats[,
OUTPUT
coat weight likes_string
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
5 tortoiseshell 3.3 1
Notice the comma with nothing before it, indicating we want to keep all of the rows.
Alternatively, we can drop the column by using the index name and the %in% operator. The %in% operator goes through each element of its left argument, in this case the names of cats, and asks, “Does this element occur in the second argument?”
R
drop <- names(cats) %in% c("age")
cats[,!drop]
OUTPUT
coat weight likes_string
1 calico 2.1 1
2 black 5.0 0
3 tabby 3.2 1
5 tortoiseshell 3.3 1
We will cover subsetting with logical operators like %in% in more detail in the next episode. See the section Subsetting through other logical operations
Key Points
- Use cbind() to add a new column to a data frame.
- Use rbind()to add a new row to a data frame.
- Remove rows from a data frame.
- Use na.omit() to remove rows from a data frame with NA values.
- Use levels() and as.character() to explore and manipulate factors.
- Use str(), summary(), nrow(), ncol(), dim(), colnames(), rownames(), head(), and typeof() to understand the structure of a data frame.
- Read in a csv file using read.csv().
- Understand what length() of a data frame represents.