R has many different data structures for different scenarios.
Lists
Lists are vectors that allow their elements to be any type of object. They are created using the list()
function.
> x <- list(1, "two", c(3, 4))
In this example, we’ve defined x
as a list consisting of three elements: the number 1
, the string "two"
, and a vector, 3 4
, of length 2. We can examine the structure of x using the str()
function.
> str(x)
List of 3
$ : num 1
$ : chr "two"
$ : num [1:2] 3 4
Remember that each element of the list is a vector; 1
is a numeric vector of length 1, and two
is a character vector of length 1.
One particularly interesting ability of a list is that it can contain lists within it. Had we defined x
as x <- list(1, "two", list(3, 4))
, the str()
function would have returned:
> str(x)
List of 3
$ : num 1
$ : chr "two"
$ :List of 2
..$ : num 3
..$ : num 4
This means that a list is a recursive object (you can test this with the is.recursive()
function). Lists can be hypothetically nested indefinitely.
Factors
A factor is a vector that stores categorical data—data that can be classified by a finite number of categories. These categories are known as the levels
of a factor.
Say you define x
as a collection of the strings "a"
, "b"
, and "c"
: x <- c("b", "c", "b", "a", "c", "c")
.
Using the factor()
function, you can have R convert the atomic character vector into a factor. R will automatically attempt to determine the levels of the factor; this will produce an error when factor
is given an argument that is non-atomic. Let’s take a look at the factor here:
> x <- c("b", "a", "b", "c", "a", "a")
> x <- factor(x)
> \# this can also be written as x <- factor(c("b", "a", "b", "c", "a", "a"))
> x
[1] b a b c a a
Levels: a b c
> str(x)
Factor w/ 3 levels "a","b","c": 2 1 2 3 1 1
> levels(x)
[1] "a" "b" "c"
> table(x)
x
a b c
3 2 1
By using the factor()
function on x
, R logically categorized the values into “levels.” When x was printed, R returned the elements in its original order, but it also printed the levels of the factor. Examining the structure of x
shows that x
is a factor with three levels, lists the levels (alphabetically), and then shows which level each element of the factor corresponds to. So here, since "b"
is alphabetically second, the 2
in 2 1 2 3 1 1
corresponds with "b"
.
The levels()
function returns a vector containing only the names of the different levels of the factor. So here, the function levels(x)
returns the three levels “a”, “b”, and “c”, in order (here from the lowest value of the level to the highest).
The tables()
function gives a table summarizing the factor. Using the table()
function on x
returned the name of the variable, a list of the levels of x
, and then, underneath, the number of values that occurs in x
corresponding with the above level. So this table shows us that, in the factor x
, there are three instances of the level "a"
, two instances of "b"
, and one instance of "c"
.
If the levels of your factor need to be in a particular order, you can use the factor()
argument levels
to define the order, and set the argument ordered
to TRUE
:
> x <- c("b", "a", "b", "c", "a", "a")
> x <- factor(x, levels = c("c", "b", "a"), ordered = TRUE
> x
[1] b a b c a a
Levels: c < b < a
> str(x)
Ord.factor w/ 3 levels "c"<"b"<"a": 2 3 2 1 3 3
> levels(x)
[1] "c" "b" "a"
> table(x)
x
c b a
1 2 3
Now R returned the levels in the order specified by the vector given to the levels
argument. The <
(less than) symbol in the output of x
and str(x)
indicate that these levels are ordered, and the str(x)
function reports that the object is an ordered factor.
Matrixes
A matrix is, in most cases, a two-dimensional atomic data structure (though you can have a one-dimensional matrix, or a non-atomic matrix made from a list). To create a matrix, you can use the matrix()
function on a vector with the nrow
and/or ncol
arguments. matrix(1:20, nrow = 5)
will produce a matrix with five rows and four columns containing the numbers one through twenty. matrix(1:20, ncol = 4)
produces the same matrix.
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
The matrix will fill by column unless the argument byrow
is set to TRUE
.
Note that the position indexes are assigned to rows and columns here. Since a matrix is naturally two-dimensional, R provides column indexes to more easily interact with the matrix. You can use the index vector []
to return the value of an individual cell of the matrix. x[1,2]
will return the value of row one, column 2: 6
. You can also use the index vector to return the values of whole rows or columns. x[1,]
will return 1 6 11 16
, the elements of the first row of the matrix.
You can also create a matrix by assigning dimensions to a vector using the dim()
function, as shown here:
x <- 1:20
dim(x) <- c(5, 4)
This created the same matrix you saw earlier. With the dim()
function, you can also redefine the dimensions of a matrix. dim(x) <- c(4,5)
will “redraw” the matrix to have four rows and five columns.
Arrays
What happens if the vector you passed to the dim()
function had more than two elements? If we had written dim(x) <- c(5, 2, 2)
we would have created another data structure: an array.
Technically, a matrix is specifically a two-dimensional array, but arrays can have unlimited dimensions. When x
contained 20 elements—x <- 1:20
—executing dim(x) <- c(5, 2, 2)
would have given x
three dimensions. R would represent this as a “series” of matrixes:
> x
, , 1
[,1] [,2]
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
, , 2
[,1] [,2]
[1,] 11 16
[2,] 12 17
[3,] 13 18
[4,] 14 19
[5,] 15 20
In the case of an array, the “row” and “column” numbers remain in the same order, and R will show the other dimensions above each matrix. In this case, we received two matrixes (based on the third dimension given) of five rows (based on the first dimension given) and two columns (based on the second dimension given). R displays arrays in order of each dimension given—so if we had an array of four dimensions (say 5, 2, 2, 2
), it would print matrixes , , 1, 1
, then , , 1, 2
, then, , 2, 1
, and lastly , , 2, 2
.
Again, you can use index vectors to find a particular element, or particular elements, of the array. In our three-dimensional array shown earlier, x[1, 2, 2]
will return 16
. You can see by the way R has printed the array that rows come before the first comma, columns come after the first comma, and the third dimension of the array comes after the second comma.
Data Frames
A data frame is a (generally) two-dimensional structure consisting of vectors of the same length. Data frames are used often, as they are the closest data structure in R to a spreadsheet or relational data tables. You can use the data.frame()
function to create a data frame.
> x <- data.frame(y = 1:3, z = c("one", "two", "three"), stringsAsFactors = FALSE)
> x
y z
1 1 one
2 2 two
3 3 three
In this example, we have created a data frame with two columns and three rows. Using y =
and z =
defines the names of the columns, which will make them easier to access, manipulate and analyze. Here, we’ve used the argument stringsAsFactors = FALSE
to make column z
an atomic character vector instead of a factor. By default, data frames will coerce vectors of strings into factors.
You can use the names()
function to change the names of your columns. names(x) <- c("a", "b")
provides a vector of new values to replace the column names, changing the columns to a
and b
. To change a certain column or columns, you can use the index vector to specify which column(s) to rename.
> names(x)[1] <- "a"
> x
a z
1 1 one
2 2 two
3 3 three
You can combine data frames with the cbind()
function or the rbind()
function. cbind()
will add the columns of one data frame to another, as long as the frames have the same number of rows.
> cbind(x, b = data.frame(c("I", "II", "III"), stringsAsFactors = FALSE)))
a z b
1 1 one I
2 2 two II
3 3 three III
rbind()
will add the rows of one data frame to the rows of another, so long as the frames have the same number of columns and have the same column names.
> rbind(x, data.frame(a = 4, z = "four"))
a z
1 1 one
2 2 two
3 3 three
4 4 four
cbind()
and rbind()
will also coerce vectors and matrixes of the proper lengths into a data frame, so long as one of the arguments of the bind function is a data frame. We could have used rbind(x, c(4, "four"))
to take the data frame x
we defined earlier, and coerce the vector c(4, "four")
to fit into the existing data frame. But coercion can affect the way your data frame stores your data. In this case, the vector c(4, "four")
would have coerced the integer 4
into the character "4"
. Then the data frame would have coerced the entire first column into a character vector. This makes it safer to use rbind()
and cbind()
to bind data frames with each other.
{{ parent.title || parent.header.title}}
{{ parent.tldr }}
{{ parent.linkDescription }}
{{ parent.urlSource.name }}