An Introduction to the R Language
The Web Dev Zone is brought to you in partnership with Mendix. Discover how IT departments looking for ways to keep up with demand for business apps has caused a new breed of developers to surface - the Rapid Application Developer.
Yet another language?
R is oriented to mathematics instead of general purpose computation, and has many similarities with Matlab and Octave; for example, it is accessible to people not having a computer science background. Moreover, it is open source.
The support for mathematical computations is reflected in more libraries included out of the box, for managing distributions, estimations, and inference tests. R has also a simplified syntax for mathematical expressions (e. g. the ~ operator to specify regression).
R is still strong on other core concepts, unlike Matlab: there are possibilities for classes and objects (a la Clojure), anonymous functions and closures, and named parameters. It includes a whole environment, more than a language interpreter: some functions implement graphing capabilities (farewell gnuplot) and a shell with completion and history.
The practical side of R is taken care of by bunbled utilities for reading data from files, databases (MySQL, ORacle, JDBC, ODBC), and for saving the results (e.g. the whole current workspace or single variables).
Like Matlab and Octave, R can be used for quick prototyping; after an algorithm is implemented and validated, it can be translated in C or other languages. The reasons for the translation include better performance, and the portability of the code on different machines, operating systems, or programmers.
The installation process depends on your system, but four Linux distributition are supported via repositories (with RPM and Deb packages).
People from other domains (like statistics) install R everyday, so it's not like compiling the kernel or hunting for missing libraries. Eveyrthing is already compiled and automated, which is a plus with respect to other niche languages which need to download different groups of JARs just because it is assumed a programmer can handle it.
R basic syntax and data structures
Before beginning, you must know that R's assignment operator is <- and not =. = will work in many cases, but <- is more general as it can be used anywhere; it is the real equivalent of the assignment operator you have used in C-like languages.
> if (a = 4) 1 Error: unexpected '=' in "if (a =" > if (a <- 4) 1  1
Moreover, as you can already see from the code above, there is no need for semicolons.
Numbers in R are numeric (which means double actually) or integers.
> answer <- 42 > class(42)  "numeric" > answers <- c(42, 43) > class(answers)  "numeric" > answers <- 42:43 > class(answers)  "integer" > class(as.integer(42))  "integer"
Booleans are represented with the instances TRUE and FALSE:
> if (TRUE) 42
Strings are also a first-class type, with easier handling than with C libraries:
> message <- "hello" > message  "hello"
Basically, every R variable is a vector, again similarly to the case of Matlab/Octave; even scalars are just vectors of length 1. Vectors are created with the concatenation function c():
> my_vector <- c(1, 2, 3) > my_vector  1 2 3
Lists, however can store variables of any type, while vectors must be homegeneous:
> list(42, "a") []  42 []  "a" > c(42, "a")  "42" "a"
Moreover, both lists and vectors can act as maps, since their keys can be strings.
Matrices and data frames are more complex structures. Matrices are the evolution of vectors in 2 dimensions, while data frames are similar structures that can contain values of different types. The difference between them is the same as for vectors and lists.
There are many more types, and useful functions bundled with the interpreter. If you come across some unknown calls, type ?entity at the prompt to load the corresponding man page; entity can be a function or a type name.
A quick example: linear regression
R can seamlessly perform linear regression, a staple problem in statistics and machine learning. The operation consists in finding the parameters of a linear combination that fits several samples of input and output variables. In our case, we want to find the parameters q and m in the model correlated_data = q + m * data.
> data <- c(1, 2, 3, 4)
> data  1 2 3 4 > mean(data)  2.5 > var(data)  1.666667 > sd(data)  1.290994 > correlated_data <- c(2, 4, 7, 7.5) > fm<-lm(correlated_data ~ data) > fm Call: lm(formula = correlated_data ~ data) Coefficients: (Intercept) data 0.25 1.95 ttributes(fm) $names  "coefficients" "residuals" "effects" "rank"  "fitted.values" "assign" "qr" "df.residual"  "xlevels" "call" "terms" "model" $class  "lm" > fm$coefficients (Intercept) data 0.25 1.95 > fm$coefficients['data'] data 1.95