R: Getting Started

If you’re like me, you love data.

Ok, so it’s likely that you aren’t like me, at least in the data-loving regard, but you have to admit that (1) data can be beautiful, and (2) data is how we understand the world. And I really mean you have to admit that; to do otherwise would be quite deluded. Or, euphemistically, New Age-y.

Unfortunately, a lot of data is presented in boring, confusing, and/or misleading ways. In a vain attempt to enable you, the reader, to make data pretty, or, at the least, understandable, I want to strongly encourage a movement away from the crappy graphs generated by Excel and Openoffice and toward the fancy ones made by R. This transition is not simple and so, as with my Python introduction, I’m going to get you started and then feed you to the sharks. I’ll continue to post little tidbits as time goes on, but there are enough tutorials around the Webs already, so I will not do anything in depth. Plus, I hardly understand it myself.

At this point, you may be wondering what the letter “R” has to do with data or graphs. Well, R essentially a programming language for statistics. And you can make pretty graphs with it. Let me demonstrate:

col=1:100,pch=1:100,main=”Random Numbers”,
ylab=”Some Random Numbers”,xlab=”As a Function of X”)

Yields the following graph:

Okay, so yeah, that’s an ugly graph. I’m simply making a point. Imagine trying to make that plot, as ugly as it is, in Excel or OpenOffice. Huge pain in the ass. While that code I typed up there looks cryptic now, I promise that it’s actually quite simple. Here’s another example, where x, y, and z are sets of 100 random numbers:



Now that was simple. In fact, I could have made the previous plot with the command plot(1:100,y) and it would have looked exactly the same, except that the axes would be labeled differently. Of course, there is one problem with that 3d scatterplot: it’s awfully hard to figure out where 3d points are sitting on a 2d surface. I wonder if there’s a way to get around that in R…


Badass, right?

So, how do we get from not knowing what R is to making 3d plots? Sadly, the learning curve is a little steep unless you’re already familiar with using text to make your computer do stuff. Let’s get started.

First, if you are running linux you may already have R installed. If not, go to synaptic and find it or join the Windows users in downloading it from CRAN. Then install it. As with Python, you can work with R in an interactive mode or by writing files for R to read and implement. We’ll stick with interactive mode for now.

In Windows, launch R from your start menu (it should be a folder called ‘R’). In Linux, just type ‘R’ in the console (it must be capitalized) followed by a vigorous smack of the Enter key. In either case, you’ll see something like this:

Note that in windows there is also an R “GUI” (Graphic User Interface), which is probably what you will see. It’s pretty much the same damn thing, except that it looks nicer and there are some menu options at the top to help you do various tasks. There is probably one in Linux as well, though I’ve never gotten around to finding out.
Now you’re ready to start learning! Remember that Python has a prompt that looks like “>>>”. Well, R has the same thing only with one. Basically because it’s only 1/3 as awesome. I say that because there is actually a way to make Python do everything that R can do. But I haven’t even started to learn those tools yet, so in the meantime we’re staying with R.
Anyway, R is basically a scripting language, meaning that you can do stuff with it that other programming languages can do. It can do for-loops, if-statements, and so on. We’ll touch on that later. For now, just know this:
You can make variables and assign values to them just as you’d expect. Type the following into the prompt (followed by the Enter key) and see what happens:
  1. r = ‘awesome’
  2. x = 1:100
  3. mean(x)
  4. sd(x)
  5. r
  6. x
  7. paste( x, r )
  8. y = c(1,2,3,4,5)
  9. y
  10. rep( y, 20 )
  11. plot( x, rep(y,20))
  12. plot( x[11:90], rep(y,16))
  13. summary(lm(rep(y,20)~x))
Notice a few things. In statements 1, 2, and 6, we assigned values to the variables r, x, and y. This means that when you have R use those variables (they’re called objects), R will return whatever values you made them contain. If you do something in R that isn’t an assignment, it’s going to give you some kind of output. If you assign r to the string ‘awesome’ (note: you can surround strings with ” ” or ‘ ‘), and then just type the expression in (5), R returns the values assigned. In statement two, we assigned x a range of numbers. In R, two numbers with a colon in between (e.g. 1:100) means a list of those two numbers with every number in between. So when you call x (statement 6 above) you should see all the values from 1:100.
Another way of assigning lists to a variable is using the c(x,y,z) notation (statement 8). This basically says “stick the elements in between parentheses into one list). You can make lists of strings, numbers, and even other lists!
I also showed a few basic functions. Statement 10 shows you how to make a list repeat itself. The syntax is rep( values , # ), where values can be a single item or a list, and # is the number of times you want it repeated.  paste() can be used to stick multiple items together, with an output that is a string (this can be useful for labeling graphs). plot(), as you’ve seen before, generates a pleasant, bare-bones graph of your data.

In statement 12, I showed that you can selectively choose any elements you want out of a list. Normally, x contains the values 1-100. If you only want the middle 80 values (i.e. not the first or last 10), you can have R return only those using the square-bracket notation ‘[ ]’. The numbers inside that bracket will refer to the positions you want. So, if you said x[1], you would return the first value in the list x (which is 1). Since 10:90 is a list of the numbers 10-90, the expression x[10:90] is going to return all items in x between the 10th and 90th positions.

The last item shows a glimpse of the statistics that R can do (stats is the main reason people use R in the sciences). This expression is more complicated, so the reason I put it there is to show you how simple (meaning “short”) expressions can be in R to give you quite detailed and valuable information.

Alright, now dive in! As with Python, I’ll continue to post more useful things on using R, though I will refrain from doing a thorough tutorial. I’ll list some useful tutorials in a later post. Good luck!