More tips'n tRicks

Loads of great links with top tips: Paul E. Johnson's Rtips. Revival 2012! (also available on pdf). More to come...

tipsntRicks on github. Moving this page to a github Jekyll-bootstrap page and a package.

TeachinMaterial repository with R teaching material.

Best pRactices

Doing the R things. Posts about code managment, object-oriented programming and R development.

Running an R package's code. Use the example function: example(myFunction). For a vignette in Sweave format, using the Stangle function from the tools package will create an R script that can be sourced. [credit: Bioc mailing list]

Getting help. Finding information with help.search("keyword"), apropos("keyword"), RSiteSearch("keyword") and library(sos); info <- ???keyword.

Graphics

Misc plotting. Plotting is on of R's strengths, and I occasionally come across nice examples that I would like to keep track of. These will be listed on this page.

grid and base integration. Some things are so simple with base graphics, while grid (and others) are so much better for others. Why choose, when one can get both together? Several solutions: the gridBase package by Paul Murrell himself, the uniPlot package (webpage, old CRAN page and user2011 slides) and a post with an example.

Sparklines can be created with the YaleToolKit and the sparkTable (user2011 slides) packages.

Animations. Have a go with the animation package. Here is also the user2011 presentation about the animatoR package, although I can't find the package itself.

Clustergrams (Schonlau, 2002) examine how cluster members are assigned to clusters as the number of clusters increases. Tal Galili posted notes, examples and R code about clustergrams.

Cookbook for R has a nice chapter about ggplot2 graphics. I like the multiplot function in the Multiple graphs on one page section particularly.

R – GUIs and interactivity

Creating R GUIs. The gWidgets package, a toolkit-independent API for building interactive GUIs. Nice illustrations: Creating GUIs in R with gWidgets by Richie Cotton and Demonstrating the Power of F Test with gWidgets by Yihui Xie.
Using R/tcltk, see R TclTk Examples by James Wettenhall and these updated ones by Philippe Grosjean, the package documentation and the 'tcl/tk return problem' thread (here and here) on the R mailing list.
Interfaces to the Qt framework from R. I remember seeing a very nice demonstration at the Bioconductor Dev day in 2009, in Seattle. qtbase and friends is not available from Bioconductor. Also, a thread about Link between Qt GUI and R.
Richie Cotton had a jolly talk at the useR2011 meeting about Easy interactive ggplots talk. The aforementioned qt/R interaction is also a great combination for interactivity.
The traitr package is an interface for creating GUIs modeled in part after the traits UI module for python, based on the MVC design pattern.
Programming Graphical User Interfaces with R by John Verzani and Michael Lawrence is planned for mid 2012. Looking forward to it!

Interactive graphics. iPlots and the new implementation Acinonyx, aka iPlot eXtreme. Just awesome!

Data editors. Sometimes, it is just more convenient to visualise a big data frame or matrix in a graphical viewer. There is the rather archaic looking edit. But his r-sig-gui thread provides many nice alternatives.

Bioconductor

Remove probes by subsetting CEL files. See this and this posts from the Bioc mailing list for the removal of probes in CEL files. This post points to the effect of this on some downstream analysis (here GCRMA normalisation). Note that affxparser also has most certainly has similar capabilities.)

About limma. Probes and contrasts combination methods in the multiple testing strategy: global will treat the entire matrix of t-statistics as a single vector of unrelated tests. In other words, with the global option, all the contrasts are considered to be independent, and the p-values are adjusted as if you just had a bunch of independent t-tests. The setting separate is equivalent to using topTable separately for each coefficient in the linear model fit, and will give the same lists of probes if adjust.method is the same. Method heirarchical adjusts down genes and then across contrasts. Method nestedF adjusts down genes and then uses classifyTestsF to classify contrasts as significant or not for the selected genes.
For the global option, all the contrasts are considered to be independent, and the p-values are adjusted as if you just had a bunch of independent t-tests.
The nestedF option is a bit more complicated. First, a bit of background. The F-statistic is used to determine if there are any differences between the samples, but it doesn't tell you which sample(s) are different. You have to fit contrasts to find out which sample(s) are different. So the idea with the nestedF is to adjust the p-values associated with the F-test to find which genes are differentially expressed in at least one sample. Now we have a list of genes that are differentially expressed, but we don't know for which sample(s) that may be true. The t-statistics associated with the contrasts are then inspected and the largest one (in absolute value) is considered significant. Now, there may be other contrasts that are significant as well, so the largest t-statistic is set to the same absolute value as the second largest t-statistic, and the F-statistic is calculated again. If the F-statistic is still significant, the second largest contrast is considered significant. This procedure is continued until the F-statistic is no longer significant.
The basic reasoning here is that the largest t-statistic for a set of contrasts is significant if the overall F-statistic is significant. By following this step-wise procedure, we can determine which contrasts are contributing to the overall significance of the F-statistic.
[credit: James W. MacDonald, Gordon K Smyth and limma help page]

Strings as variable names

It is sometimes useful to set a string in a variable name that is itself a command of is a variable name. First, have a look at Hadley Wickham's Computing on the language section.

> foo <- "bar"
> foo
 foo
[1] "bar"
> as.name(foo)
 bar
> string <- "1:10"
> string
 [1] "1:10"
> parse(text=string)
 expression(1:10)
> eval(parse(text=string))
 [1]  1  2  3  4  5  6  7  8  9 10
      

And with assign and get:

> varName1 <- "varName2"
> varName1
[1] "varName2"
> assign(varName1,"123")
> varName1
[1] "varName2"
> get(varName1)
[1] "123"
> varName2
[1] "123"
      

From stackoverflow by Joris Meys, using substitute and deparse

> test <- function(x){
       y <- deparse(substitute(x))
       print(y)
       print(x)
 }
> var <- c("one","two","three")
> test(var)
[1] "var"
[1] "one"   "two"   "three"
      

Here is a nice utilisation of deparse, parse and substitute to implement C's ternary operator by SO user kohske.

Cross-talking

Interfacing R With Other Languages. A general approach would consist to use the system() function and feed it the shell, perl -e, ruby -e, ... commands. See this page for a bit more elaboration on this technique.

HPC and parallel under R

The CRAN Task View on High-Performance and Parallel Computing with R.

How to optimise an R script for 'multicore' on Cross Validated.

R scripting

#!/usr/bin/R. Another interesting thread on today's R mailing list about executable R script. The thread was initiated by Jason E. Aten who shared a custom shell script that wraps up R '#!'-like scripting. Steve Lianoglou replied and pointed to Rscript and litter from scripting capabilities and the getopt and optparse libraries for argument parsing.

apply et al.

apply and plyr reference. The apply functions are very handy. Here a list of useful references that describe these functions: In the R News 2008-1 Help Desk, there is an article by Uwe Ligges and John Fox about how can I avoid this loop and make it faster introducing vectorization and a very nice brief introduction to 'apply' in R by Neil Saunders that also illustrates the replicate and by functions.
See also the sweep function to sweep out a summary statistics from an array as well as the scale functions.
There is also Hadley Wickham's much appreciated plyr package and the *ply functions. There is, as always, plenty of documentation on the packages page.
A note about the apply functions and speed: although using apply has benefits in terms of readability (at least for R programmers), it is not true that is auto-magically vectorises the computation and is faster that a for loop -- see this short snippet or this post. I must say that I was a bit surprised by the overhead of the apply version. As specified by Prof. Brian Ripley on the R mailing list, apply() is just a wrapper for a for loop. Below is the for loop code of the apply function:

for (i in 1L:d2) {
  tmp <- FUN(newX[,i], ...)
  if(!is.null(tmp)) ans[[i]] <- tmp
}

Standards

R Coding Standards. When writing code, it's always good to be rigorous in your coding style. Changing convention makes things more difficult to read and understand. Here are some R styles (that pretty much overlap, fortunately) that are safe to follow:
Google R style
Bioconductor coding standards
R coding standards in the R Internals manual
R style guide by Hadley Wickham
R Coding Conventions by Henrik Bengtsson (or here in pdf).
Emacs ess (emacs speaks statistics) mode is a great R coding environment, and will also help for proper indentation. And I recently found this thread about R standards, and updated the above list accordingly. This thread has also a nice discussion about S3 vs. S4 classes.

A bit of HistoRy

R (and S) names. This is a short summary of a thread of the R mailing list. I found it interesting and thought that it might be helpful to keep track of it here.
From http://stat.bell-labs.com/S/ and http://cm.bell-labs.com/stat/doc/94.11.ps by Rick Becker: By July, 1976, we decided to name the system. Acronyms were in abundance at Bell Laboratories, so it seemed sure that we would come up with one for our system, but no one seemed to be able to agree with any one else's suggestion: Interactive SCS (ISCS), Statistical Computing System (too confusing), Statistical Analysis System (already taken), etc. In the end, we decided that all of these names contained an `S' and, with the C programming language as a precedent, decided to name the system `S'. (We dropped the quotes by 1979). [posted by Ben Bolker]
From the FAQ 2.12 Why is R named R? The name is partly based on the (first) names of the first two R authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs language `S' (see What is S?). [posted by Ted Harding]

R administration

Cross building and R development on Windows. It happens that collaborators using Windows computers need to use a package of mine that is still in early development and not yet submitted to any distribution/build system (like CRAN, Bioconductor or r-forge). It is not straightforward to distribute such code without putting extra burden on users.)
This post is meant to reference useful information on how to compile R and build packages on Windows or how to use Linux to create Windows binaries. I have never done the former, but will try to keep essential information up-to-date, as this would be one way to provide windows binaries. I did the latter following the Building Microsoft Windows Versions of R and R packages under Intel Linux by Jun Yan and A. J. Rossini (PDF, associated Makefile), but this information is outdated now, so I will try to provide viable alternatives here.)

R on Windows. The best reference to compile R on Windows is the relevant section in the R Installation and Administration manual - Installing R under Windows and Window toolset appendix. The necessary tools and resources are available at http://www.murdoch-sutherland.com/Rtools/, maintained by Duncan Murdoch. Also, check out the src/gnuwin32/INSTALL and src/gnuwin32/README.packages in the R source distribution
Here is also a short summary that lists the steps to build a package under Windows.

Building farm. The most convenient way might to to submit the sources the http://win-builder.r-project.org site. However, the package needs apparently to also pass check, which might not be the case for a package in early development, where for instance no documentation has been written yet. This also works when dependencies are not on CRAN, but are available on Bioconductor. Not sure about R-forge, though.

Install sources from Windows. Windows users cat also install.packages(...,type="source"), which should work as long as no code compilation is required. I have not have much luck with this option.

Cross-building. As stated in the R-admin manual, support for cross-building was withdrawn at R 2.9.0. Two links provide some information: (1) success(?) with cross-compiling R package with R 2.9.1 on the R-devel mailing list and (2) Compiling and Cross-compiling R packages for Windows (win32) blog post by Vinh Nguyen. Both seem a bit out-dated and are not very clear to me, but I should try harder to figure out. Any additional pointers very welcome!

Misc stats

Local regression is a powerful and estabilshed technique. In R, there are two functions to do it, namely lowess and loess, that have pretty different interfaces. Here are a few posts that clarify these differences: [BioC] lowess vs. loess (by Gordon Smyth) and [S] lowess and loess function differences.

Other

Serialisation These two posts on SO are elegant illustrations on loading/attaching symbols saved into a RData file and how to create an lazy-load database.

How R Searches and Finds Stuff Super article by Suraj Gupta about frames, environments, namespaces and more.