Reproducible Data Analysis

Table of Contents

1 Introduction

What follows is, as always, "work in progress". So be patient and let me know if you find errors or have suggestions. I will try to keep here links to reproducible documents linked to what my collaborators and myself are doing.

2 Background

We have recently submitted a manuscript to The Journal of Physiology (Paris) where we advocate the reproducible data analysis more commonly dubbed reproducible research approach. Few files are associated to the manuscript illustrating the implementation of the idea using a simple example. These files are located at the same site as the manuscript. Here is the abstract of the manuscript:

Reproducible data analysis is an approach aiming at complementing classical printed scientific articles with everything required to independently reproduce the results they present. ''Everything'' covers here: the data, the computer codes and a precise description of how the code was applied to the data. A brief history of this approach is presented first, starting with what economists have been calling replication since the early eighties to end with what is now called reproducible research in computational data analysis oriented fields like statistics and signal processing. Since efficient tools are instrumental for a routine implementation of these approaches, a description of some of the available ones is presented next. A toy example demonstrates then the use of two open source software for reproducible data analysis: the ''Sweave family'' and the org-mode of emacs. The former is bound to R while the latter can be used with R, Matlab, Python and many more ''generalist'' data processing software. Both solutions can be used with Unix-like, Windows and Mac families of operating systems. It is argued that neuroscientists could communicate much more efficiently their results by adopting the reproducible research paradigm from their lab books all the way to their articles, thesis and books.

3 Resources

3.1 Software

  • Madagascar: "an open-source software package for multidimensional data analysis and reproducible computational experiments. Its mission is to provide: a convenient and powerful environment and a convenient technology transfer tool for researchers working with digital image and data processing in geophysics and related fields. Technology developed using the Madagascar project management system is transferred in the form of recorded processing histories, which become "computational recipes" to be verified, exchanged, and modified by users of the system."
  • Sumatra: "Sumatra is a tool for managing and tracking projects based on numerical simulation or analysis, with the aim of supporting reproducible research. It can be thought of as an automated electronic lab notebook for simulation/analysis projects."
  • MATLAB Report Generator: "Generate documentation for MATLAB applications and data".
  • StatWeave: "StatWeave is software whereby you can embed statistical code (e.g., SAS, R, Stata, etc.) into a LaTeX or OpenOffice document. After running StatWeave, the code listing, output, and graphs are added to the document."
  • The Sweave function of R: "Sweave is a tool that allows to embed the R code for complete data analyses in latex documents. The purpose is to create dynamic reports, which can be updated automatically if data or analysis change. Instead of inserting a prefabricated graph or table into the report, the master document contains the R code necessary to obtain it. When run through R, all data analysis output (tables, graphs, etc.) is created on the fly and inserted into a final latex document. The report can be automatically updated if data or analysis change, which allows for truly reproducible research."
  • The Babel extension of emacs' Org-Mode: "Babel is Org-mode's ability to execute source code within Org-mode documents. Org-mode is an Emacs major mode for doing almost anything with plain text."
  • More links can be found from the Reproducible Research "Task View" of CRAN.

3.2 Web sites

  • Reproducible Research: "Welcome on this site about reproducible research. This site is intended to gather a lot of information and useful links about reproducible research. As the authors (Patrick Vandewalle, Jelena Kovacevic and Martin Vetterli) are all doing research in signal/image processing, that will also be the main focus of this site. Follow the links in the text or in the navigation bar on the left to navigate through this site."
  • The Reproducible Research Repository: "This repository contains the reproducible research publications published at LCAV, EPFL."
  • Reproducible Research Planet!: "Reproducible Research Planet! is an educational non-profit organization of scientists, committed to encouraging and facilitating reproducible research in computational sciences. On this site you will find information and resources to make reproducible research a reality within your own institution."
  • The Dataverse Network Project: "To enable data archiving and preservation through re-formatting, standards and exchange protocols. To provide control and recognition for data owners through data management and persistent citations."
  • Wavelab: "WaveLab implements the concept of reproducible research. The idea is: An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. We make WaveLab available to make the full content of our scholarship available, enabling others to understand and reproduce our work."
  • Why Reproducible Research is the Right Thing.
  • Verifiable, reproducible research and computational science: a web site where talks given during a mini symposium at the SIAM Conference on Computational Science & Engineering (Reno, NV on March 4, 2011) can be found. They are all interesting, the one of Randy LeVeque is absolutely great!
  • Literate Programming and Reproducible Research: a page by Randy LeVeque presenting Python based solutions.

4 Software used

4.1 What is needed to reproduce what follows

To reproduce the following analysis examples you will need two open source software: R and GGobi. R is a "generalist" data analysis software a bit like Igor and Matlab with which neurophysiologists are much more likely to be familiar. Having also used for a long time both of these, my opinion is clearly that R is far superior to any of them. GGobi is designed to dynamically visualize high dimensional data. It is, in my opinion, the most important software for spike sorting. R and GGobi run on Windows, MacOS and Linux/Unix.

R is a "stand alone" application whose use can be simplified by combining it with emacs (if you are already familiar with this wonderful editor) thanks to the Emacs Speaks Statistics extension or by calling it through RStudio. I recommend the second solution for a smooth start if you have never heard of emacs. For long term use the first solution is the one of choice. Again both emacs and RStudio are open source software running on the three operating systems you're most likely to have at hand.

What follows does not strictly depend on R, you could, with patience and motivation, do approximately the same with Matlab (although the graphical part would be really painful to implement) as well as with Igor (although it would take you so much time that you're probably better off learning R). You can also try to perform spike sorting or other kind of neurophysiologial data analysis with dedicated software. The problem with those it that they force you to do this task the way the author did it and which was, at best, appropriate to the data he was working with. Sadly methods develop for one data type rarely work out of the box for other data. I therefore recommend to do this task within a generalist software which does not limit what you can do with the data.

4.2 What was used to produce these documents

The "source files" of these documents were written as an Org-Mode file, a mode of emacs which is enough in my view to justify switching to this editor. From a single source file, several outputs can be generated like an html file, a pdf file via \(\LaTeX\), etc. The very attractive feature is that you do not have to know anything about html or \(\LaTeX\) in order to produce a perfectly decent document. Even better and thanks to the Babel extension of Org-Mode you can include R, Matlab, Python, Octave code in your document, making its results explicitly reproducible by any reader. You therefore get a wonderful tool to implement the reproducible research paradigm.

5 Spike sorting examples

I'm currently re-writing my SpikeOMatic software and expect it to be a proper R package by the end of June 2011. Before this perhaps too optimistic date you can find the current version of the sorting specific functions in file sorting.R. I'm testing the new functions of some real data sets like the Purkinje cells data and the Locust data both available from my data page. Examples using other data sets will appear soon. Since the functions development is not finished yet, the analysis presented for the data sets is also in progress. I hope it will nevertheless be useful and that some reader will find mistakes or have comments.

5.1 Purkinje cells data set

The current status of the Purkinje cells data set analysis is available as:

5.2 Locust data set

The current status of the Locust data set analysis is available as:

Author: Christophe Pouzat

Org version 7.5 with Emacs version 23

Validate XHTML 1.0