Software --- Data


[validate] :A key component of performing any genomics experiment is validation of significant features (genes, proteins, etc.). This software can be used to assess the statistical evidence for validation of a particular analysis/technology on the basis of a random sample of significant results. Written by Jeff Leek.

[sva] : It has been shown that genome-wide expression may be affected by environmental, demographic, genetic and technical factors, creating what we call expression heterogeneity. Surrogate variable analysis (SVA) is designed to identify, estimate, and incorporate into an analysis the sources of expression heterogeneity that are not captured by variables included in the model. SVA has been shown to reduce dependence across genes, stablize false discovery rate estimates, and improve reproducibility of analyses. Written by Jeff Leek, Evan Johnson, Hilary Parker, Andrew Jaffe and John Storey.

[dks] The explosive growth of high-dimensional data has resulted in an equally explosive growth in methods for analyzing high-dimensional data. Almost all of these methods rely on p-values, corrected p-values, or false discovery rate estimates for ranking and significance calculation. However, there is no clear standard for determining whether the p-values from a new multiple testing procedure are correct. The double Kolmogorov-Smirnov package consists of a set of R functions for diagnosing whether a multiple testing procedure gives correct null p-values using simulated data.Written by Jeff Leek.

[myrna] : Myrna is a cloud computing tool for calculating differential gene expression in large RNA-seq datasets. Myrna uses Bowtie for short read alignment and R/Bioconductor for interval calculations, normalization, and statistical testing. These tools are combined in an automatic, parallel pipeline that runs in the cloud (Elastic MapReduce in this case) on a local Hadoop cluster, or on a single computer, exploiting multiple computers and CPUs wherever possible. Written by Ben Langmead, Kasper Hansen, and Jeff Leek.

[tspair] : A top scoring pair is a pair of genes whose relative ranks can be used to classify arrays according to a binary phenotype. A top scoring pair classifier has three advantages over standard classifiers: (1) the classifier is based on the relative ranks of genes and is more robust to normalization and preprocessing, (2) the classifier is based on a pair of genes and is likely to be more interpretable than a more complicated classifier, and (3) a classfier based on a small number of genes lends itself to diagnostic tests based on PCR that are both more rapid and cheaper than classifiers based on a large number of genes. Written by Jeff Leek.

[svd dimension] :An R script for calculating the dimension of a latent factor model in high-dimensional data. Written by Jeff Leek.

[EDGE] : A comprehensive software package for the significance analysis of DNA microarray experiments -- for both standard and time course experiments -- based on our new optimal discovery procedure and time course methodology. Written by Jeff Leek, Alan Dabney, Eva Monsen, and John Storey.

[Data Scientist] : An R function to determine if you are a data scientist. This is just a goofy little function I wrote to determine if you are a "data scientist". It is published over on our blog Simply Statistics. Written by Jeff Leek with help from Rafa Irizarry and Roger Peng on the scoring system.

[googleCite] : A set of R functions to analyze your Google Scholar citations page and make a word cloud of your co-authors and paper titles. It is published over on our blog Simply Statistics. Written by John Muschelli, Andrew Jaffe, and Jeff Leek.

[twitterMap] : An R function to create a personalized map of your Twitter followers like this. It is published over on our blog Simply Statistics. Written by Jeff Leek.


Peer Review Experiment : The data and code from an experimental study of peer review. Participants solved and reviewed GRE problems and data was recorded about social interaction and accuracy. Developed by Jeff Leek, Margaret Taub, and Fernando Pineda

ReCount : A database of aligned and pre-processed data from RNA-sequencing experiments. This data is ready for statistical analysis and can be used to develop new methods. Developed by Alyssa Frazee, Ben Langmead, and Jeff Leek.

Batch Effects Data : Data illustrating batch effects on various technologies. R code is also available to perform the analyses in our 2010 batch review paper (see Publications for more detail. Developed by Jeff Leek, Rob Scharpf, Hector Corrada Bravo, David Simcha, Ben Langmead, Evan Johnson, Don Geman, Keith Baggerly, and Rafael Irizarry

Bladder batch package : A pre-normalized batch effect example available as a Bioconductor experiment data package. Developed by Jeff Leek.

Braincloud : Software for visualization and analyzing genome-wide gene expression and genetic data from the developing human brain. The raw data are also available from the Download page. Developed by the Lieber Institute and the NIMH.