Johnson Lab

Mission Statement

To develop high quality cutting edge computational algorithms for applications in personalized genomic medicine.


The development of personalized treatment regimes is an active area of current research in genomics. The focus of our research is to investigate core biological components that contribute to disease prognosis and development, and to develop computational tools and latent variable models to accurately determine optimal therapeutic regimens for individual patients. Our ultimate goal is to develop a comprehensive and integrated set of relevant, biologically interpretable computational tools for genomic studies in personalized medicine. Past research our group has focused on the analysis of genome-wide data ChIP-chip experiments (Johnson et al. 2006, PNAS; Song et al. 2007, Genome Biology; Gottardo et al. 2008, Biometrics; Johnson et al. 2009, Annals of Applied Statistics) and in dealing with issues in combining microarray data from multiple studies, batches, or across microarray platforms (Johnson et al. 2007, Biostatistics; Leek et al. 2010, Nature Reviews Genetics; Bahr et al. 2010, in preparation). Currently, we are building on our past success by being actively engaged in four primary research objectives described in the aims below.

Research Aim 1: Statistical tools and methods for next-generation sequencing in epigenomics. NGS technologies are being applied to directly address many important data analysis problems in epigenomics. Specifically, building on our recently developed algorithms for array-based DNA-methylation profiling (Rai et al. 2010, Cell; Bahr et al. 2010, submitted) as well as our probabilistic NGS read mapping algorithm (Clement et al. 2010, Bioinformatics), we are developing tools for estimating DNA methylation levels with base-level resolution in bisulfite sequencing (BS-Seq) experiments. In addition, we are adapting our previous multi-tiered continuous-time Hidden-Markov Models (Johnson et al. 2009, Annals of Applied Statistics) to identify methylated regions of interest as compared with other tissues, samples or the genome average. Finally, we are working on methods for integrating epigenetic data from multiple epigenetic marks, datasets, technology platforms, and into different downstream applications (e.g. motif searching).

Research Aim 2: Integrative growth signaling models to decipher complex cancer phenotypes. We aim to systematically profile the multi-tiered growth factor receptor networks (GFRNs) to investigate novel breast cancer phenotypes and the drugs that will be most effective against each subtype. Our studies will utilize our novel gene expression ‘barcoding’ approach, which classifies genes as active/inactive, and improves researchers ability to characterize pathways, cluster patients and link prognosis to pathway status (Piccolo et al. 2010, in preparation). Our barcoding method creates a probabilistic barcode that is applicable to all microarray and sequencing platforms, allowing for efficient combination of data from multiple labs, experiments, or platforms and facilitating downstream analysis. We have also developed a novel Bayesian regression model to project multiple pathway signatures into tumor samples and to identify the active signaling pathways in each individual sample (Finlinson et al. 2010, in preparation). In addition, we will develop Bayesian latent factor models that will use the pathway signatures as prior information that allows for the refining and adapting active pathway profiles within each dataset, efficiently accounting for cell-type specific pathway differences or any ‘rewiring’ do to cancer deregulation. Finally, an extension of our latent factor modeling approach will cluster patients into molecular subtypes based on their activation profiles and response to drug treatment.

Research Aim 3: Identification of rare variants causative of familial disease risk using whole exome sequencing. We are actively engaged in several whole exome sequencing projects to identify rare variants associated with familial disease risk. For example, we will utilize paired-end sequencing to screen the constitutional genomic DNA of high-risk parent-offspring pairs for structural variants and mutations in genes whose deregulation is associated with hereditary cancer development. We propose to assess the co-occurrence of mutations and variants in the same signaling pathways across high- risk families and test for significant association with hereditary breast cancer. We have recently developed our own NGS variant identification algorithm (Clement et al. 2011, submitted), which utilizes a probabilistic Pair-Hidden Markov Model (PHMM) for base calling and SNP detection that incorporates base uncertainty. In addition to our breast cancer cohort, we are also applying our methods to exome sequencing in a large Utah ADHD pedigree, in which preliminary results already show promising candidate ADHD associated genes as well a potential cause of a secondary disease, idiopathic hemolytic anemia, from which one patient suffers (Lyon et al. 2011, Discovery medicine). Finally, we are working with a small Utah pedigree suffering from a severe, rare, X-linked disease (Rope et al. 2011, American Journal of Human Genetics).

Research Aim 4: Prediction, annotation, and dynamics of miRNA processing and target genes. Since their discovery a decade and a half ago, miRNAs have emerged as an important and widely studied regulatory mechanisms having been determined to play important roles in metabolism, immunity, cancer, and in specific cell types such as the heart, liver, bone marrow and stem cells. We have been actively processing RNA-seq data to provide more accurate annotation of known miRNAs (Warf et al. 2011, RNA) and to predict novel microRNAs (Johnson et al. 2010, Biometrics). Our current focus is on the dynamics of miRNA processing and the impact of RNA editing on this process (Warf et al. 2011, in preparation). In conjunction with these objectives, we have recently developed an algorithm for estimating the editing frequency of all types of RNA molecules from NGS data (Shepherd et al. 2011, in preparation).

Ongoing Projects

  • Breast cancer susceptibility
  • GNUMAP project: Probabilistic mapping of next-generation sequencing data and applications
  • Whole exome sequencing data analysis (applications in breast cancer, autism, ADHD, and rare genetic disorders)
  • Analysis of DNA methylation data from multiple platforms
  • Rapid diagnosis of infectious diseases

Post Doctorates
Changjin Hong
Stephen Piccolo
Ying Shen

Graduate Students

Mani Solaiappan, Biostatistics Program
Supriya Sharma, Molecular Medicine Program

Collaborators (outside of BU medical center)
Andrea Bild, PhD; University of Utah
Sean Tavtigian,PhD; University of Utah/Huntsman Cancer Institute
Barbara Graves,PhD; Howard Hughes Medical Institute
Mark Clement, PhD: Brigham Young University
Joshua Udall, PhD;Brigham Young University
Gholson Lyon,MD, PhD; Children’s Hospital Philadelphia
Raphael Gottardo, PhD; Fred Hutchinson Cancer Research Institute

Recent Publications

  1. Clement NL, Shepherd BA, Bodily P, Tumir-Ochir S, Gim Y, Lyon GJ, Snell Q, Clement MJ, Johnson WE (2011). GNUMAP-SNP: A probabilistic pair-hidden Markov model for SNP detection in next-generation sequencing data. Bioinformatics (submitted).
  2. Lyon GJ, Jiang T, Van Wijk R, Wang W, Bodily P, Xing J, Tian L, Robison R, Clement M, Yang L, Zhang P, Liu Y, Moore B, Glessner J, Elia J, Reimherr F, van Solinge W, Yandell M, Hakonarson H, Wang J, Johnson WE, Wei Z, Wang K (2011). Exome Sequencing and Unrelated Findings in the context of Complex Disease Research: Ethical and Clinical Implications. Discovery Medicine. (to appear).
  3. Cohen AL, Soldi R, Zhang H, Gustafson AM, Wilcox R, Welm BE, Chang JT, Johnson WE, Spira A, Jeffrey SS, Bild AH (2011). MATCH: Merging genomic and pharmacologic Analyses for Therapy CHoice. Molecular Systems Biology (to appear).
  4. Rope AF, Wang K, Evjenth R, Xing K, Johnston JJ, Swenson JJ, Johnson WE, Moore B, Huff CD, Bird LM, Carey JC, Opitz JM Stevens CA, Jiang T, Schank C, Fain HD, Robison R, Dalley B, Chin S, South TS, Pysher TJ, Jorde LB, Hakonarson H, Lillehaug JR, Biesecker LG, Yandell M, Arnesen T, Lyon GJ (2011). Using VAAST to Identify an X-Linked Disorder Resulting in Lethality in Male Infants Due to N-Terminal Acetyltransferase Deficiency. American Journal of Human Genetics 10.1016/j.ajhg.2011.05.017.
  5. Warf MB, Johnson WE, Bass BL (2011). Improved annotation of C. elegans microRNAs by deep sequencing reveals structures associated with processing by Drosha and Dicer. RNA 17: 563-577.
  6. Johnson WE, Welker NC, Bass BL (2011). Dynamic Linear Model for the Identification of miRNAs in Next-generation Sequencing Data. Biometrics 67: no. doi: 10.1111/j.1541-0420.2010.01570.x.
  7. Leek JT, Scharpf R, Corrada-Bravo H, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733–739.
  8. Rai K, Sarkar S, Broadbent T, Voas M, Grossman KF, Dehghanizadeh S, Hagos F, Li Y, Toth RK, Chidester S, Bahr TM, Johnson WE, Sklow B, Burt R, Cairns BR, Jones DA (2010). DNA Demethylase Activity Maintains Zebrafish Intestinal Cells in a Progenitor-like State Following Loss of APC. Cell 142 (6): 930-942.
  9. Thyagarajan B, Blaszczak AG, Chandler KJ, Watts JL, Johnson WE, Graves BJ (2010) ETS-4 Is a Transcriptional Regulator of Life Span in Caenorhabditis elegans. PLoS Genetics 6 (9): e1001125.
  10. Clement N, Snell Q, Clement M, Hollenhorst PC, Parwar J, Graves BJ, Cairns BR, Johnson WE (2010). The GNUMAP Algorithm: Unbiased Probabilistic Mapping of Oligonucleotides from Next-Generation Sequencing. Bioinformatics, 26 (1): 38-45