Within the last decades, scientists have deposited vast amounts of raw genomics data into publicly accessible domains. My goal is to develop new models, databases, and tools that will enable researchers to unleash the untapped potential of these data. I am eager to discover new techniques and to combine ideas in novel ways in order to tackle this challenge. See a sample of my research outputs below.

Hierarchical cell type classification with the Cell Ontology

Cell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. We present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification by considering the rich hierarchical structure of known cell types. Furthemore, CellO comes pre-trained on a novel, comprehensive dataset of human, healthy, untreated primary samples in the Sequence Read Archive (SRA) which, to the best of our knowledge, is the most diverse curated collection of primary cell data to date.

My helpful screenshot

Standardizing metadata for large, public genomics databases

The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. We developed the MetaSRA, a novel computational pipeline and associated database for standardizing the metadata associated with samples in the SRA by mapping each sample to biomedical ontologies. See my talk describing the MetaSRA from ISMB 2018.

My helpful screenshot

Webtools for exploring public single-cell cancer data

Single-cell RNA-seq (scRNA-seq) enables the profiling of genome-wide gene expression at the single-cell level and in so doing facilitates insight into and information about cellular heterogeneity within a tissue. This is especially important in cancer, where tumor and tumor microenvironment heterogeneity directly impact development, maintenance, and progression of disease. While publicly available scRNA-seq cancer datasets offer unprecedented opportunity to better understand the mechanisms underlying tumor maintenence and progression, much of the available information has been underutilized, in part, due to the lack of tools available for aggregating and analysing these data. We present CHARacterizing Tumor Subpopulations (CHARTS), a computational pipeline and web application for analyzing, characterizing, and integrating publicly available scRNA-seq cancer datasets. CHARTS is freely available at

My helpful screenshot

Developing tools for querying large, public genomics databases

We’ve built Jupyter-notebook based tools atop the MetaSRA for constructing structured datasets from the SRA. The Case-Control Finder finds matches samples of a given condition/disease from the SRA to control samples. The Series Finder finds ordered sets of samples where samples are ordered by a continuous property such as age.

My helpful screenshot