(in order of presentations)

Day 1

Shannon McCurdy (Pachter/Barcellos Lab, Postdoc)

Title: Estimating High-Order Heritability components in GWAS
Authors: Shannon R. McCurdy, Lisa F. Barcellos, Or Zuk

Abstract: Most genome-wide association studies (GWAS) to date have focused on estimating the additive contributions of genetic variation to the phenotype of interest. Recent work on estimating the dominance contribution to the heritability of multiple phenotypic traits suggests that dominance frequently contributes little to the genetic variation (Zhu et al., 2015). However, higher-order dominant and epistatic contributions to the total genetic variation remain largely unexamined, in part because estimating them reliably is statistically challenging and requires very large sample sizes. Characterization of higher-order dominant and epistatic effects is important for at least two reasons: first, the characterization is a necessary step towards understanding the total genetic variation of a phenotypic trait, the broad-sense heritability; second, understanding the relative importance of the different orders of complex genetic interactions contributing to a specific trait could guide further study on specific interactions. We test and apply methods for estimating the heritability of complex traits and diseases for general genetic architectures that include higher-order dominant and epistatic effects. We explore these methods through simulation studies of phenotypes, and study the sample sizes and other population parameters required to estimate the higher-order contributions reliably. We also estimate higher-order dominant and epistatic effects for dozens of human complex traits, analyzing separately phenotypes collected by the electronic Medical Records and Genomics (eMERGE) Network. This cohort is ideal for estimating these effects due to its size and diversity.


Harneet Rishi (Arkin Lab, Biophysics Grad Student, Comp Bio DE)

Title: High-throughput CRISPRi as a platform for bacterial functional genomics
Authors: Harneet S. Rishi, Esteban Toro, Honglei Liu, Xiaowo Wang, Lei S. Qi, and Adam P. Arkin

Abstract: The advent of next-generation sequencing has led to an explosion of microbial genome sequences, and now high-throughput genetics efforts are becoming increasingly vital to finding the phenotype and, by extension, function associated with a given gene of interest. Here we apply the catalytically inactive dCas9 to conduct high-throughput transcriptional and regulatory studies in E. coli. Using an Agilent OLS library of 32992 unique sgRNAs, we target 4500 genes, 5400 promoters, 640 transcription factor binding sites (TFBSs), and 106 small RNAs (sRNAs) in the E. coli genome. By combining CRISPRi with next-generation sequencing, we are able to interrogate the fitness effect of transcriptional knockdown for each of the aforementioned genomic features in a single-pot experiment. Our fitness results agree well with current knockout databases, and our ability to induce transcriptional knockdown at any point during an experiment has allowed us to uncover condition-dependent phenotypes for essential genes. We also leverage the CRISPRi platform with flow cytometry to obtain morphology phenotypes in high-throughput. We first show that CRISPRi can generate filamentous cellular phenotypes and next present a flow-seq methodology that allows us to enrich for filamentous mutants in our library. Finally, we show that our single-pot flow-seq results agree well with single-genotype microscopy studies and provide novel phenotypes for several essential genes. Overall, HT-CRISPRi enables single-pot, precise measurements of fitness for a large set of genomic features and will prove useful in genome annotation studies of model and non-model organisms.


Sean Wu (Marshall Lab, Epidemiology Grad Student, Comp Bio DE)

Title: MICRO: An Eco-epidemiological Agent Based Framework for the Modeling of Mosquito-borne Pathogens

Abstract: As climate change, urban expansion, and new patterns of land use increase the frequency of human contact with mosquitoes, the threat of mosquito-borne pathogens looms large. The modeling of mosquito transmitted diseases has a rich history, with the first differential equation based models being developed in the early 1900s by Ronald Ross, a British doctor, naturalist, and amateur mathematician, after supervising a 1909 malaria control experiment in India. His models were expanded upon in the 1950s by George MacDonald, who developed ideas of superinfection, where hosts may concurrently carry multiple clonal variants of a single pathogen, as well as the important epidemiological concept of R_0.

With the advent of more sophisticated computational resources, and the development of dynamical systems and stochastic process theory, models of disease transmission have exploded in complexity. However, there still exists the need for a flexible modular modeling framework that, instead of developing orthogonally to classical models of transmission, instead builds upon and introduces heterogeneity into mathematical models in a logically consistent manner. MICRO is a framework to simulate human-mosquito-pathogen interactions on a continuous landscape based on the deep connections between continuous time stochastic processes and differential equations. By simulating human, mosquito, and pathogen dynamics in continuous time, when stripped of heterogeneity, MICRO collapses to classical assumptions of perfectly-mixing populations, but designed so investigators can understand how adding individual level heterogeneity results in the emergence of system-level deviations from classical theory.

MICRO is being developed as an open-source R package to facilitate ease of use and understanding of code, with computationally heavy elements implemented in object-oriented C++. It is our hope that MICRO can provide a logical basis to answer important questions for vector biology, pathogen evolution, and support policy decisions when experiments are impossible or unethical to preform.


Nima Hejazi (van der Laan Lab, Biostatistics Grad Student)

Title: Data-Adaptive Estimation and Inference for Differential Methylation Analysis

Abstract: DNA methylation is amongst the best studied of epigenetic mechanisms impacting gene expression. While much attention has been paid to the proper normalization of bioinformatical data produced by DNA methylation assays, linear models remain the current standard for analyzing post-processed methylation data, for the ease they afford for both statistical inference and scientific interpretation. We present a new, general statistical algorithm for the model-free estimation of the differential methylation of DNA CpG sites, complete with straightforward and interpretable statistical inference for such estimates. The new approach leverages variable importance measures, a class of parameters arising in causal inference, in a manner that facilitates their use in obtaining targeted estimates of the importance of each CpG site. The proposed procedure is computationally efficient and self-contained, incorporating techniques to isolate a subset of candidate CpG sites based on cursory evidence of differential methylation and providing a multiple testing correction that appropriately controls the False Discovery Rate in such multi-stage analysis settings. The effectiveness of the new methodology is demonstrated by way of data analysis with real DNA methylation data, and a recently developed R package (“methyvim”; available via Bioconductor) that provides support for data analysis with this methodology is introduced.


Sumayah Rahman (Banfield Lab, PMB Grad Student, Comp Bio DE)

Title: Investigating dynamics of antibiotic resistance through genome-resolved metagenomics

Abstract: Antibiotic resistance in pathogens is extensively studied, yet little is known about how antibiotic resistance genes of routine gut bacteria influence microbiome dynamics. Here, we leverage genomes from metagenomes to investigate how genes of the premature infant gut resistome correspond to the ability of bacteria to survive under certain environmental and clinical conditions. We find that formula feeding impacts the resistome. Random forest models corroborated by statistical tests revealed that the gut resistome of formula-fed infants is enriched in class D beta-lactamase genes. Interestingly, Clostridium difficile strains harboring this gene are at higher abundance in formula-fed infants compared to C. difficile lacking this gene. Likewise, organisms with genes for major facilitator superfamily drug efflux pumps have faster replication rates under all conditions, even in the absence of antibiotic therapy. Using a machine learning approach, we identified genes that are predictive of an organism’s direction of change in relative abundance after administration of vancomycin and cephalosporin antibiotics. The most accurate results were obtained by reducing annotated genomic data into five principal components classified by boosted decision trees. Among the genes involved in predicting if an organism increased in relative abundance after treatment are those that encode for subclass B2 beta-lactamases and transcriptional regulators of vancomycin resistance. This demonstrates that machine learning applied to genome-resolved metagenomics data can identify key genes for survival after antibiotics and predict how organisms in the gut microbiome will respond to antibiotic administration.


Nicolas Alexandre (Whiteman Lab, IB Grad Student, Comp Bio DE)

Title: The Genomic Architecture of Bill Shape in the Broad-tailed Hummingbird

Abstract: It is unclear why natural populations of hummingbirds exhibit intraspecies variation for bill shape, especially given Darwin’s observation that they are “specially adapted to the various kinds of flowers they visit”. Heritable variation for beak shape has been recently leveraged to identify genomic variants in Darwin’s finches, pointing to a genetic basis for these traits, yet it is unclear whether the same loci underlie this variation in other birds. Our species of interest, the broad-tailed hummingbird (BTH) (Selasphorus platycercus), migrates from wintering sites in Mexico to summer breeding grounds in the Rocky Mountains where their breeding season coincides with the emergence of a multitude of flower species. In 2017, I measured bill traits in 500 BTH, and found variation in multiple axes of bill shape: length, width, curvature, and depth. Blood collected from these individuals is being extracted for DNA to be sequenced on Illumina at low coverage to detect loci underlying bill shape variation. Narrow sense heritability (NSH) will be calculated using a pedigree-free estimate of heritability, Genome Complex Trait Analysis (GCTA). The genome is being assembled using a combination of Pacbio and Dovetail HiC as a scaffold for Illumina sequences. This project aims to explore BTH genomic architecture by quantifying intraspecific variation and using low coverage individual sequence data to perform a GWAS. It is predicted that genome-wide loci associated with this bill shape variation will be identified and are expected to overlap with those used as a prioricandidates detected in Darwin’s finches, to be subsequently tested for signals of natural or balancing selection.


Karl Kumbier (Yu Lab, Statistics Grad Student)

Title: Iterative Random Forests to discover predictive and stable high-order interactions

Abstract: Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RIT), and through extensive, biologically inspired simulations, we developed iterative Random Forests (iRF). iRF train a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity for the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, iRF re-discovered the essential role of Zelda in early zygotic enhancer activation, and novel third-order interactions. In human-derived cells, iRF re-discovered that H3K36me3 plays a central role in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry in genome biology, automating hypothesis generation for the discovery of new molecular mechanisms from genomic data.


David DeTomaso (Yosef Lab, Comp Bio Grad Student)

Title: Scalable Analysis of Single-Cell RNA-Sequencing Experiments

Abstract: In recent years, the development of droplet-based single-cell RNA-sequencing platforms have enabled a more powerful investigation of individual cell sub-populations within a heterogenous sample. However, this opportunity is also accompanied by new computational challenges. In this talk I’ll present methods in development in the Yosef Lab to address these challenges, specifically for the identification of genes and pathways which drive heterogeneity in diverse cellular populations, and demonstrate their application on large, droplet-based single-cell datasets.

Day 2

Miaoyan Wang (Song Lab, Postdoc)

Title: Three-way clustering of multi-tissue gene expression data using tensor decomposition

Abstract: The advent of next generation sequencing methods has led to an increasing availability of large, multi-tissue datasets which contain gene expression measurements across different tissues and individuals. In this setting, variation in expression levels arises due to contributions specific to genes, tissues, individuals, and interactions thereof. Classical clustering methods are ill-suited to explore these three-way interactions and fail to fully extract the insights into transcriptome complexity and regulation contained in the data. Thus, to exploit the multi-mode structure of the data, new methods are required. To this end, we propose a tensor decomposition-based approach which permits the investigation of transcriptome variation across individuals and tissues simultaneously. Via simulation and application to the GTEx V6 data, we show that our semi-nonnegative tensor decomposition identifies three-dimensional clusters in tensorial data with high accuracy, and that the associated tensor-projection outperforms existing methods in detecting covariate- (e.g., age-, race-, and gender-) related genes by sharing information across similar tissues. Our analysis finds gene modules consistent with existing knowledge while also detecting novel candidate genes exhibiting either tissue-, individual-, or tissue-by-individual specificity. These identified genes and gene modules offer bases for future study, and the uncovered multi-way specificities provide a finer, more nuanced snapshot of transcriptome variation than previously possible. This is a joint work with Jonathan Fischer and Yun S. Song.


Calvin Chi (Barcellos Lab, Comp Bio Grad Student)

Title: Admixture Mapping Reveals Diverse Genetic Ancestry of HLA and non-HLA MS-Associated Alleles in Admixed Populations

Abstract: Multiple sclerosis (MS) is a complex neurodegenerative disease with highest prevalence among populations of northern European ancestry. The disparity of MS prevalence across the globe leads to questions regarding whether MS genetic risk factors could be traced to a predominant ancestry. We estimate local ancestry of MS-associated alleles in the largest assembled admixed population to-date, with individuals from African American, Asian American, and Hispanic populations. Our results reveal a complex picture of the genetic ancestry of MS-associated alleles—the majority of human antigen leukocyte (HLA) alleles, including the prominent HLA-DRB1*15:01 risk allele, exhibit cosmopolitan origins. We also identify MS-associated HLA alleles that are likely ancestry-specific. Alleles from different ancestries could confer different risks—our case study of HLA-DRB1*15:01 shows evidence that the European allele confers twice the risk of the African allele. Out of the 200 established non-HLA MS alleles, we only replicate association of two alleles rs405343 and rs6670198 in Asian Americans, and we find that most non-HLA MS alleles are admixed. Lastly, a genome-wide search of association between European ancestry and MS reveals a top signal from 2Mb to 3Mb on chromosome 8 in Hispanics, with cases being more European than controls at the locus. Overall, we find that admixed individuals with MS have both ancestry-specific and cosmopolitan MS-associated alleles. It is plausible that the higher prevalence of MS in European populations can be explained by a combination of greater risk exerted by alleles from European ancestry and higher frequency of protective alleles in non-European populations.


Jonathan Fischer (Song Lab, Statistics Grad Student)

Title: The role of mRNA decay factors in transcriptional regulation in Saccharomyces cerevisiae

Abstract: Recent work in Saccharomyces cerevisiae has revealed that the processes of mRNA synthesis and decay are linked, meaning gene expression is circular. In particular, elements of the so-called “synthegradosome” travel between the nucleus, where they regulate transcription initiation, and the cytoplasm, where they degrade mRNAs. Using native elongating transcript sequencing (NET-seq), we explore the active transcription profile in several decay factor knockouts to directly interrogate the roles of individual decay factors in transcription. After identifying the genes whose transcription is regulated by the considered decay factors, we perform enrichment analyses for a number of different biological annotations, including gene ontology (GO) and regulatory complex associations, among others. Of note are our observations of widespread enrichment signatures for genes whose steady state mRNA levels are associated with the SAGA complex, and, in a subset of mutants, GO terms related to translation and cellular energy production. We also probe changes in the Pol II distribution in the 5’ and 3’ ends of genes, finding that Pol II accumulates adjacent the 3’ end of genes when these decay factors are absent. Moreover, we examine non-coding transcription and observe a marked decrease in such activity in the knockout strains, suggesting these decay factors somehow stimulate this process. We conclude with an investigation of how the hypothesized roles of decay factors as transcriptional regulators meshes with recent findings about the non-redundant roles of the SAGA and TFIID complexes in transcriptional regulation.


Olivia Solomon (Barcellos/Holland Lab, MPH Grad Student)

Title: Prenatal phthalate exposure and altered patterns of DNA methylation in cord blood

Abstract: Epigenetic changes such as DNA methylation may be a molecular mechanism through which environmental exposures affect health. Phthalates are known endocrine disruptors with ubiquitous exposures in the general population including pregnant women, and they have been linked with a number of adverse health outcomes. We examined the association between in utero phthalate exposure and altered patterns of cord blood DNA methylation in 336 Mexican-American newborns. Concentrations of 11 phthalate metabolites were analyzed in maternal urine samples collected at 13 and 26 weeks gestation as a measure of fetal exposure. DNA methylation was assessed using the Infinium HumanMethylation 450K BeadChip adjusting for cord blood cell composition. To identify differentially methylated regions (DMRs) that may be more informative than individual CpG sites, we used two different approaches, DMRcate and comb-p. Regional assessment by both methods identified 27 distinct DMRs, the majority of which were in relation to multiple phthalate metabolites. Most of the significant DMRs (67%) were observed for later pregnancy (26 weeks gestation). Further, 51% of the significant DMRs were associated with the di-(2-ethylhexyl) phthalate metabolites. Five individual CpG sites were associated with phthalate metabolite concentrations after multiple comparisons adjustment (FDR), all showing hypermethylation. Genes with DMRs were involved in inflammatory response (IRAK4 and ESM1), cancer (BRCA1 and LASP1), endocrine function (CNPY1), and male fertility (IFT140, TESC, and PRDM8). These results on differential DNA methylation in newborns with prenatal phthalate exposure provide new insights and targets to explore mechanism of adverse effects of phthalates on human health.


Byung-Ju Kim (Kim Lab, Visiting Scholar)

Title: Prediction of Inherited Genomic Susceptibility to 20 Common Cancer types by a Supervised Machine-Learning Method
Authors: Byung-Ju Kim, Sung-Hou Kim

Abstract: Prevention and early intervention are the most effective ways of avoiding or minimizing psychological, physical, and financial suffering from cancer. However, such proactive action requires the ability to predict the individual’s susceptibility to cancer with a measure of probability. Of the triad of cancer causing factors (inherited genomic susceptibility, environmental, and lifestyle factors), the inherited genomic component may be derivable from the recent public availability of a large body of whole genome variation data. However, Genome-Wide Association Studies on genomic variations have so far shown limited success in predicting the inherited susceptibility to common cancers. We present here a multiple classification approach for predicting individuals’ inherited genomic susceptibility to acquire the most likely type among a panel of 20 common cancer types plus one “healthy” type by application of a supervised machine-learning method under competing conditions among the cohorts of the 21 types. This approach suggests that, depending on the types of 5,919 individuals in this study, (a) the portion of the cohort of a cancer type that acquired the observed type due to mostly inherited susceptibility factors ranges from about 33 to 88%, (or its corollary: the portion due to mostly environmental and lifestyle factors ranges from 12 to 67%). On an individual level, the method also predicts an individual’s inherited genomic susceptibility to acquire the other types ranked with associated probabilities. These probabilities may provide practical information for individuals, health professionals and health policy makers related to prevention and/or early intervention of cancer.