Nova Reader - Subject

Nebula: ultra-efficient mapping-free structural variant genotyper

2021-01-27T00:00

Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

A benchmark and an algorithm for detecting germline transposon insertions and measuring de novo transposon insertion frequencies

2021-01-28T00:00

Transposons are genomic parasites, and their new insertions can cause instability and spur the evolution of their host genomes. Rapid accumulation of short-read whole-genome sequencing data provides a great opportunity for studying new transposon insertions and their impacts on the host genome. Although many algorithms are available for detecting transposon insertions, the task remains challenging and existing tools are not designed for identifying de novo insertions. Here, we present a new benchmark fly dataset based on PacBio long-read sequencing and a new method TEMP2 for detecting germline insertions and measuring de novo ‘singleton’ insertion frequencies in eukaryotic genomes. TEMP2 achieves high sensitivity and precision for detecting germline insertions when compared with existing tools using both simulated data in fly and experimental data in fly and human. Furthermore, TEMP2 can accurately assess the frequencies of de novo transposon insertions even with high levels of chimeric reads in simulated datasets; such chimeric reads often occur during the construction of short-read sequencing libraries. By applying TEMP2 to published data on hybrid dysgenic flies inflicted by de-repressed P-elements, we confirmed the continuous new insertions of P-elements in dysgenic offspring before they regain piRNAs for P-element repression. TEMP2 is freely available at Github: https://github.com/weng-lab/TEMP2.

DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism

2021-01-27T00:00

Subcellular localization of messenger RNAs (mRNAs), as a prevalent mechanism, gives precise and efficient control for the translation process. There is mounting evidence for the important roles of this process in a variety of cellular events. Computational methods for mRNA subcellular localization prediction provide a useful approach for studying mRNA functions. However, few computational methods were designed for mRNA subcellular localization prediction and their performance have room for improvement. Especially, there is still no available tool to predict for mRNAs that have multiple localization annotations. In this paper, we propose a multi-head self-attention method, DM3Loc, for multi-label mRNA subcellular localization prediction. Evaluation results show that DM3Loc outperforms existing methods and tools in general. Furthermore, DM3Loc has the interpretation ability to analyze RNA-binding protein motifs and key signals on mRNAs for subcellular localization. Our analyses found hundreds of instances of mRNA isoform-specific subcellular localizations and many significantly enriched gene functions for mRNAs in different subcellular localizations.

Network controllability-based algorithm to target personalized driver genes for discovering combinatorial drugs of individual patients

2021-01-12T00:00

Multiple driver genes in individual patient samples may cause resistance to individual drugs in precision medicine. However, current computational methods have not studied how to fill the gap between personalized driver gene identification and combinatorial drug discovery for individual patients. Here, we developed a novel structural network controllability-based personalized driver genes and combinatorial drug identification algorithm (CPGD), aiming to identify combinatorial drugs for an individual patient by targeting personalized driver genes from network controllability perspective. On two benchmark disease datasets (i.e. breast cancer and lung cancer datasets), performance of CPGD is superior to that of other state-of-the-art driver gene-focus methods in terms of discovery rate among prior-known clinical efficacious combinatorial drugs. Especially on breast cancer dataset, CPGD evaluated synergistic effect of pairwise drug combinations by measuring synergistic effect of their corresponding personalized driver gene modules, which are affected by a given targeting personalized driver gene set of drugs. The results showed that CPGD performs better than existing synergistic combinatorial strategies in identifying clinical efficacious paired combinatorial drugs. Furthermore, CPGD enhanced cancer subtyping by computationally providing personalized side effect signatures for individual patients. In addition, CPGD identified 90 drug combinations candidates from SARS-COV2 dataset as potential drug repurposing candidates for recently spreading COVID-19.

Complete minicircle genome of Leptomonas pyrrhocoris reveals sources of its non-canonical mitochondrial RNA editing events

2021-02-28T00:00

Uridine insertion/deletion (U-indel) editing of mitochondrial mRNA, unique to the protistan class Kinetoplastea, generates canonical as well as potentially non-productive editing events. While the molecular machinery and the role of the guide (g) RNAs that provide required information for U-indel editing are well understood, little is known about the forces underlying its apparently error-prone nature. Analysis of a gRNA:mRNA pair allows the dissection of editing events in a given position of a given mitochondrial transcript. A complete gRNA dataset, paired with a fully characterized mRNA population that includes non-canonically edited transcripts, would allow such an analysis to be performed globally across the mitochondrial transcriptome. To achieve this, we have assembled 67 minicircles of the insect parasite Leptomonas pyrrhocoris, with each minicircle typically encoding one gRNA located in one of two similar-sized units of different origin. From this relatively narrow set of annotated gRNAs, we have dissected all identified mitochondrial editing events in L. pyrrhocoris, the strains of which dramatically differ in the abundance of individual minicircle classes. Our results support a model in which a multitude of editing events are driven by a limited set of gRNAs, with individual gRNAs possessing an inherent ability to guide canonical and non-canonical editing.

Streamlining CRISPR spacer-based bacterial host predictions to decipher the viral dark matter

2021-03-02T00:00

Thousands of new phages have recently been discovered thanks to viral metagenomics. These phages are extremely diverse and their genome sequences often do not resemble any known phages. To appreciate their ecological impact, it is important to determine their bacterial hosts. CRISPR spacers can be used to predict hosts of unknown phages, as spacers represent biological records of past phage–bacteria interactions. However, no guidelines have been established to standardize host prediction based on CRISPR spacers. Additionally, there are no tools that use spacers to perform host predictions on large viral datasets. Here, we developed a set of tools that includes all the necessary steps for predicting the hosts of uncharacterized phages. We created a database of >11 million spacers and a program to execute host predictions on large viral datasets. Our host prediction approach uses biological criteria inspired by how CRISPR–Cas naturally work as adaptive immune systems, which make the results easy to interpret. We evaluated the performance using 9484 phages with known hosts and obtained a recall of 49% and a precision of 69%. We also found that this host prediction method yielded higher performance for phages that infect gut-associated bacteria, suggesting it is well suited for gut-virome characterization.

SurVirus: a repeat-aware virus integration caller

2021-01-14T00:00

A significant portion of human cancers are due to viruses integrating into human genomes. Therefore, accurately predicting virus integrations can help uncover the mechanisms that lead to many devastating diseases. Virus integrations can be called by analysing second generation high-throughput sequencing datasets. Unfortunately, existing methods fail to report a significant portion of integrations, while predicting a large number of false positives. We observe that the inaccuracy is caused by incorrect alignment of reads in repetitive regions. False alignments create false positives, while missing alignments create false negatives. This paper proposes SurVirus, an improved virus integration caller that corrects the alignment of reads which are crucial for the discovery of integrations. We use publicly available datasets to show that existing methods predict hundreds of thousands of false positives; SurVirus, on the other hand, is significantly more precise while it also detects many novel integrations previously missed by other tools, most of which are in repetitive regions. We validate a subset of these novel integrations, and find that the majority are correct. Using SurVirus, we find that HPV and HBV integrations are enriched in LINE and Satellite regions which had been overlooked, as well as discover recurrent HBV and HPV breakpoints in human genome-virus fusion transcripts.

CoBold: a method for identifying different functional classes of transient RNA structure features that can impact RNA structure formation in vivo

2020-10-23T00:00

RNA structure formation in vivo happens co-transcriptionally while the transcript is being made. The corresponding co-transcriptional folding pathway typically involves transient RNA structure features that are not part of the final, functional RNA structure. These transient features can play important functional roles of their own and also influence the formation of the final RNA structure in vivo. We here present CoBold, a computational method for identifying different functional classes of transient RNA structure features that can either aid or hinder the formation of a known reference RNA structure. Our method takes as input either a single RNA or a corresponding multiple-sequence alignment as well as a known reference RNA secondary structure and identifies different classes of transient RNA structure features that could aid or prevent the formation of the given RNA structure. We make CoBold available via a web-server which includes dedicated data visualisation.

Alignment free identification of clones in B cell receptor repertoires

2020-12-16T00:00

Following antigenic challenge, activated B cells rapidly expand and undergo somatic hypermutation, yielding groups of clonally related B cells with diversified immunoglobulin receptors. Inference of clonal relationships based on the receptor sequence is an essential step in many adaptive immune receptor repertoire sequencing studies. These relationships are typically identified by a multi-step process that involves: (i) grouping sequences based on shared V and J gene assignments, and junction lengths and (ii) clustering these sequences using a junction-based distance. However, this approach is sensitive to the initial gene assignments, which are error-prone, and fails to identify clonal relatives whose junction length has changed through accumulation of indels. Through defining a translation-invariant feature space in which we cluster the sequences, we develop an alignment free clonal identification method that does not require gene assignments and is not restricted to a fixed junction length. This alignment free approach has higher sensitivity compared to a typical junction-based distance method without loss of specificity and PPV. While the alignment free procedure identifies clones that are broadly consistent with the junction-based distance method, it also identifies clones with characteristics (multiple V or J gene assignments or junction lengths) that are not detectable with the junction-based distance method.

cola: an R/Bioconductor package for consensus partitioning through a general framework

2020-12-04T00:00

Classification of high-throughput genomic data is a powerful method to assign samples to subgroups with specific molecular profiles. Consensus partitioning is the most widely applied approach to reveal subgroups by summarizing a consensus classification from a list of individual classifications generated by repeatedly executing clustering on random subsets of the data. It is able to evaluate the stability of the classification. We implemented a new R/Bioconductor package, cola, that provides a general framework for consensus partitioning. With cola, various parameters and methods can be user-defined and easily integrated into different steps of an analysis, e.g., feature selection, sample classification or defining signatures. cola provides a new method named ATC (ability to correlate to other rows) to extract features and recommends spherical k-means clustering (skmeans) for subgroup classification. We show that ATC and skmeans have better performance than other commonly used methods by a comprehensive benchmark on public datasets. We also benchmark key parameters in the consensus partitioning procedure, which helps users to select optimal parameter values. Moreover, cola provides rich functionalities to apply multiple partitioning methods in parallel and directly compare their results, as well as rich visualizations. cola can automate the complete analysis and generates a comprehensive HTML report.

UniPath: a uniform approach for pathway and gene-set based analysis of heterogeneity in single-cell epigenome and transcriptome profiles

2020-12-04T00:00

Recent advances in single-cell open-chromatin and transcriptome profiling have created a challenge of exploring novel applications with a meaningful transformation of read-counts, which often have high variability in noise and drop-out among cells. Here, we introduce UniPath, for representing single-cells using pathway and gene-set enrichment scores by a transformation of their open-chromatin or gene-expression profiles. The robust statistical approach of UniPath provides high accuracy, consistency and scalability in estimating gene-set enrichment scores for every cell. Its framework provides an easy solution for handling variability in drop-out rate, which can sometimes create artefact due to systematic patterns. UniPath provides an alternative approach of dimension reduction of single-cell open-chromatin profiles. UniPath's approach of predicting temporal-order of single-cells using their pathway enrichment scores enables suppression of covariates to achieve correct order of cells. Analysis of mouse cell atlas using our approach yielded surprising, albeit biologically-meaningful co-clustering of cell-types from distant organs. By enabling an unconventional method of exploiting pathway co-occurrence to compare two groups of cells, our approach also proves to be useful in inferring context-specific regulations in cancer cells. Available at https://reggenlab.github.io/UniPathWeb/.

Entropy subspace separation-based clustering for noise reduction (ENCORE) of scRNA-seq data

2020-12-10T00:00

Single-cell RNA sequencing enables us to characterize the cellular heterogeneity in single cell resolution with the help of cell type identification algorithms. However, the noise inherent in single-cell RNA-sequencing data severely disturbs the accuracy of cell clustering, marker identification and visualization. We propose that clustering based on feature density profiles can distinguish informative features from noise. We named such strategy as ‘entropy subspace’ separation and designed a cell clustering algorithm called ENtropy subspace separation-based Clustering for nOise REduction (ENCORE) by integrating the ‘entropy subspace’ separation strategy with a consensus clustering method. We demonstrate that ENCORE performs superiorly on cell clustering and generates high-resolution visualization across 12 standard datasets. More importantly, ENCORE enables identification of group markers with biological significance from a hard-to-separate dataset. With the advantages of effective feature selection, improved clustering, accurate marker identification and high-resolution visualization, we present ENCORE to the community as an important tool for scRNA-seq data analysis to study cellular heterogeneity and discover group markers.

Decoding the epitranscriptional landscape from native RNA sequences

2020-07-25T00:00

Traditional epitranscriptomics relies on capturing a single RNA modification by antibody or chemical treatment, combined with short-read sequencing to identify its transcriptomic location. This approach is labor-intensive and may introduce experimental artifacts. Direct sequencing of native RNA using Oxford Nanopore Technologies (ONT) can allow for directly detecting the RNA base modifications, although these modifications might appear as sequencing errors. The percent Error of Specific Bases (%ESB) was higher for native RNA than unmodified RNA, which enabled the detection of ribonucleotide modification sites. Based on the %ESB differences, we developed a bioinformatic tool, epitranscriptional landscape inferring from glitches of ONT signals (ELIGOS), that is based on various types of synthetic modified RNA and applied to rRNA and mRNA. ELIGOS is able to accurately predict known classes of RNA methylation sites (AUC > 0.93) in rRNAs from Escherichiacoli, yeast, and human cells, using either unmodified in vitro transcription RNA or a background error model, which mimics the systematic error of direct RNA sequencing as the reference. The well-known DRACH/RRACH motif was localized and identified, consistent with previous studies, using differential analysis of ELIGOS to study the impact of RNA m⁶A methyltransferase by comparing wild type and knockouts in yeast and mouse cells. Lastly, the DRACH motif could also be identified in the mRNA of three human cell lines. The mRNA modification identified by ELIGOS is at the level of individual base resolution. In summary, we have developed a bioinformatic software package to uncover native RNA modifications.

Deploying MMEJ using MENdel in precision gene editing applications for gene therapy and functional genomics

2020-12-10T00:00

Gene-editing experiments commonly elicit the error-prone non-homologous end joining for DNA double-strand break (DSB) repair. Microhomology-mediated end joining (MMEJ) can generate more predictable outcomes for functional genomic and somatic therapeutic applications. We compared three DSB repair prediction algorithms – MENTHU, inDelphi, and Lindel – in identifying MMEJ-repaired, homogeneous genotypes (PreMAs) in an independent dataset of 5,885 distinct Cas9-mediated mouse embryonic stem cell DSB repair events. MENTHU correctly identified 46% of all PreMAs available, a ∼2- and ∼60-fold sensitivity increase compared to inDelphi and Lindel, respectively. In contrast, only Lindel correctly predicted predominant single-base insertions. We report the new algorithm MENdel, a combination of MENTHU and Lindel, that achieves the most predictive coverage of homogeneous out-of-frame mutations in this large dataset. We then estimated the frequency of Cas9-targetable homogeneous frameshift-inducing DSBs in vertebrate coding regions for gene discovery using MENdel. 47 out of 54 genes (87%) contained at least one early frameshift-inducing DSB and 49 out of 54 (91%) did so when also considering Cas12a-mediated deletions. We suggest that the use of MENdel helps researchers use MMEJ at scale for reverse genetics screenings and with sufficient intra-gene density rates to be viable for nearly all loss-of-function based gene editing therapeutic applications.

e-MutPath: computational modeling reveals the functional landscape of genetic mutations rewiring interactome networks

2020-11-19T00:00

Understanding the functional impact of cancer somatic mutations represents a critical knowledge gap for implementing precision oncology. It has been increasingly appreciated that the interaction profile mediated by a genomic mutation provides a fundamental link between genotype and phenotype. However, specific effects on biological signaling networks for the majority of mutations are largely unknown by experimental approaches. To resolve this challenge, we developed e-MutPath (edgetic Mutation-mediated Pathway perturbations), a network-based computational method to identify candidate ‘edgetic’ mutations that perturb functional pathways. e-MutPath identifies informative paths that could be used to distinguish disease risk factors from neutral elements and to stratify disease subtypes with clinical relevance. The predicted targets are enriched in cancer vulnerability genes, known drug targets but depleted for proteins associated with side effects, demonstrating the power of network-based strategies to investigate the functional impact and perturbation profiles of genomic mutations. Together, e-MutPath represents a robust computational tool to systematically assign functions to genetic mutations, especially in the context of their specific pathway perturbation effect.

Predicting regulatory variants using a dense epigenomic mapped CNN model elucidated the molecular basis of trait-tissue associations

2020-12-09T00:00

Assessing the causal tissues of human complex diseases is important for the prioritization of trait-associated genetic variants. Yet, the biological underpinnings of trait-associated variants are extremely difficult to infer due to statistical noise in genome-wide association studies (GWAS), and because >90% of genetic variants from GWAS are located in non-coding regions. Here, we collected the largest human epigenomic map from ENCODE and Roadmap consortia and implemented a deep-learning-based convolutional neural network (CNN) model to predict the regulatory roles of genetic variants across a comprehensive list of epigenomic modifications. Our model, called DeepFun, was built on DNA accessibility maps, histone modification marks, and transcription factors. DeepFun can systematically assess the impact of non-coding variants in the most functional elements with tissue or cell-type specificity, even for rare variants or de novo mutations. By applying this model, we prioritized trait-associated loci for 51 publicly-available GWAS studies. We demonstrated that CNN-based analyses on dense and high-resolution epigenomic annotations can refine important GWAS associations in order to identify regulatory loci from background signals, which yield novel insights for better understanding the molecular basis of human complex disease. We anticipate our approaches will become routine in GWAS downstream analysis and non-coding variant evaluation.

The proto-Nucleic Acid Builder: a software tool for constructing nucleic acid analogs

2020-12-09T00:00

The helical structures of DNA and RNA were originally revealed by experimental data. Likewise, the development of programs for modeling these natural polymers was guided by known structures. These nucleic acid polymers represent only two members of a potentially vast class of polymers with similar structural features, but that differ from DNA and RNA in the backbone or nucleobases. Xeno nucleic acids (XNAs) incorporate alternative backbones that affect the conformational, chemical, and thermodynamic properties of XNAs. Given the vast chemical space of possible XNAs, computational modeling of alternative nucleic acids can accelerate the search for plausible nucleic acid analogs and guide their rational design. Additionally, a tool for the modeling of nucleic acids could help reveal what nucleic acid polymers may have existed before RNA in the early evolution of life. To aid the development of novel XNA polymers and the search for possible pre-RNA candidates, this article presents the proto-Nucleic Acid Builder (https://github.com/GT-NucleicAcids/pnab), an open-source program for modeling nucleic acid analogs with alternative backbones and nucleobases. The torsion-driven conformation search procedure implemented here predicts structures with good accuracy compared to experimental structures, and correctly demonstrates the correlation between the helical structure and the backbone conformation in DNA and RNA.

Graphical Abstract

An artistic rendering of the proto-Nucleic Acid builder.

Optimized design of antisense oligomers for targeted rRNA depletion

2020-11-22T00:00

RNA sequencing (RNA-seq) is extensively used to quantify gene expression transcriptome-wide. Although often paired with polyadenylate (poly(A)) selection to enrich for messenger RNA (mRNA), many applications require alternate approaches to counteract the high proportion of ribosomal RNA (rRNA) in total RNA. Recently, digestion using RNaseH and antisense DNA oligomers tiling target rRNAs has emerged as an alternative to commercial rRNA depletion kits. Here, we present a streamlined, more economical RNaseH-mediated rRNA depletion with substantially lower up-front costs, using shorter antisense oligos only sparsely tiled along the target RNA in a 5-min digestion reaction. We introduce a novel Web tool, Oligo-ASST, that simplifies oligo design to target regions with optimal thermodynamic properties, and additionally can generate compact, common oligo pools that simultaneously target divergent RNAs, e.g. across different species. We demonstrate the efficacy of these strategies by generating rRNA-depletion oligos for Xenopus laevis and for zebrafish, which expresses two distinct versions of rRNAs during embryogenesis. The resulting RNA-seq libraries reduce rRNA to <5% of aligned reads, on par with poly(A) selection, and also reveal expression of many non-adenylated RNA species. Oligo-ASST is freely available at https://mtleelab.pitt.edu/oligo to design antisense oligos for any taxon or to target any abundant RNA for depletion.

Graphical Abstract

Oligo-ASST designs optimized antisense oligo sets to target RNAs for depletion.

TENET: gene network reconstruction using transfer entropy reveals key regulatory factors from single cell transcriptomic data

2020-11-10T00:00

Accurate prediction of gene regulatory rules is important towards understanding of cellular processes. Existing computational algorithms devised for bulk transcriptomics typically require a large number of time points to infer gene regulatory networks (GRNs), are applicable for a small number of genes and fail to detect potential causal relationships effectively. Here, we propose a novel approach ‘TENET’ to reconstruct GRNs from single cell RNA sequencing (scRNAseq) datasets. Employing transfer entropy (TE) to measure the amount of causal relationships between genes, TENET predicts large-scale gene regulatory cascades/relationships from scRNAseq data. TENET showed better performance than other GRN reconstructors, in identifying key regulators from public datasets. Specifically from scRNAseq, TENET identified key transcriptional factors in embryonic stem cells (ESCs) and during direct cardiomyocytes reprogramming, where other predictors failed. We further demonstrate that known target genes have significantly higher TE values, and TENET predicted higher TE genes were more influenced by the perturbation of their regulator. Using TENET, we identified and validated that Nme2 is a culture condition specific stem cell factor. These results indicate that TENET is uniquely capable of identifying key regulators from scRNAseq data.

Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences

2020-11-19T00:00

DNA strand asymmetries can have a major effect on several biological functions, including replication, transcription and transcription factor binding. As such, DNA strand asymmetries and mutational strand bias can provide information about biological function. However, a versatile tool to explore this does not exist. Here, we present Asymmetron, a user-friendly computational tool that performs statistical analysis and visualizations for the evaluation of strand asymmetries. Asymmetron takes as input DNA features provided with strand annotation and outputs strand asymmetries for consecutive occurrences of a single DNA feature or between pairs of features. We illustrate the use of Asymmetron by identifying transcriptional and replicative strand asymmetries of germline structural variant breakpoints. We also show that the orientation of the binding sites of 45% of human transcription factors analyzed have a significant DNA strand bias in transcribed regions, that is also corroborated in ChIP-seq analyses, and is likely associated with transcription. In summary, we provide a novel tool to assess DNA strand asymmetries and show how it can be used to derive new insights across a variety of biological disciplines.