Nova Reader - Subject

DeCompress: tissue compartment deconvolution of targeted mRNA expression panels using compressed sensing

2021-02-01T00:00

Targeted mRNA expression panels, measuring up to 800 genes, are used in academic and clinical settings due to low cost and high sensitivity for archived samples. Most samples assayed on targeted panels originate from bulk tissue comprised of many cell types, and cell-type heterogeneity confounds biological signals. Reference-free methods are used when cell-type-specific expression references are unavailable, but limited feature spaces render implementation challenging in targeted panels. Here, we present DeCompress, a semi-reference-free deconvolution method for targeted panels. DeCompress leverages a reference RNA-seq or microarray dataset from similar tissue to expand the feature space of targeted panels using compressed sensing. Ensemble reference-free deconvolution is performed on this artificially expanded dataset to estimate cell-type proportions and gene signatures. In simulated mixtures, four public cell line mixtures, and a targeted panel (1199 samples; 406 genes) from the Carolina Breast Cancer Study, DeCompress recapitulates cell-type proportions with less error than reference-free methods and finds biologically relevant compartments. We integrate compartment estimates into cis-eQTL mapping in breast cancer, identifying a tumor-specific cis-eQTL for CCR3 (C–C Motif Chemokine Receptor 3) at a risk locus. DeCompress improves upon reference-free methods without requiring expression profiles from pure cell populations, with applications in genomic analyses and clinical settings.

Nebula: ultra-efficient mapping-free structural variant genotyper

2021-01-27T00:00

Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.

A benchmark and an algorithm for detecting germline transposon insertions and measuring de novo transposon insertion frequencies

2021-01-28T00:00

Transposons are genomic parasites, and their new insertions can cause instability and spur the evolution of their host genomes. Rapid accumulation of short-read whole-genome sequencing data provides a great opportunity for studying new transposon insertions and their impacts on the host genome. Although many algorithms are available for detecting transposon insertions, the task remains challenging and existing tools are not designed for identifying de novo insertions. Here, we present a new benchmark fly dataset based on PacBio long-read sequencing and a new method TEMP2 for detecting germline insertions and measuring de novo ‘singleton’ insertion frequencies in eukaryotic genomes. TEMP2 achieves high sensitivity and precision for detecting germline insertions when compared with existing tools using both simulated data in fly and experimental data in fly and human. Furthermore, TEMP2 can accurately assess the frequencies of de novo transposon insertions even with high levels of chimeric reads in simulated datasets; such chimeric reads often occur during the construction of short-read sequencing libraries. By applying TEMP2 to published data on hybrid dysgenic flies inflicted by de-repressed P-elements, we confirmed the continuous new insertions of P-elements in dysgenic offspring before they regain piRNAs for P-element repression. TEMP2 is freely available at Github: https://github.com/weng-lab/TEMP2.

DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism

2021-01-27T00:00

Subcellular localization of messenger RNAs (mRNAs), as a prevalent mechanism, gives precise and efficient control for the translation process. There is mounting evidence for the important roles of this process in a variety of cellular events. Computational methods for mRNA subcellular localization prediction provide a useful approach for studying mRNA functions. However, few computational methods were designed for mRNA subcellular localization prediction and their performance have room for improvement. Especially, there is still no available tool to predict for mRNAs that have multiple localization annotations. In this paper, we propose a multi-head self-attention method, DM3Loc, for multi-label mRNA subcellular localization prediction. Evaluation results show that DM3Loc outperforms existing methods and tools in general. Furthermore, DM3Loc has the interpretation ability to analyze RNA-binding protein motifs and key signals on mRNAs for subcellular localization. Our analyses found hundreds of instances of mRNA isoform-specific subcellular localizations and many significantly enriched gene functions for mRNAs in different subcellular localizations.

Single cell epigenetic visualization assay

2021-01-28T00:00

Characterization of the epigenetic status of individual cells remains a challenge. Current sequencing approaches have limited coverage, and it is difficult to assign an epigenetic status to the transcription state of individual gene alleles in the same cell. To address these limitations, a targeted microscopy-based epigenetic visualization assay (EVA) was developed for detection and quantification of epigenetic marks at genes of interest in single cells. The assay is based on an in situ biochemical reaction between an antibody-conjugated alkaline phosphatase bound to the epigenetic mark of interest, and a 5′-phosphorylated fluorophore-labeled DNA oligo tethered to a target gene by gene-specific oligonucleotides. When the epigenetic mark is present at the gene, phosphate group removal by the phosphatase protects the oligo from λ-exonuclease activity providing a quantitative fluorescent readout. We applied EVA to measure 5-methylcytosine (5mC) and H3K9Ac levels at different genes and the HIV-1 provirus in human cell lines. To link epigenetic marks to gene transcription, EVA was combined with RNA-FISH. Higher 5mC levels at the silenced compared to transcribed XIST gene alleles in female somatic cells validated this approach and demonstrated that EVA can be used to relate epigenetic marks to the transcription status of individual gene alleles.

Graphical Abstract

Epigenetic mark visualization at a gene of interest. Alkaline phosphatase (AP) is recruited to the epigenetic mark (purple) as an antibody conjugate. Gene specific oligonucleotides anchor phosphorylated sensor oligo (red) annealed to a detector oligo (green). When the epigenetic mark is present at the gene, the AP-dephosphorylated oligo survives subsequent λ-exonuclease treatment. The presence of the epigenetic mark at the gene is quantitated using the ratio of detector/sensor signal intensities (green/red).

Network controllability-based algorithm to target personalized driver genes for discovering combinatorial drugs of individual patients

2021-01-12T00:00

Multiple driver genes in individual patient samples may cause resistance to individual drugs in precision medicine. However, current computational methods have not studied how to fill the gap between personalized driver gene identification and combinatorial drug discovery for individual patients. Here, we developed a novel structural network controllability-based personalized driver genes and combinatorial drug identification algorithm (CPGD), aiming to identify combinatorial drugs for an individual patient by targeting personalized driver genes from network controllability perspective. On two benchmark disease datasets (i.e. breast cancer and lung cancer datasets), performance of CPGD is superior to that of other state-of-the-art driver gene-focus methods in terms of discovery rate among prior-known clinical efficacious combinatorial drugs. Especially on breast cancer dataset, CPGD evaluated synergistic effect of pairwise drug combinations by measuring synergistic effect of their corresponding personalized driver gene modules, which are affected by a given targeting personalized driver gene set of drugs. The results showed that CPGD performs better than existing synergistic combinatorial strategies in identifying clinical efficacious paired combinatorial drugs. Furthermore, CPGD enhanced cancer subtyping by computationally providing personalized side effect signatures for individual patients. In addition, CPGD identified 90 drug combinations candidates from SARS-COV2 dataset as potential drug repurposing candidates for recently spreading COVID-19.

Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench

2021-02-01T00:00

As the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward. Here, we present BatchBench (https://github.com/cellgeni/batchbench), a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. We apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.

A novel all-in-one conditional knockout system uncovered an essential role of DDX1 in ribosomal RNA processing

2021-01-27T00:00

Generation of conditional knockout (cKO) and various gene-modified cells is laborious and time-consuming. Here, we established an all-in-one cKO system, which enables highly efficient generation of cKO cells and simultaneous gene modifications, including epitope tagging and reporter gene knock-in. We applied this system to mouse embryonic stem cells (ESCs) and generated RNA helicase Ddx1 cKO ESCs. The targeted cells displayed endogenous promoter-driven EGFP and FLAG-tagged DDX1 expression, and they were converted to Ddx1 KO via FLP recombinase. We further established TetFE ESCs, which carried a reverse tetracycline transactivator (rtTA) expression cassette and a tetracycline response element (TRE)-regulated FLPERT2 cassette in the Gt(ROSA26)Sor locus for instant and tightly regulated induction of gene KO. By utilizing TetFE Ddx1^F/F ESCs, we isolated highly pure Ddx1^F/F and Ddx1⁻^/⁻ ESCs and found that loss of Ddx1 caused rRNA processing defects, thereby activating the ribosome stress–p53 pathway. We also demonstrated cKO of various genes in ESCs and homologous recombination-non-proficient human HT1080 cells. The frequency of cKO clones was remarkably high for both cell types and reached up to 96% when EGFP-positive clones were analyzed. This all-in-one cKO system will be a powerful tool for rapid and precise analyses of gene functions.

Solid-phase XRN1 reactions for RNA cleavage: application in single-molecule sequencing

2021-01-28T00:00

Modifications in RNA are numerous (∼170) and in higher numbers compared to DNA (∼5) making the ability to sequence an RNA molecule to identify these modifications highly tenuous using next generation sequencing (NGS). The ability to immobilize an exoribonuclease enzyme, such as XRN1, to a solid support while maintaining its activity and capability to cleave both the canonical and modified ribonucleotides from an intact RNA molecule can be a viable approach for single-molecule RNA sequencing. In this study, we report an enzymatic reactor consisting of covalently attached XRN1 to a solid support as the groundwork for a novel RNA exosequencing technique. The covalent attachment of XRN1 to a plastic solid support was achieved using EDC/NHS coupling chemistry. Studies showed that the solid-phase digestion efficiency of model RNAs was 87.6 ± 2.8%, while the XRN1 solution-phase digestion for the same model was 78.3 ± 4.4%. The ability of immobilized XRN1 to digest methylated RNA containing m6A and m5C ribonucleotides was also demonstrated. The processivity and clipping rate of immobilized XRN1 secured using single-molecule fluorescence measurements of a single RNA transcript demonstrated a clipping rate of 26 ± 5 nt s⁻¹ and a processivity of >10.5 kb at 25°C.

The loopometer: a quantitative in vivo assay for DNA-looping proteins

2021-01-28T00:00

Proteins that can bring together separate DNA sites, either on the same or on different DNA molecules, are critical for a variety of DNA-based processes. However, there are no general and technically simple assays to detect proteins capable of DNA looping in vivo nor to quantitate their in vivo looping efficiency. Here, we develop a quantitative in vivo assay for DNA-looping proteins in Escherichia coli that requires only basic DNA cloning techniques and a LacZ assay. The assay is based on loop assistance, where two binding sites for the candidate looping protein are inserted internally to a pair of operators for the E. coli LacI repressor. DNA looping between the sites shortens the effective distance between the lac operators, increasing LacI looping and strengthening its repression of a lacZ reporter gene. Analysis based on a general model for loop assistance enables quantitation of the strength of looping conferred by the protein and its binding sites. We use this ‘loopometer’ assay to measure DNA looping for a variety of bacterial and phage proteins.

A method for characterizing Cas9 variants via a one-million target sequence library of self-targeting sgRNAs

2021-01-15T00:00

Detailed target-selectivity information and experiment-based efficacy prediction tools are primarily available for Streptococcus pyogenes Cas9 (SpCas9). One obstacle to develop such tools is the rarity of accurate data. Here, we report a method termed ‘Self-targeting sgRNA Library Screen’ (SLS) for assaying the activity of Cas9 nucleases in bacteria using random target/sgRNA libraries of self-targeting sgRNAs. Exploiting more than a million different sequences, we demonstrate the use of the method with the SpCas9-HF1 variant to analyse its activity and reveal motifs that influence its target-selectivity. We have also developed an algorithm for predicting the activity of SpCas9-HF1 with an accuracy matching those of existing tools. SLS is a facile alternative to the much more expensive and laborious approaches used currently and has the capability of delivering sufficient amount of data for most of the orthologs and variants of SpCas9.

A novel SHAPE reagent enables the analysis of RNA structure in living cells with unprecedented accuracy

2021-01-04T00:00

Due to the mounting evidence that RNA structure plays a critical role in regulating almost any physiological as well as pathological process, being able to accurately define the folding of RNA molecules within living cells has become a crucial need. We introduce here 2-aminopyridine-3-carboxylic acid imidazolide (2A3), as a general probe for the interrogation of RNA structures in vivo. 2A3 shows moderate improvements with respect to the state-of-the-art selective 2′-hydroxyl acylation analyzed by primer extension (SHAPE) reagent NAI on naked RNA under in vitro conditions, but it significantly outperforms NAI when probing RNA structure in vivo, particularly in bacteria, underlining its increased ability to permeate biological membranes. When used as a restraint to drive RNA structure prediction, data derived by SHAPE-MaP with 2A3 yields more accurate predictions than NAI-derived data. Due to its extreme efficiency and accuracy, we can anticipate that 2A3 will rapidly take over conventional SHAPE reagents for probing RNA structures both in vitro and in vivo.

Gene-specific mutagenesis enables rapid continuous evolution of enzymes in vivo

2021-01-06T00:00

Various in vivo mutagenesis methods have been developed to facilitate fast and efficient continuous evolution of proteins in cells. However, they either modify the DNA region that does not match the target gene, or suffer from low mutation rates. Here, we report a mutator, eMutaT7 (enhanced MutaT7), with very fast in vivo mutation rate and high gene-specificity in Escherichia coli. eMutaT7, a cytidine deaminase fused to an orthogonal RNA polymerase, can introduce up to ∼4 mutations per 1 kb per day, rivalling the rate in typical in vitro mutagenesis for directed evolution of proteins, and promotes rapid continuous evolution of model proteins for antibiotic resistance and allosteric activation. eMutaT7 provides a very simple and tunable method for continuous directed evolution of proteins, and suggests that the fusion of new DNA-modifying enzymes to the orthogonal RNA polymerase is a promising strategy to explore the expanded sequence space without compromising gene specificity.

Graphical Abstract

eMutaT7: a rapid in vivo mutagenesis method for continuous directed evolution of proteins.

Microbial single-strand annealing proteins enable CRISPR gene-editing tools with improved knock-in efficiencies and reduced off-target effects

2021-02-22T00:00

Several existing technologies enable short genomic alterations including generating indels and short nucleotide variants, however, engineering more significant genomic changes is more challenging due to reduced efficiency and precision. Here, we developed RecT Editor via Designer-Cas9-Initiated Targeting (REDIT), which leverages phage single-stranded DNA-annealing proteins (SSAP) RecT for mammalian genome engineering. Relative to Cas9-mediated homology-directed repair (HDR), REDIT yielded up to a 5-fold increase of efficiency to insert kilobase-scale exogenous sequences at defined genomic regions. We validated our REDIT approach using different formats and lengths of knock-in templates. We further demonstrated that REDIT tools using Cas9 nickase have efficient gene-editing activities and reduced off-target errors, measured using a combination of targeted sequencing, genome-wide indel, and insertion mapping assays. Our experiments inhibiting repair enzyme activities suggested that REDIT has the potential to overcome limitations of endogenous DNA repair steps. Finally, our REDIT method is applicable across cell types including human stem cells, and is generalizable to different Cas9 enzymes.

SurVirus: a repeat-aware virus integration caller

2021-01-14T00:00

A significant portion of human cancers are due to viruses integrating into human genomes. Therefore, accurately predicting virus integrations can help uncover the mechanisms that lead to many devastating diseases. Virus integrations can be called by analysing second generation high-throughput sequencing datasets. Unfortunately, existing methods fail to report a significant portion of integrations, while predicting a large number of false positives. We observe that the inaccuracy is caused by incorrect alignment of reads in repetitive regions. False alignments create false positives, while missing alignments create false negatives. This paper proposes SurVirus, an improved virus integration caller that corrects the alignment of reads which are crucial for the discovery of integrations. We use publicly available datasets to show that existing methods predict hundreds of thousands of false positives; SurVirus, on the other hand, is significantly more precise while it also detects many novel integrations previously missed by other tools, most of which are in repetitive regions. We validate a subset of these novel integrations, and find that the majority are correct. Using SurVirus, we find that HPV and HBV integrations are enriched in LINE and Satellite regions which had been overlooked, as well as discover recurrent HBV and HPV breakpoints in human genome-virus fusion transcripts.

CoBold: a method for identifying different functional classes of transient RNA structure features that can impact RNA structure formation in vivo

2020-10-23T00:00

RNA structure formation in vivo happens co-transcriptionally while the transcript is being made. The corresponding co-transcriptional folding pathway typically involves transient RNA structure features that are not part of the final, functional RNA structure. These transient features can play important functional roles of their own and also influence the formation of the final RNA structure in vivo. We here present CoBold, a computational method for identifying different functional classes of transient RNA structure features that can either aid or hinder the formation of a known reference RNA structure. Our method takes as input either a single RNA or a corresponding multiple-sequence alignment as well as a known reference RNA secondary structure and identifies different classes of transient RNA structure features that could aid or prevent the formation of the given RNA structure. We make CoBold available via a web-server which includes dedicated data visualisation.

The CONJUDOR pipeline for multiplexed knockdown of gene pairs identifies RBBP-5 as a germ cell reprogramming barrier in C. elegans

2020-12-08T00:00

Multiple gene activities control complex biological processes such as cell fate specification during development and cellular reprogramming. Investigating the manifold gene functions in biological systems requires also simultaneous depletion of two or more gene activities. RNA interference-mediated knockdown (RNAi) is commonly used in Caenorhabditis elegans to assess essential genes, which otherwise lead to lethality or developmental arrest upon full knockout. RNAi application is straightforward by feeding worms with RNAi plasmid-containing bacteria. However, the general approach of mixing bacterial RNAi clones to deplete two genes simultaneously often yields poor results. To address this issue, we developed a bacterial conjugation-mediated double RNAi technique ‘CONJUDOR’. It allows combining RNAi bacteria for robust double RNAi with high-throughput. To demonstrate the power of CONJUDOR for large scale double RNAi screens we conjugated RNAi against the histone chaperone gene lin-53 with more than 700 other chromatin factor genes. Thereby, we identified the Set1/MLL methyltransferase complex member RBBP-5 as a novel germ cell reprogramming barrier. Our findings demonstrate that CONJUDOR increases efficiency and versatility of RNAi screens to examine interconnected biological processes in C. elegans with high-throughput.

Graphical Abstract

CONJUDOR adapts bacterial conjugation to combine RNAi plasmids in bacteria for robust knock-down of multiple genes in the nematode C. elegans. CONJUDOR has been used for RNAi screening to deplete multiple gene pairs with high-throughput in order to identify barriers of germ cell reprogramming into GABAergic motor neurons. The Set1/MLL methyltransferase complex member RBBP-5 could be identified using CONJUDOR as a novel barrier for germ cell to neuron reprogramming.

CRISPRidentify: identification of CRISPR arrays using machine learning approach

2020-12-08T00:00

CRISPR–Cas are adaptive immune systems that degrade foreign genetic elements in archaea and bacteria. In carrying out their immune functions, CRISPR–Cas systems heavily rely on RNA components. These CRISPR (cr) RNAs are repeat-spacer units that are produced by processing of pre-crRNA, the transcript of CRISPR arrays, and guide Cas protein(s) to the cognate invading nucleic acids, enabling their destruction. Several bioinformatics tools have been developed to detect CRISPR arrays based solely on DNA sequences, but all these tools employ the same strategy of looking for repetitive patterns, which might correspond to CRISPR array repeats. The identified patterns are evaluated using a fixed, built-in scoring function, and arrays exceeding a cut-off value are reported. Here, we instead introduce a data-driven approach that uses machine learning to detect and differentiate true CRISPR arrays from false ones based on several features. Our CRISPR detection tool, CRISPRidentify, performs three steps: detection, feature extraction and classification based on manually curated sets of positive and negative examples of CRISPR arrays. The identified CRISPR arrays are then reported to the user accompanied by detailed annotation. We demonstrate that our approach identifies not only previously detected CRISPR arrays, but also CRISPR array candidates not detected by other tools. Compared to other methods, our tool has a drastically reduced false positive rate. In contrast to the existing tools, our approach not only provides the user with the basic statistics on the identified CRISPR arrays but also produces a certainty score as a practical measure of the likelihood that a given genomic region is a CRISPR array.

NOseq: amplicon sequencing evaluation method for RNA m⁶A sites after chemical deamination

2020-12-11T00:00

Methods for the detection of m⁶A by RNA-Seq technologies are increasingly sought after. We here present NOseq, a method to detect m⁶A residues in defined amplicons by virtue of their resistance to chemical deamination, effected by nitrous acid. Partial deamination in NOseq affects all exocyclic amino groups present in nucleobases and thus also changes sequence information. The method uses a mapping algorithm specifically adapted to the sequence degeneration caused by deamination events. Thus, m⁶A sites with partial modification levels of ∼50% were detected in defined amplicons, and this threshold can be lowered to ∼10% by combination with m⁶A immunoprecipitation. NOseq faithfully detected known m⁶A sites in human rRNA, and the long non-coding RNA MALAT1, and positively validated several m⁶A candidate sites, drawn from miCLIP data with an m⁶A antibody, in the transcriptome of Drosophila melanogaster. Conceptually related to bisulfite sequencing, NOseq presents a novel amplicon-based sequencing approach for the validation of m⁶A sites in defined sequences.

Alignment free identification of clones in B cell receptor repertoires

2020-12-16T00:00

Following antigenic challenge, activated B cells rapidly expand and undergo somatic hypermutation, yielding groups of clonally related B cells with diversified immunoglobulin receptors. Inference of clonal relationships based on the receptor sequence is an essential step in many adaptive immune receptor repertoire sequencing studies. These relationships are typically identified by a multi-step process that involves: (i) grouping sequences based on shared V and J gene assignments, and junction lengths and (ii) clustering these sequences using a junction-based distance. However, this approach is sensitive to the initial gene assignments, which are error-prone, and fails to identify clonal relatives whose junction length has changed through accumulation of indels. Through defining a translation-invariant feature space in which we cluster the sequences, we develop an alignment free clonal identification method that does not require gene assignments and is not restricted to a fixed junction length. This alignment free approach has higher sensitivity compared to a typical junction-based distance method without loss of specificity and PPV. While the alignment free procedure identifies clones that are broadly consistent with the junction-based distance method, it also identifies clones with characteristics (multiple V or J gene assignments or junction lengths) that are not detectable with the junction-based distance method.

NGS-based identification and tracing of microsatellite instability from minute amounts DNA using inter-Alu-PCR

2020-12-08T00:00

Sensitive detection of microsatellite instability (MSI) in tissue or liquid biopsies using next generation sequencing (NGS) has growing prognostic and predictive applications in cancer. However, the complexities of NGS make it cumbersome as compared to established multiplex-PCR detection of MSI. We present a new approach to detect MSI using inter-Alu-PCR followed by targeted NGS, that combines the practical advantages of multiplexed-PCR with the breadth of information provided by NGS. Inter-Alu-PCR employs poly-adenine repeats of variable length present in every Alu element and provides a massively-parallel, rapid approach to capture poly-A-rich genomic fractions within short 80–150bp amplicons generated from adjacent Alu-sequences. A custom-made software analysis tool, MSI-tracer, enables Alu-associated MSI detection from tissue biopsies or MSI-tracing at low-levels in circulating-DNA. MSI-associated indels at somatic-indel frequencies of 0.05–1.5% can be detected depending on the availability of matching normal tissue and the extent of instability. Due to the high Alu copy-number in human genomes, a single inter-Alu-PCR retrieves enough information for identification of MSI-associated-indels from ∼100 pg circulating-DNA, reducing current limits by ∼2-orders of magnitude and equivalent to circulating-DNA obtained from finger-sticks. The combined practical and informational advantages of inter-Alu-PCR make it a powerful tool for identifying tissue-MSI-status or tracing MSI-associated-indels in liquid biopsies.

cola: an R/Bioconductor package for consensus partitioning through a general framework

2020-12-04T00:00

Classification of high-throughput genomic data is a powerful method to assign samples to subgroups with specific molecular profiles. Consensus partitioning is the most widely applied approach to reveal subgroups by summarizing a consensus classification from a list of individual classifications generated by repeatedly executing clustering on random subsets of the data. It is able to evaluate the stability of the classification. We implemented a new R/Bioconductor package, cola, that provides a general framework for consensus partitioning. With cola, various parameters and methods can be user-defined and easily integrated into different steps of an analysis, e.g., feature selection, sample classification or defining signatures. cola provides a new method named ATC (ability to correlate to other rows) to extract features and recommends spherical k-means clustering (skmeans) for subgroup classification. We show that ATC and skmeans have better performance than other commonly used methods by a comprehensive benchmark on public datasets. We also benchmark key parameters in the consensus partitioning procedure, which helps users to select optimal parameter values. Moreover, cola provides rich functionalities to apply multiple partitioning methods in parallel and directly compare their results, as well as rich visualizations. cola can automate the complete analysis and generates a comprehensive HTML report.

Genome-wide integration site detection using Cas9 enriched amplification-free long-range sequencing

2020-12-08T00:00

The gene and cell therapy fields are advancing rapidly, with a potential to treat and cure a wide range of diseases, and lentivirus-based gene transfer agents are the vector of choice for many investigators. Early cases of insertional mutagenesis caused by gammaretroviral vectors highlighted that integration site (IS) analysis was a major safety and quality control checkpoint for lentiviral applications. The methods established to detect lentiviral integrations using next-generation sequencing (NGS) are limited by short read length, inadvertent PCR bias, low yield, or lengthy protocols. Here, we describe a new method to sequence IS using Amplification-free Integration Site sequencing (AFIS-Seq). AFIS-Seq is based on amplification-free, Cas9-mediated enrichment of high-molecular-weight chromosomal DNA suitable for long-range Nanopore MinION sequencing. This accessible and low-cost approach generates long reads enabling IS mapping with high certainty within a single day. We demonstrate proof-of-concept by mapping IS of lentiviral vectors in a variety of cell models and report up to 1600-fold enrichment of the signal. This method can be further extended to sequencing of Cas9-mediated integration of genes and to in vivo analysis of IS. AFIS-Seq uses long-read sequencing to facilitate safety evaluation of preclinical lentiviral vector gene therapies by providing IS analysis with improved confidence.

To mock or not: a comprehensive comparison of mock IP and DNA input for ChIP-seq

2020-12-21T00:00

Chromatin immunoprecipitation (IP) followed by sequencing (ChIP-seq) is the gold standard to detect transcription-factor (TF) binding sites in the genome. Its success depends on appropriate controls removing systematic biases. The predominantly used controls, i.e. DNA input, correct for uneven sonication, but not for nonspecific interactions of the IP antibody. Another type of controls, ‘mock’ IP, corrects for both of the issues, but is not widely used because it is considered susceptible to technical noise. The tradeoff between the two control types has not been investigated systematically. Therefore, we generated comparable DNA input and mock IP experiments. Because mock IPs contain only nonspecific interactions, the sites predicted from them using DNA input indicate the spurious-site abundance. This abundance is highly correlated with the ‘genomic activity’ (e.g. chromatin openness). In particular, compared to cell lines, complex samples such as whole organisms have more spurious sites—probably because they contain multiple cell types, resulting in more expressed genes and more open chromatin. Consequently, DNA input and mock IP controls performed similarly for cell lines, whereas for complex samples, mock IP substantially reduced the number of spurious sites. However, DNA input is still informative; thus, we developed a simple framework integrating both controls, improving binding site detection.

UniPath: a uniform approach for pathway and gene-set based analysis of heterogeneity in single-cell epigenome and transcriptome profiles

2020-12-04T00:00

Recent advances in single-cell open-chromatin and transcriptome profiling have created a challenge of exploring novel applications with a meaningful transformation of read-counts, which often have high variability in noise and drop-out among cells. Here, we introduce UniPath, for representing single-cells using pathway and gene-set enrichment scores by a transformation of their open-chromatin or gene-expression profiles. The robust statistical approach of UniPath provides high accuracy, consistency and scalability in estimating gene-set enrichment scores for every cell. Its framework provides an easy solution for handling variability in drop-out rate, which can sometimes create artefact due to systematic patterns. UniPath provides an alternative approach of dimension reduction of single-cell open-chromatin profiles. UniPath's approach of predicting temporal-order of single-cells using their pathway enrichment scores enables suppression of covariates to achieve correct order of cells. Analysis of mouse cell atlas using our approach yielded surprising, albeit biologically-meaningful co-clustering of cell-types from distant organs. By enabling an unconventional method of exploiting pathway co-occurrence to compare two groups of cells, our approach also proves to be useful in inferring context-specific regulations in cancer cells. Available at https://reggenlab.github.io/UniPathWeb/.

Parallel monitoring of RNA abundance, localization and compactness with correlative single molecule FISH on LR White embedded samples

2020-12-04T00:00

Single mRNA molecules are frequently detected by single molecule fluorescence in situ hybridization (smFISH) using branched DNA technology. While providing strong and background-reduced signals, the method is inefficient in detecting mRNAs within dense structures, in monitoring mRNA compactness and in quantifying abundant mRNAs. To overcome these limitations, we have hybridized slices of high pressure frozen, freeze-substituted and LR White embedded cells (LR White smFISH). mRNA detection is physically restricted to the surface of the resin. This enables single molecule detection of RNAs with accuracy comparable to RNA sequencing, irrespective of their abundance, while at the same time providing spatial information on RNA localization that can be complemented with immunofluorescence and electron microscopy, as well as array tomography. Moreover, LR White embedding restricts the number of available probe pair recognition sites for each mRNA to a small subset. As a consequence, differences in signal intensities between RNA populations reflect differences in RNA structures, and we show that the method can be employed to determine mRNA compactness. We apply the method to answer some outstanding questions related to trans-splicing, RNA granules and mitochondrial RNA editing in single-cellular trypanosomes and we show an example of differential gene expression in the metazoan Caenorhabditis elegans.

Entropy subspace separation-based clustering for noise reduction (ENCORE) of scRNA-seq data

2020-12-10T00:00

Single-cell RNA sequencing enables us to characterize the cellular heterogeneity in single cell resolution with the help of cell type identification algorithms. However, the noise inherent in single-cell RNA-sequencing data severely disturbs the accuracy of cell clustering, marker identification and visualization. We propose that clustering based on feature density profiles can distinguish informative features from noise. We named such strategy as ‘entropy subspace’ separation and designed a cell clustering algorithm called ENtropy subspace separation-based Clustering for nOise REduction (ENCORE) by integrating the ‘entropy subspace’ separation strategy with a consensus clustering method. We demonstrate that ENCORE performs superiorly on cell clustering and generates high-resolution visualization across 12 standard datasets. More importantly, ENCORE enables identification of group markers with biological significance from a hard-to-separate dataset. With the advantages of effective feature selection, improved clustering, accurate marker identification and high-resolution visualization, we present ENCORE to the community as an important tool for scRNA-seq data analysis to study cellular heterogeneity and discover group markers.

Customized optical mapping by CRISPR–Cas9 mediated DNA labeling with multiple sgRNAs

2020-11-24T00:00

Whole-genome mapping technologies have been developed as a complementary tool to provide scaffolds for genome assembly and structural variation analysis (1,2). We recently introduced a novel DNA labeling strategy based on a CRISPR–Cas9 genome editing system, which can target any 20bp sequences. The labeling strategy is specifically useful in targeting repetitive sequences, and sequences not accessible to other labeling methods. In this report, we present customized mapping strategies that extend the applications of CRISPR–Cas9 DNA labeling. We first design a CRISPR–Cas9 labeling strategy to interrogate and differentiate the single allele differences in NGG protospacer adjacent motifs (PAM sequence). Combined with sequence motif labeling, we can pinpoint the single-base differences in highly conserved sequences. In the second strategy, we design mapping patterns across a genome by selecting sets of specific single-guide RNAs (sgRNAs) for labeling multiple loci of a genomic region or a whole genome. By developing and optimizing a single tube synthesis of multiple sgRNAs, we demonstrate the utility of CRISPR–Cas9 mapping with 162 sgRNAs targeting the 2Mb Haemophilus influenzae chromosome. These CRISPR–Cas9 mapping approaches could be particularly useful for applications in defining long-distance haplotypes and pinpointing the breakpoints in large structural variants in complex genomes and microbial mixtures.

Decoding the epitranscriptional landscape from native RNA sequences

2020-07-25T00:00

Traditional epitranscriptomics relies on capturing a single RNA modification by antibody or chemical treatment, combined with short-read sequencing to identify its transcriptomic location. This approach is labor-intensive and may introduce experimental artifacts. Direct sequencing of native RNA using Oxford Nanopore Technologies (ONT) can allow for directly detecting the RNA base modifications, although these modifications might appear as sequencing errors. The percent Error of Specific Bases (%ESB) was higher for native RNA than unmodified RNA, which enabled the detection of ribonucleotide modification sites. Based on the %ESB differences, we developed a bioinformatic tool, epitranscriptional landscape inferring from glitches of ONT signals (ELIGOS), that is based on various types of synthetic modified RNA and applied to rRNA and mRNA. ELIGOS is able to accurately predict known classes of RNA methylation sites (AUC > 0.93) in rRNAs from Escherichiacoli, yeast, and human cells, using either unmodified in vitro transcription RNA or a background error model, which mimics the systematic error of direct RNA sequencing as the reference. The well-known DRACH/RRACH motif was localized and identified, consistent with previous studies, using differential analysis of ELIGOS to study the impact of RNA m⁶A methyltransferase by comparing wild type and knockouts in yeast and mouse cells. Lastly, the DRACH motif could also be identified in the mRNA of three human cell lines. The mRNA modification identified by ELIGOS is at the level of individual base resolution. In summary, we have developed a bioinformatic software package to uncover native RNA modifications.

CoolMPS: evaluation of antibody labeling based massively parallel non-coding RNA sequencing

2020-12-08T00:00

Results of massive parallel sequencing-by-synthesis vary depending on the sequencing approach. CoolMPS™ is a new sequencing chemistry that incorporates bases by labeled antibodies. To evaluate the performance, we sequenced 240 human non-coding RNA samples (dementia patients and controls) with and without CoolMPS. The Q30 value as indicator of the per base sequencing quality increased from 91.8 to 94%. The higher quality was reached across the whole read length. Likewise, the percentage of reads mapping to the human genome increased from 84.9 to 86.2%. For both technologies, we computed similar distributions between different RNA classes (miRNA, piRNA, tRNA, snoRNA and yRNA) and within the classes. While standard sequencing-by-synthesis allowed to recover more annotated miRNAs, CoolMPS yielded more novel miRNAs. The correlation between the two methods was 0.97. Evaluating the diagnostic performance, we observed lower minimal P-values for CoolMPS (adjusted P-value of 0.0006 versus 0.0004) and larger effect sizes (Cohen's d of 0.878 versus 0.9). Validating 19 miRNAs resulted in a correlation of 0.852 between CoolMPS and reverse transcriptase-quantitative polymerase chain reaction. Comparison to data generated with Illumina technology confirmed a known shift in the overall RNA composition. With CoolMPS we evaluated a novel sequencing-by-synthesis technology showing high performance for the analysis of non-coding RNAs.

Translation elongation rate varies among organs and decreases with age

2020-12-02T00:00

There has been a surge of interest towards targeting protein synthesis to treat diseases and extend lifespan. Despite the progress, few options are available to assess translation in live animals, as their complexity limits the repertoire of experimental tools to monitor and manipulate processes within organs and individual cells. It this study, we developed a labeling-free method for measuring organ- and cell-type-specific translation elongation rates in vivo. It is based on time-resolved delivery of translation initiation and elongation inhibitors in live animals followed by ribosome profiling. It also reports translation initiation sites in an organ-specific manner. Using this method, we found that the elongation rates differ more than 50% among mouse organs and determined them to be 6.8, 5.0 and 4.3 amino acids per second for liver, kidney, and skeletal muscle, respectively. We further found that the elongation rate is reduced by 20% between young adulthood and mid-life. Thus, translation, a major metabolic process in cells, is tightly regulated at the level of elongation of nascent polypeptide chains.

A fast and robust iterative genome-editing method based on a Rock-Paper-Scissors strategy

2020-12-03T00:00

The production of optimized strains of a specific phenotype requires the construction and testing of a large number of genome modifications and combinations thereof. Most bacterial iterative genome-editing methods include essential steps to eliminate selection markers, or to cure plasmids. Additionally, the presence of escapers leads to time-consuming separate single clone picking and subsequent cultivation steps. Herein, we report a genome-editing method based on a Rock-Paper-Scissors (RPS) strategy. Each of three constructed sgRNA plasmids can cure, or be cured by, the other two plasmids in the system; plasmids from a previous round of editing can be cured while the current round of editing takes place. Due to the enhanced curing efficiency and embedded double check mechanism, separate steps for plasmid curing or confirmation are not necessary, and only two times of cultivation are needed per genome-editing round. This method was successfully demonstrated in Escherichia coli and Klebsiella pneumoniae with both gene deletions and replacements. To the best of our knowledge, this is the fastest and most robust iterative genome-editing method, with the least times of cultivation decreasing the possibilities of spontaneous genome mutations.

CoolMPS for robust sequencing of single-nuclear RNAs captured by droplet-based method

2020-12-02T00:00

Massively-parallel single-cell and single-nucleus RNA sequencing (scRNA-seq, snRNA-seq) requires extensive sequencing to achieve proper per-cell coverage, making sequencing resources and availability of sequencers critical factors for conducting deep transcriptional profiling. CoolMPS is a novel sequencing-by-synthesis approach that relies on nucleotide labeling by re-usable antibodies, but whether it is applicable to snRNA-seq has not been tested. Here, we use a low-cost and off-the-shelf protocol to chemically convert libraries generated with the widely-used Chromium 10X technology to be sequenceable with CoolMPS technology. To assess the quality and performance of converted libraries sequenced with CoolMPS, we generated a snRNA-seq dataset from the hippocampus of young and old mice. Native libraries were sequenced on an Illumina Novaseq and libraries that were converted to be compatible with CoolMPS were sequenced on a DNBSEQ-400RS. CoolMPS-derived data faithfully replicated key characteristics of the native library dataset, including correct estimation of ambient RNA-contamination, detection of captured cells, cell clustering results, spatial marker gene expression, inter- and intra-replicate differences and gene expression changes during aging. In conclusion, our results show that CoolMPS provides a viable alternative to standard sequencing of RNA from droplet-based libraries.

A simplified strategy for titrating gene expression reveals new relationships between genotype, environment, and bacterial growth

2020-11-22T00:00

A lack of high-throughput techniques for making titrated, gene-specific changes in expression limits our understanding of the relationship between gene expression and cell phenotype. Here, we present a generalizable approach for quantifying growth rate as a function of titrated changes in gene expression level. The approach works by performing CRISPRi with a series of mutated single guide RNAs (sgRNAs) that modulate gene expression. To evaluate sgRNA mutation strategies, we constructed a library of 5927 sgRNAs targeting 88 genes in Escherichia coli MG1655 and measured the effects on growth rate. We found that a compounding mutational strategy, through which mutations are incrementally added to the sgRNA, presented a straightforward way to generate a monotonic and gradated relationship between mutation number and growth rate effect. We also implemented molecular barcoding to detect and correct for mutations that ‘escape’ the CRISPRi targeting machinery; this strategy unmasked deleterious growth rate effects obscured by the standard approach of ignoring escapers. Finally, we performed controlled environmental variations and observed that many gene-by-environment interactions go completely undetected at the limit of maximum knockdown, but instead manifest at intermediate expression perturbation strengths. Overall, our work provides an experimental platform for quantifying the phenotypic response to gene expression variation.

e-MutPath: computational modeling reveals the functional landscape of genetic mutations rewiring interactome networks

2020-11-19T00:00

Understanding the functional impact of cancer somatic mutations represents a critical knowledge gap for implementing precision oncology. It has been increasingly appreciated that the interaction profile mediated by a genomic mutation provides a fundamental link between genotype and phenotype. However, specific effects on biological signaling networks for the majority of mutations are largely unknown by experimental approaches. To resolve this challenge, we developed e-MutPath (edgetic Mutation-mediated Pathway perturbations), a network-based computational method to identify candidate ‘edgetic’ mutations that perturb functional pathways. e-MutPath identifies informative paths that could be used to distinguish disease risk factors from neutral elements and to stratify disease subtypes with clinical relevance. The predicted targets are enriched in cancer vulnerability genes, known drug targets but depleted for proteins associated with side effects, demonstrating the power of network-based strategies to investigate the functional impact and perturbation profiles of genomic mutations. Together, e-MutPath represents a robust computational tool to systematically assign functions to genetic mutations, especially in the context of their specific pathway perturbation effect.

Optimized design of antisense oligomers for targeted rRNA depletion

2020-11-22T00:00

RNA sequencing (RNA-seq) is extensively used to quantify gene expression transcriptome-wide. Although often paired with polyadenylate (poly(A)) selection to enrich for messenger RNA (mRNA), many applications require alternate approaches to counteract the high proportion of ribosomal RNA (rRNA) in total RNA. Recently, digestion using RNaseH and antisense DNA oligomers tiling target rRNAs has emerged as an alternative to commercial rRNA depletion kits. Here, we present a streamlined, more economical RNaseH-mediated rRNA depletion with substantially lower up-front costs, using shorter antisense oligos only sparsely tiled along the target RNA in a 5-min digestion reaction. We introduce a novel Web tool, Oligo-ASST, that simplifies oligo design to target regions with optimal thermodynamic properties, and additionally can generate compact, common oligo pools that simultaneously target divergent RNAs, e.g. across different species. We demonstrate the efficacy of these strategies by generating rRNA-depletion oligos for Xenopus laevis and for zebrafish, which expresses two distinct versions of rRNAs during embryogenesis. The resulting RNA-seq libraries reduce rRNA to <5% of aligned reads, on par with poly(A) selection, and also reveal expression of many non-adenylated RNA species. Oligo-ASST is freely available at https://mtleelab.pitt.edu/oligo to design antisense oligos for any taxon or to target any abundant RNA for depletion.

Graphical Abstract

Oligo-ASST designs optimized antisense oligo sets to target RNAs for depletion.

TENET: gene network reconstruction using transfer entropy reveals key regulatory factors from single cell transcriptomic data

2020-11-10T00:00

Accurate prediction of gene regulatory rules is important towards understanding of cellular processes. Existing computational algorithms devised for bulk transcriptomics typically require a large number of time points to infer gene regulatory networks (GRNs), are applicable for a small number of genes and fail to detect potential causal relationships effectively. Here, we propose a novel approach ‘TENET’ to reconstruct GRNs from single cell RNA sequencing (scRNAseq) datasets. Employing transfer entropy (TE) to measure the amount of causal relationships between genes, TENET predicts large-scale gene regulatory cascades/relationships from scRNAseq data. TENET showed better performance than other GRN reconstructors, in identifying key regulators from public datasets. Specifically from scRNAseq, TENET identified key transcriptional factors in embryonic stem cells (ESCs) and during direct cardiomyocytes reprogramming, where other predictors failed. We further demonstrate that known target genes have significantly higher TE values, and TENET predicted higher TE genes were more influenced by the perturbation of their regulator. Using TENET, we identified and validated that Nme2 is a culture condition specific stem cell factor. These results indicate that TENET is uniquely capable of identifying key regulators from scRNAseq data.

Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences

2020-11-19T00:00

DNA strand asymmetries can have a major effect on several biological functions, including replication, transcription and transcription factor binding. As such, DNA strand asymmetries and mutational strand bias can provide information about biological function. However, a versatile tool to explore this does not exist. Here, we present Asymmetron, a user-friendly computational tool that performs statistical analysis and visualizations for the evaluation of strand asymmetries. Asymmetron takes as input DNA features provided with strand annotation and outputs strand asymmetries for consecutive occurrences of a single DNA feature or between pairs of features. We illustrate the use of Asymmetron by identifying transcriptional and replicative strand asymmetries of germline structural variant breakpoints. We also show that the orientation of the binding sites of 45% of human transcription factors analyzed have a significant DNA strand bias in transcribed regions, that is also corroborated in ChIP-seq analyses, and is likely associated with transcription. In summary, we provide a novel tool to assess DNA strand asymmetries and show how it can be used to derive new insights across a variety of biological disciplines.

Nova Reader - Subject

DeCompress: tissue compartment deconvolution of targeted mRNA expression panels using compressed sensing

Nebula: ultra-efficient mapping-free structural variant genotyper

A benchmark and an algorithm for detecting germline transposon insertions and measuring de novo transposon insertion frequencies

DM3Loc: multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism

Single cell epigenetic visualization assay

Network controllability-based algorithm to target personalized driver genes for discovering combinatorial drugs of individual patients

Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench

A novel all-in-one conditional knockout system uncovered an essential role of DDX1 in ribosomal RNA processing

Solid-phase XRN1 reactions for RNA cleavage: application in single-molecule sequencing

The loopometer: a quantitative in vivo assay for DNA-looping proteins

A method for characterizing Cas9 variants via a one-million target sequence library of self-targeting sgRNAs

A novel SHAPE reagent enables the analysis of RNA structure in living cells with unprecedented accuracy

Gene-specific mutagenesis enables rapid continuous evolution of enzymes in vivo

Microbial single-strand annealing proteins enable CRISPR gene-editing tools with improved knock-in efficiencies and reduced off-target effects

SurVirus: a repeat-aware virus integration caller

CoBold: a method for identifying different functional classes of transient RNA structure features that can impact RNA structure formation in vivo

The CONJUDOR pipeline for multiplexed knockdown of gene pairs identifies RBBP-5 as a germ cell reprogramming barrier in C. elegans

CRISPRidentify: identification of CRISPR arrays using machine learning approach

NOseq: amplicon sequencing evaluation method for RNA m6A sites after chemical deamination

Alignment free identification of clones in B cell receptor repertoires

NGS-based identification and tracing of microsatellite instability from minute amounts DNA using inter-Alu-PCR

cola: an R/Bioconductor package for consensus partitioning through a general framework

Genome-wide integration site detection using Cas9 enriched amplification-free long-range sequencing

To mock or not: a comprehensive comparison of mock IP and DNA input for ChIP-seq

UniPath: a uniform approach for pathway and gene-set based analysis of heterogeneity in single-cell epigenome and transcriptome profiles

Parallel monitoring of RNA abundance, localization and compactness with correlative single molecule FISH on LR White embedded samples

Entropy subspace separation-based clustering for noise reduction (ENCORE) of scRNA-seq data

Customized optical mapping by CRISPR–Cas9 mediated DNA labeling with multiple sgRNAs

Decoding the epitranscriptional landscape from native RNA sequences

CoolMPS: evaluation of antibody labeling based massively parallel non-coding RNA sequencing

Translation elongation rate varies among organs and decreases with age

A fast and robust iterative genome-editing method based on a Rock-Paper-Scissors strategy

CoolMPS for robust sequencing of single-nuclear RNAs captured by droplet-based method

A simplified strategy for titrating gene expression reveals new relationships between genotype, environment, and bacterial growth

e-MutPath: computational modeling reveals the functional landscape of genetic mutations rewiring interactome networks

Optimized design of antisense oligomers for targeted rRNA depletion

TENET: gene network reconstruction using transfer entropy reveals key regulatory factors from single cell transcriptomic data

Asymmetron: a toolkit for the identification of strand asymmetry patterns in biological sequences

NOseq: amplicon sequencing evaluation method for RNA m⁶A sites after chemical deamination