Nucleic Acids Research

Home CSVS, a crowdsourcing database of the Spanish population genetic variability

CSVS, a crowdsourcing database of the Spanish population genetic variability

María Peña-Chilet, Gema Roldán, Javier Perez-Florido, Francisco M Ortuño, Rosario Carmona, Virginia Aquino, Daniel Lopez-Lopez, Carlos Loucera, Jose L Fernandez-Rueda, Asunción Gallego, Francisco García-Garcia, Anna González-Neira, Guillermo Pita, Rocío Núñez-Torres, Javier Santoyo-López,

... See all authors

https://doi.org/10.1093/nar/gkaa794, Volume: 49, Issue: D1, Pages: 1-8

Article Type: Research Article Article History

Publisher: Oxford University Press

Altmetric

Table of Contents

INTRODUCTION
MATERIALS AND METHODS
RESULTS
DISCUSSION
DATA AVAILABILITY
Supplementary Material
SUPPLEMENTARY DATA
FUNDING

Abstract

The knowledge of the genetic variability of the local population is of utmost importance in personalized medicine and has been revealed as a critical factor for the discovery of new disease variants. Here, we present the Collaborative Spanish Variability Server (CSVS), which currently contains more than 2000 genomes and exomes of unrelated Spanish individuals. This database has been generated in a collaborative crowdsourcing effort collecting sequencing data produced by local genomic projects and for other purposes. Sequences have been grouped by ICD10 upper categories. A web interface allows querying the database removing one or more ICD10 categories. In this way, aggregated counts of allele frequencies of the pseudo-control Spanish population can be obtained for diseases belonging to the category removed. Interestingly, in addition to pseudo-control studies, some population studies can be made, as, for example, prevalence of pharmacogenomic variants, etc. In addition, this genomic data has been used to define the first Spanish Genome Reference Panel (SGRP1.0) for imputation. This is the first local repository of variability entirely produced by a crowdsourcing effort and constitutes an example for future initiatives to characterize local variability worldwide. CSVS is also part of the GA4GH Beacon network.

CSVS can be accessed at: http://csvs.babelomics.org/.

Peña-Chilet,Roldán,Perez-Florido,Ortuño,Carmona,Aquino,Lopez-Lopez,Loucera,Fernandez-Rueda,Gallego,García-Garcia,González-Neira,Pita,Núñez-Torres,Santoyo-López,Ayuso,Minguez,Avila-Fernandez,Corton,Moreno-Pelayo,Morin,Gallego-Martinez,Lopez-Escamez,Borrego,Antiñolo,Amigo,Salgado-Garrido,Pasalodos-Sanchez,Morte,The Spanish Exome Crowdsourcing Consortium Al-ShahrourFátimaArtuchRafaelBenitezJavierCastañoLuis Antoniodel CastilloIgnacioDelmiroAitorEspinosCarminaGonzálezRoserGrinbergDanielGuillénEncarnaciónLapunzinaPabloLopezEstherMartíRamónMiláMontserratMillánJosé MªNunesVirginiaPalauFrancescPerezBelenJuradoLuis PérezPeronaRosarioPujolAuroraRamosFelicianoRibesAntoniaRosellJordiRoviraEulaliaSurrallésJordiTejadaIsabelUgarteMagdalena,Carracedo,Alonso,and Dopazo: CSVS, a crowdsourcing database of the Spanish population genetic variability

INTRODUCTION

Sequencing technologies have experienced an unprecedented development during the last decade (1) that resulted in different international collaborative projects (2–4) which contributed to an extraordinary increase in the knowledge of the mutational spectrum of diseases. This generation of knowledge has been especially significant in diseases with high morbidity and mortality, caused by highly penetrant (typically protein-coding) variants (5,6). In fact, more than 4500 monogenic diseases can nowadays be directly diagnosed by personalized genomics (7), a possibility that might soon be extended to the whole spectrum of rare diseases with a genetic background (8). Among the strategies used to discover new disease variants, especially in monogenic disorders, frequency-based filtering has demonstrated to be a very useful tool (9). The rationale is as follows: variants that are relatively common in a control population (common variation) are likely benign (10), while rare variants (especially if they have functional consequences) found in multiple affected cases but absent in the control population are likely to cause disease (11–13). These filters search for genes or variants present in all (or most) affected individuals but in none (or very few) of the unaffected control individuals. Therefore, it seems clear that the availability of healthy controls is a decisive factor for the progress of discovery of new disease determinants.

From an historical perspective, the 1000 Genomes Project produced the first comprehensive catalogue of common human genetic variation (14). However, it is known that low frequency (with minor allele frequencies, MAF, under 5%) and rare (MAF under 0.5%) variants, typically population-specific (15), are poorly represented in such catalogue (14). Actually, recent studies have described a remarkable local component (16–18) and a high stratification level (19,20) in many rare variants with uncertain functional consequences. As a consequence of this, the risk of many diseases differs in distinct human populations according to their genetic backgrounds (21,22). In fact, the knowledge of the genetic variability of the local population has been revealed as a critical factor for the discovery of new disease variants (23). All these observations highlight the need for population-specific catalogues of genetic variation (24). However, only a few initiatives to study genetic variation at the population level have been carried out to date, which include a whole-genome sequence (WGS) study of 100 Malays (25), the Genome of the Netherlands, with low-resolution (∼13×) WGS data of 250 trio-families from across the entire country (15), the French-Canadians study of 109 exomes (26), the Medical Genome Project that produced a catalog of the healthy Spanish population with almost 270 exomes (23), the 3000 Finnish genomes (27) and the Icelandic population study of medium resolution (∼20×) WGS of 2636 individuals (28) or the high resolution (>30×) WGS of 1070 healthy Japanese individuals (29) and the recent genetic analysis of the Iranian population (30).

In spite of its recognized usefulness, large-scale sequencing projects of cohorts of local ‘healthy’ populations require expensive consortium-based projects to obtain a representative sample of the population targeted. Unfortunately, funding bodies that are prone to support research on diseases, tend to be, however, reluctant to fund projects that involve systematic sequencing of healthy individuals. In this scenario, a crowdsourcing strategy can provide a feasible alternative to traditional working schemas by organizing consortia that collect data from different groups that ultimately are collectively benefited of the sample size cooperatively obtained. Crowdsourcing is becoming a very popular strategy in biomedicine (31) and can be defined as ‘the process of getting services, information, labor or ideas by outsourcing through an open call, especially through the Internet’ (32). Recently some examples of crowdsourced research have demonstrated an increased accuracy in predicting breast cancer survival (33), response to drugs (34) or to toxic compounds (35) from both, clinical and genomic data, and show how ‘crowdsourced data science challenges can achieve in months what would take years through conventional research approaches’ (36).

MATERIALS AND METHODS

Subjects

The database contains detailed allelic frequencies corresponding to The MGP population, sequenced in the context of the Medical Genome Project (http://www.clinbioinfosspa.es/content/medical-genome-project), which includes 267 healthy, unrelated samples of Spanish origin (EGA, accession: EGAS00001000938), other healthy controls, patients of different diseases, accompanied in some cases of unrelated phenotypically healthy carriers. The sequences were contributed by different consortiums and projects, including groups from the Spanish Network for Research in Rare Diseases, CIBERER, results from the EnoD, (Undiagnosed Rare Diseases programme; https://www.ciberer.es/en/transversal-programmes/scientific-projects/undiagnosed-rare-diseases-programme-enod), the Project Genome 1000 Navarra (NAGEN 1000; (https://www.nagen1000navarra.es/en), The RareGenomics (https://www.rare-genomics.com/) from Madrid, and other research groups and initiatives across Spain (37,38), which currently sum up a total of 2027 genomic and exomic sequences of unrelated Spanish individuals.

Testing sample locality

Ensuring the Spanish locality of the samples uploaded in the CSVS is key for the project. Here, we specifically developed a methodology to double-check the origin of each sample. Sequences belonging to different populations in the 1000 genomes project (14) were used to train a Machine Learning based decision model to discriminate Spanish samples from the rest of populations. Firstly, SNPs corresponding to the genomic regions shared by all the samples having a MAF > 0.01 were selected. Then, individual ancestry in 1000 genomes was estimated for 26 subpopulations using ADMIXTURE (39). Therefore, each individual is described by a vector of 26 features that correspond to the probabilities of belonging to any of the 26 subpopulations of 1000 genomes. Then, a machine learning binary classificatory was built using a well-known variant of the gradient boosting machine: extreme gradient boosting (XGBoost) (40) (see Supplementary Methods for details).

Testing sample kinship and outlier sample detection

A test to determine undesired samples based on their percentage of novel variants introduced in the database, either by excess (potential noisy sample) or by defect (close relative or individual already in the database), has also been used to populate the CSVS database. A leave-one-out cross-validation (LOOCV) strategy was to build a distribution of percentages of variants contributed by any single sample to the pool of variants present in the rest of the database. Samples were considered potential outliers if overpass 1.5 times the interquartile range from first and third quartile in the distribution obtained (see Supplementary Methods for details).

Construction of the reference imputation panel

Two alternative reference panels were created for comparison purposes that include the CSVS WGS variant panel composed of 228 samples plus: (i) the entire 1000G reference panel (CSVS+1000G) and (ii) exclusively the Spanish population (IBS subpopulation) contained in the 1000G panel (CSVS+IBS), using the Minimac3 imputation tool (41). The four longest chromosomes (chromosome 1–4) were used to estimate the correlation between real and imputed genotypes (r² parameter) and assess the imputation accuracy (see Supplementary Methods for details).

RESULTS

The CSVS database

Figure 1A shows how data contributed by different genomic projects undergo different quality control steps, including an artifact and kinship detection tests and locality test, described above. Then the original VCFs are aggregated as counts of variants, binned by ICD10 (https://www.icd10data.com/) disease categories, and inserted in the CSVS database.

Figure 1.

(A) data is contributed by different genomic projects and pass through different quality control steps including an artefact and kinship test (that detects upper outliers, with an unexpected high ratio of private variants, most likely errors, and lower outliers, that are duplicates or close kinship individuals) and locality test before being inserted in the database. (B) Initial CSVS page. (C) Query panel in the Search option. (D) List of variants found in the Spanish population within the selected region along with complementary information on impact, conservation, other's population frequencies and phenotype. (E) genomic browser that displays the selected variant in its genomic context. (F) Saturation plot. (G) Updated contents of the database.

The CSVS interface

The initial screen (Figure 1B) requires the acceptance of the ‘Terms and conditions for the use of the CSVS database’ (http://csvs.babelomics.org/downloads/CSVSTermsAndConditions_use.pdf) before starting any operation. Once accepted, different options can be used.

The search option

This is the main option and allows querying the CSVS database. In the left panel (Figure 1C) queries can be done by gene symbol or by chromosomal regions. Also, one or several disease categories can be excluded, and variants can be highlighted using different types of scores (e.g. SIFT (42), Polyphen (43), CADD (44), Gerp (45)) as well as Sequence Ontology terms for the variation consequences.

The results of the query (Figure 1D) include a list of the positions for which variation has been found in the Spanish population along with complementary data as: chromosome, position, reference allele and alternative allele, allelic frequencies in the Spanish population, allelic frequencies in the 1000 genomes populations and in the EVS populations, impact and conservation indexes (SIFT, Polyphen, CADD, Gerp), the wort of the consequence types assigned to the mutation and the phenotypes, corresponding to known clinical information for the variants, extracted from ClinVar (46), COSMIC (47) and are annotated interactively on each query using the CellBase (48) webservices. Also a visualization of the variant in the genomic context is provided, based on the Genome Maps browser (49). Additionally, some extra detailed information can be found on the population frequencies observed for the variant, the phenotype or the effect.

Contact request

An interesting option is the Contact request button, offered for any variant in the query results panel, which is a local equivalent of a Matchmaker exchange service (50), extensively used to contact the original contributor of a specific sequence.

Saturation plots

Saturation plots (Figure 1F) provide an interesting perspective on the general conservation of the gene studied and, consequently on the possibilities of discovering new variants into it. Genes highly constrained to change will saturate soon and a relatively low number of individuals will capture most of the tolerated mutation the gene can handle, while unconstrained genes will present a still growing slope, meaning that there are still many variants that can potentially be discovered. Discovering a new variant in a saturated gene (constrained to change) can be more relevant than the same finding in a non-saturated gene (unconstrained). Saturation has a clear functional component, that can easily be revealed by enrichment analysis of the genes ranked by saturation. Thus, when genes are ranked by their relative saturation, enrichment analysis using enrichR (51) shows how highly saturated genes (constrained) are enriched in functional terms related to meiosis, cell signaling, proliferation and homeostasis, while the less saturated (unconstrained) are more related to sensory perception, immune response and similar functionalities (see Supplementary Results and Supplementary Figure S1). Figure 2 depicts how genes with high and low saturation are distributed along the chromosomes. Interestingly, sex chromosomes seem to be enriched in low saturated genes.

Figure 2.

Circos plot showing the different genes with high saturation (orange) and low saturation (green) along the chromosomes, which were significantly enriched in functional terms in Supplementary Figure S1.

Downloads and statistics

Partial or total downloads of the aggregated data are possible upon the reception of the corresponding data download agreement duly signed.

The Stats option provides an updated view of the content of the CSVS database.

The Spanish Genome Reference Panel (SGRP1.0)

Supplementary Figure S2 shows the accuracy of the two reference panels derived for imputation in the Spanish population. Both reference panels including the CSVS WGS reference outperformed the 1000 genomes reference. The imputation accuracy increases when variants in rare sites were included (MAF > 0.005). The most realistic imputation panel includes CSVS and the IBS population of the 1000 genomes.

Variants of pharmacogenomic interest

Interindividual genetic variability in genes involved in drug-metabolizing enzymes and transporters have been linked to differences in the efficacy and toxicity of many medications: Moreover, genetic differences between human populations are becoming increasingly recognized as important factors accounting for interindividual variations in drug responsiveness (52,53). Approximately one-fifth of new drugs approved in the past years demonstrated differences in response across ethnic groups, leading to population-specific prescribing recommendations (54). In spite of the consensus about the existence of a relative homogeneity within European populations, population-specific differences in the Spanish population were recently reported (23). Using the individuals of the CSVS repository, we addressed how population-specific differences in those genes involved in drug Absorption, Distribution, Metabolism, Excretion and Toxicity (ADMET) could affect in the rates and risks for drug inefficacy and/or adverse drug reactions in the Spanish population. We estimated the allele frequencies of a total of 142 pharmacogenetic variants described in the PharmGKB database (55) with pharmacogenetic clinical recommendations (PharmGKB variants level 1A and 1B) and a total of 40 of these were found to be polymorphic in the CSVS. When compared with the allele frequencies calculated from genetic data of 30 000 European non-Finnish individuals (gnomAD (56)), no relevant frequency differences between the general European population and the Spanish population were observed, being the most different rs2228001 (level 1B) in XPC gene, rs2108622 (level 1A) in CYP4F2 gene and rs3892097 (level 1A) In CYP2D6 gene (P-value ≤ 1 × 10⁻¹⁰). Regarding the non-polymorphic variants, we observed that all of them are low-frequency variants (lower than 0.00065) and we do not expect to find a heterozygous individual due to the sample size in our repository (Supplementary Table S1).

Apart from the genetic variants already recommended to be implemented in the clinical setting, it was found that genetic variability with functional impact was governed by few high-frequency variants for some genes, but the functionality of the majority of pharmacogenes is dominated by rare genetic variants (57). In addition, local variability in these ADMET genes could also be very relevant for explaining a substantial part of the unexplained inter-individual differences in drug response and toxicities at the population-specific level, so that it is mandatory to have available population-specific catalogs of these pharma-variants (mainly rare) to explore their contribution to predictions of drug response. To examine this, we studied the variability of the Spanish population captured by our repository in a total of 421 well-known pharmacogenes involved in drug pharmacokinetics and/or drug response (Supplementary Table S2). High-impact variants within those pharmacogenes were defined according to the Variant Effect Predictor (58) as those having having the following consequence types: frameshift, splice acceptor, splice donor, start lost, stop gained, stop lost, transcript ablation and transcript amplification. Additionally, deleterious missense variants categorized as deleterious by CONDEL (59) or having a LoFtool score (60) lower than the first quartile corresponding to the most intolerant variants.

As before, the same comparison with the corresponding European non-Finnish variants rendered a total of 318 high impact variants and 235 likely deleterious missense single nucleotide variants in the pharmacogenes studied. Interestingly, 18 (5.6%) high impact variants and 18 (7.6%) missense variants identified were present in our Spanish population while no heterozygotes were observed in these positions across ∼30 000 healthy individuals of the European non-Finnish population. Also, a non-negligible percentage of private variation was observed in these genes encoding proteins involved in drug metabolism, transport, and response, and this information can be used to pinpoint relevant private genetic variants to be included in the design of population-specific pharmacogenetic genotyping arrays to be utilized in the implementation of pharmacogenetic diagnostics in the clinical setting (Supplementary Table S3).

CSVS Beacon

Since 2017, CSVS makes its genomic information discoverable through the GA4GH Beacon network (https://beacon-network.org/). In order to improve the performance of the CSVS Beacon API we set up an SQLite database specific for this purpose. Although CSVS stores data in 1-base it can respond to queries in both 1-base or 0-base (Beacon requests data in 0-base). A form to directly make Bacon-style queries is also available (http://ucscbeacon.clinbioinfosspa.es/).

DISCUSSION

The genetic variability of the local population is recognized as one of the most relevant factors in the discovery of new disease variants, especially in mendelian diseases (6,8,23). However, genomic data of healthy individuals belonging to the local population of interest are often scarce when not unavailable. The CSVS provides an original solution to this problem. The CSVS is a continuously growing resource that collects genomic or exomic sequences of the Spanish local population, no matter whether these come from healthy or diseased individuals. The main objective is using the repository as a pseudo-control population for finding new disease-causing variants and genes, with the idea that ‘disease A is a healthy control for disease B’. Despite gene pleiotropy cannot be completely ruled out, data are binned at higher disease ICD10 categories, where this gene property can be considered negligible. Actually, resources like Disgenet (61) can be used in case of doubt, and will be incorporated to automatically exclude the proper disease categories, in future CSVS versions. Since the collection of population-specific genomic data from individuals with different diseases are easier to collect than those from healthy donors, CSVS provides an example for the construction of population-specific pseudo-control repositories by means of crowdsourcing (31). Moreover, the CSVS Beacon and the Contact request option makes of CSVS a tool with high potential of discoverability. Thus, CSVS sets the ground and it is an example for future federated European infrastructures (62).

DATA AVAILABILITY

CSVS is an open resource available at http://csvs.babelomics.org/.

The CSVS code, as well as the code of the different tests used is available in the corresponding github repository: https://github.com/babelomics/CSVS.

ACKNOWLEDGEMENTS

The Spanish Exome Crowdsourcing Consortium is a de facto consortium currently composed by: Fátima Al-Shahrour, Rafael Artuch, Javier Benitez, Luis Antonio Castaño, Ignacio del Castillo, Aitor Delmiro, Carmina Espinos, Roser González, Daniel Grinberg, Encarnación Guillén, Pablo Lapunzina, Esther Lopez, Ramón Martí, Montserrat Milá, José Mª Millán, Virginia Nunes, Francesc Palau, Belen Perez, Luis Pérez Jurado, Rosario Perona, Aurora Pujol, Feliciano Ramos, Antonia Ribes, Jordi Rosell, Eulalia Rovira, Jordi Surrallés, Isabel Tejada and Magdalena Ugarte.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Spanish Ministry of Economy and Competitiveness [SAF2017-88908-R, PT17/0009/0006 to J.D.; PI19/00321 and CIBERER ACCI-06/07/0036 to C.A., PI14-948, PI17-1659 and CIBERER ACCI-06/07/0036 to M.A.M.P.]; Regional Government of Madrid, RAREGenomics-CM [B2017/BMD-3721 to C.A. and B2017/BMD3721 to M.A.M.P.]; all co-funded with European Regional Development Funds (ERDF) as well as EU H2020-INFRADEV-1-2015-1 ELIXIR-EXCELERATE [676559]; University Chair UAM-IIS-FJD of Genomic Medicine and the Ramon Areces Foundation also supported this work. Funding for open access charge: Spanish Ministry of Economy and Competitiveness [SAF2017-88908-R].

Conflict of interest statement. None declared.

REFERENCES

Mardis

E.R.

DNA sequencing technologies: 2006–2016. Nat. Protoc.2017; 12:213.

Durbin

R.M.

, Abecasis

G.R.

, Altshuler

D.L.

, Auton

, Brooks

L.D.

, Gibbs

R.A.

, Hurles

M.E.

, McVean

G.A.

A map of human genome variation from population-scale sequencing. Nature. 2010; 467:1061–1073.

Dunham

, Kundaje

, Aldred

S.F.

, Collins

P.J.

, Davis

C.A.

, Doyle

, Epstein

C.B.

, Frietze

, Harrow

, Kaul

et al.

An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74.

, O’Connor

T.D.

, Jun

, Kang

H.M.

, Abecasis

, Leal

S.M.

, Gabriel

, Rieder

M.J.

, Altshuler

, Shendure

et al.

Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature. 2013; 493:216–220.

Boycott

K.M.

, Rath

, Chong

J.X.

, Hartley

, Alkuraya

F.S.

, Baynam

, Brookes

A.J.

, Brudno

, Carracedo

, den Dunnen

J.T.

et al.

International cooperation to enable the diagnosis of all rare genetic diseases. Am. J. Hum. Genet.2017; 100:695–705.

Boycott

K.M.

, Vanstone

M.R.

, Bulman

D.E.

, MacKenzie

A.E.

Rare-disease genetics in the era of next-generation sequencing: discovery to translation. Nat. Rev. Genet.2013; 14:681–691.

Wenger

A.M.

, Guturu

, Bernstein

J.A.

, Bejerano

Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers. Genet. Med.2017; 19:209.

Boycott

K.M.

, Hartley

, Biesecker

L.G.

, Gibbs

R.A.

, Innes

A.M.

, Riess

, Belmont

, Dunwoodie

S.L.

, Jojic

, Lassmann

et al.

A diagnosis for all rare genetic diseases: the horizon and the next frontiers. Cell. 2019; 177:32–37.

Rehm

H.L.

, Bale

S.J.

, Bayrak-Toydemir

, Berg

J.S.

, Brown

K.K.

, Deignan

J.L.

, Friez

M.J.

, Funke

B.H.

, Hegde

M.R.

, Lyon

ACMG clinical laboratory standards for next-generation sequencing. Genet. Med.2013; 15:733.

10.

Lek

, Karczewski

K.J.

, Minikel

E.V.

, Samocha

K.E.

, Banks

, Fennell

, O’Donnell-Luria

A.H.

, Ware

J.S.

, Hill

A.J.

, Cummings

B.B.

et al.

Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016; 536:285.

11.

S.B.

, Turner

E.H.

, Robertson

P.D.

, Flygare

S.D.

, Bigham

A.W.

, Lee

, Shaffer

, Wong

, Bhattacharjee

, Eichler

E.E.

Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009; 461:272–276.

12.

S.B.

, Buckingham

K.J.

, Lee

, Bigham

A.W.

, Tabor

H.K.

, Dent

K.M.

, Huff

C.D.

, Shannon

P.T.

, Jabs

E.W.

, Nickerson

D.A.

et al.

Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet.2010; 42:30–35.

13.

S.B.

, Bigham

A.W.

, Buckingham

K.J.

, Hannibal

M.C.

, McMillin

M.J.

, Gildersleeve

H.I.

, Beck

A.E.

, Tabor

H.K.

, Cooper

G.M.

, Mefford

H.C.

et al.

Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat. Genet.2010; 42:790–793.

14.

Abecasis

G.R.

, Auton

, Brooks

L.D.

, DePristo

M.A.

, Durbin

R.M.

, Handsaker

R.E.

, Kang

H.M.

, Marth

G.T.

, McVean

G.A.

An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491:56–65.

15.

The_Genome_of_the_Netherlands_Consortium Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat. Genet.2014; 46:818–825.

16.

Nelson

M.R.

, Wegmann

, Ehm

M.G.

, Kessner

, St Jean

, Verzilli

, Shen

, Tang

, Bacanu

S.A.

, Fraser

et al.

An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science. 2012; 337:100–104.

17.

Kryukov

G.V.

, Pennacchio

L.A.

, Sunyaev

S.R.

Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am. J. Hum. Genet.2007; 80:727–739.

18.

Marth

G.T.

, Yu

, Indap

A.R.

, Garimella

, Gravel

, Leong

W.F.

, Tyler-Smith

, Bainbridge

, Blackwell

, Zheng-Bradley

et al.

The functional spectrum of low-frequency coding variation. Genome Biol.2011; 12:R84.

19.

Mathieson

, McVean

Differential confounding of rare and common variants in spatially structured populations. Nat. Genet.2012; 44:243–246.

20.

Moreno-Estrada

, Gravel

, Zakharia

, McCauley

J.L.

, Byrnes

J.K.

, Gignoux

C.R.

, Ortiz-Tello

P.A.

, Martinez

R.J.

, Hedges

D.J.

, Morris

R.W.

et al.

Reconstructing the population genetic history of the Caribbean. PLoS Genet.2013; 9:e1003925.

21.

Corona

, Chen

, Sikora

, Morgan

A.A.

, Patel

C.J.

, Ramesh

, Bustamante

C.D.

, Butte

A.J.

Analysis of the genetic basis of disease in the context of worldwide human relationships and migration. PLoS Genet.2013; 9:e1003447.

22.

Fernandez

R.M.

, Bleda

, Luzon-Toro

, Garcia-Alonso

, Arnold

, Sribudiani

, Besmond

, Lantieri

, Doan

, Ceccherini

et al.

Pathways systematically associated to Hirschsprung's disease. Orphanet. J. Rare. Dis.2013; 8:187.

23.

Dopazo

, Amadoz

, Bleda

, Garcia-Alonso

, Alemán

, García-García

, Rodriguez

J.A.

, Daub

J.T.

, Muntané

, Rueda

267 Spanish exomes reveal population-specific differences in disease-related genetic variation. Mol. Biol. Evol.2016; 33:1205–1218.

24.

Bustamante

C.D.

, Burchard

E.G.

, De la Vega

F.M.

Genomics for the world. Nature. 2011; 475:163–165.

25.

Wong

L.P.

, Ong

R.T.

, Poh

W.T.

, Liu

, Chen

, Li

, Lam

K.K.

, Pillai

N.E.

, Sim

K.S.

, Xu

et al.

Deep whole-genome sequencing of 100 southeast Asian Malays. Am. J. Hum. Genet.2013; 92:52–66.

26.

Casals

, Hodgkinson

, Hussin

, Idaghdour

, Bruat

, de Maillard

, Grenier

J.C.

, Gbeha

, Hamdan

F.F.

, Girard

et al.

Whole-exome sequencing reveals a rapid change in the frequency of rare functional variants in a founding population of humans. PLos Genet.2013; 9:e1003815.

27.

Lim

E.T.

, Wurtz

, Havulinna

A.S.

, Palta

, Tukiainen

, Rehnstrom

, Esko

, Magi

, Inouye

, Lappalainen

et al.

Distribution and medical impact of loss-of-function variants in the Finnish founder population. PLoS Genet.2014; 10:e1004494.

28.

Gudbjartsson

D.F.

, Helgason

, Gudjonsson

S.A.

, Zink

, Oddson

, Gylfason

, Besenbacher

, Magnusson

, Halldorsson

B.V.

, Hjartarson

et al.

Large-scale whole-genome sequencing of the Icelandic population. Nat. Genet.2015; 47:435–444.

29.

Nagasaki

, Yasuda

, Katsuoka

, Nariai

, Kojima

, Kawai

, Yamaguchi-Kabata

, Yokozawa

, Danjoh

, Saito

et al.

Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat. Commun.2015; 6:8018.

30.

Fattahi

, Beheshtian

, Mohseni

, Poustchi

, Sellars

, Nezhadi

S.H.

, Amini

, Arzhangi

, Jalalvand

, Jamali

Iranome: a catalog of genomic variations in the Iranian population. Hum. Mutat.2019; 40:1968–1984.

31.

Khare

, Good

B.M.

, Leaman

, Su

A.I.

, Lu

Crowdsourcing in biomedicine: challenges and opportunities. Brief. Bioinform.2015; 17:23–32.

32.

Estellés-Arolas

, González-Ladrón-de-Guevara

Towards an integrated crowdsourcing definition. J Inf Sci. 2012; 38:189–200.

33.

Margolin

A.A.

, Bilal

, Huang

, Norman

T.C.

, Ottestad

, Mecham

B.H.

, Sauerwine

, Kellen

M.R.

, Mangravite

L.M.

, Furia

M.D.

et al.

Systematic analysis of challenge-driven improvements in molecular prognostic models for breast cancer. Sci. Transl. Med.2013; 5:181re1.

34.

Plenge

R.M.

, Greenberg

J.D.

, Mangravite

L.M.

, Derry

J.M.

, Stahl

E.A.

, Coenen

M.J.

, Barton

, Padyukov

, Klareskog

, Gregersen

P.K.

et al.

Crowdsourcing genetic prediction of clinical utility in the rheumatoid arthritis responder challenge. Nat. Genet.2013; 45:468–469.

35.

Eduati

, Mangravite

L.M.

, Wang

, Tang

, Bare

J.C.

, Huang

, Norman

, Kellen

, Menden

M.P.

, Yang

et al.

Prediction of human population responses to toxic compounds by a collaborative competition. Nat. Biotech.2015; 33:933–940.

36.

Davis

, Button-Simons

, Bensellak

, Ahsen

E.M.

, Checkley

, Foster

G.J.

, Su

, Moussa

, Mapiye

, Khoo

S.K.

et al.

Leveraging crowdsourcing to accelerate global health solutions. Nat. Biotechnol.2019; 37:848–850.

37.

Gallego-Martinez

, Lopez-Escamez

J.A.

Genetic architecture of Meniere's disease. Hear. Res.2019; 107872.

38.

Gui

, Schriemer

, Cheng

W.W.

, Chauhan

R.K.

, Antiňolo

, Berrios

, Bleda

, Brooks

A.S.

, Brouwer

R.W.

, Burns

A.J.

Whole exome sequencing coupled with unbiased functional analysis reveals new Hirschsprung disease genes. Genome Biol.2017; 18:48.

39.

Alexander

D.H.

, Novembre

, Lange

Fast model-based estimation of ancestry in unrelated individuals. Genome Res.2009; 19:1655–1664.

40.

Chen

, Guestrin

Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016; ACM785–794.

41.

Das

, Forer

, Schönherr

, Sidore

, Locke

A.E.

, Kwong

, Vrieze

S.I.

, Chew

E.Y.

, Levy

, McGue

Next-generation genotype imputation service and methods. Nat. Genet.2016; 48:1284.

42.

P.C.

, Henikoff

SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res.2003; 31:3812–3814.

43.

Adzhubei

, Jordan

D.M.

, Sunyaev

S.R.

Predicting functional effect of human missense mutations using PolyPhen‐2. Curr. Protoc. Hum. Genet.2013; 76:7.20.21–27.20.41.

44.

Kircher

, Witten

D.M.

, Jain

, O’Roak

B.J.

, Cooper

G.M.

, Shendure

A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet.2014; 46:310–315.

45.

Davydov

E.V.

, Goode

D.L.

, Sirota

, Cooper

G.M.

, Sidow

, Batzoglou

Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLoS Comput. Biol.2010; 6:e1001025.

46.

Landrum

M.J.

, Lee

J.M.

, Benson

, Brown

G.R.

, Chao

, Chitipiralla

, Gu

, Hart

, Hoffman

, Jang

et al.

ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res.2017; 46:D1062–D1067.

47.

Tate

J.G.

, Bamford

, Jubb

H.C.

, Sondka

, Beare

D.M.

, Bindal

, Boutselakis

, Cole

C.G.

, Creatore

, Dawson

et al.

COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res.2018; 47:D941–D947.

48.

Bleda

, Tarraga

, de Maria

, Salavert

, Garcia-Alonso

, Celma

, Martin

, Dopazo

, Medina

CellBase, a comprehensive collection of RESTful web services for retrieving relevant biological information from heterogeneous sources. Nucleic Acids Res.2012; 40:W609–W614.

49.

Medina

, Salavert

, Sanchez

, de Maria

, Alonso

, Escobar

, Bleda

, Dopazo

Genome Maps, a new generation genome browser. Nucleic Acids Res.2013; 41:W41–W46.

50.

Philippakis

A.A.

, Azzariti

D.R.

, Beltran

, Brookes

A.J.

, Brownstein

C.A.

, Brudno

, Brunner

H.G.

, Buske

O.J.

, Carey

, Doll

The Matchmaker Exchange: a platform for rare disease gene discovery. Hum. Mutat.2015; 36:915–921.

51.

Kuleshov

M.V.

, Jones

M.R.

, Rouillard

A.D.

, Fernandez

N.F.

, Duan

, Wang

, Koplev

, Jenkins

S.L.

, Jagodnik

K.M.

, Lachmann

et al.

Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res.2016; 44:W90–W97.

52.

Kubo

, Ohara

, Tachikawa

, Cavallari

, Lee

, Wen

, Scordo

, Nutescu

, Perera

, Miyajima

Population differences in S-warfarin pharmacokinetics among African Americans, Asians and whites: their influence on pharmacogenetic dosing algorithms. Pharmacogenomics J.2017; 17:494–500.

53.

Meyer

U.A.

Pharmacogenetics–five decades of therapeutic lessons from genetic diversity. Nat. Rev. Genet.2004; 5:669–676.

54.

Ramamoorthy

, Pacanowski

, Bull

, Zhang

Racial/ethnic differences in drug disposition and response: review of recently approved drugs. Clin. Pharmacol. Ther.2015; 97:263–273.

55.

Barbarino

J.M.

, Whirl‐Carrillo

, Altman

R.B.

, Klein

T.E.

PharmGKB: A worldwide resource for pharmacogenomic information. Wiley Interdiscip. Rev. Syst. Biol. Med.2018; 10:e1417.

56.

Koch

Exploring human genomic diversity with gnomAD. Nat. Rev. Genet.2020; 21:448–448.

57.

Ingelman-Sundberg

, Mkrtchian

, Zhou

, Lauschke

V.M.

Integrating rare genetic variants into pharmacogenetic drug response predictions. Hum. Genomics. 2018; 12:26.

58.

McLaren

, Gil

, Hunt

S.E.

, Riat

H.S.

, Ritchie

G.R.

, Thormann

, Flicek

, Cunningham

The ensembl variant effect predictor. Genome Biol.2016; 17:122.

59.

González-Pérez

, López-Bigas

Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet.2011; 88:440–449.

60.

Fadista

, Oskolkov

, Hansson

, Groop

LoFtool: a gene intolerance score based on loss-of-function variants in 60 706 individuals. Bioinformatics. 2017; 33:471–474.

61.

Piñero

, Queralt-Rosinach

, Bravo

À.

, Deu-Pons

, Bauer-Mehren

, Baron

, Sanz

, Furlong

L.I.

DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database. 2015; 2015:bav028.

62.

Saunders

, Baudis

, Becker

, Beltran

, Béroud

, Birney

, Brooksbank

, Brunak

, Van den Bulcke

, Drysdale

Leveraging European infrastructures to access 1 million human genomes by 2022. Nat. Rev. Genet.2019; 20:693–701.