Goncalo Abecasis is an employee of Regeneron Pharmaceuticals. We declare no other conflicts of interest.
‡ This work was primarily completed while at the University of Michigan, Ann Arbor.
Gene-based association tests aggregate genotypes across multiple variants for each gene, providing an interpretable gene-level analysis framework for genome-wide association studies (GWAS). Early gene-based test applications often focused on rare coding variants; a more recent wave of gene-based methods, e.g. TWAS, use eQTLs to interrogate regulatory associations. Regulatory variants are expected to be particularly valuable for gene-based analysis, since most GWAS associations to date are non-coding. However, identifying causal genes from regulatory associations remains challenging and contentious. Here, we present a statistical framework and computational tool to integrate heterogeneous annotations with GWAS summary statistics for gene-based analysis, applied with comprehensive coding and tissue-specific regulatory annotations. We compare power and accuracy identifying causal genes across single-annotation, omnibus, and annotation-agnostic gene-based tests in simulation studies and an analysis of 128 traits from the UK Biobank, and find that incorporating heterogeneous annotations in gene-based association analysis increases power and performance identifying causal genes.
Gene-based association tests are statistical methods used in genome-wide association studies (GWAS) to identify genes that affect heritable traits. Gene-based tests are formed by aggregating genotypes across multiple genetic variants for each gene, often including only variants that are likely to affect gene function or regulation. In this work, we present a unified framework to integrate heterogeneous classes of functional variants in gene-based association analysis. This approach enables us to simultaneously assess multiple distinct biological mechanisms underlying GWAS association signals, and to construct powerful omnibus tests by aggregating across functional classes for each gene. We evaluated the performance of gene-based association test methods and strategies to identify causal genes by conducting extensive simulation studies, and by analyzing 128 human traits from the UK Biobank and comparing our results against lists of high-confidence putative causal genes. Our analysis suggests that incorporating heterogeneous functional variants in gene-based association tests increases power to detect gene-based association and helps identify causal genes.
Genome-wide association studies (GWAS) have identified thousands of genetic loci associated with complex traits [1]; however, the biological mechanisms underlying these associations are often poorly understood. Gene-based association tests can provide a more interpretable analysis framework compared to single-variant analysis, interrogating association at the gene level by aggregating genotypes across multiple variants for each gene. This strategy can also increase power to detect association by aggregating small effects across variants, reducing the burden of multiple testing, and weighting or filtering to prioritize functional variants [2, 3].
In gene-based analysis, variants are often grouped or weighted by putative functional effect, for example, a common strategy for exome analysis is to include only rare non-synonymous or loss-of-function (LoF) variants in gene-based tests such as SKAT and the CMC burden test [4, 5]. A more recent wave of gene-based methods, e.g. PrediXcan [6, 7] and TWAS [8], use eQTL variants to construct gene-based tests of association between the predicted genetic component of gene expression and GWAS trait. Incorporating regulatory variants is expected to be particularly valuable for gene-based analysis of complex traits, since most genetic associations discovered to date are in non-coding regions [9]. However, while coding variants generally implicate a single known gene, the gene(s) affected by regulatory variants are often less clear [10, 11].
Incorporating multiple types of annotation in gene-based analysis provides several advantages over analysis methods using annotations of a single type. First, including variants from multiple annotation categories is expected to increase accuracy (e.g., odds that the most significant gene at a locus is causal), since signals that overlap a single annotation type (e.g., eQTL variants) may be driven by linkage disequilibrium (LD) or pleiotropic regulatory effects [12, 13]. Second, it can increase power by increasing the signal-to-noise ratio, and capturing a wider range of possible mechanisms driving genetic associations with complex traits (e.g., [14–16]). For example, tests that incorporate both coding and eQTL variants are expected to have high power to detect both protein-altering associations as well as associations driven by effects on gene expression levels. One-dimensional annotation scores derived from multiple annotation data sets can be used to weight variants in gene-based tests (e.g., [17–19]), which can increase power by assigning higher weight to functional variants. However, aggregating variants separately for multiple annotation types and combining the result allows us to explicitly model multiple distinct genes and biological mechanisms underlying associations.
Here, we present a statistical framework and computational tool to integrate heterogeneous functional annotations with GWAS association summary statistics for gene-based analysis. We analyze a diverse set of functional annotation data including multiple tissue-specific eQTL annotation data sets, multiple epigenetic annotation sets mapping regulatory elements to putative target genes, coding variant annotations, and TSS annotations. We compare the performance of single-annotation, omnibus, and annotation-agnostic (not stratified or weighted by functional annotation) gene-based analysis methods through simulation studies, and by analyzing GWAS summary statistics from the UK Biobank [20]. Our contributions are to 1) expound a general statistical framework for gene-based analysis with heterogeneous functional annotations, which includes several existing single-annotation gene-based association methods as components or special cases; 2) provide a computationally efficient open-source tool for gene-based analysis from summary statistics; and 3) conduct a comprehensive analysis of statistical power and accuracy identifying causal genes across gene-based association methods through extensive simulation studies and analysis of GWAS data for 128 human traits.
We first outline a statistical framework and open-source tool for gene-based analysis with heterogeneous functional annotations. Next, we describe simulations to evaluate 1) the Type I error rates of gene-based test statistics, 2) statistical power, and 3) specificity to identify causal genes. Finally, we discuss applications to empirical data using GWAS summary statistics from the UK Biobank. We assess 1) the empirical power of gene-based tests by comparing the numbers of significant independent gene-based associations discovered for each UK Biobank trait, and 2) concordance with benchmark gene lists compiled from the ClinVar database [21] and the Human Phenotype Ontology (HPO) [22].
GAMBIT (Gene-based Analysis with oMniBus, Integrative Tests) is an open-source tool for calculating and combining annotation-stratified gene-based tests using GWAS summary statistics (single-variant association z-scores). Broadly, GAMBIT’s strategy is to first separately calculate single-annotation gene-based association tests stratified by functional annotation class, and aggregate across classes for each gene to construct omnibus gene-based tests (illustrated in Fig 1). Here and elsewhere, we refer to this omnibus test statistic as the GAMBIT gene-based test. GAMBIT calculates four general forms of gene based test statistics, described briefly in Table 1 and detailed in Materials and Methods. To account for LD between neighboring variants and genes, GAMBIT relies on an LD reference panel from an appropriately matched population (e.g., [23, 24]). GAMBIT is implemented in C++, open source, and freely available.

| Statistic | Null Distribution | References & Examples | |
|---|---|---|---|
| L-type | ∑k wk Zk | Burden [25, 26], PrediXcan [6], TWAS [8] | |
| Q-type |
![]() |
![]() | SKAT [27], SOCS [28] |
| M-type |
![]() | – | Min-P [29], MOCS [28] |
| ACAT |
![]() | ≈ Cauchy(0, ∑k wk) | [30, 59] |
| HMP | ∑k wk/(∑k′ wk′/pk′) | ≈ Landau(μ, π/2)−1 | [31] |
Basic gene-based test forms used in GAMBIT. Zk denotes the single-variant z-score association test statistic for variant k, with p-value
wk denotes the weight assigned to variant k. Any real-valued weights can be used in L-type tests, whereas Q-type, ACAT, and the harmonic mean p-value (HMP) require non-negative weights.
λk denotes the kth eigenvalue of diag(w)1/2
RZdiag(w)1/2, and each


GAMBIT analysis framework & workflow.
Broad overview of GAMBIT software pipeline. (1) GWAS association summary statistics (single-variant z-scores, or effect size estimates and standard errors) are cross-referenced and linked with multiple sets of functional annotations. (2) Annotated GWAS variants are cross-referenced with LD reference data (a haplotype reference panel to estimate LD as needed). (3) GWAS summary statistics, annotations, and LD estimates are used to calculate stratified gene-based test statistics. (4) Stratified gene-based tests are combined for each gene to construct omnibus test statistics. GAMBIT supports multiple single-annotation test methods and multiple omnibus test methods to combine single-annotation tests. Statistical tests are listed in Table 1; basic annotation types are illustrated in Fig 2 and listed in Table 2. A complete description of statistical methods and annotation types can be found in Materials and Methods.
We considered 5 broad annotation classes in our analysis: 1) proximity-based annotations, 2) coding annotations, 3) UTR regions, 4) enhancer and promoter regions, and 5) eQTL predictive weights. Each of these annotation classes comprises multiple subclasses; for example, annotations include non-synonymous, splice-site, and other variant categories; and eQTL variants are stratified by tissue. Briefly, we annotated coding and UTR variants using TabAnno [32] and EPACTS [33]; obtained enhancer element and enhancer-target gene weight annotations from RoadmapLinks [10, 34], GeneHancer [35], and JEME [11]; and pre-computed tissue-specific eQTL predictive weights from PredictDB [6, 7] and FUSION/TWAS [8]. Enhancer annotations were largely derived from NIH Roadmap Epigenomics and ENCODE project data [36, 37], as well as from the FANTOM Consortium [11, 38, 39]. All eQTL variant annotations were estimated using the GTEx project v7 data [40]. Fig 2 illustrates a subset of these annotations at the CELSR2 locus on chromosome 1; detailed descriptions of annotation data and statistical methods used to aggregate test statistics within and across classes are provided in Materials and Methods.


Regulatory annotation tracks and gene weights.
Illustration of primary regulatory annotation tracks used in GAMBIT gene-based analysis framework at the CELSR2 locus on chromosome 1. Top panel: Distance-to-transcription start site (dTSS) weights, calculated as wjk(α) = exp(−α|djk|), where djk is the number of base pairs between variant j and the TSS of gene k, shown for α = 10−5 (solid lines), α = 5 × 10−5 (dashed lines), and α = 10−4 (dotted lines). Gene bodies are indicated by arrows and variant locations are marked in black at y = 0. Middle panel: enhancer-to-target-gene confidence weights. Weights are shown for enhancer variant and target gene, and unique enhancer elements are marked by black lines at y = 0. Lower panel: tissue-specific eQTL weights for each gene. eQTL tissues are differentiated by shape.
We simulated GWAS summary statistics at 2,000 loci using haplotype data from the European subset of the 1000 Genomes Project (1KGP) Phase 3 reference panel [24]. Briefly, each locus was defined by first sampling a single causal protein-coding gene, aggregating all genes within 1 Mbp of the causal gene, and finally aggregating all variants assigned to one or more genes based on functional annotations or within ≤ 500kbp of any gene at the locus. For each of the 2,000 loci, we simulated genetic effects under four causal scenarios: 1) coding variants are causal, 2) eQTL variants are causal, 3) enhancer variants are causal, and 4) UTR variants are causal. For each locus and causal scenario, we varied the proportion of trait variance accounted for by variants at the locus
We compared performance identifying causal genes across 8 gene ranking methods: 1) ranking each gene by distance between its transcription start site (TSS) and the most significant independent single variant at the locus, 2) the Pascal SOCS test -log10p-value, which assigns equal weight to all variants within 500kbp of the gene body, 3) the omnibus test (“GAMBIT”) -log10p-value, and 4-8) -log10p-values for gene-based tests using each annotation class individually (listed in Table 2 and described in Materials and methods). As expected, test statistics calculated using the known causal annotation class alone were most accurate for identifying the causal gene (e.g., gene-based p-values using coding variants were most accurate when coding variants were causal); however, the GAMBIT omnibus test was nearly as accurate, and had the second-highest performance across simulation settings (Fig 3; S1 Fig). In practical applications, the causal mechanisms underlying associations are unknown and often heterogeneous across loci; in this case, we expect the GAMBIT omnibus testing strategy to be most accurate (Fig 3, right panel).

| Test Form | Annotation Subclasses | Annotated Variants | |
|---|---|---|---|
| dTSS | ACAT/HMP | dTSS-α value | Variants within 500kbp of TSS |
| CT-TWAS | L-type | eQTL tissue | eQTL variants across 48 tissues |
| Enhancers | Q-type; ACAT/HMP | Enhancer region | All enhancer variants |
| UTR | Q-type; ACAT/HMP | 3’ and 5’ UTR | 3’ and 5’ UTR variants |
| Coding | Q-type; ACAT/HMP | Variant type (e.g., missense, splice site) | Exonic variants |
Summary of variant types, default test methods, and default aggregation procedures for primary annotation classes in GAMBIT. Rationale and further details are provided in Materials and Methods.


Performance identifying causal genes in simulations.
Proportion of simulation replicates in which causal gene is top-ranked at its locus (y-axis) for each gene-based association or gene ranking method (x-axis & bar fill color) stratified by locus heritability
We also compared statistical power for each of the gene-based test methods at both causal and non-causal proximal genes at each simulated locus (Fig 4). For proximal genes, association signals are driven by LD and pleiotropic regulatory variants shared with the causal gene; thus, gene-based tests should ideally have high power for causal genes but comparatively low power for proximal genes. Similar to the previous analysis, gene-based tests using the causal annotation class alone had the highest power for causal genes and highest specificity (low power for proximal genes) across simulation settings. The omnibus test (“GAMBIT”) generally had the second-highest power for causal genes, and intermediate power for proximal genes. Thus, we expect the omnibus testing approach to be powerful and robust when causal mechanisms are unknown or heterogeneous across loci.


Statistical power to detect gene-based associations in simulations.
Statistical power (proportion of simulation replicates in which gene-based p-value ≤2.5 × 10−6 across loci; y-axis) for each gene-based testing approach (x-axis & color) stratified by locus heritability
To compare the power of gene-based tests in empirical data, we evaluated the numbers of significant independent gene-based associations detected for each method across 128 approximately independent GWAS traits in the UK Biobank (selection procedures are described in Materials and methods). The number of independent associations is calculated for each trait by selecting the most significant gene-based association p-value, masking all gene-based tests that include variants within 1 Mbp of variants for the selected gene, and repeating until all genes with Bonferroni-adjusted p-value ≤ 5% are either selected or masked. This procedure ensures that all selected genes are separated by at least 1 Mbp, and provides a conservative estimate of the number of significant independent signals. The omnibus test (“GAMBIT”) detected significantly more associations than other gene-based association methods overall (Fig 5A), and consistently detected more associations than other methods across a wide range of traits (Fig 5B). We also compared the numbers of significant associations for each method without filtering or LD pruning (Fig 6). Statistics that incorporate many variants over a broad region for each gene (e.g., dTSS-weighted tests) yield substantially more significant associations, as expected.


UK Biobank analysis: Numbers of significant independent associations detected.
Numbers of independent gene-based associations (at Bonferroni-corrected 5% significance level) detected by each method across 128 UK Biobank traits. Panel A: Total number of significant independent associations across traits (delineated by horizontal black lines) for each gene-based test; Wilcoxon signed-rank p-values (top) for paired comparisons between no. associations detected by the omnibus test (“GAMBIT”; red) versus Pascal/SOCS (blue) and single-annotation gene-based tests (green). The omnibus test detects significantly more associations than any individual constituent gene-based test or by Pascal/SOCS across UK Biobank traits. Panel B: Comparison of total numbers of genes detected across individual traits for the omnibus test (y-axis) versus single-annotation tests (x-axis).


UK Biobank analysis: Overlap between gene-based association methods.
Panel A: Total number of significant genes (p-value < 2.5e-6) for each method across all 128 traits. Unlike Fig 5, gene-based associations in Fig 6 are not filtered or LD pruned, and a single significant GWAS variant can produce multiple significant gene-based associations for a given method. Here, a larger number of significant genes does not necessarily suggest greater statistical power. Panel B: The i, jth heatmap element can be interpreted as the conditional probability that gene-based test i is significant given that gene-based test j is significant, which is estimated as the total number of overlapping significant genes between tests i and j divided by the total number of significant genes for test j.
We compiled lists of benchmark genes from the ClinVar database [21] and the Human Phenotype Ontology (HPO) [22] for 25 traits in the UK Biobank to compare the gene-based analysis methods identifying causal genes; procedures and selection criteria are detailed in Materials and Methods. Results are shown separately using the union and intersection of ClinVar and HPO benchmark genes; the latter gene set is expected to have higher specificity, albeit fewer genes. Performance identifying benchmark genes was assessed by ranking genes separately within each benchmark locus for each UK Biobank trait, where a benchmark locus is defined as the set of all genes within 1 Mbp of a genome-wide significant single-variant association that also is within 1 Mbp of a benchmark gene. To compare the performance of gene ranking methods, we calculated the fraction of loci at which the top-ranked gene coincides with a benchmark gene (Fig 7) and assessed receiver operating characteristic (ROC) and precision-recall curves for each method (S2 Fig).


UK Biobank analysis: Performance identifying benchmark genes.
Percentage of loci at which the benchmark gene (identified from HPO and/or ClinVar) is top-ranked for each gene-based association or gene ranking method. For each method, bars on the left (outlined in black) are calculated for benchmark loci present in both HPO and ClinVar (54 loci), and bars on the right (faded outline) are calculated using the union of all HPO and ClinVar loci (153 loci). Horizontal red lines indicate the expected percentage of top-ranked benchmark genes under the null hypothesis that gene rank and benchmark labels are independent. Error bars indicate 95% confidence intervals. TSS-to-top SNP refers to ranking genes by the distance between TSS and the most significant single variant at each causal locus; the dTSS-weighted gene-based test (dTSS) uses an exponential weight funcion to assign higher weight to variants nearer the TSS for each gene (Methods).
GAMBIT omnibus tests had the highest performance identifying benchmark genes among the gene ranking methods considered, particularly for the stricter gene set, although the difference was not statistically significant relative to most other gene ranking methods (Fig 7). Gene-based tests using coding variants alone had the second-highest performance (Fig 7; S2 Fig), which may reflect the enrichment for coding associations within the benchmark gene set (S3 Fig) caused by benchmark gene selection criteria (described in Materials and methods). Due to the over-representation of coding associations, Fig 7 may underestimate the impact of incorporating heterogeneous regulatory annotations for associated loci without an established benchmark gene.
Further inspection revealed a number of loci of biological or clinical interest. In the analysis of skin cancer in the UK Biobank, three melanin or melanogenesis-related genes (TYR, OCA2, and MC1R) and telomerase reverse transcriptase (TERT) were top-ranked by the omnibus test, but not top-ranked based on TSS-to-top-SNP distance, while all other benchmark genes for skin cancer were top-ranked by both methods or by neither. At the TERT locus, the lead GWAS variant was intronic, whereas the lead variants for TYR, OCA2, and MC1R were nonsynonymous. Unsurprisingly, the latter three benchmark genes were also top-ranked based on coding variant gene-based p-values; however, only TERT was top-ranked based on CT-TWAS.
Similarly, APOB, which encodes an apolipoprotein and is associated with autosomal dominant forms of hypercholesterolemia, was top-ranked by the omnibus test but not by TSS-to-top-SNP distance for disorders of lipoid metabolism in the UK Biobank. Despite being >150 Kbp from the intergenic lead GWAS variant, APOB was also top-ranked by all single-annotation gene-based tests individually. Conversely, TSHR, which encodes a thyroid horomone receptor, was top-ranked based on TSS-to-top-SNP distance but not by the omnibus test for thyrotoxicosis. In this case, the lead GWAS variant was intronic, and CT-TWAS was the only single-annotation gene-based test that ranked TSHR as the top gene at its locus; in this example, while the omnibus test for TSHR was significant, it was outranked by CEP128 at the locus. A complete table of results for benchmark genes is provided in Supplementary Materials.
Here, we introduced GAMBIT, a statistical framework and software tool for gene-based analysis with heterogeneous annotations. Our work makes several contributions to the field:
First, we conducted extensive simulation studies to systematically compare gene-based test methods across a range of plausible biological scenarios, and demonstrated pitfalls of test methods that use only a single annotation class. When the causal annotation class is misspecified, standard gene-based tests have limited power, and can be confounded by LD and pleiotropic regulatory variants that affect multiple genes. This may lead researchers to misidentify the genes and biological mechanisms that contribute to disease risk. Finemapping, co-localization, and conditional analysis can be applied to refine association signals and mitigate spurious inferences following gene-based analysis (e.g., [41–44]). By contrast, our omnibus testing strategy helps to ameliorate spurious inferences within the context of gene-based testing directly, and also has high power to detect associations across a range of causal mechanisms underlying genetic associations.
Second, we analyzed 128 traits from the UK Biobank to evaluate performance in empirical data across a range of complex traits and genetic architectures, and confirmed that incorporating annotations of many types and across many tissues increases power relative to standard methods. While our analysis of concordance with gold-standard causal genes was limited by the relatively small numbers of benchmark genes identified for UK Biobank traits and the inherent difficulty establishing causal genes underlying regulatory associations, we found suggestive evidence that incorporating diverse annotation types in gene-based analysis can improve performance identifying causal genes relative to standard approaches (e.g., ranking genes by distance to the most significant single variant) and gene based tests using a single annotation type.
Finally, we provide a unifying framework and easy-to-use software tool to incorporate heterogeneous functional annotations in gene-based analysis. From its inception, gene-based analysis was built on the premise that aggregating functional variants at the gene level can increase statistical power and help identify causal genes in GWAS [2]. Early gene based test methods were developed primarily for rare genic variants (e.g., [25, 26]), and early gene-based association analyses often used only deleterious coding variants (e.g., [45, 46]). However, functional genomics studies have shown that most functional variation is non-coding [37], and most variant associations discovered through GWAS to date occur in non-coding regions [1, 9], highlighting the importance of regulatory annotations for gene-based association analysis. The first gene-based tests developed explicitly for regulatory variation were TWAS and PrediXcan, which aggregate eQTL variants to construct proxy variables for tissue-specific gene expression levels using predictive weights estimated from external eQTL mapping data [7, 8]. However, functional and regulatory genomics projects have introduced a wealth of annotations with potential utility for gene-based analysis (e.g., [11, 35, 38, 47]).
The omnibus testing strategy used here is expected to perform best under sparse alternatives, e.g. when one or few annotation classes harbor causal variants at a given locus. When a larger fraction of annotation classes harbor independent signals at a single gene locus, this omnibus strategy may have less power than one that explicitly accounts for multiple simultaneous signal sources. While we did not explore this possibility in our simulations, it is an interesting question which we defer to future work.
Previous studies have evaluated the performance of gene-based tests under misspecification, e.g. by varying the proportion of causal variants and correlation structure of causal effects [48, 49]. In the present study, we evaluated the performance of single-annotation gene-based tests under misspecified causal mechanisms (for example, TWAS when a mechanism other than gene expression underlies the association signal). The former problem primarily concerns the statistical form of gene-based test and the distribution of causal effects, while the latter is more related to the informativeness of functional annotations and the overlap between classes of functional variation (e.g., Fig 6B). Our simulation studies also included basic forms of misspecification in the distribution of causal effects (e.g., when only a fraction of annotated variants in the causal annotation class have non-zero causal effects) and measurement error in functional annotations (e.g., by including a error term in TWAS weights). However, further research is needed to explicitly address model misspecification and annotation measurement error in gene-based analysis.
The utility of incorporating annotations in gene-based analysis depends crucially on the accuracy and comprehensiveness of the underlying annotation data sets. While we considered the case that causal variants may be misspecified, our simulations assumed that the confidence weights assigned to regulatory elements are well-calibrated, and that causal eQTL variants are annotated. Violations of these assumptions will reduce both power and accuracy in gene-based analysis, and may in part account for differences between our results with empirical versus simulated data. Current transcriptomic and epigenomic studies are generally limited to a subset of human tissues and cell-types, and are derived from data sets of limited sample size (e.g., [37, 47]). Thus, we expect current transcriptomic and epigenomic annotations to be incomplete and imprecise. Looking forward, larger and more comprehensive studies will enable more comprehensive and accurate annotations, increasing the utility of annotation-informed association analysis methods.
In summary, our work builds upon and generalizes previous gene-based association methods, providing a flexible framework for gene-based analysis with heterogeneous annotations that can be readily adapted when new annotation resources are developed and released.
We describe 1) gene-based association test statistics, 2) procedures to aggregate variants within each class of functional variation for gene-based analysis, 3) functional annotation data sources 4) procedures to simulate GWAS data using real genotype and functional annotation data, and 5) GWAS data from the UK Biobank to which we applied our methods.
Here, we review statistical methods to aggregate multiple variants for gene-based, region-based, or pathway association analysis. For convenience, we assume a quantitative trait and ignore the presence of covariates; however, our results can easily be adapted to other settings.
The oldest and most widely used gene-based tests are linear combinations of genotypes across variants [25, 26, 50], here referred to as L-type tests. We define the L-type test statistic as TL = (w⊤RZw)−1/2 w⊤ Z, where w is a vector of single-variant weights, Z is a vector of single-variant association statistics (where each Zj follows the standard normal distribution under the null hypothesis), and RZ is the correlation matrix of z-scores. Under the null hypothesis of no association, TL follows the standard normal distribution. The L-type test statistic TL can be computed from GWAS summary statistics (single-variant z-scores, or effect sizes and standard errors) and covariance estimates, and can be written either as linear combinations of single-variant association statistics or as linear combinations of genotypes [4, 51].
Examples of L-type tests include burden tests, which calculate burden scores as a weighted sum of rare, putatively deleterious mutations [25, 50]; the cohort allelic sums test (CAST) [52]; and TWAS/PrediXcan tests [6–8], which aggregate eQTL variants using predictive weights estimated from external data sets, e.g. from the GTEx project [40]. These can be viewed as tests of association between GWAS trait and an explicit proxy variable constructed as a linear combination of genotypes. Importantly, L-type tests rely on prior knowledge regarding the directions of effect across variants [27, 50]. For example, the signed weights used in burden tests often reflect the hypothesis that rare deleterious alleles increase risk for disease, and the predictive weights used in TWAS/PrediXcan reflect the hypothesis that gene expression mediates the associations between genotypes and complex trait.
Variance component tests and quadratic forms of single-variant association statistics comprise another widely used class of gene-based association methods, here referred to as Q-type (quadratic) tests. Q-type tests include VEGAS (or SOCS), defined as the sum of squared single-variant z-scores [28, 53]; the C-alpha test [54]; and SKAT, a weighted quadratic form of single-variant association statistics [27]. We define the Q-type test statistic as TQ = Z⊤ diag(w)Z, where diag(w) is a diagonal weight matrix and Z is a vector of single-variant association z-scores; under the null hypothesis of no association, TQ follows a mixture chi-squared distribution with mixture proportions equal to the eigenvalues of diag(w)1/2 RZ diag(w)1/2, where RZ is the correlation matrix of z-scores. In contrast to L-type tests, Q-type tests aggregate single-variant association statistics without prior knowledge or assumptions pertaining to the directions of effects across variants [27, 50]. While less tractable than L-type, analytical p-values for Q-type tests can be calculated using a variety of techniques to approximate the tail probabilities of multivariate normal quadratic forms (e.g., [55, 56]), which are far more efficient than permutation procedures or Monte Carlo methods [28, 57]. Q-type tests are most appropriate when a sizable proportion of variants are hypothesized to have non-zero effects of unknown and inconsistent direction [50].
Perhaps the simplest gene-based test is the maximum chi-squared statistic across variants (or equivalently, the minimum p-value), here referred to as M-type tests. Analytical p-values for M-type tests can be calculated by directly integrating the multivariate normal density of z-scores within the hypercube given by
The aggregated Cauchy association test (ACAT), a recently proposed method to combine multiple dependent p-values, can be used to construct gene-based tests by transforming single-variant association p-values using the Cauchy quantile and cumulative distribution functions, and computing a p-value

Another recently proposed method to combine multiple dependent p-values, the Harmonic Mean P-value (HMP; [31]), can similarly be used to construct gene-based tests by weighting p-values from single-variant association tests. The unadjusted HMP p-value is defined

While this statistic can be anti-conservative when directly interpreted as a p-value, Wilson (2019) showed that 1/pHMP follows a Landau distribution (with scale and location parameters given in Table 1), which can be used to compute an asymptotically exact HMP p-value. The Landau density function is

The simple forms of gene-based tests described above can be related and combined through a variety generalizations and extensions. Q-type and M-type can both be viewed as special cases of a statistic (∑j
wj|Zj|p)1/p, which is equivalent to Q-type when p = 2 and to M-type when p → ∞; this generalization has been used, for example, in the aSPU gene-based test [61]. Similarly, Q-type and L-type can both be viewed as special cases of a statistic
Here we describe methods to aggregate variants within each of the 5 major annotation classes considered in our analysis. Briefly, we use linear (L-type) tests to combine eQTL variants using signed predictive weights reflecting the alternative hypothesis that genotype effects on trait are mediated by gene expression levels. For coding and UTR variants, we use Q-type tests within each annotation subclass, reflecting the alternative hypothesis that genetic effects follow a symmetric mean-zero distribution. For dTSS-weighted tests, we use weighted dependent p-value combination procedures (ACAT or HMP), which can be viewed as an approximate test for the alternative hypothesis that a variant is causal with prior probability proportional to e−αdTSS.
Gene-based tests for coding variants are calculated by aggregating variants separately within each coding subclass (e.g., missense, nonsense, and synonymous) using Q-type (by default) test statistics. These stratified tests are then combined across subclasses using a p-value combination procedure (HMP or ACAT) to calculate a coding omnibus test for each gene.
Gene-based tests for UTR variants are calculated by aggregating variants separately within the 3’ and 5’ UTR regions using Q-type (by default) test statistics, and applying a p-value combination procedure (HMP or ACAT) to calculate a UTR omnibus test for each gene.
One of the most common heuristics to infer likely causal genes at non-coding GWAS loci is to rank genes by distance between their transcription start site (TSS) and the most significant single GWAS variant. This strategy is appealing given the strong enrichment of regulatory variants near TSS.
To incorporate distance-to-TSS (dTSS) and capture association signals at regulatory variants that are not well-annotated in gene-based analysis, we define the dTSS weights for gene k as
The optimal α value is expected to vary across loci, and likely depends on local gene density and other factors. However, ACAT and HMP can be applied again to calculate omnibus p-values by combining dTSS-weighted gene-based test p-values pk(αi) across multiple values α1, α2, … [30, 59]. By default, GAMBIT calculates overall dTSS-weighted test statistics by aggregating across α values 10−4, 5 × 10−5, 10−5, 5 × 10−6.
To capture association signals across regulatory elements that have been assigned to one or more target gene, we weight variants in regulatory elements by element-to-target-gene confidence scores, and aggregate variants for each gene using either ACAT, HMP, or Q-type gene-based test statistics. For example, we define the regulatory-element weighted Q-type test statistic as
Given a vector of weights bkt to predict expression levels for gene k in a given tissue or cell type t as a linear combination of normalized genotypes, we define the z-score TWAS test of association between predicted expression level and GWAS trait as
To aggregate test statistics across multiple tissues or cell-types, which we refer to as Cross-Tissue TWAS (CT-TWAS), we considered three approaches:
Q-type Cross-tissue Test (CT-Q): Calculating the sum of squared tissue-specific test statistics,
M-type Cross-tissue Test (CT-M): Calculating an analytic p-value for the maximum absolute test statistic maxt|Skt| using the multivariate normal joint density of tissue- or cell-type-specific test statistics Sk1, Sk2, … under the null hypothesis of no association, and
ACAT or HMP Cross-tissue Test (CT-A or CT-H): Combining tissue- or cell-type-specific p-values pkt = 2Φ(−|Skt|) using the ACAT method or HMP respectively.
CT-Q and CT-M require the cross-tissue correlation matrix RS with elements
In early versions of GAMBIT, we combined gene-based p-values across annotation classes for each gene using standard family-wise error rate (FWER) and false detection rate (FDR) controlling procedures. However, two recently proposed methods, the Aggregated Cauchy Association Test (ACAT; [59]) and Harmonic Mean P-value (HMP; [31]), provide more powerful approaches to combine multiple dependent p-values, and we have therefore implemented both of these methods in GAMBIT. Unlike gene-based tests such as SKAT, which are formulated as parametric tests in a generalized linear mixed model, these p-value combination methods are essentially non-parametric, assuming only that p-values are uniformly distributed under the null hypothesis. Like the exponential combination (EC) procedure [62], ACAT and HMP are powerful under sparse alternatives; however, unlike EC, they enable efficient analytic p-value calculation, with computation time linear in the number of p-values. We calculated ACAT p-values (defined above) using standard formula for the Cauchy quantile function and CDF. To calculate asymptotically exact HMP p-values, we adapted a C++ routine from the ROOT System [63] for computing the Landau distribution CDF following the derivations of Wilson (2019) for the asymptotic distribution of the HMP.
To identify regulatory genetic elements and their putative target genes, we used pre-computed annotation data sets from three existing methods: Joint Effects of Multiple Enhancers (JEME) [11], GeneHancer [35], and RoadmapLinks [10, 34, 64]. GeneHancer provides a global confidence score between each enhancer element and one or more putative target genes, while JEME and RoadmapLinks provide tissue- or cell-type-specific enhancer-target confidence scores. For the latter two data sets, we calculated overall enhancer-target confidence scores across tissues and cell types as the soft maximum (LogSumExp function) of tissue- or cell-type-specific scores for each enhancer-target pair. Descriptive statistics for each enhancer annotation dataset are provided in S2 Table.
To incorporate eQTL variants in gene-based analysis, we used pre-computed tissue-specific predictive weights for eGene expression estimated using GTEx v7 [47] from TWAS/FUSION (including elastic net and LASSO models) [8] and PredictDB [6, 7]. We generated a GAMBIT eWeight annotation files incorporating all available tissues and cell types for each data resource and predictive model. Descriptive statistics for each eQTL variants weight dataset are provided in S1 Table.
Here, we describe simulation procedures for GWAS summary statistics, configurations of causal genes, and causal variant effects.
We simulated GWAS traits under the model
We define the vector of single-variant association statistics (equivalent to t-test statistics from simple linear regression) for variants k = 1, 2, …, m as

We simulated GWAS association statistics Z by calculating
We used empirical functional annotation data to simulate causal genetic effects β, guided by the intuition that a variant’s functional effects ultimately determine its effects on complex traits. While minor allele frequency (MAF) was not explicitly used to select causal variants in simulations, this procedure induces an implicit relationship between MAF and causal status due to the relationship between MAF and functional annotations (S4 Fig). For each simulated causal locus, we selected a causal gene by sampling a single CCDS protein-coding gene, and defined proximal genes as any gene with TSS within 1 Mbp of the causal gene TSS. We then simulated single-variant GWAS summary statistics for all variants associated with any causal and proximal genes by proximity (≤ 1 Mbp) or functional annotations (e.g., eQTL variants).
We simulated causal genetic effects under 5 scenarios: 0) no association (null model), 1) coding association, 2) enhancer association, 3) eGene association, and 4) UTR association. For coding and UTR associations, we first selected the number of causal variants
For non-eQTL effects, we simulated the genetic effect for each causal variant βj from an iid normal distribution, scaled so that the total genetic variance at each locus is equal to
We used GWAS summary statistics (single-variant association effect size estimates, standard errors, and p-values) for a set of 1,403 traits in the UK Biobank [20] cohort calculated using SAIGE [68]. Genotype data were imputed using the Haplotype Reference Consortium panel [69], and filtered to include only variants with imputed MAC > 20 in the UK Biobank. We selected a subset of 189 traits for primary analysis by including only traits with effective sample size ≥ 5, 000, and ≥ 1 single-variant association p-value ≤2.5e-8. For our analysis of empirical power, we selected a subset of 128/189 traits by iteratively pruning pairs of correlated traits. Beginning with the most highly correlated pair of traits, we retained the trait with the larger number of significant independent single-variant associations (in the case of ties, we selected the trait with the most detailed description), and repeated this procedure until the maximum pairwise correlation-squared between traits was ≤0.10. Trait correlations were estimated from GWAS summary statistics as described in [70]. For our analysis of concordance with benchmark genes, we first selected a subset of 47 traits including only traits with ≥ 1 single-variant association p-value < 5e-10, excluding benign neoplasms, and including at most a single trait within each trait category. We identified ≥ 1 relevant benchmark genes for 25 of the original 47 traits.
Benchmark genes for each of the selected UK Biobank traits were identified using the ClinVar [21] and Human Phenotype Ontology (HPO) databases [22]. The HPO database explicitly links genes to traits, while the ClinVar database links traits to variants. To identify benchmark genes from ClinVar, we extracted protein-altering variants (frameshift, missense, nonsense, splice site, or stop-loss variants), and excluded variants with unknown or ambiguous molecular consequence (e.g., intergenic and intronic variants). Despite including only ClinVar genes with coding associations, we expect to capture some genes for which both rare coding variants and common regulatory variants contribute to disease risk. For each UK Biobank trait, we extracted all protein-altering ClinVar variants +/- 1 Mbp of a genome-wide significant UK Biobank variant, and manually selected ClinVar traits equivalent or closely related to the corresponding UK Biobank trait. We then annotated genes associated with one or more relevant ClinVar trait as a ClinVar benchmark gene. We identified benchmark genes from the HPO database by manually matching keywords between UK Biobank and HPO traits. A complete list of HPO/ClinVar traits and benchmark genes for each UK Biobank trait is provided in Supplementary Materials.
We gratefully acknowledge the participants and investigators of the 1000 Genomes Project and UK Biobank study. We thank Yaowu Liu for helpful discussions on p-value combination procedures; Xihong Lin for helpful comments and suggestions on the manuscript; and Sarah Gagliano, Jonas Billie Nielson, and Wei Zhou for assistance with data sets. This research has been conducted using the UK Biobank Resource under Application Number 24460.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71