PLoS Genetics
Home Immune fingerprinting through repertoire similarity
Immune fingerprinting through repertoire similarity
Immune fingerprinting through repertoire similarity

The authors have declared that no competing interests exist.

Article Type: research-article Article History
Abstract

Immune repertoires provide a unique fingerprint reflecting the immune history of individuals, with potential applications in precision medicine. However, the question of how personal that information is and how it can be used to identify individuals has not been explored. Here, we show that individuals can be uniquely identified from repertoires of just a few thousands lymphocytes. We present “Immprint,” a classifier using an information-theoretic measure of repertoire similarity to distinguish pairs of repertoire samples coming from the same versus different individuals. Using published T-cell receptor repertoires and statistical modeling, we tested its ability to identify individuals with great accuracy, including identical twins, by computing false positive and false negative rates < 10−6 from samples composed of 10,000 T-cells. We verified through longitudinal datasets that the method is robust to acute infections and that the immune fingerprint is stable for at least three years. These results emphasize the private and personal nature of repertoire data.

Immune repertoires are a trove of personal information: unique to each individual, they are also windows into their past and future health. Thanks to their potential for personalized medicine and progress of sequencing technologies, these repertoires are now routinely sequenced. As a consequence they raise the question of identifiability of samples. In this paper, we estimate the quantity of immune cells needed to associate two samples from the same individual: as little as a finger prick worth of blood can serve as an immune fingerprint that can distinguish even identical twins, without giving away genetic information about non-consenting relatives. We show that this fingerprint is stable through time, and is not erased during infections or vaccinations.

Dupic,Bensouda Koraichi,Minervina,Pogorelyy,Mora,Walczak,and Roopenian: Immune fingerprinting through repertoire similarity

1 Introduction

Personalized medicine is a frequent promise of next-generation sequencing. These high-throughput and low-cost sequencing technologies hold the potential of tailored treatment for each individual. However, progress comes with privacy concerns. Genome sequences cannot be anonymized: a genetic fingerprint is in itself enough to fully identify an individual, with the rare exception of monozygotic twins. The privacy risks brought by these pseudonymized genomes have been highlighted by multiple studies [13], and the approach is now routinely used by law enforcement. Sequencing experiments that focus on a limited number of expressed genes should be less prone to these concerns. However, as we will show, B- and T-cell receptor (BCR and TCR) genes are an exception to this rule.

BCR and TCR are randomly generated through somatic recombination [4], and the fate of each B- or T-cell clone depends on the environment and immune history. The immune T-cell repertoire, defined as the set of TCR expressed in an individual, has been hailed a faithful, personalized medical record, and repertoire sequencing (RepSeq) as a potential tool of choice in personalized medicine [59]. In this report, we describe how, from small quantities of blood (blood spot or heel prick), one can extract enough information to uniquely identify an individual, providing an immune fingerprint. The “Immprint” classifier analyzes this immune fingerprint to decide whether two samples were sampled from the same individual.

2 Results

Given two samples of peripheral blood containing respectively M1 and M2 T-cells, we want to distinguish between two hypothetical scenarios: either the two samples come from the same individual (“autologous” scenario), or they were obtained from two different individuals (“heterologous” scenario), see Fig 1A.

A) The two samples A and B can either originate from the same individual (autologous) or two different individuals (heterologous). In both scenarios, sequences can be shared between the two samples, but their quantity and quality vary. B) Schematic representation of the distribution of the S or I scores for multiple pairs of samples extracted from the same individual (in the autologous scenario) or the same pair of individuals (heterologous). The dashed vertical line represents the threshold value. C) Expected value of S and I for different pairs of samples, sampled from the same individual (in blue) or different ones (yellow). Green dots represent samples extracted from pairs of identical twins. The dashed lines represents the theoretical upper bound for heterologous repertoires (see Methods) for both S and I (γ = 12). D) Sampling distributions of S for 6 different patients (autologous case, each blue curve is one patient) or 6 different pairs of patients (heterologous, each yellow or green curve is a pair of patients) for M = 5000. The y-axis scale on the left is adapted to the heterologous distributions while the scale on the right corresponds to the (much wider) autologous ones. The 3 sampling distributions in green correspond to a pair of samples extracted from identical twins. E) Detection Error Trade-off (DET) graph for both summary statistics and different sample sizes M. I (γ = 12) outperforms S in all scenarios. F) AUROC (Area Under Receiver Operating Characteristic), as a function of M. The AUROC is a traditional measure of the quality of a binary classifier (a score closer to one indicates a better classifier). The results are shown for S and I both in the default case (only the β chain considered) or for the full (α—β) receptor.
Fig 1

A) The two samples A and B can either originate from the same individual (autologous) or two different individuals (heterologous). In both scenarios, sequences can be shared between the two samples, but their quantity and quality vary. B) Schematic representation of the distribution of the S or I scores for multiple pairs of samples extracted from the same individual (in the autologous scenario) or the same pair of individuals (heterologous). The dashed vertical line represents the threshold value. C) Expected value of S and I for different pairs of samples, sampled from the same individual (in blue) or different ones (yellow). Green dots represent samples extracted from pairs of identical twins. The dashed lines represents the theoretical upper bound for heterologous repertoires (see Methods) for both S and I (γ = 12). D) Sampling distributions of S for 6 different patients (autologous case, each blue curve is one patient) or 6 different pairs of patients (heterologous, each yellow or green curve is a pair of patients) for M = 5000. The y-axis scale on the left is adapted to the heterologous distributions while the scale on the right corresponds to the (much wider) autologous ones. The 3 sampling distributions in green correspond to a pair of samples extracted from identical twins. E) Detection Error Trade-off (DET) graph for both summary statistics and different sample sizes M. I (γ = 12) outperforms S in all scenarios. F) AUROC (Area Under Receiver Operating Characteristic), as a function of M. The AUROC is a traditional measure of the quality of a binary classifier (a score closer to one indicates a better classifier). The results are shown for S and I both in the default case (only the β chain considered) or for the full (α—β) receptor.

TCR are formed by two protein chains α and β. They each present a region of high somatic variability, labeled CDR3α and CDR3β, randomly generated during the recombination process. These regions are coded by short sequences (around 50 nucleotides), which are captured by RepSeq experiments. The two chains are usually not sequenced together so that the pairing information between α and β is lost. Most experiments focus on the β chain, and when not otherwise specified, the term “receptor sequence” in this paper will refer only to the nucleotide sequence of the TRB gene coding for this β chain (which include CDR3β). Similarly, as most cells expressing the same beta chain are clonally related, we will be using the terms “clone” and “clonotype” to refer to set of cells with the same nucleotide TRB sequence, even if they were produced in separate generation events and are not real biological clones (since we have no means of distinguishing the two cases). CDR3β sequences are very diverse, with more than 1040 possible sequences [10]. For comparison, the TCRβ repertoire of a given individual is composed of 108 to 1010 unique clonotypes [11, 12]. As a result, most of the sequences found in a repertoire are “private”.

2.1 Immprint scores

To discriminate between the autologous and heterologous scenarios, one can simply count the number of unique nucleotide receptor sequences shared between the two samples, which we call S. Samples coming from the same individual should have more receptors in common because T-cells are organized in clones of cells carrying the same TCR. By contrast, S should be low in pairs of samples from different individuals, in which sharing is due to rare convergent recombinations. Appropriately setting a threshold to jointly minimize the rates of false positives and false negatives (Fig 1B), we can use S as a classifier to distinguish autologous from heterologous samples. S is not normalized for sequencing depth and values of S should not be compared between samples of different size.

The S score can be improved upon by exploiting the fact that some receptors are much more likely than others to be generated during V(D)J-recombination, with variations in generation probability (Pgen, [1315]) spanning 15 orders of magnitude. Public sequences (with high Pgen) are likely to be found in multiple individuals [16], while rare sequences (low Pgen) are unlikely to be shared by different individuals, and thus provide strong evidence for the autologous scenario when found in both samples. To account for this information, we define the score:

which accounts for Shannon’s “surprise” ln (1/Pgen)—a measure of unexpectedness—associated with each shared sequence s, so that rare shared sequences count more than public ones. The constant γ depends on the repertoire’s clonal structure and is set to 12 in the following (see Methods for an information-theoretic derivation). Pgen is computed using models previously trained on data from multiple individuals [14]. Small differences reported between the Pgen of distinct individuals justify the use of a universal model [15].

2.2 Measuring error rates

We tested the classifiers based on the S and I scores on TCRβ RepSeq datasets from 656 individuals [17]—labeled according to their cytomegalovirus (CMV) serological status. Sequences were downsampled to mimic experiments where M1 = M2 = M cells were sampled and their receptors sequenced. The frequency of a particular clonotype in the sample (the proportion of cells expressing a particular beta chain) was estimated using the read counts of unique TCRβ sequences, and the mean values of S and I computed with a procedure designed to correct for the limited diversity of the sampled repertoire relative to the full repertoire, see Methods 4.3. Similar results may be obtained when M1 and M2 are different (see Methods). The clones most often shared between two autologous samples are also the most clonally expanded—and hence are probably antigen experienced. We verified that the sequence statistics of those expanded clonotypes did not differ from generic ones (S1 Fig).

In Fig 1C, we plot the mean value of S (over many draws of M = 5000 cells) for each individual (autologous scenario, in blue) and between pairs of different individuals (heterologous scenario, in yellow). The two scenarios are clearly discernable under both scores. This result holds for pairs of monozygotic twins obtained from a distinct dataset [18] (green dots), consistent with previous reports that twins differ almost as much in their repertoires as unrelated individuals [1820]. Heterologous scores (yellow dots) vary little, and may be bounded from above by a theoretical prediction (dashed line) based on a model of recombination [21] (see Methods). On the other hand, autologous scores (blue dots) show several orders of magnitude of variability across individuals. These variations stem from the clonal structure of the repertoire, and correlate with measures of diversity (S2 Fig), which is known to vary a lot between individuals and correlates with age [22], serological status, and infectious disease history [23, 24]. To explore the worst case scenario of discriminability, hereafter we will focus on the individual with the lowest autologous S found in the dataset.

The sampling process introduces an additional source of variability within each individual. Two samples of blood from the same individual do not contain the exact same receptors, and the values of S and I is expected to vary between replicates. Examples of these variations are shown in Fig 1D. The blue (respectively yellow) curves correspond to the sample distributions in the autologous (heterologous) scenario for different individuals (pairs of individuals). The distribution of S is well-approximated by a Poisson distribution, while I follows approximately a compound distribution of a normal and Poisson distributions (see Methods for details). Armed with these statistical models of variations, we can predict upper bounds for the false negative and false positive rates. As seen from the detection error trade-off (DET) graph Fig 1E, the Immprint classifier performs very well for a few thousand receptors with an advantage for I.

With 10, 000 cells, corresponding to ∼10 μL of blood, Immprint may simultaneously achieve a false positive rate of < 10−16 and false negative rate of < 10−6, allowing for the near-certain identification of an individual based on the I score in pairwise comparisons against the world population ∼1010. When a large reference repertoire has been collected (M1 = 1, 000, 000, corresponding to ∼1mL of blood), an individual can be identified with just 100 cells (S3 Fig).

The AUROC estimator (Area Under the Curve of the Receiver Operating Characteristic), a typical measure of a binary classifier performance, can be used to score the quality of the classifier with a number between 0.5 (chance) and 1 (perfect classification). The I score outperforms the S score (Fig 1F), particularly above moderate sample sizes (M ≈ 5000). Both scores can be readily generalized to the case of paired receptors, if the pairing of the two chains is available through single-cell sequencing [2527] or computational pairing [28], we can consider the sequence of the full receptor TCRαβ, instead of just TCRβ. The generation probability of the full TCR is then given by Pgen(α, β) = Pgen(α) × Pgen(β)[29]. Because coincidental sharing of both chains is substantially rarer than with the β chain alone, using the paired chain information greatly improves the classifier.

While this paper focuses on T-cells and TCR sequences, the structure of the B-cell receptors (BCR) repertoire is very similar to the TCR repertoire and we expect to find qualitatively similar results. As an example we use the dataset obtained in Ref. [30] to measure S and I for IGH chains (forming half of the BCR receptor), S4 Fig. We see that 5000 IgG+ memory B-cells are enough to reliably identify the individuals in the study. However, B-cells are 5 times less common in peripheral blood than T-cells, and somatic hypermutation tends to distort the statistics of receptors, reducing the reliability of our classifier. Hence, for practical applications, T-cells are better means of identification.

2.3 Evolution with time

The previous results used samples obtained at the same time. However, immune repertoires are not static: interaction with pathogens and natural aging modify their composition. The evolution of clonal frequencies will decrease Immprint’s reliability with time, especially if the individual has experienced immune challenges in the meantime.

To study the effect of short-term infections, we analyzed an experiment where 6 individuals were vaccinated with the yellow fever vaccine, which is regarded as a good model of acute infection, and their immune system was monitored regularly through blood draws [18]. We observe an only moderate drop in I caused by vaccination (Fig 2A).

A) Evolution of I (M = 5000) during vaccination, between a sample taken at day 0 (vaccination date) and at a later timepoint. Each color represents a different individual. Each pair timepoint/individual has two biological replicates. The dashed line represents the threshold value. B) Evolution of I between a sample taken at year 0 and a later timepoint. The red histogram corresponds to one of the individuals sampled in [18] and the blue curves show theoretical estimates, fitted to match (τ = 0.66). C) Evolution of the (normalized) mean of I (M = 5000) as a function of time for different values of the turnover rate τ. The dashed line represents the threshold value divided by the smallest value of It = 0 (M = 5000) in the data. The data points were obtained from the datasets [35] (yellow), [18] (green) and [22] (orange). Different markers indicate different individuals.
Fig 2

A) Evolution of I (M = 5000) during vaccination, between a sample taken at day 0 (vaccination date) and at a later timepoint. Each color represents a different individual. Each pair timepoint/individual has two biological replicates. The dashed line represents the threshold value. B) Evolution of I between a sample taken at year 0 and a later timepoint. The red histogram corresponds to one of the individuals sampled in [18] and the blue curves show theoretical estimates, fitted to match (τ = 0.66). C) Evolution of the (normalized) mean of I (M = 5000) as a function of time for different values of the turnover rate τ. The dashed line represents the threshold value divided by the smallest value of It = 0 (M = 5000) in the data. The data points were obtained from the datasets [35] (yellow), [18] (green) and [22] (orange). Different markers indicate different individuals.

This is consistent with the fact that infections lead to the strong expansion of only a limited number of clones, while the rest of the immune system stays stable [3134]. While other types of infections, auto-immune diseases, and cancers may affect Immprint in more substantial ways, our result suggests that it is relatively robust to changes in condition.

We then asked how stable Immprint is over long times. Using longitudinal datasets [18, 22, 35], we show in Fig 2 that the Immprint score I is only slightly reduced for samples collected at intervals of up to three years. For longer timescale, given the lack of longitudinal datasets, we turn to mathematical models [12, 3639] to describe the dynamics of the repertoire. Following the model of fluctuating growth rate described in Ref. [38], we define two typical evolutionary timescales for the immune system: τ, the typical turnover rate of T-cell clones, and θ, which represents the typical time for a clonotype to grow or shrink by a factor two as its growth rate fluctuates. The model predicts a power-law distribution for the clone-size distribution, with exponent −1 − τ/2θ. This exponent has been experimentally measured to be ≈ − 2 [10], which leaves us with a single parameter τ, and θ = τ/2. An example of simulated evolution of Immprint with time is shown in Fig 2B The highlighted histogram represents a data point at two years obtained from [18]. While a fit is possible for this specific individual (τ = 0.66 in Fig 2B), the τ parameter is not universal, and we expect it to vary between individuals, especially as a function of age. In Fig 2C we explore the effect on the stability of Immprint for a range of reasonable values for the clone turn-over rate τ, from 6 months to 10 years, encompassing both previous estimates of the parameter [38] and measured turnover rates for different types of T-cells [40]. While Fig 2 focuses on I, the behaviour is similar for S (see S5 Fig). We observe that under this model, for most individuals and bar exceptional events, Immprint should conserve its accuracy for years or even decades.

3 Discussion

In summary, the T-cells present in small blood samples provide a somatic and long-lived barcode of human individuality, which is robust to immune challenges and stable over time. While the uniqueness of the repertoire was a well known fact, we demonstrated that the most common T-cells clones are still diverse enough to uniquely define an individual and frequent enough to be reliably sampled multiple times. Unlike genome sequencing, repertoire sequencing can discriminate monozygotic twins with the same accuracy as unrelated individuals. Additionally, a person’s unique immune fingerprint can be completely wiped out by a hematopoietic stem cell transplant [41].

A potential complication in applying Immprint is the convergent evolution of repertoires: individuals who encounter similar pathogenic environments could share many receptors. While this phenomenon occurs [17, 42], its influence on the immune repertoire is low. For example, in the context of cytomegalovirus infection, shared TCR clones are only slightly more common in co-infected individuals [17], and the result of Immprint does not seem to be affected (S6 Fig). The possibility to discriminate between twins—who shared a common environment for part of their lives—also hints that in most cases the effects of environment-driven convergence is small. Nonetheless we cannot reject the possibility that this effect is stronger for some specific pathogens, or long and strongly overlapping infection histories with pathogens that severely modify immune repertoires. A limit case study to quantitatively investigate this effect would require looking at data from mice living in otherwise sterile environments that are exposed in a controlled way to the same pathogens at the same time, bearing in mind that the diversity of mice repertoires is smaller than that of humans.

The different datasets used cover a range of different sequencing methods (see 4.1), but different approaches may lead to slightly different threshold choices. In particular, in practical implementations, sequencing depth is an important concern. One needs enough coverage to sequence TCRβ genes from as many as possible of the T-cells present in the sample, in order to measure a more precise immune fingerprint. In addition, the specific calculations presented here only apply to peripheral blood cells. Specific cell types or cells extracted from tissue samples may have different clonal distributions and potentially different receptor statistics. For example the value of S in the autologous case varies between CD4+ and CD8+ T-cells (S7 Fig), although different individuals remain distinguishable using each subset.

Immprint is implemented in a python package and webapp (see Methods) allowing the user to determine the autologous or heterologous origin of a pair of repertoires. Beyond identifying individuals, the tool could be used to check for contamination or labelling errors between samples containing TCR information. The repertoire information used by Immprint can be garnered not only from RepSeq experiments, but also from RNA-Seq experiments, which contain thousands of immune receptor transcripts [43, 44]. Relatively small samples of immune repertoires are enough to uniquely identify an individual even among twins, with potential forensics applications. At the same time, unlike genetic data from genomic or mRNA sequencing, Immprint provides no information about kin relationships, very much like classical fingerprints, and avoids privacy concerns about disclosing genetic information shared with non consenting relatives.

4 Methods

4.1 Datasets & pre-processing

We use five independant RepSeq datasets in this study: (i) genomic DNA from Peripheral blood mononuclear cells (PBMCs) from 656 healthy donors [17]; (ii) cDNA of PBMCs sampled from three pairs of twins, before and after a yellow-fever vaccination [18]; (iii), (iv) two longitudinal studies of healthy adults [22, 35];(v) cDNA dataset of IGH genes (B-cells) from 9 individuals, with multiple replicates [30].

CDR3 nucleotide sequences were extracted with MIGEC [45] (for the second dataset) coupled with MiXCR [46]. We also extract the frequency of reads from the three datasets. The non-productive sequences were discarded (out-of-frame, non-functional V gene, or presence of a stop codon). The generation probability (Pgen) was computed using OLGA [47], with the default TCRβ model. The frequency of each clone was estimated by summing the frequencies of all reads that shared the same nucleotide CDR3 sequence and identical V,J genes.

The preprocessing code is distributed on the Git repository associated with the paper. We also developed a command-line tool (https://github.com/statbiophys/immprint) that discriminates between sample origins, and a companion webapp (https://immprint.herokuapp.com).

4.2 Discrimination scores

To discriminate between the autologous and heterologous scenarios, we introduce a log-likelihood ratio test between the two possibilities:

where y1(s) = 1 if the sequence s is found in sample 1, and 0 otherwise; likewise y2(s) = 1 if s is in sample 2. The sum runs over all potential sequences s, including unseen ones. To be present in a sample, a sequence s first has to be present in the repertoire. This occurs with probability 1-(1-p(s))Nc, where Nc is the total number of clonotypes in the repertoire, and p(s) is the probability of occurence of sequence s (resulting from generation and selection, see below). Second, it must be picked in a sample of size M, with probability 1 − (1 − f)MMf (assuming Mf ≪ 1) depending on its frequency f, which is distributed according to the clone size distribution ρ(f). We checked that f(s) and Pgen(s) were not correlated (S1 Fig) and that the effects of a shared infection between different individuals were limited to a handful of clones (S8 Fig). Then one can write
where we’ve used dfρ(f)f=1/Nc. For the heterologous case the probability factorizes as:
with

Since only the term y1(s) = y2(s) = 1 (shared sequences) is different between the autologous and heterologous cases, we obtain:

Further assuming Nc p(s) ≪ 1, and p(s) = Pgen(s)q−1 (where q accounts for selection [21] and Pgen(s) is the probability of sequence generation [14]), the score simplifies to Eq 1, with γ = −ln(qNcf2〉) = ln(q−1f〉/〈f2〉). The factor γ depends on unknown parameters of the model, but can be estimated assuming a power-law for the clone size distribution [48], ρ(f)∝ f−2 extending from f = 10−11 to f = 0.01, and q = 0.01 [21], yielding γ ≈ 12.24. Alternatively we optimized γ to minimize the AUROC, yielding γ ≈ 15 (S9 Fig). Since performance degrades quickly for larger values, we conservatively set γ = 12.

4.3 Estimating mean scores from RepSeq datasets

The sampling of M cells from blood is simulated using large repertoire datasets. In a bulk repertoire sequencing dataset, the absolute number of cells for each clonotype (cells with a specific receptor) is unknown, but the fraction of each clonotype can be estimated using the proportion of reads that are associated with this specific receptor. To estimate the autologous S and I of two samples of size M1 and M2 in the absence of true replicates, we computed their expected values from a single dataset containing N reads, from which two random subsamples of sizes M1 and M2 were taken. The mean value of S is equal to S=s(1-(1-f(s))M1)(1-(1-f(s))M2), where f(s) is the true (and unknown) frequency of sequence s. A naive estimate of 〈mS〉 may be obtained by repeatly resampling subsets of sizes M1 and M2 from the observed repertoire, calculate S for each draw, and average. One get the same result by replacing f(s) by f^s=n(s)/N in the previous formula, where n(s) is the number of s reads in the full dataset, and N = ∑s n(s). However, this naive estimate leads to a systematic overestimate of the sharing (visible when compared with biological replicates, see S10 Fig), simply because this procedure overestimates the probability of resampling rare sequences, in particular singletons whose true frequency may be much lower that 1/N. A similar bias occurs when computing I. To correct for this bias, we look for a function h(n) that satisfies for all f:

so that S and I can be well approximated by:

Expanding the right-hand side of Eq 9 into 4 terms, we find that h(n)=1-gM1(n)-gM2(n)+gM1+M2(n) satisfies Eq 9 provided that:

Under the change of variable x = f/(1−f), the expression becomes:

Identifying the polynomial coefficients in xn on both sides yields:

These corrected estimates agree with the direct estimates using biological replicates (S10 Fig).

Similarly, S and I in heterologous samples can be estimated using:

where n(s) and n′(s) are the empirical counts of sequence s in the two samples.

4.4 Theoretical upper bound on heterologous scores

When the two samples were extracted from two different individuals (heterologous scenario), we can use the universality of the recombination process to give upper bounds on the values of S and I. These bounds are represented by the dashed lines in Fig 1C). If two samples of respectively M1 and M2 cells, containing T1M1 and T2M2 unique sequences are extracted from two different individuals, the number of shared sequences between them is given by [21]:

p(s) is the probability of finding a sequence s in the blood. Following [21], we make the approximation p(s) = Pgen(s)q−1, where the q = 0.01 factor is the probability that a generated sequence passes selection. Then 〈p(s)〉 can be estimated from the mean over generated sequences. Similarly, we can estimate I as

which is also estimated from the mean over generated sequences.

4.5 Error rate estimates

To make the quantitative predictions shown in Fig 1, we need to constrain the tail behavior of the distributions of S and I, for the two scenarios.

The S statistic can be rewritten as a sum of Bernouilli variables over all possible sequences, each with a parameter corresponding to its probability of being present in both samples, either in the autologous or the heterologous case. Therefore S is a Poisson binomial distribution, a sum of independent Bernouilli variables with potentially different parameters. The variance and tails of that distribution are bounded by those of the Poisson distribution with the same mean, denoted by ma for the autologous case, and mh for the heterologous case (S11 Fig).

Thanks to that inequality, the rates of false negatives and false positives for a given threshold r are bounded by:

where Q is the regularized gamma function, which appears in the cumulative distribution function of the Poisson distribution. The mean autologous score ma is estimated from experimental data: we use the smallest value of S in the Emerson dataset and Eq 10. To compute mh, we use the semi-theoretical prediction made using the universality of the recombination process, Eq 17.

Similarly, I can be viewed as a sum of S independent random variables, all following the distribution of ln(1/Pgen)−γ. However, this distribution differs in the two scenarios. Sequences shared between more than one donor have an higher probability of being generated, their ln(Pgen) distribution has higher mean and smaller variance (S12 Fig).

The sum is composed of a relatively large number of variables in most realistic scenarios. Hence, we rely on the central limit theorem to approximate it by a normal distribution, of mean and variance proportional to S. Explicitly:

The AUROC are computed based on these estimates, by numerically integrating the true positive rate P(S,I<r|heterologous) with respect to the false negative rate P(S,I<r|autologous) as the threshold r is varied.

4.6 Modeling the evolution of autologous scores

We use the model of Ref. [38] to describe the dynamics of individual T-cell clone frequencies f under a fluctuating growth rate reflecting the changing state of the environment and the random nature of immune stimuli:

where η(t) is a Gaussian white noise with 〈η(t)〉 = 0 and 〈η(t)η(t′)〉 = δ(tt′).

With the change of variable x = ln(f), these dynamics simplify to a simple Brownian motion in log-frequency: ∂t x = −τ−1+ θ−1/2 η(t). In that equation, τ appears as the decay rate of the frequency, while θ is the timescale of the noise, interpreted as the typical time it takes for the frequency to rise or fall by a logarithmic unit owing to fluctuations. Considering a large population of clones, each with their independent frequency evolving according to Eq 24, and a source term at small f corresponding to thymic exports, one can show that the steady-state probability density function of f follows a power-law [38], ρ(f) ∝ fα, with exponent α = 1+ 2θ/τ. α was empirically found to be ≈2 in a wide variety of immune repertoires [10, 4850], implying 2θτ. The turn-over time τ is unknown, and was varied from 0.5 years to 10 years in the simulations.

We simulated the evolution of human TRB repertoires by starting with the empirical values of the frequencies of each observed clones, f(s,0)=f^(s,0)=n(s,0)/N from the analysed datasets. A sample of size M was randomly selected with respect to these frequencies, and the frequencies of the clones captured in that sample were then evolved with a time-step of 2 days using Euler-Maruyama’s method, which is exact in the case of Brownian motion. Clones with frequencies falling below 10−11 (corresponding to a single cell in the organism) were removed. At each time t > 0, we measured the mean value of S with the formula ∑s(1 − (1 − f(s, t))M) where the sum runs over the sequences captured in the initial sample.

Acknowledgements

The authors are grateful for the help of Natanael Spisak with the analysis of BCR repertoire datasets.

References

NHomer, SSzelinger, MRedman, DDuggan, WTembe, JMuehling, et al Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays. PLOS Genetics. 2008;4(8):e1000167 10.1371/journal.pgen.1000167

MNaveed, EAyday, EWClayton, JFellay, CAGunter, JPHubaux, et al Privacy in the Genomic Era. ACM computing surveys. 2015;48(1). 10.1145/2767007

LSweeney, AAbu, JWinn. Identifying Participants in the Personal Genome Project by Name. Rochester, NY: Social Science Research Network; 2013. ID 2257732.

NHozumi, STonegawa. Evidence for somatic rearrangement of immunoglobulin genes coding for variable and constant regions. Proc Natl Acad Sci USA. 1976;73(10):36283632. 10.1073/pnas.73.10.3628

HRobins. Immunosequencing: applications of immune repertoire deep sequencing. Current Opinion in Immunology. 2013;25(5):64652. 10.1016/j.coi.2013.09.017

MAttaf, EHuseby, AKSewell. αβ T cell receptors as predictors of health and disease. Cellular & molecular immunology. 2015;12(4):391399. 10.1038/cmi.2014.134

DJWoodsworth, MCastellarin, RaHolt. Sequence analysis of T-cell repertoires in health and disease. Genome medicine. 2013;5(10):98 10.1186/gm502

PBradley, PGThomas. Using T Cell Receptor Repertoires to Understand the Principles of Adaptive Immune Recognition. Annual Review of Immunology. 2019;37(1):547570. 10.1146/annurev-immunol-042718-041757

MMDavis, SDBoyd. Recent progress in the analysis of αβ T cell and B cell receptor repertoires. Current Opinion in Immunology. 2019;59:109114. 10.1016/j.coi.2019.05.012

10 

Mora T, Walczak A. Quantifying Lymphocyte Receptor Diversity. é. 2016;.

11 

QQi, YLiu, YCheng, JGlanville, DZhang, JYLee, et al Diversity and clonal selection in the human T-cell repertoire. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(36):1313944. 10.1073/pnas.1409155111

12 

GLythe, RECallard, RLHoare, CMolina-París. How many TCR clonotypes does a body maintain? Journal of Theoretical Biology. 2016;389:214224.

13 

AMurugan, TMora, AMWalczak, CGCallan. Statistical inference of the generation probability of T-cell receptors from sequence repertoires. Proceedings of the National Academy of Sciences. 2012;109(40):1616116166. 10.1073/pnas.1212755109

14 

QMarcou, TMora, AMWalczak. High-Throughput Immune Repertoire Analysis with IGoR. Nature Communications. 2018;9(1):561 10.1038/s41467-018-02832-w

15 

ZSethna, GIsacchini, TDupic, TMora, AMWalczak, YElhanati. Population Variability in the Generation and Thymic Selection of T-Cell Repertoires. bioRxiv. 2020; p. 2020.01.08.899682.

16 

VVenturi, DAPrice, DCDouek, MPDavenport. The molecular basis for public T-cell responses? Nature Reviews Immunology. 2008;8(3):231238.

17 

ROEmerson, WSDeWitt, MVignali, JGravley, JKHu, EJOsborne, et al Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nature Genetics. 2017;49(5):659665 10.1038/ng.3822

18 

MVPogorelyy, AAMinervina, MPTouzel, ALSycheva, EAKomech, EIKovalenko, et al Precise Tracking of Vaccine-Responding T Cell Clones Reveals Convergent and Personalized Response in Identical Twins. Proceedings of the National Academy of Sciences. 2018; p. 201809642 10.1073/pnas.1809642115

19 

IVZvyagin, MVPogorelyy, MEIvanova, EAKomech, MShugay, DABolotin, et al Distinctive Properties of Identical Twins’ TCR Repertoires Revealed by High-Throughput Sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2014;111(16):59805985. 10.1073/pnas.1319389111

20 

HTanno, TMGould, JRMcDaniel, WCao, YTanno, REDurrett, et al Determinants Governing T Cell Receptor α/β-Chain Pairing in Repertoire Formation of Identical Twins. Proceedings of the National Academy of Sciences. 2020;117(1):532540. 10.1073/pnas.1915008117

21 

YElhanati, ZSethna, CGCallan, TMora, AMWalczak. Predicting the spectrum of TCR repertoire sharing with a data-driven model of recombination. Immunological Reviews. 2018;284(1):167179. 10.1111/imr.12665

22 

OVBritanova, MShugay, EMMerzlyak, DBStaroverov, EVPutintseva, MATurchaninova, et al Dynamics of Individual T Cell Repertoires: From Cord Blood to Centenarians. The Journal of Immunology. 2016;196(12):50055013. 10.4049/jimmunol.1600005

23 

AWSylwester, BLMitchell, JBEdgar, CTaormina, CPelte, FRuchti, et al Broadly targeted human cytomegalovirus-specific CD4+ and CD8+ T cells dominate the memory compartments of exposed subjects. Journal of Experimental Medicine. 2005;202(5):673685. 10.1084/jem.20050882

24 

NKhan, NShariff, MCobbold, RBruton, JAAinsworth, AJSinclair, et al Cytomegalovirus Seropositivity Drives the CD8 T Cell Repertoire Toward Greater Clonality in Healthy Elderly Individuals. The Journal of Immunology. 2002;169(4):19841992. 10.4049/jimmunol.169.4.1984

25 

PDash, JLMcClaren, THOguin, WRothwell, BTodd, MYMorris, et al Paired analysis of TCRα and TCRβ chains at the single-cell level in mice. Journal of Clinical Investigation. 2011;121(1):288295. 10.1172/JCI44752

26 

DRedmond, APoran, OElemento. Single-Cell TCRseq: Paired Recovery of Entire T-Cell Alpha and Beta Chain Transcripts in T-Cell Receptors from Single-Cell RNAseq. Genome Medicine. 2016;8(1):80 10.1186/s13073-016-0335-7

27 

KGrigaityte, JACarter, SJGoldfless, EWJeffery, JRonald, YJiang, et al Single-cell sequencing reveals αβ chain pairing shapes the T cell repertoire. bioRxiv:213462 2017;

28 

BHowie, AMSherwood, ADBerkebile, JBerka, ROEmerson, DWWilliamson, et al High-throughput pairing of T cell receptor a and b sequences. Science translational medicine. 2015;7(301):301ra131 10.1126/scitranslmed.aac5624

29 

TDupic, QMarcou, AMWalczak, TMora. Genesis of the T-Cell Receptor. PLOS Computational Biology. 2019;15(3):e1006874 10.1371/journal.pcbi.1006874

30 

BBriney, AInderbitzin, CJoyce, DRBurton. Commonality despite Exceptional Diversity in the Baseline Human Antibody Repertoire. Nature. 2019;566(7744):393397. 10.1038/s41586-019-0879-y

31 

WSDeWitt, ROEmerson, PLindau, MVignali, TMSnyder, CDesmarais, et al Dynamics of the Cytotoxic T Cell Response to a Model of Acute Viral Infection. Journal of Virology. 2015;249(February):JVI.0347414. 10.1128/JVI.03474-14

32 

KWolf, THether, PGilchuk, AKumar, ARajeh, CSchiebout, et al Identifying and Tracking Low-Frequency Virus-Specific TCR Clonotypes Using High-Throughput Sequencing. Cell Reports. 2018;25(9):23692378.e4. 10.1016/j.celrep.2018.11.009

33 

QQi, MMCavanagh, SLe Saux, HNamKoong, CKim, ETurgano, et al Diversification of the antigen-specific T cell receptor repertoire after varicella zoster vaccination. Science Translational Medicine. 2016;8(332):332ra46332ra46. 10.1126/scitranslmed.aaf1725

34 

ALSycheva, MVPogorelyy, EAKomech, AAMinervina, IVZvyagin, DBStaroverov, et al Quantitative profiling reveals minor changes of T cell receptor repertoire in response to subunit inactivated influenza vaccine. Vaccine. 2018;36(12):15991605. 10.1016/j.vaccine.2018.02.027

35 

NDChu, HSBi, ROEmerson, AMSherwood, MEBirnbaum, HSRobins, et al Longitudinal Immunosequencing in Healthy People Reveals Persistent T Cell Receptors Rich in Highly Public Receptors. BMC Immunology. 2019;20(1):19 10.1186/s12865-019-0300-5

36 

JAMBorghans, RJDe Boer. Quantification of T-cell dynamics: From telomeres to DNA labeling. Immunological Reviews. 2007;216(1):3547. 10.1111/j.1600-065X.2007.00497.x

37 

VThomas-Vaslin, HKAltes, RJde Boer, DKlatzmann. Comprehensive Assessment and Mathematical Modeling of T Cell Population Dynamics and Homeostasis. The Journal of Immunology. 2008;180(4):22402250. 10.4049/jimmunol.180.4.2240

38 

JDesponds, TMora, AMWalczak. Fluctuating Fitness Shapes the Clone-Size Distribution of Immune Repertoires. Proceedings of the National Academy of Sciences. 2016;113(2):274279. 10.1073/pnas.1512977112

39 

PCde Greef, TOakes, BGerritsen, MIsmail, JMHeather, RHermsen, et al The naive t-cell receptor repertoire has an extremely broad distribution of clone sizes. eLife. 2020;9:124. 10.7554/eLife.49900

40 

RJDe Boer, ASPerelson. Quantifying T Lymphocyte Turnover. Journal of theoretical biology. 2013;327:4587. 10.1016/j.jtbi.2012.12.025

41 

SBuhler, FBettens, CDantin, SFerrari-Lacraz, MAnsari, ACMamez, et al Genetic T-cell receptor diversity at 1 year following allogeneic hematopoietic stem cell transplantation. Leukemia. 2020;34(5):14221432. 10.1038/s41375-019-0654-y

42 

MVPogorelyy, et al (2018) Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins. Proceedings of the National Academy of Sciences 115:1270412709. 10.1073/pnas.1809642115

43 

BLi, TLi, BWang, RDou, JZhang, JSLiu, et al Ultrasensitive detection of TCR hypervariable-region sequences in solid-tissue RNA–seq data. Nature Genetics. 2017;49(4):482483. 10.1038/ng.3820

44 

DABolotin, SPoslavsky, ANDavydov, FEFrenkel, LFanchi, OIZolotareva, et al Antigen receptor repertoire profiling from RNA-seq data. Nature Biotechnology. 2017;35(10):908911. 10.1038/nbt.3979

45 

MShugay, OVBritanova, EMMerzlyak, MATurchaninova, IZMamedov, TRTuganbaev, et al Towards Error-Free Profiling of Immune Repertoires. Nature Methods. 2014;11(6):653655. 10.1038/nmeth.2960

46 

DABolotin, SPoslavsky, IMitrophanov, MShugay, IZMamedov, EVPutintseva, et al MiXCR: Software for Comprehensive Adaptive Immunity Profiling. Nature Methods. 2015;12(5):380381. 10.1038/nmeth.3364

47 

ZSethna, YElhanati, CCallan, AMWalczak, TMora. OLGA: Fast Computation of Generation Probabilities of B- and T-Cell Receptor Amino Acid Sequences and Motifs. Bioinformatics. 2019;35(17):29742981. 10.1093/bioinformatics/btz035

48 

MPTouzel, AMWalczak, TMora. Inferring the immune response from repertoire sequencing. PLoS Computational Biology. 2020;16(4):121.

49 

JAWeinstein, NJiang, RAWhite, DSFisher, SRQuake. High-throughput sequencing of the zebrafish antibody repertoire. Science. 2009;324(5928):807810. 10.1126/science.1170020

50 

TOakes, JMHeather, KBest, RByng-Maddick, CHusovsky, MIsmail, et al Quantitative characterization of the T cell receptor repertoire of naïve and memory subsets using an integrated experimental and computational pipeline which is robust, economical, and versatile. Frontiers in Immunology. 2017;8(OCT):117.