The authors have declared that no competing interests exist.
Imaging Mass Cytometry (IMC) combines laser ablation and mass spectrometry to quantitate metal-conjugated primary antibodies incubated in intact tumor tissue slides. This strategy allows spatially-resolved multiplexing of dozens of simultaneous protein targets with 1μm resolution. Each slide is a spatial assay consisting of high-dimensional multivariate observations (m-dimensional feature space) collected at different spatial positions and capturing data from a single biological sample or even representative spots from multiple samples when using tissue microarrays. Often, each of these spatial assays could be characterized by several regions of interest (ROIs). To extract meaningful information from the multi-dimensional observations recorded at different ROIs across different assays, we propose to analyze such datasets using a two-step graph-based approach. We first construct for each ROI a graph representing the interactions between the m covariates and compute an m dimensional vector characterizing the steady state distribution among features. We then use all these m-dimensional vectors to construct a graph between the ROIs from all assays. This second graph is subjected to a nonlinear dimension reduction analysis, retrieving the intrinsic geometric representation of the ROIs. Such a representation provides the foundation for efficient and accurate organization of the different ROIs that correlates with their phenotypes. Theoretically, we show that when the ROIs have a particular bi-modal distribution, the new representation gives rise to a better distinction between the two modalities compared to the maximum a posteriori (MAP) estimator. We applied our method to predict the sensitivity to PD-1 axis blockers treatment of lung cancer subjects based on IMC data, achieving 97.3% average accuracy on two IMC datasets. This serves as empirical evidence that the graph of graphs approach enables us to integrate multiple ROIs and the intra-relationships between the features at each ROI, giving rise to an informative representation that is strongly associated with the phenotypic state of the entire image.
We propose a two-step graph-based analyses for high-dimensional multiplexed datasets characterizing ROIs and their inter-relationships. The first step consists of extracting the steady state distribution of the random walk on the graph, which captures the mutual relations between the covariates of each ROI. The second step employs a nonlinear dimensionality reduction on the steady state distributions to construct a map that unravels the intrinsic geometric structure of the ROIs. We theoretically show that when the ROIs have a two-class structure, our method accentuates the distinction between the classes. Particularly, in a setting with Gaussian distribution it outperforms the MAP estimator, implying that the mutual relations between the covariates within the ROIs and spatial coordinates are well captured by the steady state distributions. We apply our method to imaging mass cytometry (IMC). Our analysis provides a representation that facilitates prediction of the sensitivity to PD-1 axis blockers treatment of lung cancer subjects. Particularly, our approach achieves state of the art results with average accuracy of 97.3% on two IMC datasets.
This is a PLOS Computational Biology Methods paper.
Consider multi-feature observations collected at different spatial positions. Data structure of this type requires analysts to address two immediate natural questions. First is how to characterize the associations between the different features in each position. Second is how to organize the observations from different spatial positions into an informative representation.
We approach these two questions from the standpoint of manifold learning, which is a class of nonlinear dimensionality reduction techniques for high-dimensional data [1–4]. The common assumption in manifold learning is that the multi-feature observations lie on a hidden lower-dimensional manifold. Such an assumption facilitates the incorporation of geometric concepts such as metrics, geodesic distances, and embedding, into useful data analysis techniques. In order to learn a (continuous) manifold from discrete data samples, commonly-used manifold learning methods rely on the construction of a graph. Typically, the data samples form the graph nodes and the edges of the graph are determined according to some similarity notion that is usually application-specific.
In our work, we adhere to manifold learning techniques and propose a method consisting of two-step graph analysis. At the first stage, we build a graph for each spatial position, where the graph nodes are the multi-feature observations. The motivation to build such a graph rather than using the observations directly stems from an assumption that the information about the sample at each spatial position is better expressed by the mutual-relations between the features. Then, we define a random walk on this graph and build a characteristic vector of the respective spatial position by computing the steady state distribution (SSD) of the random walk. For analysis purposes, we define a new notion of heterogeneity, representing a statistical diversity of the multiple features, and show that the SSD characterizes each spatial position in terms of this heterogeneity. In addition, using this notion of heterogeneity, when the density of the observations at the spatial positions is bi-modal, we show that these SSDs can lead to an accurate identification of the two modes, outperforming the maximum a-posteriori (MAP) estimator [5] in a statistical setting with Gaussian distributions.
At the second stage, we build a graph whose nodes are the new characteristic vectors (SSDs) of all the spatial positions. We apply diffusion maps [4] to this second graph and obtain a low dimensional representation of the spatial positions. The dimension of the computed representation is determined by a nonlinear variant of the Jackstraw algorithm [6].
Broadly, the proposed algorithm could be viewed as building a graph of graphs. From a manifold learning standpoint, this two-step procedure could be viewed as inferring a manifold of manifolds. Namely, at the first stage, we recover the local manifolds that underlie the multiple features at each spatial location, and then, at the second stage, we recover the global manifold between the spatial positions, formed by the collection of all local manifolds. This standpoint is related to a large body of recent work involving the discovery and analysis of multi-manifold structures, e.g., alternating diffusion [7–10], multi-view diffusion maps [11], joint Laplacian diagonalization [12], to name just a few. Therefore, the proposed method can be viewed as a follow up work along this line of research.
We apply our method to imaging mass cytometry (IMC) [13–15]. IMC is a new technique for multiplexed simultaneous imaging of proteins and protein modifications at subcellular resolution, ideally suited to uncover molecular and structural alterations of diseased tissues such as in cancer. IMC analysis can also be used to study the composition of non-diseases tissue samples such as histology studies or molecular profiles. The acquired intensities of the protein expression levels are viewed as markers, providing important biological information on the tissues of interest. This acquisition procedure gives rise to multi-feature observations at different spatial positions, where the multiple features are the markers and a selected subset of the spatial positions are ROIs within pathology slides.
Our experimental study focuses on one of the important tasks of IMC data analysis: associating the response status of a patient to a therapeutic intervention with a high-dimensional spatial IMC sample from the relevant patients’ tissues. Here, we propose to recast this problem as a binary hypothesis testing problem. We assume that all the ROIs of each patient can be labeled by the patient’s response or non-response status. The collection of all ROIs from the patients’ cohort induces a bi-modal density of expression signatures. Then, given the protein expression levels within the ROIs of a certain tissue type, we ask whether the subject was responsive to treatment. We showcase the performance of the proposed method on two IMC cohorts consisting of samples taken from lung cancer subjects. We achieve a average 97.3% prediction accuracy of response to treatment (PD-1 axis blockers) in an unsupervised manner. This result outperforms competing methods, specifically, the results obtained by (i) diffusion maps (DM) [4] directly applied to the multi-feature observations, (ii) the heat kernel signature (HKS) [16], and (iii) the wave kernel signature (WKS) [17].
The study was approved by the Yale University Human Investigation Committee protocols #9505008219 and #1608018220; or local institutional protocols which approved the patient consent forms or, in some cases waiver of consent when retrospectively collected archive tissue was used in a de-identified manner.
We start by presenting the problem setting. Consider n data points from a hidden manifold
In the context of IMC, the data points represent the treatment outcome based on n spatial positions located at ROIs within pathology slides of tissues from several patients. At each spatial position i = 1, …, n, the observations fj(xi) for j = 1, …, m are the expression levels of m markers. Each observation
To simplify the presentation of our approach, we begin with an illustrative localization problem, which is simpler than the IMC problem. Suppose we have a surface
Our approach consists of two stages. At the first stage, we construct a graph for each data point xi in order to capture associations between its m multi-feature observations
At the next step, we define a second graph based on the SSDs, characterizing the points
We remark that the IMC problem and the localization problem share many aspects. For example, in both problems, the multi-feature observations are noisy and the mutual-relationship between them carry important information. Yet, there is a particular aspect that makes the IMC problem more challenging; while the points on the hidden manifold in a localization problem are homogeneous because they all represent location coordinates, the points in the IMC problem could be significantly different due to the large variability in the tissue structure. Importantly, the proposed method accommodates the joint processing of such different points through their representation by the SSD.
The first step of the proposed method is to construct an undirected weighted graph


Next, we define a random walk on the graph

The transition probability matrix Pi is self-adjoint and compact, and therefore, the spectral decomposition of



Interestingly, in the special case where t = 1, the probability of the node fk(xi) to stay in place is given by

Note that



We will use πi as a new characteristic vector, or a signature, of xi, and consequently, the induced pairwise distances ||πi − πi′||, where i, i′ ∈ 1, …, n, will be used as the desired distances between the respective graphs


Since λj is in descending order and in [0, 1], the weight they assign to the eigenvectors in (12) becomes smaller as t increases. As a result, the DKS can be viewed as a low-pass filter, which controls the spectral bandwidth. In addition, the DKS can be recast in terms of the diffusion distance, a notion of distance induced by diffusion maps [4] that was shown useful in a broad range of applications, e.g., [20–22]. For more details see Diffusion maps. Specifically, when t = 1, we can show that

We note that DKS has already appeared in previous work in the context of spectral distances in [23, 24], where it was shown that it describes the underlying geometry of
Note that ϵ is a scale parameter of the Gaussian kernel, where it can be used to infer locality. If ϵ is set to a small value, then πi captures local properties. Conversely, if ϵ is large, then πi represents the global structure. As a result, a multiscale signature can be formed, consisting of multiple SSDs πi computed with different values of ϵ.
The final stage of our method is building a low-dimensional representation of all the data points
That is, the global graph weights matrix W(2) is defined by

Second, we construct a random walk, denoting its transition probability matrix by P(2). Third, we apply the eigendecomposition to P(2). Fourth, we set the dimension of the new representation according to the variant of the Jackstraw method [6], as shown in Determining the dimension of data.
The entire method is summarized in Box 1 and a block diagram is illustrated in Fig 1.
Input: A set of multi-feature observations {fj(xi)} for i = 1, …, n and j = 1, …, m.
Output:
l-dimensional representation
For each xi:
Construct a global graph G(2) with vertex set
Build a random walk on G(2) with transition probability matrix P(2).
Apply eigenvalue decomposition to P(2) and obtain the eigenvectors
Determine the number of dimensions l as described in Determining the dimension of data.
Build the mapping:


Illustrative diagram of the proposed method.
(a) For each data point xi, we build a local graph
We propose a statistical model that allows for a tractable analysis, showing the advantages of the SSD signature. Consider a data point

By collecting the m observations




Definition 1 (empirical mean)
Given a set Γ and some real function on the set

Definition 2 (heterogeneity) Define the heterogeneity of a data point xi by


The heterogeneity
Definition 3 (weighted heterogeneity) Define gi as a weighted heterogeneity, whose jth element is given by

Under the considered statistical model, with the above definitions, the SSD πi can be written explicitly.
Proposition 1 The j-th element of πi can be approximated by

The derivation is based on the Taylor expansion of the Gaussian function in Eq (1), where ϵ is the scale of the function. The proof appears in S1 Appendix.
In order to give some intuition, we consider the following special cases, where the SSD assumes a simpler form.
Suppose that the random vectors of the observation functions are independent and identically distributed, i.e.,


Note that a small value is assigned to πi(k) if the weighted heterogeneity gi(k) is large. In contrast, a large value is assigned to πi(k) if gi(k) is small. As a consequence, πi(k) carries the heterogeneity information of the observations.
When the kernel scale ϵ → ∞, the information about the heterogeneity of each observation is lost, since the same weights are assigned to all the edges. As a consequence, πi becomes just a constant vector, given by

Suppose that
A naïve approach for binary hypothesis testing would be to directly compare the densities of the two hypotheses for each observation separately. Particularly, based on the realizations from only one observation function fj, the average probability of error attained in a Bayesian setting with the MAP estimator [5] is given by


According to [25], consider

We seek another more discriminative approach for binary hypothesis testing. For this purpose, we propose a method based on the SSDs. Since the obtained SSDs represent probability distributions, the average probability of error is given by

According to Proposition 1, the total variation between two SSDs associated with data points from two hypotheses can be explicitly expressed as the l1 distance given by

The total variation of the measurements in Eq (29) and the total variation between the SSDs in Eq (32) can be used to distinguish between the two hypotheses. In the following, we specify the conditions, under which the total variation based on the SSDs in Eq (31) is larger, and hence, leading to smaller error compared to the standard MAP estimator using a single observation as specified in Eq (28).
Proposition 2
Suppose

This means that not only the standard MAP estimator but also any estimator based directly on single channel observations cannot distinguish between the two hypotheses. Conversely, from Eq (32), the SSDs may carry a distinction capability, that is,

Proposition 2 demonstrates that there are cases where a single observation cannot be used for distinguishing between the two hypotheses. However, in such cases, the SSDs may enable us to distinguish the hypotheses due to possible differences in either the heterogeneity or the covariances. Proposition 2 is further demonstrated in the context of the localization toy problem in Simulation 2.
To further expand the analysis, we make the following assumptions.
Assumption 1 The empirical mean of the weighted heterogeneity is approximately the same under the two hypotheses:

Assumption 2 The empirical mean of the covariance matrices is approximately the same under the two hypotheses:

Note that if Assumptions (A.1) and (A.2) hold, implying that

Proposition 3
Suppose that Assumptions
(A.1)
and
(A.2)
hold. Suppose

In addition, suppose that (i) the covariance between the jth observation and the other observations under the two hypotheses is approximately equal

It follows that

This proposition implies that when the assumptions hold, the probability of error based on SSD, which indirectly takes into account the mutual-relations between all observations, facilitates a better distinction of the two hypotheses compared to the standard MAP estimator computed from the best sensor. This property is further demonstrated in the localization toy problem in Simulation 3.
Proposition 4
Suppose the conditions of Proposition 3 hold. In addition, suppose that the random vectors of the observations are independent and identically distributed, i.e.,

This proposition shows that the SSDs enable us a better distinction between the two hypotheses compared to the MAP estimator based on the distributions of the multi-feature observations. In other words, the heterogeneity comprising the SSD has a significant contribution to the ability to recover the information about the latent data points
IMC is a relatively new imaging method, which enables to examine tumors and tissues at subcellular resolutions, giving rise to images consisting of the intensities of multiple proteins [13–15]. This acquisition platform, combined with computational methods, has recently been the subject of many studies. Various image processing and analysis techniques for IMC datasets can be found in [26], where it is shown that single-cell segmentation can be accomplished successfully with supervised classifiers, leading to the characterization of cell co-occurrence and cell composition of different types of tissues and samples. In [27], an IMC dataset with 37 markers is used for cell segmentation and cell clustering based on random 125 × 125μm2 patches collected from breast cancer patients. This dataset is jointly analyzed with multi-platform genomics data, where it is shown that classifiers can be iteratively trained in a supervised manner to learn from the IMC pixels the corresponding cell expression levels. In [14], spanning-tree progression analysis combined with samples’ type provided by pathologists is used for cell population and cell transition identification. In contrast to these methods, our approach focuses on extracting the mutual-relationships between markers at likely tumor cells regions at large, circumventing cell segmentation.
In this work, our goal is to identify the sensitivity of lung cancer subjects to treatment with PD-1 axis blockers, given their IMC multiplexed observations. More specifically, we aim at a binary prediction task: identifying whether the subjects responded or did not respond to the treatment. We analyze two IMC datasets consisting of baseline/treatment tumor samples from non-small cell lung cancer subjects profiled with 29 markers, representing phenotype and functional properties of both tumor and immune cells. The markers are denoted by LipoR, VIM, T-BET, CD47, Cytokeratin, CD45RO, PD-L1, GAPDH, B7-H3, LAG-3, TIM-3, FOXP3, CD4, B7-H4, CD68, PD1, CD20, CD8, CD25, VISTA, KI-67, B2M, CD3, IDO-1, PD-L2, GZB, Histone 3, DNA1 and DNA2. The resolution of the IMC images is 1 μm2 per pixel. The first dataset, denoted Dataset 1, consists of 55 subjects (samples), and the second dataset, denoted Dataset 2, consists of 29 subjects (samples). These subjects received treatment with PD-1 axis blockers. Based on the clinical sensitivity to the treatment, the subjects are categorized as: durable clinical benefit (denoted with the label responders) and no durable benefit (denoted with the label non-responders).
Prior to our analysis, a standard pre-processing using z-score normalization is applied to each marker. We remark that the mean and the standard deviation are computed based only on pixels in which the marker has non-zero values. In addition, we apply a 3 × 3 median filter to every image in Dataset 2. We note that the median filter is not applied to Dataset 1, because the typical expression levels in this dataset are sparse, and in this situation, application of a filter may destroy the signal. This is demonstrated in S1 Fig, where we apply the median filter to the expression levels of few important markers and observe a degenerate result. The markers exhibiting such sparse expression levels in Dataset 1 are VIM, T-BET, CD45RO, PD-L1, GAPDH, B7-H3, LAG-3, FOXP3, B7-H4, PD1, CD20, CD8, CD25, VISTA, KI-67, CD3, PD-L2 and GZB. Conversely, in Dataset 2, the typical expression levels are dense, and as a result, the median filter enhances the content of the images.
Our analysis does not consider the entire image, but rather focuses on ROIs located at the highest Cytokeratin expression levels, as Cytokeratin is expressed only in the tumor cells. As noted above, this circumvents cell segmentation that is typically required in other analyses techniques. At each ROI, we consider a “stack” of m image patches from all the m markers, where each patch consists of d = b × b pixels. We select N ROIs per sample by searching the patches with the maximal mean value of Cytokeratin. This is implemented by a 2D convolution applied to the Cytokeratin image with a constant kernel of size b × b. We deliberately avoid patch overlap by assigning zero values to the pixels of each ROI, once it is selected.
Using the present work notation, the intrinsic representation of the information embodied at each ROI is denoted by xi, which is assumed to be a data point from some hidden manifold


An illustration of the IMC multi-feature observations from a single subject.
In the IMC data of a single subject, we focus first on the Cytokeratin marker, which is used as an indicator of tumor cells. We select N ROIs located at the highest Cytokeratin expression levels. These ROIs are patches of size b × b
μm2, where here b set to 35. We assume that these ROIs have intrinsic hidden states represented by xi. The expression of all 29 markers at these ROIs are viewed as the multi-feature observations
The identification of the subject’s response to treatment from the IMC data is based on the application of the proposed method described in Box 1. This algorithm fits well the problem at hand due to the following main reasons. First, directly comparing observations from the different markers, fj(xi), is inapplicable since each ROI comprises different cells and different tissue structures. Our method circumvents this problem by computing an intrinsic signature of the observations. Second, due to the different dynamics of the nominal values of the observations from the different markers, a naïve concatenation of the multi-feature observations is inadequate (in contrast to the localization example).
Before applying the proposed method, we test empirically that Assumptions (A.1) and (A.2) hold. Note that for this test only, the true response status is used. To test Assumption (A.1), we compute


We analyze the two datasets separately, since the datasets were collected around a year apart, and internal acquisition system parameters were modified during that time. We apply the proposed method presented in Box 1 to the observations,
We compare the SSD-based representation obtained by the propose method to other representations obtained by three competing algorithms. The first is a direct application of diffusion maps to the sets of multi-feature observations
Fig 3 presents the predictions of treatment response by the SSD, HKS, WKS based algorithms as well as DM. We show the 3D t-SNE visualization [28] of the patch representation obtained as an output of the different algorithms and the confusion matrix of the prediction obtained by an RBF SVM classifier applied to the patch representation. To complement the results, in Table 1, we present the area under the ROC curve (AUC) of the treatment response predictions. We note that the AUC obtained for Dataset 2 without the median filter pre-processing is 0.931. For each algorithm, the presented prediction results are based on the patch size and number of patches configuration that yielded the best empirical performance using cross-validation as described above. The best configuration (patch size b × b and number of patches per subject N) of each algorithm is presented in Fig 3 on the left.

| Proposed Method | DM | HKS | WKS | |
|---|---|---|---|---|
| Dataset 1 | 0.9455 | 0.8364 | 0.7455 | 0.7818 |
| Dataset 2 | 1 | 0.8276 | 0.9655 | 0.9310 |


Treatment response predictions.
(A) Prediction for Dataset 1. (B) Prediction for Dataset 2. In each panel, at each row, we plot the best algorithm configuration, the 3D t-SNE visualization of the patches representation colored by the response status, and the confusion matrix of the response prediction obtained by a leave-one-subject-out cross-validation using an RBF SVM classifier. The rows present the results of the different methods.
In the t-SNE plots, we observe that the unsupervised separation of the patches from responders and non-responders is most pronounced in the low-dimensional representation obtained by the proposed method. This distinct visual separation, which was obtained by the proposed algorithm without access to any response outcome information. The color in the figures indicates the response status. Note that the embedding is obtained by the unsupervised algorithm and the color labels are overlaid to demonstrate the degree of separability between patches from responders and non-responders, which implies that this patch representation is informative and useful for the subsequent response prediction. This result, which is obtained in an unsupervised manner, distinguishes the current work from the computational methods for IMC data described above that rely on supervised analysis. Indeed, we observe that the prediction accuracy obtained based on the proposed method is superior compared to the other three competing methods. We note that Dataset 1 consists of 26 responders and 29 non-responders, so that the chance level of accurate prediction of a subject with durable clinical benefit is 47.27%, and Dataset 2 consists of 11 responders and 18 non-responders, thus, the chance level of accurate prediction is 37.93%.
The derivations in Theoretical analysis imply that the capability to distinguish between the two hypotheses (sensitivity or insensitivity to treatment) highly depends on the mutual relationships between the markers, which we explicitly define and term heterogeneity (Definition 2). Since our empirical study demonstrates that our approach facilitates a distinct separation between ROIs of subjects according to their treatment response, by the theoretical analysis, we conclude that the heterogeneity between the markers is where the information about the sensitivity to treatment lies.
We examine the sensitivity of the tested algorithms to the choice of the hyperparameters: the number of ROIs per subject N and the size of the patch b × b. In S2 Fig, we plot heatmaps of the treatment prediction accuracy obtained based on different choices of hyperparameters. We observe that within a relatively wide range of parameter values, the performance of the proposed method in Box 1 is high and insensitive to the particular choice of parameters. In addition, we note that the range of patch size where high performance is attained is centered at size 45 × 45 μm2. As the size of tumor cells vary within a range of 10–30 μm2 (a mean of 20 μm2), and the size of a lymphocyte ranges from 8–12 μm2 (a mean of 10 μm2), it implies that high performance is attained when patches are likely including more than one cell and cell type. Conversely, we observe that the competing methods do not show the same degree of robustness to the choice of hyperparameters, and good performance is attained only for very particular (isolated) parameter values.
To further exploit the robustness of the proposed method, we implemented an ensemble of the classifiers based on different values of hyperparameters. The implementation of the combination is based on [29]. The performance of the combined classifiers is presented in S3 Fig. We observe that the prediction accuracy of this ensemble is comparable to the prediction accuracy obtained based on the classifier with the best parameter configuration, demonstrating that the particular choice of hyperparameters can be circumvented.
Our premise is that the resulting high prediction accuracy is attributed mainly to the unsupervised informative representation obtained by our approach, rather than the classifier type. To support this claim, we repeat the analysis by replacing the RBF SVM classifier with Random Forest [5]. The results are shown in S4 Fig and demonstrate comparable performance and similar trends.
To show that Stage 2 of the proposed method in Box 1 is essential, we plot in S5 Fig heatmaps of the multi-feature observations and the SSD features resulting from Stage 1 of the algorithm. The heatmaps of the multi-feature observations demonstrate that there is no obvious difference between responders and non-responders, implying that the response status prediction is a non-trivial task. In addition, inspection of the heatmaps with the SSD features does not reveal apparent distinction between responders and non-responders. Recalling that after Stage 2, as presented in the t-SNE plots in Fig 3, this distinction becomes evident, demonstrating the contribution of Stage 2.
We presented a two-step graph analysis approach. The first step is applied to the multi-feature observations of the data points, where their mutual-relationships are extracted. This step is implemented by computing the SSD of a random walk defined on the graph whose nodes are the observations. The resulting SSD can be viewed as a signature or a characteristic vector of the data point and is analogous to traditional signatures from other domains, such as the heat kernel signature (HKS) [16] or the wave kernel signature (WKS) [17] in the field of shape analysis, and PageRank [30] in web page ranking (see Related work). The second step is applied to the signatures obtained at the first step, for the purpose of constructing an intrinsic low-dimensional representation of all the data points.
Previous attempts to analyze such mulitplexed datasets involve various approaches, including direct comparisons of the marker expressions, the cell morphology, and interactions in cell neighborhoods, to name but a few [26]. Our method introduces a new approach, building a new representation of the multiplexed data in two steps. Since each of the steps involves a construction of a graph, the entire procedure can be viewed as building a graph of graphs.
While the algorithm is described in a general setting of multiplexed data fusion, our theoretical analysis is focused on binary hypothesis testing. In comparison with a traditional statistical estimation approach, we show that our method exhibits advantages, implying that the mutual-relationships between the multi-feature observations are well captured. In the context of IMC, this could minimize the effect of deviation in individual marker scores and cell/tissue heterogenenity.
We apply the proposed method to two IMC datasets and show that solely from the imaging data, we can distinguish between two different sensitivity levels to treatment. Since our approach does not rely on rigid prior knowledge or access to labels, it has the potential of identifying biological relevance of novel parameters or marker patterns in treatment responses by analyzing dominant factors contributing more to the model stratification. Importantly, we remark that in contrast to common practice, the proposed approach does not require cell-segmentation as a precursor.
In addition to the demonstrated advantage of the proposed method over the competing methods in terms of superior prediction accuracy of treatment response, the proposed method is also more computationally efficient. In S6 Fig, we present the run time of the two stages of the proposed method in Box 1 and compare them to the run time of the three competing methods. We note that the HKS and WKS replace the proposed SSD in Stage 1, but, their respective Stage 2 are similar. In Stage 1, we observe that the proposed method in Box 1 is faster than the competing methods. The main difference between the algorithms in Stage 1 is due to the fact that the SSD is proportional to the degree of the local graph, and as a result, it can be computed efficiently from the degree vector. Conversely, both HKS and WKS require eigenvalue decomposition, which is computationally more demanding. In addition, the direct application of DM just involves Stage 2 of the proposed algorithm. Seemingly, the run time of the algorithms in Stage 2 should have been the same, yet, we observe that DM is slower. This difference is attributed to the significantly different size of the feature space. In DM, the multi-feature observations are simply concatenated, giving rise to feature space (input of Stage 2) of size (b × b × m) × (P × N), where b × b is the size of patch, m is the number of biomarkers, and P is the number of subjects. Conversely, in Stage 2 of Algorithm 1, HKS, and WKS, the feature space (input of Stage 2) is only of size m × (P × N).
It is conceivable that the most important hyperparameter of our method is the scale parameter. In S7 Fig, we present a toy example demonstrating that different values of the scale parameter ϵ lead to multiscale signatures capturing local and global features. We demonstrate that different scales facilitate the extraction of different features of the data. In future work, we plan to further explore the role of the scale and to devise multiscale signatures. Another possible direction for future research relies on the fact that our method is general and can be extended to other multiplexed datasets. For example, hyper spectral imaging, sensor networks, spatial multiplexed proteomics, and spatial transcriptomics assays is a representative subset of distinct technologies from diverse domains of science and engineering that share common data structures. The data in all these modalities consist of high-dimensional multivariate observations (m-dimensional feature space) collected at different spatial positions, and therefore, can be analyzed using similar computational methodologies. Furthermore, in many studies practitioners collect datasets consisting of multiple spatial assays of this type, each capturing such data from a single biological sample, patient, or hyper spectral image, etc. Each of these spatial assays could be characterized by several regions of interest (ROIs), giving rise to a setting similar to the IMC problem considered here. Specifically, we plan to examine applications of the proposed graph of graphs analysis to spatial transcriptomics such as Slide-seq [31], High-Density Spatial Transcriptomics [32, 33], MIBI-TOF [34], and DBiT-seq [35].
Manifold learning is a class of nonlinear techniques that embeds high dimensional data points into a low dimensional space, relying on the assumption that the high-dimensional data lie on a low dimensional manifold
Consider a set of data points

Next, a random walk P on the data points is constructed by normalizing the weight matrix W

Since P is similar to a symmetric and positive-define matrix, P has a biorthogonal right- and left-eigenvectors

The diffusion distance

The diffusion maps is defined by [4]

We remark that in many cases, due to the typical fast decay of the eigenvalues of Pt, l can be set to be smaller than d, thereby achieving dimension reduction. In addition, φ1 is a constant vector and therefore is not used in the mapping.
It can be shown that the diffusion distance can be approximated by the eigenvalues and eigenvectors by [4]

A common problem in diffusion maps setting is how to choose the dimension l. The authors in [6] proposed Jackstraw to identify the number of principal components (PCs) in the context of principal component analysis (PCA) [37]. We present here a variant of Jackstraw, adapting it to diffusion maps.
Given a random walk P constructed from a set of b data points, the associated eigenvalues are 1 = λ1 ≥ λ2 ≥ … ≥ λb with corresponding right-eigenvectors

Note that the absolute values of the eigenvalues of
There are several shape analysis signatures obtained by spectral methods with different geometric properties such as isometry and deformation invariance [16, 17, 38, 39], related to the proposed method. One of the notable shape signatures is based on the heat diffusion on a shape, called Heat Kernel Signature (HKS) [16]. Broadly, the HKS is obtained by the eigenvalue decomposition of the heat kernel defined on the shape. In the context of our problem, since the heat kernel and the Laplace-Beltrami operator Δi share the same eigenbasis, and since the discrete graph Laplacian converges (point-wise) to the Laplace-Beltrami [40]



Similarly to the DKS in Eq (12), the HKS can also be viewed as a low-pass filter. Observe that, for small t, the HKS approximates the DKS by Taylor expansion; for other t values, the weights assigned by the HKS decay faster than the weights of DKS, and therefore, the DKS gives more attention to finer structures.
Another related shape signature is built by the wave function to the Schrödinger equation describing the quantum mechanical particles, called Wave Kernel Signature (WKS) [17]. Similarly to the HKS, the WKS is given by



In the context of web page ranking, there are several traditional algorithms based on the spectral analysis of directed graphs. Among them, the celebrated PageRank score is based on the stationary distribution of a random walk representing the popularity of linked web pages [30]. Hyper induced topic search (HITS) is another related algorithm, identifying the influential nodes using a random walk on a graph. There, the graph nodes are the web pages, which are divided into two groups: authorities and hubs [41]. Both algorithms address the problem of web page ranking, where PageRank depends on the incoming links whereas HITS focuses on the outgoing links.
To illustrate the challenge in the problem setting and the generality of the proposed solution, we present three simulations of different localization problems. The Matlab code is available in S1 Matlab Code.
Consider 800 objects on a 2-sphere in

In Fig 4A, we present our setting consisting of objects and sensors located on and near a sphere, respectively. The objects positions are marked by dots, the different regions are marked by different colors (red, black, blue and yellow), and the sensors positions are marked by green stars. At the bottom, we present the (high-dimensional) sensor observations. Each block consists of 100 × 200 scalar observations corresponding to the observations of a single sensor from each region, where d = 100 is the dimension of each observation and 200 is the number of the positions per region. Visually, it is evident that distinguishing between the different regions merely based on these observations is non trivial.


Illustration of localization toy problems.
(A) The sensors and objects locations on a sphere are marked by green stars and dots, respectively. The multi-sensor observations correspond to Simulation 1 (left), Simulation 2 (middle), and Simulation 3 (right). (B) Results of the application of our approach to Simulation 1: the SSDs obtained by the proposed method (left), the diffusion maps embedding (middle), and the localization confusion matrix obtained by a 10-fold cross-validation with an RBF SVM classifier (right). (C) A comparison between the localization accuracy obtained by the proposed method based on SSDs and the localization accuracy obtained based on the output of each sensor as well as the concatenation of the output from all the sensors. The localization accuracy is the number of correctly identified positions divided by the true number of total positions in each region.
For illustration purposes, we view the problem as a classification problem, where given the high-dimensional multi-feature observations, the task is to identify in which region the object at xi resides.
The observations from each sensor consisting of 800 object positions from four regions were processed in two stages: first, a 3D embedding is constructed by applying diffusion maps to the high-dimensional observations, and second, an RBF SVM is applied to the obtained embedding in order to classify the region. To evaluate the classifiers, we perform a 10-fold cross-validation. A similar two-step procedure is applied to the concatenation of the observations from all the sensors.
Table 2 presents the resulting classification accuracy, i.e., the number of correctly classified positions divided by the total number of positions in each region. The presented results are obtained using a 10-fold cross-validation. We observe that none of the sensors enables an accurate classification. Moreover, we show in Table 2 that a naïve concatenation of the observations from all the sensors do not yield a good classification either.

| Region | Concatenated sensors | Sensor 1 | Sensor 2 | Sensor 3 | Sensor 4 | Sensor 5 |
|---|---|---|---|---|---|---|
| Blue | 43% | 46% | 18% | 44.5% | 58% | 40.5% |
| Red | 54% | 15.5% | 52.5% | 45.5% | 48.5% | 55% |
| Black | 46.5% | 49.5% | 29% | 40% | 37.5% | 42% |
| Yellow | 38.5% | 33.5% | 52% | 30.5% | 42.5% | 50.5% |
The performances obtained by the 10-fold cross-validations are presented. The percentages indicate the number of correctly classified positions divided by the true number of total positions in each class.
Seemingly, in order to mitigate the problem, we could simply represent the objects positions by a vector of the variance of the observations. However, it would require prior knowledge about the sensing model, whereas our approach is model-free.
In Fig 4B, we present the classification results obtained by the proposed method. In the diffusion maps embedding constructed from the SSDs, we observe a clear separation between the four regions. Finally, in the confusion matrix of the classification, we observe that the proposed method leads to significantly better classification results compared to the results in Table 2. This demonstrates the importance of taking into account the mutual-relationships between the sensor observations, rather than processing the nominal values of the observations directly, which in this case, give rise to correct identification of the four regions.
The main purpose of this simulation is to demonstrate Proposition 2 from Binary hypothesis testing.
Consider n = 400 positions,
Here, we have 12 sensor observations

In this simulation, a direct computation can show that Assumptions (A.1) and (A.2) hold. Specifically, we note that


In Fig 4C, we present a comparison between the classification results obtained by our approach using SSDs and the classification results obtained by using the output of each sensor as well as the concatenation of the output from all the sensors. The results are evaluated with a 10-fold cross-validation. We observe that the classification obtained by proposed algorithm is significantly better than the classification obtained using the “raw” sensor outputs.
This simulation demonstrates Proposition 3 from Binary hypothesis testing. Consider n = 400 positions,
Note that here, only two sensor observations satisfy the conditions of Special Case 1, namely,
For each data point, we compute the difference between the left-hand side and the right-hand side of the inequality in Proposition 2 per sensor. Then, in the right-top figure in Fig 4A, we circle the sensors attaining the largest difference. As expected, we observe that the circled sensors are positioned in non-symmetric orientations with respect to the locations of the two regions.
Similarly to the previous simulation, we compare the classification results obtained by our proposed method to the results attained by applying the classification directly to the observations from each sensor separately and to the concatenation of the observations from all the sensors. In addition, we also compute the results of Algorithm 1 applied to all the sensors except Sensor 1 and Sensor 7, which are the least contributing according to the inequality in Proposition 2.
The classification results presented in Fig 4C imply that by removing the least contributing sensors, namely Sensor 1 and Sensor 7, the recomputed SSDs, denoted by
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43