Graph of graphs analysis for multiplexed data with application to imaging mass cytometry

PLoS Computational Biology

Home Graph of graphs analysis for multiplexed data with application to imaging mass cytometry

Ya-Wei Eileen Lin, Tal Shnitzer, Ronen Talmon, Franz Villarroel-Espindola, Shruti Desai, Kurt Schalper, Yuval Kluger

The authors have declared that no competing interests exist.

https://doi.org/10.1371/journal.pcbi.1008741, Volume: 17, Issue: 3, Pages: 1-30

Article Type: Research Article Article History

Publisher: Public Library of Science

Altmetric

Table of Contents

Introduction
Results
Materials and methods
Supporting information

Abstract

Imaging Mass Cytometry (IMC) combines laser ablation and mass spectrometry to quantitate metal-conjugated primary antibodies incubated in intact tumor tissue slides. This strategy allows spatially-resolved multiplexing of dozens of simultaneous protein targets with 1μm resolution. Each slide is a spatial assay consisting of high-dimensional multivariate observations (m-dimensional feature space) collected at different spatial positions and capturing data from a single biological sample or even representative spots from multiple samples when using tissue microarrays. Often, each of these spatial assays could be characterized by several regions of interest (ROIs). To extract meaningful information from the multi-dimensional observations recorded at different ROIs across different assays, we propose to analyze such datasets using a two-step graph-based approach. We first construct for each ROI a graph representing the interactions between the m covariates and compute an m dimensional vector characterizing the steady state distribution among features. We then use all these m-dimensional vectors to construct a graph between the ROIs from all assays. This second graph is subjected to a nonlinear dimension reduction analysis, retrieving the intrinsic geometric representation of the ROIs. Such a representation provides the foundation for efficient and accurate organization of the different ROIs that correlates with their phenotypes. Theoretically, we show that when the ROIs have a particular bi-modal distribution, the new representation gives rise to a better distinction between the two modalities compared to the maximum a posteriori (MAP) estimator. We applied our method to predict the sensitivity to PD-1 axis blockers treatment of lung cancer subjects based on IMC data, achieving 97.3% average accuracy on two IMC datasets. This serves as empirical evidence that the graph of graphs approach enables us to integrate multiple ROIs and the intra-relationships between the features at each ROI, giving rise to an informative representation that is strongly associated with the phenotypic state of the entire image.

We propose a two-step graph-based analyses for high-dimensional multiplexed datasets characterizing ROIs and their inter-relationships. The first step consists of extracting the steady state distribution of the random walk on the graph, which captures the mutual relations between the covariates of each ROI. The second step employs a nonlinear dimensionality reduction on the steady state distributions to construct a map that unravels the intrinsic geometric structure of the ROIs. We theoretically show that when the ROIs have a two-class structure, our method accentuates the distinction between the classes. Particularly, in a setting with Gaussian distribution it outperforms the MAP estimator, implying that the mutual relations between the covariates within the ROIs and spatial coordinates are well captured by the steady state distributions. We apply our method to imaging mass cytometry (IMC). Our analysis provides a representation that facilitates prediction of the sensitivity to PD-1 axis blockers treatment of lung cancer subjects. Particularly, our approach achieves state of the art results with average accuracy of 97.3% on two IMC datasets.

Lin,Shnitzer,Talmon,Villarroel-Espindola,Desai,Schalper,Kluger,and Xu: Graph of graphs analysis for multiplexed data with application to imaging mass cytometry

This is a PLOS Computational Biology Methods paper.

Introduction

Consider multi-feature observations collected at different spatial positions. Data structure of this type requires analysts to address two immediate natural questions. First is how to characterize the associations between the different features in each position. Second is how to organize the observations from different spatial positions into an informative representation.

We approach these two questions from the standpoint of manifold learning, which is a class of nonlinear dimensionality reduction techniques for high-dimensional data [1–4]. The common assumption in manifold learning is that the multi-feature observations lie on a hidden lower-dimensional manifold. Such an assumption facilitates the incorporation of geometric concepts such as metrics, geodesic distances, and embedding, into useful data analysis techniques. In order to learn a (continuous) manifold from discrete data samples, commonly-used manifold learning methods rely on the construction of a graph. Typically, the data samples form the graph nodes and the edges of the graph are determined according to some similarity notion that is usually application-specific.

In our work, we adhere to manifold learning techniques and propose a method consisting of two-step graph analysis. At the first stage, we build a graph for each spatial position, where the graph nodes are the multi-feature observations. The motivation to build such a graph rather than using the observations directly stems from an assumption that the information about the sample at each spatial position is better expressed by the mutual-relations between the features. Then, we define a random walk on this graph and build a characteristic vector of the respective spatial position by computing the steady state distribution (SSD) of the random walk. For analysis purposes, we define a new notion of heterogeneity, representing a statistical diversity of the multiple features, and show that the SSD characterizes each spatial position in terms of this heterogeneity. In addition, using this notion of heterogeneity, when the density of the observations at the spatial positions is bi-modal, we show that these SSDs can lead to an accurate identification of the two modes, outperforming the maximum a-posteriori (MAP) estimator [5] in a statistical setting with Gaussian distributions.

At the second stage, we build a graph whose nodes are the new characteristic vectors (SSDs) of all the spatial positions. We apply diffusion maps [4] to this second graph and obtain a low dimensional representation of the spatial positions. The dimension of the computed representation is determined by a nonlinear variant of the Jackstraw algorithm [6].

Broadly, the proposed algorithm could be viewed as building a graph of graphs. From a manifold learning standpoint, this two-step procedure could be viewed as inferring a manifold of manifolds. Namely, at the first stage, we recover the local manifolds that underlie the multiple features at each spatial location, and then, at the second stage, we recover the global manifold between the spatial positions, formed by the collection of all local manifolds. This standpoint is related to a large body of recent work involving the discovery and analysis of multi-manifold structures, e.g., alternating diffusion [7–10], multi-view diffusion maps [11], joint Laplacian diagonalization [12], to name just a few. Therefore, the proposed method can be viewed as a follow up work along this line of research.

We apply our method to imaging mass cytometry (IMC) [13–15]. IMC is a new technique for multiplexed simultaneous imaging of proteins and protein modifications at subcellular resolution, ideally suited to uncover molecular and structural alterations of diseased tissues such as in cancer. IMC analysis can also be used to study the composition of non-diseases tissue samples such as histology studies or molecular profiles. The acquired intensities of the protein expression levels are viewed as markers, providing important biological information on the tissues of interest. This acquisition procedure gives rise to multi-feature observations at different spatial positions, where the multiple features are the markers and a selected subset of the spatial positions are ROIs within pathology slides.

Our experimental study focuses on one of the important tasks of IMC data analysis: associating the response status of a patient to a therapeutic intervention with a high-dimensional spatial IMC sample from the relevant patients’ tissues. Here, we propose to recast this problem as a binary hypothesis testing problem. We assume that all the ROIs of each patient can be labeled by the patient’s response or non-response status. The collection of all ROIs from the patients’ cohort induces a bi-modal density of expression signatures. Then, given the protein expression levels within the ROIs of a certain tissue type, we ask whether the subject was responsive to treatment. We showcase the performance of the proposed method on two IMC cohorts consisting of samples taken from lung cancer subjects. We achieve a average 97.3% prediction accuracy of response to treatment (PD-1 axis blockers) in an unsupervised manner. This result outperforms competing methods, specifically, the results obtained by (i) diffusion maps (DM) [4] directly applied to the multi-feature observations, (ii) the heat kernel signature (HKS) [16], and (iii) the wave kernel signature (WKS) [17].

Results

Ethics statement

The study was approved by the Yale University Human Investigation Committee protocols #9505008219 and #1608018220; or local institutional protocols which approved the patient consent forms or, in some cases waiver of consent when retrospectively collected archive tissue was used in a de-identified manner.

Overview

We start by presenting the problem setting. Consider n data points ${x_{i}}_{i = 1}^{n}$ from a hidden manifold $M$ embedded in a high-dimensional Euclidean space $R^{k}$ . Assume we do not have direct access to these data points; instead, these data points are measured through m observation functions $f_{j} : M \to R^{d}$ , where j = 1, …, m is the index of the observation function. Given n multi-feature observations f_j(x_i) of the data point x_i for i = 1, …, n, each consisting of m features j = 1, …, m, our goal is to recover the data points x_i on the hidden manifold $M$ .

In the context of IMC, the data points represent the treatment outcome based on n spatial positions located at ROIs within pathology slides of tissues from several patients. At each spatial position i = 1, …, n, the observations f_j(x_i) for j = 1, …, m are the expression levels of m markers. Each observation $f_{j} (x_{i}) \in R^{d}$ is a patch of d pixels of the expression level image of marker j at position i.

To simplify the presentation of our approach, we begin with an illustrative localization problem, which is simpler than the IMC problem. Suppose we have a surface $M$ and objects located at x_i on $M$ . The locations x_i are hidden, but measured through m sensors, such that for each location x_i we have m multi-feature observations ${f_{j} (x_{i})}_{j = 1}^{m}$ . That is f_j(x_i) is a d-dimensional observation of sensor j when the object is at x_i. The goal is to recover the locations x_i on the surface $M$ given ${f_{j} (x_{i})}_{j = 1}^{m}$ .

Our approach consists of two stages. At the first stage, we construct a graph for each data point x_i in order to capture associations between its m multi-feature observations ${f_{j} (x_{i})}_{j = 1}^{m}$ . Capturing such mutual-relationships is natural in the context of localization problems, e.g., the triangulation property [18] in which the relative locations of the sensors are exploited. In addition, these mutual-relations are typically more robust to noise in comparison with the nominal values of the multi-feature observations, f_j(x_i), themselves. Concretely, consider the m observations ${f_{j} (x_{i})}_{j = 1}^{m}$ associated with the data point x_i. Each observation f_j(x_i) forms a single node in the graph, hereby denoted as node j, giving rise to a graph with a total of m nodes. The graph we consider is the complete graph, where the weights of the edges are determined based on the Euclidean distance between the corresponding observations: the weight of the edge connecting nodes j and k is proportional to exp{−‖f_j(x_i) − f_k(x_i)‖²}. Then, we compute the SSD of a random walk defined on this graph at each location. SSD has a vector form that embodies the multi-feature inter-relationships of the data point x_i.

At the next step, we define a second graph based on the SSDs, characterizing the points ${x_{i}}_{i = 1}^{n}$ . Concretely, each data point x_i is represented by a node, and the pairwise distances between the SSDs form the adjacency matrix of the graph. Then, we apply a particular manifold learning technique, diffusion maps [4], to this graph. This application facilitates the recovery of the hidden manifold $M$ in the sense that an embedding of the points x_i is learned, such that the distances between the embedded points respect a notion of an intrinsic distance (the diffusion distance [4]) on $M$ . The application of diffusion maps to the second graph in the context of the localization problem gives rise to an embedding that serves as an accurate representation of the hidden locations of the data point. In Localization toy problem, we demonstrate the proposed method on several simulations of localization toy problems.

We remark that the IMC problem and the localization problem share many aspects. For example, in both problems, the multi-feature observations are noisy and the mutual-relationship between them carry important information. Yet, there is a particular aspect that makes the IMC problem more challenging; while the points on the hidden manifold in a localization problem are homogeneous because they all represent location coordinates, the points in the IMC problem could be significantly different due to the large variability in the tissue structure. Importantly, the proposed method accommodates the joint processing of such different points through their representation by the SSD.

Proposed method

The first step of the proposed method is to construct an undirected weighted graph $G_{i} = (V_{i}, E_{i}, W_{i})$ for each data point $x_{i} \in M \subset R^{k}$ , where the vertex set is $V_{i} = {f_{1} (x_{i}), f_{2} (x_{i}), . . ., f_{m} (x_{i})}$ , the edge set is $E_{i} \subseteq V_{i} \times V_{i}$ , and the graph weights matrix $W_{i} \in R^{m \times m}$ is given by the Gaussian kernel

where k, l ∈ {1, …, m} and ϵ > 0 is a scale parameter. Note that since W_i is symmetric, W_i is diagonalizable. That is, there is a set of real eigenvalues

{λ_{j}}_{j = 1}^{m}

with a corresponding orthonormal basis of eigenvectors

{v_{j}}_{j = 1}^{m}

such that

Next, we define a random walk on the graph $G_{i}$ . Let $P_{i} \in R^{m \times m}$ be a row stochastic matrix given by

where D_i is a diagonal matrix whose diagonal elements are given by

D_{i} (k, k) = \sum_{l = 1}^{m} W_{i} (k, l)

. The value of P_i(k, l) can be interpreted as a transition probability from a vertex f_k(x_i) to a vertex f_l(x_i) in one step of a random walk on the graph

G_{i}

The transition probability matrix P_i is self-adjoint and compact, and therefore, the spectral decomposition of $P_{i}^{t}$ for t > 0 is given by

where

{ψ_{j}, ϕ_{j}}_{j = 1}^{m}

are the right- and left-eigenvectors with the corresponding eigenvalues

{λ_{j}}_{j = 1}^{m}

. By the construction of P_i from W_i, the relations between their respective eigenvectors are given by

and

Interestingly, in the special case where t = 1, the probability of the node f_k(x_i) to stay in place is given by

Note that $ϕ_{1} \in R^{m}$ is the left eigenvector of P_i corresponding to λ₁ = 1, satisfying:

where

P_{i}^{⊤}

is the transpose of the matrix P_i. Consider an arbitrary distribution vector

π_{0} \in R^{m}

and observe the expansion:

where

〈 π_{0}, ψ_{j} 〉 = π_{0}^{⊤} ψ_{j}

is the standard Euclidean product. In the limit t → ∞,

λ_{j}^{t} \to 0

for j > 1 and therefore

where

π_{i} \in R^{m}

since λ₁ = 1, ψ₁ = 1, and

\sum_{j = 1}^{m} π_{0} (j) = 1

. Since the random walk defined by P_i is irreducible, finite, and aperiodic [], the stationary distribution π_i is a unique stationary distribution. The convergence in and the uniqueness allow us to treat the stationary distribution in this case as the steady state distribution (SSD). Note that the SSD π_i ∝ D_i1, where

1 \in R^{m}

is an all-ones vector. In other words, π_i can be viewed as a normalized degrees vector of the graph

G_{i}

We will use π_i as a new characteristic vector, or a signature, of x_i, and consequently, the induced pairwise distances ||π_i − π_i′||, where i, i′ ∈ 1, …, n, will be used as the desired distances between the respective graphs $G_{i}$ and $G_{i^{'}}$ for recovering $M$ . At first glance, using π_i may seem too simplistic. Instead, one could use the broad spectral information. Consequently, define the Diffusion Kernel Signature (DKS) by

where

Since λ_j is in descending order and in [0, 1], the weight they assign to the eigenvectors in (12) becomes smaller as t increases. As a result, the DKS can be viewed as a low-pass filter, which controls the spectral bandwidth. In addition, the DKS can be recast in terms of the diffusion distance, a notion of distance induced by diffusion maps [4] that was shown useful in a broad range of applications, e.g., [20–22]. For more details see Diffusion maps. Specifically, when t = 1, we can show that

indicating that the SSD is a special case of DKS.

We note that DKS has already appeared in previous work in the context of spectral distances in [23, 24], where it was shown that it describes the underlying geometry of $M$ . We show in the following that the seemingly simple SSD, despite the lack of broad spectral information as in the DKS, still carries substantial information.

Note that ϵ is a scale parameter of the Gaussian kernel, where it can be used to infer locality. If ϵ is set to a small value, then π_i captures local properties. Conversely, if ϵ is large, then π_i represents the global structure. As a result, a multiscale signature can be formed, consisting of multiple SSDs π_i computed with different values of ϵ.

The final stage of our method is building a low-dimensional representation of all the data points ${x_{i}}_{i = 1}^{n}$ . To this end, we apply diffusion maps to the corresponding characteristic vectors (signatures) ${π_{i}}_{i = 1}^{n}$ as follows. First, we build a global graph G⁽²⁾ whose nodes are π_i and edge weights are determined by a Gaussian kernel based on the l₁ distance between π_i.

That is, the global graph weights matrix W⁽²⁾ is defined by

where k, l ∈ [1, n], ||⋅||₁ is the l₁ distance and ϵ′ > 0 is a scale parameter. We remark that common practice is to use the l₂ distance in the Gaussian kernel. The reason we use the l₁ distance is described in Binary hypothesis testing, which indeed leads to better empirical performance reported in Imaging mass cytometry (IMC).

Second, we construct a random walk, denoting its transition probability matrix by P⁽²⁾. Third, we apply the eigendecomposition to P⁽²⁾. Fourth, we set the dimension of the new representation according to the variant of the Jackstraw method [6], as shown in Determining the dimension of data.

The entire method is summarized in Box 1 and a block diagram is illustrated in Fig 1.

Box 1. A summary of the proposed method

Input: A set of multi-feature observations {f_j(x_i)} for i = 1, …, n and j = 1, …, m.

Output: l-dimensional representation $Ψ (x_{i}) \in R^{l}$ for i = 1, …, n.

For each x_i:

Construct a local graph $G_{i}$ with vertex set

edge set

E_{i} \subseteq V_{i} \times V_{i}

, and edge weights matrix

W_{i} \in R^{m \times m}

given in .

Build a random walk on the local graph $G_{i}$ with transition probability matrix P_i defined in (3).

Compute the SSD $π_{i} \in R^{m}$ of P_i.

Construct a global graph G⁽²⁾ with vertex set ${π_{i}}_{i = 1}^{n} \in R^{m \times n}$ and the graph weights matrix P⁽²⁾ given in Eq (14).

Build a random walk on G⁽²⁾ with transition probability matrix P⁽²⁾.

Apply eigenvalue decomposition to P⁽²⁾ and obtain the eigenvectors ${φ_{k}}_{k = 1}^{n}$ .

Determine the number of dimensions l as described in Determining the dimension of data.

Build the mapping: $x_{i} \mapsto {(φ_{1} (x_{i}), φ_{2} (x_{i}), \dots, φ_{l} (x_{i}))}^{⊤} ≜ Ψ (x_{i})$ for i = 1, …, n.

Fig 1

Illustrative diagram of the proposed method.

(a) For each data point x_i, we build a local graph $G_{i}$ based on its multi-feature observations ${f_{j} (x_{i})}_{j = 1}^{m}$ . (b) We construct a random walk with transition probabilities matrix P_i on $G_{i}$ . (c) We extract the SSD signature π_i from P_i. (e) We collect the SSDs of ${x_{i}}_{i = 1}^{n}$ into an SSD representation matrix. (g) Subsequently, the matrix is subjected to a nonlinear dimensionality reduction using diffusion maps by the construction of the global graph G⁽²⁾ and the corresponding random walk with P⁽²⁾. (f) Via eigenvalue decomposition, we obtain a low-dimensional representation Ψ(x_i) for i = 1, …, n, which is used in the subsequent tasks (g).

Theoretical analysis

We propose a statistical model that allows for a tractable analysis, showing the advantages of the SSD signature. Consider a data point $x_{i} \in M$ and denote the set of the observations by $S = {f_{j} (x_{i})}_{j = 1}^{m}$ . Assume the j-th observation $f_{j} (x_{i}) \in R^{d}$ is a realization of a d-dimensional random vector $V_{j}^{i}$ following a multivariate normal distribution given by

By collecting the m observations ${f_{j} (x_{i})}_{j = 1}^{m}$ and denoting the covariance matrix between the k and l observation functions by $Σ_{k, l}^{i}$ , we obtain an md-dimensional vector that can be viewed as a realization of the random vector $V^{i} = (V_{1}^{i}, \dots, V_{m}^{i})$ with the corresponding multivariate Gaussian distribution given by

where

and

such that

and

σ_{k, l}^{i}

is the covariance between the k-th and l-th random vectors. In words, μⁱ in is a vector of md elements, consisting of the concatenation of m vectors. We index each of these vector by j = 1, …, m. All the entries of the j-th vector are equal and are set to the mean observation of the j-th sensor (marker) at the i-th point (

μ_{j}^{i}

). In , Σⁱ is a matrix of size of md × md, which is the analogous concatenation of the covariance matrices of the observations. Specifically, the diagonal blocks are the diagonal matrices

σ_{j}^{i} I_{d}

, whose diagonal elements are the variances of the j-th observation, and the off-diagonal blocks are the diagonal matrices

σ_{k, l}^{i} I_{d}

, whose diagonal elements are the covariance between sensor (marker) k and sensor (marker) l.

Definition 1 (empirical mean) Given a set Γ and some real function on the set $q \in R^{| Γ |}$ , and a subset Ω ⊂ Γ, the empirical mean of q in Ω is defined by

Definition 2 (heterogeneity) Define the heterogeneity of a data point x_i by

where 〈μⁱ〉_S is given by Definition 1, M_i is an m × m diagonal matrix, whose diagonal elements are

μ_{j}^{i}

, and

The heterogeneity $h_{i} \in R^{m}$ captures the mutual-relationships between the expected values of the observations $μ_{j}^{i}$ of a particular data point x_i; if Θ(j)ⁱ significantly deviates from μⁱ, then h_i(j) is large. Conversely, if Θⁱ(j) is close to μⁱ, then h_i(j) is close to zero.

Definition 3 (weighted heterogeneity) Define g_i as a weighted heterogeneity, whose jth element is given by

Under the considered statistical model, with the above definitions, the SSD π_i can be written explicitly.

Proposition 1 The j-th element of π_i can be approximated by

The derivation is based on the Taylor expansion of the Gaussian function in Eq (1), where ϵ is the scale of the function. The proof appears in S1 Appendix.

In order to give some intuition, we consider the following special cases, where the SSD assumes a simpler form.

Special case 1

Suppose that the random vectors of the observation functions are independent and identically distributed, i.e., $Σ_{k, l}^{i} = 0$ ∀l ∈ {1, …, m} and k ≠ l. In this case, the k-th element of π_i is

where

Note that a small value is assigned to π_i(k) if the weighted heterogeneity g_i(k) is large. In contrast, a large value is assigned to π_i(k) if g_i(k) is small. As a consequence, π_i(k) carries the heterogeneity information of the observations.

Special case 2

When the kernel scale ϵ → ∞, the information about the heterogeneity of each observation is lost, since the same weights are assigned to all the edges. As a consequence, π_i becomes just a constant vector, given by

where

1 \in R^{m}

is an all-ones vector.

Binary hypothesis testing

Suppose that ${x_{i}}_{i = 1}^{n} \in M$ are realizations of a random variable X, which follows a bimodal distribution stemming from two hypotheses: $H_{1}$ and $H_{2}$ ; $H_{1}$ has probability α and $H_{2}$ has probability (1 − α), where 0 < α < 1. Denote the set of data points from hypothesis $H_{1}$ by Ω₁ and the set of data points from hypothesis $H_{2}$ by Ω₂. Recall that for each data point x_i, f_j(x_i) are the realizations of the elements of the random vector Vⁱ. Since Vⁱ depends on the random variable X, assume that Vⁱ also follows a bimodal distribution, which is induced by the bimodal distribution of X. Particularly, consider a Gaussian setting, where Vⁱ is sampled from $N (m_{1}, Σ_{1})$ with probability α and from $N (m_{2}, Σ_{2})$ with probability of (1 − α). Similarly, assume that the random variable $V_{j}^{i}$ follows a bimodal distribution: sampled from $N (μ_{j}^{1} 1_{d}, σ_{j}^{1} I_{d})$ with probability α and from $N (μ_{j}^{2} 1_{d}, σ_{j}^{2} I_{d})$ with probability (1 − α), where the respective probability density functions are $f (v | μ_{j}^{1}, σ_{j}^{1})$ and $f (v | μ_{j}^{2}, σ_{j}^{2})$ .

A naïve approach for binary hypothesis testing would be to directly compare the densities of the two hypotheses for each observation separately. Particularly, based on the realizations from only one observation function f_j, the average probability of error attained in a Bayesian setting with the MAP estimator [5] is given by

where

T V (N (μ_{j}^{1} 1_{d}, σ_{j}^{1} I_{d}), N (μ_{j}^{2} 1_{d}, σ_{j}^{2} I_{d}))

denotes the total variation between the realizations of

V_{j}^{i} | x_{i} \in Ω_{1}

and

V_{j}^{i} | x_{i} \in Ω_{2}

, defined by

According to [25], consider $\frac{| σ_{j}^{1} - σ_{j}^{2} |}{σ (j)} \leq \frac{2}{3}$ , where $σ (j) = max {σ_{j}^{1}, σ_{j}^{2}}$ . The total variation above between two Gaussian distributions is bounded by

We seek another more discriminative approach for binary hypothesis testing. For this purpose, we propose a method based on the SSDs. Since the obtained SSDs represent probability distributions, the average probability of error is given by

According to Proposition 1, the total variation between two SSDs associated with data points from two hypotheses can be explicitly expressed as the l₁ distance given by

which consists of three main components: the variances

σ_{k}^{1}

and

σ_{k}^{2}

, the weighted heterogeneities g₁ and g₂, and the covariance Σ.

The total variation of the measurements in Eq (29) and the total variation between the SSDs in Eq (32) can be used to distinguish between the two hypotheses. In the following, we specify the conditions, under which the total variation based on the SSDs in Eq (31) is larger, and hence, leading to smaller error compared to the standard MAP estimator using a single observation as specified in Eq (28).

Proposition 2 Suppose $μ_{j}^{1} = μ_{j}^{2}$ and $σ_{j}^{1} = σ_{j}^{2}$ , which imply by definition (or by Eq (30)) that the total variation between the distributions corresponding to the two hypotheses is zero, i.e.,

This means that not only the standard MAP estimator but also any estimator based directly on single channel observations cannot distinguish between the two hypotheses. Conversely, from Eq (32), the SSDs may carry a distinction capability, that is,

Proposition 2 demonstrates that there are cases where a single observation cannot be used for distinguishing between the two hypotheses. However, in such cases, the SSDs may enable us to distinguish the hypotheses due to possible differences in either the heterogeneity or the covariances. Proposition 2 is further demonstrated in the context of the localization toy problem in Simulation 2.

To further expand the analysis, we make the following assumptions.

Assumption 1 The empirical mean of the weighted heterogeneity is approximately the same under the two hypotheses:

Assumption 2 The empirical mean of the covariance matrices is approximately the same under the two hypotheses:

Note that if Assumptions (A.1) and (A.2) hold, implying that $2 ϵ m - 2 m ({〈 g^{2} 〉}_{Ω_{1}} + {〈 Σ 〉}_{Ω_{1}}) ≃ 2 ϵ m - 2 m ({〈 g^{2} 〉}_{Ω_{2}} + {〈 Σ 〉}_{Ω_{2}})$ , then the l₁ distance between the SSDs in Eq (32) can be recast as

Proposition 3 Suppose that Assumptions (A.1) and (A.2) hold. Suppose $μ_{j}^{1} = μ_{j}^{2}$ , which implies that the upper bound of the total variation at the j-th element in Eq (30) only depends on the variance

In addition, suppose that (i) the covariance between the jth observation and the other observations under the two hypotheses is approximately equal ${〈 Σ_{k, \cdot} 〉}_{Ω_{1}} ≃ {〈 Σ_{k, \cdot} 〉}_{Ω_{2}}$ , (ii) the empirical mean of the difference of variance and weighted heterogeneity of the two hypotheses is greater than the difference of j-th variance, i.e., $〈 {(σ^{1} - σ^{2})}^{⊤} (g_{1}^{2} - g_{2}^{2}) 〉 \geq | σ_{j}^{1} - σ_{j}^{2} |$ , and (iii) the weighted heterogeneity of $H_{1}$ is sufficiently large such that ${〈 g^{2} 〉}_{Ω_{1}} \geq ϵ - {〈 Σ 〉}_{Ω_{1}} - σ (j)$ . Then, the l₁ distance between the SSDs can be recast and bounded from below by

It follows that

This proposition implies that when the assumptions hold, the probability of error based on SSD, which indirectly takes into account the mutual-relations between all observations, facilitates a better distinction of the two hypotheses compared to the standard MAP estimator computed from the best sensor. This property is further demonstrated in the localization toy problem in Simulation 3.

Proposition 4 Suppose the conditions of Proposition 3 hold. In addition, suppose that the random vectors of the observations are independent and identically distributed, i.e., $σ_{k, j}^{1} = σ_{k, j}^{2} = 0$ ∀k ≠ l, then

This proposition shows that the SSDs enable us a better distinction between the two hypotheses compared to the MAP estimator based on the distributions of the multi-feature observations. In other words, the heterogeneity comprising the SSD has a significant contribution to the ability to recover the information about the latent data points ${x_{i}}_{i = 1}^{n}$ on the underlying manifold $M$ , thereby leading to accurate binary hypothesis testing.

Imaging mass cytometry (IMC)

IMC is a relatively new imaging method, which enables to examine tumors and tissues at subcellular resolutions, giving rise to images consisting of the intensities of multiple proteins [13–15]. This acquisition platform, combined with computational methods, has recently been the subject of many studies. Various image processing and analysis techniques for IMC datasets can be found in [26], where it is shown that single-cell segmentation can be accomplished successfully with supervised classifiers, leading to the characterization of cell co-occurrence and cell composition of different types of tissues and samples. In [27], an IMC dataset with 37 markers is used for cell segmentation and cell clustering based on random 125 × 125μm² patches collected from breast cancer patients. This dataset is jointly analyzed with multi-platform genomics data, where it is shown that classifiers can be iteratively trained in a supervised manner to learn from the IMC pixels the corresponding cell expression levels. In [14], spanning-tree progression analysis combined with samples’ type provided by pathologists is used for cell population and cell transition identification. In contrast to these methods, our approach focuses on extracting the mutual-relationships between markers at likely tumor cells regions at large, circumventing cell segmentation.

In this work, our goal is to identify the sensitivity of lung cancer subjects to treatment with PD-1 axis blockers, given their IMC multiplexed observations. More specifically, we aim at a binary prediction task: identifying whether the subjects responded or did not respond to the treatment. We analyze two IMC datasets consisting of baseline/treatment tumor samples from non-small cell lung cancer subjects profiled with 29 markers, representing phenotype and functional properties of both tumor and immune cells. The markers are denoted by LipoR, VIM, T-BET, CD47, Cytokeratin, CD45RO, PD-L1, GAPDH, B7-H3, LAG-3, TIM-3, FOXP3, CD4, B7-H4, CD68, PD1, CD20, CD8, CD25, VISTA, KI-67, B2M, CD3, IDO-1, PD-L2, GZB, Histone 3, DNA1 and DNA2. The resolution of the IMC images is 1 μm² per pixel. The first dataset, denoted Dataset 1, consists of 55 subjects (samples), and the second dataset, denoted Dataset 2, consists of 29 subjects (samples). These subjects received treatment with PD-1 axis blockers. Based on the clinical sensitivity to the treatment, the subjects are categorized as: durable clinical benefit (denoted with the label responders) and no durable benefit (denoted with the label non-responders).

Prior to our analysis, a standard pre-processing using z-score normalization is applied to each marker. We remark that the mean and the standard deviation are computed based only on pixels in which the marker has non-zero values. In addition, we apply a 3 × 3 median filter to every image in Dataset 2. We note that the median filter is not applied to Dataset 1, because the typical expression levels in this dataset are sparse, and in this situation, application of a filter may destroy the signal. This is demonstrated in S1 Fig, where we apply the median filter to the expression levels of few important markers and observe a degenerate result. The markers exhibiting such sparse expression levels in Dataset 1 are VIM, T-BET, CD45RO, PD-L1, GAPDH, B7-H3, LAG-3, FOXP3, B7-H4, PD1, CD20, CD8, CD25, VISTA, KI-67, CD3, PD-L2 and GZB. Conversely, in Dataset 2, the typical expression levels are dense, and as a result, the median filter enhances the content of the images.

Our analysis does not consider the entire image, but rather focuses on ROIs located at the highest Cytokeratin expression levels, as Cytokeratin is expressed only in the tumor cells. As noted above, this circumvents cell segmentation that is typically required in other analyses techniques. At each ROI, we consider a “stack” of m image patches from all the m markers, where each patch consists of d = b × b pixels. We select N ROIs per sample by searching the patches with the maximal mean value of Cytokeratin. This is implemented by a 2D convolution applied to the Cytokeratin image with a constant kernel of size b × b. We deliberately avoid patch overlap by assigning zero values to the pixels of each ROI, once it is selected.

Using the present work notation, the intrinsic representation of the information embodied at each ROI is denoted by x_i, which is assumed to be a data point from some hidden manifold $M$ . Our working hypothesis is that the distribution of ${x_{i}}_{i = 1}^{n}$ on the hidden manifold $M$ is bimodal, which is induced by the sensitivity of the subjects to the treatment; data points x_i at ROIs within tissues of responders are located in one region of the manifold, and data points x_i at ROIs within tissues from non-responders are located in another region of the manifold. Given a data point x_i, our two hypotheses, $H_{r}$ and $H_{n}$ , are whether x_i is a realization from the distribution of responders or non-responders, respectively. Here, n = N × P is the total number of ROIs across all the samples, where P is the number of samples and N is the number of ROIs we consider in each sample. Next, recall that the intrinsic representation x_i is hidden. Instead, the accessible observations are the expression levels of the markers at the ROIs, which are represented mathematically by the observation functions $f_{j} (x_{i}) \in R^{d}$ for j = 1, …, m, where m = 29 is the number of markers. The domain of the observation functions is the hidden intrinsic manifold $M$ and the range is of dimension d = b × b, which is the size of a patch of an image of one marker. Namely, the values assigned by f_j(x_i) to x_i correspond to the expression level of marker j at the ROI associated with x_i. The set of patches of all the markers at a specific ROI x_i is a set of multi-feature observations ${f_{j} (x_{i})}_{j = 1}^{29}$ . In Fig 2, we illustrate the IMC multiplexed observations from a single subject and the set up under consideration.

Fig 2

An illustration of the IMC multi-feature observations from a single subject.

In the IMC data of a single subject, we focus first on the Cytokeratin marker, which is used as an indicator of tumor cells. We select N ROIs located at the highest Cytokeratin expression levels. These ROIs are patches of size b × b μm², where here b set to 35. We assume that these ROIs have intrinsic hidden states represented by x_i. The expression of all 29 markers at these ROIs are viewed as the multi-feature observations ${f_{j} (x_{i})}_{j = 1}^{29}$ for i = 1, …, N.

The identification of the subject’s response to treatment from the IMC data is based on the application of the proposed method described in Box 1. This algorithm fits well the problem at hand due to the following main reasons. First, directly comparing observations from the different markers, f_j(x_i), is inapplicable since each ROI comprises different cells and different tissue structures. Our method circumvents this problem by computing an intrinsic signature of the observations. Second, due to the different dynamics of the nominal values of the observations from the different markers, a naïve concatenation of the multi-feature observations is inadequate (in contrast to the localization example).

Before applying the proposed method, we test empirically that Assumptions (A.1) and (A.2) hold. Note that for this test only, the true response status is used. To test Assumption (A.1), we compute

and to test Assumption we compute

where Ω_r and Ω_n denote the sets of responders and non-responders, respectively. The respective values we obtain for Dataset 1 are 0.12 and 0.08 and for Dataset 2 the values for responders and non-responders are 0.03 and 0.04. This implies that the conditions in the two assumptions are approximately satisfied.

We analyze the two datasets separately, since the datasets were collected around a year apart, and internal acquisition system parameters were modified during that time. We apply the proposed method presented in Box 1 to the observations, ${f_{j} (x_{i})}_{j = 1}^{29}$ for i = 1, …, n, resulting in a low-dimensional representation of the ROIs. Then, we apply to the low-dimensional representation an RBF SVM classifier with a leave-one-subject-out (LOSO) cross-validation [5] in order to predict the response to treatment. In order to assess the prediction performance for each subject, we compute the average of the prediction results of all the ROIs of that subject. We note that the prediction is based on features computed per patch (rather than per subject). Therefore, in practice, the number of samples used for the cross-validation is 55 × N for Dataset 1 and 29 × N for Dataset 2. Importantly, at each cross-validation iteration, all N patches of a subject were removed from the training set, and were only used for testing the classifier.

We compare the SSD-based representation obtained by the propose method to other representations obtained by three competing algorithms. The first is a direct application of diffusion maps to the sets of multi-feature observations ${f_{j} (x_{i})}_{j = 1}^{29}$ for i = 1, …, n. The second and the third are based on HKS [16] and WKS [17], respectively, replacing the SSD as the features of each ROI at the first stage of the proposed method in Box 1. The dimensions of the representation obtained by the proposed method and by the three competing algorithms are determined by a variant of the Jackstraw method, described in Determining the dimension of data.

Fig 3 presents the predictions of treatment response by the SSD, HKS, WKS based algorithms as well as DM. We show the 3D t-SNE visualization [28] of the patch representation obtained as an output of the different algorithms and the confusion matrix of the prediction obtained by an RBF SVM classifier applied to the patch representation. To complement the results, in Table 1, we present the area under the ROC curve (AUC) of the treatment response predictions. We note that the AUC obtained for Dataset 2 without the median filter pre-processing is 0.931. For each algorithm, the presented prediction results are based on the patch size and number of patches configuration that yielded the best empirical performance using cross-validation as described above. The best configuration (patch size b × b and number of patches per subject N) of each algorithm is presented in Fig 3 on the left.

Table 1

ROC AUC for predictions of treatment sensitivity.

	Proposed Method	DM	HKS	WKS
Dataset 1	0.9455	0.8364	0.7455	0.7818
Dataset 2	1	0.8276	0.9655	0.9310

Fig 3

Treatment response predictions.

(A) Prediction for Dataset 1. (B) Prediction for Dataset 2. In each panel, at each row, we plot the best algorithm configuration, the 3D t-SNE visualization of the patches representation colored by the response status, and the confusion matrix of the response prediction obtained by a leave-one-subject-out cross-validation using an RBF SVM classifier. The rows present the results of the different methods.

In the t-SNE plots, we observe that the unsupervised separation of the patches from responders and non-responders is most pronounced in the low-dimensional representation obtained by the proposed method. This distinct visual separation, which was obtained by the proposed algorithm without access to any response outcome information. The color in the figures indicates the response status. Note that the embedding is obtained by the unsupervised algorithm and the color labels are overlaid to demonstrate the degree of separability between patches from responders and non-responders, which implies that this patch representation is informative and useful for the subsequent response prediction. This result, which is obtained in an unsupervised manner, distinguishes the current work from the computational methods for IMC data described above that rely on supervised analysis. Indeed, we observe that the prediction accuracy obtained based on the proposed method is superior compared to the other three competing methods. We note that Dataset 1 consists of 26 responders and 29 non-responders, so that the chance level of accurate prediction of a subject with durable clinical benefit is 47.27%, and Dataset 2 consists of 11 responders and 18 non-responders, thus, the chance level of accurate prediction is 37.93%.

The derivations in Theoretical analysis imply that the capability to distinguish between the two hypotheses (sensitivity or insensitivity to treatment) highly depends on the mutual relationships between the markers, which we explicitly define and term heterogeneity (Definition 2). Since our empirical study demonstrates that our approach facilitates a distinct separation between ROIs of subjects according to their treatment response, by the theoretical analysis, we conclude that the heterogeneity between the markers is where the information about the sensitivity to treatment lies.

We examine the sensitivity of the tested algorithms to the choice of the hyperparameters: the number of ROIs per subject N and the size of the patch b × b. In S2 Fig, we plot heatmaps of the treatment prediction accuracy obtained based on different choices of hyperparameters. We observe that within a relatively wide range of parameter values, the performance of the proposed method in Box 1 is high and insensitive to the particular choice of parameters. In addition, we note that the range of patch size where high performance is attained is centered at size 45 × 45 μm². As the size of tumor cells vary within a range of 10–30 μm² (a mean of 20 μm²), and the size of a lymphocyte ranges from 8–12 μm² (a mean of 10 μm²), it implies that high performance is attained when patches are likely including more than one cell and cell type. Conversely, we observe that the competing methods do not show the same degree of robustness to the choice of hyperparameters, and good performance is attained only for very particular (isolated) parameter values.

To further exploit the robustness of the proposed method, we implemented an ensemble of the classifiers based on different values of hyperparameters. The implementation of the combination is based on [29]. The performance of the combined classifiers is presented in S3 Fig. We observe that the prediction accuracy of this ensemble is comparable to the prediction accuracy obtained based on the classifier with the best parameter configuration, demonstrating that the particular choice of hyperparameters can be circumvented.

Our premise is that the resulting high prediction accuracy is attributed mainly to the unsupervised informative representation obtained by our approach, rather than the classifier type. To support this claim, we repeat the analysis by replacing the RBF SVM classifier with Random Forest [5]. The results are shown in S4 Fig and demonstrate comparable performance and similar trends.

To show that Stage 2 of the proposed method in Box 1 is essential, we plot in S5 Fig heatmaps of the multi-feature observations and the SSD features resulting from Stage 1 of the algorithm. The heatmaps of the multi-feature observations demonstrate that there is no obvious difference between responders and non-responders, implying that the response status prediction is a non-trivial task. In addition, inspection of the heatmaps with the SSD features does not reveal apparent distinction between responders and non-responders. Recalling that after Stage 2, as presented in the t-SNE plots in Fig 3, this distinction becomes evident, demonstrating the contribution of Stage 2.

Discussion

We presented a two-step graph analysis approach. The first step is applied to the multi-feature observations of the data points, where their mutual-relationships are extracted. This step is implemented by computing the SSD of a random walk defined on the graph whose nodes are the observations. The resulting SSD can be viewed as a signature or a characteristic vector of the data point and is analogous to traditional signatures from other domains, such as the heat kernel signature (HKS) [16] or the wave kernel signature (WKS) [17] in the field of shape analysis, and PageRank [30] in web page ranking (see Related work). The second step is applied to the signatures obtained at the first step, for the purpose of constructing an intrinsic low-dimensional representation of all the data points.

Previous attempts to analyze such mulitplexed datasets involve various approaches, including direct comparisons of the marker expressions, the cell morphology, and interactions in cell neighborhoods, to name but a few [26]. Our method introduces a new approach, building a new representation of the multiplexed data in two steps. Since each of the steps involves a construction of a graph, the entire procedure can be viewed as building a graph of graphs.

While the algorithm is described in a general setting of multiplexed data fusion, our theoretical analysis is focused on binary hypothesis testing. In comparison with a traditional statistical estimation approach, we show that our method exhibits advantages, implying that the mutual-relationships between the multi-feature observations are well captured. In the context of IMC, this could minimize the effect of deviation in individual marker scores and cell/tissue heterogenenity.

We apply the proposed method to two IMC datasets and show that solely from the imaging data, we can distinguish between two different sensitivity levels to treatment. Since our approach does not rely on rigid prior knowledge or access to labels, it has the potential of identifying biological relevance of novel parameters or marker patterns in treatment responses by analyzing dominant factors contributing more to the model stratification. Importantly, we remark that in contrast to common practice, the proposed approach does not require cell-segmentation as a precursor.

In addition to the demonstrated advantage of the proposed method over the competing methods in terms of superior prediction accuracy of treatment response, the proposed method is also more computationally efficient. In S6 Fig, we present the run time of the two stages of the proposed method in Box 1 and compare them to the run time of the three competing methods. We note that the HKS and WKS replace the proposed SSD in Stage 1, but, their respective Stage 2 are similar. In Stage 1, we observe that the proposed method in Box 1 is faster than the competing methods. The main difference between the algorithms in Stage 1 is due to the fact that the SSD is proportional to the degree of the local graph, and as a result, it can be computed efficiently from the degree vector. Conversely, both HKS and WKS require eigenvalue decomposition, which is computationally more demanding. In addition, the direct application of DM just involves Stage 2 of the proposed algorithm. Seemingly, the run time of the algorithms in Stage 2 should have been the same, yet, we observe that DM is slower. This difference is attributed to the significantly different size of the feature space. In DM, the multi-feature observations are simply concatenated, giving rise to feature space (input of Stage 2) of size (b × b × m) × (P × N), where b × b is the size of patch, m is the number of biomarkers, and P is the number of subjects. Conversely, in Stage 2 of Algorithm 1, HKS, and WKS, the feature space (input of Stage 2) is only of size m × (P × N).

It is conceivable that the most important hyperparameter of our method is the scale parameter. In S7 Fig, we present a toy example demonstrating that different values of the scale parameter ϵ lead to multiscale signatures capturing local and global features. We demonstrate that different scales facilitate the extraction of different features of the data. In future work, we plan to further explore the role of the scale and to devise multiscale signatures. Another possible direction for future research relies on the fact that our method is general and can be extended to other multiplexed datasets. For example, hyper spectral imaging, sensor networks, spatial multiplexed proteomics, and spatial transcriptomics assays is a representative subset of distinct technologies from diverse domains of science and engineering that share common data structures. The data in all these modalities consist of high-dimensional multivariate observations (m-dimensional feature space) collected at different spatial positions, and therefore, can be analyzed using similar computational methodologies. Furthermore, in many studies practitioners collect datasets consisting of multiple spatial assays of this type, each capturing such data from a single biological sample, patient, or hyper spectral image, etc. Each of these spatial assays could be characterized by several regions of interest (ROIs), giving rise to a setting similar to the IMC problem considered here. Specifically, we plan to examine applications of the proposed graph of graphs analysis to spatial transcriptomics such as Slide-seq [31], High-Density Spatial Transcriptomics [32, 33], MIBI-TOF [34], and DBiT-seq [35].

Materials and methods

Diffusion maps

Manifold learning is a class of nonlinear techniques that embeds high dimensional data points into a low dimensional space, relying on the assumption that the high-dimensional data lie on a low dimensional manifold $M$ [2–4]. In order to “learn” the manifold from a discrete set of data points, a graph is typically defined, where the graph nodes are the data points and the edges are determined according to some similarity notion. Since the manifold information is entirely captured by its Laplacian, the discrete counterpart, the graph Laplacian is used to build a low-dimensional embedding that respects the manifold in some sense [36]. To this end, common practice is to compute and exploit the spectral decomposition of the graph Laplacian. Diffusion maps is one of these methods, which constructs a random walk on the graph and represents the data points in a low-dimensional space preserving the neighborhood information [4].

Consider a set of data points ${y_{i}}_{i = 1}^{b}$ , where $y_{i} \in R^{d}$ for i = 1, …, b. An undirected weighted graph $G = (V, E, W)$ is constructed from the data points, where the vertex set is $V = (y_{1}, y_{2}, . . ., y_{b})$ and the weights of the edges connecting two vertices are determined by a measure of similarity between any two data points, e.g., by

where i, j ∈ {1, …, b} and ϵ > 0 is a scale parameter. Common practice is to set ϵ as the median of the distances between the graph nodes. Note that ϵ implicitly induces a notion of locality: it can be viewed as the (squared) radius of the neighborhood around each node, so that only nodes within this radius are considered as neighbors in the graph.

Next, a random walk P on the data points is constructed by normalizing the weight matrix W

where

d (i) = \sum_{j = 1}^{n} W (i, j)

. P is the transition matrix of a Markov chain defined on the data points

{y_{i}}_{i = 1}^{b}

(graph vertexes), where the entry p(i, j) describes the probability of a random walk transitioning from the node y_i to the node y_j in a single step. Raising the transition matrix P to a power t can be viewed as applying the Markov chain to the data points t times.

Since P is similar to a symmetric and positive-define matrix, P has a biorthogonal right- and left-eigenvectors ${φ_{i}, υ_{i}}_{i = 1}^{b}$ with the eigenvalues 1 = λ₁ ≥ λ₂ ≥ … ≥ λ_b ≥ 0. Consequently, the spectral decomposition of P^t is given by

The diffusion distance $D_{t}^{2} (i, j)$ between two data points y_i and y_j in the data set is defined by

which measures the similarity of two points based on the evolution of their probability distributions, and depends on all possible paths of length t in the graph between any two points. Namely, if two points are connected by a large number of paths, then the diffusion distance between them will be small. Conversely, if there are only few paths connecting two points, then the diffusion distance between them will be large.

The diffusion maps is defined by [4]

We remark that in many cases, due to the typical fast decay of the eigenvalues of P^t, l can be set to be smaller than d, thereby achieving dimension reduction. In addition, φ₁ is a constant vector and therefore is not used in the mapping.

It can be shown that the diffusion distance can be approximated by the eigenvalues and eigenvectors by [4]

where equality is reached for l = b. Namely, the diffusion distance can be approximated by the Euclidean distance between the diffusion maps of the data points.

Determining the dimension of data

A common problem in diffusion maps setting is how to choose the dimension l. The authors in [6] proposed Jackstraw to identify the number of principal components (PCs) in the context of principal component analysis (PCA) [37]. We present here a variant of Jackstraw, adapting it to diffusion maps.

Given a random walk P constructed from a set of b data points, the associated eigenvalues are 1 = λ₁ ≥ λ₂ ≥ … ≥ λ_b with corresponding right-eigenvectors ${φ_{i}}_{i = 1}^{b}$ . Collect the eigenvalues into a vector, denoted by λ, where λ = (λ₁, λ₂, …, λ_b)^⊤. Let $P_{k}^{*}$ consist of the random permutation of P. Apply eigenvalue decomposition to $P_{k}^{*}$ and obtain the corresponding eigenvalues vector $λ_{k}^{*}$ . Repeat this shuffling procedure s times and obtain a set of vectors ${λ_{k}^{*}}_{k = 1}^{s}$ . The dimension of the representations is determined by

Note that the absolute values of the eigenvalues of $P_{k}^{*}$ are considered because $P_{k}^{*}$ is not necessarily symmetric and therefore its eigenvalues $λ_{k}^{*}$ are not guaranteed to be real.

Related work

Heat kernel signature

There are several shape analysis signatures obtained by spectral methods with different geometric properties such as isometry and deformation invariance [16, 17, 38, 39], related to the proposed method. One of the notable shape signatures is based on the heat diffusion on a shape, called Heat Kernel Signature (HKS) [16]. Broadly, the HKS is obtained by the eigenvalue decomposition of the heat kernel defined on the shape. In the context of our problem, since the heat kernel and the Laplace-Beltrami operator Δ_i share the same eigenbasis, and since the discrete graph Laplacian converges (point-wise) to the Laplace-Beltrami [40]

then, a discrete counterpart of HKS is given by

where

λ_k and ψ_k are the k-th eigenvalue and k-th eigenvector of the random walk, respectively, and t is the number of random walk steps on the graph. For more details on HKS, we refer the readers to [].

Similarly to the DKS in Eq (12), the HKS can also be viewed as a low-pass filter. Observe that, for small t, the HKS approximates the DKS by Taylor expansion; for other t values, the weights assigned by the HKS decay faster than the weights of DKS, and therefore, the DKS gives more attention to finer structures.

Wave kernel signature

Another related shape signature is built by the wave function to the Schrödinger equation describing the quantum mechanical particles, called Wave Kernel Signature (WKS) [17]. Similarly to the HKS, the WKS is given by

where

with

λ_k and ψ_k are the k-th eigenvalue and k-th eigenvector of the random walk, and t is number of random walk steps on the graph. While the DKS and HKS are viewed as lowpass filters, we note that the WKS can be viewed as a band-pass filter. For more details, we refer the readers to [].

Nodes ranking

In the context of web page ranking, there are several traditional algorithms based on the spectral analysis of directed graphs. Among them, the celebrated PageRank score is based on the stationary distribution of a random walk representing the popularity of linked web pages [30]. Hyper induced topic search (HITS) is another related algorithm, identifying the influential nodes using a random walk on a graph. There, the graph nodes are the web pages, which are divided into two groups: authorities and hubs [41]. Both algorithms address the problem of web page ranking, where PageRank depends on the incoming links whereas HITS focuses on the outgoing links.

Localization toy problem

To illustrate the challenge in the problem setting and the generality of the proposed solution, we present three simulations of different localization problems. The Matlab code is available in S1 Matlab Code.

Simulation 1

Consider 800 objects on a 2-sphere in $R^{3}$ that can be located at four different regions. Each region consists of 200 objects. The positions of the objects are measured by 5 sensors, giving rise to the following set of observations ${f_{j} (x_{i})}_{j = 1}^{5} \in R^{100}$ , where j is the index of the sensor and i is the index of the object (position). Each sensor measures the position in d = 100 coordinates in the following way

where the standard deviation of the measurement depends on the distance between the position of the sensor and the position of the object

σ_{i} = 20 exp (- ∥ x_{i} - s_{j} ∥_{2}^{2})

, I₁₀₀ is the identity matrix of size 100 × 100, ‖ ⋅ ‖₂ denotes the Euclidean norm and s_j denotes the 3D position of sensor j ∈ {1, …, 5}. In other words, each object position is captured by d = 100 realizations of a Gaussian random variable with variance that is proportional to the distance between the sensor position and the object position. Note that the positions are captured by the sensor through the variance, therefore, they are difficult to infer directly by the multi-feature observations.

In Fig 4A, we present our setting consisting of objects and sensors located on and near a sphere, respectively. The objects positions are marked by dots, the different regions are marked by different colors (red, black, blue and yellow), and the sensors positions are marked by green stars. At the bottom, we present the (high-dimensional) sensor observations. Each block consists of 100 × 200 scalar observations corresponding to the observations of a single sensor from each region, where d = 100 is the dimension of each observation and 200 is the number of the positions per region. Visually, it is evident that distinguishing between the different regions merely based on these observations is non trivial.

Fig 4

Illustration of localization toy problems.

(A) The sensors and objects locations on a sphere are marked by green stars and dots, respectively. The multi-sensor observations correspond to Simulation 1 (left), Simulation 2 (middle), and Simulation 3 (right). (B) Results of the application of our approach to Simulation 1: the SSDs obtained by the proposed method (left), the diffusion maps embedding (middle), and the localization confusion matrix obtained by a 10-fold cross-validation with an RBF SVM classifier (right). (C) A comparison between the localization accuracy obtained by the proposed method based on SSDs and the localization accuracy obtained based on the output of each sensor as well as the concatenation of the output from all the sensors. The localization accuracy is the number of correctly identified positions divided by the true number of total positions in each region.

For illustration purposes, we view the problem as a classification problem, where given the high-dimensional multi-feature observations, the task is to identify in which region the object at x_i resides.

The observations from each sensor consisting of 800 object positions from four regions were processed in two stages: first, a 3D embedding is constructed by applying diffusion maps to the high-dimensional observations, and second, an RBF SVM is applied to the obtained embedding in order to classify the region. To evaluate the classifiers, we perform a 10-fold cross-validation. A similar two-step procedure is applied to the concatenation of the observations from all the sensors.

Table 2 presents the resulting classification accuracy, i.e., the number of correctly classified positions divided by the total number of positions in each region. The presented results are obtained using a 10-fold cross-validation. We observe that none of the sensors enables an accurate classification. Moreover, we show in Table 2 that a naïve concatenation of the observations from all the sensors do not yield a good classification either.

Table 2

Localization accuracy from measurements of Simulation 1.

Region	Concatenated sensors	Sensor 1	Sensor 2	Sensor 3	Sensor 4	Sensor 5
Blue	43%	46%	18%	44.5%	58%	40.5%
Red	54%	15.5%	52.5%	45.5%	48.5%	55%
Black	46.5%	49.5%	29%	40%	37.5%	42%
Yellow	38.5%	33.5%	52%	30.5%	42.5%	50.5%

The performances obtained by the 10-fold cross-validations are presented. The percentages indicate the number of correctly classified positions divided by the true number of total positions in each class.

Seemingly, in order to mitigate the problem, we could simply represent the objects positions by a vector of the variance of the observations. However, it would require prior knowledge about the sensing model, whereas our approach is model-free.

In Fig 4B, we present the classification results obtained by the proposed method. In the diffusion maps embedding constructed from the SSDs, we observe a clear separation between the four regions. Finally, in the confusion matrix of the classification, we observe that the proposed method leads to significantly better classification results compared to the results in Table 2. This demonstrates the importance of taking into account the mutual-relationships between the sensor observations, rather than processing the nominal values of the observations directly, which in this case, give rise to correct identification of the four regions.

Simulation 2

The main purpose of this simulation is to demonstrate Proposition 2 from Binary hypothesis testing.

Consider n = 400 positions, ${x_{i}}_{i = 1}^{400}$ , such that $x_{i} \in S^{2} \subset R^{3}$ , which are sampled from two different regions on the sphere. Suppose that the positions of the objects follow a bimodal distribution as depicted in the middle-top figure in Fig 4A: an object is located in the blue region with probability $\frac{1}{2}$ and in the red region with probability $\frac{1}{2}$ . The two regions represent the two hypotheses, $H_{1}$ and $H_{2}$ , where each region consists of 200 positions. Note the symmetry in this setting, that is, the distances from the blue and red regions to the sensors are approximately the same.

Here, we have 12 sensor observations ${f_{j} (x_{i})}_{j = 1}^{12}$ , measuring the positions of the objects, located at s_j. The multi-feature observations are random samples from

where

f_{j} (x_{i}) \in R^{100}

, the standard deviation is

σ_{j}^{i} = 20 exp (- ∥ x_{i} - s_{j} ∥_{2}^{2})

, and the covariance

Σ_{k, l} \sim | N (0, 0.5) |

for

⌊ \frac{k}{100} ⌋, ⌊ \frac{l}{100} ⌋ \in [1, \dots, 12]

In this simulation, a direct computation can show that Assumptions (A.1) and (A.2) hold. Specifically, we note that

and

where Ω_r and Ω_b denote the sets of red or blue object locations, respectively. In addition, the conditions of Special Case 1 are satisfied, and thus, the total variation of these sensor observations is zero. In other words, using a single sensor is insufficient to distinguish between the two regions. Conversely, we show that since the SSDs take into account the covariance information between the sensors, they allow us to make this distinction.

In Fig 4C, we present a comparison between the classification results obtained by our approach using SSDs and the classification results obtained by using the output of each sensor as well as the concatenation of the output from all the sensors. The results are evaluated with a 10-fold cross-validation. We observe that the classification obtained by proposed algorithm is significantly better than the classification obtained using the “raw” sensor outputs.

Simulation 3

This simulation demonstrates Proposition 3 from Binary hypothesis testing. Consider n = 400 positions, ${x_{i}}_{i = 1}^{400}$ , such that $x_{i} \in S^{2} \subset R^{3}$ , which are sampled from two different regions on the sphere, as depicted in the right-top figure in Fig 4A. The rest of the setting remains as described in Simulation 2.

Note that here, only two sensor observations satisfy the conditions of Special Case 1, namely, $σ_{1}^{1} = σ_{1}^{2}$ and $σ_{7}^{1} = σ_{7}^{2}$ , and thus, the total variation of only these two sensor observations is zeros.

For each data point, we compute the difference between the left-hand side and the right-hand side of the inequality in Proposition 2 per sensor. Then, in the right-top figure in Fig 4A, we circle the sensors attaining the largest difference. As expected, we observe that the circled sensors are positioned in non-symmetric orientations with respect to the locations of the two regions.

Similarly to the previous simulation, we compare the classification results obtained by our proposed method to the results attained by applying the classification directly to the observations from each sensor separately and to the concatenation of the observations from all the sensors. In addition, we also compute the results of Algorithm 1 applied to all the sensors except Sensor 1 and Sensor 7, which are the least contributing according to the inequality in Proposition 2.

The classification results presented in Fig 4C imply that by removing the least contributing sensors, namely Sensor 1 and Sensor 7, the recomputed SSDs, denoted by $π_{i}^{1, 7}$ , lead to slightly improved classification accuracy.

References

JBTenenbaum, VDe Silva, JCLangford. A global geometric framework for nonlinear dimensionality reduction. science. 2000;290(5500):2319–2323. 10.1126/science.290.5500.2319

STRoweis, LKSaul. Nonlinear dimensionality reduction by locally linear embedding. science. 2000;290(5500):2323–2326. 10.1126/science.290.5500.2323

MBelkin, PNiyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation. 2003;15(6):1373–1396. 10.1162/089976603321780317

RCoifman, SLafon. Diffusion Maps. Appl Comput Harmon Anal. 2006;21:5–30. 10.1016/j.acha.2006.04.006

KPMurphy. Machine learning: a probabilistic perspective. MIT press; 2012.

NCChung, JDStorey. Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics. 2014;31(4):545–554. 10.1093/bioinformatics/btu674

RRLederman, RTalmon. Learning the geometry of common latent variables using alternating-diffusion. Applied and Computational Harmonic Analysis. 2015;.

RTalmon, HTWu. Latent common manifold learning with alternating diffusion: analysis and applications. Applied and Computational Harmonic Analysis. 2019;47(3):848–892. 10.1016/j.acha.2017.12.006

TShnitzer, MBen-Chen, LGuibas, RTalmon, HTWu. Recovering hidden components in multimodal data with composite diffusion operators. SIAM Journal on Mathematics of Data Science. 2019;1(3):588–616. 10.1137/18M1218157

OKatz, RTalmon, YLLo, HTWu. Alternating diffusion maps for multimodal data fusion. Information Fusion. 2019;45:346–360. 10.1016/j.inffus.2018.01.007

OLindenbaum, AYeredor, MSalhov, AAverbuch. Multi-view diffusion maps. Information Fusion. 2020;55:127–149. 10.1016/j.inffus.2019.08.005

DEynard, AKovnatsky, MMBronstein, KGlashoff, AMBronstein. Multimodal manifold analysis by simultaneous diagonalization of laplacians. IEEE transactions on pattern analysis and machine intelligence. 2015;37(12):2505–2517. 10.1109/TPAMI.2015.2408348

BBodenmiller, ERZunder, RFinck, TJChen, ESSavig, RVBruggner, et al. Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators. Nature biotechnology. 2012;30(9):858. 10.1038/nbt.2317

CGiesen, HAWang, DSchapiro, NZivanovic, AJacobs, BHattendorf, et al. Highly multiplexed imaging of tumor tissues with subcellular resolution by mass cytometry. Nature methods. 2014;11(4):417. 10.1038/nmeth.2869

QChang, OIOrnatsky, ISiddiqui, ALoboda, VIBaranov, DWHedley. Imaging mass cytometry. Cytometry part A. 2017;91(2):160–169. 10.1002/cyto.a.23053

Sun J, Ovsjanikov M, Guibas L. A concise and provably informative multi-scale signature based on heat diffusion. In: Computer graphics forum. vol. 28. Wiley Online Library; 2009. p. 1383–1392.

Aubry M, Schlickewei U, Cremers D. The wave kernel signature: A quantum mechanical approach to shape analysis. In: 2011 IEEE international conference on computer vision workshops (ICCV workshops). IEEE; 2011. p. 1626–1633.

Krause H. Localization theory for triangulated categories. arXiv preprint arXiv:08061324. 2008;.

LLovász, et al. Random walks on graphs: A survey. Combinatorics, Paul erdos is eighty. 1993;2(1):1–46.

RTalmon, ICohen, SGannot. Single-channel transient interference suppression with diffusion maps. IEEE transactions on audio, speech, and language processing. 2012;21(1):132–144. 10.1109/TASL.2012.2215593

DDov, RTalmon, ICohen. Audio-visual voice activity detection using diffusion maps. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2015;23(4):732–745. 10.1109/TASLP.2015.2405481

GMishne, ICohen. Multiscale anomaly detection using diffusion maps. IEEE Journal of selected topics in signal processing. 2012;7(1):111–123. 10.1109/JSTSP.2012.2232279

MMBronstein, AMBronstein. Shape recognition with spectral distances. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010;33(5):1065–1071. 10.1109/TPAMI.2010.210

Cheng X, Mishne G. Spectral Embedding Norm: Looking Deep into the Spectrum of the Graph Laplacian. arXiv preprint arXiv:181010695. 2018;.

Devroye L, Mehrabian A, Reddad T. The total variation distance between high-dimensional Gaussians. arXiv preprint arXiv:181008693. 2018;.

HBaharlou, NPCanete, ALCunningham, ANHarman, EPatrick. Mass Cytometry Imaging for the Study of Human Diseases—Applications and Data Analysis Strategies. Frontiers in Immunology. 2019;10. 10.3389/fimmu.2019.02657

HRAli, HWJackson, VRZanotelli, EDanenberg, JRFischer, HBardwell, et al. Imaging mass cytometry and multiplatform genomics define the phenogenomic landscape of breast cancer. Nature Cancer. 2020;1(2):163–175. 10.1038/s43018-020-0026-6

LvdMaaten, GHinton. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579–2605.

DMTax, MVan Breukelen, RPDuin, JKittler. Combining multiple classifiers by averaging or by multiplying? Pattern recognition. 2000;33(9):1475–1485. 10.1016/S0031-3203(99)00138-7

LPage, SBrin, RMotwani, TWinograd. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab; 1999.

SGRodriques, RRStickels, AGoeva, CAMartin, EMurray, CRVanderburg, et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science. 2019;363(6434):1463–1467. 10.1126/science.aaw1219

SVickovic, GEraslan, FSalmén, JKlughammer, LStenbeck, DSchapiro, et al. High-definition spatial transcriptomics for in situ tissue profiling. Nature methods. 2019;16(10):987–990. 10.1038/s41592-019-0548-y

SVickovic, GEraslan, JKlughammer, LStenbeck, FSalmen, TAijo, et al. High-density spatial transcriptomics arrays for in situ tissue profiling. bioRxiv. 2019; p. 563338.

LKeren, MBosse, SThompson, TRisom, KVijayaragavan, EMcCaffrey, et al. MIBI-TOF: A multiplexed imaging platform relates cellular phenotypes and tissue structure. Science Advances. 2019;5(10):eaax5851. 10.1126/sciadv.aax5851

Liu Y, Yang M, Deng Y, Su G, Guo C, Zhang D, et al. High-Spatial-Resolution Multi-Omics Atlas Sequencing of Mouse Embryos via Deterministic Barcoding in Tissue. Available at SSRN 3466428. 2019;.

PBérard, GBesson, SGallot. Embedding Riemannian manifolds by their heat kernel. Geometric & Functional Analysis GAFA. 1994;4(4):373–398. 10.1007/BF01896401

GHDunteman. Principal components analysis. 69. Sage; 1989.

Raviv D, Bronstein MM, Bronstein AM, Kimmel R. Volumetric heat kernel signatures. In: Proceedings of the ACM workshop on 3D object retrieval. ACM; 2010. p. 39–44.

Rustamov RM. Laplace-Beltrami eigenfunctions for deformation invariant shape representation. In: Proceedings of the fifth Eurographics symposium on Geometry processing. Eurographics Association; 2007. p. 225–233.

SMinakshisundaram, ÅPleijel. Some properties of the eigenfunctions of the Laplace-operator on Riemannian manifolds. Canadian Journal of Mathematics. 1949;1(3):242–256. 10.4153/CJM-1949-021-5

JMKleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM). 1999;46(5):604–632. 10.1145/324133.324140

Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, et al. 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1912–1920.

Ma G, Lu CT, He L, Philip SY, Ragin AB. Multi-view graph embedding with hub detection for brain network analysis. In: 2017 IEEE International Conference on Data Mining (ICDM). IEEE; 2017. p. 967–972.