Optimal learning with excitatory and inhibitory synapses

Alessandro Ingrosso

The author has declared that no competing interests exist.

Current address: Quantitative Life Sciences, The Abdus Salam International Centre for Theoretical Physics - ICTP, Trieste, Italy

https://doi.org/10.1371/journal.pcbi.1008536, Volume: 16, Issue: 12, Pages: 1-24

Article Type: Research Article Article History

Publisher: Public Library of Science

- Facebook
- Twitter
- Linkedin
- Whatsapp
Altmetric

Table of Contents

Introduction
Results
Discussion
Methods

Abstract

Characterizing the relation between weight structure and input/output statistics is fundamental for understanding the computational capabilities of neural circuits. In this work, I study the problem of storing associations between analog signals in the presence of correlations, using methods from statistical mechanics. I characterize the typical learning performance in terms of the power spectrum of random input and output processes. I show that optimal synaptic weight configurations reach a capacity of 0.5 for any fraction of excitatory to inhibitory weights and have a peculiar synaptic distribution with a finite fraction of silent synapses. I further provide a link between typical learning performance and principal components analysis in single cases. These results may shed light on the synaptic profile of brain circuits, such as cerebellar structures, that are thought to engage in processing time-dependent signals and performing on-line prediction.

A general analysis of learning with biological synaptic constraints in the presence of statistically structured signals is lacking. Here, analytical techniques from statistical mechanics are leveraged to analyze association storage between analog inputs and outputs with excitatory and inhibitory synaptic weights. The linear perceptron performance is characterized and a link is provided between the weight distribution and the correlations of input/output signals. This formalism can be used to predict the typical properties of perceptron solutions for single learning instances in terms of the principal component analysis of input and output data. This study provides a mean-field theory for sign-constrained regression of practical importance in neuroscience as well as in adaptive control applications.

Ingrossoand Morrison: Optimal learning with excitatory and inhibitory synapses

Introduction

At the most basic level, neuronal circuits are characterized by the subdivision into excitatory and inhibitory populations, a principle called Dale’s law. Even though the precise functional role of Dale’s law has not yet been understood, the importance of synaptic sign constraints is pivotal in constructing biologically plausible models of synaptic plasticity in the brain [1–5]. The properties of synaptic couplings strongly impact the dynamics and response of neural circuits, thus playing a crucial role in shaping their computational capabilities. It has been argued that the statistics of synaptic weights in neural circuits could reflect a principle of optimality for information storage, both at the level of single-neuron weight distributions [6, 7] and inter-cell synaptic correlations [8] (e.g. the overabundance of reciprocal connections). A number of theoretical studies, stemming from the pioneering Gardner approach [9], have investigated the computational capabilities of stylized classification and memorization tasks in both binary [10–13] and analog perceptrons [14, 15], using synthetic data. With some exceptions mentioned in the following, these studies considered random uncorrelated inputs and outputs, a usual approach in statistical learning theory. One interesting theoretical prediction is that non-negativity constraints imply that a finite fraction of synaptic weights are set to zero at critical capacity [6, 15, 16], a feature which is consistent with experimental synaptic weight distributions observed in some brain areas, e.g. input fibers to Purkinje cells in the cerebellum.

The need to understand how the interaction between excitatory and inhibitory synapses mediates plasticity and dynamic homeostasis [17, 18] calls for the study of heterogeneous multi-population feed-forward and recurrent models. A plethora of mechanisms for excitatory-inhibitory (E-I) balance of input currents onto a neuron have been proposed [19, 20]. At the computational level, it has recently been shown that a peculiar scaling of excitation and inhibition with network size, originally introduced to account for the high variability of neural firing activity [21–27], carries the computational advantage of noise robustness and stability of memory states in associative memory networks [13].

Analyzing training and generalization performance in feed-forward and recurrent networks as a function of statistical and geometrical structure of a task remains an open problem both in computational neuroscience and statistical learning theory [28–32]. This calls for statistical models of the low-dimensional structure of data that are at the same time expressive and amenable to mathematical analyses. A few classical studies investigated the effect of “semantic” (among input patterns) and spatial (among neural units) correlations in random classification and memory retrieval [33–35]. The latter are important in the construction of associative memory networks for place cell formation in the hippocampal complex [36].

For reason of mathematical tractability, the vast majority of analytical studies in binary and analog perceptron models focused on the case where both inputs and outputs are independent and identically distributed. In this work, I relax this assumption and study optimal learning of input/output associations with real-world statistics with a linear perceptron having heterogeneous synaptic weights. I introduce a mean-field theory of an analog perceptron in the presence of weight regularization with sign-constraints, considering two different statistical models for input and output correlations. I derive its critical capacity in a random association task and study the statistical properties of the optimal synaptic weight vector across a diverse range of parameters.

This work is organized as follows. In the first section, I introduce the framework and provide the general definitions for the problem. I first consider a model of temporal (or, equivalently, “semantic”) correlations across inputs and output patterns, assuming statistical independence across neurons. I show that optimal solutions are insensitive to the fraction of E and I weights, as long as the external bias is learned. I derive the weight distribution and show that it is characterized by a finite fraction of zero weights also in the general case of E-I constraints and correlated signals. The assumption of independence is subsequently relaxed in order to provide a theory that depends on the spectrum of the sample covariance matrix and the dimensionality of the output signal along the principal components of the input. The implications of these results are discussed in the final section.

Results

Mean-field theory with correlations

Consider the problem of linearly mapping a set of correlated inputs x_iμ, with i ∈ 1, …, N and μ = 1, …, P from N_E = f_EN excitatory (E) and N_I = (1 − f_E) inhibitory (I) neurons, onto an output y_μ using a synaptic vector w, in the presence of a learnable constant bias current b (Fig 1). To account for different statistical properties of E and I input rates, we write the elements of the input matrix as ${(X)}_{i μ} \equiv x_{i μ} = {\bar{x}}_{i} + σ_{i} ξ_{i μ}$ with ${\bar{x}}_{i} = {\bar{x}}_{E}$ for i ≤ f_EN and ${\bar{x}}_{i} = {\bar{x}}_{I}$ for i > f_EN and the same for σ_i. At this stage, the quantities ξ_iμ have unit variance and are uncorrelated across neurons: 〈ξ_iμ ξ_iν〉 = δ_ijC_μν. In the following, we refer to x and y as signals and μ as a time index, although we consider general “semantic”correlations across the patterns x_μ [34]. The output signal has average $〈 y_{μ} 〉 = \bar{y}$ and variance $〈 {(y_{μ} - \bar{y})}^{2} 〉 = σ_{y}^{2}$ . We initially consider output signals y_μ with the same temporal correlations as the input, namely 〈δy_μ δy_ν〉 = C_μν, where $y_{μ} = \bar{y} + σ_{y} δ y_{μ}$ .

Fig 1

Schematic of the learning problem.

A linear perceptron receives N correlated signals (input rates of pre-synaptic neurons) x_iμ and maps them to the output y_μ through N_E = f_EN excitatory and N_I = (1 − f_E)N plastic inhibitory weights w_i, plus an additional bias current b.

For a given input-output set, we are faced with the problem of minimizing the following regression loss (energy) function:

with w_i > 0 for i ≤ f_EN, w_i < 0 otherwise. The rationale for using a regularization term lies not only in alleviating ill-conditioning due to input correlations, but also in controlling the metabolic cost of synaptic plasticity and transmission. Preliminary numerical experiments showed that the typical vector w that solves this sign-constrained least square problem has a squared norm

\sum_{i = 1}^{N} w_{i}^{2} = O (1)

, irrespectively of the L2 regularization, as in the special case of i.i.d input/output and non-negative synaptic weights []. Synaptic weights w_i are thus of

O (1 / \sqrt{N})

, hence the scaling of the regularization term Nγ and the bias current

b = I \sqrt{N}

. In order to consider a well defined N → ∞ limit for E and the spectrum of the matrix C, we take P = αN, with α called the load, as is costumary in mean-field analysis of perceptron problems [].

Optimizing with respect to the bias b naturally yields solutions w for which

where we call

{\bar{w}}_{c} = \frac{1}{N_{c}} \sum_{i \in c} w_{i} = O (1 / \sqrt{N})

the average excitatory and inhibitory weight, with c ∈ {E, I}. We call this property balance, in that the same scaling is used in balanced state theory of neural circuits [, , ].

In order to derive a mean-field description for the typical properties of the learned synaptic vector w, we employ a statistical mechanics framework in which the minimizer of E is evaluated after averaging across all possible realizations of the input matrix X and output y. To do so, we compute the free energy density

where Z = ∫dμ (w)e^−βE is the so-called partition function and the measure

d μ (w) = \prod_{i \in E} θ (w_{i}) d w_{i} \prod_{k \in I} θ (- w_{k}) d w_{k}

implements the sign-constraints over the synapic weight vector w. The brackets in stand for the quenched average over all the quantities x_iμ and y_μ, and the inverse temperature β will allow us to select weight configurations w that minimize the energy E. The free energy density f acts as a generating function from which all the statistical quantities of interest can be calculated by appropriate differentiation and taking the β → ∞ limit. In particular, we will be interested in the (normalized) average loss

ϵ = \frac{〈 E 〉}{N}

and the error

ϵ_{e r r} = \frac{1}{2 N} 〈 {| X^{T} w + b - y |}^{2} 〉

, corresponding to the average value of the first term in , where b is a P-dimensional vector containing b in every element. The average in can be computed in the N → ∞ limit with the help of the replica method, an analytical continuation technique that entails the introduction of a number n of formal replicas of the vector w. A general expression for f can be obtained in the large N limit using the saddle-point method. The crucial quantity in our derivation is the (replicated) cumulant generating function Z_ξ,δy for the (mean-removed) input x and output y, which can be easily expressed as a function of the eigenvalues λ_μ, μ = 1, …, αN of the covariance matrix C, plus a set of order parameters to be evaluated self-consistently (Methods).

Critical capacity

The existence of weight vectors w’s with a certain value of the regression loss E in the error regime (ϵ > 0) is described by the so-called overlap order parameter $Δ {\tilde{q}}_{w}$ . In the replica-based derivation of the mean-field theory, overlap parameters are introduced with the purpose of decoupling the w_i’s over the i index, and represent the scalar-product of two different configurations of the weights w (Methods: Replica formalism: ensemble covariance matrix (EC)). For finite β, the quantity $Δ q_{w} = β Δ {\tilde{q}}_{w}$ represents the variance of the synaptic weights across different solutions. In the asymptotic limit β → ∞ of Eq (3), a simple saddle-point equation for $Δ {\tilde{q}}_{w}$ can be derived when b is chosen to minimize Eq (1):

where ρ(λ) is the distribution of eigenvalues of C.

In the absence of weight regularization (γ = 0), we define the critical capacity α_c as the maximal load α = P/N for which the patterns x_μ can be correctly mapped to their outputs y_μ with zero error. When the synaptic weights are not sign-constrained, the critical capacity is obviously α_c = 1, since the matrix X is typically full rank. In the sign-constrained case, α_c is found to be the minimal value of α such that Eq (4) is satisfied for $0 < Δ {\tilde{q}}_{w} < + \infty$ . Noting that the left-hand side in Eq (4) is a non-decreasing function of $Δ {\tilde{q}}_{w}$ with an asymptote in α, the order parameter $Δ {\tilde{q}}_{w}$ goes to + ∞ as the critical capacity is approached from the right. We thus find for γ = 0 the surpisingly simple result:

As shown in Fig 2A in the case of i.i.d. x and y, the loss has a sharp increase at α = 0.5. This holds irrespectively of the structure of the covariance matrix C and the ratio of excitatory weights f_E. In Fig 2A, we also show the average minimal loss ϵ for increasing values of the regularization parameter γ.

In [15], the authors showed that, in the case with excitatory synapses only and uncorrelated inputs and outputs, α_c approaches 0.5 in the limit when the quantity $\frac{σ_{y}^{2} {\bar{x}}_{E}^{2}}{I^{2} σ_{E}^{2}}$ goes to zero, and analyzed which conditions on inputs and outputs statistics lead to maximize capacity. Here we take a complementary approach, where the x and y statistics are fixed and capacity is optimized within the error regime, so that the optimal bias $I \sqrt{N}$ is well defined in terms of minimizing 〈E〉 at any load α. The bias optimization leads to a massive simplification of the saddle-point equations and makes results independent of the E/I ratio and the input/output statistics (Methods: EC, Saddle-point equations). One may observe that, in the particular case studied by [15], α_c is maximal for very large I, due to the divergence of the norm of w at critical capacity for an optimal bias in the absence of regularization.

Fig 2

Critical capacity and weight balance.

A: Average loss ϵ for a linear perceptron with f_E = 0.8 positive synaptic weights in the case of i.i.d. input X and output y for increasing values of the regularization γ. Parameters: N = 1000, ${\bar{x}}_{E} = {\bar{x}}_{I} = σ_{E} = σ_{I} = \bar{y} = σ_{y} = 1$ . Each point is an average across 50 samples. Full lines show the theoretical results. B: Mean-field component $\tilde{h}$ (left axis, purple) and weight-input correlation c (right axis, red) for increasing dimension N in the case where the bias current $b = I \sqrt{N}$ is either learned (I optimal) or fixed at the outset (I = −1) for f_E = 1, γ = 0.1, α = 0.8. Inputs X and output y are time-correlated with un-normalized Gaussian covariance C, τ = 10 (see text). The remaining parameters are as in A. The asymptotic value $\tilde{h} = \bar{y} = 1$ is highlighted by the purple dotted line, the value c = 0 by the red dotted line as guide for the eye.

The independence of our results with respect to the E/I ratio for an optimal bias current signals a local gauge invariance, as observed by [37, 38] for a sign-constrained binary perceptron. Indeed, calling g_i = sign w_i, we can write the mean-removed output as $\sum_{i = 1}^{N} g_{i} | w_{i} | σ_{i} ξ_{i}^{μ}$ and redefine the ξ’s as $g_{i} ξ_{i}^{μ}$ , without changing their occurrence probability. This establishes an equivalence to a linear perceptron with non-negative weights (see [37] for more details), once the mean contribution has been removed. Any residual dependence of α_c or ϵ on external parameters must therefore be ascribed to the volume of weights satisfying Eq (2), for a sub-optimal external current b.

For a generic value of the bias current b, there are strong deviations from the condition in Eq (2). In Fig 2B, we compare the value of the average output $\bar{y}$ with $\tilde{h} \equiv \sum_{c \in {E, I}} N_{c} {\bar{w}}_{c} {\bar{x}}_{c} + b$ , and also plot the residual term $c = \frac{1}{N P} \sum_{i μ} δ w_{i} x_{i μ}$ , where we decomposed the weight vector components as $w_{i} = {\bar{w}}_{c} + δ w_{i}$ for c ∈ {E, I}. The quantity c measures weight-rate correlations that are responsible for the cancelation of the $O (\sqrt{N})$ bias.

The deviation from Eq (2), shown here for a rapidly decaying covariance of the form $C_{μ ν} = e^{- \frac{| μ - ν |}{2 τ^{2}}}$ , has been previously described in the context of a target-based learning algorithm used to build E-I-separated rate and spiking models of neural circuits capable of solving input/output tasks [3]. In this approach, a randomly initialized recurrent network n_T is driven by a low dimensional signal z. Its currents are then used as targets to train the synaptic couplings of a second (rate or spiking) network n_S, in such a way that the desired output z can later be linearly decoded from the self-sustained activity of n_S. Each neuron of n_S has to independently learn an input/output mapping from firing rates x to currents y, using an on-line sign-constrained least square method. In the presence of an L2 regularization and a constant $b \propto \sqrt{N}$ external current, the on-line learning method typically converges onto a solution for the recurrent synaptic weights for which Eq (2) does not hold. As also shown in [3], in the peculiar case of a self-sustained periodic dynamics (in which case off-diagonal terms of the covariance matrix C_μν do not vanish for large μ or ν) the two contributions $\tilde{h}$ and c scale approximately like $\sqrt{N}$ and cancel each other to produce an $O (1)$ total average output $\bar{y} = \tilde{h} + c$ . In the effort to build heterogeneous functional network models, the emergence of synaptic connectivity compatible with the balanced scaling thus depends on the statistics of incoming currents. Ad-hoc regularization can be avoided by adjusting external currents onto each neuron.

Power spectrum and synaptic distribution

The theory developed thus far applies to a generic covariance matrix C. To connect the spectral properties of C with the signal dynamics, we further assume the x_iμ to be N independent stationary discrete-time processes. In this case, C_μν = C(μ − ν) is a matrix of Toeplitz type [39], leading to the following expression for the average minimal loss density in the N → ∞ limit:

with

Δ {\tilde{q}}_{w}

given by . The function λ(ϕ) can be computed exactly in some cases (Methods: Power spectrum and synaptic distribution) and corresponds to the average power spectrum of the x and y stochastic processes. Fig 3 shows two representative input signals with Gaussian and exponential covariance matrix C (Fig 3A) and a comparison between the average power spectrum of the input and the analytical results for the eigenvalue spectrum of the matrix C (Fig 3B). From now on, we use the terms Gaussian or rfb (radial basis function) indistinguishably to denote the un-normalized Gaussian function

C_{μ ν} = e^{- \frac{{(μ - ν)}^{2}}{2 τ^{2}}}

Fig 3

Eigenvalues of C and Fourier spectrum.

A: Examples of excitatory input signals x_iμ (i ∈ E) with two different covariance matrices C. Top: rbf covariance, τ = 10. Bottom: exponential covariance $C_{μ ν} = e^{- \frac{| μ - ν |}{τ}}$ , τ = 10. Parameters: ${\bar{x}}_{E} = 1$ , σ_E = 0.3. B: Theoretical eigenvalue spectrum of C with τ = 10 versus average power spectrum for positive wave numbers across N = 2000 independent processes with P = 1000 time steps.

As shown in Fig 4A in the case of input x and output y with rbf covariance, the squared norm of the optimal synaptic vector w (red curve) is in general a non-monotonic function of α, its maximum being attained at bigger values of α as the time constant τ increases. We also show the minimal loss density ϵ and the mean error ϵ_err for γ = 0.1. The curves in Fig 4A are the same for any ratio f_E: the use of an optimal bias current b cancels any asymmetry between E and I populations. For a finite γ, the minimal average loss ϵ for a given f_E decreases as either σ_E or σ_I increase. For a given set of parameters f_E and γ, the optimal bias b will in general depend on the load α and the structure of the covariance matrix C, as shown in Fig 4B.

Fig 4

Learning temporally structured signals.

A: Minimal loss ϵ, error ϵ_err and norm of the weight vector w as a function of the load α for a linear perceptron trained on a time-correlated signal. Covariance matrix C is of rbf type with τ = 2. Parameters: N = 1000, f_E = 0.8, γ = 0.1, ${\bar{x}}_{E} = {\bar{x}}_{I} = σ_{E} = σ_{I} = \bar{y} = σ_{y} = 1$ . B: Optimal bias b for the two sets of signals with rbf (black curve) and exponential (yellow curve) covariance C, with τ = 2. Theoretical curves show the value $I \sqrt{N} + \bar{y}$ , where I has been computed from the saddle-point equations (Methods: EC, Saddle-point equations). Parameters as in A. Each point in A and B is an average across 50 samples. C: Probability density of non-zero synaptic weights $w_{i} \sqrt{N}$ of a linear perceptron with N = 1000, a fraction f_E = 0.8 of excitatory weights, trained on P = 600 exponentially correlated input x and output y. The δ function in zero is omitted for better visualization. Parameters: τ = 10, γ = 0.1, ${\bar{x}}_{E} = {\bar{x}}_{I} = 1$ , σ_I = 2σ_E = 0.4. The histogram is an average across 50 realizations of input/output signals. Inset: full histogram of synaptic weights $w_{i} \sqrt{N}$ .

Using the same analytical machinery employed for the calculation of the free energy Eq (3), the probability distribution of the typical weight w_i can be easily derived. This can be seen by employing a variant of the replica trick (Methods: Distribution of synaptic weights) that links the so-called entropic part of f to 〈p(w_i)〉, expressed in terms of the saddle-point values of the same (conjugated) overlap parameters employed thus far. Interestingly, the optimal bias b implies that half of the synapses are zero, irrespectively of f_E and the properties of the covariance matrix C. The probability density of the synaptic weights is composed of two truncated Gaussian densities with zero mean for the E and I components, plus a finite fraction p₀ = 0.5 of zero weights.

We show in Fig 4C the shape of the optimal weight distribution for a linear perceptron with 80% excitatory synapses, trained on exponentially correlated x and y and with a ratio σ_I/σ_E = 2. It is interesting to note that, in the presence of an optimal external current, both the means of the Gaussian components and the fraction of silent synapses do not depend on the specific properties of input and output signals.

The shape of the synaptic distribution appeared in previous studies both in the binary [8, 11, 13] and linear perceptron [15]. In the linear case with only excitatory synapses [15], for a fixed bias $b = \sqrt{N}$ , the fraction of zero E weights is larger than 0.5 at criticality. It generally depends on input parameters and the load in the error region α ≤ α_c. Let us also mention that a similar property is also apparent in the binary perceptron, where the scale of the typical solutions is set by robustness [13] to input and output noise. For weights $w_{i} = O (1 / \sqrt{N})$ , the sparsity of critical solutions generically depends on properties of E and I inputs. For weights of $O (1 / N)$ , robust solutions have a fraction of zero E weights generically larger than 0.5 [6, 11]. When inhibitory synapses are added, their weights are less sparse [11]. Interestingly, in the case without robustness, half of the E and I weights are zero at critical capacity for all f_E ≥ 0.5.

The dynamic properties of input/output mappings affect the shape of the weight distribution in a computable manner. As an example, in a linear perceptron with non-negative synapses, the explicit dependence of the variance of the weights on the input and output auto-correlation time constant is shown in Fig 5A for various loads α. Previous work considered an analog perceptron with purely excitatory weights as a model for the graded rate response of Purkinje cells in the cerebellum [15]. In the presence of heterogeneity of synaptic properties across cells, a larger variance in their synaptic distribution is expected to be correlated with high frequency temporal fluctuations in input currents. Analogously, the auto-correlation of the typical signals being processed sets the value of the constant external current that a neuron must receive in order to optimize its capacity.

When the input and output have different covariance matrices C^x ≠ C^y, a joint diagonalization is not possible in general (Methods: EC, Energetic part). We can nevertheless write an expression (Eq (23)) that holds when input and output patterns are defined on a ring (with periodic boundary conditions) and use it as an approximation for the general case. Fig 5B shows good agreement between numerical experiment and theoretical predictions for the error ϵ_err and the squared norm of the synaptic weight vector w, when input and output processes have two different time-constants τ_x and τ_y.

Fig 5

Input/Output time constants and learning performance.

A: Variance of synaptic weights (f_E = 1) for a linear perceptron of dimension N = 1000 trained on rbf-correlated signals with increasing time constant τ for three different values of the load α. Parameters: γ = 0.1, ${\bar{x}}_{E} = {\bar{x}}_{I} = σ_{E} = σ_{I} = \bar{y} = σ_{y} = 1$ . B: Average error ϵ_err in the case where input and output signals have two different covariance matrices, for increasing time constant τ_y of the output signal y. Parameters: N = 1000, f_E = 0.8, γ = 0.1, ${\bar{x}}_{E} = {\bar{x}}_{I} = \bar{y} = σ_{y} = 1$ , σ_I = 2σ_E = 0.6, C^x rbf with τ_x = 1, C^y rbf with various values of τ_y. Inset: norm of the weight vector w. Full lines show analytical results. Points are averages across 50 samples.

Sample covariance and dimensionality

In the discussion thus far, we assumed independence across the “spatial” index i in the input. It is often the case for input signals to be confined to a manifold of dimension smaller than N, a feature that can be described by various dimensionality measures, some of which rely on principal component analysis [40, 41]. In order to relax the independence assumption, we build on a framework originally introduced in the theory of spin glasses with orthogonal couplings [42–44] and further developed in the context of adaptive Thouless-Anderson-Palmer (TAP) equations [45–47]. In the TAP formalism, a set of mean-field equations is derived for a given instance of the random couplings (in our case, for a fixed input/output set). In its adaptive generalization [46], the structure of the TAP equations depends on the specific data distribution, in such a way that averaging the equations over the random couplings yields the same results of the replica approach. Here, following previous work in the context of information theory of linear vector channels and binary perceptrons [48–51], we employ an expression for an ensemble of rectangular random matrices and use the replica method to average over the input X and output y.

Let us write the input matrix ${(X)}_{i μ} = {\bar{x}}_{i} + σ_{i} ξ_{i μ}$ , with ξ = USV^T, S being the matrix of singular values. To analyze the properties of the typical case, we start from a generic singular value distribution S and consider i.i.d. output y_μ. In calculating the cumulant generating function Z_ξ,δy, we perform a homogeneous average across the left and right principal components U and V (Methods: SC, Energetic part). Calling ρ_ξξ^T(λ) the eigenvalue distribution of the sample covariance matrix ξξ^T, we can express Z_ξ,δy in terms of a function $G_{ξ, δ y}$ of an enlarged set of overlap parameters, which depends on the so-called Shannon transform [52] of ρ_ξξ^T(λ), a quantity that measures the capacity of linear vector channels. The resulting self-consistent equations, which describe the statistical properties of the synaptic weights w_i, are expressed in terms of the Stieltjes transform of ρ_ξξ^T(λ), an important tool in random matrix theory [53] (Methods: SC, Saddle-point equations).

We show the validity of the mean-field approach by employing two different data models for the input signals. In the first example, valid for α ≤ 1, all the P vectors ξ_μ are orthogonal to each other. This yields an eigenvalue distribution of the simple form ρ(λ) = αδ(λ − 1) + (1 − α)δ(λ), for which the function $G_{ξ, δ y}$ can be computed explicitly [51]. Additionally, we use a synthetic model where we explicitly set the singular value spectrum of ξ to be $s (α) = χ e^{- \frac{α^{2}}{2 σ_{x}^{2}}}$ , with χ a normalization factor ensuring matrix ξ has unit variance. The shape of the singular value spectrum s controls the spread of the data points ξ_μ in the N-dimensional input space, as shown in Fig 6A. As shown in Fig 6B for i.i.d Gaussian output, learning degrades as σ_x decreases, since inputs tend to be confined to a lower dimensional subspace rather than being equally distributed along input dimensions.

Fig 6

Sample-based PCA and learning performance.

A: First three components of inputs ξ_μ with Gaussian singular value spectrum s for two different values of σ_x (color coded top panels). Parameters: N = 100, P = 300. B: Average error ϵ_err for three different singular value spectra of the input sample covariance matrix: orthogonal model and Gaussian model with increasing σ_x (see main text for definition of σ_x). Outputs are i.i.d Gaussian. Parameters: N = 1000, f_E = 0.8, γ = 0.1, ${\bar{x}}_{E} = {\bar{x}}_{I} = \bar{y} = σ_{y} = 1$ , σ_I = 2σ_E = 0.6. B: Average error ϵ_err for input with orthogonal-type covariance and output y with rbf-type covariance with decreasing σ_y (see main text for the definition of σ_y). All remaining parameters as in A. Full lines show analytical results. Points are averages across 50 samples.

For N large enough (in practice, for N ≳ 500), the statistics of single cases is well captured by the equations for the average case (self-averaging effect). To get a mean-field description for a single case, where a given input matrix X is used, we further assume we have access to the linear expansion c_μ of the output y in the set {v_μ} of the columns of the V matrix, namely $y = \bar{y} + σ_{y} V c$ . The calculation can be carried out in a similar way and yields, for the average regression loss, the following result:

The average in is computed over the eigenvalues λ^x of the sample covariance matrix, which correspond to the PCA variances, and

λ_{μ}^{y} = c_{μ}^{2}

(Methods: SC, Energetic part). The quantity

{\tilde{Λ}}_{w}

can be computed from a set of self-consistent equations that link the order parameter

Δ {\tilde{q}}_{w}

and the first two moments of the synaptic distribution. To better understand the role of the parameter

{\tilde{Λ}}_{w}

, it is instructive to compare with the corresponding result for unconstrained weights, which can be derived from the pseudo-inverse solution w* = (ξξ^T + γ)⁻¹ ξy (Methods: SC, i.i.d. and unconstrained cases). The average loss is:

Comparing Eqs () and (), we find that

{\tilde{Λ}}_{w}

acts as an implicit regularization in the sign-constrained case. The mean-field theory is thus carried out through a diagonalization over independent contributions along the components v_μ, with prescribed input and output variances λ^x and λ^y, respectively. The coupling between different components, induced by the averages 〈⋅〉_x,y and the sign-constraints, is incorporated in the effective regularization

{\tilde{Λ}}_{w}

, acting on each component equally, that depends only on the structure of the input x (see Eqs () and () in Methods)).

In Fig 6C, we show results when the dimensionality of the output y along the (temporal) components of the input is modulated by taking $c (α) = e^{- \frac{α^{2}}{2 σ_{y}^{2}}}$ . The perceptron performance improves as the output signals spreads out across multiple components v_μ. The case of i.i.d. output is recovered by taking c_μ = 1.

Discussion

In this work, I investigated the properties of optimal solutions of a linear perceptron with sign-constrained synapses and correlated input/output signals, thus providing a general mean-field theory for constrained regression in the presence of correlations. I treated both the case of known ensemble covariances and the case where the sample covariance is given. The latter approach, built on a rotationally invariant assumption, allowed to link the regression performance to the input and output statistical properties expressed by principal component analysis.

I provided the general expression of the weight distribution for regularized regression and found that half of the weights are set to zero, irrespectively of the fraction of excitatory weights, provided the bias is optimized. The shape of the synaptic distribution has been previously described in the binary perceptron with independent input at critical capacity, as well as in the theory of compressed sensing [54]. I elucidated the role of the optimal bias current and its relation to the optimal capacity and the scaling of the solution weights. This analysis also shed light on the structural properties of synaptic matrices that emerge when target-based methods are used for building biologically plausible functional models of rate and spiking networks.

The theory presented in this work is relevant in the effort of establishing quantitative comparisons between the synaptic profile of neural circuits involved in temporal processing of dynamic signals, such as the cerebellum [55–57], and normative theories that take into account the temporal and geometrical complexity of computational tasks. On the other hand, the construction of progressively more biologically plausible models of neural circuits calls for normative theories of learning in heterogeneous networks, which can be coupled to dynamic mean-field analysis of E-I separated circuits [24, 25, 58].

As shown in this work, the interaction between correlational structure of input signals, synaptic metabolic cost and constant external current shapes the distribution of synaptic weights. In this respect, the results presented here offer a first approximation (static linear input-output associations) to account for heterogeneities of the fraction between E and I inputs to single cells in local circuits. Even though a heterogeneous linear neuron is capable of memorizing N/2 associations without error for any E/I ratio, the optimal bias does depend on f_E, its minimal value being attained for f_E = 0.5. Input current in turn sets the neuron’s operating regime and its input/output properties. Moreover, trading memorization accuracy (small output error ϵ_err) for smaller weights (small |w|²) could be beneficial when synaptic costs are considered (γ > 0). It is therefore likely that, for an optimality principle of the 80/20 ratio to emerge from purely representational considerations, dynamical and metabolic effects should be examined all together.

The importance of a theory of constrained regression with realistic input/output statistics goes beyond the realm of neuroscience. Non-negativity is commonly required to provide interpretable results in a wide variety of inference and learning problems. Off-line and on-line least-square estimation methods [59, 60] are also of great practical importance in adaptive control applications, where constraints on the parameter range are usually imposed by physical plausibility.

In this work, I assumed statistical independence between inputs and outputs. For the sake of biological plausibility, it would be interesting to consider more general input-output correlations for regression and binary discrimination tasks. The classical model for such correlations is provided by the so-called teacher-student (TS) approach [61], where the output y is generated by a deterministic parameter-dependent transformation of the input x, with a structure similar to the trained neural architecture. The problem of input/output correlations is deeply related to the issue of optimal random nonlinear expansion both in statistical learning theory [62, 63] and theoretical neuroscience [41, 64], with a history dating back to the Marr-Albus theory of pattern separation in cerebellum [65]. In a recent work, [28] introduced a promising generalization of TS, in which labels are generated via a low-dimensional latent representation, and it was shown that this model captures the training dynamics in deep networks with real world datasets.

A general analysis that fully takes into account spatio-temporal correlations in network models could shed light on the emergence of specific network motifs during training. In networks with non-linear dynamics, the mathematical treatment quickly gets challenging even for simple learning rules. In recent years, interesting work has been done to clarify the relation between learning and network motifs, using a variety of mean-field approaches. Examples are the study of associative learning in spin models [8] and the analysis of motif dynamics for simple learning rules in spiking networks [66]. Incorporating both the temporal aspects of learning and neural cross-correlations in E-I separated models with realistic input/output structure is an interesting topic for future work.

Methods

Replica formalism: Ensemble covariance matrix (EC)

Using the Replica formalism [67], the free energy density is written as:

The function Zⁿ can be computed by considering a finite number n of replicas of the vector w and subsequently taking a continuation

n \in R

. The introduction of n replicas allows to factorize 〈Zⁿ〉_x,y over individual weights w_i, at the cost of coupling different replicas after the averages over the x and y are performed. Introducing a small set of overlap order parameters, factorization across replicas is restored, so that in the large N limit the replicated partition function takes the form 〈Zⁿ〉_x,y = e^−βNnf. In the following, we will usually drop the subscript in the average 〈⋅〉_x,y.

To simplify the formulas, we introduce the $O (1)$ weights $J_{i} = σ_{i} \sqrt{N} w_{i}$ . In terms of these rescaled variables, the loss function in Eq (1) takes the form:

by virtue of

x_{i μ} = {\bar{x}}_{i} + σ_{i} ξ_{i μ}

. We proceed by inserting the definitions

M^{a} = \frac{1}{\sqrt{N}} \sum_{i = 1}^{N} \frac{{\bar{x}}_{i}}{σ_{i}} J_{i a} + I \sqrt{N}

and

Δ_{μ a} = \sum_{i = 1}^{N} ξ_{i μ} \frac{J_{i a}}{\sqrt{N}} - σ_{y} δ y_{μ}

with the aid of appropriate δ functions. The averaged replicated partition function 〈Zⁿ〉 is:

where:

In , we used a Fourier expansion of the δ functions and introduced the real variables u_μa as conjugate variables for Δ_μa. Analogously, we employed the purely imaginary

{\hat{M}}^{a}

for the variables M^a. Once the the average is carried out, second cumulants of ξ and δy get coupled to replica mixing terms of the form J_ia J_ib, which can be dealt with by introducing appropriate overlap order parameters

N q_{w}^{a b} = \sum_{i = 1}^{N} J_{i a} J_{i b}

with the use of n(n + 1)/2 additional δ functions, together with their conjugate variables

{\hat{q}}_{w}^{a b}

. Cumulants of higher order will not contribute to the expression in the large N limit. Expanding the δ functions for the overlap parameters we get the expression

where the two contributions

G_{E}

and

G_{S}

, respectively called energetic and entropic part, will be calculated separately in the following for ease of exposition. Owing to the convexity of the regression problem, we use a Replica Symmetry (RS) [] ansatz

q_{w}^{a b} = q_{w} + δ_{a b} Δ q_{w}

and M^a = M.

EC, Entropic part

The total volume of configurations w_a for fixed values of the overlap parameters is given by the entropic part:

where we called

η_{c} = \frac{{\bar{x}}_{c}}{σ_{c}}

, with c ∈ {E, I}, and η_i = η_E (η_i = η_I) if i ∈ E (i ∈ I). Using the RS ansatz

{\hat{q}}_{w}^{a b} = {\hat{q}}_{w} - δ_{a b} \frac{{\hat{q}}_{w} + Δ {\hat{q}}_{w}}{2}

and

{\hat{M}}^{a} = \hat{M}

, we get:

Using the explicit definition of the measure

d μ (J) \propto \prod_{i \in E} θ (J_{i}) d J_{i} \prod_{k \in I} θ (- J_{k}) d J_{k}

, one has, up to constant terms:

where we introduced the notations f_I = 1 − f_E and s_E = −s_I = 1. In order to disentangle the term ∑_ab J_aJ_b = (∑_a J_a)², we employ the so-called Hubbard-Stratonovich transformation

e^{\frac{b^{2}}{2}} = \int D z e^{b z}

, where

D z = d z \frac{e^{- \frac{z^{2}}{2}}}{\sqrt{2 π}}

. Taking the limit n → 0 one gets:

EC, Energetic part

In order to compute the energetic part, we first need to evaluate the average with respect to ξ and δy in Eq (11). Performing the two Gaussian integrals we get:

from which:

where we performed a translation

Δ_{μ a} + M^{a} - \bar{y} \to Δ_{μ a}

. In the special case C^x = C^y ≡ C, we can use C = VΛV^T to jointly rotate Δ_a → V Δ_a and u_a → Vu_a, thus leaving scalar products invariant. By doing so, we obtain, within the RS ansatz:

where ζ_μ = ∑_ν V_μν. Using a Hubbard-Stratonovich transformation on the term ∑_ab u_μa u_μb, after some algebra, we obtain:

Observing that the free energy only depends on M through the term

{(M - \bar{y})}^{2}

G_{E}

, we conveniently eliminate the quantities ζ_μ at this stage, using the simple saddle-point relation

thus getting:

The brackets 〈⋅〉_λ in stand for an average over the eigenvalue distribution ρ(λ) of C in the N → ∞ limit, assuming self-averaging. A similar expression for

G_{E}

was previously derived in [] for spherical weights, i.e.

\sum_{i = 1}^{N} w_{i}^{2} = 1

, in the presence of outputs y_μ generated by a teacher linear perceptron. To map Eq (45) in [] to , one substitutes (1 − q) → Δq_w (observing that q^aa = 1 thanks to the spherical constraint) and sets R = 0, since the learning task only involves patterns memorization.

When C^x ≠ C^y, we can derive a similar expression under the assumption of a ring topology in pattern space (corresponding to periodic boundary conditions in the index μ): in this case, both covariance matrices are circulant and may be jointly diagonalized by discrete Fourier transform [33, 34]. In the main text, we show that the expression

yields good results also when C^x and C^y are covariance matrices of stationary discrete-time processes.

EC, Saddle-point equations

All in all, the free energy density in the saddle-point approximation is:

The saddle-point equations stemming from the entropic part can be written as:

where the averages 〈⋅〉_J and 〈⋅〉_z in Eqs ()–() are taken with respect to the mean-field distribution of the J weights:

where z is a standard normal variable and θ is the Heaviside function: θ(x) = 1 when x > 0 and 0 otherwise. is obtained by differentiating with respect to

{\hat{q}}_{w}

and then performing an integration by part in z. is easily obtained by subtracting from the saddle-point condition over

Δ {\hat{q}}_{w}

, while originates from the derivative w.r.t.

\hat{M}

In the β → ∞ limit, the unicity of solution for γ > 0 implies that Δq_w → 0. We therefore use the following scalings for the order parameters:

while

q_{w} = O (1)

. In this scaling, Eqs ()–() take the form:

where

G (x) = \frac{e^{- \frac{x^{2}}{2}}}{\sqrt{2 π}}

and

H (x) = \int_{x}^{\infty} D z

. The two remaining saddle-point equations are:

Optimizing f with respect to the bias

b = I \sqrt{N}

immediately implies B = 0, by virtue of , and greatly simplifies the saddle-point equations. Using the scaling assumptions Eqs ()–() together with the saddle-point Eqs ()–(), we get in the main text, that is valid for any α for γ > 0. In the unregularized case (γ = 0), it describes solutions in the error regime α > α_c. The optimal bias b can be computed by

I \sqrt{N}

using , that is valid up to an

O (1)

term equal to

\bar{y}

(Fig 4B). Keeping only the leading terms in the limit β → ∞, can be written as:

From the definition of the free energy density −βNf = 〈log∫dμ(w)e^−βE〉, one has that

\frac{〈 E 〉}{N} = \partial_{β} (β f)

. Using and the relevant saddle-point equations, the expression for the average minimal energy density is then:

Also, noting that

\partial_{γ} E = \frac{N}{2} \sum_{i = 1}^{N} w_{i}^{2}

, we can compute the average squared norm of the weights

v = \sum_{i = 1}^{N} 〈 w_{i}^{2} 〉

by v = 2∂_γ f. We thus obtain:

The error

ϵ_{e r r} = \frac{1}{2 N} 〈 {| X^{T} w + b - y |}^{2} 〉

can be then computed by

ϵ_{e r r} = ϵ - \frac{γ}{2} v

Distribution of synaptic weights

The synaptic weight distribution appearing in Eqs (28) and (29) can be obtained using a variant of the replica trick [6, 67]. Using the expression Z⁻¹ = lim_{n → 0} Zⁿ⁻¹, the density of excitatory weights can be written as:

where we picked the first E weight in the first replica w₁₁ without loss of generality. The calculation proceeds along the same lines as for the entropic part above, since the energetic part does not depend on w_a explicitely. Isolating the first replica and taking the limit n → 0, one gets the expression

and analogously for the I weights. This expression holds for uncorrelated inputs and outputs and any fixed bias b, as well as for any correlated x and y with optimal bias b, where deviations from do not occur. In the β → ∞, using the scaling relations Eqs ()–(), it can be easily shown that the mean-field weight probability density of the rescaled weights

\sqrt{N} w_{i}

is a superposition of a δ function in zero and two truncated Gaussian densitites:

where the mean and standard deviation of the Gaussians G(⋅;M, Σ) are:

This weight density is valid for γ > 0 at any α and at critical capacity for γ = 0. The fraction of zero weights is given by:

Spectrum of exponential and rbf covariance

For the exponential covariance $C_{μ ν} = e^{- \frac{| μ - ν |}{τ}}$ , one has [33]:

with

x = e^{- \frac{1}{τ}}

. In the rbf case

C_{μ ν} = e^{- \frac{{| μ - ν |}^{2}}{2 τ^{2}}}

, the spectrum can be computed by Fourier series [], yielding

with

ϑ_{3} (z, q) = 1 + 2 \sum_{n = 1}^{\infty} q^{n^{2}} cos (2 n z)

the Jacobi theta function of 3rd type.

Replica formalism: Sample covariance matrix (SC)

Also in the case of a sample covariance matrix, we are interested in statistically structured inputs and output. An independent average across x and y would result in a simple dependence on the variance of y in the energetic part. To capture the geometric dependence between x and y, we thus extend the calculations in [50, 51] to the case where the linear expansion of y_μ on the right singular vectors V_⋅μ is known, by taking δy_μ = ∑_ν V_μν c_ν.

In order to compute the replicated cumulant generating function Eq (11), we again introduce overlap parameters $q_{w}^{a b}$ , whose volume is given by the previously computed entropic part $G_{S}$ . The fact that the entropic part is unchanged in turn implies that the mean-field weight distribution takes the form of Eq (44), with the values of {A, B, C} being determined by a new set of saddle-point equations.

SC, Energetic part

Using again the expressions ${(X)}_{i μ} = {\bar{x}}_{i} + σ_{i} ξ_{i μ}$ and ξ = USV^T, the replicated cumulant generating function for the joint (mean-removed) input and output is:

where we used the change of variables

{\tilde{J}}_{i a} = \sum_{k} U_{k i} J_{k a}

and

{\tilde{u}}_{μ a} = \sum_{k} V_{k μ} u_{k a}

. The average in is taken over the joint distribution

p ({\tilde{J}}_{a}, {\tilde{u}}_{a})

resulting from averaging over the Haar measure on the orthogonal matrices U and V. For a single replica, Z_ξ,δy will only depend on the squared norms

Q_{w} = \sum_{i} \frac{{\tilde{J}}_{i}^{2}}{N}

and

Q_{u} = \sum_{μ} \frac{{\tilde{u}}_{μ}^{2}}{P}

of the two vectors

\tilde{J}

and

\tilde{u}

. We can therefore write the average in the following way:

Introducing Fourier representation for the δ functions, we are left with an expression involving an N + P dimensional Gaussian integral:

where

and

1_{K}

is the identity matrix of dimension K. Following [], the determinant can be easily calculated:

where the limit is taken for N → ∞ and the average is with respect to the eigenvalue distribution ρ(λ^x). As for the quadratic portion of the Gaussian integral, calling

λ_{k}^{y} = c_{k}^{2}

, we will use the shorthand

Considering now the replicated generating function, all the n(n + 1) cross-product

J_{a} \cdot J_{b} = {\tilde{J}}_{a} \cdot {\tilde{J}}_{b}

and

u_{a} \cdot u_{b} = {\tilde{u}}_{a} \cdot {\tilde{u}}_{b}

must be conserved via the multiplication of U and V. Together with the overlap parameters

N q_{w}^{a b} = \sum_{i} J_{i a} J_{i b}

, we additionally introduce the quantities

P q_{u}^{a b} = \sum_{μ} u_{μ a} u_{μ b}

, thus obtaining:

In the RS case, we again take

q_{w}^{a b} = q_{w} + δ_{a b} Δ q_{w}

and, similarly for the u’s,

q_{u}^{a b} = - q_{u} + δ_{a b} Δ q_{u}

. In the basis where both

q_{w}^{a b}

and

q_{u}^{a b}

are diagonal, the expression for Z_ξ,δy becomes

so, calling

G_{ξ, δ y} = \frac{1}{N} {lim}_{n \to 0} log Z_{ξ, δ y}

, we have:

with the function F given by:

and

K (Λ_{w}, Λ_{u}) = Λ_{w} {〈 \frac{λ^{y}}{λ^{x} + Λ_{w} Λ_{u}} 〉}_{λ^{x}, λ^{y}}

. In , it is intended that Λ_w and Λ_w are implied by the Legendre Transform conditions:

The remaining terms in the energetic part

G_{E}

involve the

q_{u}^{a b}

overlaps and their conjugated parameters

{\hat{q}}_{u}^{a b}

. Introducing the RS ansatz

{\hat{q}}_{u}^{a b} = {\hat{q}}_{u} + δ_{a b} \frac{Δ {\hat{q}}_{u} - {\hat{q}}_{u}}{2}

, the calculation follows along the same lines of the section SC, Energetic part. We get:

Eliminating M,

{\hat{q}}_{u}

and

Δ {\hat{q}}_{u}

at the saddle-point in ,

G_{E}

reduces to:

SC, Saddle-point equations

The final expression for the free energy density

implies the following saddle-point equations:

in addition to the entropic saddle-point Eqs ()–(), which are unchanged. The saddle-point values of the conjugate Legendre variables Λ_w, Λ_u greatly simplify the expression for the first and second derivatives of F. Indeed, from Eqs () and () one has:

or, setting

Λ_{w} = β {\tilde{Λ}}_{w}

In particular, shows that

Δ {\tilde{q}}_{w}

is expressed by a Stieltjes transform of ρ(λ^x) and the first term in is its Shannon transform. In the limit β → ∞, using the following additional scaling relations for the u overlaps:

we get the expression for the energy density:

SC, i.i.d. and unconstrained cases

Either setting K = 0 or λ^y = 0 reverts back to the i.i.d. output case. In the special case of i.i.d. inputs, the eigenvalue distribution is Marchenko-Pastur

with

λ_{+ / -} = {(1 \pm \sqrt{α})}^{2}

, from which

F (Δ q_{w}, Δ q_{u}) = - \frac{α}{2} Δ q_{w} Δ q_{u}

. The saddle-point equations are essentially the same as the ones in the section EC, Saddle-point equations with

C_{μ ν}^{x} = C_{μ ν}^{y} = δ_{μ ν}

Let us also note that, in the simple unconstrained case, taking for simplicity ${\bar{x}}_{i} = 0$ and b = 0, the entropic part can be worked out to be, up to constant terms:

which, at the saddle-point, implies

{\tilde{Λ}}_{w} = γ

. The mean-field distribution

p (\sqrt{N} w)

is a zero-mean Gaussian with variance v = q_w. Using the properties of the Hessian of the Legendre Transform, it is easy to show that:

These expressions can also be derived from the pseudo-inverse solution (we take

\bar{y} = 0

for simplicity) w* = (ξξ^T + γ)⁻¹ ξy, by taking an average across ξ and y in the two expressions:

The i.i.d. output case also follows by performing independent averages over y and ξ.

Acknowledgements

The author would like to thank L.F. Abbott and Francesco Fumarola for constructive criticism of an earlier version of the manuscript.

References

HFSong, GRYang, XJWang. Training Excitatory-Inhibitory Recurrent Neural Networks for Cognitive Tasks: A Simple and Flexible Framework. PLOS Computational Biology. 2016;12(2):1–30. 10.1371/journal.pcbi.1004792

WNicola, CClopath. Supervised learning in spiking neural networks with FORCE training. Nature Communications. 2017;8(1):2208 10.1038/s41467-017-01827-3

AIngrosso, LFAbbott. Training dynamically balanced excitatory-inhibitory networks. PLOS ONE. 2019;14(8):1–18. 10.1371/journal.pone.0220547

CMKim, CCChow. Learning recurrent dynamics in spiking networks. eLife. 2018;7:e37124 10.7554/eLife.37124

WBrendel, RBourdoukan, PVertechi, CKMachens, SDenève. Learning to represent signals spike by spike. PLOS Computational Biology. 2020;16(3):1–23. 10.1371/journal.pcbi.1007692

NBrunel, VHakim, PIsope, JPNadal, BBarbour. Optimal Information Storage and the Distribution of Synaptic Weights: Perceptron versus Purkinje Cell. Neuron. 2004;43(5):745–757. 10.1016/S0896-6273(04)00528-8

BBarbour, NBrunel, VHakim, JPNadal. What can we learn from synaptic weight distributions? Trends in Neurosciences. 2007;30(12):622–629. 10.1016/j.tins.2007.09.005

NBrunel. Is cortical connectivity optimized for storing information? Nature Neuroscience. 2016;19(5):749–755. 10.1038/nn.4286

EGardner. The space of interactions in neural network models. Journal of Physics A: Mathematical and General. 1988;21(1):257–270.

CClopath, JPNadal, NBrunel. Storage of correlated patterns in standard and bistable Purkinje cell models. PLoS computational biology. 2012;8(4):e1002448–e1002448. 10.1371/journal.pcbi.1002448

JChapeton, TFares, DLaSota, AStepanyants. Efficient associative memory storage in cortical circuits of inhibitory and excitatory neurons. Proceedings of the National Academy of Sciences. 2012;109(51):E3614–E3622. 10.1073/pnas.1211467109

DZhang, CZhang, AStepanyants. Robust Associative Learning Is Sufficient to Explain the Structural and Dynamical Properties of Local Cortical Circuits. Journal of Neuroscience. 2019;39(35):6888–6904. 10.1523/JNEUROSCI.3218-18.2019

RRubin, LFAbbott, HSompolinsky. Balanced excitation and inhibition are required for high-capacity, noise-robust neuronal selectivity. Proceedings of the National Academy of Sciences. 2017;114(44):E9366–E9375. 10.1073/pnas.1705841114

HSSeung, HSompolinsky, NTishby. Statistical mechanics of learning from examples. Phys Rev A. 1992;45:6056–6091. 10.1103/PhysRevA.45.6056

CClopath, NBrunel. Optimal Properties of Analog Perceptrons with Excitatory Weights. PLOS Computational Biology. 2013;9(2):1–6. 10.1371/journal.pcbi.1002919

HGutfreund, YStein. Capacity of neural networks with discrete synaptic couplings. Journal of Physics A: Mathematical and General. 1990;23(12):2613–2630. 10.1088/0305-4470/23/12/036

JSIsaacson, MScanziani. How Inhibition Shapes Cortical Activity. Neuron. 2011;72(2):231–243. 10.1016/j.neuron.2011.09.027

REField, JAD’amour, RTremblay, CMiehl, BRudy, JGjorgjieva, et al Heterosynaptic Plasticity Determines the Set Point for Cortical Excitatory-Inhibitory Balance. Neuron. 2020; 10.1016/j.neuron.2020.03.002.

GHennequin, EJAgnes, TPVogels. Inhibitory Plasticity: Balance, Control, and Codependence. Annual Review of Neuroscience. 2017;40(1):557–579. 10.1146/annurev-neuro-072116-031005

Ahmadian Y, Miller KD. What is the dynamical regime of cerebral cortex? arXiv:190810101. 2019.

Cvan Vreeswijk, HSompolinsky. Chaos in Neuronal Networks with Balanced Excitatory and Inhibitory Activity. Science. 1996;274(5293):1724–1726. 10.1126/science.274.5293.1724

Cvan Vreeswijk, HSompolinsky. Chaotic Balanced State in a Model of Cortical Circuits. Neural Comput. 1998;10(6):1321–1371. 10.1162/089976698300017214

ARenart, Jde la Rocha, PBartho, LHollender, NParga, AReyes, et al The Asynchronous State in Cortical Circuits. Science. 2010;327(5965):587–590. 10.1126/science.1179850

JKadmon, HSompolinsky. Transition to Chaos in Random Neuronal Networks. Phys Rev X. 2015;5:041030.

OHarish, DHansel. Asynchronous Rate Chaos in Spiking Neuronal Circuits. PLOS Computational Biology. 2015;11(7):1–38. 10.1371/journal.pcbi.1004266

NBrunel. Dynamics of Sparsely Connected Networks of Excitatory and Inhibitory Spiking Neurons. Journal of Computational Neuroscience. 2000;8(3):183–208. 10.1023/A:1008925309027

MVTsodyks, TSejnowski. Rapid state switching in balanced cortical network models. Network: Computation in Neural Systems. 1995;6(2):111–124. 10.1088/0954-898X_6_2_001

Goldt S, Mézard M, Krzakala F, Zdeborová L. Modelling the influence of data structure on learning in neural networks: the hidden manifold model. arXiv:190911500. 2019.

SChung, DDLee, HSompolinsky. Classification and Geometry of General Perceptual Manifolds. Phys Rev X. 2018;8:031003.

UCohen, SChung, DDLee, HSompolinsky. Separability and geometry of object manifolds in deep neural networks. Nature Communications. 2020;11(1):746 10.1038/s41467-020-14578-5

PRotondo, MCLagomarsino, MGherardi. Counting the learnable functions of geometrically structured data. Phys Rev Research. 2020;2:023169 10.1103/PhysRevResearch.2.023169

Pastore M, Rotondo P, Erba V, Gherardi M. Statistical learning theory of structured data. arXiv:200510002. 2020.

RMonasson. Properties of neural networks storing spatially correlated patterns. Journal of Physics A: Mathematical and General. 1992;25(13):3701–3720. 10.1088/0305-4470/25/13/019

WTarkowski, MLewenstein. Learning from correlated examples in a perceptron. Journal of Physics A: Mathematical and General. 1993;26(15):3669–3679. 10.1088/0305-4470/26/15/017

RMonasson. Storage of spatially correlated patterns in autoassociative memories. Journal de Physique I. 1993;3(5):1141–1152. 10.1051/jp1:1993107

ABattista, RMonasson. Capacity-Resolution Trade-Off in the Optimal Learning of Multiple Low-Dimensional Manifolds by Attractor Neural Networks. Phys Rev Lett. 2020;124:048302 10.1103/PhysRevLett.124.048302

DJAmit, KYMWong, CCampbell. Perceptron learning with sign-constrained weights. Journal of Physics A: Mathematical and General. 1989;22(12):2039–2045. 10.1088/0305-4470/22/12/009

DJAmit, CCampbell, KYMWong. The interaction space of neural networks with sign-constrained synapses. Journal of Physics A: Mathematical and General. 1989;22(21):4687–4693. 10.1088/0305-4470/22/21/030

RMGray. Toeplitz and Circulant Matrices: A Review. Foundations and Trends in Communications and Information Theory. 2006;2(3):155–239. 10.1561/0100000006

Abbott LF, Rajan K, Sompolinsky H. Interactions between Intrinsic and Stimulus-Evoked Activity in Recurrent Neural Networks. arXiv:09123832. 2009.

ALitwin-Kumar, KDHarris, RAxel, HSompolinsky, LFAbbott. Optimal Degrees of Synaptic Connectivity. Neuron. 2017;93(5):1153–1164.e7. 10.1016/j.neuron.2017.01.030

EMarinari, GParisi, FRitort. Replica field theory for deterministic models. II. A non-random spin glass with glassy behaviour. Journal of Physics A: Mathematical and General. 1994;27(23):7647–7668. 10.1088/0305-4470/27/23/011

GParisi, MPotters. Mean-field equations for spin models with orthogonal interaction matrices. Journal of Physics A: Mathematical and General. 1995;28(18):5267–5285. 10.1088/0305-4470/28/18/016

RCherrier, DSDean, ALefèvre. Role of the interaction matrix in mean-field spin glass models. Phys Rev E. 2003;67:046112 10.1103/PhysRevE.67.046112

MOpper, OWinther. Tractable Approximations for Probabilistic Models: The Adaptive Thouless-Anderson-Palmer Mean Field Approach. Phys Rev Lett. 2001;86:3695–3699. 10.1103/PhysRevLett.86.3695

MOpper, OWinther. Adaptive and self-averaging Thouless-Anderson-Palmer mean-field theory for probabilistic modeling. Phys Rev E. 2001;64:056131 10.1103/PhysRevE.64.056131

MOpper, OWinther. Expectation Consistent Approximate Inference. Journal of Machine Learning Research. 2005;6:2177–2204.

KTakeda, SUda, YKabashima. Analysis of CDMA systems that are characterized by eigenvalue spectrum. Europhysics Letters (EPL). 2006;76(6):1193–1199. 10.1209/epl/i2006-10380-5

YKabashima. Inference from correlated patterns: a unified theory for perceptron learning and linear vector channels. Journal of Physics: Conference Series. 2008;95:012001.

TShinzato, YKabashima. Learning from correlated patterns by simple perceptrons. Journal of Physics A: Mathematical and Theoretical. 2008;42(1):015005 10.1088/1751-8113/42/1/015005

TShinzato, YKabashima. Perceptron capacity revisited: classification ability for correlated patterns. Journal of Physics A: Mathematical and Theoretical. 2008;41(32):324013 10.1088/1751-8113/41/32/324013

AMTulino, SVerdú. Random Matrix Theory and Wireless Communications. Foundations and Trends in Communications and Information Theory. 2004;1(1):1–182. 10.1561/0100000001

Tao T. Topics in Random Matrix Theory. Graduate studies in mathematics. American Mathematical Soc.;. Available from: https://books.google.com/books?id=Hjq_JHLNPT0C.

SGanguli, HSompolinsky. Statistical Mechanics of Compressed Sensing. Phys Rev Lett. 2010;104:188701 10.1103/PhysRevLett.104.188701

DMarr. A theory of cerebellar cortex. The Journal of physiology. 1969;202(2):437–470. 10.1113/jphysiol.1969.sp008820

DMWolpert, RCMiall, MKawato. Internal models in the cerebellum. Trends in Cognitive Sciences. 1998;2(9):338–347. 10.1016/S1364-6613(98)01221-2

DJHerzfeld, YKojima, RSoetedjo, RShadmehr. Encoding of error and learning to correct that error by the Purkinje cells of the cerebellum. Nature Neuroscience. 2018;21(5):736–743. 10.1038/s41593-018-0136-y

FMastrogiuseppe, SOstojic. Intrinsically-generated fluctuating activity in excitatory-inhibitory networks. PLOS Computational Biology. 2017;13(4):1–40. 10.1371/journal.pcbi.1005498

JChen, CRichard, JMBermudez, PHoneine. Variants of Non-Negative Least-Mean-Square Algorithm and Convergence Analysis. IEEE Transactions on Signal Processing. 2014;62(15):3990–4005. 10.1109/TSP.2014.2332440

VHNascimento, YVZakharov. RLS Adaptive Filter With Inequality Constraints. IEEE Signal Processing Letters. 2016;23(5):752–756. 10.1109/LSP.2016.2551468

AEngel, CVan den Broeck. Statistical mechanics of learning. Cambridge University Press; 2001.

Mei S, Montanari A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv:190805355. 2019.

Gerace F, Loureiro B, Krzakala F, Mézard M, Zdeborová L. Generalisation error in learning with random features and the hidden manifold model. arXiv:200209339. 2020.

BBabadi, HSompolinsky. Sparseness and Expansion in Sensory Representations. Neuron. 2014;83(5):1213–1226. 10.1016/j.neuron.2014.07.035

NACayco-Gajic, RASilver. Re-evaluating Circuit Mechanisms Underlying Pattern Separation. Neuron. 2019;101(4):584–602. 10.1016/j.neuron.2019.01.044

GKOcker, ALitwin-Kumar, BDoiron. Self-Organization of Microcircuits in Networks of Spiking Neurons with Plastic Synapses. PLOS Computational Biology. 2015;11(8):1–40. 10.1371/journal.pcbi.1004458

MMézard, GParisi, MVirasoro. Spin Glass Theory and Beyond World Scientific Lecture Notes in Physics; 1987.