PLoS Computational Biology
Home Optimal learning with excitatory and inhibitory synapses
Optimal learning with excitatory and inhibitory synapses
Optimal learning with excitatory and inhibitory synapses

The author has declared that no competing interests exist.

¤

Current address: Quantitative Life Sciences, The Abdus Salam International Centre for Theoretical Physics - ICTP, Trieste, Italy

Article Type: Research Article Article History
Abstract

Characterizing the relation between weight structure and input/output statistics is fundamental for understanding the computational capabilities of neural circuits. In this work, I study the problem of storing associations between analog signals in the presence of correlations, using methods from statistical mechanics. I characterize the typical learning performance in terms of the power spectrum of random input and output processes. I show that optimal synaptic weight configurations reach a capacity of 0.5 for any fraction of excitatory to inhibitory weights and have a peculiar synaptic distribution with a finite fraction of silent synapses. I further provide a link between typical learning performance and principal components analysis in single cases. These results may shed light on the synaptic profile of brain circuits, such as cerebellar structures, that are thought to engage in processing time-dependent signals and performing on-line prediction.

A general analysis of learning with biological synaptic constraints in the presence of statistically structured signals is lacking. Here, analytical techniques from statistical mechanics are leveraged to analyze association storage between analog inputs and outputs with excitatory and inhibitory synaptic weights. The linear perceptron performance is characterized and a link is provided between the weight distribution and the correlations of input/output signals. This formalism can be used to predict the typical properties of perceptron solutions for single learning instances in terms of the principal component analysis of input and output data. This study provides a mean-field theory for sign-constrained regression of practical importance in neuroscience as well as in adaptive control applications.

Ingrossoand Morrison: Optimal learning with excitatory and inhibitory synapses

Introduction

At the most basic level, neuronal circuits are characterized by the subdivision into excitatory and inhibitory populations, a principle called Dale’s law. Even though the precise functional role of Dale’s law has not yet been understood, the importance of synaptic sign constraints is pivotal in constructing biologically plausible models of synaptic plasticity in the brain [15]. The properties of synaptic couplings strongly impact the dynamics and response of neural circuits, thus playing a crucial role in shaping their computational capabilities. It has been argued that the statistics of synaptic weights in neural circuits could reflect a principle of optimality for information storage, both at the level of single-neuron weight distributions [6, 7] and inter-cell synaptic correlations [8] (e.g. the overabundance of reciprocal connections). A number of theoretical studies, stemming from the pioneering Gardner approach [9], have investigated the computational capabilities of stylized classification and memorization tasks in both binary [1013] and analog perceptrons [14, 15], using synthetic data. With some exceptions mentioned in the following, these studies considered random uncorrelated inputs and outputs, a usual approach in statistical learning theory. One interesting theoretical prediction is that non-negativity constraints imply that a finite fraction of synaptic weights are set to zero at critical capacity [6, 15, 16], a feature which is consistent with experimental synaptic weight distributions observed in some brain areas, e.g. input fibers to Purkinje cells in the cerebellum.

The need to understand how the interaction between excitatory and inhibitory synapses mediates plasticity and dynamic homeostasis [17, 18] calls for the study of heterogeneous multi-population feed-forward and recurrent models. A plethora of mechanisms for excitatory-inhibitory (E-I) balance of input currents onto a neuron have been proposed [19, 20]. At the computational level, it has recently been shown that a peculiar scaling of excitation and inhibition with network size, originally introduced to account for the high variability of neural firing activity [2127], carries the computational advantage of noise robustness and stability of memory states in associative memory networks [13].

Analyzing training and generalization performance in feed-forward and recurrent networks as a function of statistical and geometrical structure of a task remains an open problem both in computational neuroscience and statistical learning theory [2832]. This calls for statistical models of the low-dimensional structure of data that are at the same time expressive and amenable to mathematical analyses. A few classical studies investigated the effect of “semantic” (among input patterns) and spatial (among neural units) correlations in random classification and memory retrieval [3335]. The latter are important in the construction of associative memory networks for place cell formation in the hippocampal complex [36].

For reason of mathematical tractability, the vast majority of analytical studies in binary and analog perceptron models focused on the case where both inputs and outputs are independent and identically distributed. In this work, I relax this assumption and study optimal learning of input/output associations with real-world statistics with a linear perceptron having heterogeneous synaptic weights. I introduce a mean-field theory of an analog perceptron in the presence of weight regularization with sign-constraints, considering two different statistical models for input and output correlations. I derive its critical capacity in a random association task and study the statistical properties of the optimal synaptic weight vector across a diverse range of parameters.

This work is organized as follows. In the first section, I introduce the framework and provide the general definitions for the problem. I first consider a model of temporal (or, equivalently, “semantic”) correlations across inputs and output patterns, assuming statistical independence across neurons. I show that optimal solutions are insensitive to the fraction of E and I weights, as long as the external bias is learned. I derive the weight distribution and show that it is characterized by a finite fraction of zero weights also in the general case of E-I constraints and correlated signals. The assumption of independence is subsequently relaxed in order to provide a theory that depends on the spectrum of the sample covariance matrix and the dimensionality of the output signal along the principal components of the input. The implications of these results are discussed in the final section.

Results

Mean-field theory with correlations

Consider the problem of linearly mapping a set of correlated inputs x, with i ∈ 1, …, N and μ = 1, …, P from NE = fEN excitatory (E) and NI = (1 − fE) inhibitory (I) neurons, onto an output yμ using a synaptic vector w, in the presence of a learnable constant bias current b (Fig 1). To account for different statistical properties of E and I input rates, we write the elements of the input matrix as (X)iμxiμ=x¯i+σiξiμ with x¯i=x¯E for ifEN and x¯i=x¯I for i > fEN and the same for σi. At this stage, the quantities ξ have unit variance and are uncorrelated across neurons: 〈ξ ξ〉 = δijCμν. In the following, we refer to x and y as signals and μ as a time index, although we consider general “semantic”correlations across the patterns xμ [34]. The output signal has average yμ=y¯ and variance (yμy¯)2=σy2. We initially consider output signals yμ with the same temporal correlations as the input, namely 〈δyμ δyν〉 = Cμν, where yμ=y¯+σyδyμ.

Schematic of the learning problem.
Fig 1

Schematic of the learning problem.

A linear perceptron receives N correlated signals (input rates of pre-synaptic neurons) x and maps them to the output yμ through NE = fEN excitatory and NI = (1 − fE)N plastic inhibitory weights wi, plus an additional bias current b.

For a given input-output set, we are faced with the problem of minimizing the following regression loss (energy) function:

with wi > 0 for ifEN, wi < 0 otherwise. The rationale for using a regularization term lies not only in alleviating ill-conditioning due to input correlations, but also in controlling the metabolic cost of synaptic plasticity and transmission. Preliminary numerical experiments showed that the typical vector w that solves this sign-constrained least square problem has a squared norm i=1Nwi2=O(1), irrespectively of the L2 regularization, as in the special case of i.i.d input/output and non-negative synaptic weights [15]. Synaptic weights wi are thus of O(1/N), hence the scaling of the regularization term and the bias current b=IN. In order to consider a well defined N → ∞ limit for E and the spectrum of the matrix C, we take P = αN, with α called the load, as is costumary in mean-field analysis of perceptron problems [9].

Optimizing with respect to the bias b naturally yields solutions w for which

where we call w¯c=1Ncicwi=O(1/N) the average excitatory and inhibitory weight, with c ∈ {E, I}. We call this property balance, in that the same scaling is used in balanced state theory of neural circuits [21, 22, 24].

In order to derive a mean-field description for the typical properties of the learned synaptic vector w, we employ a statistical mechanics framework in which the minimizer of E is evaluated after averaging across all possible realizations of the input matrix X and output y. To do so, we compute the free energy density

where Z = ∫ (w)eβE is the so-called partition function and the measure dμ(w)=iEθ(wi)dwikIθ(wk)dwk implements the sign-constraints over the synapic weight vector w. The brackets in Eq (3) stand for the quenched average over all the quantities x and yμ, and the inverse temperature β will allow us to select weight configurations w that minimize the energy E. The free energy density f acts as a generating function from which all the statistical quantities of interest can be calculated by appropriate differentiation and taking the β → ∞ limit. In particular, we will be interested in the (normalized) average loss ϵ=EN and the error ϵerr=12N|XTw+by|2, corresponding to the average value of the first term in Eq (1), where b is a P-dimensional vector containing b in every element. The average in Eq (3) can be computed in the N → ∞ limit with the help of the replica method, an analytical continuation technique that entails the introduction of a number n of formal replicas of the vector w. A general expression for f can be obtained in the large N limit using the saddle-point method. The crucial quantity in our derivation is the (replicated) cumulant generating function Zξ,δy for the (mean-removed) input x and output y, which can be easily expressed as a function of the eigenvalues λμ, μ = 1, …, αN of the covariance matrix C, plus a set of order parameters to be evaluated self-consistently (Methods).

Critical capacity

The existence of weight vectors w’s with a certain value of the regression loss E in the error regime (ϵ > 0) is described by the so-called overlap order parameter Δq˜w. In the replica-based derivation of the mean-field theory, overlap parameters are introduced with the purpose of decoupling the wi’s over the i index, and represent the scalar-product of two different configurations of the weights w (Methods: Replica formalism: ensemble covariance matrix (EC)). For finite β, the quantity Δqw=βΔq˜w represents the variance of the synaptic weights across different solutions. In the asymptotic limit β → ∞ of Eq (3), a simple saddle-point equation for Δq˜w can be derived when b is chosen to minimize Eq (1):

where ρ(λ) is the distribution of eigenvalues of C.

In the absence of weight regularization (γ = 0), we define the critical capacity αc as the maximal load α = P/N for which the patterns xμ can be correctly mapped to their outputs yμ with zero error. When the synaptic weights are not sign-constrained, the critical capacity is obviously αc = 1, since the matrix X is typically full rank. In the sign-constrained case, αc is found to be the minimal value of α such that Eq (4) is satisfied for 0<Δq˜w<+. Noting that the left-hand side in Eq (4) is a non-decreasing function of Δq˜w with an asymptote in α, the order parameter Δq˜w goes to + ∞ as the critical capacity is approached from the right. We thus find for γ = 0 the surpisingly simple result:

As shown in Fig 2A in the case of i.i.d. x and y, the loss has a sharp increase at α = 0.5. This holds irrespectively of the structure of the covariance matrix C and the ratio of excitatory weights fE. In Fig 2A, we also show the average minimal loss ϵ for increasing values of the regularization parameter γ.

In [15], the authors showed that, in the case with excitatory synapses only and uncorrelated inputs and outputs, αc approaches 0.5 in the limit when the quantity σy2x¯E2I2σE2 goes to zero, and analyzed which conditions on inputs and outputs statistics lead to maximize capacity. Here we take a complementary approach, where the x and y statistics are fixed and capacity is optimized within the error regime, so that the optimal bias IN is well defined in terms of minimizing 〈E〉 at any load α. The bias optimization leads to a massive simplification of the saddle-point equations and makes results independent of the E/I ratio and the input/output statistics (Methods: EC, Saddle-point equations). One may observe that, in the particular case studied by [15], αc is maximal for very large I, due to the divergence of the norm of w at critical capacity for an optimal bias in the absence of regularization.

Critical capacity and weight balance.
Fig 2

Critical capacity and weight balance.

A: Average loss ϵ for a linear perceptron with fE = 0.8 positive synaptic weights in the case of i.i.d. input X and output y for increasing values of the regularization γ. Parameters: N = 1000, x¯E=x¯I=σE=σI=y¯=σy=1. Each point is an average across 50 samples. Full lines show the theoretical results. B: Mean-field component h˜ (left axis, purple) and weight-input correlation c (right axis, red) for increasing dimension N in the case where the bias current b=IN is either learned (I optimal) or fixed at the outset (I = −1) for fE = 1, γ = 0.1, α = 0.8. Inputs X and output y are time-correlated with un-normalized Gaussian covariance C, τ = 10 (see text). The remaining parameters are as in A. The asymptotic value h˜=y¯=1 is highlighted by the purple dotted line, the value c = 0 by the red dotted line as guide for the eye.

The independence of our results with respect to the E/I ratio for an optimal bias current signals a local gauge invariance, as observed by [37, 38] for a sign-constrained binary perceptron. Indeed, calling gi = sign wi, we can write the mean-removed output as i=1Ngi|wi|σiξiμ and redefine the ξ’s as giξiμ, without changing their occurrence probability. This establishes an equivalence to a linear perceptron with non-negative weights (see [37] for more details), once the mean contribution has been removed. Any residual dependence of αc or ϵ on external parameters must therefore be ascribed to the volume of weights satisfying Eq (2), for a sub-optimal external current b.

For a generic value of the bias current b, there are strong deviations from the condition in Eq (2). In Fig 2B, we compare the value of the average output y¯ with h˜c{E,I}Ncw¯cx¯c+b, and also plot the residual term c=1NPiμδwixiμ, where we decomposed the weight vector components as wi=w¯c+δwi for c ∈ {E, I}. The quantity c measures weight-rate correlations that are responsible for the cancelation of the O(N) bias.

The deviation from Eq (2), shown here for a rapidly decaying covariance of the form Cμν=e|μν|2τ2, has been previously described in the context of a target-based learning algorithm used to build E-I-separated rate and spiking models of neural circuits capable of solving input/output tasks [3]. In this approach, a randomly initialized recurrent network nT is driven by a low dimensional signal z. Its currents are then used as targets to train the synaptic couplings of a second (rate or spiking) network nS, in such a way that the desired output z can later be linearly decoded from the self-sustained activity of nS. Each neuron of nS has to independently learn an input/output mapping from firing rates x to currents y, using an on-line sign-constrained least square method. In the presence of an L2 regularization and a constant bN external current, the on-line learning method typically converges onto a solution for the recurrent synaptic weights for which Eq (2) does not hold. As also shown in [3], in the peculiar case of a self-sustained periodic dynamics (in which case off-diagonal terms of the covariance matrix Cμν do not vanish for large μ or ν) the two contributions h˜ and c scale approximately like N and cancel each other to produce an O(1) total average output y¯=h˜+c. In the effort to build heterogeneous functional network models, the emergence of synaptic connectivity compatible with the balanced scaling thus depends on the statistics of incoming currents. Ad-hoc regularization can be avoided by adjusting external currents onto each neuron.

Power spectrum and synaptic distribution

The theory developed thus far applies to a generic covariance matrix C. To connect the spectral properties of C with the signal dynamics, we further assume the x to be N independent stationary discrete-time processes. In this case, Cμν = C(μν) is a matrix of Toeplitz type [39], leading to the following expression for the average minimal loss density in the N → ∞ limit:

with Δq˜w given by Eq (4). The function λ(ϕ) can be computed exactly in some cases (Methods: Power spectrum and synaptic distribution) and corresponds to the average power spectrum of the x and y stochastic processes. Fig 3 shows two representative input signals with Gaussian and exponential covariance matrix C (Fig 3A) and a comparison between the average power spectrum of the input and the analytical results for the eigenvalue spectrum of the matrix C (Fig 3B). From now on, we use the terms Gaussian or rfb (radial basis function) indistinguishably to denote the un-normalized Gaussian function Cμν=e(μν)22τ2.

Eigenvalues of C and Fourier spectrum.
Fig 3

Eigenvalues of C and Fourier spectrum.

A: Examples of excitatory input signals x (i ∈ E) with two different covariance matrices C. Top: rbf covariance, τ = 10. Bottom: exponential covariance Cμν=e|μν|τ, τ = 10. Parameters: x¯E=1, σE = 0.3. B: Theoretical eigenvalue spectrum of C with τ = 10 versus average power spectrum for positive wave numbers across N = 2000 independent processes with P = 1000 time steps.

As shown in Fig 4A in the case of input x and output y with rbf covariance, the squared norm of the optimal synaptic vector w (red curve) is in general a non-monotonic function of α, its maximum being attained at bigger values of α as the time constant τ increases. We also show the minimal loss density ϵ and the mean error ϵerr for γ = 0.1. The curves in Fig 4A are the same for any ratio fE: the use of an optimal bias current b cancels any asymmetry between E and I populations. For a finite γ, the minimal average loss ϵ for a given fE decreases as either σE or σI increase. For a given set of parameters fE and γ, the optimal bias b will in general depend on the load α and the structure of the covariance matrix C, as shown in Fig 4B.

Learning temporally structured signals.
Fig 4

Learning temporally structured signals.

A: Minimal loss ϵ, error ϵerr and norm of the weight vector w as a function of the load α for a linear perceptron trained on a time-correlated signal. Covariance matrix C is of rbf type with τ = 2. Parameters: N = 1000, fE = 0.8, γ = 0.1, x¯E=x¯I=σE=σI=y¯=σy=1. B: Optimal bias b for the two sets of signals with rbf (black curve) and exponential (yellow curve) covariance C, with τ = 2. Theoretical curves show the value IN+y¯, where I has been computed from the saddle-point equations (Methods: EC, Saddle-point equations). Parameters as in A. Each point in A and B is an average across 50 samples. C: Probability density of non-zero synaptic weights wiN of a linear perceptron with N = 1000, a fraction fE = 0.8 of excitatory weights, trained on P = 600 exponentially correlated input x and output y. The δ function in zero is omitted for better visualization. Parameters: τ = 10, γ = 0.1, x¯E=x¯I=1, σI = 2σE = 0.4. The histogram is an average across 50 realizations of input/output signals. Inset: full histogram of synaptic weights wiN.

Using the same analytical machinery employed for the calculation of the free energy Eq (3), the probability distribution of the typical weight wi can be easily derived. This can be seen by employing a variant of the replica trick (Methods: Distribution of synaptic weights) that links the so-called entropic part of f to 〈p(wi)〉, expressed in terms of the saddle-point values of the same (conjugated) overlap parameters employed thus far. Interestingly, the optimal bias b implies that half of the synapses are zero, irrespectively of fE and the properties of the covariance matrix C. The probability density of the synaptic weights is composed of two truncated Gaussian densities with zero mean for the E and I components, plus a finite fraction p0 = 0.5 of zero weights.

We show in Fig 4C the shape of the optimal weight distribution for a linear perceptron with 80% excitatory synapses, trained on exponentially correlated x and y and with a ratio σI/σE = 2. It is interesting to note that, in the presence of an optimal external current, both the means of the Gaussian components and the fraction of silent synapses do not depend on the specific properties of input and output signals.

The shape of the synaptic distribution appeared in previous studies both in the binary [8, 11, 13] and linear perceptron [15]. In the linear case with only excitatory synapses [15], for a fixed bias b=N, the fraction of zero E weights is larger than 0.5 at criticality. It generally depends on input parameters and the load in the error region ααc. Let us also mention that a similar property is also apparent in the binary perceptron, where the scale of the typical solutions is set by robustness [13] to input and output noise. For weights wi=O(1/N), the sparsity of critical solutions generically depends on properties of E and I inputs. For weights of O(1/N), robust solutions have a fraction of zero E weights generically larger than 0.5 [6, 11]. When inhibitory synapses are added, their weights are less sparse [11]. Interestingly, in the case without robustness, half of the E and I weights are zero at critical capacity for all fE ≥ 0.5.

The dynamic properties of input/output mappings affect the shape of the weight distribution in a computable manner. As an example, in a linear perceptron with non-negative synapses, the explicit dependence of the variance of the weights on the input and output auto-correlation time constant is shown in Fig 5A for various loads α. Previous work considered an analog perceptron with purely excitatory weights as a model for the graded rate response of Purkinje cells in the cerebellum [15]. In the presence of heterogeneity of synaptic properties across cells, a larger variance in their synaptic distribution is expected to be correlated with high frequency temporal fluctuations in input currents. Analogously, the auto-correlation of the typical signals being processed sets the value of the constant external current that a neuron must receive in order to optimize its capacity.

When the input and output have different covariance matrices CxCy, a joint diagonalization is not possible in general (Methods: EC, Energetic part). We can nevertheless write an expression (Eq (23)) that holds when input and output patterns are defined on a ring (with periodic boundary conditions) and use it as an approximation for the general case. Fig 5B shows good agreement between numerical experiment and theoretical predictions for the error ϵerr and the squared norm of the synaptic weight vector w, when input and output processes have two different time-constants τx and τy.

Input/Output time constants and learning performance.
Fig 5

Input/Output time constants and learning performance.

A: Variance of synaptic weights (fE = 1) for a linear perceptron of dimension N = 1000 trained on rbf-correlated signals with increasing time constant τ for three different values of the load α. Parameters: γ = 0.1, x¯E=x¯I=σE=σI=y¯=σy=1. B: Average error ϵerr in the case where input and output signals have two different covariance matrices, for increasing time constant τy of the output signal y. Parameters: N = 1000, fE = 0.8, γ = 0.1, x¯E=x¯I=y¯=σy=1, σI = 2σE = 0.6, Cx rbf with τx = 1, Cy rbf with various values of τy. Inset: norm of the weight vector w. Full lines show analytical results. Points are averages across 50 samples.

Sample covariance and dimensionality

In the discussion thus far, we assumed independence across the “spatial” index i in the input. It is often the case for input signals to be confined to a manifold of dimension smaller than N, a feature that can be described by various dimensionality measures, some of which rely on principal component analysis [40, 41]. In order to relax the independence assumption, we build on a framework originally introduced in the theory of spin glasses with orthogonal couplings [4244] and further developed in the context of adaptive Thouless-Anderson-Palmer (TAP) equations [4547]. In the TAP formalism, a set of mean-field equations is derived for a given instance of the random couplings (in our case, for a fixed input/output set). In its adaptive generalization [46], the structure of the TAP equations depends on the specific data distribution, in such a way that averaging the equations over the random couplings yields the same results of the replica approach. Here, following previous work in the context of information theory of linear vector channels and binary perceptrons [4851], we employ an expression for an ensemble of rectangular random matrices and use the replica method to average over the input X and output y.

Let us write the input matrix (X)iμ=x¯i+σiξiμ, with ξ = USVT, S being the matrix of singular values. To analyze the properties of the typical case, we start from a generic singular value distribution S and consider i.i.d. output yμ. In calculating the cumulant generating function Zξ,δy, we perform a homogeneous average across the left and right principal components U and V (Methods: SC, Energetic part). Calling ρξξT(λ) the eigenvalue distribution of the sample covariance matrix ξξT, we can express Zξ,δy in terms of a function Gξ,δy of an enlarged set of overlap parameters, which depends on the so-called Shannon transform [52] of ρξξT(λ), a quantity that measures the capacity of linear vector channels. The resulting self-consistent equations, which describe the statistical properties of the synaptic weights wi, are expressed in terms of the Stieltjes transform of ρξξT(λ), an important tool in random matrix theory [53] (Methods: SC, Saddle-point equations).

We show the validity of the mean-field approach by employing two different data models for the input signals. In the first example, valid for α ≤ 1, all the P vectors ξμ are orthogonal to each other. This yields an eigenvalue distribution of the simple form ρ(λ) = αδ(λ − 1) + (1 − α)δ(λ), for which the function Gξ,δy can be computed explicitly [51]. Additionally, we use a synthetic model where we explicitly set the singular value spectrum of ξ to be s(α)=χeα22σx2, with χ a normalization factor ensuring matrix ξ has unit variance. The shape of the singular value spectrum s controls the spread of the data points ξμ in the N-dimensional input space, as shown in Fig 6A. As shown in Fig 6B for i.i.d Gaussian output, learning degrades as σx decreases, since inputs tend to be confined to a lower dimensional subspace rather than being equally distributed along input dimensions.

Sample-based PCA and learning performance.
Fig 6

Sample-based PCA and learning performance.

A: First three components of inputs ξμ with Gaussian singular value spectrum s for two different values of σx (color coded top panels). Parameters: N = 100, P = 300. B: Average error ϵerr for three different singular value spectra of the input sample covariance matrix: orthogonal model and Gaussian model with increasing σx (see main text for definition of σx). Outputs are i.i.d Gaussian. Parameters: N = 1000, fE = 0.8, γ = 0.1, x¯E=x¯I=y¯=σy=1, σI = 2σE = 0.6. B: Average error ϵerr for input with orthogonal-type covariance and output y with rbf-type covariance with decreasing σy (see main text for the definition of σy). All remaining parameters as in A. Full lines show analytical results. Points are averages across 50 samples.

For N large enough (in practice, for N ≳ 500), the statistics of single cases is well captured by the equations for the average case (self-averaging effect). To get a mean-field description for a single case, where a given input matrix X is used, we further assume we have access to the linear expansion cμ of the output y in the set {vμ} of the columns of the V matrix, namely y=y¯+σyVc. The calculation can be carried out in a similar way and yields, for the average regression loss, the following result:

The average in Eq (6) is computed over the eigenvalues λx of the sample covariance matrix, which correspond to the PCA variances, and λμy=cμ2 (Methods: SC, Energetic part). The quantity Λ˜w can be computed from a set of self-consistent equations that link the order parameter Δq˜w and the first two moments of the synaptic distribution. To better understand the role of the parameter Λ˜w, it is instructive to compare Eq (6) with the corresponding result for unconstrained weights, which can be derived from the pseudo-inverse solution w* = (ξξT + γ)−1 ξy (Methods: SC, i.i.d. and unconstrained cases). The average loss is:
Comparing Eqs (7) and (8), we find that Λ˜w acts as an implicit regularization in the sign-constrained case. The mean-field theory is thus carried out through a diagonalization over independent contributions along the components vμ, with prescribed input and output variances λx and λy, respectively. The coupling between different components, induced by the averages 〈⋅〉x,y and the sign-constraints, is incorporated in the effective regularization Λ˜w, acting on each component equally, that depends only on the structure of the input x (see Eqs (56) and (67) in Methods)).

In Fig 6C, we show results when the dimensionality of the output y along the (temporal) components of the input is modulated by taking c(α)=eα22σy2. The perceptron performance improves as the output signals spreads out across multiple components vμ. The case of i.i.d. output is recovered by taking cμ = 1.

Discussion

In this work, I investigated the properties of optimal solutions of a linear perceptron with sign-constrained synapses and correlated input/output signals, thus providing a general mean-field theory for constrained regression in the presence of correlations. I treated both the case of known ensemble covariances and the case where the sample covariance is given. The latter approach, built on a rotationally invariant assumption, allowed to link the regression performance to the input and output statistical properties expressed by principal component analysis.

I provided the general expression of the weight distribution for regularized regression and found that half of the weights are set to zero, irrespectively of the fraction of excitatory weights, provided the bias is optimized. The shape of the synaptic distribution has been previously described in the binary perceptron with independent input at critical capacity, as well as in the theory of compressed sensing [54]. I elucidated the role of the optimal bias current and its relation to the optimal capacity and the scaling of the solution weights. This analysis also shed light on the structural properties of synaptic matrices that emerge when target-based methods are used for building biologically plausible functional models of rate and spiking networks.

The theory presented in this work is relevant in the effort of establishing quantitative comparisons between the synaptic profile of neural circuits involved in temporal processing of dynamic signals, such as the cerebellum [5557], and normative theories that take into account the temporal and geometrical complexity of computational tasks. On the other hand, the construction of progressively more biologically plausible models of neural circuits calls for normative theories of learning in heterogeneous networks, which can be coupled to dynamic mean-field analysis of E-I separated circuits [24, 25, 58].

As shown in this work, the interaction between correlational structure of input signals, synaptic metabolic cost and constant external current shapes the distribution of synaptic weights. In this respect, the results presented here offer a first approximation (static linear input-output associations) to account for heterogeneities of the fraction between E and I inputs to single cells in local circuits. Even though a heterogeneous linear neuron is capable of memorizing N/2 associations without error for any E/I ratio, the optimal bias does depend on fE, its minimal value being attained for fE = 0.5. Input current in turn sets the neuron’s operating regime and its input/output properties. Moreover, trading memorization accuracy (small output error ϵerr) for smaller weights (small |w|2) could be beneficial when synaptic costs are considered (γ > 0). It is therefore likely that, for an optimality principle of the 80/20 ratio to emerge from purely representational considerations, dynamical and metabolic effects should be examined all together.

The importance of a theory of constrained regression with realistic input/output statistics goes beyond the realm of neuroscience. Non-negativity is commonly required to provide interpretable results in a wide variety of inference and learning problems. Off-line and on-line least-square estimation methods [59, 60] are also of great practical importance in adaptive control applications, where constraints on the parameter range are usually imposed by physical plausibility.

In this work, I assumed statistical independence between inputs and outputs. For the sake of biological plausibility, it would be interesting to consider more general input-output correlations for regression and binary discrimination tasks. The classical model for such correlations is provided by the so-called teacher-student (TS) approach [61], where the output y is generated by a deterministic parameter-dependent transformation of the input x, with a structure similar to the trained neural architecture. The problem of input/output correlations is deeply related to the issue of optimal random nonlinear expansion both in statistical learning theory [62, 63] and theoretical neuroscience [41, 64], with a history dating back to the Marr-Albus theory of pattern separation in cerebellum [65]. In a recent work, [28] introduced a promising generalization of TS, in which labels are generated via a low-dimensional latent representation, and it was shown that this model captures the training dynamics in deep networks with real world datasets.

A general analysis that fully takes into account spatio-temporal correlations in network models could shed light on the emergence of specific network motifs during training. In networks with non-linear dynamics, the mathematical treatment quickly gets challenging even for simple learning rules. In recent years, interesting work has been done to clarify the relation between learning and network motifs, using a variety of mean-field approaches. Examples are the study of associative learning in spin models [8] and the analysis of motif dynamics for simple learning rules in spiking networks [66]. Incorporating both the temporal aspects of learning and neural cross-correlations in E-I separated models with realistic input/output structure is an interesting topic for future work.

Methods

Replica formalism: Ensemble covariance matrix (EC)

Using the Replica formalism [67], the free energy density is written as:

The function Zn can be computed by considering a finite number n of replicas of the vector w and subsequently taking a continuation nR. The introduction of n replicas allows to factorize 〈Znx,y over individual weights wi, at the cost of coupling different replicas after the averages over the x and y are performed. Introducing a small set of overlap order parameters, factorization across replicas is restored, so that in the large N limit the replicated partition function takes the form 〈Znx,y = eβNnf. In the following, we will usually drop the subscript in the average 〈⋅〉x,y.

To simplify the formulas, we introduce the O(1) weights Ji=σiNwi. In terms of these rescaled variables, the loss function in Eq (1) takes the form:

by virtue of xiμ=x¯i+σiξiμ. We proceed by inserting the definitions Ma=1Ni=1Nx¯iσiJia+IN and Δμa=i=1NξiμJiaNσyδyμ with the aid of appropriate δ functions. The averaged replicated partition function 〈Zn〉 is:
where:
In Eq 10, we used a Fourier expansion of the δ functions and introduced the real variables uμa as conjugate variables for Δμa. Analogously, we employed the purely imaginary M^a for the variables Ma. Once the the average is carried out, second cumulants of ξ and δy get coupled to replica mixing terms of the form Jia Jib, which can be dealt with by introducing appropriate overlap order parameters Nqwab=i=1NJiaJib with the use of n(n + 1)/2 additional δ functions, together with their conjugate variables q^wab. Cumulants of higher order will not contribute to the expression in the large N limit. Expanding the δ functions for the overlap parameters we get the expression
where the two contributions GE and GS, respectively called energetic and entropic part, will be calculated separately in the following for ease of exposition. Owing to the convexity of the regression problem, we use a Replica Symmetry (RS) [67] ansatz qwab=qw+δabΔqw and Ma = M.

EC, Entropic part

The total volume of configurations wa for fixed values of the overlap parameters is given by the entropic part:

where we called ηc=x¯cσc, with c ∈ {E, I}, and ηi = ηE (ηi = ηI) if i ∈ E (i ∈ I). Using the RS ansatz q^wab=q^wδabq^w+Δq^w2 and M^a=M^, we get:
Using the explicit definition of the measure dμ(J)iEθ(Ji)dJikIθ(Jk)dJk, one has, up to constant terms:
where we introduced the notations fI = 1 − fE and sE = −sI = 1. In order to disentangle the term ∑ab JaJb = (∑a Ja)2, we employ the so-called Hubbard-Stratonovich transformation eb22=Dzebz, where Dz=dzez222π. Taking the limit n0 one gets:

EC, Energetic part

In order to compute the energetic part, we first need to evaluate the average with respect to ξ and δy in Eq (11). Performing the two Gaussian integrals we get:

from which:
where we performed a translation Δμa+May¯Δμa. In the special case Cx = CyC, we can use C = VΛVT to jointly rotate ΔaV Δa and uaVua, thus leaving scalar products invariant. By doing so, we obtain, within the RS ansatz:
where ζμ = ∑ν Vμν. Using a Hubbard-Stratonovich transformation on the term ∑ab uμa uμb, after some algebra, we obtain:
Observing that the free energy only depends on M through the term (My¯)2 in GE, we conveniently eliminate the quantities ζμ at this stage, using the simple saddle-point relation
thus getting:
The brackets 〈⋅〉λ in Eq (22) stand for an average over the eigenvalue distribution ρ(λ) of C in the N → ∞ limit, assuming self-averaging. A similar expression for GE was previously derived in [34] for spherical weights, i.e. i=1Nwi2=1, in the presence of outputs yμ generated by a teacher linear perceptron. To map Eq (45) in [34] to Eq (22), one substitutes (1 − q) → Δqw (observing that qaa = 1 thanks to the spherical constraint) and sets R = 0, since the learning task only involves patterns memorization.

When CxCy, we can derive a similar expression under the assumption of a ring topology in pattern space (corresponding to periodic boundary conditions in the index μ): in this case, both covariance matrices are circulant and may be jointly diagonalized by discrete Fourier transform [33, 34]. In the main text, we show that the expression

yields good results also when Cx and Cy are covariance matrices of stationary discrete-time processes.

EC, Saddle-point equations

All in all, the free energy density in the saddle-point approximation is:

The saddle-point equations stemming from the entropic part can be written as:

where the averages 〈⋅〉J and 〈⋅〉z in Eqs (25)–(27) are taken with respect to the mean-field distribution of the J weights:
where z is a standard normal variable and θ is the Heaviside function: θ(x) = 1 when x > 0 and 0 otherwise. Eq (25) is obtained by differentiating Eq (24) with respect to q^w and then performing an integration by part in z. Eq (26) is easily obtained by subtracting Eq (25) from the saddle-point condition over Δq^w, while Eq (27) originates from the derivative w.r.t. M^.

In the β → ∞ limit, the unicity of solution for γ > 0 implies that Δqw → 0. We therefore use the following scalings for the order parameters:

while qw=O(1). In this scaling, Eqs (25)–(27) take the form:
where G(x)=ex222π and H(x)=xDz. The two remaining saddle-point equations are:
Optimizing f with respect to the bias b=IN immediately implies B = 0, by virtue of Eq (33), and greatly simplifies the saddle-point equations. Using the scaling assumptions Eqs (30)–(33) together with the saddle-point Eqs (34)–(38), we get Eq (4) in the main text, that is valid for any α for γ > 0. In the unregularized case (γ = 0), it describes solutions in the error regime α > αc. The optimal bias b can be computed by IN using Eq (36), that is valid up to an O(1) term equal to y¯ (Fig 4B). Keeping only the leading terms in the limit β → ∞, Eq (24) can be written as:
From the definition of the free energy density −βNf = 〈log∫(w)eβE〉, one has that EN=β(βf). Using Eq (39) and the relevant saddle-point equations, the expression for the average minimal energy density is then:
Also, noting that γE=N2i=1Nwi2, we can compute the average squared norm of the weights v=i=1Nwi2 by v = 2∂γ f. We thus obtain:
The error ϵerr=12N|XTw+by|2 can be then computed by ϵerr=ϵγ2v.

Distribution of synaptic weights

The synaptic weight distribution appearing in Eqs (28) and (29) can be obtained using a variant of the replica trick [6, 67]. Using the expression Z−1 = limn → 0 Zn−1, the density of excitatory weights can be written as:

where we picked the first E weight in the first replica w11 without loss of generality. The calculation proceeds along the same lines as for the entropic part above, since the energetic part does not depend on wa explicitely. Isolating the first replica and taking the limit n → 0, one gets the expression
and analogously for the I weights. This expression holds for uncorrelated inputs and outputs and any fixed bias b, as well as for any correlated x and y with optimal bias b, where deviations from Eq (2) do not occur. In the β → ∞, using the scaling relations Eqs (30)–(33), it can be easily shown that the mean-field weight probability density of the rescaled weights Nwi is a superposition of a δ function in zero and two truncated Gaussian densitites:
where the mean and standard deviation of the Gaussians G(⋅;M, Σ) are:
This weight density is valid for γ > 0 at any α and at critical capacity for γ = 0. The fraction of zero weights is given by:

Spectrum of exponential and rbf covariance

For the exponential covariance Cμν=e|μν|τ, one has [33]:

with x=e1τ. In the rbf case Cμν=e|μν|22τ2, the spectrum can be computed by Fourier series [39], yielding
with ϑ3(z,q)=1+2n=1qn2cos(2nz) the Jacobi theta function of 3rd type.

Replica formalism: Sample covariance matrix (SC)

Also in the case of a sample covariance matrix, we are interested in statistically structured inputs and output. An independent average across x and y would result in a simple dependence on the variance of y in the energetic part. To capture the geometric dependence between x and y, we thus extend the calculations in [50, 51] to the case where the linear expansion of yμ on the right singular vectors Vμ is known, by taking δyμ = ∑ν Vμν cν.

In order to compute the replicated cumulant generating function Eq (11), we again introduce overlap parameters qwab, whose volume is given by the previously computed entropic part GS. The fact that the entropic part is unchanged in turn implies that the mean-field weight distribution takes the form of Eq (44), with the values of {A, B, C} being determined by a new set of saddle-point equations.

SC, Energetic part

Using again the expressions (X)iμ=x¯i+σiξiμ and ξ = USVT, the replicated cumulant generating function for the joint (mean-removed) input and output is:

where we used the change of variables J˜ia=kUkiJka and u˜μa=kVkμuka. The average in Eq (47) is taken over the joint distribution p(J˜a,u˜a) resulting from averaging over the Haar measure on the orthogonal matrices U and V. For a single replica, Zξ,δy will only depend on the squared norms Qw=iJ˜i2N and Qu=μu˜μ2P of the two vectors J˜ and u˜. We can therefore write the average in the following way:
Introducing Fourier representation for the δ functions, we are left with an expression involving an N + P dimensional Gaussian integral:
where
and 1K is the identity matrix of dimension K. Following [51], the determinant can be easily calculated:
where the limit is taken for N → ∞ and the average is with respect to the eigenvalue distribution ρx). As for the quadratic portion of the Gaussian integral, calling λky=ck2, we will use the shorthand
Considering now the replicated generating function, all the n(n + 1) cross-product Ja·Jb=J˜a·J˜b and ua·ub=u˜a·u˜b must be conserved via the multiplication of U and V. Together with the overlap parameters Nqwab=iJiaJib, we additionally introduce the quantities Pquab=μuμauμb, thus obtaining:
In the RS case, we again take qwab=qw+δabΔqw and, similarly for the u’s, quab=qu+δabΔqu. In the basis where both qwab and quab are diagonal, the expression for Zξ,δy becomes
so, calling Gξ,δy=1Nlimn0logZξ,δy, we have:
with the function F given by:
and K(Λw,Λu)=Λwλyλx+ΛwΛuλx,λy. In Eq (54), it is intended that Λw and Λw are implied by the Legendre Transform conditions:
The remaining terms in the energetic part GE involve the quab overlaps and their conjugated parameters q^uab. Introducing the RS ansatz q^uab=q^u+δabΔq^uq^u2, the calculation follows along the same lines of the section SC, Energetic part. We get:
Eliminating M, q^u and Δq^u at the saddle-point in Eq (58), GE reduces to:

SC, Saddle-point equations

The final expression for the free energy density

implies the following saddle-point equations:
in addition to the entropic saddle-point Eqs (25)–(27), which are unchanged. The saddle-point values of the conjugate Legendre variables Λw, Λu greatly simplify the expression for the first and second derivatives of F. Indeed, from Eqs (61) and (62) one has:
or, setting Λw=βΛ˜w:
In particular, Eq (56) shows that Δq˜w is expressed by a Stieltjes transform of ρx) and the first term in Eq (55) is its Shannon transform. In the limit β → ∞, using the following additional scaling relations for the u overlaps:
we get the expression for the energy density:

SC, i.i.d. and unconstrained cases

Either setting K = 0 or λy = 0 reverts back to the i.i.d. output case. In the special case of i.i.d. inputs, the eigenvalue distribution is Marchenko-Pastur

with λ+/=(1±α)2, from which F(Δqw,Δqu)=α2ΔqwΔqu. The saddle-point equations are essentially the same as the ones in the section EC, Saddle-point equations with Cμνx=Cμνy=δμν.

Let us also note that, in the simple unconstrained case, taking for simplicity x¯i=0 and b = 0, the entropic part can be worked out to be, up to constant terms:

which, at the saddle-point, implies Λ˜w=γ. The mean-field distribution p(Nw) is a zero-mean Gaussian with variance v = qw. Using the properties of the Hessian of the Legendre Transform, it is easy to show that:
These expressions can also be derived from the pseudo-inverse solution (we take y¯=0 for simplicity) w* = (ξξT + γ)−1 ξy, by taking an average across ξ and y in the two expressions:
The i.i.d. output case also follows by performing independent averages over y and ξ.

Acknowledgements

The author would like to thank L.F. Abbott and Francesco Fumarola for constructive criticism of an earlier version of the manuscript.

References

HFSong, GRYang, XJWang. Training Excitatory-Inhibitory Recurrent Neural Networks for Cognitive Tasks: A Simple and Flexible Framework. PLOS Computational Biology. 2016;12(2):130. 10.1371/journal.pcbi.1004792

WNicola, CClopath. Supervised learning in spiking neural networks with FORCE training. Nature Communications. 2017;8(1):2208 10.1038/s41467-017-01827-3

AIngrosso, LFAbbott. Training dynamically balanced excitatory-inhibitory networks. PLOS ONE. 2019;14(8):118. 10.1371/journal.pone.0220547

CMKim, CCChow. Learning recurrent dynamics in spiking networks. eLife. 2018;7:e37124 10.7554/eLife.37124

WBrendel, RBourdoukan, PVertechi, CKMachens, SDenève. Learning to represent signals spike by spike. PLOS Computational Biology. 2020;16(3):123. 10.1371/journal.pcbi.1007692

NBrunel, VHakim, PIsope, JPNadal, BBarbour. Optimal Information Storage and the Distribution of Synaptic Weights: Perceptron versus Purkinje Cell. Neuron. 2004;43(5):745757. 10.1016/S0896-6273(04)00528-8

BBarbour, NBrunel, VHakim, JPNadal. What can we learn from synaptic weight distributions? Trends in Neurosciences. 2007;30(12):622629. 10.1016/j.tins.2007.09.005

NBrunel. Is cortical connectivity optimized for storing information? Nature Neuroscience. 2016;19(5):749755. 10.1038/nn.4286

EGardner. The space of interactions in neural network models. Journal of Physics A: Mathematical and General. 1988;21(1):257270.

10 

CClopath, JPNadal, NBrunel. Storage of correlated patterns in standard and bistable Purkinje cell models. PLoS computational biology. 2012;8(4):e1002448e1002448. 10.1371/journal.pcbi.1002448

11 

JChapeton, TFares, DLaSota, AStepanyants. Efficient associative memory storage in cortical circuits of inhibitory and excitatory neurons. Proceedings of the National Academy of Sciences. 2012;109(51):E3614E3622. 10.1073/pnas.1211467109

12 

DZhang, CZhang, AStepanyants. Robust Associative Learning Is Sufficient to Explain the Structural and Dynamical Properties of Local Cortical Circuits. Journal of Neuroscience. 2019;39(35):68886904. 10.1523/JNEUROSCI.3218-18.2019

13 

RRubin, LFAbbott, HSompolinsky. Balanced excitation and inhibition are required for high-capacity, noise-robust neuronal selectivity. Proceedings of the National Academy of Sciences. 2017;114(44):E9366E9375. 10.1073/pnas.1705841114

14 

HSSeung, HSompolinsky, NTishby. Statistical mechanics of learning from examples. Phys Rev A. 1992;45:60566091. 10.1103/PhysRevA.45.6056

15 

CClopath, NBrunel. Optimal Properties of Analog Perceptrons with Excitatory Weights. PLOS Computational Biology. 2013;9(2):16. 10.1371/journal.pcbi.1002919

16 

HGutfreund, YStein. Capacity of neural networks with discrete synaptic couplings. Journal of Physics A: Mathematical and General. 1990;23(12):26132630. 10.1088/0305-4470/23/12/036

17 

JSIsaacson, MScanziani. How Inhibition Shapes Cortical Activity. Neuron. 2011;72(2):231243. 10.1016/j.neuron.2011.09.027

18 

REField, JAD’amour, RTremblay, CMiehl, BRudy, JGjorgjieva, et al Heterosynaptic Plasticity Determines the Set Point for Cortical Excitatory-Inhibitory Balance. Neuron. 2020; 10.1016/j.neuron.2020.03.002.

19 

GHennequin, EJAgnes, TPVogels. Inhibitory Plasticity: Balance, Control, and Codependence. Annual Review of Neuroscience. 2017;40(1):557579. 10.1146/annurev-neuro-072116-031005

20 

Ahmadian Y, Miller KD. What is the dynamical regime of cerebral cortex? arXiv:190810101. 2019.

21 

Cvan Vreeswijk, HSompolinsky. Chaos in Neuronal Networks with Balanced Excitatory and Inhibitory Activity. Science. 1996;274(5293):17241726. 10.1126/science.274.5293.1724

22 

Cvan Vreeswijk, HSompolinsky. Chaotic Balanced State in a Model of Cortical Circuits. Neural Comput. 1998;10(6):13211371. 10.1162/089976698300017214

23 

ARenart, Jde la Rocha, PBartho, LHollender, NParga, AReyes, et al The Asynchronous State in Cortical Circuits. Science. 2010;327(5965):587590. 10.1126/science.1179850

24 

JKadmon, HSompolinsky. Transition to Chaos in Random Neuronal Networks. Phys Rev X. 2015;5:041030.

25 

OHarish, DHansel. Asynchronous Rate Chaos in Spiking Neuronal Circuits. PLOS Computational Biology. 2015;11(7):138. 10.1371/journal.pcbi.1004266

26 

NBrunel. Dynamics of Sparsely Connected Networks of Excitatory and Inhibitory Spiking Neurons. Journal of Computational Neuroscience. 2000;8(3):183208. 10.1023/A:1008925309027

27 

MVTsodyks, TSejnowski. Rapid state switching in balanced cortical network models. Network: Computation in Neural Systems. 1995;6(2):111124. 10.1088/0954-898X_6_2_001

28 

Goldt S, Mézard M, Krzakala F, Zdeborová L. Modelling the influence of data structure on learning in neural networks: the hidden manifold model. arXiv:190911500. 2019.

29 

SChung, DDLee, HSompolinsky. Classification and Geometry of General Perceptual Manifolds. Phys Rev X. 2018;8:031003.

30 

UCohen, SChung, DDLee, HSompolinsky. Separability and geometry of object manifolds in deep neural networks. Nature Communications. 2020;11(1):746 10.1038/s41467-020-14578-5

31 

PRotondo, MCLagomarsino, MGherardi. Counting the learnable functions of geometrically structured data. Phys Rev Research. 2020;2:023169 10.1103/PhysRevResearch.2.023169

32 

Pastore M, Rotondo P, Erba V, Gherardi M. Statistical learning theory of structured data. arXiv:200510002. 2020.

33 

RMonasson. Properties of neural networks storing spatially correlated patterns. Journal of Physics A: Mathematical and General. 1992;25(13):37013720. 10.1088/0305-4470/25/13/019

34 

WTarkowski, MLewenstein. Learning from correlated examples in a perceptron. Journal of Physics A: Mathematical and General. 1993;26(15):36693679. 10.1088/0305-4470/26/15/017

35 

RMonasson. Storage of spatially correlated patterns in autoassociative memories. Journal de Physique I. 1993;3(5):11411152. 10.1051/jp1:1993107

36 

ABattista, RMonasson. Capacity-Resolution Trade-Off in the Optimal Learning of Multiple Low-Dimensional Manifolds by Attractor Neural Networks. Phys Rev Lett. 2020;124:048302 10.1103/PhysRevLett.124.048302

37 

DJAmit, KYMWong, CCampbell. Perceptron learning with sign-constrained weights. Journal of Physics A: Mathematical and General. 1989;22(12):20392045. 10.1088/0305-4470/22/12/009

38 

DJAmit, CCampbell, KYMWong. The interaction space of neural networks with sign-constrained synapses. Journal of Physics A: Mathematical and General. 1989;22(21):46874693. 10.1088/0305-4470/22/21/030

39 

RMGray. Toeplitz and Circulant Matrices: A Review. Foundations and Trends in Communications and Information Theory. 2006;2(3):155239. 10.1561/0100000006

40 

Abbott LF, Rajan K, Sompolinsky H. Interactions between Intrinsic and Stimulus-Evoked Activity in Recurrent Neural Networks. arXiv:09123832. 2009.

41 

ALitwin-Kumar, KDHarris, RAxel, HSompolinsky, LFAbbott. Optimal Degrees of Synaptic Connectivity. Neuron. 2017;93(5):11531164.e7. 10.1016/j.neuron.2017.01.030

42 

EMarinari, GParisi, FRitort. Replica field theory for deterministic models. II. A non-random spin glass with glassy behaviour. Journal of Physics A: Mathematical and General. 1994;27(23):76477668. 10.1088/0305-4470/27/23/011

43 

GParisi, MPotters. Mean-field equations for spin models with orthogonal interaction matrices. Journal of Physics A: Mathematical and General. 1995;28(18):52675285. 10.1088/0305-4470/28/18/016

44 

RCherrier, DSDean, ALefèvre. Role of the interaction matrix in mean-field spin glass models. Phys Rev E. 2003;67:046112 10.1103/PhysRevE.67.046112

45 

MOpper, OWinther. Tractable Approximations for Probabilistic Models: The Adaptive Thouless-Anderson-Palmer Mean Field Approach. Phys Rev Lett. 2001;86:36953699. 10.1103/PhysRevLett.86.3695

46 

MOpper, OWinther. Adaptive and self-averaging Thouless-Anderson-Palmer mean-field theory for probabilistic modeling. Phys Rev E. 2001;64:056131 10.1103/PhysRevE.64.056131

47 

MOpper, OWinther. Expectation Consistent Approximate Inference. Journal of Machine Learning Research. 2005;6:21772204.

48 

KTakeda, SUda, YKabashima. Analysis of CDMA systems that are characterized by eigenvalue spectrum. Europhysics Letters (EPL). 2006;76(6):11931199. 10.1209/epl/i2006-10380-5

49 

YKabashima. Inference from correlated patterns: a unified theory for perceptron learning and linear vector channels. Journal of Physics: Conference Series. 2008;95:012001.

50 

TShinzato, YKabashima. Learning from correlated patterns by simple perceptrons. Journal of Physics A: Mathematical and Theoretical. 2008;42(1):015005 10.1088/1751-8113/42/1/015005

51 

TShinzato, YKabashima. Perceptron capacity revisited: classification ability for correlated patterns. Journal of Physics A: Mathematical and Theoretical. 2008;41(32):324013 10.1088/1751-8113/41/32/324013

52 

AMTulino, SVerdú. Random Matrix Theory and Wireless Communications. Foundations and Trends in Communications and Information Theory. 2004;1(1):1182. 10.1561/0100000001

53 

Tao T. Topics in Random Matrix Theory. Graduate studies in mathematics. American Mathematical Soc.;. Available from: https://books.google.com/books?id=Hjq_JHLNPT0C.

54 

SGanguli, HSompolinsky. Statistical Mechanics of Compressed Sensing. Phys Rev Lett. 2010;104:188701 10.1103/PhysRevLett.104.188701

55 

DMarr. A theory of cerebellar cortex. The Journal of physiology. 1969;202(2):437470. 10.1113/jphysiol.1969.sp008820

56 

DMWolpert, RCMiall, MKawato. Internal models in the cerebellum. Trends in Cognitive Sciences. 1998;2(9):338347. 10.1016/S1364-6613(98)01221-2

57 

DJHerzfeld, YKojima, RSoetedjo, RShadmehr. Encoding of error and learning to correct that error by the Purkinje cells of the cerebellum. Nature Neuroscience. 2018;21(5):736743. 10.1038/s41593-018-0136-y

58 

FMastrogiuseppe, SOstojic. Intrinsically-generated fluctuating activity in excitatory-inhibitory networks. PLOS Computational Biology. 2017;13(4):140. 10.1371/journal.pcbi.1005498

59 

JChen, CRichard, JMBermudez, PHoneine. Variants of Non-Negative Least-Mean-Square Algorithm and Convergence Analysis. IEEE Transactions on Signal Processing. 2014;62(15):39904005. 10.1109/TSP.2014.2332440

60 

VHNascimento, YVZakharov. RLS Adaptive Filter With Inequality Constraints. IEEE Signal Processing Letters. 2016;23(5):752756. 10.1109/LSP.2016.2551468

61 

AEngel, CVan den Broeck. Statistical mechanics of learning. Cambridge University Press; 2001.

62 

Mei S, Montanari A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv:190805355. 2019.

63 

Gerace F, Loureiro B, Krzakala F, Mézard M, Zdeborová L. Generalisation error in learning with random features and the hidden manifold model. arXiv:200209339. 2020.

64 

BBabadi, HSompolinsky. Sparseness and Expansion in Sensory Representations. Neuron. 2014;83(5):12131226. 10.1016/j.neuron.2014.07.035

65 

NACayco-Gajic, RASilver. Re-evaluating Circuit Mechanisms Underlying Pattern Separation. Neuron. 2019;101(4):584602. 10.1016/j.neuron.2019.01.044

66 

GKOcker, ALitwin-Kumar, BDoiron. Self-Organization of Microcircuits in Networks of Spiking Neurons with Plastic Synapses. PLOS Computational Biology. 2015;11(8):140. 10.1371/journal.pcbi.1004458

67 

MMézard, GParisi, MVirasoro. Spin Glass Theory and Beyond World Scientific Lecture Notes in Physics; 1987.