Cost function dependent barren plateaus in shallow parametrized quantum circuits

M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, Patrick J. Coles

https://doi.org/10.1038/s41467-021-21728-w, Volume: 12, Pages: null

Article Type: Research Article Article History

Publisher: Nature Publishing Group UK

- Facebook
- Twitter
- Linkedin
- Whatsapp
Altmetric

Table of Contents

Introduction
Results
Discussion
Methods
Supplementary information
Supplementary information

Abstract

Variational quantum algorithms (VQAs) optimize the parameters θ of a parametrized quantum circuit V(θ) to minimize a cost function C. While VQAs may enable practical applications of noisy quantum computers, they are nevertheless heuristic methods with unproven scaling. Here, we rigorously prove two results, assuming V(θ) is an alternating layered ansatz composed of blocks forming local 2-designs. Our first result states that defining C in terms of global observables leads to exponentially vanishing gradients (i.e., barren plateaus) even when V(θ) is shallow. Hence, several VQAs in the literature must revise their proposed costs. On the other hand, our second result states that defining C with local observables leads to at worst a polynomially vanishing gradient, so long as the depth of V(θ) is O(logn)

. Our results establish a connection between locality and trainability. We illustrate these ideas with large-scale simulations, up to 100 qubits, of a quantum autoencoder implementation.

Parametrised quantum circuits are a promising hybrid classical-quantum approach, but rigorous results on their effective capabilities are rare. Here, the authors explore the feasibility of training depending on the type of cost functions, showing that local ones are less prone to the barren plateau problem.

Keywords

Quantum Physics, Quantum information, Mathematics and computing, Information theory and computation

Cerezo,Sone,Volkoff,Cincio,and Coles: Cost function dependent barren plateaus in shallow parametrized quantum circuits

Introduction

One of the most important technological questions is whether Noisy Intermediate-Scale Quantum (NISQ) computers will have practical applications¹. NISQ devices are limited both in qubit count and in gate fidelity, hence preventing the use of quantum error correction.

The leading strategy to make use of these devices is variational quantum algorithms (VQAs)². VQAs employ a quantum computer to efficiently evaluate a cost function C, while a classical optimizer trains the parameters θ of a Parametrized Quantum Circuit (PQC) V(θ). The benefits of VQAs are three-fold. First, VQAs allow for task-oriented programming of quantum computers, which is important since designing quantum algorithms is non-intuitive. Second, VQAs make up for small qubit counts by leveraging classical computational power. Third, pushing complexity onto classical computers, while only running short-depth quantum circuits, is an effective strategy for error mitigation on NISQ devices.

There are very few rigorous scaling results for VQAs (with exception of one-layer approximate optimization^3–5). Ideally, in order to reduce gate overhead that arises when implementing on quantum hardware one would like to employ a hardware-efficient ansatz⁶ for V(θ). As recent large-scale implementations for chemistry⁷ and optimization⁸ applications have shown, this ansatz leads to smaller errors due to hardware noise. However, one of the few known scaling results is that deep versions of randomly initialized hardware-efficient ansatzes lead to exponentially vanishing gradients⁹. Very little is known about the scaling of the gradient in such ansatzes for shallow depths, and it would be especially useful to have a converse bound that guarantees non-exponentially vanishing gradients for certain depths. This motivates our work, where we rigorously investigate the gradient scaling of VQAs as a function of the circuit depth.

The other motivation for our work is the recent explosion in the number of proposed VQAs. The Variational Quantum Eigensolver (VQE) is the most famous VQA. It aims to prepare the ground state of a given Hamiltonian H = ∑_αc_ασ_α, with H expanded as a sum of local Pauli operators¹⁰. In VQE, the cost function is obviously the energy $C = ⟨ψ ∣ H ∣ ψ⟩$ of the trial state $∣ψ⟩$ . However, VQAs have been proposed for other applications, like quantum data compression¹¹, quantum error correction¹², quantum metrology¹³, quantum compiling^14–17, quantum state diagonalization^18,19, quantum simulation^20–23, fidelity estimation²⁴, unsampling²⁵, consistent histories²⁶, and linear systems^27–29. For these applications, the choice of C is less obvious. Put another way, if one reformulates these VQAs as ground-state problems (which can be done in many cases), the choice of Hamiltonian H is less intuitive. This is because many of these applications are abstract, rather than associated with a physical Hamiltonian.

We remark that polynomially vanishing gradients imply that the number of shots needed to estimate the gradient should grow as $O (poly (n))$ . In contrast, exponentially vanishing gradients (i.e., barren plateaus) imply that derivative-based optimization will have exponential scaling³⁰, and this scaling can also apply to derivative-free optimization³¹. Assuming a polynomial number of shots per optimization step, one will be able to resolve against finite sampling noise and train the parameters if the gradients vanish polynomially. Hence, we employ the term “trainable” for polynomially vanishing gradients.

In this work, we connect the trainability of VQAs to the choice of C. For the abstract applications in refs. ^11–29, it is important for C to be operational, so that small values of C imply that the task is almost accomplished. Consider an example of state preparation, where the goal is to find a gate sequence that prepares a target state $∣ψ_{0}⟩$ . A natural cost function is the square of the trace distance D_T between $∣ψ_{0}⟩$ and $∣ψ⟩ = V {(θ)}^{†} ∣0⟩$ , given by $C_{G} = D_{T} {(∣ψ_{0}⟩, ∣ψ⟩)}^{2}$ , which is equivalent to

with

O_{G} = 1 - ∣0⟩ ⟨0∣

. Note that

\sqrt{C_{G}} \geq ∣ ⟨ψ ∣ M ∣ ψ⟩ - ⟨ψ_{0} ∣ M ∣ ψ_{0}⟩ ∣

has a nice operational meaning as a bound on the expectation value difference for a POVM element M.

However, here we argue that this cost function and others like it exhibit exponentially vanishing gradients. Namely, we consider global cost functions, where one directly compares states or operators living in exponentially large Hilbert spaces (e.g., $∣ψ⟩$ and $∣ψ_{0}⟩$ ). These are precisely the cost functions that have operational meanings for tasks of interest, including all tasks in refs. ^11–29. Hence, our results imply that a non-trivial subset of these references will need to revise their choice of C.

Interestingly, we demonstrate vanishing gradients for shallow PQCs. This is in contrast to McClean et al.⁹, who showed vanishing gradients for deep PQCs. They noted that randomly initializing θ for a V(θ) that forms a 2-design leads to a barren plateau, i.e., with the gradient vanishing exponentially in the number of qubits, n. Their work implied that researchers must develop either clever parameter initialization strategies^32,33 or clever PQCs ansatzes^4,34,35. Similarly, our work implies that researchers must carefully weigh the balance between trainability and operational relevance when choosing C.

While our work is for general VQAs, barren plateaus for global cost functions were noted for specific VQAs and for a very specific tensor-product example by our research group^14,18, and more recently in²⁹. This motivated the proposal of local cost functions^{14,16,18,22,25–27}, where one compares objects (states or operators) with respect to each individual qubit, rather than in a global sense, and therein it was shown that these local cost functions have indirect operational meaning.

Our second result is that these local cost functions have gradients that vanish polynomially rather than exponentially in n, and hence have the potential to be trained. This holds for V(θ) with depth $O (\log n)$ . Figure 1 summarizes our two main results.

Fig. 1

Summary of our main results.

McClean et al.⁹ proved that a barren plateau can occur when the depth D of a hardware-efficient ansatz is $D \in O (poly (n))$ . Here we extend these results by providing bounds for the variance of the gradient of global and local cost functions as a function of D. In particular, we find that the barren plateau phenomenon is cost-function dependent. a For global cost functions (e.g., Eq. (1)), the landscape will exhibit a barren plateau essentially for all depths D. b For local cost functions (e.g., Eq. (2)), the gradient vanishes at worst polynomially and hence is trainable when $D \in O (\log (n))$ , while barren plateaus occur for $D \in O (poly (n))$ , and between these two regions the gradient transitions from polynomial to exponential decay.

Finally, we illustrate our main results for an important example: quantum autoencoders¹¹. Our large-scale numerics show that the global cost function proposed in¹¹ has a barren plateau. On the other hand, we propose a novel local cost function that is trainable, hence making quantum autoencoders a scalable application.

Results

Warm-up example

To illustrate cost-function-dependent barren plateaus, we first consider a toy problem corresponding to the state preparation problem in the Introduction with the target state being $∣0⟩$ . We assume a tensor-product ansatz of the form $V (θ) {= ⨂}_{j = 1}^{n} e^{- i θ^{j} σ_{x}^{(j)} / 2}$ , with the goal of finding the angles θ^j such that $V (θ) ∣0⟩ = ∣0⟩$ . Employing the global cost of (1) results in $C_{G} = 1 - \prod_{j = 1}^{n} \cos^{2} \frac{θ^{j}}{2}$ . The barren plateau can be detected via the variance of its gradient: $Var [\frac{\partial C_{G}}{\partial θ^{j}}] = \frac{1}{8} {(\frac{3}{8})}^{n - 1}$ , which is exponentially vanishing in n. Since the mean value is $⟨\frac{\partial C_{G}}{\partial θ^{j}}⟩ = 0$ , the gradient concentrates exponentially around zero.

On the other hand, consider a local cost function:

where

1_{\bar{j}}

is the identity on all qubits except qubit j. Note that C_L vanishes under the same conditions as C_G^14,16, C_L = 0 ⇔ C_G = 0. We find

C_{L} = 1 - \frac{1}{n} \sum_{j = 1}^{n} \cos^{2} \frac{θ^{j}}{2}

, and the variance of its gradient is

Var [\frac{\partial C_{L}}{\partial θ^{j}}] = \frac{1}{8 n^{2}}

, which vanishes polynomially with n and hence exhibits no barren plateau. Figure 2 depicts the cost landscapes of C_G and C_L for two values of n and shows that the barren plateau can be avoided here via a local cost function.

Fig. 2

Cost function landscapes.

a Two-dimensional cross-section through the landscape of $C_{G} = 1 - \prod_{j = 1}^{n} \cos^{2} (θ^{j} / 2)$ for n = 4 (blue) and n = 24 (orange). b The same cross-section through the landscape of $C_{L} = 1 - \frac{1}{n} \sum_{j = 1}^{n} \cos^{2} (θ^{j} / 2)$ is independent of n. In both cases, 200 Haar distributed points are shown, with very few (most) of these points lying in the valley containing the global minimum of C_G (C_L).

Moreover, this example allows us to delve deeper into the cost landscape to see a phenomenon that we refer to as a narrow gorge. While a barren plateau is associated with a flat landscape, a narrow gorge refers to the steepness of the valley that contains the global minimum. This phenomenon is illustrated in Fig. 2, where each dot corresponds to cost values obtained from randomly selected parameters θ. For C_G we see that very few dots fall inside the narrow gorge, while for C_L the narrow gorge is not present. Note that the narrow gorge makes it harder to train C_G since the learning rate of descent-based optimization algorithms must be exponentially small in order not to overstep the narrow gorge. The following proposition (proved in the Supplementary Note 2) formalizes the narrow gorge for C_G and its absence for C_L by characterizing the dependence on n of the probability C ⩽ δ. This probability is associated with the parameter space volume that leads to C ⩽ δ.

Proposition 1

Let θ^j be uniformly distributed on [−π, π] ∀j. For any δ ∈ (0, 1), the probability that C_G ≤ δ satisfies

For any

δ \in [\frac{1}{2}, 1]

, the probability that C_L ≤ δ satisfies

General framework

For our general results, we consider a family of cost functions that can be expressed as the expectation value of an operator O as follows

where ρ is an arbitrary quantum state on n qubits. Note that this framework includes the special case where ρ could be a pure state, as well as the more special case where

ρ = ∣0⟩ ⟨0∣

, which is the input state for many VQAs such as VQE. Moreover, in VQE, one chooses O = H, where H is the physical Hamiltonian. In general, the choice of O and ρ essentially defines the application of interest of the particular VQA.

It is typical to express O as a linear combination of the form $O = c_{0} 1 + \sum_{i = 1}^{N} c_{i} O_{i}$ . Here O_i ≠ $1$ , $c_{i} \in R$ , and we assume that at least one c_i ≠ 0. Note that C_G and C_L in (1) and (2) fall under this framework. In our main results below, we will consider two different choices of O that respectively capture our general notions of global and local cost functions and also generalize the aforementioned C_G and C_L.

As shown in Fig. 3a, V(θ) consists of L layers of m-qubit unitaries W_kl(θ_kl), or blocks, acting on alternating groups of m neighboring qubits. We refer to this as an Alternating Layered Ansatz. We remark that the Alternating Layered Ansatz will be a hardware-efficient ansatz so long as the gates that compose each block are taken from a set of gates native to a specific device. As depicted in Fig. 3c, the one dimensional Alternating Layered Ansatz can be readily implemented in devices with one-dimensional connectivity, as well as in devices with two-dimensional connectivity (such as that of IBM’s³⁶ and Google’s³⁷ quantum devices). That is, with both one- and two-dimensional hardware connectivity one can group qubits to form an Alternating Layered Ansatz as in Fig. 3a.

Fig. 3

Alternating Layered Ansatz.

a Each block W_kl acts on m qubits and is parametrized via (27). As shown, we define S_k as the m-qubit subsystem on which W_kL acts, where L is the last layer of V(θ). Given some block W, it is useful for our proofs (outlined in the Methods) to write $V (θ) = V_{R} (1_{\bar{w}} \otimes W) V_{L}$ , where V_R contains all gates in the forward light-cone $L$ of W. The forward light-cone $L$ is defined as all gates with at least one input qubit causally connected to the output qubits of W. We define $\bar{L}$ as the compliment of $L$ , S_w as the m-qubit subsystem on which W acts, and $S_{\bar{w}}$ as the n − m qubit subsystem on which W acts trivially. b The operator O_i acts nontrivially only in subsystem $S_{k - 1} \in S$ , while $O_{i^{'}}$ acts nontrivially on the first m/2 qubits of S_k+1, and on the second m/2 qubits of S_k. c Depiction of the Alternating Layered Ansatz with one- and two-dimensional connectivity. Each circle represents a physical qubit.

The index l = 1, …, L in W_kl(θ_kl) indicates the layer that contains the block, while k = 1, …, ξ indicates the qubits it acts upon. We assume n is a multiple of m, with n = mξ, and that m does not scale with n. As depicted in Fig. 3a, we define S_k as the m-qubit subsystem on which W_kL acts, and we define $S = {S_{k}}$ as the set of all such subsystems. Let us now consider a block W_kl(θ_kl) in the lth layer of the ansatz. For simplicity we henceforth use W to refer to a given W_kl(θ_kl). As shown in the Methods section, given a θ^ν ∈ θ_kl that parametrizes a rotation $e^{- i θ^{ν} σ_{ν} / 2}$ (with σ_ν a Pauli operator) inside a given block W, one can always express

where W_A and W_B contain all remaining gates in W, and are properly defined in the Methods section.

The contribution to the gradient ∇C from a parameter θ^ν in the block W is given by the partial derivative ∂_νC. While the value of ∂_νC depends on the specific parameters θ, it is useful to compute ${⟨ \partial_{ν} C ⟩}_{V}$ , i.e., the average gradient over all possible unitaries V(θ) within the ansatz. Such an average may not be representative near the minimum of C, although it does provide a good estimate of the expected gradient when randomly initializing the angles in V(θ). In the Methods Section we explicitly show how to compute averages of the form 〈…〉_V, and in the Supplementary Note 3 we provide a proof for the following Proposition.

Proposition 2

The average of the partial derivative of any cost function of the form (6) with respect to a parameter θ^ν in a block W of the ansatz in Fig. 3 is

provided that either W_A or W_B of () form a 1-design.

Here we recall that a t-design is an ensemble of unitaries, such that sampling over their distribution yields the same properties as sampling random unitaries from the unitary group with respect to the Haar measure up to the first t moments³⁸. The Methods section provides a formal definition of a t-design.

Proposition 2 states that the gradient is not biased in any particular direction. To analyze the trainability of C, we consider the second moment of its partial derivatives:

where we used the fact that

{⟨ \partial_{ν} C ⟩}_{V} = 0

. The magnitude of Var[∂_νC] quantifies how much the partial derivative concentrates around zero, and hence small values in () imply that the slope of the landscape will typically be insufficient to provide a cost-minimizing direction. Specifically, from Chebyshev’s inequality, Var[∂_νC] bounds the probability that the cost-function partial derivative deviates from its mean value (of zero) as

\Pr (∣ \partial_{ν} C ∣ \geq c) \leq Var [\partial_{ν} C] / c^{2}

for all c > 0.

Main results

Here we present our main theorems and corollaries, with the proofs sketched in the Methods and detailed in the Supplementary Information. In addition, in the Methods section we provide some intuition behind our main results by analyzing a generalization of the warm-up example where V(θ) is composed of a single layer of the ansatz in Fig. 3. This case bridges the gap between the warm-up example and our main theorems and also showcases the tools used to derive our main result.

The following theorem provides an upper bound on the variance of the partial derivative of a global cost function which can be expressed as the expectation value of an operator of the form

Specifically, we consider two cases of interest: (i) When N = 1 and each

{\hat{O}}_{1 k}

is a non-trivial projector (

{\hat{O}}_{1 k}^{2} = {\hat{O}}_{1 k} \neq 1

) of rank r_k acting on subsystem S_k, or (ii) When N is arbitrary and

{\hat{O}}_{i k}

is traceless with

Tr [{\hat{O}}_{i k}^{2}] \leq 2^{m}

(for example, when

{\hat{O}}_{i k} {= ⨂}_{j = 1}^{m} σ_{j}^{μ}

is a tensor product of Pauli operators

σ_{j}^{μ} \in {1_{j}, σ_{j}^{x}, σ_{j}^{y}, σ_{j}^{z}}

, with at least one

σ_{j}^{μ} \neq 1

). Note that case (i) includes C_G of () as a special case.

Theorem 1

Consider a trainable parameter θ^ν in a block W of the ansatz in Fig. 3. Let Var[∂_νC] be the variance of the partial derivative of a global cost function C (with O given by (10)) with respect to θ^ν. If W_A, W_B of (7), and each block in V(θ) form a local 2-design, then Var[∂_νC] is upper bounded by

(i)For N = 1 and when each ${\hat{O}}_{1 k}$ is a non-trivial projector, then defining $R = \prod_{k = 1}^{ξ} r_{k}^{2}$ , we have

(ii)For arbitrary N and when each ${\hat{O}}_{i k}$ satisfies $Tr [{\hat{O}}_{i k}] = 0$ and $Tr [{\hat{O}}_{i k}^{2}] \leq 2^{m}$ , then

From Theorem 1 we derive the following corollary.

Corollary 1

Consider the function F_n(L, l).

(i)Let N = 1 and let each ${\hat{O}}_{1 k}$ be a non-trivial projector, as in case (i) of Theorem 1. If $c_{1}^{2} R \in O (2^{n})$ and if the number of layers $L \in O (poly (\log (n)))$ , then

which implies that Var[∂_νC] is exponentially vanishing in n if m ⩾ 2.

(ii)Let N be arbitrary, and let each ${\hat{O}}_{i k}$ satisfy $Tr [{\hat{O}}_{i k}] = 0$ and $Tr [{\hat{O}}_{i k}^{2}] \leq 2^{m}$ , as in case (ii) of Theorem 1. If $N \in O (2^{n})$ , $c_{i} \in O (1)$ , and if the number of layers $L \in O (poly (\log (n)))$ , then

which implies that Var[∂_νC] is exponentially vanishing in n if m ⩾ 2.

Let us now make several important remarks. First, note that part (i) of Corollary 1 includes as a particular example the cost function C_G of (1). Second, part (ii) of this corollary also includes as particular examples operators with $N \in O (1)$ , as well as $N \in O (poly (n))$ . Finally, we remark that F_n(L, l) becomes trivial when the number of layers L is Ω(poly(n)), however, as we discuss below, we can still find that Var[∂_νC_G] vanishes exponentially in this case.

Our second main theorem shows that barren plateaus can be avoided for shallow circuits by employing local cost functions. Here we consider m-local cost functions where each ${\hat{O}}_{i}$ acts nontrivially on at most m qubits and (on these qubits) can be expressed as ${\hat{O}}_{i} = {\hat{O}}_{i}^{μ_{i}} \otimes {\hat{O}}_{i}^{μ^{'}}$ :

where

{\hat{O}}_{i}^{μ_{i}}

are operators acting on m/2 qubits which can be written as a tensor product of Pauli operators. Here, we assume the summation in Eq. () includes two possible cases as schematically shown in Fig. 3b: First, when

{\hat{O}}_{i}^{μ_{i}}

(

{\hat{O}}_{i}^{μ^{'}}

) acts on the first (last) m/2 qubits of a given S_k, and second, when

{\hat{O}}_{i}^{μ_{i}}

(

{\hat{O}}_{i}^{μ^{'}}

) acts on the last (first) m/2 qubits of a given S_k (S_k+1). This type of cost function includes any ultralocal cost function (i.e., where the

{\hat{O}}_{i}

are one-body) as in (), and also VQE Hamiltonians with up to m/2 neighbor interactions. Then, the following theorem holds.

Theorem 2

Consider a trainable parameter θ^ν in a block W of the ansatz in Fig. 3. Let Var[∂_νC] be the variance of the partial derivative of an m-local cost function C (with O given by (16)) with respect to θ^ν. W_A, W_B of (7), and each block in V(θ) form a local 2-design, then Var[∂_νC] is lower bounded by

with

where

i_{L}

is the set of i indices whose associated operators

{\hat{O}}_{i}

act on qubits in the forward light-cone

L

of W, and

k_{L_{B}}

is the set of k indices whose associated subsystems S_k are in the backward light-cone

L_{B}

of W. Here we defined the function

ϵ (M) = D_{HS} (M, Tr (M) 1 / d_{M})

where D_HS is the Hilbert–Schmidt distance and d_M is the dimension of the matrix M. In addition,

ρ_{k, k^{'}}

is the partial trace of the input state ρ down to the subsystems

S_{k} S_{k + 1} . . . S_{k^{'}}

Let us make a few remarks. First, note that the $ϵ ({\hat{O}}_{i})$ in the lower bound indicates that training V(θ) is easier when ${\hat{O}}_{i}$ is far from the identity. Second, the presence of $ϵ (ρ_{k, k^{'}})$ in G_n(L, l) implies that we have no guarantee on the trainability of a parameter θ^ν in W if ρ is maximally mixed on the qubits in the backwards light-cone.

From Theorem 2 we derive the following corollary for m-local cost functions, which guarantees the trainability of the ansatz for shallow circuits.

Corollary 2

Consider the function F_n(L, l). Let O be an operator of the form (16), as in Theorem 2. If at least one term $c_{i}^{2} ϵ (ρ_{k, k^{'}}) ϵ ({\hat{O}}_{i})$ in the sum in (18) vanishes no faster than Ω(1/poly(n)), and if the number of layers L is $O (\log (n))$ , then

On the other hand, if at least one term

c_{i}^{2} ϵ (ρ_{k, k^{'}}) ϵ ({\hat{O}}_{i})

in the sum in () vanishes no faster than

Ω (1 / 2^{poly (\log (n))})

, and if the number of layers is

O (poly (\log (n)))

, then

Hence, when L is $O (poly (\log (n)))$ there is a transition region where the lower bound vanishes faster than polynomially, but slower than exponentially.

We finally justify the assumption of each block being a local 2-design from the fact that shallow circuit depths lead to such local 2-designs. Namely, it has been shown that one-dimensional 2-designs have efficient quantum circuit descriptions, requiring $O (m^{2})$ gates to be exactly implemented³⁸, or $O (m)$ to be approximately implemented^39,40. Hence, an L-layered ansatz in which each block forms a 2-design can be exactly implemented with a depth $D \in O (m^{2} L)$ , and approximately implemented with $D \in O (m L)$ . For the case of two-dimensional connectivity, it has been shown that approximate 2-designs require a circuit depth of $O (\sqrt{m})$ to be implemented⁴⁰. Therefore, in this case the depth of the layered ansatz is $D \in O (\sqrt{m} L)$ . The latter shows that increasing the dimensionality of the circuit reduces the circuit depth needed to make each block a 2-design.

Moreover, it has been shown that the Alternating Layered Ansatz of Fig. 3 will form an approximate one-dimensional 2-design on n qubits if the number of layers is $O (n)$ ⁴⁰. Hence, for deep circuits, our ansatz behaves like a random circuit and we recover the barren plateau result of⁹ for both local and global cost functions.

Numerical simulations

As an important example to illustrate the cost-function-dependent barren plateau phenomenon, we consider quantum autoencoders^11,41–44. In particular, the pioneering VQA proposed in ref. ¹¹ has received significant literature attention, due to its importance to quantum machine learning and quantum data compression. Let us briefly explain the algorithm in ref. ¹¹.

Consider a bipartite quantum system AB composed of n_A and n_B qubits, respectively, and let ${p_{μ}, ∣ ψ_{μ} ⟩}$ be an ensemble of pure states on AB. The goal of the quantum autoencoder is to train a gate sequence V(θ) to compress this ensemble into the A subsystem, such that one can recover each state $∣ ψ_{μ} ⟩$ with high fidelity from the information in subsystem A. One can think of B as the “trash” since it is discarded after the action of V(θ).

To quantify the degree of data compression, ref. ¹¹ proposed a cost function of the form:

where

ρ_{AB}^{in} = \sum_{μ} p_{μ} ∣ ψ_{μ} ⟩ ⟨ ψ_{μ} ∣

is the ensemble-average input state,

ρ_{B}^{out} = \sum_{μ} p_{μ} {Tr}_{A} [∣ ψ^{'} ⟩ ⟨ ψ^{'} ∣]

is the ensemble-average trash state, and

∣ψ^{'}⟩ = V (θ) ∣ ψ_{μ} ⟩

. Equation () makes it clear that

C_{G}^{'}

has the form in (), and

O_{G}^{'} = 1_{AB} - 1_{A} \otimes ∣0⟩ ⟨0∣

is a global observable of the form in (). Hence, according to Corollary 1,

C_{G}^{'}

exhibits a barren plateau for large n_B. (Specifically, Corollary 1 applies in this context when n_A < n_B). As a result, large-scale data compression, where one is interested in discarding large numbers of qubits, will not be possible with

C_{G}^{'}

To address this issue, we propose the following local cost function

where

O_{L}^{'} = 1_{AB} - \frac{1}{n_{B}} \sum_{j = 1}^{n_{B}} 1_{A} \otimes ∣0⟩ {⟨0∣}_{j} \otimes 1_{\bar{j}}

, and

1_{\bar{j}}

is the identity on all qubits in B except the jth qubit. As shown in the Supplementary Note ,

C_{L}^{'}

satisfies

C_{L}^{'} \leq C_{G}^{'} \leq n_{B} C_{L}^{'}

, which implies that

C_{L}^{'}

is faithful (vanishing under the same conditions as

C_{G}^{'}

). Furthermore, note that

O_{L}^{'}

has the form in (). Hence Corollary 2 implies that

C_{L}^{'}

does not exhibit a barren plateau for shallow ansatzes.

Here we simulate the autoencoder algorithm to solve a simple problem where n_A = 1, and where the input state ensemble ${p_{μ}, ∣ ψ_{μ} ⟩}$ is given by

In order to analyze the cost-function-dependent barren plateau phenomenon, the dimension of subsystem B is gradually increased as n_B = 10, 15, …, 100.

Numerical results

In our heuristics, the gate sequence V(θ) is given by two layers of the ansatz in Fig. 4, so that the number of gates and parameters in V(θ) increases linearly with n_B. Note that this ansatz is a simplified version of the ansatz in Fig. 3, as we can only generate unitaries with real coefficients. All parameters in V(θ) were randomly initialized and as detailed in the Methods section, we employ a gradient-free training algorithm that gradually increases the number of shots per cost-function evaluation.

Fig. 4

Alternating Layered Ansatz for V(θ) employed in our numerical simulations.

Each layer is composed of control-Z gates acting on alternating pairs of neighboring qubits which are preceded and followed by single qubit rotations around the y-axis, $R_{y} (θ_{i}) = e^{- i θ_{i} σ_{y} / 2}$ . Shown is the case of two layers, n_A = 1, and n_B = 10 qubits. The number of variational parameters and gates scales linearly with n_B: for the case shown there are 71 gates and 51 parameters. While each block in this ansatz will not form an exact local 2-design, and hence does not fall under our theorems, one can still obtain a cost-function-dependent barren plateau.

Analysis of the n-dependence. Figure 5 shows representative results of our numerical implementations of the quantum autoencoder in ref. ¹¹ obtained by training V(θ) with the global and local cost functions respectively given by (22) and (23). Specifically, while we train with finite sampling, in the figures we show the exact cost-function values versus the number of iterations. Here, the top (bottom) axis corresponds to the number of iterations performed while training with $C_{G}^{'}$ ( $C_{L}^{'}$ ). For n_B = 10 and 15, Fig. 5 shows that we are able to train V(θ) for both cost functions. For n_B = 20, the global cost function initially presents a plateau in which the optimizing algorithm is not able to determine a minimizing direction. However, as the number of shots per function evaluation increases, one can eventually minimize $C_{G}^{'}$ . Such result indicates the presence of a barren plateau where the gradient takes small values which can only be detected when the number of shots becomes sufficiently large. In this particular example, one is able to start training at around 140 iterations.

Fig. 5

Cost versus number of iterations for the quantum autoencoder problem defined by Eqs. (25)–(26).

In all cases we employed two layers of the ansatz shown in Fig. 4, and we set n_A = 1, while increasing n_B = 10, 15, …, 100. The top (bottom) axis corresponds to the global cost function $C_{G}^{'}$ of Eq. (22) (local cost function $C_{L}^{'}$ of (23)). As can be seen, $C_{G}^{'}$ can be trained up to n_B = 20 qubits, while $C_{L}^{'}$ trained in all cases. These results indicate that global cost function presents a barren plateau even for a shallow depth ansatz, and this can be avoided by employing a local cost function.

When n_B > 20 we are unable to train the global cost function, while always being able to train our proposed local cost function. Note that the number of iterations is different for $C_{G}^{'}$ and $C_{L}^{'}$ , as for the global cost function case we reach the maximum number of shots in fewer iterations. These results indicate that the global cost function of (22) exhibits a barren plateau where the gradient of the cost function vanishes exponentially with the number of qubits, and which arises even for constant depth ansatzes. We remark that in principle one can always find a minimizing direction when training $C_{G}^{'}$ , although this would require a number of shots that increases exponentially with n_B. Moreover, one can see in Fig. 5 that randomly initializing the parameters always leads to $C_{G}^{'} \approx 1$ due to the narrow gorge phenomenon (see Proposition 1), i.e., where the probability of being near the global minimum vanishes exponentially with n_B.

On the other hand, Fig. 5 shows that the barren plateau is avoided when employing a local cost function since we can train $C_{L}^{'}$ for all considered values of n_B. Moreover, as seen in Fig. 5, $C_{L}^{'}$ can be trained with a small number of shots per cost-function evaluation (as small as 10 shots per evaluation).

Analysis of the L-dependence. The power of Theorem 2 is that it gives the scaling in terms of L. While one can substitute a function of n for L as we did in Corollary 2, one can also directly study the scaling with L (for fixed n). Figure 6 shows the dependence on L when training $C_{L}^{'}$ for the autoencoder example with n_A = 1 and n_B = 10. As one can see, the training becomes more difficult as L increases. Specifically, as shown in the inset it appears to become exponentially more difficult, as the number of shots needed to achieve a fixed cost value grows exponentially with L. This is consistent with (and hence verifies) our bound on the variance in Theorem 2, which vanishes exponentially in L, although we remark that this behavior can saturate for very large L⁹.

Fig. 6

Local cost $C_{L}^{'}$ versus number of iterations for the quantum autoencoder problem in Eqs. (25–26) with n_B = 10.

Each curve corresponds to a different number of layers L in the ansatz of Fig. 4 with L = 2,…, 20. Curves were averaged over 9 instances of the autoencoder. As the number of layers increases, the optimization becomes harder. Inset: Number of shots needed to reach cost values of $C_{L}^{'} = 0.02$ and $C_{L}^{'} = 0.05$ versus number of layers L. As L increases the number of shots needed to reach the indicated cost values appears to increase exponentially.

In summary, even though the ansatz employed in our heuristics is beyond the scope of our theorems, we still find cost-function-dependent barren plateaus, indicating that the cost-function dependent barren plateau phenomenon might be more general and go beyond our analytical results.

Discussion

While scaling results have been obtained for classical neural networks⁴⁵, very few such results exist for the trainability of parametrized quantum circuits, and more generally for quantum neural networks. Hence, rigorous scaling results are urgently needed for VQAs, which many researchers believe will provide the path to quantum advantage with near-term quantum computers. One of the few such results is the barren plateau theorem of ref. ⁹, which holds for VQAs with deep, hardware-efficient ansatzes.

In this work, we proved that the barren plateau phenomenon extends to VQAs with randomly initialized shallow Alternating Layered Ansatzes. The key to extending this phenomenon to shallow circuits was to consider the locality of the operator O that defines the cost function C. Theorem 1 presented a universal upper bound on the variance of the gradient for global cost functions, i.e., when O is a global operator. Corollary 1 stated the asymptotic scaling of this upper bound for shallow ansatzes as being exponentially decaying in n, indicating a barren plateau. Conversely, Theorem 2 presented a universal lower bound on the variance of the gradient for local cost functions, i.e., when O is a sum of local operators. Corollary 2 notes that for shallow ansatzes this lower bound decays polynomially in n. Taken together, these two results show that barren plateaus are cost-function-dependent, and they establish a connection between locality and trainability.

In the context of chemistry or materials science, our present work can inform researchers about which transformation to use when mapping a fermionic Hamiltonian to a spin Hamiltonian⁴⁶, i.e., Jordan-Wigner versus Bravyi–Kitaev⁴⁷. Namely, the Bravyi–Kitaev transformation often leads to more local Pauli terms, and hence (from Corollary 2) to a more trainable cost function. This fact was recently numerically confirmed⁴⁸.

Moreover, the fact that Corollary 2 is valid for arbitrary input quantum states may be useful when constructing variational ansatzes. For example, one could propose a growing ansatz method where one appends $\log (n)$ layers of the hardware-efficient ansatz to a previously trained (hence fixed) circuit. This could then lead to a layer-by-layer training strategy where the previously trained circuit can correspond to multiple layers of the same hardware-efficient ansatz.

We remark that our definition of a global operator (local operator) is one that is both non-local (local) and many body (few body). Therefore, the barren plateau phenomenon could be due to the many-bodiness of the operator rather than the non-locality of the operator; we leave the resolution of this question to future work. On the other hand, our Theorem 1 rules out the possibility that barren plateaus could be due to cardinality, i.e., the number of terms in O when decomposed as a sum of Pauli products⁴⁹. Namely, case (ii) of this theorem implies barren plateaus for O of essentially arbitrary cardinality, and hence cardinality is not the key variable at work here.

We illustrated these ideas for two examples VQAs. In Fig. 2, we considered a simple state-preparation example, which allowed us to delve deeper into the cost landscape and uncover another phenomenon that we called a narrow gorge, stated precisely in Proposition 1. In Fig. 5, we studied the more important example of quantum autoencoders, which have generated significant interest in the quantum machine learning community. Our numerics showed the effects of barren plateaus: for more than 20 qubits we were unable to minimize the global cost function introduced in¹¹. To address this, we introduced a local cost function for quantum autoencoders, which we were able to minimize for system sizes of up to 100 qubits.

There are several directions in which our results could be generalized in future work. Naturally, we hope to extend the narrow gorge phenomenon in Proposition 1 to more general VQAs. In addition, we hope in the future to unify our theorems 1 and 2 into a single result that bounds the variance as a function of a parameter that quantifies the locality of O. This would further solidify the connection between locality and trainability. Moreover, our numerics suggest that our theorems (which are stated for exact 2-designs) might be extendable in some form to ansatzes composed of simpler blocks, like approximate 2-designs³⁹.

We emphasize that while our theorems are stated for a hardware-efficient ansatz and for costs that are of the form (6), it remains an interesting open question as to whether other ansatzes, cost function, and architectures exhibit similar scaling behavior as that stated in our theorems. For instance, we have recently shown⁵⁰ that our results can be extended to a more general type of Quantum Neural Network called dissipative quantum neural networks⁵¹. Another potential example of interest could be the unitary-coupled cluster (UCC) ansatz in chemistry⁵², which is intended for use in the $O (poly (n))$ depth regime³⁴. Therefore it is important to study the key mathematical features of an ansatz that might allow one to go from trainability for $O (\log n)$ depth (which we guarantee here for local cost functions) to trainability for $O (poly n)$ depth.

Finally, we remark that some strategies have been developed to mitigate the effects of barren plateaus^32,33,53,54. While these methods are promising and have been shown to work in certain cases, they are still heuristic methods with no provable guarantees that they can work in generic scenarios. Hence, we believe that more work needs to be done to better understand how to prevent, avoid, or mitigate the effects of barren plateaus.

Methods

In this section, we provide additional details for the results in the main text, as well as a sketch of the proofs for our main theorems. We note that the proof of Theorem 2 comes before that of Theorem 1 since the latter builds on the former. More detailed proofs of our theorems are given in the Supplementary Information.

Variance of the cost function partial derivative

Let us first discuss the formulas we employed to compute Var[∂_νC]. Let us first note that without loss of generality, any block W_kl(θ_kl) in the Alternating Layered Ansatz can be written as a product of ζ_kl independent gates from a gate alphabet $A = {G_{μ} (θ)}$ as

where each

θ_{k l}^{ν}

is a continuous parameter. Here,

G_{ν} (θ_{k l}^{ν}) = R_{ν} (θ_{k l}^{ν}) Q_{ν}

where Q_ν is an unparametrized gate and

R_{ν} (θ_{k l}^{ν}) = e^{- i θ_{k l}^{ν} σ_{ν} / 2}

with σ_ν a Pauli operator. Note that W_kL denotes a block in the last layer of V(θ).

For the proofs of our results, it is helpful to conceptually break up the ansatz as follows. Consider a block W_kl(θ_kl) in the lth layer of the ansatz. For simplicity, we henceforth use W to refer to a given W_kl(θ_kl). Let S_w denote the m-qubit subsystem that contains the qubits W acts on, and let $S_{\bar{w}}$ be the (n − m) subsystem on which W acts trivially. Similarly, let $H_{w}$ and $H_{\bar{w}}$ denote the Hilbert spaces associated with S_w and $S_{\bar{w}}$ , respectively. Then, as shown in Fig. 3a, V(θ) can be expressed as

Here,

1_{\bar{w}}

is the identity on

H_{\bar{w}}

, and V_R contains the gates in the (forward) light-cone

L

of W, i.e., all gates with at least one input qubit causally connected to the output qubits of W. The latter allows us to define

S_{L}

as the subsystem of all qubits in

L

Let us here recall that the Alternating Layered Ansatz can be implemented with either a 1D or 2D square connectivity as schematically depicted in Fig. 3c. We remark that the following results are valid for both cases as the light-cone structure will be the same. Moreover, the notation employed in our proofs applies to both the 1D and 2D cases. Hence, there is no need to refer to the connectivity dimension in what follows.

Let us now assume that θ^ν is a parameter inside a given block W, we obtain from (6), (27), and (28)

with

Finally, from (29) we can derive a general formula for the variance:

which holds if W_A and W_B form independent 2-designs. Here, the summation runs over all bitstrings p, q,

p^{'}

q^{'}

of length 2^n−m. In addition, we defined

where

{Tr}_{\bar{w}}

indicates the trace over subsystem

S_{\bar{w}}

, and Ω_qp and Ψ_qp are operators on

H_{w}

defined as

We derive Eq. () in the Supplementary Note .

Computing averages over V

Here we introduce the main tools employed to compute quantities of the form 〈…〉_V. These tools are used throughout the proofs of our main results.

Let us first remark that if the blocks in V(θ) are independent, then any average over V can be computed by averaging over the individual blocks, i.e., ${⟨ \dots ⟩}_{V} = {⟨ \dots ⟩}_{W_{11}, \dots, W_{k l}, \dots} = {⟨ \dots ⟩}_{V_{L}, W, V_{R}}$ . For simplicity let us first consider the expectation value over a single block W in the ansatz. In principle 〈…〉_W can be approximated by varying the parameters in W and sampling over the resulting 2^m × 2^m unitaries. However, if W forms a t-design, this procedure can be simplified as it is known that sampling over its distribution yields the same properties as sampling random unitaries from the unitary group with respect to the unique normalized Haar measure.

Explicitly, the Haar measure is a uniquely defined left and right-invariant measure over the unitary group dμ(W), such that for any unitary matrix A ∈ U(2^m) and for any function f(W) we have

where the integration domain is assumed to be U(2^m) throughout this work. Consider a finite set

{W_{y}}_{y \in Y}

(of size ∣Y∣) of unitaries W_y, and let P_(t, t)(W) be an arbitrary polynomial of degree at most t in the matrix elements of W and at most t in those of W^†. Then, this finite set is a t-design if³⁸

From the general form of C in Eq. (6) we can see the cost function is a polynomial of degree at most 2 in the matrix elements of each block W_kl in V(θ), and at most 2 in those of ${(W_{k l})}^{†}$ . Then, if a given block W forms a 2-design, one can employ the following elementwise formula of the Weingarten calculus^55,56 to explicitly evaluate averages over W up to the second moment:

where w_ij are the matrix elements of W, and

Intuition behind the main results

The goal of this section is to provide some intuition for our main results. Specifically, we show here how the scaling of the cost function variance can be related to the number of blocks we have to integrate to compute ${⟨ \dots ⟩}_{V_{R}, V_{L}}$ , the locality of the cost functions, and with the number of layers in the ansatz.

First, we recall from Eq. (38) that integrating over a block leads to a coefficient of the order 1/2^2m. Hence, we see that the more blocks one integrates over, the worse the scaling can be.

We now generalize the warm-up example. Let V(θ) be a single layer of the alternating ansatz of Fig. 3, i.e., V(θ) is a tensor product of m-qubit blocks W_k: = W_k1, with k = 1, …, ξ (and with ξ = n/m), so that θ^ν is in the block $W_{k^{'}}$ . In the Supplementary Note 5 we generalize this scenario to the when the input state is not $∣0⟩$ , but instead is an arbitrary state ρ.

From (31), the partial derivative of the global cost function in (1) can be expressed as

where

υ = \frac{{(2^{m} - 1)}^{2} Tr [σ_{ν}^{2}]}{2^{2 m} {(2^{m + 1} - 1)}^{2}}

. From () we have that in order to compute () one needs to integrate over ξ − 1 blocks. Then, since each integration leads to a coefficient 1/2^2m the variance will scale as

O {() 1 / (2^{2 m})}^{ξ - 1} = O (1 / 2^{2 n})

. Hence, the scaling of the variance gets worse for each block we integrate (such that the block acts on qubits we are measuring).

On the other hand, for a local cost let us consider a single term in (3) where $j \in S_{\tilde{k}}$ , so that

Here, in contrast to the global case, we only have to integrate over a single block irrespective of the total number of qubits. Hence, we now find that the variance scales as

O (1 / n^{2})

, where we remark that the scaling is essentially given by the prefactor 1/n² in ().

Let us now briefly provide some intuition as to why the scaling of local cost gradients becomes exponentially vanishing with the number of layers as in Theorem 2. Consider the case when V(θ) contains L layers of the ansatz in Fig. 3. Moreover, as shown in Fig. 7, let W be in the first layer, and let O_i act on the m topmost qubits of $L$ . As schematically depicted in Fig. 7, we now have to integrating over L − 1 blocks. Then, as proved in the Supplementary Note 5, integrating over a block leads to a coefficient 2^m/2/(2^m + 1). Hence, after integrating L − 1 times, we obtain a coefficient $2^{m (L - 1) / 2} / {(2^{m} + 1)}^{L - 1}$ which vanishes no faster than $Ω (1 / poly (n))$ if $m L \in O (\log (n))$ .

Fig. 7

The block W is in the first layer of V(θ), and the operator O_i acts on the topmost m qubits in the forward light-cone $L$ of W.

Dashed thick lines indicate the backward light-cone of O_i. All but L − 1 blocks simplify to identity in Ω_qp of Eq. (34).

As we discuss below, for more general scenarios the computation of Var[∂_νC] becomes more complex.

Sketch of the proof of the main theorems

Here we present a sketch of the proof of Theorems 1 and 2. We refer the reader to the Supplementary Information for a detailed version of the proofs.

As mentioned in the previous subsection, if each block in V(θ) forms a local 2-design, then we can explicitly calculate expectation values 〈…〉_W via (38). Hence, to compute ${⟨ Δ Ω_{q p}^{p^{'} q^{'}} ⟩}_{V_{R}}$ , and ${⟨ Δ Ψ_{p q}^{p^{'} q^{'}} ⟩}_{V_{L}}$ in (31), one needs to algorithmically integrate over each block using the Weingarten calculus. In order to make such computation tractable, we employ the tensor network representation of quantum circuits.

For the sake of clarity, we recall that any two-qubit gate can be expressed as $U = \sum_{i j k l} U_{i j k l} ∣i j⟩ ⟨k l∣$ , where U_ijkl is a 2 × 2 × 2 × 2 tensor. Similarly, any block in the ansatz can be considered as a $2^{\frac{m}{2}} \times 2^{\frac{m}{2}} \times 2^{\frac{m}{2}} \times 2^{\frac{m}{2}}$ tensor. As schematically shown in Fig. 8a, one can use the circuit description of $Ω_{q p}^{i}$ and Ψ_pq to derive the tensor network representation of terms such as $Tr [Ω_{q p}^{i} Ω_{q^{'} p^{'}}^{j}]$ . Here, $Ω_{q p}^{i}$ is obtained from (34) by simply replacing O with O_i.

Fig. 8

Tensor-network representations of the terms relevant to Var[∂_νC].

a Representation of $Ω_{q p}^{i}$ of Eq. (34) (left), where the superscript indicates that O is replaced by O_i. In this illustration, we show the case of n = 2m qubits, and we denote $P_{p q} = ∣q⟩ ⟨p∣$ . We also show the representation of $Tr [Ω_{q p}^{i} Ω_{q^{'} p^{'}}^{j}]$ (middle) and of $Tr [Ω_{q p}^{i}] Tr [Ω_{q^{'} p^{'}}^{j}]$ (right). b By means of the Weingarten calculus we can algorithmically integrate over each block in the ansatz. After each integration one obtains four new tensors according to Eq. (38). Here we show the tensor obtained after the integrations $\int d μ (W) Tr [Ω_{q p}^{i} Ω_{q^{'} p^{'}}^{j}]$ and $\int d μ (W) Tr [Ω_{q p}^{i}] Tr [Ω_{q^{'} p^{'}}^{j}]$ , which are needed to compute ${⟨ Δ Ω_{q p}^{q^{'} p^{'}} ⟩}_{V_{R}}$ as in Eq. (31). c As shown in the Supplementary Note 1, the result of the integration is a sum of the form (49), where the deltas over p, q, $p^{'}$ , and $q^{'}$ arise from the contractions between P_pq and $P_{p^{'} q^{'}}$ .

In Fig. 8b we depict an example where we employ the tensor network representation of $Ω_{q p}^{i}$ to compute the average of $Tr [Ω_{q p}^{i} Ω_{q^{'} p^{'}}^{j}]$ , and $Tr [Ω_{q p}^{i}] Tr [Ω_{q^{'} p^{'}}^{j}]$ . As expected, after each integration one obtains a sum of four new tensor networks according to Eq. (38).

Proof of Theorem 2

Let us first consider an m-local cost function C where O is given by (16), and where ${\hat{O}}_{i}$ acts nontrivially in a given subsystem S_k of $S$ . In particular, when ${\hat{O}}_{i}$ is of this form the proof is simplified, although the more general proof is presented in the Supplementary Note 6. If $S_{k} ⊄ S_{L}$ we find $Ω_{q p}^{i} \propto 1_{w}$ , and hence

The latter implies that we only have to consider the operators

{\hat{O}}_{i}

which act on qubits inside of the forward light-cone

L

of W.

Then, as shown in the Supplementary Information

Here we remark that the proportionality factor contains terms of the form

δ_{{(p, q)}_{S_{\bar{w}}^{+}}} δ_{{(p^{'}, q^{'})}_{S_{\bar{w}}^{+}}} δ_{{(p, q^{'})}_{S_{\bar{w}}^{-}}} δ_{{(p^{'}, q)}_{S_{\bar{w}}^{-}}}

(where

S_{\bar{w}}^{+} \cup S_{\bar{w}}^{-} = S_{\bar{w}}

), which arises from the different tensor contractions of

P_{p q} = ∣q⟩ ⟨p∣

in Fig. 8c. It is then straightforward to show that

where we define

{\tilde{ρ}}^{-}

as the reduced states of

\tilde{ρ} = V_{L} ρ V_{L}^{†}

in the Hilbert spaces associated with subsystems

S_{w} \cup S_{\bar{w}}^{-}

. Here we recall that D_HS is the Hilbert–Schmidt distance

D_{H S} (ρ, σ) = Tr [{(ρ - σ)}^{2}]

By employing properties of D_HS one can show (see Supplementary Note 6)

where

{\tilde{ρ}}_{w} = {Tr}_{S_{\bar{w}}^{-}} [{\tilde{ρ}}^{-}]

. We can then leverage the tensor network representation of quantum circuits to algorithmically integrate over each block in V_L and compute

{⟨ D_{HS} ({\tilde{ρ}}_{w}, \frac{1}{2^{m}}) ⟩}_{V_{L}}

. One finds

with

t_{k, k^{'}} \geq \frac{2^{m l}}{{(2^{m} + 1)}^{2 l}}

\forall k, k^{'}

, and

ϵ (ρ_{k, k^{'}})

defined in Theorem 2. Combining these results leads to Theorem 2. Moreover, as detailed in the Supplementary information, Theorem 2 is also valid when O is of the form ().

Proof of Theorem 1

Let us now provide a sketch of the proof of Theorem 1, case (i). Here we denote for simplicity ${\hat{O}}_{k} : = {\hat{O}}_{1 k}$ . We leave the proof of case (ii) for the Supplementary Note 7. In this case there are now operators O_i which act outside of the forward light-cone $L$ of W. Hence, it is convenient to include in V_R not only all the gates in $L$ but also all the blocks in the final layer of V(θ) (i.e., all blocks W_kL, with k = 1, …ξ). We can define $S_{\bar{L}}$ as the compliment of $S_{L}$ , i.e., as the subsystem of all qubits which are not in $L$ (with associated Hilbert-space $H_{\bar{L}}$ ). Then, we have $V_{R} = V_{L} \otimes V_{\bar{L}}$ and $∣q⟩ ⟨p∣ = ∣q⟩ {⟨p∣}_{L} \otimes ∣q⟩ {⟨p∣}_{\bar{L}}$ , where we define $V_{\bar{L}} : = ⨂_{k \in k_{\bar{L}}} W_{k L}$ , $∣q⟩ {⟨p∣}_{L} : {= ⨂}_{k \in k_{L}} ∣q⟩ {⟨p∣}_{k}$ , and $∣q⟩ {⟨p∣}_{\bar{L}} : {= ⨂}_{k^{'} \in k_{\bar{L}}} ∣q⟩ {⟨p∣}_{k^{'}}$ . Here, we define $k_{L} : = {k : S_{k} \subseteq S_{L}}$ and $k_{\bar{L}} : = {k : S_{k} \subseteq S_{\bar{L}}}$ , which are the set of indices whose associated qubits are inside and outside $L$ , respectively. We also write $O = c_{0} 1 + c_{1} {\hat{O}}_{L} \otimes {\hat{O}}_{\bar{L}}$ , where we define ${\hat{O}}_{L} : {= ⨂}_{k \in k_{L}} {\hat{O}}_{k}$ and ${\hat{O}}_{\bar{L}} : {= ⨂}_{k^{'} \in k_{\bar{L}}} {\hat{O}}_{k^{'}}$ .

Using the fact that the blocks in V(θ) are independent we can now compute ${⟨ Δ Ω_{q p}^{q^{'} p^{'}} ⟩}_{V_{R}} = {⟨ Δ Ω_{q p}^{q^{'} p^{'}} ⟩}_{V_{\bar{L}}, V_{L}}$ . Then, from the definition of Ω_pq in Eq. (34) and the fact that one can always express

with

and where

{Tr}_{L \cap \bar{w}}

indicates the partial trace over the Hilbert-space associated with the qubits in

S_{L} \cap S_{\bar{w}}

. As detailed in the Supplementary Information we can use Eq. () to show that

On the other hand, as shown in the Supplementary Note 7 (and as schematically depicted in Fig. 8c), when computing the expectation value ${⟨ \dots ⟩}_{V_{L}}$ in (47), one obtains

where we defined

δ_{τ} = δ_{{(p, q)}_{S_{\bar{τ}}}} δ_{{(p^{'}, q^{'})}_{S_{\bar{τ}}}} δ_{{(p, q^{'})}_{S_{τ}}} δ_{{(p^{'}, q)}_{S_{τ}}}

t_{τ} \in R

S_{τ} \cup S_{\bar{τ}} = S_{L} \cap S_{\bar{w}}

(with

S_{τ} \neq \emptyset

), and

Here we use the notation

{Tr}_{x_{τ}}

to indicate the trace over the Hilbert space associated with subsystem

S_{x_{τ}}

, such that

S_{x_{τ}} \cup S_{y_{τ}} \cup S_{z_{τ}} = S_{L}

. As shown in the Supplementary Note , combining the deltas in Eqs. (), and () with

{⟨Δ Ψ_{p q}^{p^{'} q^{'}}⟩}_{V_{L}}

leads to Hilbert–Schmidt distances between two quantum states as in (). One can then use the following bounds

D_{HS} (ρ_{1}, ρ_{2}) \leq 2

Δ O_{τ}^{L} \leq \prod_{k \in k_{L}} r_{k}^{2}

, and ∑_τt_τ ≤ 2, along with some additional simple algebra explained in the Supplementary Information to obtain the upper bound in Theorem 1.

Ansatz and optimization method

Here we describe the gradient-free optimization method used in our heuristics. First, we note that all the parameters in the ansatz are randomly initialized. Then, at each iteration, one solves the following sub-space search problem: $\min_{s \in R^{d}} C (θ + A \cdot s)$ , where A is a randomly generated isometry, and s = (s₁, …, s_d) is a vector of coefficients to be optimized over. We used d = 10 in our simulations. Moreover, the training algorithm gradually increases the number of shots per cost-function evaluation. Initially, C is evaluated with 10 shots, and once the optimization reaches a plateau, the number of shots is increased by a factor of 3/2. This process is repeated until a termination condition on the value of C is achieved, or until we reach the maximum value of 10⁵ shots per function evaluation. While this is a simple variable-shot approach, we remark that a more advanced variable-shot optimizer can be found in ref. ⁵⁷.

Finally, let us remark that while we employ a sub-space search algorithm, in the presence of barren plateaus all optimization methods will (on average) fail unless the algorithm has a precision (i.e., number of shots) that grows exponentially with n. The latter is due to the fact that an exponentially vanishing gradient implies that on average the cost function landscape will essentially be flat, with the slope of the order of $O (1 / 2^{n})$ . Hence, unless one has a precision that can detect such small changes in the cost value, one will not be able to determine a cost minimization direction with gradient-based, or even with black-box optimizers such as the Nelder–Mead method^58–61.

Peer review information: Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-021-21728-w.

Acknowledgements

We thank Jacob Biamonte, Elizabeth Crosson, Burak Sahinoglu, Rolando Somma, Guillaume Verdon, and Kunal Sharma for helpful conversations. All authors were supported by the Laboratory Directed Research and Development (LDRD) program of Los Alamos National Laboratory (LANL) under project numbers 20180628ECR (for M.C.), 20190065DR (for A.S., L.C., and P.J.C.), and 20200677PRD1 (for T.V.). M.C. and A.S. were also supported by the Center for Nonlinear Studies at LANL. P.J.C. acknowledges initial support from the LANL ASC Beyond Moore’s Law project. This work was also supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under the Quantum Computing Application Teams program.

Author contributions

The project was conceived by M.C., L.C., and P.J.C. The manuscript was written by M.C., A.S., T.V., L.C., and P.J.C. T.V. proved Proposition 1. M.C. and A.S. proved Proposition 2 and Theorems 1–2. M.C., A.S., T.V., and P.J.C. proved Corollaries 1–2. M.C., A.S., T.V., L.C., and P.J.C. analyzed the quantum autoencoder. For the numerical results, T.V. performed the simulation in Fig. 2, and L.C. performed the simulation in Fig. 5.

Data availability

Data generated and analyzed during the current study are available from the corresponding author upon reasonable request.

Competing interests

The authors declare no competing interests.

References

Preskill

. Quantum computing in the NISQ era and beyond.

doi: 10.22331/q-2018-08-06-79

McClean

, Romero

, Babbush

, Aspuru-Guzik

. The theory of variational hybrid quantum-classical algorithms.

doi: 10.1088/1367-2630/18/2/023023

Farhi, E., Goldstone, J. & Gutmann, S. A quantum approximate optimization algorithm. Preprint at https://arxiv.org/abs/1411.4028 (2014).

Hadfield

. From the quantum approximate optimization algorithm to a quantum alternating operator ansatz.

doi: 10.3390/a12020034

Hastings, M. B. Classical and quantum bounded depth approximation algorithms. Preprint at https://arxiv.org/abs/1905.07047 (2019).

Kandala

. Hardware-efficient variational quantum eigensolver for small molecules and quantum magnets.

doi: 10.1038/nature23879

Arute

. Hartree-fock on a superconducting qubit quantum computer.

doi: 10.1126/science.abb9811

Harrigan, Matthew P. et al. Quantum approximate optimization of non-planar graph problems on a planar superconducting processor. Nature Physics 1–5 (2021).

McClean

, Boixo

, Smelyanskiy

, Babbush

, Neven

. Barren plateaus in quantum neural network training landscapes.

doi: 10.1038/s41467-018-07090-4

10.

Peruzzo

. A variational eigenvalue solver on a photonic quantum processor.

doi: 10.1038/ncomms5213

11.

Romero

, Olson

, Aspuru-Guzik

. Quantum autoencoders for efficient compression of quantum data.

doi: 10.1088/2058-9565/aa8072

12.

Johnson, P. D., Romero, J., Olson, J., Cao, Y. & Aspuru-Guzik, A. QVECTOR: an algorithm for device-tailored quantum error correction. Preprint at https://arxiv.org/abs/1711.02249 (2017).

13.

Koczor

, Endo

, Jones

, Matsuzaki

, Benjamin

. Variational-state quantum metrology.

doi: 10.1088/1367-2630/ab965e

14.

Khatri

. Quantum-assisted quantum compiling.

doi: 10.22331/q-2019-05-13-140

15.

Jones, T. & Benjamin, S. C. Quantum compilation and circuit optimisation via energy dissipation. Preprint at https://arxiv.org/abs/1811.03147 (2019).

16.

Sharma

, Khatri

, Cerezo

, Coles

. Noise resilience of variational quantum compiling.

doi: 10.1088/1367-2630/ab784c

17.

Heya, K., Suzuki, Y., Nakamura, Y. & Fujii, K. Variational quantum gate optimization. Preprint at https://arxiv.org/abs/1810.12745 (2018).

18.

LaRose

, Tikku

, O’Neel-Judy

, Cincio

, Coles

. Variational quantum state diagonalization.

19.

Bravo-Prieto

, García-Martín

, Latorre

. Quantum singular value decomposer.

doi: 10.1103/PhysRevA.101.062310

20.

, Benjamin

. Efficient variational quantum simulator incorporating active error minimization.

21.

Heya, K., Nakanishi, K. M., Mitarai, K. & Fujii, K. Subspace variational quantum simulator. Preprint at https://arxiv.org/abs/1904.08566 (2019).

22.

Cirstoiu

. Variational fast forwarding for quantum simulation beyond the coherence time.

doi: 10.1038/s41534-019-0235-y

23.

Otten, M., Cortes, C. L. & Gray, S. K. Noise-resilient quantum dynamics using symmetry-preserving ansatzes. Preprint at https://arxiv.org/abs/1910.06284 (2019).

24.

Cerezo

, Poremba

, Cincio

, Coles

. Variational quantum fidelity estimation.

doi: 10.22331/q-2020-03-26-248

25.

Carolan

. Variational quantum unsampling on a quantum photonic processor.

doi: 10.1038/s41567-019-0747-6

26.

Arrasmith

, Cincio

, Sornborger

, Zurek

, Coles

. Variational consistent histories as a hybrid algorithm for quantum foundations.

doi: 10.1038/s41467-019-11417-0

27.

Bravo-Prieto, C. et al. Variational quantum linear solver. Preprint at https://arxiv.org/abs/1909.05820 (2019).

28.

Xu, X. et al. Variational algorithms for linear algebra. Preprint at https://arxiv.org/abs/1909.03898 (2019).

29.

Huang, H.-Y., Bharti, K. & Rebentrost, P. Near-term quantum algorithms for linear systems of equations. Preprint at https://arxiv.org/abs/1909.07344 (2019).

30.

Cerezo, M. & Coles, P. J. Impact of barren plateaus on the hessian and higher order derivatives. Preprint at https://arxiv.org/abs/2008.07454 (2020).

31.

Arrasmith, A., Cerezo, M., Czarnik, P., Cincio, L. & Coles, P. J. Effect of barren plateaus on gradient-free optimization. Preprint at https://arxiv.org/abs/2011.12245 (2020).

32.

Grant

, Wossnig

, Ostaszewski

, Benedetti

. An initialization strategy for addressing barren plateaus in parametrized quantum circuits.

doi: 10.22331/q-2019-12-09-214

33.

Verdon, G. et al. Learning to learn with quantum neural networks via classical neural networks. Preprint at https://arxiv.org/abs/1907.05415 (2019a).

34.

Lee

, Huggins

, Head-Gordon

, Whaley

. Generalized unitary coupled cluster wave functions for quantum computation.

doi: 10.1021/acs.jctc.8b01004

35.

Verdon, G. et al. Quantum graph neural networks. Preprint at https://arxiv.org/abs/1909.12264 (2019b).

36.

IBM Q: Quantum devices and simulators. https://www.research.ibm.com/ibm-q/technology/devices/.

37.

Arute

. Quantum supremacy using a programmable superconducting processor.

doi: 10.1038/s41586-019-1666-5

38.

Dankert

, Cleve

, Emerson

, Livine

. Exact and approximate unitary 2-designs and their application to fidelity estimation.

doi: 10.1103/PhysRevA.80.012304

39.

Brandao

FGSL

, Harrow

, Horodecki

. Local random quantum circuits are approximate polynomial-designs.

doi: 10.1007/s00220-016-2706-8

40.

Harrow, A. & Mehraban, S. Approximate unitary t-designs by short random quantum circuits using nearest-neighbor and long-range gates. Preprint at https://arxiv.org/abs/1809.06957 (2018).

41.

Wan

, Dahlsten

, Kristjánsson

, Gardner

, Kim

. Quantum generalisation of feedforward neural networks.

doi: 10.1038/s41534-017-0032-4

42.

Lamata

. Quantum autoencoders via quantum adders with genetic algorithms.

doi: 10.1088/2058-9565/aae22b

43.

Pepper

, Tischler

, Pryde

. Experimental realization of a quantum autoencoder: the compression of qutrits via machine learning.

doi: 10.1103/PhysRevLett.122.060501

44.

Verdon, G., Pye, J. & Broughton, M. A universal training algorithm for quantum deep learning. Preprint at https://arxiv.org/abs/1806.09729 (2018).

45.

Pennington, J. & Bahri, Y. Geometry of neural network loss surfaces via random matrix theory. in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017) pp. 2798–2806, http://proceedings.mlr.press/v70/pennington17a.html.

46.

Tranter

, Love

, Mintert

, Coveney

. A comparison of the bravyi–kitaev and jordan–wigner transformations for the quantum simulation of quantum chemistry.

doi: 10.1021/acs.jctc.8b00450

47.

Bravyi

, Kitaev

. Fermionic quantum computation.

doi: 10.1006/aphy.2002.6254

48.

Uvarov

, Biamonte

, Yudin

. Variational quantum eigensolver for frustrated quantum systems.

doi: 10.1103/PhysRevB.102.075104

49.

Biamonte, J. Universal variational quantum computation. Preprint at https://arxiv.org/abs/1903.04500 (2019).

50.

Sharma, K., Cerezo, M., Cincio, L. & Coles, P. J. Trainability of dissipative perceptron-based quantum neural networks. Preprint at https://arxiv.org/abs/2005.12458 (2020b).

51.

Beer

. Training deep quantum neural networks.

doi: 10.1038/s41467-020-14454-2

52.

Bartlett

, Musiał

. Coupled-cluster theory in quantum chemistry.

doi: 10.1103/RevModPhys.79.291

53.

Volkoff, T. & Coles, P. J. Large gradients via correlation in random parameterized quantum circuits. Quant. Sci. Technol. (2021). https://iopscience.iop.org/article/10.1088/2058-9565/abd891.

54.

Skolik

. Layerwise learning for quantum neural networks.

55.

Benoît

, Śniady

. Integration with respect to the haar measure on unitary, orthogonal and symplectic group.

doi: 10.1007/s00220-006-1554-3

56.

Puchała

, Miszczak

. Symbolic integration with respect to the haar measure on the unitary groups.

57.

Kübler

, Arrasmith

, Cincio

, Coles

. An adaptive optimizer for measurement-frugal variational algorithms.

doi: 10.22331/q-2020-05-11-263

58.

Nelder

, Mead

. A simplex method for function minimization.

doi: 10.1093/comjnl/7.4.308

59.

Paley

REAC

, Zygmund

. A note on analytic functions in the unit circle.

doi: 10.1017/S0305004100010112

60.

Fukuda

, König

, Nechita

. RTNI–a symbolic integrator for haar-random tensor networks.

doi: 10.1088/1751-8121/ab434b

61.

Nielsen, M. A. & Chuang, I. L. Quantum computation and quantum information: 10th Anniversary Edition, 10th ed. (Cambridge University Press, New York, NY, USA, 2011)