## Chairs: Steven Brunton (UW) and Urban Fasel (UW)

### Time: August 19th, 11:10am-12:00pm ET, 17:10-18:00 CET, 23:10-00:00 GMT+8

**Borrowing From the Future: Addressing Double Sampling in Model-free Control**, Yuhua Zhu (Stanford University), Zachary Izzo (Stanford); Lexing Ying (Stanford University)

*Paper Highlight, by Antonio Celani*

Bellman residual minimization with stochastic gradient descent (SGD) is a stable temporal-difference method whose usefulness is limited by the need of two independent samples for the state that will be visited next, rather than just one sample. This second independent sample is often not accessible. To get around this issue the authors extend the borrowing-from-the-future (BFF) algorithm to action-value function based model-free control. The analysis of the algorithm shows that when the underlying dynamics vary slowly with respect to the actions, it approximates well the unbiased SGD. The authors then confirm their theoretical findings with numerical simulations. This paper addresses an important issue in Temporal Difference Control with function approximation with great clarity and paves the way towards further developments of this algorithmic approach to Reinforcement Learning.

**Ground States of Quantum Many Body Lattice Models via Reinforcement Learning**, Willem Gispen (University of Cambridge), Austen Lamacraft (University of Cambridge)

*Paper Highlight, by Lin Lin*

This paper formulates the problem of finding the ground state of lattice quantum systems as a reinforcement learning problem. This paper focuses on two main setups: 1) mapping the Feynman-Kac representation to RL in the continuous time; 2) mapping the stochastic representation of the Schrödinger equation to RL in the discrete time. In the first case, the imaginary time evolution is used to give a solution to the ground state in the infinite time limit; In the second case, two different ways are given for the normalization of transition probability. The authors propose a logarithmic transformation, turning the stochastic representation into a Bellman equation. The validity of the transformation is based on the assumption that the Hamiltonian is stoquastic and the Perron-Frobenius theorem. Compared to other techniques like Stochastic Reconfiguration (SR), the two benefits can be provided by this RL methods are: 1) less data may be needed for one step update; 2) samples could be generated more efficiently.

**Decentralized Multi-Agents by Imitation of a Centralized Controller**, Alex Lin (UCLA), Mark Debord (NAVAIR); Gary Hewer (NAVAIR); Katia Estabridis (NAVAIR); Guido F Montufar (UCLA / Max Planck Institute MIS); Stanley Osher (UCLA)

*Paper Highlight, by Dmitry Borisovich Rokhlin*

The paper concerns the construction of decentralized agent strategies with local observations by a dynamical version of the supervised learning. This construction is based on a centralized controller (expert), which is trained under the full observability assumption. The fruitful idea is to sequentially extend the training set by application of the current agent policies and train these policies on the extended set by measuring the deviation from the expert policy by some loss function. The authors provide interesting applications to complex learning tasks such as StarCraft 2 game and cooperative navigation.

**Noise-Robust End-to-End Quantum Control using Deep Autoregressive Policy Networks**, Jiahao Yao (University of California, Berkeley), Paul Köttering (University of California, Berkeley); Hans Gundlach (University of California, Berkeley); Lin Lin (University of California, Berkeley); Marin Bukov (University of California, Berkeley)

*Paper Highlight, by Gautam Nallamala*

Machine learning models are a natural source of optimization methods to problems in quantum control, which are often formulated as the task of directing an initial quantum state to the ground state of a many-body system using an appropriate sequence of unitary evolution operators. Quantum many-body states are intrinsically high-dimensional and the energy landscape has a complex, non-convex structure, which makes deep reinforcement learning (DRL) an appropriate framework for the task. In this paper, the authors develop a hybrid discrete-continuous DRL algorithm to prepare the ground state of a quantum Ising model. The task can be formulated as a goal-oriented “foraging” task of finding an appropriate sequence of unitary operations to rotate an initial state towards a minimal energy state (which is a priori unknown). Past work done on the same setup either optimized for the discrete choice of rotation operators and used black-box methods to find the durations of each operation (CD-QAOA) or considered only two operators and optimized for the durations using a policy gradient-based algorithm (PG-QAOA). While the former produced better solutions, the latter was found to be more robust to noise. This works combines insights from the two previous approaches and develops an algorithm (RL-QAOA) that achieves a good solution while also being noise-robust. The developed algorithm has potential applications beyond quantum control, for example, hierarchical control problems such as motor sequence execution in which it is not only important to choose which actions from a broad class are to be executed but also how this action is to be performed specifically in a given context.

**Temporal-difference learning with nonlinear function approximation: lazy training and mean field regimes**, Andrea Agazzi (Duke University), Jianfeng Lu (Duke University)

*Paper Highlight, by Phan-Minh Nguyen*

The convergence behavior of the Temporal-difference (TD) learning algorithm is an important open problem. The paper gives several positive results in the difficult nonlinear TD setting with large-width two-layer neural networks, all via interesting connections with recent advances in analyses of neural network dynamics: lazy and mean field training. In particular, in the lazy regime, the paper neatly exploits the idea of linearization, thereby allowing to apply previous insights from the linear TD setting. In the mean field regime where linearization no longer holds, the paper again recognizes applicability of a number of ideas from previous studies of mean field gradient descent training, such as Wasserstein flows, topological invariance and the instability argument under universal approximation. It has to be emphasized that these connections are not a priori obvious and technical works require careful execution. For much to be done in about 40 pages, the paper would be a nice read to those who seek new directions in theoretical reinforcement learning, as well as those who hope to find inspirations beyond the gradient-based supervised learning context.