## Chair: Rachel Ward (UT Austin)

### Time: August 16th, 16:20-17:10 CET, 10:20am-11:10am ET, 22:20-23:10 GMT+8

### Session Video

**Hessian-Aided Random Perturbation (HARP) Using Noisy Zeroth-Order Queries**, Jingyi Zhu (Alibaba Group)

*Paper Highlight*

This paper proposes a novel method for stochastic blackbox optimization using zeroth-order oracles (ZO), by using the ZO queries to estimate curvature information. The paper complements a nice theoretical analysis with compelling numerical experiments. All reviewers agreed this paper brings a very interesting contribution to the growing literature of black-box optimization.

**Average-Case Integrality Gap for Non-Negative Principal Component Analysis**, Afonso S Bandeira (ETH); Dmitriy Kunisky (New York University), Alex Wein (NYU)

*Paper Highlight, by Manuel Saenz*

“Although this work addresses the performance of a semi-definite program for studying non-negative principal component analysis, some of its relevance lies beyond this particular problem. Here the authors present very clear arguments to establish the limit of the performance of the algorithm under study; and contrast this limit with a previous conjecture built upon numerical experiments. All in all, it is a cautionary tale for the interpolation from finite size simulations. It is also a good example of the complex and interesting interaction between experiments and mathematics that is currently taking place in the field of Machine Learning.

**A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms**, Chao Ma (Princeton University), Lei Wu (Princeton University), Weinan E (Princeton University)

*Paper Highlight, by Pankaj Mehta*

The paper connects the continue-time limits of adaptive gradient descent methods, RMSProp and Adam, to the sign gradient descent algorithm and explores three types of typical phenomena in these adaptive algorithmsâ€™ training processes. By analyzing the signGD flow, this paper explains the fast initial convergence of these adaptive gradient algorithms with a learning rate approximating 0 and fixed momentum parameters. The connection, the convergence analysis, and experiments on verifying the three qualitative patterns are original and technically sound.

**BEAR: Sketching BFGS Algorithm for Ultra-High Dimensional Feature Selection in Sublinear Memory**, Amirali Aghazadeh (UC Berkeley), Vipul Gupta (UC Berkeley); Alex DeWeese (University of California); O. Ozan Koyluoglu (University of California, Berkeley); Ramchandran Kannan (University of California, Berkeley)

*Paper Highlight, by Michele Ceriotti*

As datasets grow in size, and featurizations grow in complexity, the sheer size of the design matrix limits the applicability of several machine-learning algorithms, including those to select the most relevant and informative features. The authors propose an algorithm to achieve second-order feature selection with a sub-linear memory footprint, by combining the BFGS quasi-Newton method with a Count Sketch storage of the approximate Hessian. They demonstrate dramatic improvements in both memory occupation and classification performance in comparison with first-order methods, making the treatment of very large datasets feasible on commodity hardware.

**Hessian Estimation via Stein’s Identity in Black-Box Problems**, Jingyi Zhu (Alibaba Group)

*Paper Highlight, by Alec Koppel*

This work puts forth a new Hessian estimation scheme for stochastic optimization problems when only a zero-th order oracle is available, i.e., the functional form of the objective is unavailable but one may acquire noisy function calls. Rather than construct a Hessian estimate based upon finite differencing (SPSA), which requires four zero-th order calls, the key insight of this work is to employ Stein’s identity for reducing the number of zero-th order calls. The asymptotic almost sure convergence of an approximate Newton method based upon this scheme is established, as well as its finite-time convergence rate, which alleviates some of the smoothness conditions required for related approaches. Moreover, the asymptotic characterization of the limiting covariance of the resulting parameter estimates is very interesting. Experimental results illustrate the merits of this approach in practice.

**Practical and Fast Momentum-Based Power Methods**, Tahseen Rabbani (University of Maryland), Apollo Jain (University of California, Berkeley); Arjun Rajkumar (University of Maryland); Furong Huang (University of Maryland)

*Paper Highlight, by Elizabeth Newman*

This paper proposes a delayed momentum power method (DMPower) and a delayed momentum streaming method (DMStream). Both rely on a pre-momentum phase to approximate the second largest eigenvalue which is then used in a power method with momentum to approximate the dominant eigenvalue and top eigenvector. The paper includes rigorous theoretical guarantees for the speed of convergence for each algorithm which notably does not rely on a priori knowledge of the second largest eigenvalue. These practical and fast algorithms are of great interest to the machine learning community for applications that use, e.g., spectral clustering or principal component analysis.