Learning Foundations I

Chair: Lenaic Chizat (EPFL)

Time: August 16th, 11:10am-12:00pm ET, 17:10-18:00 CET, 23:10-0:00 GMT+8

Generalization and Memorization: The Bias Potential Model, Hongkang Yang (Princeton University), Weinan E (Princeton University)

Paper Highlight, by Song Mei

Learning probability distributions such as generative models and density estimators are among the most essential tasks in machine learning, but its mathematical foundations are not yet well-established. This paper provided a theoretical foundation for the bias potential model, which is a simple mathematical model of probability distributions. Here, the potential function is modeled using one hidden layer neural networks. Two remarkable contributions of this paper are: 1. It established approximation, generalization, and optimization efficiency results under certain mild assumptions of the target distribution, generalizing the theories of learning functions to learning distributions. 2. It was shown that, although the model will diverge or memorize the samples in the long term, early stopping can be adopted to achieve good generalization.

slides video paper

Orientation-Preserving Vectorized Distance Between Curves, Hasan Pourmahmoodaghababa (University of Utah), Jeff Phillips (University of Utah)

Paper Highlight, by Xiuyuan Cheng

The paper introduces a new distance for continuous curves which is orientation-preserving and provides an alternative to Frechet distance and Dynamic Time Warping distance. The method is based on landmarks, and specifically, by computing feature vectors at a set of points in the ambient space and measuring their relationship (distance and direction) to the directed curves. The work is a generalization of earlier ideas on curve distance by taking into account curve orientation. The new curve distance is realizable in a lifted Euclidean space and thus can be efficiently computed. Theoretically, the paper proves the stability of the obtained distance to perturbation of landmarks, as well as its relationship to Hausdorff and Frechet distances. Algorithm-wise, the new distance is computationally less expensive than existing methods and allows faster implementation for nearest neighbor classification. Empirically, the new approach shows a significant advantage on a synthetic dataset. The work provides a promising tool for machine learning applications based on curve distances, such as shape analysis.

slides video paper

Solvable Model for Inheriting the Regularization through Knowledge Distillation, Luca Saglietti (EPFL), Lenka Zdeborova (EPFL)

Paper Highlight, by Grant Rotskoff

This paper of Saglietti and Zdeborová introduces an analytically tractable model of knowledge distillation (KD), a type of transfer learning in which a complex model is used as the teacher network for a student network with reduced complexity. The authors develop a framework to evaluate performance in the distilled model which resembles calculations that arise in statistical mechanics: the weights of the teacher play the role of a quenched disorder and the typical behavior of the student can be accessed by analyzing the Gibbs measure associated with the KD loss function in the low temperature limit. The results show, in the context of a tractable Gaussian mixture model, that regularization is transferred in the learning process. Furthermore, the authors explain how the generalization properties of the teacher for the student. The paper also features a very detailed and well-explained replica calculation!

slides video paper

Dynamic Algorithms for Online Multiple Testing, Ziyu Xu (Carnegie Mellon University), Aaditya Ramdas (Carnegie Mellon University)

Paper Highlight, by Can Yang

The authors introduced the method SupLORD for online multiple testing, aiming to control false discovery exceedance (FDX) and improve statistical power. The manuscript was well organized with a very clear logic flow. The authors first provided a comprehensive introduction to the topic of online multiple testing, and then they highlighted the three major contributions of SupLORD: delayed FDX control, dynamic scheduling, and theoretical improvements in FDR control. Accordingly, the authors showed how SupLORD achieved the three major contributions through derivation of the boost sequence, formulation of dynamic scheduling, and theoretical analysis on the control of FDR. Using simulation studies, the authors demonstrated the advantages of SupLORD over other related methods. It is highly anticipated that SupLORD can be used in real applications of online multiple testing.

slides video paper

Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?, Mariia Seleznova (Ludwig Maximilian University of Munich), Gitta Kutyniok (Ludwig Maximilian University of Munich)

Paper Highlight, by Clement Hongler

This paper investigates from a numerical viewpoint how the initialization and the training of finite-size neural networks differs from the behavior of infinite-width neural networks in the so-called NTK regime. This paper discusses this question simply and quite systematically, with a large number of experiments for various widths, depths, initialization variance values, nonlinearity choices; one observes the NTK variance at initialization, its evolution during training, as well as the loss at the end of training; the results are compared with theoretical NTK predictions which are derived in the article, in particular the so-called edge of chaos transition. This article is very easy to read, and it effectively conveys, for many simple cases, a picture of when the NTK approximation holds or when it doesn’t, making it a desirable addition to the literature.

slides video paper