Statistical Inference with Stochastic Gradient Algorithms

Abstract

Stochastic gradient algorithms are widely used for large-scale learning and inference problems. However, their use in practice is typically guided by heuristics and trial-and-error rather than rigorous, generalizable theory. We take a step toward better understanding the effect of the tuning parameters of these algorithms by characterizing the large-sample behavior of iterates of a very general class of preconditioned stochastic gradient algorithms with fixed step size, including stochastic gradient descent with and without additional Gaussian noise, momentum, and/or acceleration. We show that near a local optimum, the iterates converge weakly to paths of an Ornstein–Uhlenbeck process, and provide sufficient conditions for the stationary distributions of the finite-sample processes to converge weakly to that of the limiting process. In particular, with appropriate choices of tuning parameters, the limiting stationary covariance can match either the Bernstein–von Mises-limit of the posterior, adjustments to the posterior for model misspecification, or the asymptotic distribution of the MLE; and that with a naive tuning the limit corresponds to none of these. Moreover, we argue that, in the large-sample regime, an essentially independent sample from the stationary distribution can be obtained after a fixed number of passes over the dataset. Our theoretical results show that properly tuned stochastic gradient algorithms offer a practical approach to obtaining inferences that are computationally efficient and statistically robust. These results are supported by several experiments based on both simulated and real data.