当前位置: X-MOL 学术Appl. Comput. Harmon. Anal. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank
Applied and Computational Harmonic Analysis ( IF 2.5 ) Pub Date : 2023-09-06 , DOI: 10.1016/j.acha.2023.101595
Hung-Hsu Chou , Carsten Gieshoff , Johannes Maly , Holger Rauhut

In deep learning, it is common to use more network parameters than training points. In such scenario of over-parameterization, there are usually multiple networks that achieve zero training error so that the training algorithm induces an implicit bias on the computed solution. In practice, (stochastic) gradient descent tends to prefer solutions which generalize well, which provides a possible explanation of the success of deep learning. In this paper we analyze the dynamics of gradient descent in the simplified setting of linear networks and of an estimation problem. Although we are not in an overparameterized scenario, our analysis nevertheless provides insights into the phenomenon of implicit bias. In fact, we derive a rigorous analysis of the dynamics of vanilla gradient descent, and characterize the dynamical convergence of the spectrum. We are able to accurately locate time intervals where the effective rank of the iterates is close to the effective rank of a low-rank projection of the ground-truth matrix. In practice, those intervals can be used as criteria for early stopping if a certain regularity is desired. We also provide empirical evidence for implicit bias in more general scenarios, such as matrix sensing and random initialization. This suggests that deep learning prefers trajectories whose complexity (measured in terms of effective rank) is monotonically increasing, which we believe is a fundamental concept for the theoretical understanding of deep learning.



中文翻译:

深度矩阵分解的梯度下降:动态和对低秩的隐式偏差

在深度学习中,通常使用比训练点更多的网络参数。在这种过度参数化的场景中,通常有多个网络实现零训练误差,因此训练算法会对计算解产生隐式偏差。在实践中,(随机)梯度下降往往更喜欢泛化良好的解决方案,这为深度学习的成功提供了可能的解释。在本文中,我们分析了线性网络和估计问题的简化设置中梯度下降的动态。尽管我们没有处于过度参数化的情况,但我们的分析仍然提供了对隐性偏差现象的见解。事实上,我们对普通梯度下降的动力学进行了严格的分析,并表征了谱的动态收敛。我们能够准确地定位迭代的有效秩接近地面真值矩阵的低秩投影的有效秩的时间间隔。在实践中,如果需要一定的规律性,这些间隔可以用作提前停止的标准。我们还为更一般场景中的隐式偏差提供了经验证据,例如矩阵感知和随机初始化。这表明深度学习更喜欢复杂性(以有效排名衡量)单调增加的轨迹,我们认为这是深度学习理论理解的基本概念。如果需要一定的规律性,这些间隔可以用作提前停止的标准。我们还为更一般场景中的隐式偏差提供了经验证据,例如矩阵感知和随机初始化。这表明深度学习更喜欢复杂性(以有效排名衡量)单调增加的轨迹,我们认为这是深度学习理论理解的基本概念。如果需要一定的规律性,这些间隔可以用作提前停止的标准。我们还为更一般场景中的隐式偏差提供了经验证据,例如矩阵感知和随机初始化。这表明深度学习更喜欢复杂性(以有效排名衡量)单调增加的轨迹,我们认为这是深度学习理论理解的基本概念。

更新日期:2023-09-06
down
wechat
bug