Abstract
In this paper, we study an online learning algorithm with a robust loss function \(\mathcal {L}_{\sigma }\) for regression over a reproducing kernel Hilbert space (RKHS). The loss function \(\mathcal {L}_{\sigma }\) involving a scaling parameter \(\sigma >0\) can cover a wide range of commonly used robust losses. The proposed algorithm is then a robust alternative for online least squares regression aiming to estimate the conditional mean function. For properly chosen \(\sigma \) and step size, we show that the last iterate of this online algorithm can achieve optimal capacity independent convergence in the mean square distance. Moreover, if additional information on the underlying function space is known, we also establish optimal capacity-dependent rates for strong convergence in RKHS. To the best of our knowledge, both of the two results are new to the existing literature of online learning.
Similar content being viewed by others
References
N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68 (1950), 337–404.
F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23 (2007), 52–72.
R. Bessa, V. Miranda, and J. Gama. Entropy and correntropy against minimum square error in offline and online three-day ahead wind power forecasting. IEEE Transactions on Power Systems, 24 (2009), 1657–1666.
M. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding, 63 (1996), 75–104.
G. Blanchard and N. Mücke. Optimal rates for regularization of statistical inverse Learning problems. Foundations of Computational Mathematics, 18 (2018), 971–1013.
L. Bottou, F. E Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60 (2018), 223–311.
A. Caponnetto and E. De Vito. Optimal rates for the regularized least squares algorithm. Foundations of Computational Mathematics, 7 (2007), 331–368.
X. Chen, B. Tang, J. Fan, and X. Guo. Online gradient descent algorithms for functional data learning. Journal of Complexity, page 101635, 2021.
A. Christmann and A. Van Messem, and I. Steinwart. On consistency and robustness properties of support vector machines for heavy-tailed distributions. Statistics and Its Interface, 2 (2009), 331–327.
A. Christmann and I. Steinwart. Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli, 13 (2007), 799–819.
F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge Univesity Press, 2007.
K. De Brabanter, K. Pelckmans, J. De Brabanter, M. Debruyne, J. A. K. Suykens, M. Hubert, and B. De Moor. Robustness of kernel based regression: a comparison of iterative weighting schemes. International Conference on Artificial Neural Networks, (2009), 100–110.
M. Debruyne, A. Christmann, M. Hubert, and J. A. K. Suykens. Robustness of reweighted least squares kernel based regression. Journal of Multivariate Analysis, 101 (2010), 447–463.
E. De Vito, S. Pereverzyev, and L. Rosasco. Adaptive kernel methods using the balancing principle. Foundations of Computational Mathematics, 10 (2010), 455–479.
A. Dieuleveut and F. Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44 (2016), 1363–1399.
R. Fair. On the robust estimation of econometric models. Annals of Economic and Social Measurement, 3 (1974), 667–677.
H. Feng, S. Hou, L. Wei, and D. X. Zhou. CNN models for readability of Chinese texts. Mathematical Foundations of Computing, 5 (2021), 351–362.
Y. Feng, X. Huang, L. Shi, Y. Yang, and J. A. K. Suykens. Learning with the maximum correntropy criterion induced losses for regression. Journal of Machine Learning Research, 16 (2015), 993–1034.
Y. Feng and Q. Wu. A framework of learning through empirical gain maximization. Neural Computation, 33 (2021), 1656–1697.
S. Ganan and D. McClure. Bayesian image analysis: An application to single photon emission tomography. Journal of the American Statistical Association, (1985), 12–18.
X. Guo, Z. C. Guo, and L. Shi. Capacity dependent analysis for functional online learning algorithms. Applied and Computational Harmonic Analysis, 67 (2023), 1–30.
Z. C. Guo, T. Hu, and L. Shi. Gradient descent for robust kernel based regression. Inverse Problems, 34 (2018), 065009(29pp).
Z. C. Guo, S. B. Lin, and D. X. Zhou. Learning theory of distribued spectral algorithms. Inverse Problems, 33 (2017), 074009(29pp).
Z. C. Guo and L. Shi. Fast and strong convergence of online learning algorithms. Advances in Computational Mathematics, 26 (2019), 1–26.
F. R. Hampel, E. M. Ronchetti and P. J. Rousseeuw, and W. A. Stahel. Robust statistics: The Approach Based on Influence Functions. John Wiley & Sons, New York, 1986.
R. He, W. Zheng, and B. Hu. Maximum correntropy criterion for robust face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33 (2011), 1561–1576.
P. W. Holland and R. E. Welsch. Robust regression using iteratively reweighted leastsquares. Communications in Statistics-Theory and Methods, 6 (1977), 813–827.
S. Huang, Y. Feng, and Q. Wu, Learning theory of minimum error entropy under weak moment conditions. Analysis and Applications, 20 (2022), 121–139.
P. Huber. Robust Statistics. Wiley, New York, 1981.
J. Lin and L. Rosasco. Optimal learning for multi-pass stochastic gradient methods. In Advances in Neural Information Processing Systems, 4556–4564, 2016.
W. Liu, P. Pokharel, and J. C. Principe. Correntropy: Properties and applications in non-Gaussian signal processing. IEEE Transactions on Signal Processing, 55 (2007), 5286–5298.
S. Lu, P. Mathé, and S. V. Pereverzev. Balancing principle in supervised learning for a general regularization scheme. Applied and Computational Harmonic Analysis, 48 (2020), 123–148.
F. Lv and J. Fan, Optimal learning with Gaussians and correntropy loss. Analysis and Applications, 19(2021), 107–124.
R. Maronna, D. Martin, and V. Yohai. Robust Statistics. John Wiley & Sons, Chichester, 2006.
R. A. Maronna and R. D. Martin and V. J. Yohai. Robust Statistics: Theory and Methods. John Wiley & Sons, New York, 2006.
I. Mizera and C. Müller. Breakdown points of Cauchy regression-scale estimators. Statistics & probability letters, 57 (2002), 79–89.
L. Pillaud-Vivien, R. Alessandro, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, 8114–8124, 2018.
A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19 (2009), 1574–1609.
A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), 449–456, 2012.
G. Raskutti, M. J. Wainwright, and B. Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. Journal of Machine Learning Research, 15 (2014), 335–366.
L. Rosasco, A, Tacchetti, and S. Villa. Regularization by early stopping for online learning algorithms. Stat, 1050 (2014), 30 pages.
I. Santamaría, P. Pokharel, and J. C. Principe. Generalized correlation function: definition, properties, and application to blind equalization. IEEE Transactions on Signal Processing, 54 (2006), 2187–2197.
B. Schölkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2018.
S. Smale and D. X. Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1 (2003), 17–41.
S. Smale and D. X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26 (2007), 153–172.
S. Smale and D. X. Zhou. Online learning with Markov sampling. Analysis and Applications, 7 (2009), 87–113.
I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26 (2017), 225–287.
I. Steinwart and A. Christmann. Support Vector Machines. Springer-Verlag, New York, 2008.
I. Steinwart, D. R. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In The 22nd Annual Conference on Learning Theory (COLT), 2009.
D. Sun, S. Roth, and M. Black. Secrets of optical flow estimation and their principles. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2432–2439, 2010.
I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning (ICML-13), 1139–1147, 2013.
Y. Yao. On complexity issues of online learning algorithms. IEEE Transactions on Information Theory, 56 (2010), 6470–6481.
Y. Ying and M. Pontil. Online gradient descent learning algorithms. Foundations of Computational Mathematics, 8 (2008), 561–596.
Y. Ying and D. X. Zhou. Unregularized online learning algorithms with general loss functions. Applied and Computational Harmonic Analysis, 42 (2017), 224–244.
T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In International Conference on Machine Learning (ICML-04), 919–926, 2004.
X. Zhu, Z. Li, and J. Sun. Expression recognition method combining convolutional features and Transformer. Mathematical Foundations of Computing, 6 (2023), 203–217.
Acknowledgements
The authors are grateful to the anonymous referees for their careful reading of this paper and suggestions. The work of Zheng-Chu Guo is supported by Zhejiang Provincial Natural Science Foundation of China (Project No. LR20A010001), National Natural Science Foundation of China (Project Nos. U21A20426 and 12271473), and Fundamental Research Funds for the Central Universities (Project No. 2021XZZX001). The work of Andreas Christmann is partially supported by German Science Foundation (DFG) under Grant CH 291/3-1. The work of Lei Shi is supported by the National Natural Science Foundation of China (Project Nos. 12171039 and 12061160462) and Shanghai Science and Technology Program (Project Nos. 21JC1400600 and 20JC1412700).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Thomas Strohmer.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, ZC., Christmann, A. & Shi, L. Optimality of Robust Online Learning. Found Comput Math (2023). https://doi.org/10.1007/s10208-023-09616-9
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10208-023-09616-9