Abstract
Transportation of measure provides a versatile approach for modeling complex probability distributions, with applications in density estimation, Bayesian inference, generative modeling, and beyond. Monotone triangular transport maps—approximations of the Knothe–Rosenblatt (KR) rearrangement—are a canonical choice for these tasks. Yet the representation and parameterization of such maps have a significant impact on their generality and expressiveness, and on properties of the optimization problem that arises in learning a map from data (e.g., via maximum likelihood estimation). We present a general framework for representing monotone triangular maps via invertible transformations of smooth functions. We establish conditions on the transformation such that the associated infinite-dimensional minimization problem has no spurious local minima, i.e., all local minima are global minima; and we show for target distributions satisfying certain tail conditions that the unique global minimizer corresponds to the KR map. Given a sample from the target, we then propose an adaptive algorithm that estimates a sparse semi-parametric approximation of the underlying KR map. We demonstrate how this framework can be applied to joint and conditional density estimation, likelihood-free inference, and structure learning of directed graphical models, with stable generalization performance across a range of sample sizes.
Similar content being viewed by others
Notes
For any \({\varvec{z}}\in \mathbb {R}^d\), \({\varvec{x}}=S^{-1}({\varvec{z}})\) can be computed recursively as \(x_{k}=T^{k}({\varvec{x}}_{< k},z_k)\) for \(k=1,\dots ,d\), where the function \(T^{k}({\varvec{x}}_{< k},\cdot )\) is the inverse of \(x_k\mapsto S_{k}({\varvec{x}}_{< k},x_k)\). In practice, evaluating \(T^{k}\) requires solving a root-finding problem which is guaranteed to have a unique (real) root, and for which the bisection method converges geometrically fast. Therefore, \(S^{-1}({\varvec{z}})\) can be evaluated to machine precision in negligible computational time.
That is, \(\Vert v_1\otimes \cdots \otimes v_k\Vert _{V_k} = \Vert v_1\Vert _{L^2_{\eta _1}}\Vert v_2\Vert _{L^2_{\eta _2}}\cdots \Vert v_{k-1}\Vert _{L^2_{\eta _{k-1}}} \Vert v_k\Vert _{H^1_{\eta _k}}\) for any \(v_j\in L^2_{\eta _k}\) and \(v_k\in H^1_{\eta _k}\).
References
Ambrogioni, L., Güçlü, U., van Gerven, M. A. and Maris, E. (2017). The kernel mixture network: A nonparametric method for conditional density estimation of continuous random variables. arXiv preprintarXiv:1705.07111.
Anderes, E. and Coram, M. (2012). A general spline representation for nonparametric and semiparametric density estimates using diffeomorphisms. arXiv preprintarXiv:1205.5314.
Baptista, R., Hosseini, B., Kovachki, N. B. and Marzouk, Y. (2023). Conditional sampling with monotone GANs: from generative models to likelihood-free inference. arXiv preprintarXiv:2006.06755v3.
Baptista, R., Marzouk, Y., Morrison, R. E. and Zahm, O. (2021). Learning non-Gaussian graphical models via Hessian scores and triangular transport. arXiv preprintarXiv:2101.03093.
Bertsekas, D. P. (1997). Nonlinear programming. Journal of the Operational Research Society 48 334–334.
Bigoni, D., Marzouk, Y., Prieur, C. and Zahm, O. (2022). Nonlinear dimension reduction for surrogate modeling using gradient information. Information and Inference: A Journal of the IMA.
Bishop, C. M. (1994). Mixture density networks Technical Report No. Neural Computing Research Group report: NCRG/94/004, Aston University.
Bogachev, V. I., Kolesnikov, A. V. and Medvedev, K. V. (2005). Triangular transformations of measures. Sbornik: Mathematics 196 309.
Boyd, J. P. (1984). Asymptotic coefficients of Hermite function series. Journal of Computational Physics 54 382–410.
Brennan, M., Bigoni, D., Zahm, O., Spantini, A. and Marzouk, Y. (2020). Greedy inference with structure-exploiting lazy maps. Advances in Neural Information Processing Systems 33.
Chang, S.-H., Cosman, P. C. and Milstein, L. B. (2011). Chernoff-type bounds for the Gaussian error function. IEEE Transactions on Communications 59 2939–2944.
Chkifa, A., Cohen, A. and Schwab, C. (2015). Breaking the curse of dimensionality in sparse polynomial approximation of parametric PDEs. Journal de Mathématiques Pures et Appliquées 103 400–428.
Cohen, A. (2003). Numerical analysis of wavelet methods. Elsevier.
Cohen, A. and Migliorati, G. (2018). Multivariate approximation in downward closed polynomial spaces. In Contemporary Computational Mathematics-A celebration of the 80th birthday of Ian Sloan 233–282. Springer.
Cui, T. and Dolgov, S. (2021). Deep composition of tensor trains using squared inverse Rosenblatt transports. Foundations of Computational Mathematics 1–60.
Cui, T., Dolgov, S. and Zahm, O. (2023). Scalable conditional deep inverse Rosenblatt transports using tensor trains and gradient-based dimension reduction. Journal of Computational Physics 485 112103.
Cui, T., Tong, X. T. and Zahm, O. (2022). Prior normalization for certified likelihood-informed subspace detection of Bayesian inverse problems. Inverse Problems 38 124002.
Dinh, L., Sohl-Dickstein, J. and Bengio, S. (2017). Density estimation using Real NVP. In International Conference on Learning Representations.
Durkan, C., Bekasov, A., Murray, I. and Papamakarios, G. (2019). Neural spline flows. In Advances in Neural Information Processing Systems 7509–7520.
El Moselhy, T. A. and Marzouk, Y. M. (2012). Bayesian inference with optimal maps. Journal of Computational Physics 231 7815–7850.
Huang, C.-W., Chen, R. T., Tsirigotis, C. and Courville, A. (2020). Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization. In International Conference on Learning Representations.
Huang, C.-W., Krueger, D., Lacoste, A. and Courville, A. (2018). Neural Autoregressive Flows. In International Conference on Machine Learning 2083–2092.
Irons, N. J., Scetbon, M., Pal, S. and Harchaoui, Z. (2022). Triangular flows for generative modeling: Statistical consistency, smoothness classes, and fast rates. In International Conference on Artificial Intelligence and Statistics 10161–10195. PMLR.
Jaini, P., Kobyzev, I., Yu, Y. and Brubaker, M. (2020). Tails of Lipschitz triangular flows. In International Conference on Machine Learning 4673–4681. PMLR.
Jaini, P., Selby, K. A. and Yu, Y. (2019). Sum-of-squares polynomial flow. In International Conference on Machine Learning 3009–3018.
Katzfuss, M. and Schäfer, F. (2023). Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields. Journal of the American Statistical Association 0 1–15.
Kingma, D. P. and Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems 10215–10224.
Kobyzev, I., Prince, S. and Brubaker, M. (2020). Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
Kufner, A. and Opic, B. (1984). How to define reasonably weighted Sobolev spaces. Commentationes Mathematicae Universitatis Carolinae 25 537–554.
Lezcano Casado, M. (2019). Trivializations for gradient-based optimization on manifolds. Advances in Neural Information Processing Systems 32 9157–9168.
Lichman, M. (2013). UCI Machine Learning Repository. http://archive.ics.uci.edu/ml.
Lueckmann, J.-M., Boelts, J., Greenberg, D., Goncalves, P. and Macke, J. (2021). Benchmarking simulation-based inference. In International Conference on Artificial Intelligence and Statistics 343–351. PMLR.
Mallat, S. (1999). A wavelet tour of signal processing. Elsevier.
Marzouk, Y., Moselhy, T., Parno, M. and Spantini, A. (2016). Sampling via Measure Transport: An Introduction In Handbook of Uncertainty Quantification 1–41. Springer International Publishing.
Migliorati, G. (2015). Adaptive polynomial approximation by means of random discrete least squares. In Numerical Mathematics and Advanced Applications-ENUMATH 2013 547–554. Springer.
Migliorati, G. (2019). Adaptive approximation by optimal weighted least-squares methods. SIAM Journal on Numerical Analysis 57 2217–2245.
Morrison, R., Baptista, R. and Marzouk, Y. (2017). Beyond normality: Learning sparse probabilistic graphical models in the non-Gaussian setting. In Advances in Neural Information Processing Systems 2359–2369.
Muckenhoupt, B. (1972). Hardy’s inequality with weights. Studia Mathematica 44 31–38.
Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media.
Novak, E., Ullrich, M., Woźniakowski, H. and Zhang, S. (2018). Reproducing kernels of Sobolev spaces on \(\mathbb{R}^d\) and applications to embedding constants and tractability. Analysis and Applications 16 693–715.
Oord, A. V. D., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., Driessche, G. V. D., Lockhart, E., Cobo, L. C., Stimberg, F. et al. (2017). Parallel WaveNet: Fast high-fidelity speech synthesis. arXiv preprintarXiv:1711.10433.
Papamakarios, G. and Murray, I. (2016). Fast \(\varepsilon \)-free inference of simulation models with Bayesian conditional density estimation. In Advances in Neural Information Processing Systems 1028–1036.
Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S. and Lakshminarayanan, B. (2021). Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research 22 1–64.
Papamakarios, G., Pavlakou, T. and Murray, I. (2017). Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems 2338–2347.
Parno, M. D. and Marzouk, Y. M. (2018). Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification 6 645–682.
Radev, S. T., Mertens, U. K., Voss, A., Ardizzone, L. and Köthe, U. (2020). BayesFlow: Learning complex stochastic models with invertible neural networks. IEEE transactions on neural networks and learning systems.
Ramsay, J. O. (1998). Estimating smooth monotone functions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60 365–375.
Raskutti, G. and Uhler, C. (2018). Learning directed acyclic graph models based on sparsest permutations. Stat 7 e183.
Rezende, D. and Mohamed, S. (2015). Variational inference with normalizing flows. In International conference on machine learning 1530–1538. PMLR.
Rosenblatt, M. (1952). Remarks on a multivariate transformation. The Annals of Mathematical Statistics 23 470–472.
Rothfuss, J., Ferreira, F., Walther, S. and Ulrich, M. (2019). Conditional density estimation with neural networks: Best practices and benchmarks. arXiv preprintarXiv:1903.00954.
Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians. Springer International Publishing.
Schäfer, F., Katzfuss, M. and Owhadi, H. (2021). Sparse Cholesky Factorization by Kullback–Leibler Minimization. SIAM Journal on Scientific Computing 43 A2019–A2046.
Schmuland, B. (1992). Dirichlet forms with polynomial domain. Math. Japon 37 1015–1024.
Schölkopf, B., Herbrich, R. and Smola, A. J. (2001). A generalized representer theorem. In International conference on computational learning theory 416–426. Springer.
Shin, Y. E., Zhou, L. and Ding, Y. (2022). Joint estimation of monotone curves via functional principal component analysis. Computational Statistics & Data Analysis 166 107343.
Silverman, B. W. (1982). On the estimation of a probability density function by the maximum penalized likelihood method. The Annals of Statistics 795–810.
Sisson, S. A., Fan, Y. and Tanaka, M. M. (2007). Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences 104 1760–1765.
Spantini, A., Baptista, R. and Marzouk, Y. (2022). Coupling techniques for nonlinear ensemble filtering. SIAM Review 64 921–953.
Spantini, A., Bigoni, D. and Marzouk, Y. (2018). Inference via low-dimensional couplings. The Journal of Machine Learning Research 19 2639–2709.
Tabak, E. G. and Turner, C. V. (2013). A family of nonparametric density estimation algorithms. Communications on Pure and Applied Mathematics 66 145–164.
Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M. and Sugiyama, M. (2020). Coupling-based invertible neural networks are universal diffeomorphism approximators. In Advances in Neural Information Processing Systems 33 3362–3373.
Trippe, B. L. and Turner, R. E. (2018). Conditional density estimation with Bayesian normalising flows. In Bayesian Deep Learning: NIPS 2017 Workshop.
Truong, T. T. and Nguyen, H.-T. (2021). Backtracking Gradient Descent Method and Some Applications in Large Scale Optimisation. Part 2: Algorithms and Experiments. Applied Mathematics & Optimization 84 2557–2586.
Uria, B., Murray, I. and Larochelle, H. (2013). RNADE: The real-valued neural autoregressive density-estimator. arXiv preprintarXiv:1306.0186.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science 47. Cambridge university press.
Vidakovic, B. (2009). Statistical modeling by wavelets 503. John Wiley & Sons.
Villani, C. (2008). Optimal transport: old and new 338. Springer Science & Business Media.
Wang, S. and Marzouk, Y. (2022). On minimax density estimation via measure transport. arXiv preprintarXiv:2207.10231.
Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
Wehenkel, A. and Louppe, G. (2019). Unconstrained monotonic neural networks. In Advances in Neural Information Processing Systems 1543–1553.
Wenliang, L., Sutherland, D., Strathmann, H. and Gretton, A. (2019). Learning deep kernels for exponential family densities. In International Conference on Machine Learning 6737–6746.
Zahm, O., Cui, T., Law, K., Spantini, A. and Marzouk, Y. (2022). Certified dimension reduction in nonlinear Bayesian inverse problems. Mathematics of Computation 91 1789–1835.
Zech, J. and Marzouk, Y. (2022). Sparse approximation of triangular transports. Part II: the infinite dimensional case. Constructive Approximation 55 987–1036.
Zech, J. and Marzouk, Y. (2022). Sparse Approximation of triangular transports. Part I: the finite-dimensional case. Constructive Approximation 55 919–986.
Acknowledgements
RB, YM, and OZ gratefully acknowledge support from the INRIA associate team Unquestionable. RB and YM are also grateful for support from the AFOSR Computational Mathematics program (MURI award FA9550-15-1-0038) and the US Department of Energy AEOLUS center. RB acknowledges support from an NSERC PGSD-D fellowship. OZ also acknowledges support from the ANR JCJC project MODENA (ANR-21-CE46-0006-01).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Albert Cohen.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proofs and Theoretical Details
1.1 A.1. Proof of Proposition 1
Proof
Recall that the KR rearrangement \(S_{\text {KR}}\) is a transport map that satisfies \(S_{\text {KR}}^\sharp \eta =\pi \), where \(\eta \) is the density of the standard Gaussian measure on \(\mathbb {R}^d\) and \(\pi \) is the target density. Corollary 3.10 in [8] states that for any PDF \(\varrho \) on \(\mathbb {R}^d\) of the form \(\varrho ({\varvec{x}}):=f({\varvec{x}})\eta ({\varvec{x}})\) with \(f\log f\in L^1_\eta \), the inequality
holds, where T is the KR rearrangement such that \(T_\sharp \eta =\varrho \). Let S be an increasing lower triangular map as in (1) and let \(\varrho = S_\sharp \pi \). Thus, we have \(T = S\circ S_{\text {KR}}^{-1}\), and so, the left-hand side of (38) becomes
and the right-hand side becomes
which yields (4). \(\square \)
1.2 A.2. Convexity of \(s\mapsto \mathcal {J}_k(s)\)
Lemma 10
The optimization problem \(\min _{\{s:\partial _k s > 0\}} \mathcal {J}_{k}(s)\) is strictly convex.
Proof
Let \(s_{1}, s_{2}:\mathbb {R}^k \rightarrow \mathbb {R}\) be two functions such that \(\partial _{k} s_{1}({\varvec{x}}_{\le k}) > 0\) and \(\partial _{k} s_{2}({\varvec{x}}_{\le k}) > 0\). Let \( s_t = t s_1 + (1-t)s_2\) for \(0<t<1\). Then \(s_t\) also satisfies \(\partial _{k} s_{t}({\varvec{x}}_{\le k}) > 0\). Finally, because both \(\xi \mapsto \frac{1}{2}\xi ^2\) and \(\xi \mapsto -\log (\xi )\) are strictly convex functions, we have
which shows that \(\mathcal {J}_{k}\) is strictly convex. \(\square \)
1.3 A.3. Proof of Proposition 2
To prove Proposition 2, we need the following lemma.
Lemma 11
Let
Then
holds for any \(f\in H^1([0,1])\).
Proof
Because \(\mathcal {C}^{\infty }([0,1])\) is dense in \(H^1([0,1])\), it suffices to show (39) for any \(f\in \mathcal {C}^{\infty }([0,1])\). By the mean value theorem, there exists \(0\le z \le 1\) such that
Thus, we can write
This concludes the proof. \(\square \)
We now prove Proposition 2.
Proof
For any \(f\in V_k\), Lemma 11 permits us to write
where \(C_T=2 \sup _{0\le t\le 1} \eta _1(t)^{-1}\). \(\square \)
1.4 A.4. Proof of Proposition 3
The proof relies on Proposition 2 and on the following generalized integral Hardy inequality, see [39].
Lemma 12
Let \(\eta _{\le k}\) be the standard Gaussian density on \(\mathbb {R}^k\). Then there exists a constant \(C_{H}\) such that for any \(v \in L^2_{\eta }(\mathbb {R}^{k})\),
Proof of Lemma 12
Let us recall the integral Hardy inequality [39]. \(\square \)
Theorem 13
(from [39]) For weight \(\rho :\mathbb {R}_{+} \rightarrow \mathbb {R}_{+}\) and \(u \in L^2_{\rho }(\mathbb {R})\), there exists a constant \(C_{H} < \infty \) such that
if and only if
We apply Theorem 13 with the one-dimensional standard Gaussian density \(\rho = \eta \) for \(x > 0\). In order to check condition (42), we need to show that
is bounded. Since \(x\mapsto D(x)\) is a continuous function with a finite limit as \(x \rightarrow 0\), it is sufficient to show that D(x) has a finite limit when \(x \rightarrow \infty \). For \(x > 1\), \(\int _{x}^{+\infty } e^{-t^2/2} \text {d}t \le e^{-x^2/2}\) and \(D(x)^2 \le e^{-x^2/2} \int _{0}^{x} e^{t^2/2} \text {d}t\). Furthermore, using integration by parts we have \(\int _{0}^{x} e^{t^2/2} \text {d}t = \int _{0}^{1} e^{t^2/2}\text {d}t + e^{x^2/2}/x - \sqrt{e} + \int _{1}^{x} e^{t^2/2}/t^2\text {d}t\). As \(x \rightarrow \infty \), the dominating term in the sum is \(e^{x^2/2}/x\). Thus, \(e^{-x^2/2} \int _{0}^{x} e^{t^2/2}\text {d}t\) behaves asymptotically as \(\mathcal {O}(\frac{1}{x})\), so that \(D(x)\rightarrow 0\) when \(x\rightarrow \infty \). Thus, condition (42) is satisfied.
Then, by the Hardy inequality in (41) for \(u \in L_{\eta }^2(\mathbb {R})\) we have
For the symmetric density \(\eta (x_{k}) = \eta (-x_{k})\), we also have
Combining the results in (43) and (44), we have
Setting \(u(t) = v({\varvec{x}}_{<k},t)\) and integrating both sides over \({\varvec{x}}_{<k} \in \mathbb {R}^{k-1}\) with the standard Gaussian weight function \(\eta ({\varvec{x}}_{<k})\) give the result. \(\square \)
We now prove Proposition 3.
Proof
By Proposition 2, by Lemma 12, and by the Lipschitz property of g, we can write
for any \(f_1,f_2\in V_k\). Furthermore, using the Lipschitz property of g we have
Combining (45) with (46), we obtain (19) with \(C = \sqrt{2(C_T+C_H L^2)+L^2}\).
It remains to show that \(\Vert \mathcal {R}_k(f) \Vert _{V_k}<\infty \) for any \(f\in V_k\). Letting with \(f_1=f\) and \(f_2=0\) in (19), the triangle inequality yields
Because \(\mathcal {R}_k(0)\) is the affine function \({\varvec{x}}\mapsto g(0)x_k\), we have that \(\Vert \mathcal {R}_k(0) \Vert _{L^2_{\eta _{\le k}}}^2= g(0)^2\int x_k^2 \eta ({\varvec{x}})\text {d}{\varvec{x}}\) and \(\Vert \partial _k \mathcal {R}_k(0) \Vert _{L^2_{\eta _{\le k}}}^2 = g(0)^2\) are finite and so is \(\Vert \mathcal {R}_k(0) \Vert _{V_k}\). Thus, \(\mathcal {R}_k(f)\in V_k\) for all \(f\in V_k\). \(\square \)
1.5 A.5. Proof of Proposition 4
Proof
For any \(f\in V_k\), we have
Because Proposition 3 ensures \(\mathcal {R}_k(f)\in V_k\subset L^2_{\eta _{\le k}}\), we have that \(\mathcal {L}_{k}(f)\) is finite for any \(f\in V_k\). Now, for any \(f_{1},f_2\in V_k\), we can write
This shows that \(\mathcal {L}_{k}:V_k\rightarrow \mathbb {R}\) is continuous. To show that \(\mathcal {L}_{k}\) is differentiable, we let \(f,\varepsilon \in V_k\) so that
where \(\ell :V_k\rightarrow \mathbb {R}\) is the linear form defined by
If \(\ell \) is continuous, meaning if there exists a constant \(C_\ell \) such that \(|\ell (\varepsilon )|\le C_\ell \Vert \varepsilon \Vert _{V_k}\) for any \(\varepsilon \in V_k\), then the Riesz representation theorem states that there exists a vector \(\nabla \mathcal {L}_k(f)\in V_k\) such that \(\ell (\varepsilon )=\langle \nabla \mathcal {L}_k(f),\varepsilon \rangle _{V_k}\). This proves \(\mathcal {L}_k\) is differentiable everywhere.
To show that \(\ell \) is continuous, we write
where the second last inequality also uses Proposition 2 and Lemma 12. This concludes the proof. \(\square \)
1.6 A.6. Proof of the Local Lipschitz Regularity (24)
Proposition 14
In addition to the assumptions of Theorem 4, we further assume there exists a constant \(L<\infty \) such that for all \(\xi ,\xi '\in \mathbb {R}\) we have
Then there exists \(M<\infty \) such that
for any \(f_1,f_2 \in \overline{V}_k\), where \(\overline{V}_k = \{f \in V_k, \partial _k f \in L^\infty \}\) is the space endowed with the norm \(\Vert f \Vert _{\overline{V}_k} = \Vert f\Vert _{V_k} + \Vert \partial _k f \Vert _{L^\infty }\).
Proof
Recall the definition (23) of \(\nabla \mathcal {L}_k(f)\)
Then for any \(f_1,f_2\in \overline{V}_k\). we can write
where
For the first term A, we write
For the second term B, we write
For the third term C, we write
For the last term D, we write
Thus, because \(\Vert f_1-f_2\Vert _{V_k} \le \Vert f_1-f_2\Vert _{\overline{V}_k} \) we obtain
where
This concludes the proof. \(\square \)
1.7 A.7. Proof of Proposition 6
Proof
To show that \(\mathcal {R}_k(V_k)=\{\mathcal {R}(f):f\in V_k\}\) is convex, let \(f_1,f_2\in V_k\) and \(0\le \alpha \le 1\). We need to show that there exists \(f_\alpha \in V_k\) such that \(\mathcal {R}(f_\alpha ) = S_\alpha \) where
Let
It remains to show that \(f_\alpha \in V_k\), meaning that \(f_\alpha \in L^2_{\eta _{\le k}}\) and \(\partial _k f_\alpha \in L^2_{\eta _{\le k}}\). By convexity of \(\xi \mapsto g^{-1}(\xi )^2\), we have
Thus, \(\partial _k f_\alpha \in L^2_{\eta _{\le k}}\). Furthermore, we have
To show that the above quantity is finite, Proposition 2 permits us to write
which is finite. Finally, because \(g^{-1}(\partial _k S_\alpha ) =\partial _k f_\alpha \in L^2_{\eta _{\le k}}\) by (49), Lemma 12 yields
which is finite. We deduce that \(f_\alpha \in L^2_{\eta _{\le k}}\) and therefore that \(f_\alpha \in V_k\). \(\square \)
1.8 A.8. Proof of Proposition 5
Proof
Let \(s_1,s_2 \in V_{k}\) be strictly increasing functions with respect to \(x_k\) that satisfy \(\partial _k s_i({\varvec{x}}_{\le k}) \ge c\) for \(i=1,2\) and all \({\varvec{x}}_{\le k} \in \mathbb {R}^{k}\). By the Lipschitz property of \(g^{-1}\) on the domain \([c,\infty )\) with constant \(L_c\), we can write
Applying Proposition 2 to \(s_1,s_2 \in V_k\) and Lemma 12 to \(\partial _k \mathcal {R}_k^{-1}(s_i) = g^{-1}(\partial _k s_i) \in L_{\eta _{\le k}}^2\) for \(i=1,2\), we have
where the last inequality follows from (50).
It remains to show that \(\Vert \mathcal {R}_k^{-1}(s)\Vert _{V_k} < \infty \) for any \(s \in V_{k}\) such that \({\textrm{ess}}\,{\textrm{inf}\,}\partial _k s > 0\). Letting \(s_1 = s\) and \(s_2 = g(0)x_k\), the triangle inequality combined with (26) yields
The function \(\mathcal {R}_k^{-1}(g(0)x_k)\) is zero. Therefore, \(\Vert \mathcal {R}_k^{-1}(s) \Vert _{V_k} \le C_c\Vert s - g(0)x_k \Vert _{V_k} \le C_c(\Vert s \Vert _{V_k} + \Vert g(0)x_k \Vert _{V_{k}})\). For a linear function, \(\Vert g(0)x_k \Vert _{V_k}^2 = \Vert g(0)x_k \Vert ^2_{L^2_{\eta _{\le k}}} + \Vert g(0) \Vert ^2_{L^2_{\eta _{\le k}}} = 2\,g(0)^2\) is finite, and so, \(\Vert \mathcal {R}_k^{-1}(s) \Vert _{V_k} < \infty \) for \(s \in V_k\). Furthermore, if \(\partial _k s \ge c > 0\), then \(\partial _k \mathcal {R}_k^{-1}(s) = g^{-1}(\partial _k s) \ge g^{-1}(c) > -\infty \), and so, \({\textrm{ess}}\,{\textrm{inf}\,}\mathcal {R}_k^{-1}(s) > -\infty \). \(\square \)
1.9 A.9. Proof for the KR Rearrangement
Proof
Let \(S_{\text {KR},k}\) be the kth component of the KR rearrangement, given by composing the inverse CDF of the standard Gaussian marginal \(F_{\eta ,k}(x_k)\) with the CDF of the target’s kth marginal conditional \(F_{\pi _k}(x_k|{\varvec{x}}_{<k})\). That is,
The goal is to show \(S_{\text {KR},k} \in V_k\), that is, \(S_{\text {KR},k} \in L^2_{\eta _{\le k}}\) and \(\partial _k S_{\text {KR},k} \in L^2_{\eta _{\le k}}\).
First we show \(S_{\text {KR},k} \in L^2_{\eta _{\le k}}\). From condition (30), we have \(F_{\eta _k}^{-1}(C_1 F_{\eta _k}(x_k)) \le S_{\text {KR},k}(x_k|{\varvec{x}}_{<k}) \le F_{\eta _k}^{-1}(C_2 F_{\eta _k}(x_k))\) for some constants \(C_1,C_2 > 0\) so that
for all \({\varvec{x}}_{<k} \in \mathbb {R}^{k-1}\). To show that \(S_{\text {KR},k} \in L^2_{\eta _{\le k}}\), it is sufficient to prove that any function of the form \(x_k\mapsto F_{\eta _k}^{-1}( C F_{\eta _k}(x_k))\) is in \( L^2_{\eta _{\le k}}\) for any \(C>0\). From Theorems 1 and 2 in [11], there exists strictly positive constants \(\alpha _i,\beta _i > 0\) for \(i=1,2\) such that
for \(x_k > 0\). With a change of variable \(u=F_{\eta _k}(x_k)\), we obtain \(F^{-1}_{\eta _k}(u)^2 \le 1/\beta _{2} \log (\alpha _2/(1-u))\) for all \(u > F_{\eta _k}(0) = 1/2\). Letting \(u=C F_{\eta _k}(x_k)\) yields
for all \(x_k > \max \{ F_{\eta _k}^{-1}(1/(2C)), 0\}\). Using the same argument, we obtain a similar bound on \(F^{-1}_{\eta _k}( C F_{\eta _k}(x_k) )^2\) for all \(x_k\) smaller than a certain value. Together with the continuity of \(x_k\mapsto F^{-1}_{\eta _k}( C F_{\eta _k}(x_k) )^2\), these bounds ensure that \(x_k\mapsto F^{-1}_{\eta _k}( C F_{\eta _k}(x_k) )\) is in \(L^2_{\eta _{\le k}}\) for any C. Then \(S_{\text {KR},k} \in L^2_{\eta _{\le k}}\). Furthermore, we have \(S_{\text {KR},k}({\varvec{x}}_{\le k}) = \mathcal {O}(x_k)\) as \(|x_k| \rightarrow \infty \).
Now we show that \(\partial _k S_{\text {KR},k} \in L^2_{\eta _{\le k}}\) by showing \(\partial _k S_{\text {KR},k}\) is a continuous and bounded function. From the absolute continuity of \(\varvec{\mu }\) and \(\varvec{\nu }\), we have that
is continuous, where \(F_{\pi _k}^{-1}(\cdot |{\varvec{x}}_{<k})\) denotes the inverse of the map \(x_k \mapsto F_{\pi _k}(x_k|{\varvec{x}}_{<k})\) for each \({\varvec{x}}_{<k} \in \mathbb {R}^{k-1}\). Hence, it is sufficient to show that \(\partial _k S_{\text {KR},k}\) goes to a finite limit as \(|x_k| \rightarrow \infty \). For the right-hand limit, we can write
where in the second equality we used the inverse function theorem and the third equality follows from l’Hôpital’s rule. To analyze the ratio \(F_{\eta _k}^{-1}(u)/F_{\pi _k}^{-1}(u|{\varvec{x}}_{<k})\), we combine the lower bound in (30) and the bounds in (54) to get
Similarly, from the upper bound in (30) and the bounds in (54), we have
Thus, \(\partial _k S_{\text {KR},k}({\varvec{x}}_{\le k}) = \mathcal {O}(1)\) as \(x_k \rightarrow \infty \), and we have \(\partial _k S_{\text {KR},k} \in L^2_{\eta _k}\).
Lastly, taking the limit in (56) we have \(\lim _{x_k \rightarrow \infty } \partial _k S_{\text {KR},k}({\varvec{x}}_{\le k}) \ge \sqrt{\beta _2/\beta _1}\). For a target distribution \(\pi \) with full support, all marginal conditional densities satisfy \(\pi _k(x_k|{\varvec{x}}_{<k}) > 0\) for each \({\varvec{x}}_{\le k} \in \mathbb {R}^{k}\). Given that the \(\partial _k S_{\text {KR},k}\) does not approach zero as \(|x_k| \rightarrow \infty \), we can find a strictly positive constant \(c_k > 0\) such that \(\partial _k S_{\text {KR},k}({\varvec{x}}_{\le k}) \ge c_k\) for all \({\varvec{x}}_{\le k} \in \mathbb {R}^k\). This shows that \({\textrm{ess}}\,{\textrm{inf}\,}\partial _k S_{\text {KR},k} > 0\). \(\square \)
Appendix B: Multi-index Refinement for the Wavelet Basis
In this section, we show how to greedily enrich the index set \(\Lambda _t\) for a one-dimensional wavelet basis parameterized by the tuple of indices (l, q) representing the level l and translation q of each wavelet \(\psi _{(l,q)}\). To define the allowable indices, we construct a binary tree where each node is indexed by (l, q) and has two children with indices \((l+1,2q)\) and \((l+1,2q+1)\). The root of the tree has index (0, 0) and corresponds to the mother wavelet \(\psi \). Analogously to the downward-closed property for polynomial indices, we only add nodes to the tree (i.e., indices in \(\Lambda _t\)) if its parents have already been added. Given any set \(\Lambda _t\), we define its reduced margin as
Then, the ATM algorithm with a wavelet basis follows from Algorithm 1 with this construction for the reduced margin at each iteration.
Appendix C: Architecture Details of Alternative Methods
In this section, we present the details of the alternative methods to ATM that we consider in Sect. 5.
For each normalizing flow model, we use the recommended stochastic gradient descent optimizer with a learning rate of \(10^{-3}\). We partition 10% of the samples in each training set to be validation samples and use the remaining samples for training the model. We select the optimal hyperparameters for each dataset by fitting the density with the training data and choosing the parameters that minimize the negative log-likelihood of the approximate density on the validation samples. We also use the validation samples to set the termination criteria during the optimization.
We follow the implementation of [52] to define the architectures of these models. The hyperparameters we consider for the neural networks in the MDN and NF models are: 2 hidden layers, 32 hidden units in each layer, \(\{5,10,20,50,100\}\) centers or flows, weight normalization, and a dropout probability of \(\{0,0.2\}\) for regularizing the neural networks during training. For CKDE and NKDE, we select the bandwidth of the kernel estimators using fivefold cross-validation.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Baptista, R., Marzouk, Y. & Zahm, O. On the Representation and Learning of Monotone Triangular Transport Maps. Found Comput Math (2023). https://doi.org/10.1007/s10208-023-09630-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10208-023-09630-x
Keywords
- Knothe–Rosenblatt rearrangement
- Normalizing flows
- Monotone functions
- Infinite-dimensional optimization
- Adaptive approximation
- Multivariate polynomials
- Wavelets
- Density estimation