Abstract
Training machine-learning models with synthetically generated data can alleviate the problem of data scarcity when acquiring diverse and sufficiently large datasets is costly and challenging. Here we show that cascaded diffusion models can be used to synthesize realistic whole-slide image tiles from latent representations of RNA-sequencing data from human tumours. Alterations in gene expression affected the composition of cell types in the generated synthetic image tiles, which accurately preserved the distribution of cell types and maintained the cell fraction observed in bulk RNA-sequencing data, as we show for lung adenocarcinoma, kidney renal papillary cell carcinoma, cervical squamous cell carcinoma, colon adenocarcinoma and glioblastoma. Machine-learning models pretrained with the generated synthetic data performed better than models trained from scratch. Synthetic data may accelerate the development of machine-learning models in scarce-data settings and allow for the imputation of missing data modalities.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
TCGA data can be downloaded from the GDC platform (https://portal.gdc.cancer.gov/). The two GEO series used in this study can be downloaded from the GEO platform: GSE50760 and GSE226069. The PBTA dataset can be downloaded from the Gabriella Miller Kids First Data Resource Portal (KF-DRC, https://kidsfirstdrc.org). Microsatellite-instability-status data can be downloaded from the Kaggle platform: https://www.kaggle.com/datasets/joangibert/tcga_coad_msi_mss_jpg. Case IDs used for this work as well as the RNA-seq encodings obtained for all experiments are available under an academic-use-only licence at https://rna-cdm.stanford.edu. One million synthetic images are available in the Dryad platform at https://doi.org/10.5061/dryad.6djh9w174 (ref. 77).
Code availability
A demo for generating synthetic images and the code are available under an academic-use-only licence at https://rna-cdm.stanford.edu.
References
Sung, H. et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 71, 209–249 (2021).
Jones, P. A. & Baylin, S. B. The epigenomics of cancer. Cell 128, 683–692 (2007).
Lujambio, A. & Lowe, S. W. The microcosmos of cancer. Nature 482, 347–355 (2012).
Frangioni, J. V. New technologies for human cancer imaging. J. Clin. Oncol. 26, 4012–4021 (2008).
Williams, B. J., Bottoms, D. & Treanor, D. Future-proofing pathology: the case for clinical adoption of digital pathology. J. Clin. Pathol. 70, 1010–1018 (2017).
Heindl, A., Nawaz, S. & Yuan, Y. Mapping spatial heterogeneity in the tumor microenvironment: a new era for digital pathology. Lab. Invest. 95, 377–384 (2015).
Cheng, J. et al. Identification of topological features in renal tumor microenvironment associated with patient survival. Bioinformatics 34, 1024–1030 (2018).
Coudray, N. et al. Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning. Nat. Med. 24, 1559–1567 (2018).
Castillo, D. et al. Integration of RNA-seq data with heterogeneous microarray data for breast cancer profiling. BMC Bioinformatics18, 506 (2017).
Yu, D. et al. Copy number variation in plasma as a tool for lung cancer prediction using Extreme Gradient Boosting (XGBoost) classifier. Thorac. Cancer 11, 95–102 (2020).
Maros, M. E. et al. Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nat. Protoc. 15, 479–512 (2020).
Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 16144–16155 (IEEE, 2022).
Carrillo-Perez, F. et al. Machine-learning-based late fusion on multi-omics and multi-scale data for non-small-cell lung cancer diagnosis. J. Pers. Med. 12, 601 (2022).
Lee, C. & van der Schaar, M. A variational information bottleneck approach to multi-omics data integration. In International Conference on Artificial Intelligence and Statistics 1513–1521 (PMLR, 2021).
Chen, R. J. et al. Pathomic fusion: an integrated framework for fusing histopathology and genomic features for cancer diagnosis and prognosis. IEEE Trans. Med. Imaging 41, 757–770 (2020).
Cheerla, A. & Gevaert, O. Deep learning with multimodal representation for pancancer prognosis prediction. Bioinformatics 35, i446–i454 (2019).
Chen, R. J. et al. Pan-cancer integrative histology–genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878 (2022).
Vanguri, R. S. et al. Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L) 1 blockade in patients with non-small cell lung cancer. Nat. Cancer 3, 1151–1164 (2022).
Lipkova, J. et al. Artificial intelligence for multimodal data integration in oncology. Cancer Cell 40, 1095–1110 (2022).
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Jennings, C. N. et al. Bridging the gap with the UK Genomics Pathology Imaging Collection. Nat. Med. 28, 1107–1108 (2022).
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 41, D991–D995 (2012).
Quiros, A. C., Murray-Smith, R. & Yuan, K. PathologyGAN: learning deep representations of cancer tissue. In Proceedings of the Third Conference on Medical Imaging with Deep Learning 121, 669–695 (PMLR, 2020).
Quiros, A. C., Murray-Smith, R. & Yuan, K. Learning a low dimensional manifold of real cancer tissue with PathologyGAN. Preprint at https://arxiv.org/abs/1907.02644v5 (2020).
Viñas, R., Andrés-Terré, H., Liò, P. & Bryson, K. Adversarial generation of gene expression data. Bioinformatics 38, 730–737 (2022).
Mitra, R. & MacLean, A. L. RVAgene: generative modeling of gene expression time series data. Bioinformatics 37, 3252–3262 (2021).
Qiu, Y. L., Zheng, H. & Gevaert, O. Genomic data imputation with variational auto-encoders. Gigascience 9, giaa082 (2020).
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 5769–5779 (Curran Associates, 2017).
Metz, L., Poole, B., Pfau, D. & Sohl-Dickstein, J. Unrolled generative adversarial networks. Preprint at https://doi.org/10.48550/arXiv.1611.02163 (2016).
Salimans, T. et al. Improved techniques for training gans. In Advances in Neural Information Processing Systems 29 (eds Lee, D. et al.) 2234–2242 (Curran Associates, 2016).
Zhao, S., Song, J. & Ermon, S. Infovae: balancing learning and inference in variational autoencoders. Proc. AAAI Conf. Artif. Intell. 33, 5885–5892 (2019).
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional image generation with CLIP latents. Preprint at https://doi.org/10.48550/arXiv.2204.06125 (2022).
Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems 35, 36479–36494 (PMLR, 2022).
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning 2256–2265 (PMLR, 2015).
Radford, A. et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning 8748–8763 (PMLR, 2021).
Yu, K. H. et al. Association of omics features with histopathology patterns in lung adenocarcinoma. Cell Syst. 5, 620–627 (2017).
Fu, Y. et al. Pan-cancer computational histopathology reveals mutations, tumor composition and prognosis. Nat. Cancer 1, 800–810 (2020).
Schmauch, B. et al. A deep learning model to predict RNA-seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at https://doi.org/10.48550/arXiv.1802.03426 (2018).
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. & Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30 (eds Guyon, I. et al.) 6629–6640 (Curran Associates, 2017).
Binkowski, M., Sutherland, D. J., Arbel, M. & Gretton, A. Demystifying MMD GANS. Preprint at https://doi.org/10.48550/arXiv.1801.01401 (2018).
Kim, S. K. et al. A nineteen gene-based risk score classifier predicts prognosis of colorectal cancer patients. Mol. Oncol. 8, 1653–1666 (2014).
Quintanal-Villalonga, A. et al. Comprehensive molecular characterization of lung tumors implicates AKT and MYC signaling in adenocarcinoma to squamous cell transdifferentiation. J. Hematol. Oncol. 14, 170 (2021).
Graham, S. et al. Hover-Net: simultaneous segmentation and classification of nuclei in multi-tissue histology images. Med. Image Anal. 58, 101563 (2019).
Karimi, E. et al. Single-cell spatial immune landscapes of primary and metastatic brain tumours. Nature 614, 555–563 (2023).
Han, S. et al. Rescuing defective tumor-infiltrating T-cell proliferation in glioblastoma patients. Oncol. Lett. 12, 2924–2929 (2016).
Steyaert, S. et al. Multimodal deep learning to predict prognosis in adult and pediatric brain tumors. Commun. Med. 3, 44 (2023).
Lehrer, M. et al. in Advances in Biology and Treatment of Glioblastoma (ed. Somasundaram, K.) 143–159 (Springer, 2017).
Yamashita, R. et al. Deep learning model for the prediction of microsatellite instability in colorectal cancer: a diagnostic study. Lancet Oncol. 22, 132–141 (2021).
Marisa, L. et al. Gene expression classification of colon cancer into molecular subtypes: characterization, validation, and prognostic value. PLoS Med. 10, e1001453 (2013).
Li, W. et al. High resolution histopathology image generation and segmentation through adversarial training. Med. Image Anal. 75, 102251 (2022).
Karras, T., Aittala, M., Aila, T. & Laine, S. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems, 35, 26565–26577 (PMLR, 2022).
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 5, 493–497 (2021).
Azizi, S. et al. Robust and efficient medical imaging with self-supervision. Preprint at https://doi.org/10.48550/arXiv.2205.09723 (2022).
Dries, R. et al. Advances in spatial transcriptomic data analysis. Genome Res. 31, 1706–1718 (2021).
Zheng, H., Brennan, K., Hernaez, M. & Gevaert, O. Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples. Gigascience 8, giz145 (2019).
Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).
Lu, M. Y. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979).
Goode, A., Gilbert, B., Harkes, J., Jukic, D. & Satyanarayanan, M. OpenSlide: a vendor-neutral software foundation for digital pathology. J. Pathol. Inform. 4, 27 (2013).
Kather, J. N. et al. Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer. Nat. Med. 25, 1054–1056 (2019).
Ijaz, H. et al. Pediatric high-grade glioma resources from the Children’s Brain Tumor Tissue Consortium. Neuro Oncol. 22, 163–165 (2020).
Higgins, I. et al. beta-VAE: learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations 1–13 (ICLR, 2017).
Hyvärinen, A. & Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6, 695−709 (2005).
Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 23, 1661–1674 (2011).
Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020).
Ho, J. et al. Cascaded diffusion models for high fidelity image generation. J. Mach. Learn. Res. 23, 1–33 (2022).
Ronneberger, O., Fischer, P. & Brox, T. U-Net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention (eds Navab, N. et al.) 234–241 (Springer, 2015).
Grill, J. B. et al. Bootstrap your own latent–a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020).
Kaiser, L. et al. Fast decoding in sequence models using discrete latent variables. Proc. Mach. Learn. Res. 80, 2390–2399 (2018).
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Newman, A. M. et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 37, 773–782 (2019).
Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. & Rosati, R. A. Evaluating the yield of medical tests. JAMA 247, 2543–2546 (1982).
Longato, E., Vettoretti, M. & Di Camillo, B. A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. J. Biomed. Inform. 108, 103496 (2020).
Graf, E., Schmoor, C., Sauerbrei, W. & Schumacher, M. Assessment and comparison of prognostic classification schemes for survival data. Stat. Med. 18, 2529–2545 (1999).
Carrillo-Perez, F. RNA-to-image multi-cancer synthesis using cascaded diffusion models, one million synthetic images. Dryad https://doi.org/10.5061/dryad.6djh9w174 (2023).
Acknowledgements
The results published here are in whole or in part based on data generated by the TCGA Research Network (https://www.cancer.gov/tcga). F.C.-P. was supported by MCIN/AEI/10.13039/501100011033 (grant number PID2021-128317OB-I00), Consejería de Universidad, Investigación e Innovación (grant number P20-00163), which are both funded by ‘ERDF A way of making Europe.’, and a Predoctoral scholarship from the Fulbright Spanish Commission. M.P. was supported by the Belgian American Educational Foundation and FWO (grant number 1161223N). Research reported here was further supported by the National Cancer Institute (NCI) (grant number R01 CA260271). This research used resources of the Argonne Leadership Computing Facility, a U.S. Department of Energy (DOE) Office of Science user facility at Argonne National Laboratory and is based on research supported by the U.S. DOE Office of Science-Advanced Scientific Computing Research Program, under Contract No. DE-AC02-06CH11357. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Author information
Authors and Affiliations
Contributions
F.C.-P., M.P. and O.G. conceived and designed the study. F.C.-P., M.P. and Y.Z. performed data preprocessing. F.C.-P. developed the code. T.N.N. and R.M. contributed to code optimization and parallel training. R.M. and T.N.N. provided access to the Argonne National Laboratory platform. J.S. performed the analysis of the clinical impact and analysed the digital pathology quality. Y.Z. obtained the deconvolved RNA-seq data. F.C.-P. and M.P. generated the figures. O.G. supervised the work and obtained the funding. F.C.-P. and O.G. wrote the manuscript with contributions and/or revisions from all authors.
Corresponding author
Ethics declarations
Competing interests
Stanford has submitted a provisional patent application for this work with patent number 18/538,743, United States, 2023. The authors declare no competing interests.
Peer review
Peer review information
Nature Biomedical Engineering thanks Moritz Gerstung, Ke Yuan and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Cell-percentage comparison between using bulk RNA-Seq and de-convolved expression.
a. Percentage of lymphocytes cells found by Hovernet in synthetic tiles generated using bulk RNA-Seq and haematopoietic de-convolved RNA-Seq. A significantly higher percentage of lymphocytes was found in all four out of five cancer types with a significantly p-value in four out of five of them (TCGA-CESC p-value = 0.15; TCGA-KIRP p-value = 6.08 × 10−21; TCGA-LUAD p-value = 9.86 × 10−16; TCGA-GBM p-value = 2.02 × 10−7; TCGA-COAD p-value = 1.07 × 10−22). The median difference is annotated in the plot per cancer type. b. UMAP projection of the bulk RNA-Seq expression (circles) and the counterpart deconvolved haematopoietic RNA-Seq (crosses). Clear differences can be observed in the expression, with a mean percentage difference of 7% across the cancer types, which corresponds to a similar increase in lymphocytes in the majority of the cancer types.
Extended Data Fig. 2 Microsatellite-instability-status prediction.
Comparison between a model trained from scratch and a model that have been pretrained using SimCLR on synthetic tiles, on a different number of real tiles sampled from the training set. Metrics are computed on a fivefold CV, and results correspond to those obtained on the different test sets. The model pretrained on the synthetic tiles always outperform the model trained from scratch, no matter the number of training samples that are used.
Supplementary information
Supplementary Information
Supplementary Figures and Tables.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Carrillo-Perez, F., Pizurica, M., Zheng, Y. et al. Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models. Nat. Biomed. Eng (2024). https://doi.org/10.1038/s41551-024-01193-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41551-024-01193-8