Two studies boost the use of synthetic data to advance cancer research

The UPM validates two methods to generate reliable synthetic data in cancer and survival analysis, key to training AI with few patients.

2 minutes

Add DEMÓCRATA to Google

Published

2 minutes

Most read

Two investigations led by the Polytechnic University of Madrid (UPM) demonstrate how it is possible to create robust synthetic data when real data is limited, with direct applications in the study of cancer.

As detailed by the UPM, artificial intelligence (AI) requires large amounts of information to be trained. In the medical field, however, data is often scarce, highly varied, and difficult to share for ethical, legal, and confidentiality reasons. This difficulty is accentuated in areas such as oncology, rare diseases, or survival analyses, where gathering extensive patient databases is often unfeasible.

Faced with this scenario, a UPM group has carried out two complementary works focused on perfecting the generation of synthetic data, that is, artificial records that reproduce the statistical patterns of the original data without replicating specific patients.

Based on the results obtained, the authors emphasize an idea relevant to the future of medical AI, "it is not enough to check if synthetic data is useful for training a model for a specific task, it is also necessary to measure to what extent they really resemble the original data and if they preserve complex relationships between variables."

For this reason, both investigations emphasize the importance of integrating utility metrics with similarity metrics, which allows for a more rigorous evaluation of the real quality of synthetic data.

The first of the works, published in the journal 'Neurocomputing', proposes a methodology that allows generative models to learn better from very few real examples. The key element consists of introducing an "artificial inductive bias," a kind of prior mathematical guide that directs the model when it has very little data.

To this end, the team combined transfer learning and meta-learning techniques and analyzed different strategies, such as pre-training, model averaging, the so-called 'model-agnostic meta-learning' (MAML), and 'domain randomized search' (DRS).

The results show that, in general, transfer learning strategies offered the best performance and very notably improved the quality of the generated synthetic data. In some experiments, the improvement reached up to 60 percent in Jensen-Shannon divergence, a metric that allows estimating how closely the distribution of synthetic data resembles that of real data.

Application to Cancer and Survival Analysis

The second study, published in the 'IEEE Journal of Biomedical and Health Informatics', adapts this methodology to the biomedical field and validates it in oncological research and survival models. This type of analysis allows estimating the time until a relevant clinical event occurs, such as a relapse, tumor progression, or patient death, and is particularly affected by the lack of information.

The work shows that the approach is also effective in this highly demanding environment and makes it possible to generate high-quality synthetic data even when the initial conditions are very restrictive.

The implications of this line of work are broad. In the opinion of Patricia Alonso, a researcher at the UPM: "Having reliable synthetic data can, on the one hand, facilitate the development and validation of AI tools in hospitals and research centers with scarce data and, on the other hand, favor studies in small cohorts, as well as open new avenues for collaboration and open science without compromising patient privacy".