Using synthetic data to bridge production and development

Organizations face a familiar dilemma: how can developers experiment and build models using realistic data without exposing sensitive customer information? Generative adversarial networks (GANs) offer a promising solution by creating synthetic data that mimics real datasets. In this post, we explore a practical approach by training a tabular GAN model in a secure production environment and then deploy that model in a development environment to generate synthetic data for training another model. We will use a finance-related scenario to illustrate how this pipeline works, discussing why it’s valuable and how to address key challenges along the way.

Why synthetic data for financial projects?

Working with financial data often means dealing with strict privacy regulations (e.g. GDPR, banking secrecy). Synthetic data acts as proxy data – it preserves the statistical characteristics of real-world data (distributions, correlations, etc.) without exposing actual sensitive records. This has several major benefits for finance projects:

1. Privacy preservation

Synthetic data does not contain real personal identifiers, so it does not impact human privacy and is less risky if a data breach occurs. Developers can use realistic datasets without violating privacy regulations or confidentiality agreements.

2. Regulatory compliance

Since synthetic datasets are generated (not sampled from real customers), they help institutions share data internally or with partners without leaking personal information. This approach is privacy-by-design, ensuring compliance while still enabling data-driven innovation.

3. Data access and agility

Gaining access to production data can take ages due to approval processes and silos. Synthetic data can be generated quickly on demand, giving developers fast access to realistic data. This accelerates model development lifecycles since teams don’t wait for sanitized or masked data extracts.

4. Preserved business logic

Unlike random masking or anonymization which often destroy patterns and referential integrity, well-generated synthetic data retains the business logic and relationships of the original. This means analyses and models built on synthetic data produce reliable results akin to using real data. In fact, studies show models trained on high-quality synthetic data can achieve similar accuracy to models trained on original data. Also check out Using AI-generated synthetic data for easy and fast access to high quality data.

Training a Tabular GAN model in production

The first step is to train the GAN model in the production environment where the real financial data resides. We bring the compute to the data (instead of moving data around). Using PROC TABULARGAN here ensures that the real dataset never leaves the production servers during model training.

Why train in production? Because that’s where the truth is. The GAN needs to see the real data to learn its patterns.

Below is the example code to train our tabularGAN model – the documentation is available here.

* Configure the path of where the astore should be stored;
%let targetAstorePath = /export/pvs/sasdata/homes/gerdaw;
* Configure the interval variables;
%let intervalVariables = value clage;
* Configure the nominal variables;
%let nominalVariables = bad job;
proc tabularGAN data = sampsio.hmeq
    seed = 42
    numSamples = 5;
    input &intervalVariables. / level = interval;
    input &nominalVariables. / level = nominal;
    gmm alpha = 1 maxClusters = 10 seed = 42 VB(maxVbIter = 30);
    aeOptimization ADAM LearningRate = 0.0001 numEpochs = 3;
    ganOptimization ADAM(beta1 = 0.55 beta2 = 0.95) numEpochs = 5;
    train embeddingDim = 64 miniBatchSize = 300 useOrigLevelFreq;
    saveState rStore = work.astore;
    output out = work.out;
run; quit;
 
Now we need to download our trained model so that we can move it to the development environment:
proc aStore;
    download
        rStore=casuser.astore
        store="&targetAstorePath./gan_model.sasast";
run; quit;
And finally we can generate new synthetic data that we can than use to train our ML models with:
* Configure the path of where the astore was uploaded to;
%let targetAstorePath = /export/pvs/sasdata/homes/gerdaw;
* Number of target synthetic rows to generate;
%let numberOfSyntheticRows = 100;
data work.id;
    do i=1 to &numberOfSyntheticRows.;
        output;
    end;
run;
 
proc aStore;
    upload
        rStore = work.gan_astore
        store = "&targetAstorePath./gan_model.sasast";
 
    score
        rStore = work.gan_astore
        out = work.synthetic_hmeq
        data = work.id 
        copyVars = (_all_);
run; quit;

Summary

So we trained a GAN model in production on real data, than moved that model into our development environment and generated new synthetic data in order to be able to create new ML models: