Synthetic data_Blog Lead
Adam Fletcher
Adam Fletcher Data Scientist

Tech Focus Wed 2nd August, 2023

How synthetic data can speed up development

Access, governance and security are paramount when building data products. Incorrectly handled data can cause costly data breaches and reputational damage. However, these processes take large amounts of time, slowing down development. By leveraging synthetic data, companies can safely speed up development time.

Using synthetic data as an alternative to production data means that developers can build tools in parallel with governance and security, without risking production data use. This significantly reduces lead time as the development no longer proceeds in a waterfall manner. It also helps to improve security testing because the scale of synthetic data mimics the volume of production data, greatly reducing project risk.

Alternate Approaches 

Traditional methods to solve these problems usually involve small amounts of manually created data or an anonymised production dataset. The challenge this creates is that small data won’t reflect the load demands of production data, while it has been demonstrated that there are proven ways to de-anonymise data to reveal sensitive information. 

Synthetic data doesn’t contain any PII and can’t be de-anonymised, meaning it can safely be outsourced to other territories, where teams can develop tools on synthetic data without ever needing access to production data.

Use Cases for Synthetic Data  

Reduce risk for data migrations 

During a data migration, synthetic data ensures that testing can be done with realistic data loads, greatly reducing the risk of data leaks or errors. It is possible to use more synthetic data to understand if the system will be able to meet future requirements.

Single customer view 

If you are combining data from multiple sources, records may contain slightly different versions of the truth. An address might have changed, names could be spelled differently, or login locations could change.  How do you connect these disparate data sources accurately? 

Synthetic data allows you to make custom datasets containing data mismatches, along with unique identifiers that can be used as ground truth to validate matching. This means testing and accuracy statistics can be applied to the matching capability of the product, improving reputability. 

Offshore workers 

In many cases, production data can’t leave specific territories – for example, data may not be able to leave the EU because of concerns around GDPR. This massively reduces the available worker pool for projects. If an offshore team is able to develop using synthetic data, this allows organisations to work within data compliance using representative data. The end product can then be deployed directly into production with maximum compatibility. 

Machine learning and analytics 

Most synthetic data is reflective of production data in that data types will match, in addition to data limits like max / min values. However if your workflow involves analytics or machine learning you will need the synthetic data to follow the same distributions and correlations as the production data. Utilising synthetic data means that data scientists can develop models that would typically require highly-sensitive data in lower security environments. 

However, to build a machine learning model, the model must access and learn the dataset. This can be a critical blocker for some projects. Another is proper validation that the model is truly generating new values and not copying the production data.

Making Synthetic Data  

Generating typical synthetic data doesn’t require scanning or seeing the production data, with the exception of machine learning-generated data. It does, however, need some descriptive information such as table names, column names and high-level information about the type of data and any limits or patterns. 

For example, you might need to set an upper and lower limit for numbers and dates that reflects production data. When it comes to text information, it is important to know if the column contains things like names or addresses, and the pattern of any IDs – for example, 2 letters followed by 4 numbers. 

Conclusion

Synthetic data is a vital tool in developing products that can help teams get stood up faster, and build more safely, with proper load-testing. If you’d like to know more about applications of synthetic data, get in touch with someone on our team.