Data is the essential fuel driving organizations’ advanced analytics and machine learning initiatives, but between privacy concerns and process issues, it’s not always easy for researchers to get their hands on what they need. A promising new avenue to explore is synthetic data, which can be shared and used in ways real-world data can’t. However, this emerging approach isn’t without risks or drawbacks, and it’s essential that organizations carefully explore where and how they invest their resources.
What Is Synthetic Data? Synthetic data is artificially generated by an AI algorithm that has been trained on a real data set. It has the same predictive power as the original data but replaces it rather than disguising or modifying it. The goal is to reproduce the statistical properties and patterns of an existing data set by modeling its probability distribution and sampling it out. The algorithm essentially creates new data that has all of the same characteristics of the original data — leading to the same answers. However, crucially, it’s virtually impossible to reconstruct the original data (think personally identifiable information) from either the algorithm or the synthetic data it has created.