Synthetic Data: Meet the Unsung Catalyst in AI Acceleration
- By Zeid Khater, Forrester
- July 06, 2024
Synthetic data is not a new phenomenon. Rules-based synthetic data has been around longer than most people realize. It is commonly used in analytics for data augmentation, conjoint model analysis, and simulation testing. Rules-based methods, however, lack flexibility and struggle with complex data distributions. Assumptions made during rule creation don’t always hold universally, and manually defining rules can become impractical as datasets grow. Generative AI (GenAI) models such as generative adversarial networks (GANs) and variational autoencoders (VAEs) are making it easier to generate more realistic synthetic data quickly by learning complex distributions directly from that data and generating much more realistic and higher-quality synthetic data, which can then train better-performing AI models.
Forrester defines synthetic data as:
Generated data of any type (e.g., structured, transactional, image, audio, etc.) that duplicates, mimics, or extrapolates from the real world but maintains no direct link to it, particularly for scenarios where real-world data is unavailable, unusable, or strictly regulated.
GenAI-based synthetic data is becoming the unsung hero of AI development. For example, we have synthetic data to thank for Microsoft’s Phi-1 base model, which was trained on a curated “textbook-quality” synthetic dataset rather than traditional web data exclusively, which appears to have a positive impact on mitigating toxic and biased content generation. These smaller models will continue to be crucial in scaling genAI implementation for industry-specific use cases.
Synthetic data is also likely to grow in popularity due to its ability to train AI models at a much faster pace by generating large, clean, relevant datasets. NVIDIA claims its NVIDIA Isaac Sim simulation application can help “train [computer vision models] 100 times faster.” Synthetic data providers are emerging to democratize AI training — and their solutions are not limited to computer vision systems. Synthetic data provider Gretel, for example, released the world’s largest open-source text-to-SQL synthetic dataset to assist developers in training their models via tabular data.
One of the most salient advantages of employing synthetic data for AI model training lies in data privacy. By generating data completely disconnected from the original dataset, it becomes impervious to traceability back to its source. This attribute holds particular significance in sensitive domains such as healthcare, medical research, and financial services, where data utilization for AI training is highly regulated and requires strict adherence to privacy laws and regulations.
As AI continues its rapid expansion, the demand for training data escalates in tandem, necessitating the establishment of robust regulatory frameworks. Synthetic data emerges as a viable solution, enabling the faster training of models to meet market demands while remaining fully compliant with regulatory constraints.
The original article is here.
The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/ Flashvector
Zeid Khater, Forrester
Zeid Khater is a Forrester analyst and covers customer data and analytics. His work helps customer insights professionals utilize their data to generate insights, reach the right customers, increase loyalty and profitability, and inform product strategy and service delivery. His research agenda also includes the uses and applications of third-party data.