4 minute read 1 Dec 2022

Tech in ESG

Synthetic data: fake is the new real

4 minute read 1 Dec 2022

Related topics Technology Digital Consulting AI Data and decision intelligence

Upvote

Show resources

India Tax Insights - Issue 22

Download 11 MB

Digitally created synthetic data helps train AI models even for unprecedented conditions.

A leading US-based tech company that makes AI-based chips for automobiles wanted to train its deep-learning networks for autonomous vehicle applications, such as object detection, safety monitoring, lane keeping and parking. Getting quality data (and avoiding poor data) from live tests on real cars and roads and analyzing that data would have meant gathering and curating thousands of pictures. Even that would have been inadequate, and expensive. The team found an alternative. It created thousands of data sets using micro-simulations of cars driving on virtual streets modeled on real-world data, such as road conditions, vehicle specifications, weather, and even dangerous or rare conditions.

Many organizations are turning to synthetic data — data generated digitally using algorithms. Collecting and labeling high-quality data from the real world is heavy on time and resources, apart from being plagued by unavailability, inconsistency and bias. Moreover, real data sets may not contain all permutations and combinations possible, especially in edge cases. Synthetic data solve these problems to a great extent.

Since it possesses the same predictive power as real data, synthetic data replicates the statistical characteristics and patterns of an existing data set by modeling its probability distribution and sampling it out. It can be generated for unseen conditions and events. Synthetic data can train AI models, test systems, and build prototypes when actual data sets lack quality, volume, or variety. It allows customization, avoiding privacy concerns and faster turnaround for product testing.

Generated either from real data sets or created using existing models, synthetic data is propelling AI. According to Gartner, by 2024, 60% of the data used to develop AI and analytics projects will be synthetically generated.

Generating data

Two simultaneous trends are driving the demand for synthetic data. While there is a need for large amounts of clean data to build and train AIML models, generating high-quality synthetic data has become simpler and easier. Coupled with that, data privacy rules are driving up the need for de-identified data to create large aggregate databases that can be used for more accurate analytics and AI models. Groups, the AIML community, businesses and the government agencies are adopting different data synthesis to support model building, application development and data dissemination.

Generally, a synthetic dataset that includes binary, numerical and categorical data, or unstructured data like images and video, is generated by deep learning generative models like Generative Adversarial Network (GAN), Variational Auto Encoder (VAN) models and diffusion models. Lately, transformer models are also gaining popularity in natural language processing. Users can either choose data that is fully or partially synthetic (where only the sensitive information from the original data is hidden). Compared to rule-based test data, synthetic test data is easier to generate, which also reduces the cost of generating training data.

How EY can help

Analytics consulting services

We can help you apply intelligence in your organization to grow, protect and optimize your business by harnessing the latest technologies.

Providing insights: from healthcare to fashion

Data generated through simulation environments allow users to conduct “what if” analyses and design new test scenarios. This is particularly useful when no real data is available. During the COVID-19 pandemic, many of the AI models that healthcare professionals and researchers used required advanced computation. Researchers used large quantities of synthetic data that was based on actual patient data but not directly derived from individual records. Synthetic data were also used to study the spread and impact of the pandemic over time across densely tested geographic areas.

Use cases are emerging in various sectors, like financial services, software testing, pharmaceuticals, manufacturing and distribution, retail, fashion and others. For instance, banks and financial services companies can use synthetic data to evaluate potential market behaviors, design algorithms for more equitable loan distribution, combat financial fraud and make new products and services.

In the pharmaceutical industry, synthetic data is useful when handling large but sensitive samples, where regulatory restrictions and data privacy is a challenge. It enables faster and better trials as well as cross-border research.

In agriculture, digitally generated data can be helpful in developing computer vision applications for crop yield prediction, crop disease detection, identifying fruits and predicting plant growth models.

Natural language processing is an area where synthetic data is used widely, especially while training systems of virtual voice assistants. In manufacturing, synthetic data is used to train AIML for industrial robots to enable factory automation and for robots to perform complex tasks in the production line. Artificially generated data sets can train AI in autonomous check-out systems, study customer demographics, or run cashier-less retail stores. Apart from these, advanced ML models trained on synthetic data help e-commerce companies in improving warehousing and inventory management.

Synthetic data has multiple use cases and solves many of the problems associated with real-world data. It is, however, not a one-stop solution. There are significant risks and limitations, since the quality of data generated largely depends on the quality of the model that created it. This means that biases can still exist, and it can get obsolete quickly. However, advances in synthetic data generation will boost the accuracy of ML models and accelerate AI. Used with due caution, it has the potential to make the software more trustworthy as well as transform the economics of data.

How EY can help

Artificial intelligence consulting services

Our Consulting approach to the adoption of AI and intelligent automation is human-centered, pragmatic, outcomes-focused and ethical.

Summary

Synthetic data generated using algorithms can train AI models, test systems, and build prototypes when actual data sets lack quality, volume, and/or variety. It has the same predictive power as real data and is currently propelling AI and solving many of the problems associated with real-world data. There are multiple use cases across sectors, including financial services, automobiles, retail, healthcare, and pharmaceuticals. The challenge is that the quality of synthetic data generated depends on the quality of the model that created it.

About this article

Alexy Thomas

By Alexy Thomas

EY India Technology Consulting Partner

Technology enthusiast, Data-driven.

Related topics Technology Digital Consulting AI Data and decision intelligence

Upvote

EY refers to the global organization, and may refer to one or more, of the member firms of Ernst & Young Global Limited, each of which is a separate legal entity. Ernst & Young Global Limited, a UK company limited by guarantee, does not provide services to clients.

EY | Assurance | Consulting | Strategy and Transactions | Tax

About EY

EY is a global leader in assurance, consulting, strategy and transactions, and tax services. The insights and quality services we deliver help build trust and confidence in the capital markets and in economies the world over. We develop outstanding leaders who team to deliver on our promises to all of our stakeholders. In so doing, we play a critical role in building a better working world for our people, for our clients and for our communities.

EYG/OC/FEA no.

ED MMYY

This material has been prepared for general informational purposes only and is not intended to be relied upon as accounting, tax, or other professional advice. Please refer to your advisors for specific advice.

Topics

General

People

Synthetic data: fake is the new real

Show resources

India Tax Insights - Issue 22

Digitally created synthetic data helps train AI models even for unprecedented conditions.

How EY can help

Analytics consulting services

How EY can help

Artificial intelligence consulting services

Related content

Summary

Want to know more about our services?

Send us your queries

Welcome to EY.com

Topics

General

People

Trending

Synthetic data: fake is the new real

Show resources

India Tax Insights - Issue 22

Digitally created synthetic data helps train AI models even for unprecedented conditions.

How EY can help

Analytics consulting services

How EY can help

Artificial intelligence consulting services

Related content

Summary

Want to know more about our services?

Send us your queries