AI is always hungry. To clarify, the smartness in Artificial Intelligence (AI) is a factor of how much data we allow our AI ‘engines’ to ingest and how well the datasets that comprise that data encapsulate a full and comprehensive image of the world we are attempting to digitize.
Only when we have fed AI engines fully, do they run properly. But over-feeding is okay too, AI is able to chew through the repeats and deduplicate any extraneous data that it has already been fed. Plus, anyway, seeing the same dish served up more than once just affirms AI’s ability to know what shape, size and flavor it actually is.
There are plenty of baseline AI explanations out there. Most of them will remind us that AI doesn’t know the difference between a cat and dog (both have four legs, fur, a tail and big teeth) until it has worked through many thousands of example images. It is this challenge of getting enough images (and video) that makes training AI for planet Earth so tough.
AI basic training (20 weeks)
Industry experts at UK-based AI vision system training specialist Mindtech suggest that it can take up to 20 weeks to gather and properly annotate the baseline 100,000 real-world images necessary to start training a visual AI system to do something novel, like pick out a lost child in a busy shopping mall. It takes significantly more images to help a delivery robot service to safely navigate spaces where children are playing.
Data scientists working on these kinds of AI systems can spend up to 80% of their time gathering, cleaning and manually annotating real-world images so they can be digested by AI systems. Clearly, that doesn’t leave much room for network development or gleaning insights from the data.
MORE FOR YOU
There is, however, a way around this roadblock but it requires taking a step back from our very human view of what is ‘real’ and what is ‘synthetic’ and instead of approaching this industry-wide problem from the perspective of the AIs we’re looking to train. To an AI, well-created and perfectly annotated synthetic data — that is, images and video created by a computer — is actually just as good and in many cases better than real-world data.
What is synthetic data?
Mindtech CEO Steve Harris defines synthetic data as follows, “Synthetic data in the context of AI vision systems refers to annotated images that have been created from computer generated, photo-realistic, 3D environments containing objects, things, people (or anything). The annotated images and metadata created are used to train AI vision systems and compliment ‘real-world’ data. Synthetic data and real data are as authentic as each other. Synthetic data is comprised of binary bits and bytes, just like any other piece of information, while real data is synthetic in its own way too; the images and video that real data is extracted from are taken by cameras and processed by chips in a way that often introduces noise and interference. In this way, to a computer AI, synthetic data is actually cleaner than real-world data because there’s no unexpected noise.”
Harris further explains that one of the key advantages achievable here is that synthetic image data exists without bias or privacy issues (because it’s not really real, per se) and offers alternative angles, views, shapes, lighting, shading etc. of images that have (or still do) exist back on planet Earth. These are all the alternative image forms that would typically take an age to collate, photograph, capture, procure and ingest.
A case for corner case
Along with this, hundreds of thousands of what Mindtech’s Harris calls corner cases scenarios (camera location modeling, different lighting and other variables) which would be hard to create in the real world, can be quickly and easily created in a 3D virtual environment.
Nightmare or edge-case scenarios are perfect use cases for synthetic data too. We can safely recreate these scenarios (gun crime, objects on a busy road, a lost child in a shop) in a virtual world without harming anyone or anything. Mindtech illustrates how synthetic data is created on its platform here.
“Synthetic data gives companies the ability to create photo-realistic 3D worlds and extract unlimited synthetic data to train their visual AI models. Instead of spending months just finding 100,000 real-world images to train a visual AI system, a machine learning engineer can use a synthetic data software creation platform for AI training to generate the same number of high-quality images over the course of a couple of days,” said Harris.
Advocates of synthetic data argue that there is a huge opportunity to scale the AI industry exponentially through this technique if engineers are willing to embrace the synthetic data route.
It surprises many, but studies show that synthetic data actually enhances a machine learning model’s accuracy. According to McKinsey’s 2020 State of AI report into high-performing AI companies, 49% of them are already using synthetic data to train their AI models.
The big four factor
Harris suggests that most companies starting their AI journey face the challenge of accessing enough high-quality images and videos to train their visual AI systems. These companies (from start-ups to scale-ups and even global enterprise companies) are competing against the ‘big four’: Apple, Amazon, Facebook and Google. Google’s engineers for example have access to more than four trillion images stored in Google Photos.
“These giants of the tech world tend to restrict access to this wealth of potential training data. It secures their competitive advantage to develop new products and solutions and in the future monetize their datasets. They’re not immune to some of the problems that smaller companies face though. Finding the relevant images in the trillions available is non-trivial, and once found, they still need annotating,” said Mindtech’s Harris.
This whole subject is heating up. At the same time, data privacy regulations such as the EU’s General Data Protection Regulation (GDPR) are being more readily enforced. For example, in 2019, Microsoft deleted its database of 10 million images — the largest publicly available facial recognition data set at the time — due to data privacy concerns. The Mindtech team insists that these factors combine to create a scarcity of real-world visual data for those looking to develop and train new visual AI systems, leaving only the very largest tech companies able to compete.
Could synthetic data be an AI leveler?
“If we want to see a competitive landscape, with different companies able to develop AIs that better understand the world and human interaction, then it’s clear we will need to democratize access to training data, resolve privacy concerns around that data, and expedite the way data is annotated. The only solution is to use synthetic data: computer-generated 3D video and images,” explained Harris.
In terms of the optimum amount of synthetic data to use, academic research suggests the best training results come from data sets with 90% of synthetic data and 10% of real-world data. This research was backed up by Deloitte Consulting, who found that an AI model trained using 80% synthetic data had similar accuracy to a model trained on real data.
As already suggested above, synthetic data is also a way of eliminating biases that exist in real-world visual data. In the virtual world, different ethnicities, age groups and diversity in terms of color of clothing or sex are much easier to create. As data changes over time, it’s easier to reflect this in a virtual environment to avoid data drift impacting an AI model’s performance.
“Using synthetic data not only saves time and money, but it guarantees that the data AI is trained on is robust and does not create data privacy or ethical bottlenecks. I expect to see the influence synthetic data has on real-world AI solutions continue to grow in the coming years. By using synthetic training data, we can help machines better understand the way humans interact with each other and the world around them,” concluded Harris.
Synthetic data is a certainty
By 2024, Gartner predicts that 60% of data used for AI and data analytics projects will be synthetic, and by 2030, synthetic data will have completely overtaken real data in AI models.
“There is a risk of false, early-stage perceptions surrounding the use of synthetic data in some circles. This is most likely due to the naming of the term itself, as anything ‘synthetic’ might naturally be thought of as plasticized, non-organic, or in some way fake. But, of course, there should be nothing more natural than machine learning tuition being driven by machine intelligence. Properly generated, managed, maintained and secured, synthetic data’s level of bias handling, safety, privacy and cadence represent a significant accelerator and enabler for the AI capabilities of tomorrow,” said Nelson Petracek, CTO, TIBCO Software.
In practice, synthetic data is already used in healthcare to train machines to monitor patients recovering from surgery; in security and surveillance systems to detect suspicious objects or unusual patterns of behavior inside shopping centers or sports arenas; or delivery drones that need to understand the world around them.
The suggestion and technology proposition here is as follows: synthetic data might be synthetic in nature at its core, but its DNA stems from the real world and its application, use case and validation points are so tangible, multifarious and pragmatic, that it is in fact a really real reality.