• 11 August, 2020
  • 6 min read

Training CV Models with Synthetic Data (w/ Cesar Romero + Unity)

Training CV Models with Synthetic Data (w/ Cesar Romero + Unity)

In this talk, Rsqrd AI welcomes Cesar Romero, Principle ML Engineer at Unity Technologies! Cesar speaks about the value of synthetic data and its potential and how Unity has already invested in the development of synthetic data.

Real World Data

Challenges with Real World Data

Real world data is valuable, but there are limitations to collecting it and using it.

  • Privacy and Compliance – making sure the data belongs to the users as opposed to whoever collects them. There are restrictions to the use of data.
  • Biased and Insufficient Real World Data – there can be inherent biases in the data
  • Time Consuming – it takes a lot of time to collect and annotate data
  • Expensive – there are monetary costs to collecting data and processing it

Biases and Limited Real World Data

Biases in a dataset can result in unfavorable and inaccurate outcomes, but bias can come from anywhere and have a big impact on the ultimate result.

Cesar recounts a recent dilemma which gained a lot of traction on Twitter. Here is a sample of the Flickr Faces dataset:

Flickr faces dataset

As you can see, there is a good assortment of different ethnicities and characteristics. However, a system was trying to generate faces using this dataset with the following results:

Obama and doesn't look like Obama

As a human, we can recognize that the picture on the left is a pixelated image of Barack Obama. The output looks very different and doesn’t have any characteristics of the prominent figure at all.

This generated a lot of discussion about where the bias came from and who is responsible. Is it on the data? Is it on the people? Cesar proposes viewing this as an opportunity for more awareness that if you’re simulating and generating the data that you need, you can mitigate some biases in data that you can control.

Synthetic Data

Economics of Synthetic and Real Data

Cesar focuses on a key observation about synthetic data:

If your simulation can be created using game engines like Unity, it has full knowledge of what it’s rendering and therefore, you can think of getting labels kind of for free

There is a lot of potential synthetic data to be significantly cheaper than real world data since a lot of the cost with collecting and processing real world data is mitigated. Also, because you’re creating data and not collecting it, you’ll ultimately be able to create much more data than you’ll be able to collect.

Survey of Synthetic Data in Practice

Cesar presents some examples of synthetic data in the world of computer vision and how it’s been very effective.

One example is from Hinterstoisser et al 2019 and Google Research. They were trying to train a system to do object detection on groceries using synthetic data. They rendered a bunch of images of grocery items at different angles and tracked other properties like camera zoom and used that to train their synthetic data model.

synthetic data has better results than real data
Source: Hinterstoisser et al 2019. An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detection‌‌

They found that the model that used purely synthetic data outperformed a model that trained and tested on real data and had better accuracy.

Getting Started with Synthetic Data

Now that there are clear benefits of using synthetic data, how do you get started? There are several options that can get you on your way.

Find a Simulator

Existing Simulators: There are existing simulators that can create data for you. Some examples for autonomous vehicles are the Carla simulator at carla.org, and the LGVSL simulator at lgsvlsimulator.com (which fun fact, was created in Unity)

Make your own Simulator Using a Game Engine: You may not find an existing simulator that fits your needs, and that’s okay. You can create your own content using a game engine or a scriptable renderer. Here is an example using Unity creating content of groceries.

image made by game engine

Run the Simulator

After you have your content, you want to run the simulation to generate data. Initially, you’d want to test locally and generate perfectly labeled data for a single task. Eventually, you might want to scale up. You can generate millions of training examples on the cloud with a service like Unity Simulation. You can learn more at unity.com/products/simulation.

Analyze the Dataset

After spending all this effort and time into creating this data, it’s important to analyze it before you start training to see if the data is useful or not.

Unity released a Python package and Apache 2 license that can help you do so. It comes with a Notebook with very useful visualizations. It also has some integration with Unity Simulation. You can learn more at pypi.org/projects/datasetinsights.

visualizations of dataset

Cesar walks through the process and goes through an object detection example using groceries, which can be found here: 25:18

Sneak Peek of what Unity is Doing (and you can do it too!)

Cesar gives a preview of some of the experiments they’re running. They were inspired by the Google paper mentioned earlier and found that their model trained on ~400k synthetic images and <100 real images outperformed the best model trained using only real images.

You can see it for yourself, too! Here is the Github repo for the project where you can generate your own synthetic data and train your ML models: https://github.com/Unity-Technologies/SynthDet

Here is the overview pulled from the repo:

“SynthDet is an open source project that demonstrates an end-to-end object detection pipeline using synthetic image data. The project includes all the code and assets for generating a synthetic dataset in Unity. Using recent research, SynthDet utilizes Unity Perception package to generate highly randomized images of 64 common grocery products (example: cereal boxes and candy) and export them along with appropriate labels and annotations (2D bounding boxes). The synthetic dataset generated can then be used to train a deep learning based object detection model. This project is geared towards ML practitioners and enthusiasts who are actively exploring synthetic data or just looking to get started.”


Synthetic data has a lot of benefits, and the technology to create synthetic data has become a lot more accessible. Read more about synthetic data at Unity here on their blog!

Synthetic data: Simulating myriad possibilities to train robust machine learning models

Use Unity’s perception tools to generate and analyze synthetic data at scale to train your ML models

Cool Stuff to Check Out

Visualizing image classification to object detection to semantic segmentation: 6m 01s

Examples of synthetic data in research: 11m 00s

Interesting questions from the video:

  • How do you prevent potential bias creeping into the synthetic data? 35m 40s
  • Is there a study about the impact in the model performance using different render engines? 36m 53s
  • Do you use generic 3D models, like a cuboid mesh, to texture wrap similarly shaped objects (e.g. cereal box)? If yes, do you also use generic 3D models for human faces? 40m 37s

Flickr faces dataset

Obama Face Depixelizer Tweet 1

Obama Face Depixelizer Tweet 2

Unity Dataset Insights

An Annotation Saved is an Annotation Earned: Using Fully Synthetic Training for Object Instance Detection

Some cool links:

Video: Rsqrd AI - Cesar Romero - Training CV Models with Synthetic Data

All information and ideas presented in this post are that of the speaker and the talk.

Leave a comment