Synthetic Data for AI with Aiven and ShadowTraffic

Generating artificial datasets that closely resemble real data without compromising sensitive information

In the world of AI and machine learning, access to high-quality, large-scale data is crucial for success. However, real-world datasets are often hard to come by due to concerns around privacy, cost, or regulatory restrictions. Synthetic data provides a practical solution to these challenges by generating artificial datasets that closely resemble real data without compromising sensitive information.

Apache Kafka®, a distributed event streaming platform, is a popular choice for building real-time data pipelines and streaming applications. By using synthetic data, Kafka topics can simulate real-world events, making it possible to test, train, and develop AI models in a more controlled and scalable environment. This approach not only accelerates AI workflows but also helps minimize the risks associated with using actual data.

In this article, we'll walk through how to use ShadowTraffic to simulate traffic to an Apache Kafka topic in Aiven. We'll demonstrate how to stream sensor readings to a Kafka topic, offering a robust and scalable setup.

ShadowTraffic provides a range of generators, functions, and modifiers that help shape the data stream, allowing you to create realistic datasets that closely resemble real-world scenarios. In our example, we'll walk through the steps to do just that.

Step 1. Create Aiven for Apache Kafka service

For this tutorial, we'll be using Aiven for Apache Kafka, which you can set up in just a few minutes. If you're new to Aiven, go ahead and create an account — you'll also get free credits to start your trial.

Once you're signed in, create a new Aiven for Apache Kafka service. This will serve as the backbone for our synthetic data stream.

Step 2. Prepare SSL credentials

To enable secure communication between ShadowTraffic and Apache Kafka, we’ll need to configure Java SSL keystore and truststore files. Follow these instructions to generate the required files. By the end of this step, you’ll have the following:

  • A truststore file in JKS format: client.truststore.jks
  • A truststore password
  • A keystore file in PKCS12 format: client.keystore.p12
  • A keystore password
  • A key password

Add the generated files into a folder ssl, we'll need them later when we run ShadowTraffic via Docker.

Step 3. Quick start with ShadowTraffic

ShadowTraffic has excellent documentation that explains how to get started with its APIs. Begin by creating a license.env file (you can use a free trial for this tutorial).

ShadowTraffic relies on Docker. If you're new to Docker ecosystem, follow these steps to set it up. Next, pull the ShadowTraffic docker image by running the following command:

docker pull shadowtraffic/shadowtraffic

Step 4. Create configuration file

Now, let's create a configuration file for ShadowTraffic, which will instruct it on how to generate and stream the data. The configuration file consists of two key sections: connection information and data generation settings.

Connection configuration

To securely transmit data between your Aiven for Apache Kafka service and ShadowTraffic, we’ll need to use the keystore and truststore files you created earlier. Below is an example configuration that incorporates the keystore and truststore details:

{ "connections": { "dev-kafka": { "kind": "kafka", "producerConfigs": { "bootstrap.servers": "YOUR-KAFKRA-URI", "ssl.truststore.location": "/ssl/client.truststore.jks", "ssl.truststore.type": "JKS", "ssl.truststore.password": "YOUR-TRUSTSTORE-PASSWORD", "ssl.keystore.location": "/ssl/client.keystore.p12", "ssl.key.password": "YOUR-KEY-PASSWORD", "ssl.keystore.password": "YOUR-KEYSTORE-PASSWORD", "value.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer", "key.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer", "security.protocol": "SSL" } } } }

Replace YOUR-KAFKRA-URI with the URI to your Apache Kafka cluster and set correct values for YOUR-TRUSTSTORE-PASSWORD, YOUR-KEY-PASSWORD and YOUR-KEYSTORE-PASSWORD.

We'll link the SSL keystore and truststore files to the Docker container later when running the docker run command. For now, you can keep the file locations as the default, but be sure to set the correct passwords for both the keystore and truststore.

Generators

Next, we’ll set up the data generator. Each record generated will use a sensor ID as the key, which will uniquely identify each sensor.

To generate the sensor ID, we’ll use ShadowTraffic's uuid function, which creates random IDs for us:

"key": { "sensorId": { "_gen": "uuid" } },

For the value of each record, we’ll simulate a PM2.5 (particulate matter) reading. This can be modeled using a normal (Gaussian) distribution, where most values fall between 55 and 65. The timestamp will be generated dynamically at the time of data creation:

"value": { "pm25": { "_gen": "normalDistribution", "mean": 60, "sd": 5 }, "timestamp": { "_gen": "now" } },

However, to make the synthetic data stream more realistic, we can adjust the behavior based on the time of day. For this, we'll use an intervals construct that allows us to generate different patterns at different times. When the system’s current time overlaps with a defined Cron schedule, the corresponding data pattern will be used.

Here’s how we can set up a PM2.5 generator that simulates different pollution levels throughout the day:

"pm25": { "_gen": "intervals", "intervals": [ [ "0 6-9 * * *", { "_gen": "normalDistribution", "mean": 100, "sd": 15 } ], [ "0 10-15 * * *", { "_gen": "normalDistribution", "mean": 60, "sd": 10 } ], [ "0 16-19 * * *", { "_gen": "normalDistribution", "mean": 90, "sd": 12 } ], [ "0 20-23 * * *", { "_gen": "normalDistribution", "mean": 50, "sd": 8 } ], [ "0 0-5 * * *", { "_gen": "normalDistribution", "mean": 40, "sd": 5 } ] ], "defaultValue": { "_gen": "normalDistribution", "mean": 60, "sd": 5 } },

Additionally, we can limit how frequently events are generated by using the throttleMs parameter, which in this case ensures the generator produces an event no more than specified number of milliseconds:

"throttleMs": { "_gen": "intervals", "intervals": [ [ "*/5 * * * *", 50 ], [ "*/2 * * * *", 1000 ] ], "defaultValue": 4000 }

Altogether, the configuration file looks like this:

{ "generators": [ { "topic": "airQualityReadings", "key": { "sensorId": { "_gen": "uuid" } }, "value": { "pm25": { "_gen": "normalDistribution", "mean": 60, "sd": 5 }, "timestamp": { "_gen": "now" } }, "localConfigs": { "pm25": { "_gen": "intervals", "intervals": [ [ "0 6-9 * * *", { "_gen": "normalDistribution", "mean": 100, "sd": 15 } ], [ "0 10-15 * * *", { "_gen": "normalDistribution", "mean": 60, "sd": 10 } ], [ "0 16-19 * * *", { "_gen": "normalDistribution", "mean": 90, "sd": 12 } ], [ "0 20-23 * * *", { "_gen": "normalDistribution", "mean": 50, "sd": 8 } ], [ "0 0-5 * * *", { "_gen": "normalDistribution", "mean": 40, "sd": 5 } ] ], "defaultValue": { "_gen": "normalDistribution", "mean": 60, "sd": 5 } }, "throttleMs": { "_gen": "intervals", "intervals": [ [ "*/5 * * * *", 50 ], [ "*/2 * * * *", 1000 ] ], "defaultValue": 4000 } } } ], "connections": { "dev-kafka": { "kind": "kafka", "producerConfigs": { "bootstrap.servers": "YOUR-KAFKRA-URI", "ssl.truststore.location": "/ssl/client.truststore.jks", "ssl.truststore.type": "JKS", "ssl.truststore.password": "YOUR-TRUSTSTORE-PASSWORD", "ssl.keystore.location": "/ssl/client.keystore.p12", "ssl.key.password": "YOUR-KEY-PASSWORD", "ssl.keystore.password": "YOUR-KEYSTORE-PASSWORD", "value.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer", "key.serializer": "io.shadowtraffic.kafka.serdes.JsonSerializer", "security.protocol": "SSL" } } } }

Here is the structure of the files we've created so far:

Step 5. Run ShadowTraffic generator

Now it's time to run ShadowTraffic. From the folder where you’ve saved your files, run the following command:

docker run --env-file license.env \ -v $(pwd)/ssl:/ssl \ -v $(pwd)/config.json:/home/config.json \ shadowtraffic/shadowtraffic:latest \ --config /home/config.json

This command mounts your SSL files and configuration into the Docker container and starts ShadowTraffic.

Now you can go back to your Apache Kafka topic to see the stream of data as it flows through — this is where your synthetic sensor readings come to life!

Conclusions and next steps

In this tutorial, we've demonstrated how to use ShadowTraffic to generate synthetic data and stream it to an Aiven for Apache Kafka service. By simulating real-world data with configurable generators, you can test and develop AI models without relying on sensitive or costly datasets.

Next, you can further explore ShadowTraffic’s advanced features and their usage with Aiven for Apache Kafka to incorporate this synthetic data into machine learning workflows.

Learn more what you can do with Aiven and AI: