Jul 12, 2022
Why streaming data is essential to empower the ‘Modern Data Stack'
The case for streaming over batch data continues to get stronger. Read on to see how our experience confirms this.
Why streaming data is essential for the modern data stack
As a product-led company Aiven is heavily invested in building a pioneering analytics function. Therefore we are always looking for the best ways to capture and harvest data.
I’m Anton Heikinheimo and I work as a Data Engineer at Aiven, building reliable, scalable and maintainable data pipelines. Aiven as a company has been growing exponentially, more than doubling its headcount every year since our foundation in 2016. However, the data engineering function has not needed to grow since its inauguration in 2020, in fact we have been able to empower Aivens internal analytics with just two data engineers. Championing open source and always prefering managed solutions has been essential for our success. Open source has allowed us to build reliable software and control our costs as we’ve grown, while managed solutions have allowed us to focus on building business logic, not managing infrastructure.
The discussion between batch and streaming has been ongoing for ages, and the consensus has always been that streaming is the way forward. However, lately a lot of the trends in the data space have even further advanced this comparison in favor of streaming. In this blog post we’re going to talk about what these trends are and why streaming is relevant to enable them.
Why are data teams moving towards streaming?
The data teams that have had the privilege of re-thinking their data stack in recent years have typically opted for a stack that is built around streaming, and stream processing. This comes as no surprise when you consider the benefits of setting up streaming pipelines compared to batch. Batch ingestion jobs come with a lower upfront cost as they are easier to set up, but anyone who has stayed around long enough will understand the complications introduced by batch ingestion, for example late arriving dimensions (data being out of sync) and having to make sacrifices in downstream applications due to data delay. Setting up a streaming ingestion pipeline requires more upfront investment, but the main benefits can be grouped into:
- Data timeliness, applications built on top of the data warehouse don’t need to wait hours for the data to arrive.
- Data quality and completeness, with the help of CDC (Change Data Capture) and stream processing data teams are able to control the completeness and quality of the data before it gets ingested.
- Cost savings, by aggregating and joining data before ingestion teams are typically able to save costs in storage and processing.
- Less operational overhead, data stacks built around stream processing do not need to worry about data being out of sync.
Data timeliness
Aside from traditional dashboarding & reporting, data teams are now also working with operational use-cases that directly affect the flow of business. A new trend in the data space is Reverse ETL, where data is sent from operational systems to the data warehouse, enriched in the data warehouse and then sent back to operational systems. An example of reverse ETL is marketing messaging where customer facing messaging is personalized based on calculated values such as CLTV (Customer lifetime value), customer segmentation and customer health score.
These new operational use-cases in the data warehouse have greatly increased the importance of data timeliness. Internally at Aiven we have made it as one of our KPIs to reduce the time from when data is generated in the backend to it arriving in its respective operational system. With Apache Kafka® you are able to transmit data as soon as it is generated, and with the help of a stream processing framework such as Apache Flink® you can enrich the data on the fly.
Data quality and completeness
The modern data stack is built around the process of ELT (Extract Load Transform), the main difference to the traditional ETL (Extract Transform Load) is that raw data is not transformed before being loaded to the data warehouse. One of the unintended consequences of ELT is that poor data quality enters the staging area often resulting in technical debt as data teams try to fix poor quality with complex transformations. Moreover, ELT also makes it more difficult to get accurate historical data since the data models are mutable, meaning that the data gets erased and rewritten each run. With mutable models and poor data quality getting an accurate historical view gets exponentially more difficult with each introduction of new features in the staging data.
Stream processing can be used to combat these issues, by combining the best of ETL and ELT, doing something called ETLT (Extract Transform Load Transform). In ETLT some preliminary transformations are done during data ingestions, these transformations can be performed with streaming data processing tools like Apache Flink. The rationale behind doing transformations during data ingestion is that some data checks can be performed on the fly, to guarantee data quality in the staging area. Examples of these checks include checking for nulls, performing joins and enforcing a schema. At Aiven for example we are joining two event sources: API requests with API request responses. This serves two purposes:
- Validating that each response has a request
- Pre-joining the data on the fly and avoiding an expensive (and slow) join in the data warehouse
Another problem lies in data completeness: batch processing uses polling intervals to query the tables on a predefined interval, for example hourly or daily. If changes in the underlying data happen more frequently than the polling interval, and if the tables do not explicitly track the state, then you risk missing valuable information. Furthermore you risk breaking the integrity of the data, for example an immutable customer_action_log
will contain an action of “email updated”, but the batch process did not capture the updated value in the customer table for email since the customer changed the value back immediately. This is why log based database replication, for example using the Debezium CDC (Change Data Capture) connector, is far superior as the replication is based on the database's native logging source.
Lastly, stream processing opens up the possibility of GDPR compliant data pipelines where PII data is anonymized in transit and by doing so you don’t risk PII data laying around in storage containers, logs or unused tables.
Cost savings
Significant data savings can be achieved with stream processing by aggregating and enriching data in transit.
Pre-aggregating data is commonly done when ingesting metrics or other IoT data. Metrics data can often have granularity of seconds, such large amounts of data can skyrocket the costs of storage and compute (imagine a scenario where you have 200 thousand nodes that are all transmitting data every few seconds). At Aiven we are using Apache Flink sliding windows to define the grain of the data at ingestion time, and capture exactly the granularity which best serves the business. While grouping the data into buckets with Flink windows you need to define the aggregate functions, for us this is typically the min, max, avg and a distribution metric.
Another trend in the data space has been the Activity Schema, where events and data from different sources are normalized and parsed to a single table. Flink is an excellent tool for this purpose, as it has good support for working with nested JSON structures in SQL.
Less operational overhead
One of the struggles with batch ingestion is the operational overhead which is introduced with extracting various different sources on different schedules. Data is frequently misaligned and often the solution to these failing tests is to wait an hour or two before rerunning the job. The reason for this might be a single failing extraction, which breaks the integrity of the data. Downtime is an unfortunate reality when working with databases, and with batch jobs dealing with these downtimes can be a headache, but with streaming and particularly Apache Flink it is easy to do recovery as it allows us to do transformations based on event time (rather than processing time), essentially leading to the data stream recovering without intervention.
With streaming data ingestion these scenarios are unlikely, as data is transmitted as soon as it is generated. Lastly, in streaming we have the luxury of using open-source tooling that has been tried and tested, namely Apache Kafka and Apache Flink. These tools are in such a mature state that once your pipeline is up and running it just works (magically). Since the tools are open-source there is also not a risk of being locked in to a specific vendor or software. The pricing is predictable and does not exponentially increase with the usage, actually quite the opposite, scale benefits frequently come into play when working with Kafka and Flink.
Conclusion
We can conclude that recent trends in the data space have only expedited the need to build the data warehouse around streaming. If the data department is not able to provide a reliable data interface which responds to the velocity of the business then the decisions will either go unanswered or business will be making decisions blindfolded. At Aiven we have managed to empower our analysts and business stakeholders by using open source tools like Apache Kafka and Apache Flink on our platform. One unintended consequence of working with Aiven’s platform has been the close collaboration we have developed with our own product function, helping them identify the friction which data engineers encounter in their day to day, which helps them build products that solve real world problems.
Further reading
Stay updated with Aiven
Subscribe for the latest news and insights on open source, Aiven offerings, and more.