As organizations move towards building architectures and applications that utilize real-time streaming data, the challenge of data governance comes to the fore. Data governance is difficult enough when data is static and is sitting in a database, but it unsurprisingly becomes much harder when you need to trust data that is being streamed in real-time from a variety of sources.
For example, if an auditor requires that you show what personal identifiable information your organization holds on customers, finding that data quickly across hundreds of streaming events is no small task. Organizations need tools that help them understand their data pipelines and data dependencies when adopting ‘data in motion' platforms (typically, Apache Kafka), and they need tools to encourage teams to work responsibly with streaming data.
Data going into the platform needs to be trusted and teams need ways of understanding and collaborating on data issues when things go wrong. Not all of this can be solved by tooling - data governance as an ethos needs to be encouraged too - but tools can go some way to provide a structure that supports ‘good behaviour'.
With this in mind, Confluent recently announced its Stream Governance suite to do just that. Given Confluent is a leading commercial provider of Apache Kafka and is spearheading this concept of ‘data in motion', it's important that the company also supports customers in their endeavours to better trust their data.
Confluent believes that organizations that learn to operate around a ‘central nervous system' of real-time streaming data will be the ones that succeed in the "next stage of digital disruption". However, this won't be possible unless organizations can figure out ways to get trusted streaming data into the hands of the many, not just the few, and safely share data across teams. This requires governance.
Stream Governance essentially provides tooling to allow both self-service data discoverability and controls for long-term data compatibility and compliance. It is formed of a number of components:
Stream Catalog - this allows individuals across teams to collaborate within a centralized, organized library designed for sharing, finding, and understanding the high-quality data needed to drive their projects forward. This includes topics, schemas, fields in schemas, clusters, and connectors. Confluent describes it as like a digital library for data in motion, allowing any user-experienced with streaming data or not-to put it to use.
Stream Lineage - teams need an easy way to comprehend how data is moving between different systems and applications within a business. Stream lineage aims to provide an end-to-end map of event streams with both a bird's-eye view and drill-down functionality. The hope is that with a better understanding of where data originated, where it's going, and how it's transformed, developers can move projects forward with assurance their work won't cause negative or unexpected downstream impact.
Schema Management UI - this aims to make it easier to understand which schemas exist, how they are defined, and where they are used. Confluent believes that standardising around well-defined and agreed upon schema structures will allow teams to develop ‘resilient data in motion pipelines' and protect against system failures and corruption.
Making sure bad data doesn't get in
We got the chance to speak with Ben Stopford, Lead Technologist, Office of the CTO at Confluent, where he discussed the importance of schema in governance in building trust for data in motion. He said:
You're establishing a schema for every concept in the business. So, if you're a retailer, for example, that could include orders and payments. You will have a set of schemas to describe the core entities in your business.
Anybody who needs data in your organization can then find that data, by searching it. You go into Confluent Cloud, you type in the search box: ‘I want information about customers'.
That will come up with a whole bunch of different schemas and you can find out which of the event streams hold these different schemas, you're able to look at the data, and then there's metadata around it as well telling you where it came from. Then the UI in Cloud maps out a graph of all these flows that go through the organization.
Stream Governance provides teams with a visual chart that they can use to map data pipelines backwards, which are built on this validated schema. If there's an issue, users can visually identify where the problem lies and alert others.
Essentially, Stopford explained, Stream Governance goes some way to helping organizations understand their complex data ecosystems. He said:
In a complicated organization you have two fundamental problems. Data breaks all the time, so how do I have a stable system if the data is going to break? And secondly, if I want to do something new, how do I understand this complicated data ecosystem?
We have a variety of governance features, which work really well, because they both guard data that is going into this central nervous system, which is connecting all the different bits of the business together. But they also work on the other side, to allow you to work out where the data came from.
Nothing is ever perfect, so if you do get a bit of data that is bad, and your system breaks, or you get a report that's incorrect, you need to figure out where that data came from.
However, Stopford isn't naive to the fact that no matter how good Confluent's tooling is, or how sophisticated it gets over time, organizations need to put in the work too, to encourage teams to be good data citizens. That's something that's true of stagnant data, or data in motion. He said:
The thing about data governance is that it's totally imperfect. It's sort of annoying in that regard. A lot of things in computer science are very exact. You can guarantee things. Governance isn't quite like that in a big organization, it tends to be messier. Which is why there has to be this socio-digital element to the strategy, which encourages and tries to make sure that people are complying, that they've got a vested interest and they're going to be good data citizens.
It's worth taking a look at this YouTube clip that Confluent has uploaded, which shows how Stream Governance works in practice. It provides a clear idea of not only how individuals can better understand their streaming data, but also how teams can use the tooling to work across the organization. We look forward to speaking to customers that are using the Stream Governance to understand how it has changed how they work with data. But as Confluent gains traction with buyers, supporting how they can better work with event-based streaming data is essential. Data is hard and nine times out of ten when we speak to end users, governance is one of the things that they struggle with the most. If Confluent can continue to make it easier for buyers to work with their real-time streaming data, then it should pave the way for the proliferation of more use cases.