The field of data architecture is full of jargon, from data meshes to hubs and warehouses.
Two of the biggest buzzwords right now, though, are data lakes and data fabric – two different approaches to handling the often vast amounts of data a modern business ends up collecting.
The concept of a ‘data lake’ is a metaphor: if a lake stores water in whatever ‘natural’ form it arrives in, from rain, rivers or streams, a data lake stores data in whatever form it arrives, from whichever part of your organisation is creating it, whether that data is structured or unstructured.
This was traditionally positioned in opposition to a data warehouse, where you define what kinds of data you’re going to store before you collate it, standardising and structuring your data as it comes in.
The flaw in that approach is obvious. If I run a business selling hats and have set up my data warehouse to record information about hats, but then I decide to branch out into selling shoes, I would have to change the structure of the warehouse to hold different kinds of product information. But a data lake doesn’t care whether the data is about shoes, hats or dinosaurs, or even what format it’s in. You can simply pour anything in there and figure out the rest later.
What are the pros and cons of different data architectures?
Of course, every approach has its advantages and disadvantages. Data stored with a clear, pre-defined structure is easier to use, whereas making sense of everything that has been poured into a lake requires more specialist knowledge and will likely need to go through a data scientist before other people in your business can get actionable insights out of it. The unfiltered nature of the data can also present reliability and/or security issues.
But data lakes are more flexible, have lower storage costs and can support a broader range of uses. For instance, they’re often used in combination with machine learning, as raw unstructured data is often more suitable there than something that’s already been carefully tagged, filtered and labelled.
Unfortunately, there are broader problems with handling data that no storage solution can solve on its own. This is where data fabric comes in. While a business could in theory use a single storage solution, in practice this is rarely ideal because organisations tend to have such a wide range of use cases and demands on their data: a team focused on machine learning may have very different priorities to a team focused on compliance.
A data fabric establishes relationships and interoperability between all the data an organisation holds; the metaphor being that you can ‘knit’ all these different things together to create a single framework that accounts for all your data, without having to store all that data together in one place.
How do you match a data architecture to your business needs?
The actual architecture behind it will vary depending on business needs. Connecting different sources may be as simple as hooking them up via APIs, or as complex as matching data via artificial intelligence. The point is to do this within a clearly defined framework in order to ensure that everyone in the organisation can get access to the data they need, when they need it, without tying up technical and data resources, and without introducing data security and governance issues.
The difference from a data warehouse approach is that you don’t necessarily need to rigorously define how every individual piece of the framework is storing the data. Instead, you can simply bring in new components to the fabric as and when you need to.
A common use case for a data fabric is tracking identity throughout your data – whether that’s the human identity of a customer or employee, or the non-human identity of a machine or other entity.
John Pritchard, of identity data platform Radiant Logic, describes the issues and how a data fabric can tackle them. “It’s very common for organisations to have lots of systems that define their employees or customers, and a data fabric is often used to try to bring those together in a coherent and cohesive state,” he explains.
A business might hold lots of different information on an employee, for example: contact details, the types of training they’ve done, certifications to use a particular machine, any compliance procedures they’ve been through.
As Pritchard puts it: “Those types of data attributes tend to live in lots of different specialised systems. And the idea of a fabric is to try to bring them together. It’s necessary when you’ve got a lot of different things going on. That’s probably the big driver for most organisations. And in our world, it’s very common for that data to not exactly match each other.”
What’s at stake if you choose the wrong data architecture?
This kind of inconsistent and incomplete data can have big implications. “In our space, a lot of times that completeness relates to risk,” Pritchard says. “When identity data has either quality issues or data drift, then the systems that use that data to make, say, access decisions can be put at risk, because the data has become old. A fabric approach, connecting lots of different data sources, can sort of watch how the data is moving over time and assess it for its completeness.”
But a data fabric approach is about more than just making sure all the systems holding your data can talk to each other. It’s also about using all that data together to ensure you’re building a complete and up-to-date picture based on everything you know about a given entity, whether that’s a customer, a product or a business partner.
Daniel Wood, of development platform Unqork, highlights examples where this can be key. “When real-time analytics are required by finance and healthcare use cases, such as understanding patient data, performing fraud detection, or general monitoring and alerting, data fabrics can be incredibly useful due to the complex data integrations.”
How to piece together your data architecture
So, which should your organisation be using – a fabric or a lake? Well, that might not be quite the right way to look at it, because a data lake could well be one of the components stitched into the architecture of a data fabric.
What any tech leader needs to think about, first and foremost, is the type of data being collected and whether it needs to be shared across the business. A team doing a lot of work with IoT devices, machine learning or any other big data use case is likely to need a data lake of some kind.
The real question is if, or when, that data needs to be used outside that particular silo. How should that process be managed? That’s the point at which a data fabric becomes useful.