Data is so important to our clients, and at the same time the technology and tooling changes so rapidly. The landscape available to data practitioners is changing all the time. We ask ourselves: What architectures should we use? What ways of working should we adopt? What new tools have we found useful? What looks great but actually gets in the way of making data useful?
These are our recommendations and insights into what we think you should use, explore or avoid when it comes to all-things data.
Organisations struggle to use their data to drive decision making. Some common scenarios we see include:
Lack of trust in the data. Either information is out of date due to persistent failures in the flow of data or the data does not reliably represent what is actually happening in the organisation.
Inability to change the data. Existing data pipelines or reports cannot be changed for fear of breaking the existing information, so insights fail to keep pace with the changing organisation.
Inability to access the data. Data is locked away or held hostage in current operational systems and cannot be accessed to provide insight and to aid decision making.
Adopt and use modern engineering practices when building data pipelines to manage the flow of data, so that pipelines are reliable and can be changed safely and frequently.
Data pipelines are no different from any other kind of software and benefit from the same practices that have been proven to accelerate software delivery in other areas: infrastructure as code; configuration management of the pipeline; continuous integration and deployment; working in small batches; test driven development and monitoring and observation of the pipeline at various stages.of the pipeline at various stages.
These practices have been shown to increase the ability to deliver software and to reduce rework. But in the data world, pipelines are often not developed and supported using these best practices, leading to failures in data pipelines, long lead times to make data accessible to end users and ultimately loss of trust in the data. Adopting modern engineering practices in a data space, such as DataOps, means that you can make data flow to people who need it more reliably, and create new flows more quickly, improving your ability to understand your business operations and customers.
Organisations have invested in machine learning but are not seeing tangible benefits. Some scenarios we see are:
You can't adapt ML models fast enough or your ML models don't perform in production as they do in 'lab' environments.
There are many Machine Learning Proofs of Concept but they take too much time and effort to keep going or to operate at scale.
The organisation has a regulatory need to explain why a decision has been made - how do you make sure you can see with confidence why this loan/medical diagnosis/risk evaluation was made?
Bring modern engineering practices to bear on organisation's machine learning capabilities - to reliably produce and operate ML-based services.
Data scientists are great at developing algorithms and generating insight but do not always have the software engineering skills to deploy a model to production quickly and reliably. Applying established DevOps techniques such as Infrastructure as Code for management and deployment, monitoring and alerting of model performance and a versioned model repository enable rapid delivery, observability and experimentation of ML within an organisation.
Starting out with a steel thread to prove the usefulness of ML allows for quick assessment of delivered business value without heavy upfront costs.
Organisations have invested in data platforms/warehouses etc. but they are not being used by analysts, data scientists, or other business users.
Typical scenarios are:
We cannot drive quality into the data for our data scientists, analysts and operations managers. I cannot get the data in a format that works for them.
I have built a one-size-fits-all data platform but the data users don't like it because it's not the data they want; they don't believe the quality of the data and they don't like the way it's presented.
Adopt a Domain-Driven approach to the creation of data pipelines in which one team owns the quality and the delivery of the data for that domain.
Domain-Driven Design has been a tried and tested technique in software development for many years. Its approach generates components that have an isolated bounded context, making it easier to provide product ownership, organisational understanding and focused delivery with fewer cross-team dependencies and increased user engagement.
This approach has lagged in coming across to data processing, but it's one we have seen success with for the same reasons - what works for a microservice works for a data pipeline. Being experts in the domain of that data, the caretakers of the pipeline can ensure data quality, ease of use and clarity of domain representation. The power of this approach multiplies as more domain driven data is created in the organisation, as it is more easily shared and aggregated, sparking the creation of new business insights.
Organisations do not know whether their models are adding value or which models they should use.
How can they move from gut instinct or HIPPO opinions? Typical scenarios are:
My teams are building lots of models. How do I know if they are making a difference to my business goals?
My data science team has built a number of models for the same purpose. Which model should I use?
Use multi-variant testing to run multiple models at the same time and evaluate how they affect your business metrics.
A/B testing combines the common-sense approach of trying out different variations at the same time, with statistical analysis to run experiments and measure the effect on business outcomes. There are powerful tools to support this, which help set up experiments and collect the measurements. In many cases it is possible to run multiple models at the same time, and to move business operations to the 'winning' model during the experiment - gaining business benefit whilst the experiment is still in flight.
Often the end result (increased revenue, reduction in fraud etc.) is too difficult or takes too long to evaluate to be of use experimentally. Instead, identify an intermediate metric which is correlated with the required end result.
Data technology architectures have become very complex.
What used to be a simple database running SQL queries is now a complex technology stack to process data at scale. I need to manage and hire for these but managing and getting insights from the data is becoming complex and expensive.
Organisations have very complex data analytics architectures, which put the support of data pipelines out of the hands of the data scientists and analysts who want to use them and who understand the data best. How can we reduce some of the complexity of these architectures?
Utilise hyper-scale cloud data warehouses and the power of SQL to implement key parts of the pipeline in a language that data analysts and scientists can use.
These days modern cloud data warehouses like Snowflake or Google BigQuery allow compute and storage to scale independently, enabling scaling data analysis over petabytes of data. We no longer need to worry about initial sizing or cluster management, because this is managed automagically by the warehouse.
Using these technologies, data processing and data analytics can be undertaken wholly in SQL. SQL is a language well known to data analysts and data scientists - making it much easier for them to contribute to the development, operation and improvement of the data pipelines, as well as reducing the dependency on data engineering functions and thereby allowing organisations to move faster with their data.
Organisations have created architectures to bring data together in a repository but are now unable to meet the needs of data users to access the data or to make data available.
Some common scenarios are:
It is difficult for data users to get access to new data sets.
It is difficult for data owners to easily provide data to other data users.
There are quality or usability issues in data from data platforms.
Explore implementing data mesh concepts and patterns in your organisation.
The data mesh architecture focuses on creating a domain driven data platform in place of centralised monolithic data stores. This approach embeds the learnings of digital platforms and applies them to data, to change the view of data from a hoarded commodity to a product in its own right.
It brings a common data storage pattern (the mesh) together with Domain-Driven Pipelines and Data Product Owners to power the end to end delivery of self-contained data products that can be accessed the same way from analytics tools, microservices and data processing tools across an organisation's estate.
They are self-serve platforms with an underpinning Data Infrastructure as a Platform team that enables the domain teams, and provides key common services such as security, data quality and data discoverability.
These all combine to create an architecture where data teams are self-reliant, have clear boundaries and direct interactions with data producers and end users, resulting in faster delivery of better quality data products across the organisation.
Organisations know there is valuable data in their business, but it remains unusable or accessible only to very few people or no-one at all.
Valuable data from business processes is ignored, trapped or treated only as useless exhaust. Organisations see data “held hostage in on-prem systems”.
Data is not usable because of gaps in the data or poorly formed / structured data. For example,phone numbers and addresses are not consistently formatted for use, while data codes and categories are poorly understood.
The same or similar data sets are being created many times in the organisation.
Explore thinking about your data as products with end users who should be treated as any other end user. Have dedicated people whose job is to make sure the data can be used by the people that need it.
Data as a product is a domain-bounded isolated data-set that has value to data users - such as database table(s) or an API. The users of a data product interact with it in an ad-hoc manner that isn't guided by a specific set of user interactions. This is what gives a data product the power to enable a data-driven organisation. For example, a single data product may be:
Surfaced through a business intelligence tool for user generated reporting.
Joined by data scientists with other data products to enable Machine Learning.
Brought into a operational data store for real time usage by a microservice.
Leveraged by data engineers in a data pipeline to create new data products.
Data products need to be valuable to their users - they must be useful data; and they need to be trusted. Like the product owner role in application development, data product owners are accountable for making sure the data is successful and meets these needs. They make sure the data is accessible to the analysts, data scientists or business users who need it in a form that is right for them to use, and that it is of the quality required by them.
Thinking about data as a product with customers and assigning data product owners allows organisations to innovate and move faster because data governance is performed by someone that is accountable for this specific data domain and the value created by it.
Organisations create data architectures and platforms but these get stuck in complex architectures that impede their ability to make data available to data analysts and scientists and other data users.
Explore implementing a paved road for the creation of data pipelines.
The paved road approach has been successful in accelerating the delivery of digital services. We recommend applying it to data pipelines. Create a 'Hello World' base repo with simple pipeline from ingest to end user accessibility, which includes observability, testing, readme/wiki etc. so that dev teams can rapidly put together new ones.
With the exponential growth of data, and data work being spread all over an organisation in different teams, there is a need to have some kind of uniformity and shared practices. Paved Road for Data empowers teams to work on data by leveraging a set of self-serve tools and best practices. It means that teams can work easily with data, without losing the uniformity around tools and architectures inside an organisation. It fits the Data Mesh architecture which advises having a Data Infrastructure as a Data Platform team, which should be domain agnostic and should focus just on creating the paved road. This is different from centralising data engineering; the team is horizontal on the organisation but acts as facilitator to other teams in a self-serve approach.
Organisations want to find insights about their business but the data is spread over many systems.
Explore the latest tools for making federated queries over many data sources.
Organisations alway struggle with getting the right data to the right users at the right speed. One key reason for this is because the data needed by users is spread over multiple sources. Traditional ETL and data warehousing models worked well for reporting purposes, but with the growth of advanced analytics use cases and near real-time data needs, these models are proving unsatisfactory.
Data virtualisation products can remove the ETL requirements by enabling federated queries across your data estate. Products in this growing space include AWS Athena, Denodo, DremIO, Trino. The ability to create a catalog of available curated data, de-couple users and sources, and provide a single rich interface enables you to efficiently and safely share data enabling faster reporting and analytics.
Inability to rapidly find the data needed for analysis or insight generation.
Data pipelines re-implemented many times over.
Explore the use of data catalogues/data discovery tools - find the right approach for your organisation.
One way to address this is by creating documentation for the data. Some organisations are using tools like spreadsheets or wikis, some use other fully manual tools that are made for the purpose. Although the documentation tends to be avoided, forgotten and consequently outdated. However, a new type of centralised data catalogs is emerging which uses automation to generate the data catalog, looking into the data lineage and usage patterns of data. Being able to know which datasets are available to explore, to know who owns the datasets and how they were generated (data lineage) and used, empowers data scientists and data analysts with a self-serve way to discover and explore the full value of data.
The ease and the speed to explore data and innovate with new products is a key in the data landscape of data driven organisations, so a data catalog which relies on automation and not just only in human-made documentation is a must have to empower data discoverability.
Enterprise Data Models aim to provide a model of the key data items passing through an organisation. Often they feature as activities early on in data programmes as a critical artifact to inform the design and implementation of data architectures. This is a laudable aim, but adopting this approach leads to these sorts of scenarios:
Long delays in delivering value to users: An overemphasis on developing a monolithic EDM leads to it becoming an IT deliverable. This incurs a long implementation period, leading to loss of business engagement and a slow down in innovation. In the worst case the model is never finished.
The model is not adopted by developers: The EDM takes too long to develop, so we also see the EDM being out of sync with the actual implementation as teams work around this slow monolithic process and adopt their own niche models to meet their needs. The model becomes a zombie artifact.
Avoid the development of an enterprise data model before data can be ingested into the system.
Rather, starting with the needs of the data users in the initial use cases will lead you to a useful data model that can be iterated and developed over time. An overemphasis on developing a monolithic EDM can often lead to it becoming an IT deliverable that incurs a long implementation period, leading to loss of business engagement and a slow down in innovation.
Instead focus on building a Contributory Data Catalog that is built up as data of business value is ingested and utilised. This will grow into a vibrant and high value EDM that provides value to the business and enables faster innovation through increased trust in the data.
Centralising all data engineering in one function or team increases the distance from the users, increases time to value and leads to these sorts of scenarios:
Data engineering is seen as a blocker rather than as an enabler
A focus on technology rather than business needs because the central team lacks the domain context of how the data should be used.
Exhausted or demotivated engineering teams - the weight of an over growing backlog and the feeling that they are never meeting the needs of the business leads to frustration and demotivation in the centralised team.
Work seems to be prioritised according to data platform needs rather than the data users.
Centralised teams make the organisation less responsive to change and are at a distance from the concerns and needs of the data users. They usually lack the context to fully understand the data being processed, so are less likely to spot issues in the data and have a simplistic view of data quality, i.e. field must not be empty, rather than what good quality for an element of data actually looks like.
Avoid having a central function whose job is to service all needs for data users.
Instead, find ways of moving the work of providing data access closer to the end users - ideally ones that enable self-service of data provision by engineers attached to data product teams. In the Team Topologies world this might be an engineer in a stream aligned team which is creating a data product using services provided by a data platform team.
Too many organisations have fallen foul of the problem where IT has bought a product to solve a problem and it hasn't been adopted. The data area is no different. There is no shortage of vendors promising one stop solutions for data platforms or promising that you can manage all your dashboards and data in one place. But when they are implemented they fail to deliver the expected benefits.
The products have been well researched and are rated for example “best of breed” or “Technology Leaders” by respected technology evaluators - so why haven't they been embraced by the organisation? Meeting business needs, and so responding to market conditions or users needs, is put on the back burner while the technology initiative is implemented. Time and time again we see these initiatives take longer and cost more than expected because the focus is not on meeting a business need. Instead they get tied up in completing all the implementations around meeting technical milestones unconnected with users before even considering a valuable use case.
Strategies should start by understanding the business problems you want to address with your data and they should consider people and processes as well as the technology. Understand how the business wants to work (process), how people will use the data (People ) you can then ensure the products (technology) being chosen will meet the business needs. Of course, there are many great off the shelf products out there, which can be very beneficial. But they will not be the full solution. Before you commit to one, understand its boundaries and what skills (People - again!) you will need to make it work. We really recommend doing some technical spikes to get a feel for what the product can do and how you can include it in your development lifecycle (it's easy to forget about this point) before making your choice.
Environments in which the data pipelines can be constructed using visual programming approaches, such as drag and drop of components onto a canvas, have made it easier for non-coders to create data pipelines and democratise data.
Why should you avoid it?
Whilst we applaud the goals of these platforms to improve self-service of data pipelines, they often create challenges longer-term. The Continuous Delivery movement has identified a number of drivers for accelerating delivery and operation of software - practices such as Test Driven Development and Test Automation for deployment, working in small batches, and infrastructure as code.
Most Wysiwyg platforms are not developed with these approaches in mind. They are difficult to integrate into CI/CD/TDD approaches and infrastructure. For example, they are typically not provisioned with an ability to create unit-tests, making continuous deployment and low-risk upgrade or maintenance difficult. They can be difficult to place under version control and when they are, it is typically not possible to see the changes between commits. In some cases monitoring and alerting are not easy to integrate with the platforms.
These challenges make it difficult to maintain trustworthy pipelines. In our experience it is almost impossible to maintain the generated code that these tools produce. Whilst a simplistic pipeline is easy to demonstrate, it is remarkable how many projects using these tools still require specialised consultants to be involved long after the initial setup period.
We prefer to create our pipelines in code as this allows us to benefit from all the Continuous Delivery development techniques that software engineers have discovered are the most efficient ways to create high quality, trustworthy software. However, if you do choose to go down the wysiwyg route, we have found that these tools require more Quality Assurance time and effort as testing is shifted right. You typically cannot version control to the same degree as you can with a code solution, but try to find opportunities to apply it where you can. For example, we have found that Infrastructure as Code using tools like Terraform can be applied fruitfully and tools like Liquibase can be used to manage the SQL (which you will almost certainly need to work with) and automate some of the QA testing.
Data Lakes are highly scalable data storage areas where data in many different formats - unstructured, semi-structured (e.g json files) or structured (e.g. parquet files.). Because they are simple storage areas it can be very simple to ingest data which is attractive, but it can be tempting to ingest lots of data with an ‘if we collect it they will come’ mentality and it will often also lack key features which make data usable such as discoverability or appropriate partitioning - all of which leads to the dismissal of Data Lakes as ‘Data Swamps.’
Data is the lifeblood of business. A data lake or data warehouse is a way of storing some of that business data but the focus needs to be on the business and its requirements, not the building of the store or the ability to swallow massive amounts of data.
Focus on the business requirements and use them in conjunction with curated pipelines or distributed SQL tools.
Many of the failings of data lakes are not about the chosen technology or the designs that have been implemented, it's that they are built with a focus on ingesting large amounts of data rather than providing data for end users. So instead of focusing on building a data lake, focus on delivering a domain focused data platform, with use-cases for the data that meet the needs of the business and can be built out.
We are not saying never use Data Lakes - we have seen them successfully used as part of an ELT (extract load transform) architecture as the landing zone to drop raw data from source systems - the so-called Lakehouse architecture. They can also be a great choice if the data is of the same structure and you can apply tools like Presto or AWS Athena to provide querying and discoverability services.