Challenges in working with low-code tools in data engineering projects
This article is to share my experience and challenges of working as a data SDET on a data engineering project using low-code tools.
This article has been written in close collaboration with Simon Case – Head of Data at Equal Experts.
Introduction on low code tools:
A lot of data engineering work involves extracting data from operational systems (on-premise databases, workflow systems etc.), and ingesting it into an environment – such as a data warehouse – where it can be stored and queried without negatively impacting the operational services.
There are many ways to do this, and one of the challenges for a data engineering team is to decide which is the right one for the technical and business landscape they are working in. At Equal Experts we have a strong preference for approaches which can fit into a continuous delivery methodology – continuous delivery has been shown to be the most efficient way of delivering software and we have found that software which is focused on data is no different. This requires a code based solution – with everything stored in a repo.
However, there are many tools which promise to streamline this process through the use of low-code or no-code technologies. These:
- Use a visual development approach – solutions are specified with drag and drop graphical user interfaces.
- Are aimed at allowing users with little or no software development experience to create solutions.
- Often have large lists of connectors to connect to different external third party sources which enable the development team and data savvy users to quickly orchestrate data and do experimentation with it.
Recently we worked with a client who had adopted a low-code tool for data, which enabled the specification of ETL/ELT data pipelines through the use of a visual programming, drag and drop interface. This blog captures some of our experiences of working with these tools. Whilst you cannot reach the same levels of automation as with a purely code-based solution, we found some ways of improving the reliability of deployments.
What are the main challenges of a low-code/no-code solution?
From the perspective of developers who follow continuous delivery practices, we found there were a number of things that prevented us from creating a development process as we would have liked:
- Immature Git integration – the integration of low code tools with Git for version control is still in an infancy stage. Whilst many offer some ability to commit changes to a git repo, there is no easy way to do a git diff between branches or between commits within a feature branch to highlight the differences and to resolve the merge conflicts. Teams have to be careful in their branching and merging strategy to avoid the merge conflicts by allocating work to different functionality areas so that developers do not trip up on each other’s work by working on the same job.
- No unit-testing framework – in a continuous delivery approach, continuous testing is critical. Having small tests associated with each change means developers can check that later changes to the code won’t break the solution, without the need to run end to end tests. This hugely improves feedback loops by reducing the time to detect errors and also improves the ability to deliver features. In low or no-code solutions there is no easy way to create small scale simple tests that check small pieces of functionality.
- Manual promotion – because unit-testing is not possible, promotion to new environments (e.g. from development to UAT) required manual steps and it was not possible to create a true automated CI/CD environment.
- QA becomes a bottleneck – because we were not able to test in depth with unit tests, testing got shifted right to the QA team. Lots of manual tests were required for each release. Fortunately there were willing data users in the company, who would test releases and include edge-cases.
The approaches we took to improve the ability to deliver new data pipelines?
We did find that there were things from the modern software development toolbox that we could apply, and that did improve the team’s ability to deliver a reliable and trustworthy solution:
- Infrastructure as code – for all data pipelines, the use of IaC using tools like Terraform or Cloud formation is pretty much essential. Fortunately we found that we were able to use these with the low-code solution, and the orchestration solution could be hosted in EC2.
- Version controlling the Database – like most data pipelines, a large part of the work was handled within the data warehouse (in our case Snowflake.) We applied Liquibase to control the structures of the databases. This turned out to be a life-saver. It made it simple to clone the database across the environments and saved us a lot of work.
- Automating QA with SodaSQL – we used SodaSQL tool to create data quality tests which reduced some of the manual testing effort. We even managed to create tests to prevent regressions. It’s not as good as a true unit test framework, but it definitely improved the system stability and reduced the time to deliver data pipelines.
I hope you found these learnings useful. These insights and much more can be found in our data pipeline playbook.
If you want to talk to us about Data Engineering or our experiences in using low-code data tools please do get in touch.