Can DBT Macros replace UDFs in AWS Athena?

Edit: You might be interested in our updated solution to this problem: UDF Nirvana? A new way to integrate UDFs into the dbt ecosystem

Last year, I wrote about how you can use dbt to deploy and test User Defined Functions, or UDFs. I recently expected to use this technique with AWS Athena and was surprised to find that Athena  only supports UDFs  that are backed by an AWS Lambda function. There’s no simple SQL-based CREATE FUNCTION statement here.

I find myself without a way to capture, name, test and document reusable logic in Athena. I wasn’t ready to admit defeat and accept the horror that is copy-and-pasting logic, so as I was using dbt anyway I wondered – could I use dbt macros instead?

Macros v UDFs 

The main difference between macros and UDFs is where they reside. 

A UDF will be deployed into your data warehouse. It’s something you can find in the information schema and call from any query that you write. Any documentation provided will also be available in the data warehouse. 

In contrast, a macro only exists within dbt, and it does not exist as an entity in the data warehouse. When you call a macro into a query, it will be replaced by SQL at dbt compile-time. Macros can be used to achieve the “capture, name, test and document reusable logic” idea.

Solving problems with Macros v UDF 

Let’s take the is_integer UDF example and solve that problem with a macro:

  • Simple, one-line macro: is_integer

This macro example is very similar to the UDF version from the previous post. (I’m using SIMILAR TO because I’m testing with DuckDB, which doesn’t support BigQuery’s REGEXP_REPLACE syntax). This example also shows how to add a little inline documentation using standard SQL comments.

{% macro is_positive_int(column_name) -%}
({{ column_name }} SIMILAR TO '^[0-9]+$' /* dbt macro {{ this }} */)
{%- endmacro %}

If we modify this test case to work with the macro and DuckDB, it would look like this: 

WITH examples AS (
  SELECT 'foo' the_string, FALSE expected
  UNION ALL SELECT '1111', TRUE
  UNION ALL SELECT '5 not this', FALSE
  UNION ALL SELECT 'still not this 5', FALSE
  UNION ALL SELECT NULL, NULL
)

, test AS (
 SELECT
    *,
      {{ is_positive_int('the_string') }} actual
  FROM examples
)

SELECT * FROM test WHERE actual != expected

As you can see, this is very similar to how we would approach the problem with a UDF. If we then run a dbt_test, this happens: 

1 of 2 START test is_positive_int ........................ [RUN]
1 of 2 PASS is_positive_int .............................. [PASS in 0.07s]

The compiled SQL uses a comment as rudimentary documentation, pointing to the origin of the implementation. 

SELECT
  *,
  (the_string SIMILAR TO '^[0-9]+$' /* dbt macro "macro_testing"."main_dbt_test__audit"."is_positive_int" */) actual
FROM examples

A More Complex Example: Macro v UDF (rag_indicator) 

So far we have been successful in using macros, but what happens when we’re doing something more complicated? 

To illustrate, I’ll use a problem I’ve seen numerous times – how to fit a metric value to a red/amber/green traffic light status indicator.

In this example, our macro takes the parameters and sets threshold and error values for the metric: 

{% macro rag_indicator(column_name, maximum_valid, green_lower_bound, amber_lower_bound, minimum_valid) -%}

(CASE
    WHEN {{ column_name }} > {{ maximum_valid }} THEN 'ERROR_TOO_HIGH'
    WHEN {{ column_name }} >= {{ green_lower_bound }} THEN 'GREEN'
    WHEN {{ column_name }} >= {{ amber_lower_bound }} THEN 'AMBER'
    WHEN {{ column_name }} >= {{ minimum_valid }} THEN 'RED'
    WHEN {{ column_name }} < {{ minimum_valid }} THEN 'ERROR_TOO_LOW'
    ELSE 'ERROR_INVALID'
END)

{%- endmacro %}

The results can be seen in our example: 

WITH examples AS (
  SELECT 6 the_value, 'ERROR_TOO_HIGH' expected
  UNION ALL SELECT 5.0000000000001, 'ERROR_TOO_HIGH'
  UNION ALL SELECT 5, 'GREEN'
  UNION ALL SELECT 4, 'GREEN'
  UNION ALL SELECT 3.9999999, 'AMBER'
  UNION ALL SELECT 3, 'AMBER'
  UNION ALL SELECT 2, 'AMBER'
  UNION ALL SELECT 1, 'RED'
  UNION ALL SELECT 0.5, 'RED'
  UNION ALL SELECT 0, 'RED'
  UNION ALL SELECT -0.000000000001, 'ERROR_TOO_LOW'
  UNION ALL SELECT NULL, 'ERROR_INVALID'
)

, test AS (
  SELECT
    *,
      {{ rag_indicator('the_value',
        maximum_valid=5,
        green_lower_bound=4,
        amber_lower_bound=2,
        minimum_valid=0) }} actual
  FROM examples
)

SELECT * FROM test WHERE actual != expected

The tests are passing, and if we look at the compiled SQL: 

SELECT
    *,
      (CASE
    WHEN the_value > 5 THEN 'ERROR_TOO_HIGH'
    WHEN the_value >= 4 THEN 'GREEN'
    WHEN the_value >= 2 THEN 'AMBER'
    WHEN the_value >= 0 THEN 'RED'
    WHEN the_value < 0 THEN 'ERROR_TOO_LOW'
    ELSE 'ERROR_INVALID'
END /* dbt macro "macro_testing"."main_dbt_test__audit"."rag_indicator" */) actual
  FROM examples

This is a very simple implementation of a RAG indicator function, but a reusable implementation means you can start simple, keep track of what’s using it and build a test suite to make sure it does what you expect it to do. . This is much better than copy–pasting an implementation by hand and then trying to wrestle with the inevitable divergence as the codebase evolves..

Summary: Can you deploy UDFs with Macros in AWS? 

We believe it is possible to deliver most of the benefits of SQL-based UDFs using dbt macros. We’ll wrap up by calling out a few differences to be aware of.

  • Using macros means that there will not be a defined entity representing the logic within the data warehouse. However, it also means you can avoid having a dependency on that entity and its location.
  • A macro-based approach also can result in more verbose compiled SQL, lock comments can at least be used to indicate which macro is responsible for a chunk of SQL embedded in a query. 
  • Calling one macro from within another is more fiddly than calling one UDF from within another, for example ensuring that quoting works correctly. The resulting SQL can get complicated fast, too.
  • Unlike a UDF there’s nothing to control the scope of your macro, so if you’re not careful, operator precedence in the surrounding query can alter the result. A macro “add_one” with content “x + 1” will work fine in most contexts, but when used like this: “{{ add_one(100) }} * 2”, you get 102 instead of the expected 202. Wrap your macro SQL in brackets to ensure it’s a more independent unit.
  • In a UDF you can declare the types of parameters and results that are checked at runtime – so if you say it takes an INT64 and someone tries to run it on a string, boom – nice clear error. You can’t do that with a macro, so you may need to be more defensive, for example explicitly casting values to ensure they are the correct type and force an error if not.
  • You can provide any documentation you want for a dbt macro. Just [add a schema file].
What are experience vision maps?

Enterprise technology projects can involve hundreds of team members working across multiple platforms and business departments. Getting an overview of what’s happening, and keeping the overall goal in mind isn’t easy.

Enterprise vision maps can help by providing a single ‘story’ about large projects. They combine text and images to explain how solutions will transform often complex and ambiguous situations within organisations.

Typically, an EVM will tell the story of your future user or customer journey, as they move through a service. It will show the interactions that customers and staff have online and offline, enabled by digital services. The aim of an EVM is to provide a simpler way for people to understand the direction and ambition of a project, which supports better engagement and alignment among different stakeholders, teams and functions.

The real magic of the EVM is the process by which it is created. An EVM is a collaborative, shared vision created by everyone involved in delivering it – from front line staff to product, technical, legal and data teams. When people have co-created a vision of a better future there’s less need to convince them – it leads naturally to conversations like, “When can we have all this? How do we get started?”

Getting to this point takes hard work and commitment from the organisation. While the process can be fast, fun and engaging, it takes significant research and a deep understanding of the organisation’s challenges, culture, and customers. It can also be a challenge to commit the time needed to important co-creation workshops. But the payoff is worth it!

When to use enterprise vision maps?

EVMs are a quick and effective way to set out a vision for transforming products and services. They are most suitable when an organisation:

  • Is focused on transforming services, rather than optimising them
  • A company has complex, messy problems without a single, obvious solution
  • Multiple teams are involved in delivering a service
  • Technology and service standards need to be improved and/or updated
  • There isn’t a clear path to iterate from the current product to industry leading standards
What do EVMs look like?

This is an example EVM we created with a travel client, showing how joined up customer information can be used to deliver better experiences at every touchpoint in the customer journey.

The EVM tells the story of a family as they book, plan, and travel to their holiday together, seamlessly supported by great digital and in-person customer support. This map is built on a shared vision of how the company wanted to use technology. This was then used to plan and prioritise the design and development of systems and services

The anatomy of an Experience Vision Map

An EVM is a combination of human story and tech-agnostic system map. The story simplifies and humanises a complex ecosystem, making it accessible and engaging. The system map points to the technology that’s needed to deliver the target experience.

The key components of an experience vision map are: 

  • The Story 
  • The systems 
  • The ambition
  • The path to delivery 

Below, we outline in more detail what each of these components involve and how it helps organisations to deliver more successful outcomes for complex transformation projects. 

The Story 

The heart of an EVM is the story of a customer, or a group of customers, moving through the client’s products and services over the full lifetime of their engagement. 

An EVM starts with the target or desired experience, described from the customer’s perspective. Like lots of good stories it has: 

  • A hero (the customer) 
  • Struggles to overcome (limitations of the current system) 
  • A saviour figure (the new technology), and 
  • A gang of friends to support the hero (supporting staff, customer service etc) 

An EVM story can take place over an extended period of time, like the example above where we followed the Campbell family booking their summer holiday and following them into the next year, and beyond. 

By telling a story, the EVM introduces drama into the customer journey, with the technology coming to the rescue like a fairy godmother. The story also helps to engage with multiple audiences by creating a universally human scenario that draws people into an imaginary future scenario where the business is delivering better experiences. 

The Systems 

If the customer is the hero in our story, then the system map is the landscape. 

The Experience Vision Map shows every interaction the customer has on their journey with our digital services and systems. This makes it easier to see the key systems that will be needed to deliver the future experience and puts them into a human context. 

Systems that are commonly included in an EVM might include: 

  • app or website 
  • booking systems and CRM 
  • APIs and data management 
  • specialist technology and hardware, i.e. machine learning or customer terminals 

The EVM shows the systems – and how they combine to enable the target experience –  at a very high level and tech-agnostically. At this stage it’s too early to come up with solutions.  Once we have the target experience mapped we can look at the whole landscape and assess what needs building, buying and integrating.

This is the bit of the EVM that leads directly to things that can be built, so the design team must work closely with technical and domain experts to make sure that the experience shared on the EVM is grounded in reality and is painting an ambitious but ultimately deliverable future. 

The ambition

Many people find change challenging and others want change to happen at a rapid pace. This means it’s important that the whole organisation has a shared view of their ambitions – what does ‘good’ look like? Where are we prepared to compromise? What is our goal? 

An EVM should articulate this shared ambition, making it easier to create alignment across multiple delivery teams, operations and leadership.   

Your ‘vision’ should be far enough into the future to encourage people to break free from current delivery constraints, but not so far out that it starts to look like a science-fiction story of flying cars and space travel. Let’s remember that companies change at different rates, and smaller companies might find it easier to accelerate transformation compared to bigger organisations. In general, though, an EVM should aim to look 2-3 years ahead. 

A path to delivery

So far, our EVM has looked at our customer’s future experience, the systems that will deliver those experiences, and identified our vision of the future. With these elements in place, it’s time to stand back and think about what it’s going to take to get there. It’s easy to paint a pretty picture of great customer experiences. It’s much harder to plot a realistic, achievable path to deliver that great experience. 

Using the EVM as a guide, the next step is to identify the key platforms, data, teams, technologies and processes that will be needed to enable the transformed customer experience. Depending on the complexity of the transformation needed and the organisation’s wider strategy, different approaches to planning can be used. 

It’s common to set out ‘swimlanes’ in the main map, and identify what it takes to delivery each component of the system from technology, data, people, and process perspectives.

The map is used to identify discrete projects to deliver that start to deliver the target vision

Some processes might need to be further unpacked before creating swimlanes. By focusing on the customer experience EVMs typically skim over a lot of the complexity and where good technology ‘just works’, automagically, for the use. When this is the case, layering very high-level processes on the EVM – just enough to be able to identify discrete projects and systems  – can add useful clarity.  

Finally, we use the prioritised and roughly scoped items to create a roadmap for delivering the vision. With all teams involved in creating the vision and aligned in the direction to move, we move into the delivery phase. 

Typical EVM project delivery outline

Although timescales can vary, it will usually take about eight weeks to create an experience vision map for a typical project, using an agreed target experience and delivery roadmap. 

EVMs are an efficient way to create a strategic vision for a product or service, and the plan will balance that speed with the necessary depth of understanding of the company and opportunities. A typical project outline could be:

Weeks 1-3: Research.
The design team needs to get a deep understanding of the business, and the challenges and opportunities it has in a short period of time. This will include:

  • Stakeholder interviews – this includes representatives of everyone involved in delivering the services
  • Customer interviews
  • Market and competitive landscape mapping

Week 4: Co-creation workshops.
Co-creation workshops bring together everyone from senior leadership to front-line staff. We collaboratively sketch future user journeys exploring many different scenarios. The conversations that happen in these workshops generate new ideas and ensure everyone is on the same page

Weeks 5-6: Synthesis.
The workshops will generate a huge amount of ideas across different scenarios. The design team pulls out patterns and themes from the sketches and combines the scenarios into one archetypal journey that represents the combined vision of the workshop participants.

Weeks 7-8: Review, planning, and roadmapping.
When the EVM is complete it’s time to look at what needs to be done to get there! We identify the discrete projects required to move towards the vision, these will then be prioritised and planned with full clarity for how they are contributing to the wider vision and strategic goals.

In about eight weeks we have an ambitious, achievable, service transformation strategy and an organisation aligned on delivering it!

A data science and machine learning (DSML) platform is a set of tools and services that enable data scientists and machine learning engineers to develop models rapidly, and streamline data analytics. It should also help users to promote solutions from ideation to being operational more easily, helping organisations to bring new services to market faster. 

DSML doesn’t only reduce time to market. It can also help development teams through improved reusability of features, governance and compliance support, cost optimisation and improved cross-team collaboration and knowledge sharing. 

That said, the complex nature of work and the constantly evolving technology landscape can make building a DSML platform challenging. In this article, we will discuss some of the lessons learned while building a data science and machine learning platform for a leading mining company. 

1. Identify and understand the user  

Your strategy and product vision must be built upon a good understanding of what value the platform will bring to the business, who will be using it and what level of expertise they will have. Are they seasoned data scientists, or citizen analysts? What problems will this platform solve, and how does that tie in with the organisation’s future plans?

When we understand the user and their needs, this helps us to make better technology choices. For example, no code / low code tools might be an excellent choice for democratising data science, but an engineering-oriented toolset might be more suitable for a small group of highly experienced engineers and analysts.  

2. Think big, start small 

It’s easy to try and please everyone or over-engineer a solution when there are a vast amount of tools available, and broad expectations from the product team. The problem with this is that the DSML platform could turn into a half-baked solution – and stay that way for a long time.  

Instead, take the advice of Barry O’Reilly, author of Unlearn. He advises teams to ‘think big, and start small’. In other words, have a strategy to build a world-class, comprehensive solution but define a roadmap and start with a small use case that delivers a big difference to the organisation. When building a DSML platform, you might want to consider whether this could be batch analytics, online inference, streaming analytics or something else. Again, understanding the customer’s needs is critical. 

3. Get the right team 

It’s essential to have a cross-functional team covering all aspects of delivery to build a data science and machine learning platform quickly. Depending on the solution chosen, your team could include machine learning engineers, data engineers, cloud engineers, developers and data scientists to support users. It isn’t only about skills, but also privileges. In our case, the single most delayed feature was related to integration with a central Data Lake. Our data engineer did not have sufficient permissions to access it which slowed down debugging and the introduction of changes. Tackling such dependencies through embedding the right people in the team or building relationships early pays off in a faster delivery.

4. The right data science projects

A DSML platform will have constraints related to its architecture and technology used. It’s vital to analyse these constraints, and share them with users. A good example here might be the support for Python and its ecosystem of libraries. A data scientist can easily use any of the available library versions in a local environment but that might not be necessarily the case on the shared platform.

Projects that are in an advanced phase of development or already developed can be particularly tricky to migrate. They can use legacy tools that are not supported on the platform. Moreover, the way the legacy models were developed might make migration very expensive and in some cases, the end model might generate different predictions than the original one.

5. Enable growth

The number of platform users will usually increase over time so it’s important to ensure that the effort for day-to-day support from the platform team is manageable. 

Deliver self-service through the automation of procedures and privilege delegation to ensure that product teams can handle BAU tasks without being dependent on platform engineers.

The second big enabler for growing product teams is ensuring they know how to use the platform effectively. Learning at scale requires high-quality training materials for onboarding and building a community where users learn from each other.

The time savings can be used for adding more features but more importantly for engaging in meaningful conversations with users about how the platform should evolve. In our case, the Hassle Maps concept proved to be particularly useful here.

6. Prioritise service availability

Keeping services available is critical to productivity and embedding trust in the platform. Machine Learning can be very resource-demanding. It’s crucial to monitor the cloud infrastructure and react to the changes in demand for CPU, memory, storage but also performance (API response time).

In the event of a production incident, it is invaluable to have a defined incident management process. The Google SRE book provides a great set of practices to ensure the high reliability of the service. In our case, introducing a triaging process with clear responsibilities together with blameless post-mortems resulted in 24 times faster MTTR (Mean Time To Recovery).

7. Share successes

Finally, don’t forget to celebrate and share your successes. Gather stories about how the product teams achieved their goals by utilising the platform. Couple them with metrics to show specific platform outcomes. These could be quantitative (for example active monthly users, time to market, number of API calls) or qualitative (NPS, user surveys). Share these successes with a wider audience through internal communication channels but also on sprint reviews and system demos.

Data governance is a challenge for many organisations, especially when it comes to knowing what data is available and where it is stored.

Typically, we would implement a data catalogue to address this challenge. But getting the most out of these technologies is about more than just installing a tool. In this post, we want to share how applying the tools and techniques of a product manager can help to address data governance problems.

Typical data management challenges

We often work with organisations that have built up large and potentially very useful data estate. This type of organisation commonly experiences data management issues, including:

  High turnaround time to discover information around business and data assets

  Data analysts had low confidence in data due to limited shared context

  Lack of confidence in data sets because of a lack of lineage

  Siloed information leading to effort duplication

These issues cause enormous frustration. Business managers can’t see if projects were successful because it is hard to locate data, while data analysts are unsure if a given data set is trustworthy. Every department has its own definition of what a customer is (and their own data sets).

What does product thinking look like? 

As product managers, we would apply a classic product-based approach to shape our response to these challenges. This is: discover -> prototype-> implement approach

Discover

Conducting user research helps to identify challenges the organisation has with information management, and can include workshops and individual meetings. These meetings should allow you to identify personas, pain points, frustrations and where improvements can be made. It’s also an opportunity to learn about strategic technical drivers, which helps to identify the most appropriate tools and technology. 

By the end of the discovery stage, we had defined the bigger picture and associated vision, goals and nets. We also had a defined set of success measures, which were: 

  • Time saved by data analysts and other functions (either as a result of improved access  or because it was quicker to find the right data asset.)
  • Employee Empowerment  – does a data catalogue improve employee knowledge of available data and does it make them  feel more empowered?
  • Increased speed to onboard new employees. Does a data catalogue improve how quickly you can onboard someone onto an analyst team? 
  • Reduced Infrastructure and related costs – does a catalogue enable people to find existing data sets or reports and does this lead to reduced infrastructure costs?
Prototype 

In a standard product approach, we will prototype the product to assess how people use it and evaluate whether benefits are likely to be met.  Some organisations might not be  convinced about the need for a prototype, but it’s essential to developing new products and introducing new services, and can make a critical contribution to success. 

If it is important to get feedback on the catalogue quickly, it’s not necessary to implement a prototype of the  whole service. Whilst automating metadata entry is a big accelerator for data catalogues, for the  prototype we use handcrafted templates so that we can get feedback on the user experience quickly and understand what metadata was most relevant. 

Once a prototype is in place, having one to one sessions with users helps gather feedback on their use of the tool. We look at how users are being helped to do typical tasks along with a post-test questionnaire that measures acceptance, usability and user satisfaction with the data catalogue. Some of the important points that will come out of this kind of evaluation might include:  

  • The catalogue displays various metadata for data assets. Using the prototype we can assess which ones are most useful
  • Users can easily see where the data came from through a lineage function – this shows what the source is, and how close the asset is to the source. This really helps users assess their confidence in the data-set.
  • It also  shows a snapshot view  – a small sample of the data, users should find this helpful in understanding the content.

During the prototype phase we could also learn: 

  • What is the most popular tool? Search, or something else? 
  • Do users find it helpful to request access through a basket function that automates the work of manually finding out who owns the data, making contact with them, waiting for them to have the bandwidth to provide data access etc? 
  • How useful are crowdsourcing tools? In one recent project users liked that they could provide ratings and reviews of data sources. This helped other users find what they wanted.
  • What percentage of data sets are redundant, perhaps because they are duplicate or not useful?

This user feedback means that we are able to iterate on the prototype to improve the design. For example, in some instances, user feedback showed that users were  overwhelmed by the standard metadata. In this case, we created customised views based on their feedback. Additionally, new users struggled to understand how to use the tool, so we created an introduction session that walked  new users through how to use it. 

Finally, the initial catalogue organisation  was difficult to navigate so it was refined  based on user feedback. 

Implement 

The actual implementation of the production catalogue was undertaken by the catalogue vendor’s implementation partner. In this phase, the supplier continued to give direction and worked with the client to measure the success factors. 

We hope that this post has given some insight into some of the considerations when creating a data catalogue solution, and shows how product thinking isn’t just about user interfaces. It can also be applied to data to help organisations shape important data services.

What’s it like to be a developer at Equal Experts?

Abraham Marin-Perez and Subhash Gupta talk to fellow colleague Neha Datt about how they personally describe pairing, cross disciplinary pairing, TDD (#MyTestIsMyFirstClient), other practices that are typical at Equal Experts from an engineering perspective, a typical day, difficult parts of the job, and career growth.

Take a look to learn more:


Thank you Neha, Abraham and Subhash!

Want to find out more?  Get in touch