Can DBT Macros replace UDFs in AWS Athena?

Edit: You might be interested in our updated solution to this problem: UDF Nirvana? A new way to integrate UDFs into the dbt ecosystem

Last year, I wrote about how you can use dbt to deploy and test User Defined Functions, or UDFs. I recently expected to use this technique with AWS Athena and was surprised to find that Athena  only supports UDFs  that are backed by an AWS Lambda function. There’s no simple SQL-based CREATE FUNCTION statement here.

I find myself without a way to capture, name, test and document reusable logic in Athena. I wasn’t ready to admit defeat and accept the horror that is copy-and-pasting logic, so as I was using dbt anyway I wondered – could I use dbt macros instead?

Macros v UDFs 

The main difference between macros and UDFs is where they reside. 

A UDF will be deployed into your data warehouse. It’s something you can find in the information schema and call from any query that you write. Any documentation provided will also be available in the data warehouse. 

In contrast, a macro only exists within dbt, and it does not exist as an entity in the data warehouse. When you call a macro into a query, it will be replaced by SQL at dbt compile-time. Macros can be used to achieve the “capture, name, test and document reusable logic” idea.

Solving problems with Macros v UDF 

Let’s take the is_integer UDF example and solve that problem with a macro:

  • Simple, one-line macro: is_integer

This macro example is very similar to the UDF version from the previous post. (I’m using SIMILAR TO because I’m testing with DuckDB, which doesn’t support BigQuery’s REGEXP_REPLACE syntax). This example also shows how to add a little inline documentation using standard SQL comments.

{% macro is_positive_int(column_name) -%}
({{ column_name }} SIMILAR TO '^[0-9]+$' /* dbt macro {{ this }} */)
{%- endmacro %}

If we modify this test case to work with the macro and DuckDB, it would look like this: 

WITH examples AS (
  SELECT 'foo' the_string, FALSE expected
  UNION ALL SELECT '1111', TRUE
  UNION ALL SELECT '5 not this', FALSE
  UNION ALL SELECT 'still not this 5', FALSE
  UNION ALL SELECT NULL, NULL
)

, test AS (
 SELECT
    *,
      {{ is_positive_int('the_string') }} actual
  FROM examples
)

SELECT * FROM test WHERE actual != expected

As you can see, this is very similar to how we would approach the problem with a UDF. If we then run a dbt_test, this happens: 

1 of 2 START test is_positive_int ........................ [RUN]
1 of 2 PASS is_positive_int .............................. [PASS in 0.07s]

The compiled SQL uses a comment as rudimentary documentation, pointing to the origin of the implementation. 

SELECT
  *,
  (the_string SIMILAR TO '^[0-9]+$' /* dbt macro "macro_testing"."main_dbt_test__audit"."is_positive_int" */) actual
FROM examples

A More Complex Example: Macro v UDF (rag_indicator) 

So far we have been successful in using macros, but what happens when we’re doing something more complicated? 

To illustrate, I’ll use a problem I’ve seen numerous times – how to fit a metric value to a red/amber/green traffic light status indicator.

In this example, our macro takes the parameters and sets threshold and error values for the metric: 

{% macro rag_indicator(column_name, maximum_valid, green_lower_bound, amber_lower_bound, minimum_valid) -%}

(CASE
    WHEN {{ column_name }} > {{ maximum_valid }} THEN 'ERROR_TOO_HIGH'
    WHEN {{ column_name }} >= {{ green_lower_bound }} THEN 'GREEN'
    WHEN {{ column_name }} >= {{ amber_lower_bound }} THEN 'AMBER'
    WHEN {{ column_name }} >= {{ minimum_valid }} THEN 'RED'
    WHEN {{ column_name }} < {{ minimum_valid }} THEN 'ERROR_TOO_LOW'
    ELSE 'ERROR_INVALID'
END)

{%- endmacro %}

The results can be seen in our example: 

WITH examples AS (
  SELECT 6 the_value, 'ERROR_TOO_HIGH' expected
  UNION ALL SELECT 5.0000000000001, 'ERROR_TOO_HIGH'
  UNION ALL SELECT 5, 'GREEN'
  UNION ALL SELECT 4, 'GREEN'
  UNION ALL SELECT 3.9999999, 'AMBER'
  UNION ALL SELECT 3, 'AMBER'
  UNION ALL SELECT 2, 'AMBER'
  UNION ALL SELECT 1, 'RED'
  UNION ALL SELECT 0.5, 'RED'
  UNION ALL SELECT 0, 'RED'
  UNION ALL SELECT -0.000000000001, 'ERROR_TOO_LOW'
  UNION ALL SELECT NULL, 'ERROR_INVALID'
)

, test AS (
  SELECT
    *,
      {{ rag_indicator('the_value',
        maximum_valid=5,
        green_lower_bound=4,
        amber_lower_bound=2,
        minimum_valid=0) }} actual
  FROM examples
)

SELECT * FROM test WHERE actual != expected

The tests are passing, and if we look at the compiled SQL: 

SELECT
    *,
      (CASE
    WHEN the_value > 5 THEN 'ERROR_TOO_HIGH'
    WHEN the_value >= 4 THEN 'GREEN'
    WHEN the_value >= 2 THEN 'AMBER'
    WHEN the_value >= 0 THEN 'RED'
    WHEN the_value < 0 THEN 'ERROR_TOO_LOW'
    ELSE 'ERROR_INVALID'
END /* dbt macro "macro_testing"."main_dbt_test__audit"."rag_indicator" */) actual
  FROM examples

This is a very simple implementation of a RAG indicator function, but a reusable implementation means you can start simple, keep track of what’s using it and build a test suite to make sure it does what you expect it to do. . This is much better than copy–pasting an implementation by hand and then trying to wrestle with the inevitable divergence as the codebase evolves..

Summary: Can you deploy UDFs with Macros in AWS? 

We believe it is possible to deliver most of the benefits of SQL-based UDFs using dbt macros. We’ll wrap up by calling out a few differences to be aware of.

  • Using macros means that there will not be a defined entity representing the logic within the data warehouse. However, it also means you can avoid having a dependency on that entity and its location.
  • A macro-based approach also can result in more verbose compiled SQL, lock comments can at least be used to indicate which macro is responsible for a chunk of SQL embedded in a query. 
  • Calling one macro from within another is more fiddly than calling one UDF from within another, for example ensuring that quoting works correctly. The resulting SQL can get complicated fast, too.
  • Unlike a UDF there’s nothing to control the scope of your macro, so if you’re not careful, operator precedence in the surrounding query can alter the result. A macro “add_one” with content “x + 1” will work fine in most contexts, but when used like this: “{{ add_one(100) }} * 2”, you get 102 instead of the expected 202. Wrap your macro SQL in brackets to ensure it’s a more independent unit.
  • In a UDF you can declare the types of parameters and results that are checked at runtime – so if you say it takes an INT64 and someone tries to run it on a string, boom – nice clear error. You can’t do that with a macro, so you may need to be more defensive, for example explicitly casting values to ensure they are the correct type and force an error if not.
  • You can provide any documentation you want for a dbt macro. Just [add a schema file].

A data science and machine learning (DSML) platform is a set of tools and services that enable data scientists and machine learning engineers to develop models rapidly, and streamline data analytics. It should also help users to promote solutions from ideation to being operational more easily, helping organisations to bring new services to market faster. 

DSML doesn’t only reduce time to market. It can also help development teams through improved reusability of features, governance and compliance support, cost optimisation and improved cross-team collaboration and knowledge sharing. 

That said, the complex nature of work and the constantly evolving technology landscape can make building a DSML platform challenging. In this article, we will discuss some of the lessons learned while building a data science and machine learning platform for a leading mining company. 

1. Identify and understand the user  

Your strategy and product vision must be built upon a good understanding of what value the platform will bring to the business, who will be using it and what level of expertise they will have. Are they seasoned data scientists, or citizen analysts? What problems will this platform solve, and how does that tie in with the organisation’s future plans?

When we understand the user and their needs, this helps us to make better technology choices. For example, no code / low code tools might be an excellent choice for democratising data science, but an engineering-oriented toolset might be more suitable for a small group of highly experienced engineers and analysts.  

2. Think big, start small 

It’s easy to try and please everyone or over-engineer a solution when there are a vast amount of tools available, and broad expectations from the product team. The problem with this is that the DSML platform could turn into a half-baked solution – and stay that way for a long time.  

Instead, take the advice of Barry O’Reilly, author of Unlearn. He advises teams to ‘think big, and start small’. In other words, have a strategy to build a world-class, comprehensive solution but define a roadmap and start with a small use case that delivers a big difference to the organisation. When building a DSML platform, you might want to consider whether this could be batch analytics, online inference, streaming analytics or something else. Again, understanding the customer’s needs is critical. 

3. Get the right team 

It’s essential to have a cross-functional team covering all aspects of delivery to build a data science and machine learning platform quickly. Depending on the solution chosen, your team could include machine learning engineers, data engineers, cloud engineers, developers and data scientists to support users. It isn’t only about skills, but also privileges. In our case, the single most delayed feature was related to integration with a central Data Lake. Our data engineer did not have sufficient permissions to access it which slowed down debugging and the introduction of changes. Tackling such dependencies through embedding the right people in the team or building relationships early pays off in a faster delivery.

4. The right data science projects

A DSML platform will have constraints related to its architecture and technology used. It’s vital to analyse these constraints, and share them with users. A good example here might be the support for Python and its ecosystem of libraries. A data scientist can easily use any of the available library versions in a local environment but that might not be necessarily the case on the shared platform.

Projects that are in an advanced phase of development or already developed can be particularly tricky to migrate. They can use legacy tools that are not supported on the platform. Moreover, the way the legacy models were developed might make migration very expensive and in some cases, the end model might generate different predictions than the original one.

5. Enable growth

The number of platform users will usually increase over time so it’s important to ensure that the effort for day-to-day support from the platform team is manageable. 

Deliver self-service through the automation of procedures and privilege delegation to ensure that product teams can handle BAU tasks without being dependent on platform engineers.

The second big enabler for growing product teams is ensuring they know how to use the platform effectively. Learning at scale requires high-quality training materials for onboarding and building a community where users learn from each other.

The time savings can be used for adding more features but more importantly for engaging in meaningful conversations with users about how the platform should evolve. In our case, the Hassle Maps concept proved to be particularly useful here.

6. Prioritise service availability

Keeping services available is critical to productivity and embedding trust in the platform. Machine Learning can be very resource-demanding. It’s crucial to monitor the cloud infrastructure and react to the changes in demand for CPU, memory, storage but also performance (API response time).

In the event of a production incident, it is invaluable to have a defined incident management process. The Google SRE book provides a great set of practices to ensure the high reliability of the service. In our case, introducing a triaging process with clear responsibilities together with blameless post-mortems resulted in 24 times faster MTTR (Mean Time To Recovery).

7. Share successes

Finally, don’t forget to celebrate and share your successes. Gather stories about how the product teams achieved their goals by utilising the platform. Couple them with metrics to show specific platform outcomes. These could be quantitative (for example active monthly users, time to market, number of API calls) or qualitative (NPS, user surveys). Share these successes with a wider audience through internal communication channels but also on sprint reviews and system demos.

Data governance is a challenge for many organisations, especially when it comes to knowing what data is available and where it is stored.

Typically, we would implement a data catalogue to address this challenge. But getting the most out of these technologies is about more than just installing a tool. In this post, we want to share how applying the tools and techniques of a product manager can help to address data governance problems.

Typical data management challenges

We often work with organisations that have built up large and potentially very useful data estate. This type of organisation commonly experiences data management issues, including:

  High turnaround time to discover information around business and data assets

  Data analysts had low confidence in data due to limited shared context

  Lack of confidence in data sets because of a lack of lineage

  Siloed information leading to effort duplication

These issues cause enormous frustration. Business managers can’t see if projects were successful because it is hard to locate data, while data analysts are unsure if a given data set is trustworthy. Every department has its own definition of what a customer is (and their own data sets).

What does product thinking look like? 

As product managers, we would apply a classic product-based approach to shape our response to these challenges. This is: discover -> prototype-> implement approach

Discover

Conducting user research helps to identify challenges the organisation has with information management, and can include workshops and individual meetings. These meetings should allow you to identify personas, pain points, frustrations and where improvements can be made. It’s also an opportunity to learn about strategic technical drivers, which helps to identify the most appropriate tools and technology. 

By the end of the discovery stage, we had defined the bigger picture and associated vision, goals and nets. We also had a defined set of success measures, which were: 

  • Time saved by data analysts and other functions (either as a result of improved access  or because it was quicker to find the right data asset.)
  • Employee Empowerment  – does a data catalogue improve employee knowledge of available data and does it make them  feel more empowered?
  • Increased speed to onboard new employees. Does a data catalogue improve how quickly you can onboard someone onto an analyst team? 
  • Reduced Infrastructure and related costs – does a catalogue enable people to find existing data sets or reports and does this lead to reduced infrastructure costs?
Prototype 

In a standard product approach, we will prototype the product to assess how people use it and evaluate whether benefits are likely to be met.  Some organisations might not be  convinced about the need for a prototype, but it’s essential to developing new products and introducing new services, and can make a critical contribution to success. 

If it is important to get feedback on the catalogue quickly, it’s not necessary to implement a prototype of the  whole service. Whilst automating metadata entry is a big accelerator for data catalogues, for the  prototype we use handcrafted templates so that we can get feedback on the user experience quickly and understand what metadata was most relevant. 

Once a prototype is in place, having one to one sessions with users helps gather feedback on their use of the tool. We look at how users are being helped to do typical tasks along with a post-test questionnaire that measures acceptance, usability and user satisfaction with the data catalogue. Some of the important points that will come out of this kind of evaluation might include:  

  • The catalogue displays various metadata for data assets. Using the prototype we can assess which ones are most useful
  • Users can easily see where the data came from through a lineage function – this shows what the source is, and how close the asset is to the source. This really helps users assess their confidence in the data-set.
  • It also  shows a snapshot view  – a small sample of the data, users should find this helpful in understanding the content.

During the prototype phase we could also learn: 

  • What is the most popular tool? Search, or something else? 
  • Do users find it helpful to request access through a basket function that automates the work of manually finding out who owns the data, making contact with them, waiting for them to have the bandwidth to provide data access etc? 
  • How useful are crowdsourcing tools? In one recent project users liked that they could provide ratings and reviews of data sources. This helped other users find what they wanted.
  • What percentage of data sets are redundant, perhaps because they are duplicate or not useful?

This user feedback means that we are able to iterate on the prototype to improve the design. For example, in some instances, user feedback showed that users were  overwhelmed by the standard metadata. In this case, we created customised views based on their feedback. Additionally, new users struggled to understand how to use the tool, so we created an introduction session that walked  new users through how to use it. 

Finally, the initial catalogue organisation  was difficult to navigate so it was refined  based on user feedback. 

Implement 

The actual implementation of the production catalogue was undertaken by the catalogue vendor’s implementation partner. In this phase, the supplier continued to give direction and worked with the client to measure the success factors. 

We hope that this post has given some insight into some of the considerations when creating a data catalogue solution, and shows how product thinking isn’t just about user interfaces. It can also be applied to data to help organisations shape important data services.

As a technical architect at Equal Experts, I’m fortunate to work with clients at the start of their decision-making process for new, large-scale IT projects. One of the questions I’m often asked is: “Should we buy, or should we build?” Not surprisingly, I have views, and am always happy to share them.

Recently, a client gave me a novel take on this subject that I hadn’t considered before. He likened the costs associated with building software to making a chicken sandwich. Specifically, he told me the story of Andy George, who made a documentary series following his mission to make a chicken sandwich from scratch – and I do mean from scratch. He harvested wheat, ground flour, milked a cow, made cheese and butter, boiled sea water to make salt…you get the idea. At the end of his journey, he’d spent $1,500 and 6 months creating a single chicken sandwich, which he could have bought the same day from a local store, for less than $5.

It’s a fascinating story (you can watch the videos if you have time), but having considered it for a while, I don’t believe it’s a valid comparison.

We don’t need to make our own salt

Here’s the thing. When we build software, we don’t build it from scratch. We assemble all the base ingredients (common services, third-party libraries, databases, etc.), and then we build the important stuff: the business-specific functionality that makes your software different.

Returning to the chicken sandwich analogy, building custom software is like buying the ingredients (bread, mayo, chicken, salt, etc.) and then making the sandwich in the way that you want.

This is the gist of my argument. It’s rarely a question of build or buy; in most cases we build and buy.

If all you need is a cheeseburger, buy a cheeseburger

So, now I’m on a roll with the food analogies, let’s look at one of my own relating to buying commercial off-the-shelf (COTS) software.

When buying a COTS solution (for example, SAP or Salesforce), my view is that it’s like going to McDonald’s for a cheeseburger. The cheeseburger is fine (there’s nothing wrong with it), but you can’t customise it. At best, you might be able to ask for a different type of bun, or to leave out the sauce, but you can’t ask for Brie instead of cheddar, or to add a fried egg. If something isn’t part of the standard offering, then sorry, but you can’t have it.

This is perfectly good if all you want is a cheeseburger, you don’t need your cheeseburger to be different, and the cheeseburger in question fulfils your exact needs.

The same principle applies to software. If a COTS solution does what you need without customisation, and you don’t need it to differentiate your organisation from all the other organisations that are using it, then go the COTS route.

Of course, there are situations where this is the best approach – typically for internal, non-customer-facing systems (e.g. accounting systems).

A few tips for buying software

  • Try before you buy. No matter how slick a software company’s demo is, always try before you commit to anything. Even if the vendor has no official trial offering, negotiate one (six months, if you can).
  • Avoid long-term tie-ins (I recommend no more than one year).
  • Go for a module-based product and only buy what you need.
  • Don’t customise. Use the product out of the box and build additional features around it.
  • Ensure that the product follows your architectural standards and patterns (e.g. integration via RESTful APIs, event handling or scales on load).

Principles for when to build or buy

My key principles for the decision-making process:

  • Build what differentiates you. If something is core to your business and represents who/what you are, don’t rely on someone else to provide it.
  • Software developers can be very keen to develop everything from first principles! But if something isn’t core to your business and it’s not a differentiator, buy it.
  • Do what delivers value the fastest but maintain flexibility so you can make changes as requirements change/grow.
  • If you’re still not sure which way to go, build first, then test and reassess. This will clarify requirements and if it’s not what you want, you’ll have a better idea of what you need to buy.

What is a differentiator?

Gone are the days when software was something in the background that took care of the boring stuff and couldn’t be a differentiator. Now, software solutions are so often the thing that improves business processes, enhances customer experience and ultimately drives revenue generation.

So what qualifies as a differentiator?

  • If something is core to your business (i.e. important enough to be in your business plan or part of your business strategy), then it’s a differentiator.
  • If customers come to (or stay with) you because you do something that your competitors don’t – or you do it better – then it’s a differentiator.
  • If it’s a missing or poorly performing feature that might prompt customers to move to a competitor, then it’s a differentiator.
  • If something generates direct revenue, it’s a differentiator.
  • If the business is willing to change its processes to align with a software product, then it is probably not a differentiator.

Costs (and super powers)

One of the points I stress to clients is that we don’t build software or move to the cloud because it’s cheaper – often, it’s not. We do these things because they provide the freedom to be innovative, fast, and better than our competitors.

Of course, costs must be considered – but so too must value.

Total Cost of Ownership

During budgetary discussions, clients often focus on the cost of buying software licenses and support but neglect many items that impact Total Cost of Ownership (TCO).

I advise considering the future costs associated with:

  • Ongoing maintenance after the initial license period ends.
  • What would the cost be if you need to customise for a specific feature?
  • Upgrades of customised solutions. Once a standard product is customised, it’s often very tricky and thus costly to upgrade.
  • If you can’t make changes to your solution quickly and easily, you’ll be slow to market and risk missing opportunities.
  • Product switching. Committing to a large, upfront license fee can make switching solutions, in the future, prohibitive
  • Contract negotiations. Never underestimate the time and effort it takes to negotiate a contract with a major supplier. Once the lawyers get involved it can really stretch your delivery timelines.

True Value of Ownership

People are generally very familiar with return on investment (ROI) – invest £10m and get £12m back; it’s monetary and, therefore, very straightforward.

But value isn’t always reflected in direct revenue; it’s about other things – customer satisfaction, agility, innovation, etc. This is where building your own software can really come into its own.

In fact – not to be overly dramatic – but I’d argue that building software is a real-life super power. It’s where you can absolutely achieve far more value over buying a solution.

We don’t deliver software like we used to. Continuous Delivery makes the whole process fast and lean. So, when you choose to build:

  • Value is realised sooner. The bones of an idea can become fully-formed reality, incredibly fast – as fast as you want it to be. And of course, the sooner you get your solution to market, the sooner you’re realising value.
  • You only need to build what you need. Often, people buy huge, expensive systems for features they think they need, but actually never use.
  • You only need to build what is valuable.
  • Your only limit is imagination; you’ll never be constrained by someone else’s framework or roadmap.
  • You can build differentiators. If you can buy COTS software that does great things, so can your competitors – any potential differentiators are gone.
  • It’s your Intellectual Property (IP). If you’ve bought a solution and you ask the supplier for a customisation which adds value to their core product, it’s likely that they will offer it to other customers – you have no control over this.
  • Change is easy. As the context surrounding your business changes, your software follows suit.

Conclusions

In my conversations with clients, I completely understand when they say this:

“Of course, you’re going to say ‘build over buy’. You work at Equal Experts. That’s what you do.”

Yes, we’re in the business of building software, but we do this because we think the final solution will make your organisation unique, agile, successful and simply more effective.

Ultimately, this isn’t about diametrically opposing views. As I said way back at the start of this post, it’s not a question of whether to build or buy; almost invariably it’s how to build and buy.

When you consider what you need to build and where best to source standard functionality, it should always be in the context of maximising value – this is where greatness lies.