Can DBT Macros replace UDFs in AWS Athena?
Edit: You might be interested in our updated solution to this problem: UDF Nirvana? A new way to integrate UDFs into the dbt ecosystem
Last year, I wrote about how you can use dbt to deploy and test User Defined Functions, or UDFs. I recently expected to use this technique with AWS Athena and was surprised to find that Athena only supports UDFs that are backed by an AWS Lambda function. There’s no simple SQL-based CREATE FUNCTION statement here.
I find myself without a way to capture, name, test and document reusable logic in Athena. I wasn’t ready to admit defeat and accept the horror that is copy-and-pasting logic, so as I was using dbt anyway I wondered – could I use dbt macros instead?
Macros v UDFs
The main difference between macros and UDFs is where they reside.
A UDF will be deployed into your data warehouse. It’s something you can find in the information schema and call from any query that you write. Any documentation provided will also be available in the data warehouse.
In contrast, a macro only exists within dbt, and it does not exist as an entity in the data warehouse. When you call a macro into a query, it will be replaced by SQL at dbt compile-time. Macros can be used to achieve the “capture, name, test and document reusable logic” idea.
Solving problems with Macros v UDF
Let’s take the is_integer UDF example and solve that problem with a macro:
- Simple, one-line macro: is_integer
This macro example is very similar to the UDF version from the previous post. (I’m using SIMILAR TO because I’m testing with DuckDB, which doesn’t support BigQuery’s REGEXP_REPLACE syntax). This example also shows how to add a little inline documentation using standard SQL comments.
{% macro is_positive_int(column_name) -%}
({{ column_name }} SIMILAR TO '^[0-9]+$' /* dbt macro {{ this }} */)
{%- endmacro %}
If we modify this test case to work with the macro and DuckDB, it would look like this:
WITH examples AS (
SELECT 'foo' the_string, FALSE expected
UNION ALL SELECT '1111', TRUE
UNION ALL SELECT '5 not this', FALSE
UNION ALL SELECT 'still not this 5', FALSE
UNION ALL SELECT NULL, NULL
)
, test AS (
SELECT
*,
{{ is_positive_int('the_string') }} actual
FROM examples
)
SELECT * FROM test WHERE actual != expected
As you can see, this is very similar to how we would approach the problem with a UDF. If we then run a dbt_test, this happens:
1 of 2 START test is_positive_int ........................ [RUN]
1 of 2 PASS is_positive_int .............................. [PASS in 0.07s]
The compiled SQL uses a comment as rudimentary documentation, pointing to the origin of the implementation.
SELECT
*,
(the_string SIMILAR TO '^[0-9]+$' /* dbt macro "macro_testing"."main_dbt_test__audit"."is_positive_int" */) actual
FROM examples
A More Complex Example: Macro v UDF (rag_indicator)
So far we have been successful in using macros, but what happens when we’re doing something more complicated?
To illustrate, I’ll use a problem I’ve seen numerous times – how to fit a metric value to a red/amber/green traffic light status indicator.
In this example, our macro takes the parameters and sets threshold and error values for the metric:
{% macro rag_indicator(column_name, maximum_valid, green_lower_bound, amber_lower_bound, minimum_valid) -%}
(CASE
WHEN {{ column_name }} > {{ maximum_valid }} THEN 'ERROR_TOO_HIGH'
WHEN {{ column_name }} >= {{ green_lower_bound }} THEN 'GREEN'
WHEN {{ column_name }} >= {{ amber_lower_bound }} THEN 'AMBER'
WHEN {{ column_name }} >= {{ minimum_valid }} THEN 'RED'
WHEN {{ column_name }} < {{ minimum_valid }} THEN 'ERROR_TOO_LOW'
ELSE 'ERROR_INVALID'
END)
{%- endmacro %}
The results can be seen in our example:
WITH examples AS (
SELECT 6 the_value, 'ERROR_TOO_HIGH' expected
UNION ALL SELECT 5.0000000000001, 'ERROR_TOO_HIGH'
UNION ALL SELECT 5, 'GREEN'
UNION ALL SELECT 4, 'GREEN'
UNION ALL SELECT 3.9999999, 'AMBER'
UNION ALL SELECT 3, 'AMBER'
UNION ALL SELECT 2, 'AMBER'
UNION ALL SELECT 1, 'RED'
UNION ALL SELECT 0.5, 'RED'
UNION ALL SELECT 0, 'RED'
UNION ALL SELECT -0.000000000001, 'ERROR_TOO_LOW'
UNION ALL SELECT NULL, 'ERROR_INVALID'
)
, test AS (
SELECT
*,
{{ rag_indicator('the_value',
maximum_valid=5,
green_lower_bound=4,
amber_lower_bound=2,
minimum_valid=0) }} actual
FROM examples
)
SELECT * FROM test WHERE actual != expected
The tests are passing, and if we look at the compiled SQL:
SELECT
*,
(CASE
WHEN the_value > 5 THEN 'ERROR_TOO_HIGH'
WHEN the_value >= 4 THEN 'GREEN'
WHEN the_value >= 2 THEN 'AMBER'
WHEN the_value >= 0 THEN 'RED'
WHEN the_value < 0 THEN 'ERROR_TOO_LOW'
ELSE 'ERROR_INVALID'
END /* dbt macro "macro_testing"."main_dbt_test__audit"."rag_indicator" */) actual
FROM examples
This is a very simple implementation of a RAG indicator function, but a reusable implementation means you can start simple, keep track of what’s using it and build a test suite to make sure it does what you expect it to do. . This is much better than copy–pasting an implementation by hand and then trying to wrestle with the inevitable divergence as the codebase evolves..
Summary: Can you deploy UDFs with Macros in AWS?
We believe it is possible to deliver most of the benefits of SQL-based UDFs using dbt macros. We’ll wrap up by calling out a few differences to be aware of.
- Using macros means that there will not be a defined entity representing the logic within the data warehouse. However, it also means you can avoid having a dependency on that entity and its location.
- A macro-based approach also can result in more verbose compiled SQL, lock comments can at least be used to indicate which macro is responsible for a chunk of SQL embedded in a query.
- Calling one macro from within another is more fiddly than calling one UDF from within another, for example ensuring that quoting works correctly. The resulting SQL can get complicated fast, too.
- Unlike a UDF there’s nothing to control the scope of your macro, so if you’re not careful, operator precedence in the surrounding query can alter the result. A macro “add_one” with content “x + 1” will work fine in most contexts, but when used like this: “{{ add_one(100) }} * 2”, you get 102 instead of the expected 202. Wrap your macro SQL in brackets to ensure it’s a more independent unit.
- In a UDF you can declare the types of parameters and results that are checked at runtime – so if you say it takes an INT64 and someone tries to run it on a string, boom – nice clear error. You can’t do that with a macro, so you may need to be more defensive, for example explicitly casting values to ensure they are the correct type and force an error if not.
- You can provide any documentation you want for a dbt macro. Just [add a schema file].