Testing and deploying UDFs with dbt

Edit: You might be interested in our updated solution to this problem: UDF Nirvana? A new way to integrate UDFs into the dbt ecosystem

Inspired by our previous post on unit testing dbt and discussion in the dbt forums, I looked into how to get dbt to help me test and deploy UDFs. Although dbt doesn’t directly support UDFs, it turns out it’s actually fairly straightforward to define, test and reference them in your models. I’ll be using BigQuery in this post, but I think the techniques will apply to other warehouses that support UDFs like Snowflake, Redshift and Spark, too.

What’s a UDF and what does it have to do with dbt?

A UDF is a User-Defined Function. Like any other type of function, they’re really helpful to capture and name chunks of logic that you can apply in queries, and I talked about them before, starting from Tip 4 in our Data Warehouse Tips post. dbt takes care of deploying the tables and views that you might use your UDFs in, so robustly referencing a function from your models is important. When I say ‘robustly’, I mean ensuring that the function exists and that you’re calling the function from the appropriate location – for example, avoiding hard-coding your production functions everywhere. There’s no native support for referencing a UDF like there is with `source` or `ref`, nor any native way to deploy a function in a project. This post explains how you can solve those problems.

The example

I have a BigQuery dataset that I use for my personal development work. This is referenced in my own dbt profile. When I run dbt commands on my machine, they take effect here.

There are test and production datasets that my CI system has profiles for. I cannot run ad-hoc dbt commands in these datasets. When I push my changes, the CI system will run any dbt commands necessary to deploy my updates and run tests. I’ll not be talking about that in this post, but it’s important that I’m referencing the UDF that is deployed in the same environment that I’m working in – otherwise I can’t iterate safely, relying on my tests.

The example UDF I’ll use is a simple predicate function, located in `my_schema`, from string to bool that returns true if and only if the string value can be interpreted as a positive integer:

CREATE OR REPLACE FUNCTION my_schema.is_positive_int(a_string STRING)

RETURNS BOOLEAN

AS (

  REGEXP_CONTAINS(a_string, r'^[0-9]+$')

);

Defining a UDF in dbt

We need to write a dbt “macro” – a Jinja template – that parameterises where the UDF should go. In `macros/is_positive_int_udf.sql`:

{% macro is_positive_int_udf() %}

CREATE OR REPLACE FUNCTION {{ target.schema }}.is_positive_int(a_string STRING)

RETURNS BOOLEAN

AS (

  REGEXP_CONTAINS(a_string, r'^[0-9]+$')

);

{% endmacro %}

Note the macro name matches the file and that `target.schema` is templated out.

To have dbt actually deploy this for us, we add a hook in our `dbt_project.yml`:

on-run-start:

- '{{ is_positive_int_udf() }}'

Now, when we `dbt run`, the UDF will be deployed.

11:11:52  Running 1 on-run-start hooks

11:11:52  1 of 1 START hook: my_project.on-run-start.0.................... [RUN]

11:11:52  1 of 1 OK hook: my_project.on-run-start.0....................... [OK in 0.49s]

Kind of yucky in the console output with just list indices to identify what’s what –  but not the end of the world.

Testing a UDF in dbt

Now our UDF is schema-aware and is being deployed by dbt, we can write tests for it with minimal ceremony. In `tests/is_positive_int.sql`:

WITH examples AS (

  SELECT 'foo' the_string, FALSE expected

  UNION ALL SELECT '1111', TRUE

  UNION ALL SELECT '5 not this', FALSE

  UNION ALL SELECT 'still not this 5', FALSE

  UNION ALL SELECT NULL, NULL

)


, test AS (

  SELECT

 *,

  {{ target.schema }}.is_positive_int(the_string) actual

  FROM examples

)



SELECT * FROM test WHERE TO_JSON_STRING(actual) != TO_JSON_STRING(expected)

Running `dbt test` runs the pre-hooks and then the tests, perfect.

...

11:25:22  Running 1 on-run-start hooks

11:25:22  1 of 1 START hook: my_project.on-run-start.0.................... [RUN]

11:25:23  1 of 1 OK hook: my_project.on-run-start.0....................... [OK in 0.58s]

...

11:25:24  1 of 1 START test is_positive_int............................... [RUN]

11:25:25  1 of 1 PASS is_positive_int..................................... [PASS in 1.02s]



11:25:26  Finished running 1 tests, 1 hooks in 1.46s.

Using the UDF in a Model

We’ve actually already seen how to do this, in the test.

In `models/my_model.sql`:

SELECT {{ target.schema }}.is_positive_int(maybe_positive_int_column)

FROM {{ ref(...) }}

Summary

There you have it. In summary, to incorporate UDFs neatly into your dbt models:

  • Write a macro for your UDF, templating the target schema
  • Add a call to macro to `on-run-start` in your `dbt_project.yaml`
  • Write some tests, of course
  • Use the UDF in your models, templating the target schema