A Realistic Guide to Why and How You Should Test Your ETL — July 5, 2019

Software Development vs. ETL Development

If you talk to your resident C# developer about automated testing you’ll get one of two responses:

  1. an explanation of how they incorporate testing into their development pipeline
  2. an embarrassed look as they explain why they don’t have time to write tests, but they know they should!

What’s notable is that both of these answers take for granted that automated testing is very good thing. Most enterprise software development cultures enforce this view in both subtle and not so subtle ways. It’s an established cultural norm in their world. (WORDART THIS)Automated testing = good.

This isn’t the case with ETL development. Casual conversations with other ETL developers reveal a common thought: that’s why we have < insert person > who does QA. Sometimes it’s the developer themselves who is performing the role; sometimes it’s a coworker working with end users.

It’s easy to attribute these differing attitudes to the technical differences between C# development and ETL development, but I think it’s more than that. I believe there is a difference of cultures.

There’s an established pipeline for traditional software developers. Most people going into CS know they want to be a developer on the other side. For ETL developers, our community is smaller. I never remember hearing the term ETL at all until I’d already been working in industry for a few years. No career counselor ever discussed it with me, that’s for sure.

Maybe it’s because there are just less ETL jobs. People accidentally end up in this role as a form of day to day necessity. Sort of like the Accidental DBA. A brief look at these subreddit stats is an indication of how specific and niche ETL is.

Subreddit Stats

Note the logarithmic scale. I know subreddit stats aren’t the final word on a community’s size, but they’re definitely an indicator.

I bring up our distinct cultures, because they probably shouldn’t be so distinct. Things like automated testing are a baseline assumption in most software development. Why not in ETL? ETL is software development. As of mid 2019 this is the beginning of Wikipedia’s article on Software Development:

“Software development is the process of conceiving, specifying, designing, programming, documenting, testing, and bug fixing involved in creating and maintaining applications, frameworks, or other software components. Software development is a process of writing and maintaining the source code, but in a broader sense, it includes all that is involved between the conception of the desired software through to the final manifestation of the software, sometimes in a planned and structured process. Therefore, software development may include research, new development, prototyping, modification, reuse, re-engineering, maintenance, or any other activities that result in software products.”

Now just replace the term “Software” with “ETL”. Sounds pretty familiar.

With that out of the way I propose you, my ETL developing reader, shamelessly steal the best cultural norms of traditional software development. Source control. Automated testing. Automated build and deployment pipelines. If not for your user’s benefit, then for your own sanity.

ETL development is, without any qualification, software development and should be held to the same standards.

Automated Testing Is A Pillar Of ETL

Your product to end users is data. The ETL is actually sort of incidental, just a means to an end. When end users call you, it’s not about the ETL, it’s about the data.

If you aren’t running regular (daily?) validations of your production ETL you are creating technical debt. You need to be running regular data validation tests against databases you develop.

Tests = Documentation

During the discovery and design phase of ETL and database development, a lot of domain knowledge is collected. You organize discussions with end users. You analyze the source system data. If you’re anything like me, you spend a lot of time talking with more senior developers nearby.

As that project begins to wrap up and users are beginning to utilize the data, you’re inevitably going to reprioritize your time onto whatever is next. Slowly that domain knowledge will start to leak away. Fiddly little details about specific columns will become vague remembrances of “Yeah, there’s something up with column X on table Y, but I can’t remember exactly what.” It reminds me of the graphs you see online when people talk about spaced repetition. Or even the classic comics of developers being interrupted.

ETL is complex and very, very often there is a devil in the details. Those details will absolutely fade with time. A big difference between the product of junior developers versus senior developers is how maintainable their code is in the long run. Two years after you’re done with a project, are you going to remember the specific trade-off pros and cons related to a design decision on a single column, out of possibly hundreds of columns?

Tests help us document those details for our future selves.

Skin In The Game

Validation tests give us time to proactively correct bad data. If you’re reading this post it’s almost guaranteed that you’ve received an end user email regarding bad data. When it’s relatively new development, it’s maybe a little embarrassing. If it’s old development, and it’s been generating bad data for days, weeks, or months, it’s a gut wrenching experience. 🍀 Good luck if that ETL is related to your accounting department, it could be career affecting.

Make an early warning system. I implore you. Routine, automated validation tests can be your canary in the coal mine. Compare these two scenarios: 1 - A manager in billing emails you, “Hey Developer - it looks like something is up with the receivables aging report. I just checked this month’s numbers, there’s no way we only have outstanding.” 2 - You email the manager in billing, “Hey Manager - Our system is picking up a discrepancy in the aging report’s data for this month. We’re digging into it. I’ll get back to you by end of day with more details as we uncover them.”

Scenario 2 looks professional. It looks like you are in charge of your system. It projects a sense of ownership. In Scenario 1 you and your system are now under the spotlight. Because your end user is the one initiating the response, this is a to-do item for them. From their perspective it feels like their scope of responsibilities has just increased. Not only do they have their people and their work to worry about, but now they’ve got to be checking up on the data folks? All because you’re not paying attention to your system? 😬 “The spotlight looks like a prison break” - Tom Waits

Specific, Useful Data Testing Cases

Automated testing is important. ✔, got it. So how to do it? The unfortunate truth is that testing is both important and hard to do right. It’s not hard to do in a loose sense. At its most basic form, data testing is comprised of writing some sort of assertion -> after the ETL runs, assert query X returns results set Y. That’s pretty easy, right? I promise that with little background and not much time, you can write a sort of bad test. Like many people, at some point in my education I took a class that had a chapter on software testing. At the end of that class I walked out saying I could write tests for my software. While I technically could write tests (LINK TO FUTURAMA IMAGE), I didn’t know how to write good tests. Like many oversimplifications, the above description of testing doesn’t get at the subtle engineering question of how to create reliable tests that are maintainable for the long term.

There are two keys to writing good automated data tests 1 - Understand in what ways your data sets are important to your end users. 2 - Understand Type I and Type II errors.

#1 is a bit self explanatory. The key is to know what is worth alerting on at all. It takes developer time to write tests, and based on their error rates (discussed next) they incur costs over time. A report viewed daily by your CEO is a report worth having automated daily (or even hourly) automated tests for, even if they take significant investment to create and maintain. On the other hand, if you ETLed a table for a friend that they use rarely, it’s hard to justify spending time creating tests for that.

#2 is about the sensitivity of your tests. I think we’ve all had the experience of installing a monitoring tool for something that then goes crazy with alerts and notifications. That kind of alert overload becomes meaningless. We’re trying to find the sweet spot where an alert means there is something that we need to take action on, and a lack of alerts means false positives and false negatives.

Hidden Bombs, WIP Validation

A fun alerting / error rate conversation: When should you be alerted about a fire at your office? A Reasonable Person: immediately. Fire represents and immediate danger to my body, and I’ve only got one of those. Pretty fond of it. So yeah, fire alarms, the whole deal. I want to know. A Good Test Writer: So what about candles? They’re technically fire, and you really shouldn’t have them at work due to insurance. A Reasonable Person: Okay, yeah, no fire alarm for candles. But still, a manager should be alerted pretty quickly that someone is doing that. A Good Test Writer: W.r.t. candle fires, that sounds reasonable. What about smokers? When they light a cigarette, technically they’re creating a small fire. A Reasonable Person: 🤔 I guess that’s expected behavior. There’d be no reason to alert someone when someone lights a cigarette.

(MENTION JUDEA PEARL’S BOOK? https://en.wikipedia.org/wiki/Causality(book) __ ) (ETL MATURITY MODEL?)

The Confusion Matrix - Type I and Type II Errors You must tailor your testing methodology to the current and future scale of the ETL that you manage. If ETL is a minor part of your job, a couple tests with occasional Type I errors won’t negate the utility of testing. If you develop for ten different ETL projects, each with tens of tables, each with dozens of columns, the

0.5% or half a percent chance - Type I Error rate per column tested: One project, five tables, ten columns per -> 50 columns. Individual day probability of Type I Error: 50*0.005 = 25%

When unit testing source code, the only velocity of change is how quickly your devs can kick out more code. The nature of data testing is that the subject of the test is usually changing.

Type I Error Rate

%
# Of ETL Projects

Tables per Project

Columns per Table

False Positive likelihood on a weekend:
Per Month:  %
Per Year:  %
On a National Holiday (Per Year):  %