Skip to main content

💻 NEW: Data-Driven Marketing Masterclass: € 200 discount when booking before March 17 - Save your seat >

12.09.2025: Blog Series Part 1

How We Successfully Scaled Our Marketing Data Pipelines with dlt


Christian Shackleton Hopmann c2f9533d
Christian Shackleton on September 11, 2025

This blog post series is intended for data engineers, analytics engineers, and technical marketers who are familiar with basic ETL concepts, Python, and tools like Airflow or dbt. You don’t need deep expertise in any one platform, but a working understanding of how API data ingestion, orchestration, and cloud data warehouses typically fit together will help you get the most from this post.

Hopmann Blogpost Serie dlt

When you work with marketing data, you’re juggling a growing stack of APIs, shifting schemas, and non-trivial cost considerations. In one of our internal projects, we managed our ingestion with hundreds of custom Airflow DAGs (Directed Acyclic Graph). Each was hand-written, often with duplicated logic across scripts, and exploring through them was an interesting journey through changing code styles, naming conventions, and full of patchy or out of date documentation. It would work until it didn’t, maintenance costs were rising, let alone continuing to scale or expand it, and onboarding was a quick dive into a labyrinth.

We didn’t need to rebuild our extraction layer, the logic there was stable. We needed something better for the loading layer of our data stack, to move the data to S3, to Snowflake, and apply standardisation transformations. So, we made a change. We adopted dlt (Data Load Tool), an open-source Python library that significantly simplified how we ingest, clean, and load data into our databases.

This is the start of the story of how we integrated dlt into our Marketing Data Stack, what we learned along the way, and how it helped us scale ingestion without scaling complexity. Later we will dive a bit deeper into the challenges we encountered in applying dlt to our style and use case, so stick around to read that.

The Problem with Legacy DAGs

Before dlt, our ingestion pipelines followed a pattern that will feel familiar to many data engineering teams: lots of custom Python scripts orchestrated by Airflow. Each new marketing API, Adobe, Meta, LinkedIn, etc. and account got its own DAG, its own schema definitions, and often its own quirks baked directly into the codebase.

Our original setup evolved organically: extract from an API, drop files into S3, load them into Snowflake with SQL statements. Each DAG repeated S3 staging logic, table creation statements, and sometimes transformation code.

This led to the following challenges:

  • Duplicated and overlapping DAGs cluttered Airflow, with inconsistent naming and metadata.
  • Hardcoded schemas and brittle transformations made updates error-prone, often requiring reruns of entire DAGs.
  • Inconsistent logging, retries, and normalization across pipelines created operational overhead.
  • Slow onboarding for new members who faced copy-paste reuse, legacy assumptions, and scattered schema definitions.
  • Incremental loads required manual handling, adding complexity and maintenance burden.

As the number of datasets grew, so did the operational overhead, and the consequences of this pattern became increasingly challenging:

  • Onboarding a new API source took days, even for similar JSON-based endpoints.
  • Maintaining schemas across DAGs became error-prone, especially when vendors updated their fields or nested structures.
  • Debugging failures meant tracing through individual scripts that were all slightly different in how they handled edge cases, retries, or logging.

Over time, the inefficiencies compounded. More time was spent maintaining pipelines than building new capabilities. Technical debt grew, documentation fell behind, and it became harder to bring new team members up to speed. Even though we had standardized some patterns, the system lacked true reusability and abstraction.

We didn’t just want a new way to move data, but we wanted a framework that could:

  • Handle schema drift gracefully,
  • Reduce boilerplate code,
  • Let us configure, not re-code, pipelines,
  • And scale with the team as new data needs came online.

That’s where dlt came in.

Why We Chose dlt

dlt checked several key boxes for us right out of the gate:

  • It’s open source and Python-native, which meant we could plug it directly into our existing stack and workflows without a major rewrite.
  • It handles schema inference, evolution, and normalization out of the box, including nested JSON, a common challenge with marketing APIs.
  • It separates ingestion logic from orchestration, letting us focus on what the pipeline does, not how it’s scheduled.
  • It integrates cleanly with Snowflake, and supports S3 as a staging layer, which is ideal for our existing AWS-based architecture.
  • It’s configuration-driven, with support for TOML and environment-based settings, allowing us to manage source-specific logic without duplicating code.
  • It comes with built-in support for lineage, notifications, retries, and incremental loads, reducing the need for custom logic in every pipeline.

dlt struck the right balance between control and convenience.

Another important factor was dlt’s flexibility in deployment. Since it’s just Python, we could deploy it using Airflow’s TaskFlow API inside Airflow. This meant:

  • No new infrastructure required.
  • Minimal developer re-training.
  • Easy version control and promotion via our existing CI/CD workflows.
dlt_Blogpost_Hopmann

We also appreciated the active open source community behind dlt. We were able to get help quickly on their Slack channel, and since we started using dlt, the documentation has been expanded considerably.

How We Implemented dlt

Once we decided to adopt dlt, we focused on designing an architecture that balanced modularity, reusability, and scalability—while staying grounded in tools our team already knew: Airflow, S3, and Snowflake.

We redesigned our pipelines around a clear separation of concerns:

  • Extraction handled via custom Python scripts and API calls
  • Staging handled via Amazon S3
  • Loading & light transformations handled via dlt
  • Core transformations managed in dbt Cloud, where we build staging, intermediate, and mart models
  • Handling of secrets via AWS Parameter Store

The diagram below shows how the pieces fit together:

Data Pipeline Flow with dlt

Key Implementation Details for dlt

  • Data extraction: Python scripts and dlt sources fetch raw data and drop CSV/JSON into a consistent S3 folder structure.
  • S3 staging: dlt reads from these S3 buckets, standardizes formats, and prepares data for Snowflake ingestion.
  • dlt ingestion: dlt converts files into Parquet, handles schema drift automatically, and loads data into Snowflake’s raw layer.
  • Light transformations: dlt applies only minimal renaming, type casting, and standardization in order to avoid complex transformations and keep pipelines simple.
  • dbt Cloud for transformations: All heavy lifting (joins, data quality checks, marts) is handled downstream in dbt, keeping business logic centralized.

Working with Secrets

Managing credentials was one of the trickier aspects of integrating dlt into our stack. dlt’s default approach relies on .toml configuration files for storing secrets, but we quickly ran into AWS Parameter Store character limits due to the number of marketing sources we work with.

Instead of abandoning Parameter Store, we took a hybrid approach:

  • Credentials remain stored as individual secure variables in AWS Parameter Store.
  • At runtime, we dynamically generate the .toml files that dlt expects.
  • This lets us keep secrets centralized, secure, and consistent with the rest of our infrastructure.

This allowed us to keep secrets centralized, secure, and consistent with the rest of our infrastructure while still making dlt happy.

Results of introducing dlt

Adopting dlt fundamentally changed how we manage marketing data pipelines. By moving from hand-written ingestion logic to a standardized, config-driven approach, we saw immediate benefits:

  • Faster onboarding: Adding a new API source now takes just a fraction of the time it did before.
  • Less duplication: Shared loading logic reduced code maintenance significantly.
  • Greater reliability: Automatic schema handling, retries, and logging mean fewer failed loads.

This shift freed up data engineering time to focus on business-critical transformations in dbt rather than spending resources debugging ingestion issues.

Advice for Teams Adopting dlt

If you’re considering dlt for your data pipelines, our biggest advice is: start simple! Begin with a small, low-risk pipeline to understand how configuration, schema handling, and retries work in your environment before scaling up.

A few tips from our experience:

  • Leverage staging: Using S3 as an intermediate layer gave us more control and observability over ingested data.
  • Keep transformations minimal in dlt: We found dlt works best when handling light transformations and letting dbt handle the heavy lifting.
  • Plan your configs early: Centralized configs are powerful. Plan how you’ll manage credentials, destinations, and environments upfront.
  • Expect some trial and error: Especially if you have multiple destinations or complex orchestration, anticipate needing a few tweaks.

For us, the learning curve was worth it. Once we had the basics in place, dlt allowed us to move faster, standardize ingestion, and spend more time on analytics instead of boilerplate code.

What’s Next?

This post focused on why we adopted dlt and how we integrated it into our data stack to streamline loading and light transformations. But as with any migration, we faced some unique challenges along the way.

In upcoming posts, we’ll dive deeper into topics like:

  • Managing Secrets at Scale: How we handle hundreds of credentials securely using AWS Parameter Store while keeping dlt happy with its .toml-based configuration model.
  • Multi-Pipeline Concurrency: Our approach to running dozens of dlt pipelines in parallel, resolving conflicts between configurations, and orchestrating two Snowflake destinations.
  • Schema Persistence & Evolution: Lessons learned about managing schema evolution (and devolution) with dlt.

If you’re exploring dlt or modernizing your own data pipelines, stay tuned for more on issues we encountered when implementing dlt, and our solutions.

And of course, we are happy to support you with an initial consultation or hands-on assistance if you need it!