Running dlt at Scale: Pipelines & Secrets in MWAA
This post about running dlt at scale is part of our dlt blog post series. It assumes familiarity with Airflow, AWS, and Snowflake. If you’re new to dlt, start with our first post where we cover why we adopted it and how we integrated it into our data stack.
The Challenge: Running dlt at Scale
dlt works exceptionally well when you have a single environment or when pipelines share the same sources and destinations. In these scenarios, using a single .toml config or environment variables containing credentials, schema definitions, and destination settings is enough, and local runs are smooth and predictable.
But that’s often not reality:
- We run dozens of concurrent dlt pipelines.
- We integrate with dozens of marketing sources, many with multiple accounts each using unique credentials.
- We maintain multiple Snowflake destinations.
- Pipelines orchestrated via MWAA (Managed Airflow), with code stored in GitHub and deployed through S3
In this setup, dlt’s recommended static .toml configuration and suggested use of environmental variables quickly became a scaling bottleneck. The one-size-fits-all approach could not handle the diversity of sources, destinations, and concurrent pipeline runs in our cloud orchestration environment.
Where Things Broke Down
Once we tried to scale dlt beyond a single environment, the limitations of a static .toml configuration became apparent:
1. Parameter Store Limits
AWS caps the size of stored parameters. Packing dozens of explicitly named credentials into one .toml file pushed us over those limits. Our temporary workaround was to share credentials across pipelines to reduce duplication. That decision triggered the next issue.
2. Destination Switching
We needed to switch between different Snowflake environments, but managing them inside one static config required tricks like reusing pipeline names, which conflicted with dlt’s state management and introduced complexity.
3. Pipeline Name Conflicts
dlt links pipeline state to the pipeline name. To run pipelines concurrently, we needed unique pipeline names, but the static config didn’t allow secrets to resolve properly when names didn’t match.
4. Concurrent Writes
Multiple pipelines sharing the same pipeline names caused race conditions. Mid-run, configs would overwrite each other, leading to unpredictable results, as Pipeline A’s secrets sometimes bled into Pipeline B’s run. This led to unpredictable behavior and difficult-to-debug issues.
Rethinking How We Handle Secrets
It became clear that static .toml files were not going to work in our environment. With dozens of pipelines running concurrently, each needing unique credentials and targeting different destinations, we needed a dynamic, per-pipeline approach.
Our solution: generate and inject secrets at runtime, directly into each DLT pipeline, rather than relying on global environment variables or pre-defined config files.
Key benefits of this solution
- Per-pipeline isolation: Each DAG/task fetches only the credentials it needs from AWS Parameter Store and assigns them to dlt.secrets.
- Safe concurrent execution: Since dlt.secrets lives in memory per Python process, there is no risk of one pipeline overwriting another’s credentials.
- Dynamic configuration: Pipelines can target different buckets, Snowflake environments, or API credentials without interfering with each other.
- No global state or disk writes: Unlike environment variables, you are explicitly controlling secrets at runtime, which is safer in a multi-tenant execution environment. and disappear once the Airflow task completes.
Note: dlt currently supports Google Cloud Secrets Manager natively, but not AWS Parameter Store. This may change in future releases, so please always check the latest dlt documentation if you plan to rely on other secret backends.
Even though environment variables are standard practice for containerized workflows, in concurrent multi-DAG MWAA setups where values differ per DAG, our approach is actually more secure and reliable. Temporary secrets in memory never touch disk, and AWS SSM Parameter Store remains the source of truth.
Also, since this is running inside an Airflow task, none of the credentials persist beyond the task’s execution. Once the task finishes, everything is cleared from memory.
Our Approach for Running dlt at Scale
- Store all credentials securely in AWS Parameter Store.
- Use Airflow connections to manage Snowflake credentials.
- Fetch temporary AWS session tokens at runtime.
- Dynamically inject secrets directly into dlt using dlt.secrets.
- Generate configs on the fly per pipeline, no static .toml files needed.
Here’s a simplified version of our secrets injection function, which is run just before creating the dlt pipeline:
def set_dlt_secrets(snowflake_conn_id: str) -> None:
"""
Loads AWS and Snowflake credentials into dlt secrets from Airflow connection metadata.
dlt's filesystem destination uses boto3 by default, but manually setting
dlt.secrets["destination.filesystem.credentials.*"] overrides that chain.
If any credential is defined, the rest becomes mandatory, so we set them all explicitly.
"""
import boto3
from airflow.hooks.base import BaseHook
import json
import dlt
# Fetch Snowflake connection from Airflow
snowflake_conn = BaseHook.get_connection(snowflake_conn_id)
snowflake_extra = json.loads(snowflake_conn.extra) if snowflake_conn.extra else {}
# Retrieve temporary AWS credentials from session
session = boto3.Session()
credentials = session.get_credentials().get_frozen_credentials()
logger.info("🔑 Successfully retrieved AWS temporary credentials")
# Set dlt secrets for S3
dlt.secrets["destination.filesystem.bucket_url"] = f"s3://{bucket_name}"
dlt.secrets["destination.filesystem.credentials.aws_access_key_id"] = credentials.access_key
dlt.secrets["destination.filesystem.credentials.aws_secret_access_key"] = credentials.secret_key
dlt.secrets["destination.filesystem.credentials.aws_session_token"] = credentials.token
dlt.secrets["destination.filesystem.credentials.aws_default_region"] = "eu-central-1"
# Set dlt secrets for Snowflake external stage
Outcome of this approach
This approach unlocked a full pipeline isolation and eliminated configuration conflicts. Every Airflow DAG working with dlt now gets its own fresh, ephemeral config, containing only the secrets it needs for that run.
This allowed us to:
- Give each pipeline a unique name.
- Keep secrets isolated per pipeline.
- Scale to as many concurrent pipelines as we wanted without collisions, letting the Amazon Managed Workflow for Apache Airflow (MWAA) simply orchestrates independent tasks.
What were our Learnings for Running dlt at Scale?
Running dlt at scale forced us to rethink how we manage secrets and configs:
- dlt’s defaults favor simplicity, not scale: .toml files work for local setups but break under concurrency.
- Dynamic configuration is key: Per-pipeline secrets and configs eliminate collisions.
- AWS Parameter Store works well when you avoid bundling everything into one large file.
- Isolation is the foundation: Every pipeline should manage its own secrets and schema independently.
- MWAA introduces constraints that make static configs impractical, but dynamic generation solves them elegantly.
What’s Next
In our next post, we’ll cover another challenge we faced: schema persistence and evolution. Marketing data demands change constantly, and sometimes dlt’s helpful tricks become a hindrance when handling shifting schemas.
If you need help scaling your data projects in the meantime, we are of course always happy to assist you.