Marketing Data Pipelines with Airflow: Idempotence & Reusability
So far in this Airflow blog series, we’ve covered modular task design, resilience with retries and alerting, and workflow patterns for clarity and scale. But there’s still a hidden danger in production pipelines: What happens when you rerun them?
If you’ve ever retried an Airflow DAG and ended up with duplicate rows in your data warehouse, you’ve experienced the pain of non-idempotent tasks. In marketing pipelines – with flaky APIs, delayed files, and retries happening all the time – this is a constant risk.
In this post, we’ll cover how to design Marketing Data pipelines with Airflow that are idempotent (safe to rerun) and reusable across contexts.
What Does Idempotency in Marketing Pipelines with Airflow Mean?
An idempotent task can be run multiple times without producing different results. Think of it like hitting “save” in a text editor: no matter how many times you click, your document doesn’t get duplicated.
âś… With idempotency: A retry fetches the same data once and loads it safely.
❌ Without idempotency: A retry fetches the same data again and appends duplicates into your warehouse.
Practical Techniques for Idempotency
2. Use Batch IDs
Each DAG run should have a unique identifier (commonly the execution date). Pass this batch_id through every task.
- Include the batch_id in S3 paths so retries overwrite the same files instead of creating new ones.
- Pass it into sensors to ensure the right files trigger downstream steps.
XCom vs Variable
XCom vs Variable
In the example above I used Variable.set and Variable.get to pass values between tasks, but there’s actually a better-fitting alternative: XComs. With xcom_push and xcom_pull, Airflow lets you share data that only lives for the duration of a single DAG run.
- Variable.set saves a value in Airflow’s metadata database so it can be reused across DAG runs.
- Variable.get retrieves that saved value anywhere in your DAGs.
- xcom_push shares a small piece of data between tasks within the same DAG run.
- xcom_pull retrieves that value from another task in the same run.
That makes them perfect for things like batch IDs, file paths, or timestamps that change from run to run.
Variables, on the other hand, stick around across runs and are better suited for constants or configuration values such as API keys or client IDs.
If you use Variables for temporary data, you risk overwriting state or breaking idempotency when retries happen. XComs keep that short-lived information where it belongs, inside the run, and make your pipelines cleaner and safer to rerun.
Fetch by Last Update Timestamp
When pulling from APIs, don’t fetch everything. Instead, get records created or modified after the latest last_modified_at timestamp in your warehouse. This approach will reduces API load, prevents duplicates and keeps the pipeline always moving forward
3. Incremental Loads Instead of Full Copies
Thanks to technique #2, you don’t need to copy everything on each run. Instead, use incremental loading strategies:
- With COPY INTO, you can append blindly (as we do in our Linkedin DAG via the CopyFromExternalStageToSnowflakeOperator)
- With Snowflake’s MERGE statements, you can handle changes more intelligently:
- New records are inserted.
- Existing records are updated.
- Duplicates are avoided.
This ensures your data stays fresh without ballooning into messy duplicates.
Idempotency pairs naturally with reusability. By parameterizing your DAGs with things like batch_id, client_id, or campaign_id, you can:
- Run the same DAG structure across multiple accounts.
- Standardize logic while isolating data.
- Scale to new clients without writing new DAGs.
Without idempotency, every retry is a gamble with your data warehouse integrity. With it, retries are safe and predictable.
Conclusion: From raw data to reliable answers
When data flows smoothly, your Sales & Marketing teams can make better decisions faster. Robust marketing pipelines with Airflow are a good approach for this.
We work with you to set up Airflow and Snowflake workflows, establish a measurement framework, and provide meaningful reports and dashboards. Everything is pragmatic, documented, and directly applicable.