May 28, 2026

Building a Production Data Pipeline on PPDM with Airflow and DuckDB

By John Wassilak

We’ve covered what PPDM is, how to scope an implementation, why projects fail, and how to reconcile the data that lands in the model. The question that comes next is the one engineers actually have to answer: what does the pipeline look like?

This isn’t a tutorial. It’s a design document for a production-grade ingestion pipeline that takes upstream data from source systems, lands it in PPDM-aligned tables, and makes it available for analytics. The tools are Apache Airflow and DuckDB. The architecture is something you can actually run.

The architecture in one sentence

Source systems feed Airflow DAGs. DAGs transform and load data into PPDM-aligned PostgreSQL tables. PostgreSQL exports snapshots to Parquet. DuckDB queries the Parquet for analytics.

Everything else is details about how to implement each layer.

Why Airflow

Airflow is the right orchestration tool for PPDM ingestion for reasons that have less to do with features and more to do with operational fit.

Upstream data pipelines have complex dependencies. Your production volumes can’t be allocated until the well master is current. Your working interests can’t be applied until the OCC operator change records have been ingested. Your monthly roll-up can’t run until all the prior-month corrections have landed. Airflow’s DAG model is explicitly designed for this kind of dependency management, and it expresses those relationships in a way that’s inspectable, debuggable, and modifiable without touching the pipeline logic.

Airflow also handles the scheduling and retry semantics that upstream pipelines actually need. The OCC posts new data at irregular intervals. Some production reporting files arrive late. Some vendor system exports fail and need to be re-pulled. A scheduler that handles backfills, retries, and SLA monitoring is worth the operational overhead, especially for pipelines that need to be reliable every month, not just when someone is watching.

The scheduling patterns and DAG design covered in Airflow Dataset Scheduling and Decoupled Pipelines in Airflow apply directly to PPDM ingestion.

The PPDM entities you actually pipeline first

Not all of PPDM needs to be in your ingestion pipeline on day one. The same subsetting principle that applies to the model applies to the pipelines. Build the ones that solve your current problem.

For most operators, that means three entities.

Well master. The canonical list of wells in your operating area, normalized to consistent identifiers. API number is the primary key in most implementations, but you’ll also need to map well names, permit numbers, and operator-assigned IDs across source systems. The well master DAG pulls from OCC permit records, your internal well register, and any vendor production accounting system that has a well list. It reconciles duplicates, enforces a canonical API number format, and updates effective dates for changes.

Production volumes. Monthly production by well, by meter, or by lease depending on your reporting structure. This is the highest-frequency data in the stack. Production corrections for prior periods are common, so the pipeline needs to handle upserts, not just inserts. The OCC monthly production extract is one source; your SCADA or production accounting system is another. The DAG needs to know which one is authoritative for any given well, and that’s a business decision, not a technical one.

Completion reports. Formation intervals, perforation depths, stimulation details, initial potential tests. This changes less frequently (only when a new completion happens) but is highly important for reservoir and decline analysis. The OCC Form 1002A is the source for Oklahoma operators; parsing it is covered in OCC Data Ingestion: Automating What Most Companies Still Do by Hand.

These three entities power the core reconciliation that matters most. Land data, working interests, and facilities can follow once the core is running and trusted.

What the OCC ingestion DAG looks like

An OCC ingestion DAG has a few distinct phases.

The first phase is extraction. Pull the OCC data files that have been posted since the last successful run. For monthly production, that’s a known schedule. For completion reports and operator changes, it’s closer to continuous monitoring with a daily check. The extractor downloads files, validates that they’re complete (OCC occasionally posts truncated files), and drops them into a staging area.

The second phase is parsing and normalization. OCC data arrives in formats that require interpretation. API numbers may be 14-digit or 10-digit. Well names are inconsistent across filing types. Volume units are sometimes ambiguous. The normalization step resolves these, applies your canonical API format, and flags records that can’t be automatically resolved for human review.

The third phase is loading into PPDM-aligned tables. Normalized records land in staging tables first, where a set of quality checks runs before promotion to the production schema. Checks include: does this API number exist in the well master? Does this production volume fall within the expected range for this well’s history? Is this operator change cross-referenced against the current working interest register? Records that fail go to a review queue, not to the trash.

The fourth phase is downstream notification. When the production tables are updated, the dependent DAGs get triggered via Airflow’s dataset API. The allocation run, the reconciliation report, the monthly roll-up. No polling, no manual hand-off.

Where DuckDB fits

The PPDM tables in PostgreSQL are your record-of-truth for operational data. They’re not optimized for the analytical queries your engineers and analysts want to run.

DuckDB fills that gap. After each pipeline run, an export job materializes key PPDM entities as Parquet files: one file per month of production data, one file for the current well master, one file for the completion records. DuckDB reads these files directly for analytical queries.

The result is fast. DuckDB runs columnar queries over Parquet with no server required. For aggregations like production by formation, decline curve inputs, or field-level roll-ups, performance is significantly faster than the same queries against a transactional PostgreSQL table with thousands of individual row accesses.

The other advantage is isolation. Analysts querying DuckDB are not competing with the Airflow ingestion jobs writing to PostgreSQL. The analytical layer is a read-only snapshot. It’s always slightly behind the operational data (by design), but for monthly analysis that’s not a constraint.

The simplest first implementation is a scheduled export from PostgreSQL to Parquet (a Python task in Airflow using psycopg2 and pyarrow), landed to object storage, and a DuckDB instance that reads from that location. One DAG, a few gigabytes of storage cost, and analysts can run queries the same day the OCC file is posted.

If you’re evaluating the platform decision for PPDM itself, the companion post PPDM in the Cloud: SQL Server, PostgreSQL, or Something Else? covers how DuckDB fits into the broader architecture.

What a production run actually looks like

When this is working, the monthly cycle looks like this.

OCC posts new production data roughly between the 10th and 15th of the following month. The Airflow monitor task detects the new files, triggers the extraction DAG, and the ingestion pipeline runs. Parse, normalize, stage, check, promote. If the checks pass, production data is in the PPDM tables within an hour of the OCC file being posted. If a check fails, the on-call analyst gets a notification with the specific records that need review.

The allocation DAG runs next, triggered automatically by the production dataset sensor. Working interests are applied, volumes are split, and the monthly revenue estimate lands in the accounting interface.

The Parquet export runs after the allocation. Analysts can open DuckDB queries against current production data before the end of the day the OCC file was posted.

The reconciliation report compares this month’s numbers against the well master and the land system’s working interest register. Discrepancies are flagged with the same review queue as the ingestion checks. The analyst team closes out the flagged items, and the monthly production report is done.

That’s the target state. Getting there takes work, and some of the organizational pieces are in Why Your PPDM Implementation Failed (and How to Try Again) and Reconciling Land and Production Data. The technical pipeline described here is the foundation. It doesn’t run itself, and it doesn’t make the business decisions the pipeline depends on.

Common mistakes

A few patterns that consistently create problems in production.

Not separating staging from production tables. If you load directly into the PPDM schema without a staging step, a bad source file overwrites good data with no way to recover. Always land in staging first, validate, then promote.

Running corrections as deletes. Production corrections from OCC (and from your own operations) should be handled as new records with effective dates, not as overwrites. If you delete and reinsert, you lose the history of what was corrected and when. PPDM’s temporal modeling supports this natively; use it.

Skipping the quality checks. The instinct is to build the checks later, once the pipeline is working. The checks are how you find out the pipeline isn’t working. Build them in the first version, even if they’re simple.

Building one monolithic DAG. An Airflow DAG that does extraction, normalization, loading, allocation, and export in one sequential job is a nightmare to debug and modify. Break it into composable DAGs connected by dataset dependencies. The architecture should be easy to read and easy to repair.

We Were Just at PPDM 2026

We spent April 27 through 29 at the PPDM Energy Data Convention in Houston. If you’re building PPDM pipelines and want to talk through the architecture, the OCC-specific patterns, or anything else in this stack, we’d love to hear what you’re working on.