Data Engineering - InFocus Data Blog

Field Tickets and the Digitization Gap: Why Paper Forms Still Run Upstream Operations

Data Engineering June 4, 2026

Field tickets, run tickets, JIBs, and service invoices still arrive at most mid-size operators as paper or PDFs. Here's why the digitization gap has persisted, what a real ingestion pipeline looks like, and where LLM-assisted extraction earns its keep versus where it just creates silent errors in close.

By John Wassilak Read more →

Building a Production Data Pipeline on PPDM with Airflow and DuckDB

Data Engineering May 28, 2026

What does a real PPDM ingestion pipeline actually look like? Here's the architecture: Airflow DAGs pulling from OCC and other source systems, landing normalized data into PPDM-aligned PostgreSQL tables, and DuckDB handling the analytical layer on top. Design decisions, common mistakes, and what a working monthly cycle looks like.

By John Wassilak Read more →

OCC Data Ingestion: Automating What Most Companies Still Do by Hand

Data Engineering April 21, 2026

Most Oklahoma operators are still pulling Oklahoma Corporation Commission data by hand every month. There's no technical reason for that. Here's what a proper OCC ingestion pipeline looks like, and what it takes to get one running.

By John Wassilak Read more →

We Run Our SDLC Out of Git

Data Engineering April 14, 2026

We put our entire SDLC in git. Requirements, decisions, task assignments, everything. Then we cancelled standup. Nobody complained. OK, I complained, which is apparently how you get assigned the blog post about it.

By John Wassilak Read more →

Data Pipeline Patterns: A Practical Reference

Data Engineering April 2, 2026

A practical reference for data pipeline patterns: loading strategies, slowly changing dimensions, change data capture, Lambda, Kappa, and Medallion architecture, reliability fundamentals like idempotency and atomic swap, and orchestration patterns.

By John Wassilak Read more →

Open Source vs. Proprietary Data Platforms: What Oklahoma Companies Actually Need

Data Engineering March 26, 2026

The hyperscale data platform pitch is compelling, but it was designed for a different customer. Here is an honest look at where open source tools like PostgreSQL, DuckDB, Airflow, and dbt outperform proprietary platforms for most Oklahoma organizations, and when the proprietary option is actually the right call.

By John Wassilak Read more →

Let Your Data Trigger Your DAGs: Airflow Dataset Scheduling

Data Engineering March 19, 2026

Stop coupling DAGs by time or ExternalTaskSensor. Airflow's dataset scheduling lets you wire pipelines together through the data they produce and consume, so the right DAGs run at the right time without the fragility.

By John Wassilak Read more →

Why Oklahoma Energy Companies Can't Afford to Ignore Data Engineering

Data Engineering March 12, 2026

Oklahoma energy companies are sitting on enormous amounts of data spread across systems that were never designed to talk to each other. PPDM gives you a standard. Data engineering makes it actually work.

By John Wassilak Read more →

5 Signs Your Oklahoma Business Needs a Data Engineer (Not Just an IT Guy)

Data Engineering February 26, 2026

We've seen a trend of small-to-mid size Oklahoma businesses outgrowing their data setup. Here's how to tell if it's time to bring in a real data engineer.

By John Wassilak Read more →

Stop Letting Your Airflow Config Live in Someone's Head

Data Engineering February 24, 2026

If your Airflow Variables, Connections, and secrets only exist in the UI or someone's memory, you don't have a config strategy, you have a time-bomb. Here's how to actually fix that.

By John Wassilak Read more →

Keep Your Pipelines Portable: The Case for Decoupled Airflow

Data Engineering February 17, 2026

Learn how to build portable, testable data pipelines by containerizing your ETL logic and using Airflow purely as a scheduler, keeping your code independent from any specific orchestration tool.

By John Wassilak Read more →

Streamlining Data Pipelines: Leveraging Distributed Airflow Workers Across Customer Networks

Data Engineering February 10, 2026

Stop fighting for inbound VPN access. Put your Airflow workers where the data lives and let them call home.

By John Wassilak Read more →

Mastering Meltano

How to Run a Meltano Pipeline with a REST API and PostgreSQL

Data Engineering August 11, 2025

This guide takes an existing Meltano proof-of-concept and elevates it by using a true data source and target database. We will be working with public REST API endpoints as an extractor, and we will use PostgreSQL as a loader.

By Aaron O'Toole Read more →

Mastering Meltano

How to Build a Docker Image to Run Containerized Meltano Pipelines

Data Engineering August 4, 2025

Containerize a Meltano EL pipeline with Docker to get a reproducible, self-contained workflow that produces a JSONL artifact.

By Aaron O'Toole Read more →