Building Financial ETL Pipelines That Don't Break at 3AM

At BeyondIRR, we ingest over 10GB of financial data daily — mutual fund NAVs, equity prices, bond yields, transaction histories, corporate actions. When this pipeline breaks, our clients can't see their portfolio values. That's bad. When it breaks at 3AM, it's worse.

Here's what we learned after two years of iterating on this pipeline, including the failures we'd rather forget.

The Architecture

We built a three-layer pipeline: ingestion (pulling from 40+ data sources), transformation (normalising, validating, enriching), and serving (real-time reads for the portfolio dashboard). Each layer is independently deployable and observable.

# Simplified structure
ingestion/
  ├── amfi_scraper.py       # Mutual fund NAVs
  ├── nse_feed.py           # Equity prices  
  ├── rbi_bonds.py          # Government bonds
  └── transaction_sync.py   # Client transactions

transform/
  ├── normaliser.py
  ├── validator.py          # Schema + range checks
  └── enricher.py           # FX rates, corporate actions

serve/
  └── portfolio_api.py      # FastAPI + Redis cache

The Failures

The first version had a fatal flaw: it was a single Python script run on a cron job. When AMFI changed their CSV format mid-year, the whole thing silently produced NaN values for three days before anyone noticed. Clients were seeing their portfolio as ₹0.

Silent failures are the most dangerous kind. A pipeline that crashes loudly is better than one that silently produces wrong data.

Lesson learned: we added data quality gates at every layer boundary. If the NAV for a mutual fund deviates by more than 15% from the previous day, the pipeline stops and pages the on-call engineer.

What Actually Works

Schema contracts between layers — treat each layer boundary like an API contract. Break it deliberately, with versioning.
Idempotent transforms — running the same data through the transform layer twice should produce the same result.
Dead letter queues — records that fail validation don't disappear; they go to a review queue for manual inspection.
Observability first — every pipeline run logs record count, null rate, processing time, and source latency. You can't debug what you can't see.

After these changes, we went from 4-5 pipeline incidents per month to less than one per quarter. The 3AM pages became rare enough that when one did happen, the team actually felt alert instead of exhausted.