At BeyondIRR, we ingest over 10GB of financial data daily — mutual fund NAVs, equity prices, bond yields, transaction histories, corporate actions. When this pipeline breaks, our clients can't see their portfolio values. That's bad. When it breaks at 3AM, it's worse.

Here's what we learned after two years of iterating on this pipeline, including the failures we'd rather forget.

The Architecture

We built a three-layer pipeline: ingestion (pulling from 40+ data sources), transformation (normalising, validating, enriching), and serving (real-time reads for the portfolio dashboard). Each layer is independently deployable and observable.

# Simplified structure
ingestion/
  ├── amfi_scraper.py       # Mutual fund NAVs
  ├── nse_feed.py           # Equity prices  
  ├── rbi_bonds.py          # Government bonds
  └── transaction_sync.py   # Client transactions

transform/
  ├── normaliser.py
  ├── validator.py          # Schema + range checks
  └── enricher.py           # FX rates, corporate actions

serve/
  └── portfolio_api.py      # FastAPI + Redis cache

The Failures

The first version had a fatal flaw: it was a single Python script run on a cron job. When AMFI changed their CSV format mid-year, the whole thing silently produced NaN values for three days before anyone noticed. Clients were seeing their portfolio as ₹0.

Silent failures are the most dangerous kind. A pipeline that crashes loudly is better than one that silently produces wrong data.

Lesson learned: we added data quality gates at every layer boundary. If the NAV for a mutual fund deviates by more than 15% from the previous day, the pipeline stops and pages the on-call engineer.

What Actually Works

After these changes, we went from 4-5 pipeline incidents per month to less than one per quarter. The 3AM pages became rare enough that when one did happen, the team actually felt alert instead of exhausted.