← All projects

Canvas Platform Data Ingestion

A production-style medallion pipeline (raw/cur/meta) that ingests Canvas-style JSONL into Azure SQL with rerun-safe watermarking, auditing, and data quality checks.


Problem

Analytics teams need rerunnable ingestion of semi-structured data with auditing, quality checks, and minimal reprocessing.

Context and constraints

  • Source: Canvas-style JSONL
  • Target: Azure SQL
  • Architecture: raw → curated → meta
  • Requirement: rerun-safe incremental ingestion via watermarking
  • Operational needs: run auditing, DQ checks, schema-drift detection

Approach

Architecture

flowchart TB
  A[JSONL source] --> B[raw schema]
  B --> C[curated transforms]
  C --> D[analytics-ready tables]
  B --> M[meta.job_run]
  C --> Q[meta.dq_results]
  B --> S[meta.schema_drift]
  M --> R[watermark for incremental runs]
    

Implementation highlights

Results and impact

Reduced reprocessing by limiting loads to new/changed records based on watermark; improved operational transparency via auditing and quality monitoring.

Tech stack

Python, Azure SQL, JSONL ingestion, SQL transformations, Git/GitHub.

Links

What I'd improve next

Add alerting on DQ failures and contract tests to guard against upstream schema changes.