Canvas Platform Data Ingestion

A production-style medallion pipeline (raw/cur/meta) that ingests Canvas-style JSONL into Azure SQL with rerun-safe watermarking, auditing, and data quality checks.

Problem

Analytics teams need rerunnable ingestion of semi-structured data with auditing, quality checks, and minimal reprocessing.

Context and constraints

Source: Canvas-style JSONL
Target: Azure SQL
Architecture: raw → curated → meta
Requirement: rerun-safe incremental ingestion via watermarking
Operational needs: run auditing, DQ checks, schema-drift detection

Approach

Implement medallion layers: raw landing tables, curated transforms, and meta operational tables
Use watermark tables/logic to load only new or changed data
Capture job runs and DQ results for traceability

Architecture

flowchart TB
  A[JSONL source] --> B[raw schema]
  B --> C[curated transforms]
  C --> D[analytics-ready tables]
  B --> M[meta.job_run]
  C --> Q[meta.dq_results]
  B --> S[meta.schema_drift]
  M --> R[watermark for incremental runs]

Implementation highlights

Rerun-safe watermarking to reduce reprocessing
Operational metadata for auditing and troubleshooting
Data quality checks and schema-drift detection recorded in meta tables

Results and impact

Reduced reprocessing by limiting loads to new/changed records based on watermark; improved operational transparency via auditing and quality monitoring.

Tech stack

Python, Azure SQL, JSONL ingestion, SQL transformations, Git/GitHub.

What I'd improve next

Add alerting on DQ failures and contract tests to guard against upstream schema changes.

Canvas Platform Data Ingestion

Problem

Context and constraints

Approach

Architecture

Implementation highlights

Results and impact

Tech stack

Links

What I'd improve next