Canvas Platform Data Ingestion
A production-style medallion pipeline (raw/cur/meta) that ingests Canvas-style JSONL into Azure SQL with rerun-safe watermarking, auditing, and data quality checks.
Problem
Analytics teams need rerunnable ingestion of semi-structured data with auditing, quality checks, and minimal reprocessing.
Context and constraints
- Source: Canvas-style JSONL
- Target: Azure SQL
- Architecture: raw → curated → meta
- Requirement: rerun-safe incremental ingestion via watermarking
- Operational needs: run auditing, DQ checks, schema-drift detection
Approach
- Implement medallion layers: raw landing tables, curated transforms, and meta operational tables
- Use watermark tables/logic to load only new or changed data
- Capture job runs and DQ results for traceability
Architecture
flowchart TB
A[JSONL source] --> B[raw schema]
B --> C[curated transforms]
C --> D[analytics-ready tables]
B --> M[meta.job_run]
C --> Q[meta.dq_results]
B --> S[meta.schema_drift]
M --> R[watermark for incremental runs]
Implementation highlights
- Rerun-safe watermarking to reduce reprocessing
- Operational metadata for auditing and troubleshooting
- Data quality checks and schema-drift detection recorded in meta tables
Results and impact
Reduced reprocessing by limiting loads to new/changed records based on watermark; improved operational transparency via auditing and quality monitoring.
Tech stack
Python, Azure SQL, JSONL ingestion, SQL transformations, Git/GitHub.
Links
- Repo: https://github.com/luyangsi/Canvas_pipeline
- Projects index: /projects/
What I'd improve next
Add alerting on DQ failures and contract tests to guard against upstream schema changes.