← All projects

Semiconductors Data Platform

An end-to-end data engineering platform simulating semiconductor fab operations — featuring a 4-layer medallion architecture, a custom YAML-based data quality rule engine with 15+ manufacturing-specific validation checks, and analytics for yield monitoring and equipment health tracking.


Problem

Semiconductor manufacturing generates high-volume operational data across wafers, batches, equipment, and process steps. Without a structured pipeline and quality framework, monitoring yield degradation, detecting equipment anomalies, and tracing root causes becomes intractable. The challenge is building a production-grade platform that enforces referential integrity, validates process sequences, and surfaces health metrics for operational decision-making.

Context and constraints

  • Domain: Semiconductor fab operations (simulated data mirroring TSMC / Lam Research patterns)
  • Architecture: 4-layer medallion — raw → staging → data quality → curated
  • Requirement: Incremental processing with watermark-based change tracking; idempotent pipeline runs
  • Quality requirement: 15+ domain-specific validation rules enforced before promotion to curated layer
  • Analytics requirement: Yield monitoring by equipment and process step; equipment health (uptime, MTBF, alarm frequency)

Approach

Architecture

flowchart TB
  A[Fab Simulation Data] --> B[Raw Layer\ntimestamp-partitioned]
  B --> C[Staging Layer\nschema standardization + type casting]
  C --> D[DQ Validation Layer\n15+ rule engine checks]
  D --> E[Curated Layer\nstar schema — fact + dimension tables]
  E --> F1[Yield Monitoring\nby equipment & process step]
  E --> F2[Equipment Health\nuptime · MTBF · alarm frequency]
  E --> F3[Wafer Traceability\nroot cause analysis]
  D --> G[Quality Report\nfailed checks + rule details]
    

Data quality checks implemented

  • Referential integrity: Every wafer must have a valid batch_id; every batch must map to a valid equipment record
  • Process sequence validation: Manufacturing steps must follow correct ordering (e.g., lithography → etch → deposition → test); out-of-sequence records are flagged
  • Range validation: Test metrics (yield rate, thickness, resistivity) validated against domain-specific acceptable ranges
  • Temporal consistency: Prevents anomalies where process end timestamps precede start timestamps or overlap across equipment
  • Completeness checks: Required fields validated across wafer, batch, and equipment dimensions

Results and impact

Fully automated monitoring pipeline with rerun-safe incremental ingestion. DQ rule engine catches referential, sequential, and range violations before data reaches analytics layer — analogous to abuse/anomaly detection in production systems. Curated star schema enables yield trend analysis and equipment health scoring for operational decision-making.

Tech stack

Python 3.9+, pandas, NumPy, YAML-based rule engine, CSV/Parquet storage, pytest, star schema design.

Links

What I'd improve next

Replace CSV/Parquet storage with a columnar data warehouse (e.g., Azure Synapse or DuckDB) to enable real-time dashboard queries; add alerting on DQ rule failure rates crossing threshold, and integrate MTBF anomaly detection with statistical control charts.