Semiconductors Data Platform
An end-to-end data engineering platform simulating semiconductor fab operations — featuring a 4-layer medallion architecture, a custom YAML-based data quality rule engine with 15+ manufacturing-specific validation checks, and analytics for yield monitoring and equipment health tracking.
Problem
Semiconductor manufacturing generates high-volume operational data across wafers, batches, equipment, and process steps. Without a structured pipeline and quality framework, monitoring yield degradation, detecting equipment anomalies, and tracing root causes becomes intractable. The challenge is building a production-grade platform that enforces referential integrity, validates process sequences, and surfaces health metrics for operational decision-making.
Context and constraints
- Domain: Semiconductor fab operations (simulated data mirroring TSMC / Lam Research patterns)
- Architecture: 4-layer medallion — raw → staging → data quality → curated
- Requirement: Incremental processing with watermark-based change tracking; idempotent pipeline runs
- Quality requirement: 15+ domain-specific validation rules enforced before promotion to curated layer
- Analytics requirement: Yield monitoring by equipment and process step; equipment health (uptime, MTBF, alarm frequency)
Approach
- Implemented 4-layer medallion design: raw (timestamp-partitioned), staging (schema standardization, type casting), data quality validation, and curated (star schema with fact and dimension tables)
- Built a custom YAML-based rule engine for manufacturing-specific quality checks: referential integrity (wafer → batch → equipment), process sequence verification (lithography → etch → test ordering), temporal consistency, and range validation on test metrics
- Implemented watermark-based incremental ingestion to process only new/changed records, with idempotent re-run safety and graceful late-arriving data handling
- Designed curated layer as a star schema optimized for yield and equipment health analytics queries
Architecture
flowchart TB
A[Fab Simulation Data] --> B[Raw Layer\ntimestamp-partitioned]
B --> C[Staging Layer\nschema standardization + type casting]
C --> D[DQ Validation Layer\n15+ rule engine checks]
D --> E[Curated Layer\nstar schema — fact + dimension tables]
E --> F1[Yield Monitoring\nby equipment & process step]
E --> F2[Equipment Health\nuptime · MTBF · alarm frequency]
E --> F3[Wafer Traceability\nroot cause analysis]
D --> G[Quality Report\nfailed checks + rule details]
Data quality checks implemented
- Referential integrity: Every wafer must have a valid batch_id; every batch must map to a valid equipment record
- Process sequence validation: Manufacturing steps must follow correct ordering (e.g., lithography → etch → deposition → test); out-of-sequence records are flagged
- Range validation: Test metrics (yield rate, thickness, resistivity) validated against domain-specific acceptable ranges
- Temporal consistency: Prevents anomalies where process end timestamps precede start timestamps or overlap across equipment
- Completeness checks: Required fields validated across wafer, batch, and equipment dimensions
Results and impact
Tech stack
Python 3.9+, pandas, NumPy, YAML-based rule engine, CSV/Parquet storage, pytest, star schema design.
Links
- Repo: https://github.com/luyangsi/semiconductors_data_platform
- Projects index: /projects/
What I'd improve next
Replace CSV/Parquet storage with a columnar data warehouse (e.g., Azure Synapse or DuckDB) to enable real-time dashboard queries; add alerting on DQ rule failure rates crossing threshold, and integrate MTBF anomaly detection with statistical control charts.