ETL Pipeline: Bank Info
Extracts a Wikipedia table, converts market cap currency via a local exchange-rate CSV, and loads to CSV + SQLite. Containerized with Docker and automated with a CI/CD pipeline (GitHub Actions). Demonstrates production-ready ETL practices at small scale.
Problem
Demonstrate a clean, reproducible ETL workflow with portable execution and verifiable outputs.
Context and constraints
- Extract: Wikipedia "List of largest banks" (by market cap)
- Transform: convert USD to GBP/EUR/INR using local exchange-rate CSV
- Load: CSV + SQLite
- Operational: CLI, Makefile shortcuts, Docker, CI smoke test
Approach
- Extract and normalize the Wikipedia table
- Type-cast and clean market cap values
- Apply currency conversion via local exchange-rate CSV
- Persist outputs to CSV and SQLite; validate with CI
Architecture
flowchart TB
A[Wikipedia table] --> B[Extract]
B --> C[Clean & normalize]
C --> D[Join exchange rates CSV]
D --> E[Currency conversion]
E --> F[Write CSV]
E --> G[Load SQLite]
H[CI smoke test] --> I[Build verification]
Implementation highlights
- CLI + Makefile for consistent execution
- Dockerized runtime for environment parity
- CI smoke test to detect regressions
Results and impact
End-to-end ETL with operational readiness patterns (Docker + CI). No runtime performance metrics claimed.
Tech stack
Python, SQLite, Docker, GitHub Actions.
Links
- Repo: https://github.com/luyangsi/ETL_Pipeline_Bank_Info
- Projects index: /projects/
What I'd improve next
Add incremental refresh logic and schema validation to handle upstream HTML/table changes.