Wikipedia API Extraction
A multilingual NLP pipeline that pairs EN↔ZH Wikipedia pages and flags low-similarity mismatch sentences using multilingual embeddings.
Problem
Cross-language differences can create inconsistencies; identifying mismatched content requires scalable methods.
Context and constraints
- Languages: English and Chinese Wikipedia
- Sensitive identity categories: race, gender/sex, nationality, age, religion
- Need: candidate generation + semantic similarity at document and sentence levels
Approach
- Extract pages and generate candidate EN↔ZH pairings
- Compute similarity with multilingual embeddings
- Surface low-similarity sentence pairs for downstream review
Architecture
flowchart LR
A[Wikipedia API] --> B[Page extraction]
B --> C[Candidate pairing EN↔ZH]
C --> D[Sentence segmentation]
D --> E[Multilingual embeddings]
E --> F[Similarity scoring]
F --> G[Low-similarity mismatches]
Implementation highlights
- Candidate pairing to reduce the search space
- Two-level scoring (document + sentence)
- Outputs designed for qualitative review and audits
Results and impact
Produced ranked candidate pairs and mismatch sentence pairs for targeted review (no invented accuracy metrics).
Tech stack
Python, Wikipedia API, multilingual embeddings, text processing.
Links
- Repo: https://github.com/luyangsi/Wikipedia_API_Extraction
- Projects index: /projects/
What I'd improve next
Introduce human-in-the-loop validation and threshold calibration per category.