← All projects

Wikipedia API Extraction

A multilingual NLP pipeline that pairs EN↔ZH Wikipedia pages and flags low-similarity mismatch sentences using multilingual embeddings.


Problem

Cross-language differences can create inconsistencies; identifying mismatched content requires scalable methods.

Context and constraints

  • Languages: English and Chinese Wikipedia
  • Sensitive identity categories: race, gender/sex, nationality, age, religion
  • Need: candidate generation + semantic similarity at document and sentence levels

Approach

Architecture

flowchart LR
  A[Wikipedia API] --> B[Page extraction]
  B --> C[Candidate pairing EN↔ZH]
  C --> D[Sentence segmentation]
  D --> E[Multilingual embeddings]
  E --> F[Similarity scoring]
  F --> G[Low-similarity mismatches]
    

Implementation highlights

Results and impact

Produced ranked candidate pairs and mismatch sentence pairs for targeted review (no invented accuracy metrics).

Tech stack

Python, Wikipedia API, multilingual embeddings, text processing.

Links

What I'd improve next

Introduce human-in-the-loop validation and threshold calibration per category.