← All projects

Crossref Retraction Metadata Analysis

A data quality research project analyzing retraction-flag phrases in titles and tracking indexing drift across DOI status snapshots (208 DOIs).


Problem

Incorrect or drifting retraction metadata can propagate misinformation in scholarly databases and analytics pipelines.

Context and constraints

  • Data: Crossref DOI metadata
  • Task: extract retraction/removal-notice phrases from titles and compare with retraction status
  • Time comparison: April 2023 vs July 2024
  • Observed drift: 208 DOIs marked retracted in April 2023 but not in July 2024

Approach

Architecture

flowchart TB
    A["Crossref metadata"] --> B["Snapshot Apr 2023"]
    A --> C["Snapshot Jul 2024"]
    B --> D["Phrase extraction"]
    C --> D
    D --> E["Consistency checks"]
    E --> F["Indexing drift report (208 DOIs)"]
    

Implementation highlights

Results and impact

Identified indexing drift: 208 DOIs marked retracted in April 2023 but not in July 2024, highlighting risks of relying on a single metadata source.

Tech stack

Python, metadata pipelines, data quality analysis.

Links

What I'd improve next

Add multi-source validation and automated anomaly detection for sudden status shifts.