Crossref Retraction Metadata Analysis
A data quality research project analyzing retraction-flag phrases in titles and tracking indexing drift across DOI status snapshots (208 DOIs).
Problem
Incorrect or drifting retraction metadata can propagate misinformation in scholarly databases and analytics pipelines.
Context and constraints
- Data: Crossref DOI metadata
- Task: extract retraction/removal-notice phrases from titles and compare with retraction status
- Time comparison: April 2023 vs July 2024
- Observed drift: 208 DOIs marked retracted in April 2023 but not in July 2024
Approach
- Collect DOI metadata snapshots
- Extract retraction phrases from titles
- Compare phrases to retraction status and report drift
Architecture
flowchart TB
A["Crossref metadata"] --> B["Snapshot Apr 2023"]
A --> C["Snapshot Jul 2024"]
B --> D["Phrase extraction"]
C --> D
D --> E["Consistency checks"]
E --> F["Indexing drift report (208 DOIs)"]
Implementation highlights
- Snapshot-based comparisons to detect status changes
- Phrase extraction for retraction-signal consistency
- Documentation-first approach for reproducibility
Results and impact
Identified indexing drift: 208 DOIs marked retracted in April 2023 but not in July 2024, highlighting risks of relying on a single metadata source.
Tech stack
Python, metadata pipelines, data quality analysis.
Links
- Repo: https://github.com/luyangsi/Crossref_Metadata_Analysis
- Projects index: /projects/
What I'd improve next
Add multi-source validation and automated anomaly detection for sudden status shifts.