Research Dataset Recommendation System
A research-support tool that matches research questions to public datasets across 25+ sources, cutting discovery time from ~2 hours to <5 minutes.
Problem
Researchers often spend significant time searching across many portals to locate an appropriate dataset for a research question.
Context and constraints
- Input: a research question (free text)
- Output: ranked dataset recommendations with access notes
- Catalog spans 25+ sources (e.g., CFPB, FRED, Census, WRDS)
- Constraints: avoid inventing evaluation metrics beyond stated time savings
Approach
- Curate and maintain a dataset catalog across 25+ sources
- Convert questions into keywords/signals and match against the catalog
- Provide an end-to-end analysis example (CFPB complaints) using OLS regression and t-tests
Architecture
flowchart LR
A["Research question"]
B["Text parsing and keywords"]
C["Dataset catalog (25+ sources)"]
D["Matching and ranking"]
E["Recommendation list + access notes"]
F["Reproducible analysis notebook"]
A --> B
B --> C
C --> D
D --> E
E --> F
Implementation highlights
- Catalog design supports adding new sources with minimal friction
- Separation of concerns between ingestion and ranking logic
- Reproducible analysis workflow with publication-ready visuals
Results and impact
Reported impact: reduced manual dataset search time from ~2 hours to <5 minutes; included an end-to-end CFPB analysis using OLS regression and t-tests.
Tech stack
Python, pandas, statistical modeling (OLS), hypothesis testing (t-tests), Jupyter.
Links
- Repo: https://github.com/luyangsi/Research_Dataset_Recommender
- Projects index: /projects/
What I'd improve next
Add richer query understanding (synonyms, domain ontologies) and introduce relevance evaluation with labeled queries.