← All projects

Research Dataset Recommendation System

A research-support tool that matches research questions to public datasets across 25+ sources, cutting discovery time from ~2 hours to <5 minutes.


Problem

Researchers often spend significant time searching across many portals to locate an appropriate dataset for a research question.

Context and constraints

  • Input: a research question (free text)
  • Output: ranked dataset recommendations with access notes
  • Catalog spans 25+ sources (e.g., CFPB, FRED, Census, WRDS)
  • Constraints: avoid inventing evaluation metrics beyond stated time savings

Approach

Architecture

flowchart LR
      A["Research question"]
      B["Text parsing and keywords"]
      C["Dataset catalog (25+ sources)"]
      D["Matching and ranking"]
      E["Recommendation list + access notes"]
      F["Reproducible analysis notebook"]

      A --> B
      B --> C
      C --> D
      D --> E
      E --> F
    

Implementation highlights

Results and impact

Reported impact: reduced manual dataset search time from ~2 hours to <5 minutes; included an end-to-end CFPB analysis using OLS regression and t-tests.

Tech stack

Python, pandas, statistical modeling (OLS), hypothesis testing (t-tests), Jupyter.

Links

What I'd improve next

Add richer query understanding (synonyms, domain ontologies) and introduce relevance evaluation with labeled queries.