Research Dataset Recommendation System

A research-support tool that matches research questions to public datasets across 25+ sources, cutting discovery time from ~2 hours to <5 minutes.

Problem

Researchers often spend significant time searching across many portals to locate an appropriate dataset for a research question.

Context and constraints

Input: a research question (free text)
Output: ranked dataset recommendations with access notes
Catalog spans 25+ sources (e.g., CFPB, FRED, Census, WRDS)
Constraints: avoid inventing evaluation metrics beyond stated time savings

Approach

Curate and maintain a dataset catalog across 25+ sources
Convert questions into keywords/signals and match against the catalog
Provide an end-to-end analysis example (CFPB complaints) using OLS regression and t-tests

Architecture

flowchart LR
      A["Research question"]
      B["Text parsing and keywords"]
      C["Dataset catalog (25+ sources)"]
      D["Matching and ranking"]
      E["Recommendation list + access notes"]
      F["Reproducible analysis notebook"]

      A --> B
      B --> C
      C --> D
      D --> E
      E --> F

Implementation highlights

Catalog design supports adding new sources with minimal friction
Separation of concerns between ingestion and ranking logic
Reproducible analysis workflow with publication-ready visuals

Results and impact

Reported impact: reduced manual dataset search time from ~2 hours to <5 minutes; included an end-to-end CFPB analysis using OLS regression and t-tests.

Tech stack

Python, pandas, statistical modeling (OLS), hypothesis testing (t-tests), Jupyter.

What I'd improve next

Add richer query understanding (synonyms, domain ontologies) and introduce relevance evaluation with labeled queries.

Research Dataset Recommendation System

Problem

Context and constraints

Approach

Architecture

Implementation highlights

Results and impact

Tech stack

Links

What I'd improve next