π Scholare β Automated Literature Review Pipeline
An end-to-end, config-driven Python tool that searches academic literature, downloads papers, and generates structured research notes β ready to plug into any research topic.
π Try it instantly β no installation required!
Click the badge above to run Scholare directly in your browser via Google Colab.
β¨ What It Does
- Searches Free APIs via OpenAlex natively, along with preprint servers like arXiv and bioRxiv/medRxiv.
- Enriches every result through Semantic Scholar β abstracts, TLDRs, DOIs, code/data hints. (Falls back to DOI lookups for highest accuracy).
- Discovers Open-Access Links dynamically using the Unpaywall API.
- Categorizes papers using configurable keyword rules.
- Downloads open-access PDFs into a local folder (with a
--no-downloadCLI override). - Generates visualizations β category distribution, open-access status, citation histogram, year timeline.
- Produces structured Markdown research notes β executive summary, taxonomy, top-cited, per-category breakdown with TLDRs, embedded charts, full paper index.
- Compares runs β pass a previous CSV to isolate newly discovered papers.
- Semantic Relevance Scoring (Optional) β ranks papers using keyword heuristics or deep-learning embeddings via
sentence-transformers.
π οΈ Setup
1. Prerequisites
- Python 3.10+
- (Optional) A Semantic Scholar API key β semanticscholar.org/product/api (Highly recommended to avoid rate limits).
2. Install
From source (recommended for now):
git clone https://github.com/OWNER/scholare.git
cd scholare
python -m venv venv
# Activate
# Windows PowerShell:
.\venv\Scripts\Activate.ps1
# macOS / Linux:
source venv/bin/activate
pip install -e .
Eventually via PyPI:
3. Configure API Keys
Edit .env:
4. Create Your Config
Edit my_config.json:
{
"query": "your search query here",
"limit": 30,
"output_dir": "./my_output",
"categories": {
"Category A": ["keyword1", "keyword2"],
"Other": []
},
"default_category": "Other",
"download_pdfs": true,
"sources": ["openalex", "arxiv", "biorxiv"],
"search_intent": "your natural language description of what you are looking for",
"use_embeddings": true,
"compare_methods": false
}
| Field | Description |
|---|---|
query |
Search string (mapped appropriately across OpenAlex and preprints) |
limit |
Max number of papers to retrieve per API source |
output_dir |
Base output directory (subfolders auto-named by date + terms) |
categories |
Category name β keyword list for paper classification |
default_category |
Fallback when no keywords match |
download_pdfs |
Set false to skip PDF downloading by default |
sources |
(Optional) List of sources to query. Available: openalex, arxiv, biorxiv |
search_intent |
(Optional) Natural language phrase for semantic relevance scoring |
use_embeddings |
(Optional) Set to true to use sentence-transformers for ML-based relevance ranking (requires pip install scholare[ml]) |
compare_methods |
(Optional) Set to true to output both keyword and ML embedding scores for comparison in the CSV |
π Usage
CLI
# Run the pipeline
scholare --config my_config.json
# Skip downloading PDFs (overrides config)
scholare --config my_config.json --no-download
# Compare with a previous run
scholare --config my_config.json --previous-csv ./old_output/results.csv
Programmatic & Cloud Notebooks (Colab / Kaggle)
β‘ Zero-install quick start β Run Scholare directly in your browser!
No Python setup, no terminal, no installation. Just click and run.
You can also install the package manually in any cloud notebook: Install the package directly from GitHub:
Then, you can define your configuration natively in Python and pass it to the pipeline:
import os
from scholare.config import load_config
from scholare.pipeline import run_pipeline
# Setting API Keys:
# Method A: Direct Injection
# os.environ["S2_API_KEY"] = "your_key_here"
# os.environ["UNPAYWALL_EMAIL"] = "your_email@example.com"
# Method B: Secure Colab Secrets (Recommended)
# from google.colab import userdata
# os.environ["S2_API_KEY"] = userdata.get('S2_API_KEY')
# Define config as a dictionary mapping
my_config = {
"query": "federated learning",
"limit": 10,
"output_dir": "./output",
"categories": {"Privacy": ["dp"]},
"default_category": "Other",
"download_pdfs": False
}
config_obj = load_config(my_config)
df = run_pipeline(config_obj)
print(f"Found {len(df)} papers")
[!TIP] See the full interactive Cloud Notebook Template (
examples/cloud_notebook_template.ipynb) to get started immediately!
π Output Structure
Each run creates a descriptive subfolder:
output/
βββ 2026-02-25_EEG_BCI_melanin_bias/
βββ papers/ # Downloaded open-access PDFs
βββ visualizations/ # PNG charts
β βββ category_distribution.png
β βββ open_access_status.png
β βββ citation_distribution.png
β βββ year_distribution.png
βββ research_notes.md # Structured Markdown summary
βββ results.csv # Raw data
βββ new_discoveries.csv # (if --previous-csv was used)
π Example Configs
See the examples/ directory for ready-to-use configs:
eeg_bias_config.jsonβ EEG/BCI bias researchfederated_learning_config.jsonβ Federated learning in healthcare
πΊοΈ Roadmap
Scholare is actively growing. See ROADMAP.md for planned features including:
- π‘ More APIs β OpenAlex, Unpaywall, arXiv, Crossref, PubMed, CORE
- π€ Export β BibTeX/RIS for Zotero & Mendeley
- π Integration β Zotero/Mendeley library sync
- π Chrome extension β one-click literature searches
- π Analysis β citation networks, clustering, PRISMA diagrams
- π¨ UX β rich CLI, interactive HTML dashboards
- π€ RAG Chat β interactive CLI chat across downloaded PDFs using local or cloud LLMs
Contributions welcome! Pick any item from the roadmap.
π€ Contributing
We welcome contributions! See CONTRIBUTING.md for setup and guidelines.
β οΈ Limitations
- Only open-access PDFs can be downloaded. Paywalled papers are noted with links.
- Semantic Scholar without a key is limited, which might slow down enrichment on large searches. Adding an
S2_API_KEYsolves this. - Categorization is keyword-based (heuristic, not an AI classifier).
- bioRxiv searching uses the Crossref endpoint simulating a search, leading to slightly different handling.
π¦ Package Structure
scholare/
βββ __init__.py # Public API
βββ __main__.py # CLI entry point
βββ config.py # Config loader
βββ api.py # OpenAlex, preprint, Unpaywall, Semantic Scholar clients
βββ pipeline.py # Main orchestration
βββ downloader.py # PDF downloading
βββ notes.py # Markdown research notes generator
βββ visualizations.py # Chart generation
βββ utils.py # Categorization & comparison helpers
π License
MIT β use it, fork it, build on it.