635 lines
13 KiB
Markdown
635 lines
13 KiB
Markdown
# AGENTS.md
|
|
|
|
## Project overview
|
|
|
|
`paperlib` is a local-first paper library engine with a CLI.
|
|
|
|
It is designed to:
|
|
|
|
- import PDF papers into a structured local library
|
|
- convert PDFs into Markdown using external converters such as MinerU
|
|
- maintain stable per-paper metadata files and a searchable index database
|
|
- optionally generate AI-based structured summaries
|
|
- expose a clean CLI that is useful both for humans and for higher-level automation tools such as an arXiv daily digest workflow
|
|
|
|
`paperlib` is **not** primarily an AI app. AI summarization is an optional enrichment layer, not the core of the system.
|
|
|
|
The project should remain useful even when:
|
|
|
|
- no LLM API key is configured
|
|
- no summarization is enabled
|
|
- only import / convert / index / search features are used
|
|
|
|
---
|
|
|
|
## Core design principles
|
|
|
|
### 1. Local-first
|
|
|
|
User data lives locally in the paper library directory.
|
|
|
|
The library must remain usable without a server, web app, or remote database.
|
|
|
|
Prefer plain files plus SQLite over opaque internal state.
|
|
|
|
### 2. CLI-first
|
|
|
|
The CLI is the primary interface.
|
|
|
|
All important workflows should be accessible from the CLI.
|
|
|
|
The Python API is useful, but secondary.
|
|
|
|
### 3. JSON files are the source of truth
|
|
|
|
Per-paper JSON files in the library are the durable source of truth.
|
|
|
|
Examples:
|
|
|
|
- `meta.json`
|
|
- `summary.json`
|
|
|
|
SQLite is an index/cache layer, not the canonical data store.
|
|
|
|
This means:
|
|
|
|
- the index must be rebuildable from files
|
|
- `reindex` should be able to repair the database from on-disk records
|
|
- code must not assume the database alone is authoritative
|
|
|
|
### 4. AI is optional enrichment
|
|
|
|
Importing, converting, indexing, listing, showing, and searching papers must work without AI.
|
|
|
|
AI summarization should be isolated behind a clean interface.
|
|
|
|
Do not make core workflows depend on an LLM provider.
|
|
|
|
### 5. Stable machine-readable interfaces
|
|
|
|
Important commands should support `--json` output so that other tools can consume them.
|
|
|
|
Examples:
|
|
|
|
- `paperlib import ... --json`
|
|
- `paperlib summarize ... --json`
|
|
- `paperlib show ... --json`
|
|
- `paperlib export ... --format json`
|
|
|
|
### 6. Small, explicit, inspectable components
|
|
|
|
Prefer simple and explicit logic over large hidden frameworks.
|
|
|
|
Keep components understandable:
|
|
|
|
- importer
|
|
- converter
|
|
- renderer
|
|
- summarizer
|
|
- search
|
|
- reindex
|
|
- doctor
|
|
|
|
Avoid unnecessary abstraction until there is a real need.
|
|
|
|
---
|
|
|
|
## Non-goals
|
|
|
|
The following are currently out of scope unless explicitly planned later:
|
|
|
|
- mandatory daemon architecture
|
|
- web UI
|
|
- multi-user remote service
|
|
- cloud-first design
|
|
- vector database as a required dependency
|
|
- opaque agent framework controlling the core library
|
|
- “fully autonomous research assistant” behavior
|
|
|
|
---
|
|
|
|
## Library data layout
|
|
|
|
The paper library on disk should be human-browsable.
|
|
|
|
A typical layout looks like:
|
|
|
|
```text
|
|
library_root/
|
|
config/
|
|
config.toml
|
|
vocab.yaml
|
|
prompts/
|
|
summarize_paper.md
|
|
|
|
inbox/
|
|
papers/
|
|
arxiv/
|
|
2026/
|
|
2604.12345/
|
|
meta.json
|
|
source.pdf
|
|
paper.md
|
|
summary.json
|
|
summary.md
|
|
ref.bib
|
|
assets/
|
|
logs/
|
|
mineru.log
|
|
local/
|
|
sha256-.../
|
|
meta.json
|
|
source.pdf
|
|
paper.md
|
|
summary.json
|
|
summary.md
|
|
|
|
db/
|
|
paperlib.sqlite3
|
|
|
|
cache/
|
|
```
|
|
|
|
Conventions:
|
|
|
|
- `meta.json` contains stable metadata and processing status
|
|
- `summary.json` contains structured AI-generated enrichment
|
|
- `summary.md` is rendered from `summary.json`
|
|
- `paper.md` is generated from the PDF by an external converter such as MinerU
|
|
- the database is rebuildable from the files above
|
|
|
|
---
|
|
|
|
## Data model boundaries
|
|
|
|
### `meta.json`
|
|
|
|
`meta.json` should contain deterministic or near-deterministic information, mostly from:
|
|
|
|
- import process
|
|
- file system state
|
|
- external paper metadata sources
|
|
|
|
Typical fields include:
|
|
|
|
- `paper_id`
|
|
- `source_type`
|
|
- `source_id`
|
|
- `title`
|
|
- `authors`
|
|
- `published_date`
|
|
- `updated_date`
|
|
- `categories`
|
|
- `pdf_path`
|
|
- `paper_md_path`
|
|
- `summary_json_path`
|
|
- `summary_md_path`
|
|
- `imported_at`
|
|
- `conversion_status`
|
|
- `summary_status`
|
|
|
|
Avoid putting speculative AI content into `meta.json`.
|
|
|
|
### `summary.json`
|
|
|
|
`summary.json` is optional enrichment and may be regenerated.
|
|
|
|
It should contain structured fields such as:
|
|
|
|
- one-sentence summary
|
|
- problem statement
|
|
- method overview
|
|
- main results
|
|
- claimed contributions
|
|
- assumptions
|
|
- limitations
|
|
- problem tags
|
|
- technique tags
|
|
- entities
|
|
- relevance-to-user fields
|
|
- recommended sections
|
|
|
|
`summary.json` must include a schema version.
|
|
|
|
### SQLite
|
|
|
|
SQLite stores searchable/indexed state and job-independent status.
|
|
|
|
It should help with:
|
|
|
|
- listing papers
|
|
- filtering and search
|
|
- path lookup
|
|
- tag lookup
|
|
- status overview
|
|
|
|
But it should never be treated as the only durable source of paper metadata.
|
|
|
|
---
|
|
|
|
## CLI philosophy
|
|
|
|
The CLI should be easy for humans and predictable for scripts.
|
|
|
|
### Important CLI expectations
|
|
|
|
- human-readable by default
|
|
- machine-readable with `--json`
|
|
- clear exit codes
|
|
- no hidden background magic
|
|
- no required daemon
|
|
- stable command names
|
|
- idempotent operations when possible
|
|
|
|
### Expected command families
|
|
|
|
Core commands include:
|
|
|
|
- `init`
|
|
- `import`
|
|
- `import-dir`
|
|
- `watch`
|
|
- `convert`
|
|
- `reindex`
|
|
- `doctor`
|
|
- `status`
|
|
- `list`
|
|
- `show`
|
|
- `search`
|
|
- `open`
|
|
- `print-path`
|
|
- `summarize`
|
|
- `render-summary`
|
|
- `export`
|
|
|
|
When implementing commands, preserve a clear separation between:
|
|
|
|
- mutation commands
|
|
- read/query commands
|
|
|
|
---
|
|
|
|
## Architecture guidelines
|
|
|
|
The codebase should be organized around a few clear layers.
|
|
|
|
### 1. Core domain logic
|
|
|
|
Pure Python logic for:
|
|
|
|
- identifying papers
|
|
- computing paths
|
|
- importing PDFs
|
|
- updating metadata
|
|
- converting PDFs to Markdown
|
|
- rendering summaries
|
|
- rebuilding the index
|
|
|
|
This layer should be testable without the CLI.
|
|
|
|
### 2. CLI layer
|
|
|
|
Thin wrappers around the core domain logic.
|
|
|
|
The CLI should:
|
|
|
|
- parse arguments
|
|
- call core functions
|
|
- format output
|
|
- handle exit codes
|
|
|
|
The CLI should not contain deep business logic.
|
|
|
|
### 3. Optional integrations
|
|
|
|
External systems should live in integration modules, for example:
|
|
|
|
- MinerU wrapper
|
|
- filesystem watch integration
|
|
- ripgrep integration
|
|
- LLM provider integration
|
|
|
|
Keep these adapters isolated.
|
|
|
|
### 4. Optional AI layer
|
|
|
|
The AI summarization layer should be behind a stable abstraction.
|
|
|
|
For example:
|
|
|
|
- load prompt template
|
|
- load paper markdown
|
|
- load optional profile / vocabulary
|
|
- call provider
|
|
- validate structured output
|
|
- write `summary.json`
|
|
- render `summary.md`
|
|
|
|
Avoid leaking provider-specific behavior into unrelated modules.
|
|
|
|
---
|
|
|
|
## AI collaboration guidelines
|
|
|
|
When using AI to help develop this project, the AI should follow these rules.
|
|
|
|
### 1. Respect the project boundaries
|
|
|
|
Do not redesign `paperlib` into:
|
|
|
|
- a web app
|
|
- a required daemon
|
|
- a monolithic agent system
|
|
- a chat-first interface
|
|
|
|
Unless explicitly asked, keep the project aligned with:
|
|
|
|
- local-first
|
|
- CLI-first
|
|
- JSON/SQLite-based architecture
|
|
- AI-optional enrichment
|
|
|
|
### 2. Prefer incremental changes
|
|
|
|
Make small, reviewable changes.
|
|
|
|
When implementing a feature:
|
|
|
|
- first clarify which module owns it
|
|
- avoid broad refactors unless necessary
|
|
- preserve existing CLI semantics unless intentionally changing them
|
|
|
|
### 3. Keep file formats stable
|
|
|
|
Changes to `meta.json` or `summary.json` are important.
|
|
|
|
If changing schemas:
|
|
|
|
- update the schema version
|
|
- update documentation
|
|
- consider migration or backward compatibility
|
|
- do not silently break existing libraries
|
|
|
|
### 4. Avoid hidden coupling
|
|
|
|
Do not make unrelated modules depend on each other unnecessarily.
|
|
|
|
For example:
|
|
|
|
- `search` should not depend on LLM code
|
|
- `import` should not require summarization
|
|
- `reindex` should not assume a specific converter
|
|
- `render-summary` should not require calling AI again
|
|
|
|
### 5. Prefer explicit data flow
|
|
|
|
When adding features, keep data flow obvious.
|
|
|
|
For example:
|
|
|
|
- `import` creates or updates metadata
|
|
- `convert` creates `paper.md`
|
|
- `summarize` creates `summary.json`
|
|
- `render-summary` creates `summary.md`
|
|
- `reindex` rebuilds SQLite from files
|
|
|
|
### 6. Do not invent capabilities
|
|
|
|
If a feature is not implemented yet, do not pretend it exists.
|
|
|
|
Examples:
|
|
|
|
- do not write code that assumes a daemon exists
|
|
- do not assume remote sync exists
|
|
- do not assume vector search exists
|
|
- do not assume arXiv-specific logic belongs in the core library
|
|
|
|
### 7. Prefer durable outputs over polished prose
|
|
|
|
When designing AI summarization outputs, favor:
|
|
|
|
- structured JSON
|
|
- stable field names
|
|
- grep-friendly rendered Markdown
|
|
- concise, reusable information
|
|
|
|
over:
|
|
|
|
- highly polished review prose
|
|
- flashy but unstable output formats
|
|
|
|
---
|
|
|
|
## Coding guidelines
|
|
|
|
### General style
|
|
|
|
- Prefer straightforward Python.
|
|
- Use type hints.
|
|
- Keep functions small and focused.
|
|
- Add docstrings to public functions and classes.
|
|
- Avoid overengineering.
|
|
- Prefer composition over deep inheritance.
|
|
|
|
### Error handling
|
|
|
|
- Fail clearly.
|
|
- Provide helpful error messages.
|
|
- Distinguish user-facing CLI errors from internal exceptions.
|
|
- Avoid silently swallowing errors.
|
|
|
|
### Logging
|
|
|
|
- Use structured and informative logging where useful.
|
|
- Avoid noisy logs in normal CLI output.
|
|
- Keep machine-readable command output clean when `--json` is used.
|
|
|
|
### File operations
|
|
|
|
- Be careful with moves, copies, and overwrites.
|
|
- Prefer atomic writes for JSON files when possible.
|
|
- Never corrupt existing metadata due to partial writes.
|
|
|
|
### Idempotence
|
|
|
|
Where possible, commands should behave safely when run multiple times.
|
|
|
|
Examples:
|
|
|
|
- re-importing the same file should detect duplicates
|
|
- `render-summary` should be repeatable
|
|
- `reindex` should be safe to rerun
|
|
|
|
### Testing
|
|
|
|
Add tests for:
|
|
|
|
- path layout logic
|
|
- metadata read/write behavior
|
|
- duplicate detection
|
|
- reindex behavior
|
|
- summary rendering
|
|
- search behavior
|
|
- CLI output contracts for core commands
|
|
|
|
Prefer unit tests for core logic and targeted integration tests for CLI behavior.
|
|
|
|
---
|
|
|
|
## Search design guidelines
|
|
|
|
Search should support at least two useful modes:
|
|
|
|
### 1. Field-aware structured search
|
|
|
|
Examples:
|
|
|
|
- tags
|
|
- authors
|
|
- categories
|
|
- titles
|
|
- summary fields
|
|
|
|
### 2. Full-text-friendly search
|
|
|
|
Support grep-like workflows and integration with tools such as `ripgrep`.
|
|
|
|
Do not require semantic/vector search as a baseline feature.
|
|
|
|
If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
|
|
|
|
---
|
|
|
|
## Summarization design guidelines
|
|
|
|
Summarization should produce reusable structured outputs.
|
|
|
|
### Summarization goals
|
|
|
|
A summary should be useful for:
|
|
|
|
- later human review
|
|
- grep-style reverse lookup
|
|
- building daily/weekly reports
|
|
- indexing by problem/method/result
|
|
- personal research triage
|
|
|
|
### Summarization output
|
|
|
|
Prefer generating:
|
|
|
|
- `summary.json` as the canonical structured output
|
|
- `summary.md` rendered from JSON
|
|
|
|
Do not make free-form Markdown the only output.
|
|
|
|
### Prompting guidelines
|
|
|
|
Prompts should instruct the model to:
|
|
|
|
- extract factual information
|
|
- avoid unsupported claims
|
|
- use concise and stable language
|
|
- prefer controlled vocabulary when available
|
|
- return structured JSON only
|
|
- use `null` or empty lists for unclear fields rather than hallucinating
|
|
|
|
### Provider abstraction
|
|
|
|
The summarizer should not be tightly coupled to a single LLM provider.
|
|
|
|
Use a provider abstraction so the project can support:
|
|
|
|
- OpenAI-compatible APIs
|
|
- local models later if desired
|
|
- different prompt templates and vocabularies
|
|
|
|
---
|
|
|
|
## What belongs in `paperlib` vs higher-level tools
|
|
|
|
`paperlib` is the base library engine.
|
|
|
|
It should own:
|
|
|
|
- PDF import
|
|
- local storage layout
|
|
- conversion to Markdown
|
|
- metadata files
|
|
- summary files
|
|
- index maintenance
|
|
- CLI access to those capabilities
|
|
|
|
It should not own high-level discovery workflows such as:
|
|
|
|
- arXiv daily fetching
|
|
- personalized new-paper ranking
|
|
- daily digest generation
|
|
- automated paper downloading from external feeds
|
|
|
|
Those belong in higher-level tools that consume `paperlib`.
|
|
|
|
---
|
|
|
|
## Expected development workflow
|
|
|
|
When implementing a new feature, the preferred order is:
|
|
|
|
1. identify the owning module
|
|
2. define or update the data contract
|
|
3. implement the core logic
|
|
4. add tests
|
|
5. expose it through the CLI if appropriate
|
|
6. update docs and examples
|
|
|
|
If a change affects on-disk formats or CLI behavior, document it clearly.
|
|
|
|
---
|
|
|
|
## Decision heuristics
|
|
|
|
When uncertain, prefer the option that is:
|
|
|
|
- more local-first
|
|
- more inspectable
|
|
- easier to test
|
|
- easier to recover from
|
|
- less coupled to AI
|
|
- more stable for scripts
|
|
- less magical
|
|
|
|
Examples:
|
|
|
|
- prefer JSON + Markdown over opaque internal blobs
|
|
- prefer explicit CLI commands over hidden automation
|
|
- prefer rebuildable indexes over fragile single-source databases
|
|
- prefer optional AI enrichment over mandatory AI workflows
|
|
|
|
---
|
|
|
|
## Documentation expectations
|
|
|
|
Important features should be documented in:
|
|
|
|
- `README.md` for user-facing overview
|
|
- `docs/cli.md` for command behavior
|
|
- `docs/storage-layout.md` for on-disk structure
|
|
- `docs/summary-schema.md` for `summary.json`
|
|
- `docs/integration-guide.md` for higher-level tool integration
|
|
|
|
Keep docs aligned with actual behavior.
|
|
|
|
---
|
|
|
|
## If you are an AI agent contributing code
|
|
|
|
Before making a change, ask:
|
|
|
|
1. Does this belong in `paperlib`, or in a higher-level workflow project?
|
|
2. Does this preserve local-first and CLI-first design?
|
|
3. Does this make AI optional, not mandatory?
|
|
4. Does this keep JSON files as the durable source of truth?
|
|
5. Does this keep the system understandable to a developer reading the code later?
|
|
|
|
If the answer to any of these is no, reconsider the approach.
|