init
This commit is contained in:
@@ -0,0 +1,636 @@
|
||||
# AGENTS.md
|
||||
|
||||
## Project overview
|
||||
|
||||
`paperlib` is a local-first paper library engine with a CLI.
|
||||
|
||||
It is designed to:
|
||||
|
||||
- import PDF papers into a structured local library
|
||||
- convert PDFs into Markdown using external converters such as MinerU
|
||||
- maintain stable per-paper metadata files and a searchable index database
|
||||
- optionally generate AI-based structured summaries
|
||||
- expose a clean CLI that is useful both for humans and for higher-level automation tools such as an arXiv daily digest workflow
|
||||
|
||||
`paperlib` is **not*- primarily an AI app. AI summarization is an optional enrichment layer, not the core of the system.
|
||||
|
||||
The project should remain useful even when:
|
||||
|
||||
- no LLM API key is configured
|
||||
- no summarization is enabled
|
||||
- only import / convert / index / search features are used
|
||||
|
||||
---
|
||||
|
||||
## Core design principles
|
||||
|
||||
### 1. Local-first
|
||||
|
||||
User data lives locally in the paper library directory.
|
||||
|
||||
The library must remain usable without a server, web app, or remote database.
|
||||
|
||||
Prefer plain files plus SQLite over opaque internal state.
|
||||
|
||||
### 2. CLI-first
|
||||
|
||||
The CLI is the primary interface.
|
||||
|
||||
All important workflows should be accessible from the CLI.
|
||||
|
||||
The Python API is useful, but secondary.
|
||||
|
||||
### 3. JSON files are the source of truth
|
||||
|
||||
Per-paper JSON files in the library are the durable source of truth.
|
||||
|
||||
Examples:
|
||||
|
||||
- `meta.json`
|
||||
- `summary.json`
|
||||
|
||||
SQLite is an index/cache layer, not the canonical data store.
|
||||
|
||||
This means:
|
||||
|
||||
- the index must be rebuildable from files
|
||||
- `reindex` should be able to repair the database from on-disk records
|
||||
- code must not assume the database alone is authoritative
|
||||
|
||||
### 4. AI is optional enrichment
|
||||
|
||||
Importing, converting, indexing, listing, showing, and searching papers must work without AI.
|
||||
|
||||
AI summarization should be isolated behind a clean interface.
|
||||
|
||||
Do not make core workflows depend on an LLM provider.
|
||||
|
||||
### 5. Stable machine-readable interfaces
|
||||
|
||||
Important commands should support `--json` output so that other tools can consume them.
|
||||
|
||||
Examples:
|
||||
|
||||
- `paperlib import ... --json`
|
||||
- `paperlib summarize ... --json`
|
||||
- `paperlib show ... --json`
|
||||
- `paperlib export ... --format json`
|
||||
|
||||
### 6. Small, explicit, inspectable components
|
||||
|
||||
Prefer simple and explicit logic over large hidden frameworks.
|
||||
|
||||
Keep components understandable:
|
||||
|
||||
- importer
|
||||
- converter
|
||||
- renderer
|
||||
- summarizer
|
||||
- search
|
||||
- reindex
|
||||
- doctor
|
||||
|
||||
Avoid unnecessary abstraction until there is a real need.
|
||||
|
||||
---
|
||||
|
||||
## Non-goals
|
||||
|
||||
The following are currently out of scope unless explicitly planned later:
|
||||
|
||||
- mandatory daemon architecture
|
||||
- web UI
|
||||
- multi-user remote service
|
||||
- cloud-first design
|
||||
- vector database as a required dependency
|
||||
- opaque agent framework controlling the core library
|
||||
- “fully autonomous research assistant” behavior
|
||||
|
||||
---
|
||||
|
||||
## Library data layout
|
||||
|
||||
The paper library on disk should be human-browsable.
|
||||
|
||||
A typical layout looks like:
|
||||
|
||||
```text
|
||||
library_root/
|
||||
config/
|
||||
config.toml
|
||||
vocab.yaml
|
||||
prompts/
|
||||
summarize_paper.md
|
||||
|
||||
inbox/
|
||||
papers/
|
||||
arxiv/
|
||||
2026/
|
||||
2604.12345/
|
||||
meta.json
|
||||
source.pdf
|
||||
paper.md
|
||||
summary.json
|
||||
summary.md
|
||||
ref.bib
|
||||
assets/
|
||||
logs/
|
||||
mineru.log
|
||||
local/
|
||||
sha256-.../
|
||||
meta.json
|
||||
source.pdf
|
||||
paper.md
|
||||
summary.json
|
||||
summary.md
|
||||
|
||||
db/
|
||||
paperlib.sqlite3
|
||||
|
||||
cache/
|
||||
```
|
||||
|
||||
Conventions:
|
||||
|
||||
- `meta.json` contains stable metadata and processing status
|
||||
- `summary.json` contains structured AI-generated enrichment
|
||||
- `summary.md` is rendered from `summary.json`
|
||||
- `paper.md` is generated from the PDF by an external converter such as MinerU
|
||||
- the database is rebuildable from the files above
|
||||
|
||||
---
|
||||
|
||||
## Data model boundaries
|
||||
|
||||
### `meta.json`
|
||||
|
||||
`meta.json` should contain deterministic or near-deterministic information, mostly from:
|
||||
|
||||
- import process
|
||||
- file system state
|
||||
- external paper metadata sources
|
||||
|
||||
Typical fields include:
|
||||
|
||||
- `paper_id`
|
||||
- `source_type`
|
||||
- `source_id`
|
||||
- `title`
|
||||
- `authors`
|
||||
- `published_date`
|
||||
- `updated_date`
|
||||
- `categories`
|
||||
- `pdf_path`
|
||||
- `paper_md_path`
|
||||
- `summary_json_path`
|
||||
- `summary_md_path`
|
||||
- `imported_at`
|
||||
- `conversion_status`
|
||||
- `summary_status`
|
||||
|
||||
Avoid putting speculative AI content into `meta.json`.
|
||||
|
||||
### `summary.json`
|
||||
|
||||
`summary.json` is optional enrichment and may be regenerated.
|
||||
|
||||
It should contain structured fields such as:
|
||||
|
||||
- one-sentence summary
|
||||
- problem statement
|
||||
- method overview
|
||||
- main results
|
||||
- claimed contributions
|
||||
- assumptions
|
||||
- limitations
|
||||
- problem tags
|
||||
- technique tags
|
||||
- entities
|
||||
- relevance-to-user fields
|
||||
- recommended sections
|
||||
|
||||
`summary.json` must include a schema version.
|
||||
|
||||
### SQLite
|
||||
|
||||
SQLite stores searchable/indexed state and job-independent status.
|
||||
|
||||
It should help with:
|
||||
|
||||
- listing papers
|
||||
- filtering and search
|
||||
- path lookup
|
||||
- tag lookup
|
||||
- status overview
|
||||
|
||||
But it should never be treated as the only durable source of paper metadata.
|
||||
|
||||
---
|
||||
|
||||
## CLI philosophy
|
||||
|
||||
The CLI should be easy for humans and predictable for scripts.
|
||||
|
||||
### Important CLI expectations
|
||||
|
||||
- human-readable by default
|
||||
- machine-readable with `--json`
|
||||
- clear exit codes
|
||||
- no hidden background magic
|
||||
- no required daemon
|
||||
- stable command names
|
||||
- idempotent operations when possible
|
||||
|
||||
### Expected command families
|
||||
|
||||
Core commands include:
|
||||
|
||||
- `init`
|
||||
- `import`
|
||||
- `import-dir`
|
||||
- `watch`
|
||||
- `convert`
|
||||
- `reindex`
|
||||
- `doctor`
|
||||
- `status`
|
||||
- `list`
|
||||
- `show`
|
||||
- `search`
|
||||
- `open`
|
||||
- `print-path`
|
||||
- `summarize`
|
||||
- `render-summary`
|
||||
- `export`
|
||||
|
||||
When implementing commands, preserve a clear separation between:
|
||||
|
||||
- mutation commands
|
||||
- read/query commands
|
||||
|
||||
---
|
||||
|
||||
## Architecture guidelines
|
||||
|
||||
The codebase should be organized around a few clear layers.
|
||||
|
||||
### 1. Core domain logic
|
||||
|
||||
Pure Python logic for:
|
||||
|
||||
- identifying papers
|
||||
- computing paths
|
||||
- importing PDFs
|
||||
- updating metadata
|
||||
- converting PDFs to Markdown
|
||||
- rendering summaries
|
||||
- rebuilding the index
|
||||
|
||||
This layer should be testable without the CLI.
|
||||
|
||||
### 2. CLI layer
|
||||
|
||||
Thin wrappers around the core domain logic.
|
||||
|
||||
The CLI should:
|
||||
|
||||
- parse arguments
|
||||
- call core functions
|
||||
- format output
|
||||
- handle exit codes
|
||||
|
||||
The CLI should not contain deep business logic.
|
||||
|
||||
### 3. Optional integrations
|
||||
|
||||
External systems should live in integration modules, for example:
|
||||
|
||||
- MinerU wrapper
|
||||
- filesystem watch integration
|
||||
- ripgrep integration
|
||||
- LLM provider integration
|
||||
|
||||
Keep these adapters isolated.
|
||||
|
||||
### 4. Optional AI layer
|
||||
|
||||
The AI summarization layer should be behind a stable abstraction.
|
||||
|
||||
For example:
|
||||
|
||||
- load prompt template
|
||||
- load paper markdown
|
||||
- load optional profile / vocabulary
|
||||
- call provider
|
||||
- validate structured output
|
||||
- write `summary.json`
|
||||
- render `summary.md`
|
||||
|
||||
Avoid leaking provider-specific behavior into unrelated modules.
|
||||
|
||||
---
|
||||
|
||||
## AI collaboration guidelines
|
||||
|
||||
When using AI to help develop this project, the AI should follow these rules.
|
||||
|
||||
### 1. Respect the project boundaries
|
||||
|
||||
Do not redesign `paperlib` into:
|
||||
|
||||
- a web app
|
||||
- a required daemon
|
||||
- a monolithic agent system
|
||||
- a chat-first interface
|
||||
|
||||
Unless explicitly asked, keep the project aligned with:
|
||||
|
||||
- local-first
|
||||
- CLI-first
|
||||
- JSON/SQLite-based architecture
|
||||
- AI-optional enrichment
|
||||
|
||||
### 2. Prefer incremental changes
|
||||
|
||||
Make small, reviewable changes.
|
||||
|
||||
When implementing a feature:
|
||||
|
||||
- first clarify which module owns it
|
||||
- avoid broad refactors unless necessary
|
||||
- preserve existing CLI semantics unless intentionally changing them
|
||||
|
||||
### 3. Keep file formats stable
|
||||
|
||||
Changes to `meta.json` or `summary.json` are important.
|
||||
|
||||
If changing schemas:
|
||||
|
||||
- update the schema version
|
||||
- update documentation
|
||||
- consider migration or backward compatibility
|
||||
- do not silently break existing libraries
|
||||
|
||||
### 4. Avoid hidden coupling
|
||||
|
||||
Do not make unrelated modules depend on each other unnecessarily.
|
||||
|
||||
For example:
|
||||
|
||||
- `search` should not depend on LLM code
|
||||
- `import` should not require summarization
|
||||
- `reindex` should not assume a specific converter
|
||||
- `render-summary` should not require calling AI again
|
||||
|
||||
### 5. Prefer explicit data flow
|
||||
|
||||
When adding features, keep data flow obvious.
|
||||
|
||||
For example:
|
||||
|
||||
- `import` creates or updates metadata
|
||||
- `convert` creates `paper.md`
|
||||
- `summarize` creates `summary.json`
|
||||
- `render-summary` creates `summary.md`
|
||||
- `reindex` rebuilds SQLite from files
|
||||
|
||||
### 6. Do not invent capabilities
|
||||
|
||||
If a feature is not implemented yet, do not pretend it exists.
|
||||
|
||||
Examples:
|
||||
|
||||
- do not write code that assumes a daemon exists
|
||||
- do not assume remote sync exists
|
||||
- do not assume vector search exists
|
||||
- do not assume arXiv-specific logic belongs in the core library
|
||||
|
||||
### 7. Prefer durable outputs over polished prose
|
||||
|
||||
When designing AI summarization outputs, favor:
|
||||
|
||||
- structured JSON
|
||||
- stable field names
|
||||
- grep-friendly rendered Markdown
|
||||
- concise, reusable information
|
||||
|
||||
over:
|
||||
|
||||
- highly polished review prose
|
||||
- flashy but unstable output formats
|
||||
|
||||
---
|
||||
|
||||
## Coding guidelines
|
||||
|
||||
### General style
|
||||
|
||||
- Prefer straightforward Python.
|
||||
- Use type hints.
|
||||
- Keep functions small and focused.
|
||||
- Add docstrings to public functions and classes.
|
||||
- Avoid overengineering.
|
||||
- Prefer composition over deep inheritance.
|
||||
|
||||
### Error handling
|
||||
|
||||
- Fail clearly.
|
||||
- Provide helpful error messages.
|
||||
- Distinguish user-facing CLI errors from internal exceptions.
|
||||
- Avoid silently swallowing errors.
|
||||
|
||||
### Logging
|
||||
|
||||
- Use structured and informative logging where useful.
|
||||
- Avoid noisy logs in normal CLI output.
|
||||
- Keep machine-readable command output clean when `--json` is used.
|
||||
|
||||
### File operations
|
||||
|
||||
- Be careful with moves, copies, and overwrites.
|
||||
- Prefer atomic writes for JSON files when possible.
|
||||
- Never corrupt existing metadata due to partial writes.
|
||||
|
||||
### Idempotence
|
||||
|
||||
Where possible, commands should behave safely when run multiple times.
|
||||
|
||||
Examples:
|
||||
|
||||
- re-importing the same file should detect duplicates
|
||||
- `render-summary` should be repeatable
|
||||
- `reindex` should be safe to rerun
|
||||
|
||||
### Testing
|
||||
|
||||
Add tests for:
|
||||
|
||||
- path layout logic
|
||||
- metadata read/write behavior
|
||||
- duplicate detection
|
||||
- reindex behavior
|
||||
- summary rendering
|
||||
- search behavior
|
||||
- CLI output contracts for core commands
|
||||
|
||||
Prefer unit tests for core logic and targeted integration tests for CLI behavior.
|
||||
|
||||
---
|
||||
|
||||
## Search design guidelines
|
||||
|
||||
Search should support at least two useful modes:
|
||||
|
||||
### 1. Field-aware structured search
|
||||
|
||||
Examples:
|
||||
|
||||
- tags
|
||||
- authors
|
||||
- categories
|
||||
- titles
|
||||
- summary fields
|
||||
|
||||
### 2. Full-text-friendly search
|
||||
|
||||
Support grep-like workflows and integration with tools such as `ripgrep`.
|
||||
|
||||
Do not require semantic/vector search as a baseline feature.
|
||||
|
||||
If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
|
||||
|
||||
---
|
||||
|
||||
## Summarization design guidelines
|
||||
|
||||
Summarization should produce reusable structured outputs.
|
||||
|
||||
### Summarization goals
|
||||
|
||||
A summary should be useful for:
|
||||
|
||||
- later human review
|
||||
- grep-style reverse lookup
|
||||
- building daily/weekly reports
|
||||
- indexing by problem/method/result
|
||||
- personal research triage
|
||||
|
||||
### Summarization output
|
||||
|
||||
Prefer generating:
|
||||
|
||||
- `summary.json` as the canonical structured output
|
||||
- `summary.md` rendered from JSON
|
||||
|
||||
Do not make free-form Markdown the only output.
|
||||
|
||||
### Prompting guidelines
|
||||
|
||||
Prompts should instruct the model to:
|
||||
|
||||
- extract factual information
|
||||
- avoid unsupported claims
|
||||
- use concise and stable language
|
||||
- prefer controlled vocabulary when available
|
||||
- return structured JSON only
|
||||
- use `null` or empty lists for unclear fields rather than hallucinating
|
||||
|
||||
### Provider abstraction
|
||||
|
||||
The summarizer should not be tightly coupled to a single LLM provider.
|
||||
|
||||
Use a provider abstraction so the project can support:
|
||||
|
||||
- OpenAI-compatible APIs
|
||||
- local models later if desired
|
||||
- different prompt templates and vocabularies
|
||||
|
||||
---
|
||||
|
||||
## What belongs in `paperlib` vs higher-level tools
|
||||
|
||||
`paperlib` is the base library engine.
|
||||
|
||||
It should own:
|
||||
|
||||
- PDF import
|
||||
- local storage layout
|
||||
- conversion to Markdown
|
||||
- metadata files
|
||||
- summary files
|
||||
- index maintenance
|
||||
- CLI access to those capabilities
|
||||
|
||||
It should not own high-level discovery workflows such as:
|
||||
|
||||
- arXiv daily fetching
|
||||
- personalized new-paper ranking
|
||||
- daily digest generation
|
||||
- automated paper downloading from external feeds
|
||||
|
||||
Those belong in higher-level tools that consume `paperlib`.
|
||||
|
||||
---
|
||||
|
||||
## Expected development workflow
|
||||
|
||||
When implementing a new feature, the preferred order is:
|
||||
|
||||
1. identify the owning module
|
||||
2. define or update the data contract
|
||||
3. implement the core logic
|
||||
4. add tests
|
||||
5. expose it through the CLI if appropriate
|
||||
6. update docs and examples
|
||||
|
||||
If a change affects on-disk formats or CLI behavior, document it clearly.
|
||||
|
||||
---
|
||||
|
||||
## Decision heuristics
|
||||
|
||||
When uncertain, prefer the option that is:
|
||||
|
||||
- more local-first
|
||||
- more inspectable
|
||||
- easier to test
|
||||
- easier to recover from
|
||||
- less coupled to AI
|
||||
- more stable for scripts
|
||||
- less magical
|
||||
|
||||
Examples:
|
||||
|
||||
- prefer JSON + Markdown over opaque internal blobs
|
||||
- prefer explicit CLI commands over hidden automation
|
||||
- prefer rebuildable indexes over fragile single-source databases
|
||||
- prefer optional AI enrichment over mandatory AI workflows
|
||||
|
||||
---
|
||||
|
||||
## Documentation expectations
|
||||
|
||||
Important features should be documented in:
|
||||
|
||||
- `README.md` for user-facing overview
|
||||
- `docs/cli.md` for command behavior
|
||||
- `docs/storage-layout.md` for on-disk structure
|
||||
- `docs/summary-schema.md` for `summary.json`
|
||||
- `docs/integration-guide.md` for higher-level tool integration
|
||||
|
||||
Keep docs aligned with actual behavior.
|
||||
|
||||
---
|
||||
|
||||
## If you are an AI agent contributing code
|
||||
|
||||
Before making a change, ask:
|
||||
|
||||
1. Does this belong in `paperlib`, or in a higher-level workflow project?
|
||||
2. Does this preserve local-first and CLI-first design?
|
||||
3. Does this make AI optional, not mandatory?
|
||||
4. Does this keep JSON files as the durable source of truth?
|
||||
5. Does this keep the system understandable to a developer reading the code later?
|
||||
|
||||
If the answer to any of these is no, reconsider the approach.
|
||||
|
||||
```
|
||||
Reference in New Issue
Block a user