update: update the dev-docs for AI agent
This commit is contained in:
@@ -4,652 +4,81 @@
|
|||||||
|
|
||||||
`paperlib` is a local-first paper library engine with a CLI.
|
`paperlib` is a local-first paper library engine with a CLI.
|
||||||
|
|
||||||
It is designed to:
|
**Key point**: `paperlib` is **not** primarily an AI app. AI summarization is optional enrichment. The project must remain useful without LLM configuration.
|
||||||
|
|
||||||
- import PDF papers into a structured local library
|
## Critical design principles
|
||||||
- convert PDFs into Markdown using external converters such as MinerU
|
|
||||||
- maintain stable per-paper metadata files and a searchable index database
|
|
||||||
- optionally generate AI-based structured summaries
|
|
||||||
- expose a clean CLI that is useful both for humans and for higher-level automation tools such as an arXiv daily digest workflow
|
|
||||||
|
|
||||||
`paperlib` is **not** primarily an AI app. AI summarization is an optional enrichment layer, not the core of the system.
|
1. **Local-first**: User data lives locally. Prefer plain files + SQLite over opaque state.
|
||||||
|
2. **CLI-first**: The CLI is the primary interface. Python API is secondary.
|
||||||
|
3. **JSON files are source of truth**: Per-paper JSON files are durable truth. SQLite is rebuildable index/cache.
|
||||||
|
4. **AI is optional**: Core workflows (import/convert/index/list/show/search) work without AI.
|
||||||
|
5. **Machine-readable**: Commands support `--json` output for automation.
|
||||||
|
|
||||||
The project should remain useful even when:
|
## Development commands
|
||||||
|
|
||||||
- no LLM API key is configured
|
- **Testing**: `uv run pytest` (specific: `uv run pytest tests/test_models.py`)
|
||||||
- no summarization is enabled
|
- **Linting**: `uv run ruff check src/`
|
||||||
- only import / convert / index / search features are used
|
- **Formatting**: `uv run ruff format`
|
||||||
|
- **CLI testing**: `uv run paperlib --help` or `uv run paperlib init .tmp/test-lib`
|
||||||
|
|
||||||
---
|
**Always use `uv run` for Python commands. Use `./.tmp` for test libraries (it's tmpfs).**
|
||||||
|
|
||||||
## Core design principles
|
## Current CLI commands
|
||||||
|
|
||||||
### 1. Local-first
|
**Implemented**:
|
||||||
|
- `init` - Initialize library
|
||||||
|
- `status` - Show library config
|
||||||
|
- `list` - List papers
|
||||||
|
- `show` - Show paper details
|
||||||
|
- `search` - Search papers
|
||||||
|
- `import` - Import papers (PDF/arXiv)
|
||||||
|
- `convert` - Convert PDFs to Markdown (MinerU)
|
||||||
|
- `reindex` - Rebuild search index
|
||||||
|
|
||||||
User data lives locally in the paper library directory.
|
**Planned**: `import-dir`, `watch`, `doctor`, `open`, `print-path`, `summarize`, `render-summary`, `export`
|
||||||
|
|
||||||
The library must remain usable without a server, web app, or remote database.
|
## Critical constraints
|
||||||
|
|
||||||
Prefer plain files plus SQLite over opaque internal state.
|
### What paperlib IS
|
||||||
|
- PDF import and local storage
|
||||||
|
- PDF → Markdown conversion
|
||||||
|
- Metadata files and search indexing
|
||||||
|
- CLI for all operations
|
||||||
|
- Optional AI summarization
|
||||||
|
|
||||||
### 2. CLI-first
|
### What paperlib is NOT
|
||||||
|
- Web UI or daemon
|
||||||
|
- Multi-user service
|
||||||
|
- Cloud-first design
|
||||||
|
- Vector database requirement
|
||||||
|
- Autonomous research assistant
|
||||||
|
|
||||||
The CLI is the primary interface.
|
### File format stability
|
||||||
|
Changes to `meta.json` or `summary.json` schemas are breaking changes. Must update schema version and consider migration.
|
||||||
All important workflows should be accessible from the CLI.
|
|
||||||
|
|
||||||
The Python API is useful, but secondary.
|
|
||||||
|
|
||||||
### 3. JSON files are the source of truth
|
|
||||||
|
|
||||||
Per-paper JSON files in the library are the durable source of truth.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
- `meta.json`
|
|
||||||
- `summary.json`
|
|
||||||
|
|
||||||
SQLite is an index/cache layer, not the canonical data store.
|
|
||||||
|
|
||||||
This means:
|
|
||||||
|
|
||||||
- the index must be rebuildable from files
|
|
||||||
- `reindex` should be able to repair the database from on-disk records
|
|
||||||
- code must not assume the database alone is authoritative
|
|
||||||
|
|
||||||
### 4. AI is optional enrichment
|
|
||||||
|
|
||||||
Importing, converting, indexing, listing, showing, and searching papers must work without AI.
|
|
||||||
|
|
||||||
AI summarization should be isolated behind a clean interface.
|
|
||||||
|
|
||||||
Do not make core workflows depend on an LLM provider.
|
|
||||||
|
|
||||||
### 5. Stable machine-readable interfaces
|
|
||||||
|
|
||||||
Important commands should support `--json` output so that other tools can consume them.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
- `paperlib import ... --json`
|
|
||||||
- `paperlib summarize ... --json`
|
|
||||||
- `paperlib show ... --json`
|
|
||||||
- `paperlib export ... --format json`
|
|
||||||
|
|
||||||
### 6. Small, explicit, inspectable components
|
|
||||||
|
|
||||||
Prefer simple and explicit logic over large hidden frameworks.
|
|
||||||
|
|
||||||
Keep components understandable:
|
|
||||||
|
|
||||||
- importer
|
|
||||||
- converter
|
|
||||||
- renderer
|
|
||||||
- summarizer
|
|
||||||
- search
|
|
||||||
- reindex
|
|
||||||
- doctor
|
|
||||||
|
|
||||||
Avoid unnecessary abstraction until there is a real need.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Non-goals
|
|
||||||
|
|
||||||
The following are currently out of scope unless explicitly planned later:
|
|
||||||
|
|
||||||
- mandatory daemon architecture
|
|
||||||
- web UI
|
|
||||||
- multi-user remote service
|
|
||||||
- cloud-first design
|
|
||||||
- vector database as a required dependency
|
|
||||||
- opaque agent framework controlling the core library
|
|
||||||
- “fully autonomous research assistant” behavior
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Library data layout
|
|
||||||
|
|
||||||
The paper library on disk should be human-browsable.
|
|
||||||
|
|
||||||
A typical layout looks like:
|
|
||||||
|
|
||||||
```text
|
|
||||||
library_root/
|
|
||||||
config/
|
|
||||||
config.toml
|
|
||||||
vocab.yaml
|
|
||||||
prompts/
|
|
||||||
summarize_paper.md
|
|
||||||
|
|
||||||
inbox/
|
|
||||||
papers/
|
|
||||||
arxiv/
|
|
||||||
2026/
|
|
||||||
2604.12345/
|
|
||||||
meta.json
|
|
||||||
source.pdf
|
|
||||||
paper.md
|
|
||||||
summary.json
|
|
||||||
summary.md
|
|
||||||
ref.bib
|
|
||||||
assets/
|
|
||||||
logs/
|
|
||||||
mineru.log
|
|
||||||
local/
|
|
||||||
sha256-.../
|
|
||||||
meta.json
|
|
||||||
source.pdf
|
|
||||||
paper.md
|
|
||||||
summary.json
|
|
||||||
summary.md
|
|
||||||
|
|
||||||
db/
|
|
||||||
paperlib.sqlite3
|
|
||||||
|
|
||||||
cache/
|
|
||||||
```
|
|
||||||
|
|
||||||
Conventions:
|
|
||||||
|
|
||||||
- `meta.json` contains stable metadata and processing status
|
|
||||||
- `summary.json` contains structured AI-generated enrichment
|
|
||||||
- `summary.md` is rendered from `summary.json`
|
|
||||||
- `paper.md` is generated from the PDF by an external converter such as MinerU
|
|
||||||
- the database is rebuildable from the files above
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Data model boundaries
|
|
||||||
|
|
||||||
### `meta.json`
|
|
||||||
|
|
||||||
`meta.json` should contain deterministic or near-deterministic information, mostly from:
|
|
||||||
|
|
||||||
- import process
|
|
||||||
- file system state
|
|
||||||
- external paper metadata sources
|
|
||||||
|
|
||||||
Typical fields include:
|
|
||||||
|
|
||||||
- `paper_id`
|
|
||||||
- `source_type`
|
|
||||||
- `source_id`
|
|
||||||
- `title`
|
|
||||||
- `authors`
|
|
||||||
- `published_date`
|
|
||||||
- `updated_date`
|
|
||||||
- `categories`
|
|
||||||
- `pdf_path`
|
|
||||||
- `paper_md_path`
|
|
||||||
- `summary_json_path`
|
|
||||||
- `summary_md_path`
|
|
||||||
- `imported_at`
|
|
||||||
- `conversion_status`
|
|
||||||
- `summary_status`
|
|
||||||
|
|
||||||
Avoid putting speculative AI content into `meta.json`.
|
|
||||||
|
|
||||||
### `summary.json`
|
|
||||||
|
|
||||||
`summary.json` is optional enrichment and may be regenerated.
|
|
||||||
|
|
||||||
It should contain structured fields such as:
|
|
||||||
|
|
||||||
- one-sentence summary
|
|
||||||
- problem statement
|
|
||||||
- method overview
|
|
||||||
- main results
|
|
||||||
- claimed contributions
|
|
||||||
- assumptions
|
|
||||||
- limitations
|
|
||||||
- problem tags
|
|
||||||
- technique tags
|
|
||||||
- entities
|
|
||||||
- relevance-to-user fields
|
|
||||||
- recommended sections
|
|
||||||
|
|
||||||
`summary.json` must include a schema version.
|
|
||||||
|
|
||||||
### SQLite
|
|
||||||
|
|
||||||
SQLite stores searchable/indexed state and job-independent status.
|
|
||||||
|
|
||||||
It should help with:
|
|
||||||
|
|
||||||
- listing papers
|
|
||||||
- filtering and search
|
|
||||||
- path lookup
|
|
||||||
- tag lookup
|
|
||||||
- status overview
|
|
||||||
|
|
||||||
But it should never be treated as the only durable source of paper metadata.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## CLI philosophy
|
|
||||||
|
|
||||||
The CLI should be easy for humans and predictable for scripts.
|
|
||||||
|
|
||||||
### Important CLI expectations
|
|
||||||
|
|
||||||
- human-readable by default
|
|
||||||
- machine-readable with `--json`
|
|
||||||
- clear exit codes
|
|
||||||
- no hidden background magic
|
|
||||||
- no required daemon
|
|
||||||
- stable command names
|
|
||||||
- idempotent operations when possible
|
|
||||||
|
|
||||||
### Expected command families
|
|
||||||
|
|
||||||
Core commands include:
|
|
||||||
|
|
||||||
- `init`
|
|
||||||
- `import`
|
|
||||||
- `import-dir`
|
|
||||||
- `watch`
|
|
||||||
- `convert`
|
|
||||||
- `reindex`
|
|
||||||
- `doctor`
|
|
||||||
- `status`
|
|
||||||
- `list`
|
|
||||||
- `show`
|
|
||||||
- `search`
|
|
||||||
- `open`
|
|
||||||
- `print-path`
|
|
||||||
- `summarize`
|
|
||||||
- `render-summary`
|
|
||||||
- `export`
|
|
||||||
|
|
||||||
When implementing commands, preserve a clear separation between:
|
|
||||||
|
|
||||||
- mutation commands
|
|
||||||
- read/query commands
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Architecture guidelines
|
|
||||||
|
|
||||||
The codebase should be organized around a few clear layers.
|
|
||||||
|
|
||||||
### 1. Core domain logic
|
|
||||||
|
|
||||||
Pure Python logic for:
|
|
||||||
|
|
||||||
- identifying papers
|
|
||||||
- computing paths
|
|
||||||
- importing PDFs
|
|
||||||
- updating metadata
|
|
||||||
- converting PDFs to Markdown
|
|
||||||
- rendering summaries
|
|
||||||
- rebuilding the index
|
|
||||||
|
|
||||||
This layer should be testable without the CLI.
|
|
||||||
|
|
||||||
### 2. CLI layer
|
|
||||||
|
|
||||||
Thin wrappers around the core domain logic.
|
|
||||||
|
|
||||||
The CLI should:
|
|
||||||
|
|
||||||
- parse arguments
|
|
||||||
- call core functions
|
|
||||||
- format output
|
|
||||||
- handle exit codes
|
|
||||||
|
|
||||||
The CLI should not contain deep business logic.
|
|
||||||
|
|
||||||
### 3. Optional integrations
|
|
||||||
|
|
||||||
External systems should live in integration modules, for example:
|
|
||||||
|
|
||||||
- MinerU wrapper
|
|
||||||
- filesystem watch integration
|
|
||||||
- ripgrep integration
|
|
||||||
- LLM provider integration
|
|
||||||
|
|
||||||
Keep these adapters isolated.
|
|
||||||
|
|
||||||
### 4. Optional AI layer
|
|
||||||
|
|
||||||
The AI summarization layer should be behind a stable abstraction.
|
|
||||||
|
|
||||||
For example:
|
|
||||||
|
|
||||||
- load prompt template
|
|
||||||
- load paper markdown
|
|
||||||
- load optional profile / vocabulary
|
|
||||||
- call provider
|
|
||||||
- validate structured output
|
|
||||||
- write `summary.json`
|
|
||||||
- render `summary.md`
|
|
||||||
|
|
||||||
Avoid leaking provider-specific behavior into unrelated modules.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## AI collaboration guidelines
|
|
||||||
|
|
||||||
When using AI to help develop this project, the AI should follow these rules.
|
|
||||||
|
|
||||||
### 1. Respect the project boundaries
|
|
||||||
|
|
||||||
Do not redesign `paperlib` into:
|
|
||||||
|
|
||||||
- a web app
|
|
||||||
- a required daemon
|
|
||||||
- a monolithic agent system
|
|
||||||
- a chat-first interface
|
|
||||||
|
|
||||||
Unless explicitly asked, keep the project aligned with:
|
|
||||||
|
|
||||||
- local-first
|
|
||||||
- CLI-first
|
|
||||||
- JSON/SQLite-based architecture
|
|
||||||
- AI-optional enrichment
|
|
||||||
|
|
||||||
### 2. Prefer incremental changes
|
|
||||||
|
|
||||||
Make small, reviewable changes.
|
|
||||||
|
|
||||||
When implementing a feature:
|
|
||||||
|
|
||||||
- first clarify which module owns it
|
|
||||||
- avoid broad refactors unless necessary
|
|
||||||
- preserve existing CLI semantics unless intentionally changing them
|
|
||||||
|
|
||||||
### 3. Keep file formats stable
|
|
||||||
|
|
||||||
Changes to `meta.json` or `summary.json` are important.
|
|
||||||
|
|
||||||
If changing schemas:
|
|
||||||
|
|
||||||
- update the schema version
|
|
||||||
- update documentation
|
|
||||||
- consider migration or backward compatibility
|
|
||||||
- do not silently break existing libraries
|
|
||||||
|
|
||||||
### 4. Avoid hidden coupling
|
|
||||||
|
|
||||||
Do not make unrelated modules depend on each other unnecessarily.
|
|
||||||
|
|
||||||
For example:
|
|
||||||
|
|
||||||
|
### Module boundaries
|
||||||
- `search` should not depend on LLM code
|
- `search` should not depend on LLM code
|
||||||
- `import` should not require summarization
|
- `import` should not require summarization
|
||||||
- `reindex` should not assume a specific converter
|
- `reindex` should work from files alone
|
||||||
- `render-summary` should not require calling AI again
|
- Keep AI behind clean interfaces
|
||||||
|
|
||||||
### 5. Prefer explicit data flow
|
## Git commits
|
||||||
|
Format: `"<scope>: <subject>"` where scope is `feat|fix|docs|style|refactor|test|perf|update`
|
||||||
|
First line ≤88 chars, second line empty.
|
||||||
|
|
||||||
When adding features, keep data flow obvious.
|
## When you need details
|
||||||
|
|
||||||
For example:
|
- **Architecture**: See `dev-docs/architecture.md`
|
||||||
|
- **Data model**: See `dev-docs/data-model.md`
|
||||||
- `import` creates or updates metadata
|
- **AI integration**: See `dev-docs/ai-guidelines.md`
|
||||||
- `convert` creates `paper.md`
|
- **Code style**: See `dev-docs/coding-guidelines.md`
|
||||||
- `summarize` creates `summary.json`
|
|
||||||
- `render-summary` creates `summary.md`
|
|
||||||
- `reindex` rebuilds SQLite from files
|
|
||||||
|
|
||||||
### 6. Do not invent capabilities
|
|
||||||
|
|
||||||
If a feature is not implemented yet, do not pretend it exists.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
- do not write code that assumes a daemon exists
|
|
||||||
- do not assume remote sync exists
|
|
||||||
- do not assume vector search exists
|
|
||||||
- do not assume arXiv-specific logic belongs in the core library
|
|
||||||
|
|
||||||
### 7. Prefer durable outputs over polished prose
|
|
||||||
|
|
||||||
When designing AI summarization outputs, favor:
|
|
||||||
|
|
||||||
- structured JSON
|
|
||||||
- stable field names
|
|
||||||
- grep-friendly rendered Markdown
|
|
||||||
- concise, reusable information
|
|
||||||
|
|
||||||
over:
|
|
||||||
|
|
||||||
- highly polished review prose
|
|
||||||
- flashy but unstable output formats
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Coding guidelines
|
|
||||||
|
|
||||||
### General style
|
|
||||||
|
|
||||||
- Prefer straightforward Python.
|
|
||||||
- Use type hints.
|
|
||||||
- Keep functions small and focused.
|
|
||||||
- Add docstrings to public functions and classes.
|
|
||||||
- Avoid overengineering.
|
|
||||||
- Prefer composition over deep inheritance.
|
|
||||||
|
|
||||||
### Error handling
|
|
||||||
|
|
||||||
- Fail clearly.
|
|
||||||
- Provide helpful error messages.
|
|
||||||
- Distinguish user-facing CLI errors from internal exceptions.
|
|
||||||
- Avoid silently swallowing errors.
|
|
||||||
|
|
||||||
### Logging
|
|
||||||
|
|
||||||
- Use structured and informative logging where useful.
|
|
||||||
- Avoid noisy logs in normal CLI output.
|
|
||||||
- Keep machine-readable command output clean when `--json` is used.
|
|
||||||
|
|
||||||
### File operations
|
|
||||||
|
|
||||||
- Be careful with moves, copies, and overwrites.
|
|
||||||
- Prefer atomic writes for JSON files when possible.
|
|
||||||
- Never corrupt existing metadata due to partial writes.
|
|
||||||
|
|
||||||
### Idempotence
|
|
||||||
|
|
||||||
Where possible, commands should behave safely when run multiple times.
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
- re-importing the same file should detect duplicates
|
|
||||||
- `render-summary` should be repeatable
|
|
||||||
- `reindex` should be safe to rerun
|
|
||||||
|
|
||||||
### Testing
|
|
||||||
|
|
||||||
Add tests for:
|
|
||||||
|
|
||||||
- path layout logic
|
|
||||||
- metadata read/write behavior
|
|
||||||
- duplicate detection
|
|
||||||
- reindex behavior
|
|
||||||
- summary rendering
|
|
||||||
- search behavior
|
|
||||||
- CLI output contracts for core commands
|
|
||||||
|
|
||||||
Prefer unit tests for core logic and targeted integration tests for CLI behavior.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Search design guidelines
|
|
||||||
|
|
||||||
Search should support at least two useful modes:
|
|
||||||
|
|
||||||
### 1. Field-aware structured search
|
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
- tags
|
|
||||||
- authors
|
|
||||||
- categories
|
|
||||||
- titles
|
|
||||||
- summary fields
|
|
||||||
|
|
||||||
### 2. Full-text-friendly search
|
|
||||||
|
|
||||||
Support grep-like workflows and integration with tools such as `ripgrep`.
|
|
||||||
|
|
||||||
Do not require semantic/vector search as a baseline feature.
|
|
||||||
|
|
||||||
If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summarization design guidelines
|
|
||||||
|
|
||||||
Summarization should produce reusable structured outputs.
|
|
||||||
|
|
||||||
### Summarization goals
|
|
||||||
|
|
||||||
A summary should be useful for:
|
|
||||||
|
|
||||||
- later human review
|
|
||||||
- grep-style reverse lookup
|
|
||||||
- building daily/weekly reports
|
|
||||||
- indexing by problem/method/result
|
|
||||||
- personal research triage
|
|
||||||
|
|
||||||
### Summarization output
|
|
||||||
|
|
||||||
Prefer generating:
|
|
||||||
|
|
||||||
- `summary.json` as the canonical structured output
|
|
||||||
- `summary.md` rendered from JSON
|
|
||||||
|
|
||||||
Do not make free-form Markdown the only output.
|
|
||||||
|
|
||||||
### Prompting guidelines
|
|
||||||
|
|
||||||
Prompts should instruct the model to:
|
|
||||||
|
|
||||||
- extract factual information
|
|
||||||
- avoid unsupported claims
|
|
||||||
- use concise and stable language
|
|
||||||
- prefer controlled vocabulary when available
|
|
||||||
- return structured JSON only
|
|
||||||
- use `null` or empty lists for unclear fields rather than hallucinating
|
|
||||||
|
|
||||||
### Provider abstraction
|
|
||||||
|
|
||||||
The summarizer should not be tightly coupled to a single LLM provider.
|
|
||||||
|
|
||||||
Use a provider abstraction so the project can support:
|
|
||||||
|
|
||||||
- OpenAI-compatible APIs
|
|
||||||
- local models later if desired
|
|
||||||
- different prompt templates and vocabularies
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## What belongs in `paperlib` vs higher-level tools
|
|
||||||
|
|
||||||
`paperlib` is the base library engine.
|
|
||||||
|
|
||||||
It should own:
|
|
||||||
|
|
||||||
- PDF import
|
|
||||||
- local storage layout
|
|
||||||
- conversion to Markdown
|
|
||||||
- metadata files
|
|
||||||
- summary files
|
|
||||||
- index maintenance
|
|
||||||
- CLI access to those capabilities
|
|
||||||
|
|
||||||
It should not own high-level discovery workflows such as:
|
|
||||||
|
|
||||||
- arXiv daily fetching
|
|
||||||
- personalized new-paper ranking
|
|
||||||
- daily digest generation
|
|
||||||
- automated paper downloading from external feeds
|
|
||||||
|
|
||||||
Those belong in higher-level tools that consume `paperlib`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Expected development workflow
|
|
||||||
|
|
||||||
When implementing a new feature, the preferred order is:
|
|
||||||
|
|
||||||
1. identify the owning module
|
|
||||||
2. define or update the data contract
|
|
||||||
3. implement the core logic
|
|
||||||
4. add tests
|
|
||||||
5. expose it through the CLI if appropriate
|
|
||||||
6. update docs and examples
|
|
||||||
|
|
||||||
If a change affects on-disk formats or CLI behavior, document it clearly.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Development workflow guidelines
|
|
||||||
|
|
||||||
Follow these practices during development:
|
|
||||||
|
|
||||||
1. Use `uv run` whenever a Python call is needed (e.g., `uv run python script.py`, `uv run pytest`)
|
|
||||||
2. Establish testing library in `./.tmp` (which is a tmpfs) rather than in global `/tmp`
|
|
||||||
3. Run `uv run ruff check src/` to check format and style after each development step
|
|
||||||
4. Make git commits in format of `"<scope>: <subject>"`, where `<scope>` can only be one of:
|
|
||||||
- "feat" (feature)
|
|
||||||
- "fix"
|
|
||||||
- "docs"
|
|
||||||
- "style"
|
|
||||||
- "refactor"
|
|
||||||
- "test"
|
|
||||||
- "perf" (performance)
|
|
||||||
- "update" (for misc update)
|
|
||||||
|
|
||||||
The first line of commit message should be within 88 characters, the second line should be empty, and the message body starting from the third line is optional and should be short and clean if present.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Decision heuristics
|
## Decision heuristics
|
||||||
|
|
||||||
When uncertain, prefer the option that is:
|
When uncertain, prefer the option that is:
|
||||||
|
|
||||||
- more local-first
|
- more local-first
|
||||||
- more inspectable
|
- more inspectable
|
||||||
- easier to test
|
- easier to test
|
||||||
- easier to recover from
|
|
||||||
- less coupled to AI
|
- less coupled to AI
|
||||||
- more stable for scripts
|
- more stable for scripts
|
||||||
- less magical
|
- less magical
|
||||||
|
|
||||||
Examples:
|
|
||||||
|
|
||||||
- prefer JSON + Markdown over opaque internal blobs
|
|
||||||
- prefer explicit CLI commands over hidden automation
|
|
||||||
- prefer rebuildable indexes over fragile single-source databases
|
|
||||||
- prefer optional AI enrichment over mandatory AI workflows
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Documentation expectations
|
|
||||||
|
|
||||||
Important features should be documented in:
|
|
||||||
|
|
||||||
- `README.md` for user-facing overview
|
|
||||||
- `docs/cli.md` for command behavior
|
|
||||||
- `docs/storage-layout.md` for on-disk structure
|
|
||||||
- `docs/summary-schema.md` for `summary.json`
|
|
||||||
- `docs/integration-guide.md` for higher-level tool integration
|
|
||||||
|
|
||||||
Keep docs aligned with actual behavior.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## If you are an AI agent contributing code
|
|
||||||
|
|
||||||
Before making a change, ask:
|
|
||||||
|
|
||||||
1. Does this belong in `paperlib`, or in a higher-level workflow project?
|
|
||||||
2. Does this preserve local-first and CLI-first design?
|
|
||||||
3. Does this make AI optional, not mandatory?
|
|
||||||
4. Does this keep JSON files as the durable source of truth?
|
|
||||||
5. Does this keep the system understandable to a developer reading the code later?
|
|
||||||
|
|
||||||
If the answer to any of these is no, reconsider the approach.
|
|
||||||
@@ -0,0 +1,55 @@
|
|||||||
|
# AI Integration Guidelines
|
||||||
|
|
||||||
|
## Search design
|
||||||
|
|
||||||
|
Search should support at least two useful modes:
|
||||||
|
|
||||||
|
### 1. Field-aware structured search
|
||||||
|
Examples: tags, authors, categories, titles, summary fields
|
||||||
|
|
||||||
|
### 2. Full-text-friendly search
|
||||||
|
Support grep-like workflows and integration with tools such as `ripgrep`.
|
||||||
|
|
||||||
|
Do not require semantic/vector search as a baseline feature.
|
||||||
|
|
||||||
|
If semantic search is ever added later, it should be optional and must not displace simple grep/database search.
|
||||||
|
|
||||||
|
## Summarization design
|
||||||
|
|
||||||
|
Summarization should produce reusable structured outputs.
|
||||||
|
|
||||||
|
### Summarization goals
|
||||||
|
|
||||||
|
A summary should be useful for:
|
||||||
|
- later human review
|
||||||
|
- grep-style reverse lookup
|
||||||
|
- building daily/weekly reports
|
||||||
|
- indexing by problem/method/result
|
||||||
|
- personal research triage
|
||||||
|
|
||||||
|
### Summarization output
|
||||||
|
|
||||||
|
Prefer generating:
|
||||||
|
- `summary.json` as the canonical structured output
|
||||||
|
- `summary.md` rendered from JSON
|
||||||
|
|
||||||
|
Do not make free-form Markdown the only output.
|
||||||
|
|
||||||
|
### Prompting guidelines
|
||||||
|
|
||||||
|
Prompts should instruct the model to:
|
||||||
|
- extract factual information
|
||||||
|
- avoid unsupported claims
|
||||||
|
- use concise and stable language
|
||||||
|
- prefer controlled vocabulary when available
|
||||||
|
- return structured JSON only
|
||||||
|
- use `null` or empty lists for unclear fields rather than hallucinating
|
||||||
|
|
||||||
|
### Provider abstraction
|
||||||
|
|
||||||
|
The summarizer should not be tightly coupled to a single LLM provider.
|
||||||
|
|
||||||
|
Use a provider abstraction so the project can support:
|
||||||
|
- OpenAI-compatible APIs
|
||||||
|
- local models later if desired
|
||||||
|
- different prompt templates and vocabularies
|
||||||
@@ -0,0 +1,74 @@
|
|||||||
|
# Architecture Guidelines
|
||||||
|
|
||||||
|
The codebase should be organized around a few clear layers.
|
||||||
|
|
||||||
|
## 1. Core domain logic
|
||||||
|
|
||||||
|
Pure Python logic for:
|
||||||
|
|
||||||
|
- identifying papers
|
||||||
|
- computing paths
|
||||||
|
- importing PDFs
|
||||||
|
- updating metadata
|
||||||
|
- converting PDFs to Markdown
|
||||||
|
- rendering summaries
|
||||||
|
- rebuilding the index
|
||||||
|
|
||||||
|
This layer should be testable without the CLI.
|
||||||
|
|
||||||
|
## 2. CLI layer
|
||||||
|
|
||||||
|
Thin wrappers around the core domain logic.
|
||||||
|
|
||||||
|
The CLI should:
|
||||||
|
|
||||||
|
- parse arguments
|
||||||
|
- call core functions
|
||||||
|
- format output
|
||||||
|
- handle exit codes
|
||||||
|
|
||||||
|
The CLI should not contain deep business logic.
|
||||||
|
|
||||||
|
## 3. Optional integrations
|
||||||
|
|
||||||
|
External systems should live in integration modules, for example:
|
||||||
|
|
||||||
|
- MinerU wrapper
|
||||||
|
- filesystem watch integration
|
||||||
|
- ripgrep integration
|
||||||
|
- LLM provider integration
|
||||||
|
|
||||||
|
Keep these adapters isolated.
|
||||||
|
|
||||||
|
## 4. Optional AI layer
|
||||||
|
|
||||||
|
The AI summarization layer should be behind a stable abstraction.
|
||||||
|
|
||||||
|
For example:
|
||||||
|
|
||||||
|
- load prompt template
|
||||||
|
- load paper markdown
|
||||||
|
- load optional profile / vocabulary
|
||||||
|
- call provider
|
||||||
|
- validate structured output
|
||||||
|
- write `summary.json`
|
||||||
|
- render `summary.md`
|
||||||
|
|
||||||
|
Avoid leaking provider-specific behavior into unrelated modules.
|
||||||
|
|
||||||
|
## Component boundaries
|
||||||
|
|
||||||
|
Avoid hidden coupling:
|
||||||
|
|
||||||
|
- `search` should not depend on LLM code
|
||||||
|
- `import` should not require summarization
|
||||||
|
- `reindex` should not assume a specific converter
|
||||||
|
- `render-summary` should not require calling AI again
|
||||||
|
|
||||||
|
Prefer explicit data flow:
|
||||||
|
|
||||||
|
- `import` creates or updates metadata
|
||||||
|
- `convert` creates `paper.md`
|
||||||
|
- `summarize` creates `summary.json`
|
||||||
|
- `render-summary` creates `summary.md`
|
||||||
|
- `reindex` rebuilds SQLite from files
|
||||||
@@ -0,0 +1,51 @@
|
|||||||
|
# Coding Guidelines
|
||||||
|
|
||||||
|
## General style
|
||||||
|
|
||||||
|
- Prefer straightforward Python
|
||||||
|
- Use type hints
|
||||||
|
- Keep functions small and focused
|
||||||
|
- Add docstrings to public functions and classes
|
||||||
|
- Avoid overengineering
|
||||||
|
- Prefer composition over deep inheritance
|
||||||
|
|
||||||
|
## Error handling
|
||||||
|
|
||||||
|
- Fail clearly
|
||||||
|
- Provide helpful error messages
|
||||||
|
- Distinguish user-facing CLI errors from internal exceptions
|
||||||
|
- Avoid silently swallowing errors
|
||||||
|
|
||||||
|
## Logging
|
||||||
|
|
||||||
|
- Use structured and informative logging where useful
|
||||||
|
- Avoid noisy logs in normal CLI output
|
||||||
|
- Keep machine-readable command output clean when `--json` is used
|
||||||
|
|
||||||
|
## File operations
|
||||||
|
|
||||||
|
- Be careful with moves, copies, and overwrites
|
||||||
|
- Prefer atomic writes for JSON files when possible
|
||||||
|
- Never corrupt existing metadata due to partial writes
|
||||||
|
|
||||||
|
## Idempotence
|
||||||
|
|
||||||
|
Where possible, commands should behave safely when run multiple times.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
- re-importing the same file should detect duplicates
|
||||||
|
- `render-summary` should be repeatable
|
||||||
|
- `reindex` should be safe to rerun
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
Add tests for:
|
||||||
|
- path layout logic
|
||||||
|
- metadata read/write behavior
|
||||||
|
- duplicate detection
|
||||||
|
- reindex behavior
|
||||||
|
- summary rendering
|
||||||
|
- search behavior
|
||||||
|
- CLI output contracts for core commands
|
||||||
|
|
||||||
|
Prefer unit tests for core logic and targeted integration tests for CLI behavior.
|
||||||
@@ -0,0 +1,92 @@
|
|||||||
|
# Data Model
|
||||||
|
|
||||||
|
## Library data layout
|
||||||
|
|
||||||
|
The paper library on disk should be human-browsable.
|
||||||
|
|
||||||
|
A typical layout looks like:
|
||||||
|
|
||||||
|
```text
|
||||||
|
library_root/
|
||||||
|
config/
|
||||||
|
config.toml
|
||||||
|
vocab.yaml
|
||||||
|
prompts/
|
||||||
|
summarize_paper.md
|
||||||
|
|
||||||
|
inbox/
|
||||||
|
papers/
|
||||||
|
arxiv/
|
||||||
|
2026/
|
||||||
|
2604.12345/
|
||||||
|
meta.json
|
||||||
|
source.pdf
|
||||||
|
paper.md
|
||||||
|
summary.json
|
||||||
|
summary.md
|
||||||
|
ref.bib
|
||||||
|
assets/
|
||||||
|
logs/
|
||||||
|
mineru.log
|
||||||
|
local/
|
||||||
|
sha256-.../
|
||||||
|
meta.json
|
||||||
|
source.pdf
|
||||||
|
paper.md
|
||||||
|
summary.json
|
||||||
|
summary.md
|
||||||
|
|
||||||
|
db/
|
||||||
|
paperlib.sqlite3
|
||||||
|
|
||||||
|
cache/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Data boundaries
|
||||||
|
|
||||||
|
### `meta.json`
|
||||||
|
|
||||||
|
`meta.json` should contain deterministic or near-deterministic information, mostly from:
|
||||||
|
|
||||||
|
- import process
|
||||||
|
- file system state
|
||||||
|
- external paper metadata sources
|
||||||
|
|
||||||
|
Typical fields include:
|
||||||
|
|
||||||
|
- `paper_id`, `source_type`, `source_id`
|
||||||
|
- `title`, `authors`, `published_date`, `updated_date`, `categories`
|
||||||
|
- `pdf_path`, `paper_md_path`, `summary_json_path`, `summary_md_path`
|
||||||
|
- `imported_at`, `conversion_status`, `summary_status`
|
||||||
|
|
||||||
|
Avoid putting speculative AI content into `meta.json`.
|
||||||
|
|
||||||
|
### `summary.json`
|
||||||
|
|
||||||
|
`summary.json` is optional enrichment and may be regenerated.
|
||||||
|
|
||||||
|
It should contain structured fields such as:
|
||||||
|
|
||||||
|
- one-sentence summary, problem statement, method overview
|
||||||
|
- main results, claimed contributions, assumptions, limitations
|
||||||
|
- problem tags, technique tags, entities
|
||||||
|
- relevance-to-user fields, recommended sections
|
||||||
|
|
||||||
|
`summary.json` must include a schema version.
|
||||||
|
|
||||||
|
### SQLite
|
||||||
|
|
||||||
|
SQLite stores searchable/indexed state and job-independent status.
|
||||||
|
|
||||||
|
It should help with:
|
||||||
|
- listing papers, filtering and search, path lookup, tag lookup, status overview
|
||||||
|
|
||||||
|
But it should never be treated as the only durable source of paper metadata.
|
||||||
|
|
||||||
|
## Key conventions
|
||||||
|
|
||||||
|
- `meta.json` contains stable metadata and processing status
|
||||||
|
- `summary.json` contains structured AI-generated enrichment
|
||||||
|
- `summary.md` is rendered from `summary.json`
|
||||||
|
- `paper.md` is generated from the PDF by an external converter such as MinerU
|
||||||
|
- the database is rebuildable from the files above
|
||||||
Reference in New Issue
Block a user