docs: add docs

2026-04-17 16:54:30 -04:00
parent 74d140e5f8
commit 432010f431
10 changed files with 1682 additions and 19 deletions
@@ -0,0 +1,289 @@
+# Summary Schema
+
+This document defines the structure and semantics of the `summary.json` files that contain AI-generated paper summaries.
+
+## Overview
+
+The `summary.json` file contains structured, AI-generated analysis of a paper. It is designed to:
+
+- Provide consistent, machine-readable summaries
+- Support research triage and discovery workflows
+- Enable automated categorization and search
+- Remain stable across different AI providers
+- Use controlled vocabulary when available
+
+## Schema Version 1.0
+
+### File Structure
+
+```json
+{
+  "schema_version": "1.0",
+  "one_sentence_summary": "This paper introduces a novel neural architecture for...",
+  "problem_statement": "Current approaches to X suffer from limitations...",
+  "method_overview": "The authors propose a hybrid approach combining...",
+  "main_results": "Experiments show 15% improvement over baselines...",
+  "claimed_contributions": [
+    "Novel attention mechanism design",
+    "State-of-the-art results on ImageNet",
+    "Theoretical analysis of convergence properties"
+  ],
+  "assumptions": [
+    "Data is independently distributed",
+    "Computational budget allows for large models"
+  ],
+  "limitations": [
+    "Only evaluated on English text",
+    "Requires significant computational resources",
+    "Limited theoretical justification for design choices"
+  ],
+  "problem_tags": ["classification", "computer-vision", "optimization"],
+  "technique_tags": ["neural-networks", "attention", "transformers"],
+  "entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
+  "relevance_to_user": 0.75,
+  "recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
+}
+```
+
+## Field Definitions
+
+### Required Fields
+
+#### `schema_version` (string)
+- **Purpose**: Track format version for migration
+- **Format**: Semantic version string (e.g., "1.0")
+- **Required**: Yes
+
+#### `one_sentence_summary` (string)
+- **Purpose**: Concise paper overview for quick scanning
+- **Guidelines**: 
+  - One complete sentence, under 200 characters
+  - Focus on the main contribution or finding
+  - Avoid technical jargon when possible
+- **Example**: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."
+
+### Core Content Fields
+
+#### `problem_statement` (string)
+- **Purpose**: What problem does this paper address?
+- **Guidelines**:
+  - 2-3 sentences maximum
+  - Focus on the gap or limitation being addressed
+  - Explain why this problem matters
+
+#### `method_overview` (string)
+- **Purpose**: High-level description of the approach
+- **Guidelines**:
+  - 3-4 sentences maximum
+  - Focus on the key innovation or insight
+  - Avoid detailed algorithmic descriptions
+
+#### `main_results` (string)
+- **Purpose**: Key empirical findings or theoretical results
+- **Guidelines**:
+  - Quantitative results when available
+  - Highlight significance of improvements
+  - Note any surprising or counterintuitive findings
+
+### Structured Lists
+
+#### `claimed_contributions` (array of strings)
+- **Purpose**: Authors' stated contributions
+- **Guidelines**:
+  - Extract from paper's contribution list
+  - Preserve authors' framing and claims
+  - 3-6 items typically
+
+#### `assumptions` (array of strings)
+- **Purpose**: Key assumptions underlying the work
+- **Guidelines**:
+  - Mathematical, methodological, or data assumptions
+  - Critical for understanding applicability
+  - Often unstated but important
+
+#### `limitations` (array of strings)
+- **Purpose**: Acknowledged or apparent limitations
+- **Guidelines**:
+  - From authors' discussion or limitations section
+  - Obvious limitations not acknowledged by authors
+  - Important for understanding scope
+
+### Categorization
+
+#### `problem_tags` (array of strings)
+- **Purpose**: Categorize the problem domain
+- **Controlled vocabulary** (preferred values):
+  - `classification`, `regression`, `clustering`
+  - `optimization`, `search`, `planning`
+  - `generation`, `translation`, `summarization`
+  - `detection`, `segmentation`, `tracking`
+  - `compression`, `encoding`, `decoding`
+  - `privacy`, `security`, `robustness`
+  - `interpretability`, `fairness`, `ethics`
+  - `efficiency`, `scalability`, `deployment`
+
+#### `technique_tags` (array of strings)  
+- **Purpose**: Categorize the technical approaches
+- **Controlled vocabulary** (preferred values):
+  - `neural-networks`, `deep-learning`, `transformers`
+  - `cnn`, `rnn`, `lstm`, `gru`, `attention`
+  - `reinforcement-learning`, `supervised-learning`, `unsupervised-learning`
+  - `bayesian`, `probabilistic`, `statistical`
+  - `graph-neural-networks`, `graph-algorithms`
+  - `computer-vision`, `natural-language-processing`
+  - `federated-learning`, `transfer-learning`, `meta-learning`
+  - `adversarial`, `generative-models`, `vae`, `gan`
+
+### Entities and References
+
+#### `entities` (array of strings)
+- **Purpose**: Important datasets, models, algorithms, or systems mentioned
+- **Guidelines**:
+  - Proper names: "ImageNet", "BERT", "ResNet"
+  - Algorithms: "SGD", "Adam", "RANSAC"  
+  - Benchmarks: "GLUE", "COCO", "WMT"
+  - Avoid generic terms like "neural network"
+
+### User Relevance
+
+#### `relevance_to_user` (number, optional)
+- **Purpose**: Estimated relevance score for the user
+- **Format**: Float between 0.0 and 1.0
+- **Guidelines**:
+  - Based on user's research interests (if known)
+  - `null` if user preferences unavailable
+  - Higher scores = more relevant
+
+#### `recommended_sections` (array of strings, optional)
+- **Purpose**: Specific sections worth reading in detail
+- **Format**: Section references as they appear in paper
+- **Examples**: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]
+
+## Generation Guidelines
+
+### AI Provider Instructions
+
+When generating summaries, AI models should:
+
+1. **Read for understanding**: Focus on the paper's core contributions
+2. **Use structured thinking**: Work through each field systematically  
+3. **Prefer facts over interpretation**: Extract what authors claim, not opinions
+4. **Use controlled vocabulary**: Select from predefined tag lists when possible
+5. **Be concise**: Optimize for quick scanning and search
+6. **Handle uncertainty**: Use `null` or empty arrays for unclear fields
+
+### Quality Criteria
+
+Good summaries exhibit:
+- **Accuracy**: Faithful to the paper's content
+- **Completeness**: Cover all major aspects  
+- **Consistency**: Similar papers get similar treatment
+- **Searchability**: Use terms that aid discovery
+- **Brevity**: Information density over verbosity
+
+### Common Issues to Avoid
+
+- **Hallucination**: Never invent facts not in the paper
+- **Editorializing**: Don't add opinions about paper quality
+- **Inconsistent terminology**: Use standard field names
+- **Over-abstraction**: Keep concrete details when useful
+- **Under-specification**: Provide enough detail for usefulness
+
+## Schema Evolution
+
+### Version History
+
+- **v1.0** (current): Initial schema with core fields
+
+### Migration Strategy
+
+When the schema evolves:
+1. New versions increment the `schema_version` field
+2. Migration tools handle format upgrades automatically  
+3. Backward compatibility maintained when possible
+4. Deprecated fields are marked but preserved
+
+### Extensibility
+
+Future versions may add:
+- Additional structured fields
+- Hierarchical tag taxonomies
+- Multi-lingual support
+- Citation relationship mapping
+- Experimental reproducibility metadata
+
+## Integration with paperlib
+
+### File Lifecycle
+
+1. **Generation**: AI provider creates `summary.json`
+2. **Validation**: paperlib validates against schema
+3. **Indexing**: Content indexed for search
+4. **Rendering**: Human-readable `summary.md` generated
+5. **Updates**: Summaries can be regenerated with new models
+
+### Search Integration
+
+Summary fields are indexed for search:
+- Full-text search includes all text fields
+- Tag-based search uses `problem_tags` and `technique_tags`  
+- Entity search uses the `entities` field
+- Relevance ranking can use `relevance_to_user` scores
+
+### API Integration
+
+Higher-level tools can consume summaries programmatically:
+
+```python
+import json
+from pathlib import Path
+
+# Load summary
+summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
+with summary_path.open() as f:
+    summary = json.load(f)
+
+# Extract key information  
+tags = summary["problem_tags"] + summary["technique_tags"]
+relevance = summary.get("relevance_to_user", 0.0)
+entities = summary["entities"]
+```
+
+This enables automated workflows like:
+- Daily digest generation
+- Research recommendation systems  
+- Literature review automation
+- Cross-reference discovery
+
+## Examples
+
+### Machine Learning Paper
+```json
+{
+  "schema_version": "1.0",
+  "one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
+  "problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
+  "method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
+  "main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
+  "claimed_contributions": [
+    "Novel compound scaling method for ConvNets",
+    "EfficientNet family with state-of-the-art accuracy/efficiency",
+    "Systematic study of scaling dimensions"
+  ],
+  "assumptions": [
+    "ImageNet classification transfers to other vision tasks",
+    "Compound scaling works across different architectures"
+  ],
+  "limitations": [
+    "Limited evaluation on tasks beyond image classification",
+    "Scaling coefficients may not generalize to all architectures"
+  ],
+  "problem_tags": ["classification", "computer-vision", "efficiency"],
+  "technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
+  "entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
+  "relevance_to_user": null,
+  "recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
+}
+```
+
+This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.