docs: add docs
This commit is contained in:
@@ -0,0 +1,289 @@
|
||||
# Summary Schema
|
||||
|
||||
This document defines the structure and semantics of the `summary.json` files that contain AI-generated paper summaries.
|
||||
|
||||
## Overview
|
||||
|
||||
The `summary.json` file contains structured, AI-generated analysis of a paper. It is designed to:
|
||||
|
||||
- Provide consistent, machine-readable summaries
|
||||
- Support research triage and discovery workflows
|
||||
- Enable automated categorization and search
|
||||
- Remain stable across different AI providers
|
||||
- Use controlled vocabulary when available
|
||||
|
||||
## Schema Version 1.0
|
||||
|
||||
### File Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"one_sentence_summary": "This paper introduces a novel neural architecture for...",
|
||||
"problem_statement": "Current approaches to X suffer from limitations...",
|
||||
"method_overview": "The authors propose a hybrid approach combining...",
|
||||
"main_results": "Experiments show 15% improvement over baselines...",
|
||||
"claimed_contributions": [
|
||||
"Novel attention mechanism design",
|
||||
"State-of-the-art results on ImageNet",
|
||||
"Theoretical analysis of convergence properties"
|
||||
],
|
||||
"assumptions": [
|
||||
"Data is independently distributed",
|
||||
"Computational budget allows for large models"
|
||||
],
|
||||
"limitations": [
|
||||
"Only evaluated on English text",
|
||||
"Requires significant computational resources",
|
||||
"Limited theoretical justification for design choices"
|
||||
],
|
||||
"problem_tags": ["classification", "computer-vision", "optimization"],
|
||||
"technique_tags": ["neural-networks", "attention", "transformers"],
|
||||
"entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
|
||||
"relevance_to_user": 0.75,
|
||||
"recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
|
||||
}
|
||||
```
|
||||
|
||||
## Field Definitions
|
||||
|
||||
### Required Fields
|
||||
|
||||
#### `schema_version` (string)
|
||||
- **Purpose**: Track format version for migration
|
||||
- **Format**: Semantic version string (e.g., "1.0")
|
||||
- **Required**: Yes
|
||||
|
||||
#### `one_sentence_summary` (string)
|
||||
- **Purpose**: Concise paper overview for quick scanning
|
||||
- **Guidelines**:
|
||||
- One complete sentence, under 200 characters
|
||||
- Focus on the main contribution or finding
|
||||
- Avoid technical jargon when possible
|
||||
- **Example**: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."
|
||||
|
||||
### Core Content Fields
|
||||
|
||||
#### `problem_statement` (string)
|
||||
- **Purpose**: What problem does this paper address?
|
||||
- **Guidelines**:
|
||||
- 2-3 sentences maximum
|
||||
- Focus on the gap or limitation being addressed
|
||||
- Explain why this problem matters
|
||||
|
||||
#### `method_overview` (string)
|
||||
- **Purpose**: High-level description of the approach
|
||||
- **Guidelines**:
|
||||
- 3-4 sentences maximum
|
||||
- Focus on the key innovation or insight
|
||||
- Avoid detailed algorithmic descriptions
|
||||
|
||||
#### `main_results` (string)
|
||||
- **Purpose**: Key empirical findings or theoretical results
|
||||
- **Guidelines**:
|
||||
- Quantitative results when available
|
||||
- Highlight significance of improvements
|
||||
- Note any surprising or counterintuitive findings
|
||||
|
||||
### Structured Lists
|
||||
|
||||
#### `claimed_contributions` (array of strings)
|
||||
- **Purpose**: Authors' stated contributions
|
||||
- **Guidelines**:
|
||||
- Extract from paper's contribution list
|
||||
- Preserve authors' framing and claims
|
||||
- 3-6 items typically
|
||||
|
||||
#### `assumptions` (array of strings)
|
||||
- **Purpose**: Key assumptions underlying the work
|
||||
- **Guidelines**:
|
||||
- Mathematical, methodological, or data assumptions
|
||||
- Critical for understanding applicability
|
||||
- Often unstated but important
|
||||
|
||||
#### `limitations` (array of strings)
|
||||
- **Purpose**: Acknowledged or apparent limitations
|
||||
- **Guidelines**:
|
||||
- From authors' discussion or limitations section
|
||||
- Obvious limitations not acknowledged by authors
|
||||
- Important for understanding scope
|
||||
|
||||
### Categorization
|
||||
|
||||
#### `problem_tags` (array of strings)
|
||||
- **Purpose**: Categorize the problem domain
|
||||
- **Controlled vocabulary** (preferred values):
|
||||
- `classification`, `regression`, `clustering`
|
||||
- `optimization`, `search`, `planning`
|
||||
- `generation`, `translation`, `summarization`
|
||||
- `detection`, `segmentation`, `tracking`
|
||||
- `compression`, `encoding`, `decoding`
|
||||
- `privacy`, `security`, `robustness`
|
||||
- `interpretability`, `fairness`, `ethics`
|
||||
- `efficiency`, `scalability`, `deployment`
|
||||
|
||||
#### `technique_tags` (array of strings)
|
||||
- **Purpose**: Categorize the technical approaches
|
||||
- **Controlled vocabulary** (preferred values):
|
||||
- `neural-networks`, `deep-learning`, `transformers`
|
||||
- `cnn`, `rnn`, `lstm`, `gru`, `attention`
|
||||
- `reinforcement-learning`, `supervised-learning`, `unsupervised-learning`
|
||||
- `bayesian`, `probabilistic`, `statistical`
|
||||
- `graph-neural-networks`, `graph-algorithms`
|
||||
- `computer-vision`, `natural-language-processing`
|
||||
- `federated-learning`, `transfer-learning`, `meta-learning`
|
||||
- `adversarial`, `generative-models`, `vae`, `gan`
|
||||
|
||||
### Entities and References
|
||||
|
||||
#### `entities` (array of strings)
|
||||
- **Purpose**: Important datasets, models, algorithms, or systems mentioned
|
||||
- **Guidelines**:
|
||||
- Proper names: "ImageNet", "BERT", "ResNet"
|
||||
- Algorithms: "SGD", "Adam", "RANSAC"
|
||||
- Benchmarks: "GLUE", "COCO", "WMT"
|
||||
- Avoid generic terms like "neural network"
|
||||
|
||||
### User Relevance
|
||||
|
||||
#### `relevance_to_user` (number, optional)
|
||||
- **Purpose**: Estimated relevance score for the user
|
||||
- **Format**: Float between 0.0 and 1.0
|
||||
- **Guidelines**:
|
||||
- Based on user's research interests (if known)
|
||||
- `null` if user preferences unavailable
|
||||
- Higher scores = more relevant
|
||||
|
||||
#### `recommended_sections` (array of strings, optional)
|
||||
- **Purpose**: Specific sections worth reading in detail
|
||||
- **Format**: Section references as they appear in paper
|
||||
- **Examples**: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]
|
||||
|
||||
## Generation Guidelines
|
||||
|
||||
### AI Provider Instructions
|
||||
|
||||
When generating summaries, AI models should:
|
||||
|
||||
1. **Read for understanding**: Focus on the paper's core contributions
|
||||
2. **Use structured thinking**: Work through each field systematically
|
||||
3. **Prefer facts over interpretation**: Extract what authors claim, not opinions
|
||||
4. **Use controlled vocabulary**: Select from predefined tag lists when possible
|
||||
5. **Be concise**: Optimize for quick scanning and search
|
||||
6. **Handle uncertainty**: Use `null` or empty arrays for unclear fields
|
||||
|
||||
### Quality Criteria
|
||||
|
||||
Good summaries exhibit:
|
||||
- **Accuracy**: Faithful to the paper's content
|
||||
- **Completeness**: Cover all major aspects
|
||||
- **Consistency**: Similar papers get similar treatment
|
||||
- **Searchability**: Use terms that aid discovery
|
||||
- **Brevity**: Information density over verbosity
|
||||
|
||||
### Common Issues to Avoid
|
||||
|
||||
- **Hallucination**: Never invent facts not in the paper
|
||||
- **Editorializing**: Don't add opinions about paper quality
|
||||
- **Inconsistent terminology**: Use standard field names
|
||||
- **Over-abstraction**: Keep concrete details when useful
|
||||
- **Under-specification**: Provide enough detail for usefulness
|
||||
|
||||
## Schema Evolution
|
||||
|
||||
### Version History
|
||||
|
||||
- **v1.0** (current): Initial schema with core fields
|
||||
|
||||
### Migration Strategy
|
||||
|
||||
When the schema evolves:
|
||||
1. New versions increment the `schema_version` field
|
||||
2. Migration tools handle format upgrades automatically
|
||||
3. Backward compatibility maintained when possible
|
||||
4. Deprecated fields are marked but preserved
|
||||
|
||||
### Extensibility
|
||||
|
||||
Future versions may add:
|
||||
- Additional structured fields
|
||||
- Hierarchical tag taxonomies
|
||||
- Multi-lingual support
|
||||
- Citation relationship mapping
|
||||
- Experimental reproducibility metadata
|
||||
|
||||
## Integration with paperlib
|
||||
|
||||
### File Lifecycle
|
||||
|
||||
1. **Generation**: AI provider creates `summary.json`
|
||||
2. **Validation**: paperlib validates against schema
|
||||
3. **Indexing**: Content indexed for search
|
||||
4. **Rendering**: Human-readable `summary.md` generated
|
||||
5. **Updates**: Summaries can be regenerated with new models
|
||||
|
||||
### Search Integration
|
||||
|
||||
Summary fields are indexed for search:
|
||||
- Full-text search includes all text fields
|
||||
- Tag-based search uses `problem_tags` and `technique_tags`
|
||||
- Entity search uses the `entities` field
|
||||
- Relevance ranking can use `relevance_to_user` scores
|
||||
|
||||
### API Integration
|
||||
|
||||
Higher-level tools can consume summaries programmatically:
|
||||
|
||||
```python
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
# Load summary
|
||||
summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
|
||||
with summary_path.open() as f:
|
||||
summary = json.load(f)
|
||||
|
||||
# Extract key information
|
||||
tags = summary["problem_tags"] + summary["technique_tags"]
|
||||
relevance = summary.get("relevance_to_user", 0.0)
|
||||
entities = summary["entities"]
|
||||
```
|
||||
|
||||
This enables automated workflows like:
|
||||
- Daily digest generation
|
||||
- Research recommendation systems
|
||||
- Literature review automation
|
||||
- Cross-reference discovery
|
||||
|
||||
## Examples
|
||||
|
||||
### Machine Learning Paper
|
||||
```json
|
||||
{
|
||||
"schema_version": "1.0",
|
||||
"one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
|
||||
"problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
|
||||
"method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
|
||||
"main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
|
||||
"claimed_contributions": [
|
||||
"Novel compound scaling method for ConvNets",
|
||||
"EfficientNet family with state-of-the-art accuracy/efficiency",
|
||||
"Systematic study of scaling dimensions"
|
||||
],
|
||||
"assumptions": [
|
||||
"ImageNet classification transfers to other vision tasks",
|
||||
"Compound scaling works across different architectures"
|
||||
],
|
||||
"limitations": [
|
||||
"Limited evaluation on tasks beyond image classification",
|
||||
"Scaling coefficients may not generalize to all architectures"
|
||||
],
|
||||
"problem_tags": ["classification", "computer-vision", "efficiency"],
|
||||
"technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
|
||||
"entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
|
||||
"relevance_to_user": null,
|
||||
"recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
|
||||
}
|
||||
```
|
||||
|
||||
This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.
|
||||
Reference in New Issue
Block a user