10 KiB
10 KiB
Summary Schema
This document defines the structure and semantics of the summary.json files that contain AI-generated paper summaries.
Overview
The summary.json file contains structured, AI-generated analysis of a paper. It is designed to:
- Provide consistent, machine-readable summaries
- Support research triage and discovery workflows
- Enable automated categorization and search
- Remain stable across different AI providers
- Use controlled vocabulary when available
Schema Version 1.0
File Structure
{
"schema_version": "1.0",
"one_sentence_summary": "This paper introduces a novel neural architecture for...",
"problem_statement": "Current approaches to X suffer from limitations...",
"method_overview": "The authors propose a hybrid approach combining...",
"main_results": "Experiments show 15% improvement over baselines...",
"claimed_contributions": [
"Novel attention mechanism design",
"State-of-the-art results on ImageNet",
"Theoretical analysis of convergence properties"
],
"assumptions": [
"Data is independently distributed",
"Computational budget allows for large models"
],
"limitations": [
"Only evaluated on English text",
"Requires significant computational resources",
"Limited theoretical justification for design choices"
],
"problem_tags": ["classification", "computer-vision", "optimization"],
"technique_tags": ["neural-networks", "attention", "transformers"],
"entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"],
"relevance_to_user": 0.75,
"recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"]
}
Field Definitions
Required Fields
schema_version (string)
- Purpose: Track format version for migration
- Format: Semantic version string (e.g., "1.0")
- Required: Yes
one_sentence_summary (string)
- Purpose: Concise paper overview for quick scanning
- Guidelines:
- One complete sentence, under 200 characters
- Focus on the main contribution or finding
- Avoid technical jargon when possible
- Example: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy."
Core Content Fields
problem_statement (string)
- Purpose: What problem does this paper address?
- Guidelines:
- 2-3 sentences maximum
- Focus on the gap or limitation being addressed
- Explain why this problem matters
method_overview (string)
- Purpose: High-level description of the approach
- Guidelines:
- 3-4 sentences maximum
- Focus on the key innovation or insight
- Avoid detailed algorithmic descriptions
main_results (string)
- Purpose: Key empirical findings or theoretical results
- Guidelines:
- Quantitative results when available
- Highlight significance of improvements
- Note any surprising or counterintuitive findings
Structured Lists
claimed_contributions (array of strings)
- Purpose: Authors' stated contributions
- Guidelines:
- Extract from paper's contribution list
- Preserve authors' framing and claims
- 3-6 items typically
assumptions (array of strings)
- Purpose: Key assumptions underlying the work
- Guidelines:
- Mathematical, methodological, or data assumptions
- Critical for understanding applicability
- Often unstated but important
limitations (array of strings)
- Purpose: Acknowledged or apparent limitations
- Guidelines:
- From authors' discussion or limitations section
- Obvious limitations not acknowledged by authors
- Important for understanding scope
Categorization
problem_tags (array of strings)
- Purpose: Categorize the problem domain
- Controlled vocabulary (preferred values):
classification,regression,clusteringoptimization,search,planninggeneration,translation,summarizationdetection,segmentation,trackingcompression,encoding,decodingprivacy,security,robustnessinterpretability,fairness,ethicsefficiency,scalability,deployment
technique_tags (array of strings)
- Purpose: Categorize the technical approaches
- Controlled vocabulary (preferred values):
neural-networks,deep-learning,transformerscnn,rnn,lstm,gru,attentionreinforcement-learning,supervised-learning,unsupervised-learningbayesian,probabilistic,statisticalgraph-neural-networks,graph-algorithmscomputer-vision,natural-language-processingfederated-learning,transfer-learning,meta-learningadversarial,generative-models,vae,gan
Entities and References
entities (array of strings)
- Purpose: Important datasets, models, algorithms, or systems mentioned
- Guidelines:
- Proper names: "ImageNet", "BERT", "ResNet"
- Algorithms: "SGD", "Adam", "RANSAC"
- Benchmarks: "GLUE", "COCO", "WMT"
- Avoid generic terms like "neural network"
User Relevance
relevance_to_user (number, optional)
- Purpose: Estimated relevance score for the user
- Format: Float between 0.0 and 1.0
- Guidelines:
- Based on user's research interests (if known)
nullif user preferences unavailable- Higher scores = more relevant
recommended_sections (array of strings, optional)
- Purpose: Specific sections worth reading in detail
- Format: Section references as they appear in paper
- Examples: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"]
Generation Guidelines
AI Provider Instructions
When generating summaries, AI models should:
- Read for understanding: Focus on the paper's core contributions
- Use structured thinking: Work through each field systematically
- Prefer facts over interpretation: Extract what authors claim, not opinions
- Use controlled vocabulary: Select from predefined tag lists when possible
- Be concise: Optimize for quick scanning and search
- Handle uncertainty: Use
nullor empty arrays for unclear fields
Quality Criteria
Good summaries exhibit:
- Accuracy: Faithful to the paper's content
- Completeness: Cover all major aspects
- Consistency: Similar papers get similar treatment
- Searchability: Use terms that aid discovery
- Brevity: Information density over verbosity
Common Issues to Avoid
- Hallucination: Never invent facts not in the paper
- Editorializing: Don't add opinions about paper quality
- Inconsistent terminology: Use standard field names
- Over-abstraction: Keep concrete details when useful
- Under-specification: Provide enough detail for usefulness
Schema Evolution
Version History
- v1.0 (current): Initial schema with core fields
Migration Strategy
When the schema evolves:
- New versions increment the
schema_versionfield - Migration tools handle format upgrades automatically
- Backward compatibility maintained when possible
- Deprecated fields are marked but preserved
Extensibility
Future versions may add:
- Additional structured fields
- Hierarchical tag taxonomies
- Multi-lingual support
- Citation relationship mapping
- Experimental reproducibility metadata
Integration with paperlib
File Lifecycle
- Generation: AI provider creates
summary.json - Validation: paperlib validates against schema
- Indexing: Content indexed for search
- Rendering: Human-readable
summary.mdgenerated - Updates: Summaries can be regenerated with new models
Search Integration
Summary fields are indexed for search:
- Full-text search includes all text fields
- Tag-based search uses
problem_tagsandtechnique_tags - Entity search uses the
entitiesfield - Relevance ranking can use
relevance_to_userscores
API Integration
Higher-level tools can consume summaries programmatically:
import json
from pathlib import Path
# Load summary
summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json")
with summary_path.open() as f:
summary = json.load(f)
# Extract key information
tags = summary["problem_tags"] + summary["technique_tags"]
relevance = summary.get("relevance_to_user", 0.0)
entities = summary["entities"]
This enables automated workflows like:
- Daily digest generation
- Research recommendation systems
- Literature review automation
- Cross-reference discovery
Examples
Machine Learning Paper
{
"schema_version": "1.0",
"one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.",
"problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.",
"method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.",
"main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.",
"claimed_contributions": [
"Novel compound scaling method for ConvNets",
"EfficientNet family with state-of-the-art accuracy/efficiency",
"Systematic study of scaling dimensions"
],
"assumptions": [
"ImageNet classification transfers to other vision tasks",
"Compound scaling works across different architectures"
],
"limitations": [
"Limited evaluation on tasks beyond image classification",
"Scaling coefficients may not generalize to all architectures"
],
"problem_tags": ["classification", "computer-vision", "efficiency"],
"technique_tags": ["cnn", "neural-architecture-search", "model-scaling"],
"entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"],
"relevance_to_user": null,
"recommended_sections": ["Section 3.1", "Table 2", "Figure 2"]
}
This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.