# Summary Schema This document defines the structure and semantics of the `summary.json` files that contain AI-generated paper summaries. ## Overview The `summary.json` file contains structured, AI-generated analysis of a paper. It is designed to: - Provide consistent, machine-readable summaries - Support research triage and discovery workflows - Enable automated categorization and search - Remain stable across different AI providers - Use controlled vocabulary when available ## Schema Version 1.0 ### File Structure ```json { "schema_version": "1.0", "one_sentence_summary": "This paper introduces a novel neural architecture for...", "problem_statement": "Current approaches to X suffer from limitations...", "method_overview": "The authors propose a hybrid approach combining...", "main_results": "Experiments show 15% improvement over baselines...", "claimed_contributions": [ "Novel attention mechanism design", "State-of-the-art results on ImageNet", "Theoretical analysis of convergence properties" ], "assumptions": [ "Data is independently distributed", "Computational budget allows for large models" ], "limitations": [ "Only evaluated on English text", "Requires significant computational resources", "Limited theoretical justification for design choices" ], "problem_tags": ["classification", "computer-vision", "optimization"], "technique_tags": ["neural-networks", "attention", "transformers"], "entities": ["ImageNet", "BERT", "ResNet", "CIFAR-10"], "relevance_to_user": 0.75, "recommended_sections": ["Section 3.2", "Algorithm 1", "Table 2"] } ``` ## Field Definitions ### Required Fields #### `schema_version` (string) - **Purpose**: Track format version for migration - **Format**: Semantic version string (e.g., "1.0") - **Required**: Yes #### `one_sentence_summary` (string) - **Purpose**: Concise paper overview for quick scanning - **Guidelines**: - One complete sentence, under 200 characters - Focus on the main contribution or finding - Avoid technical jargon when possible - **Example**: "This paper introduces a new attention mechanism that improves transformer efficiency by 40% while maintaining accuracy." ### Core Content Fields #### `problem_statement` (string) - **Purpose**: What problem does this paper address? - **Guidelines**: - 2-3 sentences maximum - Focus on the gap or limitation being addressed - Explain why this problem matters #### `method_overview` (string) - **Purpose**: High-level description of the approach - **Guidelines**: - 3-4 sentences maximum - Focus on the key innovation or insight - Avoid detailed algorithmic descriptions #### `main_results` (string) - **Purpose**: Key empirical findings or theoretical results - **Guidelines**: - Quantitative results when available - Highlight significance of improvements - Note any surprising or counterintuitive findings ### Structured Lists #### `claimed_contributions` (array of strings) - **Purpose**: Authors' stated contributions - **Guidelines**: - Extract from paper's contribution list - Preserve authors' framing and claims - 3-6 items typically #### `assumptions` (array of strings) - **Purpose**: Key assumptions underlying the work - **Guidelines**: - Mathematical, methodological, or data assumptions - Critical for understanding applicability - Often unstated but important #### `limitations` (array of strings) - **Purpose**: Acknowledged or apparent limitations - **Guidelines**: - From authors' discussion or limitations section - Obvious limitations not acknowledged by authors - Important for understanding scope ### Categorization #### `problem_tags` (array of strings) - **Purpose**: Categorize the problem domain - **Controlled vocabulary** (preferred values): - `classification`, `regression`, `clustering` - `optimization`, `search`, `planning` - `generation`, `translation`, `summarization` - `detection`, `segmentation`, `tracking` - `compression`, `encoding`, `decoding` - `privacy`, `security`, `robustness` - `interpretability`, `fairness`, `ethics` - `efficiency`, `scalability`, `deployment` #### `technique_tags` (array of strings) - **Purpose**: Categorize the technical approaches - **Controlled vocabulary** (preferred values): - `neural-networks`, `deep-learning`, `transformers` - `cnn`, `rnn`, `lstm`, `gru`, `attention` - `reinforcement-learning`, `supervised-learning`, `unsupervised-learning` - `bayesian`, `probabilistic`, `statistical` - `graph-neural-networks`, `graph-algorithms` - `computer-vision`, `natural-language-processing` - `federated-learning`, `transfer-learning`, `meta-learning` - `adversarial`, `generative-models`, `vae`, `gan` ### Entities and References #### `entities` (array of strings) - **Purpose**: Important datasets, models, algorithms, or systems mentioned - **Guidelines**: - Proper names: "ImageNet", "BERT", "ResNet" - Algorithms: "SGD", "Adam", "RANSAC" - Benchmarks: "GLUE", "COCO", "WMT" - Avoid generic terms like "neural network" ### User Relevance #### `relevance_to_user` (number, optional) - **Purpose**: Estimated relevance score for the user - **Format**: Float between 0.0 and 1.0 - **Guidelines**: - Based on user's research interests (if known) - `null` if user preferences unavailable - Higher scores = more relevant #### `recommended_sections` (array of strings, optional) - **Purpose**: Specific sections worth reading in detail - **Format**: Section references as they appear in paper - **Examples**: ["Section 3.2", "Algorithm 1", "Table 2", "Appendix A"] ## Generation Guidelines ### AI Provider Instructions When generating summaries, AI models should: 1. **Read for understanding**: Focus on the paper's core contributions 2. **Use structured thinking**: Work through each field systematically 3. **Prefer facts over interpretation**: Extract what authors claim, not opinions 4. **Use controlled vocabulary**: Select from predefined tag lists when possible 5. **Be concise**: Optimize for quick scanning and search 6. **Handle uncertainty**: Use `null` or empty arrays for unclear fields ### Quality Criteria Good summaries exhibit: - **Accuracy**: Faithful to the paper's content - **Completeness**: Cover all major aspects - **Consistency**: Similar papers get similar treatment - **Searchability**: Use terms that aid discovery - **Brevity**: Information density over verbosity ### Common Issues to Avoid - **Hallucination**: Never invent facts not in the paper - **Editorializing**: Don't add opinions about paper quality - **Inconsistent terminology**: Use standard field names - **Over-abstraction**: Keep concrete details when useful - **Under-specification**: Provide enough detail for usefulness ## Schema Evolution ### Version History - **v1.0** (current): Initial schema with core fields ### Migration Strategy When the schema evolves: 1. New versions increment the `schema_version` field 2. Migration tools handle format upgrades automatically 3. Backward compatibility maintained when possible 4. Deprecated fields are marked but preserved ### Extensibility Future versions may add: - Additional structured fields - Hierarchical tag taxonomies - Multi-lingual support - Citation relationship mapping - Experimental reproducibility metadata ## Integration with paperlib ### File Lifecycle 1. **Generation**: AI provider creates `summary.json` 2. **Validation**: paperlib validates against schema 3. **Indexing**: Content indexed for search 4. **Rendering**: Human-readable `summary.md` generated 5. **Updates**: Summaries can be regenerated with new models ### Search Integration Summary fields are indexed for search: - Full-text search includes all text fields - Tag-based search uses `problem_tags` and `technique_tags` - Entity search uses the `entities` field - Relevance ranking can use `relevance_to_user` scores ### API Integration Higher-level tools can consume summaries programmatically: ```python import json from pathlib import Path # Load summary summary_path = Path("papers/arxiv/2022/arxiv-2212_06340/summary.json") with summary_path.open() as f: summary = json.load(f) # Extract key information tags = summary["problem_tags"] + summary["technique_tags"] relevance = summary.get("relevance_to_user", 0.0) entities = summary["entities"] ``` This enables automated workflows like: - Daily digest generation - Research recommendation systems - Literature review automation - Cross-reference discovery ## Examples ### Machine Learning Paper ```json { "schema_version": "1.0", "one_sentence_summary": "Introduces EfficientNet, a family of convolutional neural networks that achieve better accuracy and efficiency than previous models through compound scaling.", "problem_statement": "Existing ConvNet scaling methods arbitrarily scale network dimensions, leading to suboptimal accuracy and efficiency trade-offs.", "method_overview": "The paper proposes compound scaling that uniformly scales network width, depth, and resolution with a fixed ratio, guided by neural architecture search to find optimal scaling coefficients.", "main_results": "EfficientNet-B7 achieves 84.3% top-1 accuracy on ImageNet while being 8.4x smaller and 6.1x faster than the best existing ConvNet.", "claimed_contributions": [ "Novel compound scaling method for ConvNets", "EfficientNet family with state-of-the-art accuracy/efficiency", "Systematic study of scaling dimensions" ], "assumptions": [ "ImageNet classification transfers to other vision tasks", "Compound scaling works across different architectures" ], "limitations": [ "Limited evaluation on tasks beyond image classification", "Scaling coefficients may not generalize to all architectures" ], "problem_tags": ["classification", "computer-vision", "efficiency"], "technique_tags": ["cnn", "neural-architecture-search", "model-scaling"], "entities": ["ImageNet", "MobileNet", "ResNet", "NASNet"], "relevance_to_user": null, "recommended_sections": ["Section 3.1", "Table 2", "Figure 2"] } ``` This schema provides a foundation for consistent, structured paper analysis while remaining flexible enough to evolve with new research needs and AI capabilities.