分隔符与标记系统

摘要

分隔符与标记系统是组织LLM上下文的核心基础设施。本文深入讲解section分隔、item分隔、来源标注、质量标记、优先级标记的设计原理，以及如何与LLM的注意力机制结合，实现更精准的信息提取和更强的上下文理解。

关键词速览

术语	英文	说明
分隔符	Delimiter	用于分隔内容的符号
标记	Marker	标注信息属性的符号
Section	Section	内容区块/章节
Item	Item	独立条目
来源标注	Source Tag	信息来源标识
质量标记	Quality Tag	质量/可信度标识
优先级	Priority	内容重要程度
XML标签	XML Tag	可扩展标记语言
自定义标记	Custom Tag	用户定义的标记
注意力引导	Attention Guidance	引导模型注意特定内容

一、分隔符基础

1.1 分隔符的作用

分隔符在LLM上下文中的核心作用：

结构划分：明确标记不同内容区块
边界识别：帮助模型理解内容边界
层次表示：体现内容的从属关系
信息隔离：防止不同内容相互干扰
选择性忽略：让模型知道可以跳过的部分

1.2 分隔符类型对比

类型	示例	适用场景	优缺点
纯文本	`---` `***`	通用分隔	简单但语义弱
Markdown	`#` `##`	标题层级	语义明确
XML标签	`<section>`	精确语义	冗长但精确
自定义	`[CONTEXT]`	特定格式	灵活但需说明
Unicode	`─────`	视觉分隔	美观但可能乱码

二、Section分隔设计

2.1 标准Section结构

═══════════════════════════════════════════════════════
[第一部分：背景介绍]
═══════════════════════════════════════════════════════
 
这部分包含背景信息和前置知识...
 
═══════════════════════════════════════════════════════
[第二部分：核心内容]
═══════════════════════════════════════════════════════
 
这部分是文档的核心要点...
 
═══════════════════════════════════════════════════════
[第三部分：实践指南]
═══════════════════════════════════════════════════════
 
这部分提供具体的操作指南...

2.2 代码实现

class SectionDelimiters:
    """Section分隔符生成器"""
    
    DELIMITER_STYLES = {
        'heavy': {
            'top': '═' * 60,
            'middle': '─' * 60,
            'bottom': '═' * 60,
            'wrapper': lambda t, c: f"{t}\n{c}\n{t}"
        },
        'light': {
            'top': '─' * 60,
            'middle': '·' * 60,
            'bottom': '─' * 60,
            'wrapper': lambda t, c: f"{t}\n{c}\n{t}"
        },
        'markdown': {
            'top': None,
            'middle': None,
            'bottom': None,
            'wrapper': lambda t, c: f"## {c}" if t is None else f"{c}"
        },
        'bracket': {
            'top': '┌' + '─' * 58 + '┐',
            'middle': '│' + '─' * 58 + '│',
            'bottom': '└' + '─' * 58 + '┘',
            'wrapper': lambda t, c: f"{t}\n{c}\n{self.bottom}"
        }
    }
    
    @classmethod
    def create_section(
        cls,
        title: str,
        content: str,
        style: str = 'heavy'
    ) -> str:
        """创建带分隔符的section"""
        style_config = cls.DELIMITER_STYLES.get(style, cls.DELIMITER_STYLES['heavy'])
        
        if style == 'markdown':
            return f"{'# ' * 2}{title}\n\n{content}"
        
        return f"""{style_config['top']}
{title}
{style_config['bottom']}
 
{content}
"""
    
    @classmethod
    def create_sections(
        cls,
        sections: List[Dict[str, str]],
        style: str = 'heavy'
    ) -> str:
        """批量创建sections"""
        result = []
        for section in sections:
            result.append(cls.create_section(
                section['title'],
                section['content'],
                style
            ))
            result.append("")  # 段间距
        
        return "\n".join(result)

2.3 语义化的Section设计

SECTION_TEMPLATES = {
    'instruction': {
        'start': '[📋 任务说明]',
        'content_type': 'instruction',
        'priority': 'high'
    },
    'context': {
        'start': '[📚 背景信息]',
        'content_type': 'context',
        'priority': 'medium'
    },
    'example': {
        'start': '[💡 示例]',
        'content_type': 'example',
        'priority': 'medium'
    },
    'warning': {
        'start': '[⚠️ 注意]',
        'content_type': 'warning',
        'priority': 'high'
    },
    'reference': {
        'start': '[📖 参考资料]',
        'content_type': 'reference',
        'priority': 'low'
    }
}
 
def build_context_with_semantic_sections(
    sections: List[Dict]
) -> str:
    """构建语义化的分段上下文"""
    result = []
    
    # 按优先级排序
    priority_order = {'high': 0, 'medium': 1, 'low': 2}
    sorted_sections = sorted(
        sections,
        key=lambda x: priority_order.get(
            SECTION_TEMPLATES.get(x['type'], {}).get('priority', 'medium')
        )
    )
    
    for section in sorted_sections:
        template = SECTION_TEMPLATES.get(section['type'], SECTION_TEMPLATES['context'])
        result.append(template['start'])
        result.append(section['content'])
        result.append("")  # 分隔
    
    return "\n".join(result)

三、Item分隔设计

3.1 列表项分隔

## 可用功能列表
 
---
 
[1️⃣] **用户管理**
  - 创建用户
  - 编辑用户
  - 删除用户
 
---
 
[2️⃣] **权限控制**
  - 分配角色
  - 设置权限
  - 审计日志
 
---
 
[3️⃣] **系统配置**
  - 基本设置
  - 通知配置
  - 安全设置

3.2 卡片式Item

┌─────────────────────────────────────────────────┐
│ 📦 项目A                                          │
│ - 状态: 进行中                                    │
│ - 负责人: 张三                                     │
│ - 进度: 65%                                      │
└─────────────────────────────────────────────────┘
 
┌─────────────────────────────────────────────────┐
│ 📦 项目B                                          │
│ - 状态: 已完成                                    │
│ - 负责人: 李四                                     │
│ - 进度: 100%                                     │
└─────────────────────────────────────────────────┘

3.3 代码实现

class ItemSeparator:
    """Item分隔符生成器"""
    
    @staticmethod
    def numbered_list(items: List[str], start: int = 1) -> str:
        """数字编号列表"""
        return "\n".join([f"{i}. {item}" for i, item in enumerate(items, start)])
    
    @staticmethod
    def bullet_list(items: List[str], indent: int = 2) -> str:
        """项目符号列表"""
        prefix = " " * indent + "• "
        return "\n".join([f"{prefix}{item}" for item in items])
    
    @staticmethod
    def checkbox_list(items: List[str], checked: List[bool] = None) -> str:
        """复选框列表"""
        result = []
        for i, item in enumerate(items):
            state = "✓" if checked and checked[i] else "○"
            result.append(f"[{'x' if checked and checked[i] else ' '}] {item}")
        return "\n".join(result)
    
    @staticmethod
    def card_grid(items: List[Dict], width: int = 50) -> str:
        """卡片网格布局"""
        card_border = "┌" + "─" * (width - 2) + "┐"
        card_bottom = "└" + "─" * (width - 2) + "┘"
        
        lines = []
        for item in items:
            lines.append(card_border)
            lines.append(f"│ {item.get('title', 'Untitled'):<{width-4}} │")
            for key, value in item.get('details', {}).items():
                lines.append(f"│ {key}: {value:<{width-len(key)-4}} │")
            lines.append(card_bottom)
            lines.append("")
        
        return "\n".join(lines)

四、来源标注系统

4.1 来源标记类型

# 来源标注示例
 
## 直接引用
> 来源: 《人工智能导论》, 第3章, 作者: 张华, 2024年
 
## 数据来源
[数据1] 来自: 国家统计局2024年度报告
[数据2] 来自: OpenAI官方文档
[数据3] 来自: 公司内部数据库 (最后更新: 2024-01-15)
 
## 参考文档
📄 参考: 政策文件-2024-001号
📄 参考: 技术规范-v2.3

4.2 代码实现

from dataclasses import dataclass
from enum import Enum
from typing import Optional, List
import datetime
 
class SourceType(Enum):
    DOCUMENT = "document"
    WEB_PAGE = "web_page"
    DATABASE = "database"
    API = "api"
    BOOK = "book"
    PAPER = "paper"
    INTERNAL = "internal"
 
@dataclass
class Source:
    """信息来源"""
    source_type: SourceType
    identifier: str
    title: str
    author: Optional[str] = None
    url: Optional[str] = None
    date: Optional[str] = None
    reliability: float = 1.0  # 0-1可信度
    
    def to_tag(self) -> str:
        """转换为标注字符串"""
        parts = []
        
        icon = {
            SourceType.DOCUMENT: "📄",
            SourceType.WEB_PAGE: "🌐",
            SourceType.DATABASE: "💾",
            SourceType.API: "🔌",
            SourceType.BOOK: "📚",
            SourceType.PAPER: "📝",
            SourceType.INTERNAL: "🏢"
        }.get(self.source_type, "📌")
        
        parts.append(f"{icon} [{self.source_type.value}]")
        parts.append(self.title)
        
        if self.author:
            parts.append(f"作者: {self.author}")
        if self.date:
            parts.append(f"日期: {self.date}")
        
        reliability_icon = "🟢" if self.reliability >= 0.8 else "🟡" if self.reliability >= 0.5 else "🔴"
        parts.append(f"可信度: {reliability_icon}")
        
        return " | ".join(parts)
 
class SourceTracker:
    """来源追踪器"""
    
    def __init__(self):
        self.sources: List[Source] = []
        self.citation_format = "[{num}]"
    
    def add_source(self, source: Source) -> str:
        """添加来源并返回引用标记"""
        # 检查是否已存在
        for i, s in enumerate(self.sources):
            if s.identifier == source.identifier:
                return self.citation_format.format(num=i + 1)
        
        self.sources.append(source)
        return self.citation_format.format(num=len(self.sources))
    
    def format_context_with_sources(
        self,
        content: str,
        citations: Dict[int, str]  # 位置 -> 源索引
    ) -> str:
        """在上下文中插入来源标注"""
        # 实现来源插入逻辑
        return content
    
    def generate_reference_section(self) -> str:
        """生成参考资源部分"""
        if not self.sources:
            return ""
        
        lines = ["## 参考资源\n"]
        
        for i, source in enumerate(self.sources, 1):
            lines.append(f"{self.citation_format.format(num=i)} {source.to_tag()}")
        
        return "\n".join(lines)

五、质量标记系统

5.1 质量等级标记

# 内容质量标记
 
## 可靠性等级
 
🟢 **高可信度** - 官方来源、经过验证的信息
  - 官方文档
  - 学术论文
  - 权威机构发布
 
🟡 **中等可信度** - 一般来源、可能存在误差
  - 新闻报道
  - 行业分析
  - 用户反馈
 
🔴 **低可信度** - 需谨慎使用、可能不准确
  - 社交媒体
  - 匿名来源
  - 未经证实
 
## 信息时效性
 
⏰ **实时** - 刚刚更新的信息
📅 **近期** - 30天内更新
📆 **中期** - 6个月内更新
📜 **历史** - 超过6个月

5.2 实现代码

class QualityMarker:
    """质量标记器"""
    
    RELIABILITY_LEVELS = {
        'high': {'icon': '🟢', 'label': '高可信度', 'score': 1.0},
        'medium': {'icon': '🟡', 'label': '中等可信度', 'score': 0.6},
        'low': {'icon': '🔴', 'label': '低可信度', 'score': 0.3},
        'unknown': {'icon': '⚪', 'label': '可信度未知', 'score': 0.5}
    }
    
    TIMELINESS_LEVELS = {
        'realtime': {'icon': '⏰', 'label': '实时', 'days': 0},
        'recent': {'icon': '📅', 'label': '近期', 'days': 30},
        'medium': {'icon': '📆', 'label': '中期', 'days': 180},
        'historical': {'icon': '📜', 'label': '历史', 'days': 10000}
    }
    
    @classmethod
    def mark_content(
        cls,
        content: str,
        reliability: str = 'unknown',
        timeliness: str = 'recent',
        custom_notes: str = None
    ) -> str:
        """为内容添加质量标记"""
        rel_info = cls.RELIABILITY_LEVELS.get(reliability, cls.RELIABILITY_LEVELS['unknown'])
        time_info = cls.TIMELINESS_LEVELS.get(timeliness, cls.TIMELINESS_LEVELS['recent'])
        
        header = f"""{rel_info['icon']} {rel_info['label']} | {time_info['icon']} {time_info['label']}"""
        if custom_notes:
            header += f" | 📝 {custom_notes}"
        
        return f"""> {header}
>
> {content}"""
    
    @classmethod
    def filter_by_quality(
        cls,
        contents: List[Dict],
        min_reliability: str = 'unknown',
        max_age_days: int = None
    ) -> List[Dict]:
        """按质量过滤内容"""
        min_score = cls.RELIABILITY_LEVELS[min_reliability]['score']
        
        filtered = []
        for content in contents:
            reliability_score = cls.RELIABILITY_LEVELS.get(
                content.get('reliability', 'unknown'), {}
            ).get('score', 0.5)
            
            if reliability_score < min_score:
                continue
            
            if max_age_days and content.get('age_days', 0) > max_age_days:
                continue
            
            filtered.append(content)
        
        return filtered

六、优先级标记系统

6.1 优先级标记

# 优先级标记示例
 
## 用户问题处理流程
 
🔴 **[P0 - 紧急]** 系统宕机、数据丢失风险
   → 需要立即响应，5分钟内处理
 
🟠 **[P1 - 高]** 核心功能不可用
   → 24小时内处理
 
🟡 **[P2 - 中]** 功能异常、部分受影响
   → 1周内处理
 
🟢 **[P3 - 低]** 体验优化、非功能性问题
   → 按计划处理
 
## 任务优先级队列
 
> [!example]+ 优先级队列示例
> 
> 1. 🔴 备份当前数据
> 2. 🔴 验证恢复流程
> 3. 🟠 更新安全补丁
> 4. 🟠 优化数据库查询
> 5. 🟡 完善日志记录
> 6. 🟢 代码重构

6.2 代码实现

from enum import IntEnum
 
class Priority(IntEnum):
    P0_CRITICAL = 0  # 紧急
    P1_HIGH = 1      # 高
    P2_MEDIUM = 2    # 中
    P3_LOW = 3       # 低
 
class PriorityMarker:
    """优先级标记器"""
    
    PRIORITY_CONFIG = {
        Priority.P0_CRITICAL: {
            'icon': '🔴',
            'label': 'P0-紧急',
            'response_time': '5分钟',
            'sla': '立即处理'
        },
        Priority.P1_HIGH: {
            'icon': '🟠',
            'label': 'P1-高',
            'response_time': '24小时',
            'sla': '尽快处理'
        },
        Priority.P2_MEDIUM: {
            'icon': '🟡',
            'label': 'P2-中',
            'response_time': '1周',
            'sla': '按计划处理'
        },
        Priority.P3_LOW: {
            'icon': '🟢',
            'label': 'P3-低',
            'response_time': '待定',
            'sla': '可选处理'
        }
    }
    
    @classmethod
    def mark_task(cls, task: Dict, priority: Priority) -> str:
        """标记任务优先级"""
        config = cls.PRIORITY_CONFIG[priority]
        
        return f"""{config['icon']} **[{config['label']}]** {task.get('title', 'Untitled')}
- 描述: {task.get('description', '')}
- 响应时间: {config['response_time']}
- SLA: {config['sla']}"""
    
    @classmethod
    def sort_by_priority(cls, tasks: List[Dict]) -> List[Dict]:
        """按优先级排序"""
        return sorted(
            tasks,
            key=lambda x: x.get('priority', Priority.P3_LOW)
        )
    
    @classmethod
    def filter_by_priority(
        cls,
        items: List[Dict],
        min_priority: Priority = Priority.P3_LOW
    ) -> List[Dict]:
        """按最低优先级过滤"""
        return [
            item for item in items
            if item.get('priority', Priority.P3_LOW) <= min_priority
        ]

七、与注意力机制的结合

7.1 注意力引导原理

LLM的注意力机制对不同位置的内容有不同的权重。合理使用分隔符和标记可以：

强化目标内容：通过重复标记增强关注
弱化干扰内容：使用跳过标记
建立关联：使用一致的标记建立跨内容关联
引导检索：通过标记帮助定位关键信息

7.2 注意力增强技术

class AttentionEnhancer:
    """注意力增强器"""
    
    @staticmethod
    def repeat_key_terms(content: str, key_terms: List[str], repeat_count: int = 3) -> str:
        """重复关键词以增强注意力"""
        enhanced = content
        
        for term in key_terms:
            # 在开头、中间、结尾各放置一次
            if term in enhanced:
                # 简单实现：在开头添加
                parts = enhanced.split(term, 1)
                if len(parts) == 2:
                    enhanced = f"{term} {term} {parts[0]}{term}{parts[1]}"
        
        return enhanced
    
    @staticmethod
    def add_attention_markers(content: str, important_sections: List[str]) -> str:
        """为重要部分添加注意力标记"""
        marked = content
        
        for section in important_sections:
            if section in marked:
                marked = marked.replace(
                    section,
                    f">>> {section} <<< [重要]"
                )
        
        return marked
    
    @staticmethod
    def structure_for_attention(content: str, structure: str = "pyramid") -> str:
        """
        按注意力模型组织内容
        
        structure: 'pyramid' (最重要在前), 'inverted' (最重要在后)
        """
        lines = content.split('\n')
        
        if structure == "pyramid":
            # 最重要的结论/摘要放前面
            conclusion = [l for l in lines if '结论' in l or '总结' in l or '建议' in l]
            main_body = [l for l in lines if l not in conclusion]
            return '\n'.join(conclusion + main_body)
        
        elif structure == "inverted":
            # 细节在前，结论在后
            detail = [l for l in lines if '细节' not in l and '结论' not in l]
            summary = [l for l in lines if '结论' in l or '总结' in l]
            return '\n'.join(detail + summary)
        
        return content

7.3 位置优化策略

class PositionOptimizer:
    """位置优化器 - 解决Lost in Middle问题"""
    
    @staticmethod
    def reorder_for_llm(items: List[Dict], query: str) -> List[Dict]:
        """
        重排项目以优化LLM注意力
        
        策略：最重要的放开头和结尾，中间的可以适当牺牲
        """
        if len(items) <= 3:
            return items
        
        # 计算每个项目与查询的相关性（简化实现）
        scored = []
        for item in items:
            score = sum(1 for kw in item.get('keywords', []) if kw in query)
            scored.append((score, item))
        
        scored.sort(key=lambda x: x[0], reverse=True)
        
        # 重排：最高相关性放开头，次高放结尾
        reordered = [scored[0][1]]
        
        if len(scored) > 2:
            # 中间的项目（按原顺序）
            middle = [item for _, item in scored[1:-1]]
            reordered.extend(middle)
        
        reordered.append(scored[-1][1])
        
        return reordered
    
    @staticmethod
    def split_context_for_attention(
        context: str,
        max_head_size: int = 8000,
        max_tail_size: int = 8000
    ) -> str:
        """
        将上下文分为开头和结尾两部分，舍弃中间
        
        利用LLM对首尾注意力更强的特性
        """
        tokens = context.split()
        
        if len(tokens) <= max_head_size + max_tail_size:
            return context
        
        head = ' '.join(tokens[:max_head_size])
        tail = ' '.join(tokens[-max_tail_size:])
        
        return f"""{head}
 
[... 内容已截断以保持上下文完整性 ...]
 
{tail}"""

八、综合标记系统

8.1 完整标记模板

MARKER_SYSTEM_CONFIG = {
    'section_delimiter': '════════════════════════════════════',
    'item_delimiter': '─' * 40,
    'source_format': '[来源: {title} | {date}]',
    'quality_markers': ['🟢', '🟡', '🔴'],
    'priority_markers': ['🔴', '🟠', '🟡', '🟢'],
    'attention_markers': {
        'start': '【',
        'end': '】'
    }
}
 
class ComprehensiveMarkerSystem:
    """综合标记系统"""
    
    def __init__(self, config: Dict = None):
        self.config = config or MARKER_SYSTEM_CONFIG
    
    def build_context(
        self,
        sections: List[Dict],
        query: str,
        include_sources: bool = True,
        include_quality: bool = True
    ) -> str:
        """构建完整的带标记上下文"""
        result = []
        
        # 1. 查询信息
        result.append(f"## 用户查询\n{query}\n")
        
        # 2. 重排sections
        reordered = PositionOptimizer.reorder_for_llm(sections, query)
        
        # 3. 遍历sections
        for i, section in enumerate(reordered, 1):
            result.append(self.config['section_delimiter'])
            result.append(f"[第{i}部分] {section.get('title', 'Untitled')}")
            result.append(self.config['section_delimiter'])
            
            # 优先级标记
            if 'priority' in section:
                priority_marker = self.config['priority_markers'][section['priority']]
                result.append(f"{priority_marker} 优先级: {section['priority']}")
            
            # 内容
            result.append(section.get('content', ''))
            
            # 来源标注
            if include_sources and 'source' in section:
                source = section['source']
                source_tag = f"[来源: {source.get('title', 'Unknown')}"
                if source.get('date'):
                    source_tag += f" | {source['date']}"
                source_tag += "]"
                result.append(source_tag)
            
            # 质量标记
            if include_quality and 'reliability' in section:
                rel_idx = {'high': 0, 'medium': 1, 'low': 2}.get(section['reliability'], 0)
                result.append(f"{self.config['quality_markers'][rel_idx]} 可信度: {section['reliability']}")
            
            result.append("")
        
        return "\n".join(result)

8.2 使用示例

# 构建示例
system = ComprehensiveMarkerSystem()
 
sections = [
    {
        'title': '系统架构概述',
        'content': '系统采用微服务架构...',
        'priority': 1,
        'source': {'title': '技术文档v1.2', 'date': '2024-01'},
        'reliability': 'high'
    },
    {
        'title': '部署流程',
        'content': '详细部署步骤如下...',
        'priority': 2,
        'source': {'title': '运维手册', 'date': '2023-12'},
        'reliability': 'medium'
    },
    {
        'title': '故障排除',
        'content': '常见问题及解决方案...',
        'priority': 0,
        'source': {'title': '经验总结', 'date': '2024-02'},
        'reliability': 'medium'
    }
]
 
context = system.build_context(
    sections=sections,
    query="如何部署系统",
    include_sources=True,
    include_quality=True
)
 
print(context)

九、相关主题

十、参考文献

Liu, N. F., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts.
Clark, K., et al. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
Raffel, C., et al. (2020). Exploring the Limits of Transfer Learning with T5.

人工智能知识库

探索

分隔符与标记系统

关键词速览

一、分隔符基础

1.1 分隔符的作用

1.2 分隔符类型对比

二、Section分隔设计

2.1 标准Section结构

2.2 代码实现

2.3 语义化的Section设计

三、Item分隔设计

3.1 列表项分隔

3.2 卡片式Item

3.3 代码实现

四、来源标注系统

4.1 来源标记类型

4.2 代码实现

五、质量标记系统

5.1 质量等级标记

5.2 实现代码

六、优先级标记系统

6.1 优先级标记

6.2 代码实现

七、与注意力机制的结合

7.1 注意力引导原理

7.2 注意力增强技术

7.3 位置优化策略

八、综合标记系统

8.1 完整标记模板

8.2 使用示例

九、相关主题

十、参考文献

关系图谱

目录

反向链接