分布式语义学详解

文档概述

本文档深入探讨分布式语义学的理论基础、数学原理和计算实现。从Firth的原始假说出发，系统阐述共现矩阵构建、奇异值分解降维、点互信息算法等核心技术，并分析Skip-gram/CBOW模型的数学本质与语义空间的几何性质。

关键词速览

术语	英文	核心定义
Firth	John Rupert Firth	分布式假说的提出者
共现矩阵	Co-occurrence Matrix	词与上下文词的共现统计
SVD	Singular Value Decomposition	奇异值分解降维技术
PMI	Pointwise Mutual Information	点互信息算法
Skip-gram	Skip-gram Model	用中心词预测上下文
CBOW	Continuous Bag-of-Words	用上下文预测中心词
语义空间	Semantic Space	词向量张成的几何空间
词汇表征	Lexical Representation	词语的数学表示
降维	Dimensionality Reduction	高维到低维的映射
上下文向量	Context Vector	词的上下文统计表示

一、分布式假说的理论渊源

1.1 John Rupert Firth与分布式语义学

约翰·鲁珀特·弗思（John Rupert Firth，1890-1960）是英国语言学家，伦敦学派的核心人物。他在1957年发表的论文中首次明确提出了分布式假说：

“It is a principle of linguistic meaning that words occurring in similar contexts tend to have similar meanings.” （语言意义的原则：出现在相似上下文中的词往往具有相似的意义。）

这一假说的提出标志着语言学研究范式的重大转变——从关注语言形式的内省分析，转向基于语言使用分布的经验研究。

1.2 假说的认知基础

分布式假说的合理性可以从多个角度论证：

语言习得视角 儿童在习得语言时，无法直接”看到”词语的内部语义结构，但可以通过观察词语的使用模式来推断其意义。这与分布式假说的逻辑一致：

语义学习 = f (上下文观察) \Rightarrow 相似的上下文 \Rightarrow 相似的语义

信息论视角 从信息论角度，词语的上下文携带着关于该词语意义的信息。上下文的共现关系编码了语义相似性：

I (w; c) = lo g \frac{p ( w , c )}{p ( w ) p ( c )}

其中 $I (w; c)$ 是词 $w$ 与上下文 $c$ 之间的互信息。

1.3 分布式语义学的形式化框架

分布式语义学的核心可以形式化为三元组：

D = (W, C, M)

其中：

$W$ ：词汇集合
$C$ ：上下文元素集合
$M$ ：词汇-上下文矩阵（词汇-上下文共现统计）

示例

语料: "The cat sat on the mat. The dog sat on the log."

词汇W = {the, cat, dog, sat, on, mat, log}
上下文C = {the, cat, dog, sat, on, mat, log}

共现矩阵M (窗口=1):
      the  cat  dog  sat  on  mat  log
the     0    1    1    2    2    1    1
cat     1    0    0    1    0    0    0
dog     1    0    0    1    0    0    0
sat     2    1    1    0    2    0    0
on      2    0    0    2    0    1    1
mat     1    0    0    0    1    0    0
log     1    0    0    0    1    0    0

二、共现矩阵与统计分析

2.1 共现矩阵的构建

共现矩阵是分布式语义学的基石，其构建过程直接影响最终的语义表示质量。

2.1.1 上下文定义

窗口上下文 最基本的方式是定义固定大小的词窗口：

def build_cooccurrence_matrix(corpus, vocabulary, window_size=5):
    """
    构建词汇-词汇共现矩阵
    
    Args:
        corpus: 分词后的语料列表
        vocabulary: 词汇表 (word -> idx)
        window_size: 窗口大小 (单侧)
    
    Returns:
        cooc_matrix: scipy sparse matrix
    """
    import scipy.sparse as sp
    vocab_size = len(vocabulary)
    cooc_matrix = sp.lil_matrix((vocab_size, vocab_size))
    
    # 遍历语料
    for i, word in enumerate(corpus):
        if word not in vocabulary:
            continue
        word_idx = vocabulary[word]
        
        # 窗口内的上下文
        start = max(0, i - window_size)
        end = min(len(corpus), i + window_size + 1)
        
        for j in range(start, end):
            if i != j and corpus[j] in vocabulary:
                context_idx = vocabulary[corpus[j]]
                cooc_matrix[word_idx, context_idx] += 1
    
    return cooc_matrix.tocsr()

依赖关系上下文 更语义化的方式是利用句法依赖关系：

句子: "The cat sat on the mat"
依赖分析:
  nsubj(sat, cat)
  det(mat, the)
  prep(sat, on)
  pobj(on, mat)

上下文定义: 依赖关系对
cat的上下文: {nsubj:sat, det:the}
sat的上下文: {nsubj:cat, prep:on, pobj:mat}

2.1.2 权重方案

原始计数 最简单的权重方案，直接使用共现频次。问题是对常见词过度加权。

对数平滑

M_{l o g} (i, j) = lo g (1 + M (i, j))

Hellinger距离加权 用于处理稀疏性问题：

M_{H} (i, j) = M (i, j) / M_{ro w} (i)

2.1.3 矩阵类型

矩阵类型	定义	特点
词-词矩阵	$M_{ww} [i, j]$ = 词 $i$ 与词 $j$ 的共现	对称，反映词汇搭配
词-上下文矩阵	$M_{w c} [i, j]$ = 词 $i$ 与上下文 $j$ 的共现	非对称，更灵活
词-文档矩阵	$M_{w d} [i, j]$ = 词 $i$ 在文档 $j$ 中的频次	捕捉主题信息
词-位置矩阵	$M_{wp} [i, j]$ = 词 $i$ 在位置 $j$ 的频次	捕捉位置信息

2.2 共现矩阵的数学性质

稀疏性 自然语言的共现矩阵通常是高度稀疏的：

稀疏度 = 1 - \frac{非零元素数}{总元素数} \approx 99% +

这促使我们使用降维技术。

对称性 词-词共现矩阵是对称的： $M = M^{T}$ ，这意味着：

可以使用对称的矩阵分解方法
词之间的关系是双向的

三、SVD降维与语义空间

3.1 奇异值分解（SVD）

SVD是处理共现矩阵的核心降维技术，将高维稀疏矩阵分解为低维稠密表示。

3.1.1 SVD数学原理

对于 $m \times n$ 的共现矩阵 $M$ ，SVD分解为：

M = U Σ V^{T}

其中：

$U \in R^{m \times k}$ ：左奇异向量矩阵（词向量）
$Σ \in R^{k \times k}$ ：奇异值对角矩阵
$V^{T} \in R^{k \times n}$ ：右奇异向量矩阵（上下文向量）

截断SVD 保留前 $k$ 个最大的奇异值：

M_{k} = U_{k} Σ_{k} V_{k}^{T}

词 $i$ 的 $k$ 维向量表示为：

v_{i} = U_{k} [i, :] \cdot Σ_{k}

或

v_{i} = U_{k} [i, :] \cdot Σ_{k}

3.1.2 SVD实现

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
 
def compute_word_embeddings_svd(cooc_matrix, k=300):
    """
    使用SVD计算词嵌入
    
    Args:
        cooc_matrix: 稀疏共现矩阵 (scipy sparse)
        k: 降维维度
    
    Returns:
        embeddings: 词向量矩阵 (vocab_size, k)
    """
    # 确保是CSR格式
    if not isinstance(cooc_matrix, csr_matrix):
        cooc_matrix = csr_matrix(cooc_matrix)
    
    # 计算SVD（使用稀疏版本，更高效）
    # svds返回按奇异值降序排列的因子
    U, s, Vt = svds(cooc_matrix.astype(float), k=k)
    
    # 奇异值降序排列，需要反转
    U = U[:, ::-1]
    s = s[::-1]
    Vt = Vt[::-1, :]
    
    # 加权组合：词向量 = U * sqrt(S)
    embeddings = U * np.sqrt(s)
    
    return embeddings
 
def ppmi_transform(cooc_matrix, eps=1e-10):
    """
    将共现矩阵转换为PPMI (Positive Pointwise Mutual Information)
    
    这是一种更常用的预处理步骤
    """
    # 转换为密集矩阵（演示用，实际应使用稀疏操作）
    M = cooc_matrix.toarray().astype(float)
    
    # 计算边际概率
    pmi = np.log2((M * M.sum()) / (M.sum(axis=1, keepdims=True) * M.sum(axis=0, keepdims=True) + eps) + eps)
    
    # PPMI = max(PMI, 0)
    ppmi = np.maximum(pmi, 0)
    
    return ppmi

3.2 语义空间的性质

3.2.1 维度意义

SVD降维后的每个维度对应一个潜在的语义因子：

语义空间维度示例（简化2D可视化）:

维度1 (D1): 性别语义轴
    │
    │    man  boy  king  he
    │
────┼─────────────────────── 男性
    │
    │    woman girl queen she
    │
    └──────────────────────────────→ 维度2 (D2)

语义轴分析:
D1: 男性-女性 对立
D2: 成人-儿童 或 王权-平民

3.2.2 距离与相似度

语义空间中的距离度量：

度量	公式	语义解释
欧氏距离	$∥ v_{i} - v_{j} ∥_{2}$	绝对语义距离
余弦相似度	$\frac{v _{i} \cdot v _{j}}{∥ v _{i} ∥∥ v _{j} ∥}$	语义方向相似
曼哈顿距离	$\sum_{k} ∥ v_{i, k} - v_{j, k} ∥$	特征维度差异总和

def semantic_similarity_analysis(embeddings, word_to_idx, vocab):
    """
    语义相似度分析
    
    检验语义空间是否捕捉了语义关系
    """
    from sklearn.metrics.pairwise import cosine_similarity
    
    # 选取分析样本
    test_pairs = [
        ('cat', 'dog'),      # 相似：动物
        ('king', 'queen'),   # 相似：王室
        ('cat', 'book'),     # 不相似
        ('happy', 'sad'),    # 反义
        ('run', 'walk'),     # 相似：动作
    ]
    
    results = []
    for w1, w2 in test_pairs:
        if w1 in word_to_idx and w2 in word_to_idx:
            idx1, idx2 = word_to_idx[w1], word_to_idx[w2]
            sim = cosine_similarity(
                embeddings[idx1:idx1+1], 
                embeddings[idx2:idx2+1]
            )[0, 0]
            results.append((w1, w2, sim))
            print(f"{w1} <-> {w2}: {sim:.4f}")
    
    return results

四、PMI算法家族

4.1 点互信息（PMI）

PMI是衡量两个事件关联强度的经典指标。

4.1.1 定义与公式

对于词 $w$ 和上下文 $c$ ，PMI定义为：

PMI (w, c) = lo g_{2} \frac{p ( w , c )}{p ( w ) \cdot p ( c )} = lo g_{2} \frac{p ( c ∣ w )}{p ( c )}

直观的解释

PMI > 0：词与上下文共现的频率高于随机期望
PMI = 0：共现频率符合随机期望
PMI < 0：共现频率低于随机期望

等价形式（使用频次）

PMI (w, c) = lo g_{2} \frac{# ( w , c ) \cdot ∣ D ∣}{# ( w ) \cdot # ( c )}

其中 $∣ D ∣$ 是语料总词数， $#$ 表示频次。

4.1.2 PMI的问题与改进

零值问题 当 $# (w, c) = 0$ 时，PMI趋向负无穷。

解决方案：PPMI

PPMI (w, c) = max (0, PMI (w, c))

实现代码

import numpy as np
 
def compute_pmi(cooc_matrix, eps=1e-10):
    """
    计算PMI矩阵
    
    Args:
        cooc_matrix: 原始共现计数矩阵
    
    Returns:
        pmi_matrix: PMI值矩阵
    """
    # 计算边际频次
    word_counts = cooc_matrix.sum(axis=1, keepdims=True)      # (V, 1)
    context_counts = cooc_matrix.sum(axis=0, keepdims=True)  # (1, C)
    total = cooc_matrix.sum()
    
    # 计算联合概率和边际概率
    p_joint = cooc_matrix / total
    p_word = word_counts / total
    p_context = context_counts / total
    
    # PMI = log2(p_joint / (p_word * p_context))
    pmi = np.log2(
        p_joint / (p_word * p_context + eps) + eps
    )
    
    return pmi
 
def compute_ppmi(cooc_matrix, k=10):
    """
    计算PPMI矩阵
    
    可选：使用SVD进一步降维到k维
    """
    from scipy.sparse import csr_matrix
    from scipy.sparse.linalg import svds
    
    # 计算PPMI
    pmi = compute_pmi(cooc_matrix)
    ppmi = np.maximum(pmi, 0)
    
    # 可选：SVD降维
    if k:
        ppmi_sparse = csr_matrix(ppmi)
        U, s, Vt = svds(ppmi_sparse, k=k)
        embeddings = U * np.sqrt(s)
        return embeddings
    
    return ppmi

4.2 PMI的变体

4.2.1 正规化PMI（NPMI）

NPMI (w, c) = \frac{PMI ( w , c )}{- lo g p ( w , c )}

取值范围为 $[- 1, 1]$ ，更便于比较。

4.2.2 差分PMI（DPMI）

考虑位置信息的PMI变体：

DPMI (w, c) = PMI (w, c_{l e f t}) - PMI (w, c_{r i g h t})

用于捕捉词的方向性语义（如”before” vs “after”的差异）。

4.2.3 PMI-IR

基于信息检索的PMI变体：

PMI-IR (w, q u ery) = \frac{# ( hits ( w \land q u ery ))}{# ( hits ( w )) \cdot # ( hits ( q u ery ))}

用于从搜索引擎结果中计算词语相似度。

五、Skip-gram与CBOW模型

5.1 Skip-gram模型详解

Skip-gram是Word2Vec的核心模型之一，其数学基础与分布式语义学密切相关。

5.1.1 模型架构

输入: 中心词 $w_t$ (one-hot vector $\mathbf{x}_{w_t}$)
     ↓ 嵌入矩阵 $W_{in}$ (V × d)
隐藏层: 中心词嵌入 $\mathbf{v}_{w_t} = W_{in}^\top \mathbf{x}_{w_t}$
     ↓ 输出嵌入矩阵 $W_{out}$ (d × V)
输出: Softmax → 上下文词概率分布

数学表达:
$p(w_O \mid w_I) = \frac{\exp(\mathbf{u}_{w_O}^\top \mathbf{v}_{w_I})}{\sum_{w=1}^V \exp(\mathbf{u}_w^\top \mathbf{v}_{w_I})}$

5.1.2 Skip-gram与分布式假说的联系

Skip-gram的目标函数直接实现了分布式假说：

L = t = 1 \sum T - c \leq j \leq c, j \neq = 0 \sum lo g p (w_{t + j} ∣ w_{t})

这等价于：给定中心词 $w_{t}$ ，最大化其上下文词的预测概率。学习的词嵌入捕捉了”出现在相似上下文的词具有相似表示”这一分布式假说。

5.1.3 Skip-gram的完整实现

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import Counter
 
class SkipGram(nn.Module):
    """
    Skip-gram词向量模型
    
    训练目标: 最大化 $p(context | center)$
    这直接实现了分布式假说的核心思想
    
    数学上等价于学习一种特殊的PMI嵌入
    """
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        
        # 输入嵌入 (中心词)
        self.in_embeddings = nn.Embedding(vocab_size, embedding_dim)
        # 输出嵌入 (上下文词)
        self.out_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # 初始化
        self.in_embeddings.weight.data.uniform_(-0.5/embedding_dim, 0.5/embedding_dim)
        self.out_embeddings.weight.data.zero_()
    
    def forward(self, center, context, neg_context):
        """
        Skip-gram前向传播
        
        Args:
            center: 中心词ID [batch_size]
            context: 上下文词ID [batch_size]
            neg_context: 负采样词ID [batch_size, num_neg]
        """
        # 中心词嵌入
        v_center = self.in_embeddings(center)  # [batch, dim]
        
        # 上下文词嵌入
        u_context = self.out_embeddings(context)  # [batch, dim]
        
        # 负采样词嵌入
        u_neg = self.out_embeddings(neg_context)  # [batch, num_neg, dim]
        
        # 正样本损失: log σ(u_context · v_center)
        pos_score = torch.sum(v_center * u_context, dim=1)  # [batch]
        pos_loss = torch.nn.functional.binary_cross_entropy_with_logits(
            pos_score, torch.ones_like(pos_score)
        )
        
        # 负样本损失: Σ log σ(-u_neg · v_center)
        neg_score = torch.bmm(u_neg, v_center.unsqueeze(2)).squeeze()  # [batch, num_neg]
        neg_loss = torch.nn.functional.binary_cross_entropy_with_logits(
            neg_score, torch.zeros_like(neg_score)
        )
        
        return pos_loss + neg_loss.mean()
 
class Word2VecTrainer:
    """Word2Vec训练器"""
    
    def __init__(self, corpus, vocab_size=10000, embedding_dim=100, 
                 window_size=5, min_count=5, num_neg=5):
        self.corpus = corpus
        self.embedding_dim = embedding_dim
        self.window_size = window_size
        self.num_neg = num_neg
        
        # 构建词表
        self.word_to_idx, self.idx_to_word, self.vocab = self._build_vocab()
        self.vocab_size = len(self.word_to_idx)
        
        # 初始化模型
        self.model = SkipGram(self.vocab_size, embedding_dim)
        self.optimizer = optim.Adam(self.model.parameters(), lr=0.01)
    
    def _build_vocab(self):
        """构建词表"""
        word_counts = Counter(self.corpus)
        filtered_counts = {w: c for w, c in word_counts.items() 
                         if c >= 5}  # min_count=5
        
        vocab = [word for word, _ in sorted(filtered_counts.items(), 
                                           key=lambda x: -x[1])[:10000]]
        word_to_idx = {w: i for i, w in enumerate(vocab)}
        idx_to_word = {i: w for w, i in word_to_idx.items()}
        
        return word_to_idx, idx_to_word, vocab
    
    def _generate_training_data(self):
        """生成训练样本"""
        data = []
        for i, word in enumerate(self.corpus):
            if word not in self.word_to_idx:
                continue
            
            center_idx = self.word_to_idx[word]
            
            # 上下文词
            start = max(0, i - self.window_size)
            end = min(len(self.corpus), i + self.window_size + 1)
            
            for j in range(start, end):
                if i != j and self.corpus[j] in self.word_to_idx:
                    context_idx = self.word_to_idx[self.corpus[j]]
                    data.append((center_idx, context_idx))
        
        return data
    
    def _get_neg_samples(self, batch_size):
        """获取负采样"""
        return np.random.choice(
            self.vocab_size, 
            size=(batch_size, self.num_neg),
            replace=True
        )
    
    def train(self, epochs=5, batch_size=512):
        """训练模型"""
        training_data = self._generate_training_data()
        n_batches = len(training_data) // batch_size
        
        for epoch in range(epochs):
            total_loss = 0
            np.random.shuffle(training_data)
            
            for i in range(0, len(training_data), batch_size):
                batch = training_data[i:i+batch_size]
                if len(batch) < batch_size:
                    continue
                
                center = torch.tensor([x[0] for x in batch])
                context = torch.tensor([x[1] for x in batch])
                neg = torch.tensor(self._get_neg_samples(len(batch)))
                
                self.optimizer.zero_grad()
                loss = self.model(center, context, neg)
                loss.backward()
                self.optimizer.step()
                
                total_loss += loss.item()
            
            print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/n_batches:.4f}")
    
    def get_embeddings(self):
        """获取训练好的词向量"""
        return self.model.in_embeddings.weight.detach().numpy()

5.2 CBOW模型详解

CBOW（Continuous Bag-of-Words）与Skip-gram互补。

5.2.1 模型架构

输入: 上下文词 (多个one-hot vectors)
     ↓ 分别嵌入并求平均
隐藏层: 上下文平均嵌入 $\bar{v} = \frac{1}{2c}\sum_{-c \leq j \leq c, j \neq 0} v_{w_{t+j}}$
     ↓ 输出嵌入矩阵
输出: Softmax → 中心词概率分布

5.2.2 CBOW与Skip-gram的对比

特性	Skip-gram	CBOW
预测方向	中心→上下文	上下文→中心
训练数据量	较少（中心词少）	较多（上下文词多）
稀有词处理	更好（每个样本都有）	较差（被平均稀释）
训练速度	较慢（更多输出词）	较快
大规模语料	表现好	可能欠拟合

5.2.3 CBOW实现

class CBOW(nn.Module):
    """
    CBOW词向量模型
    
    用上下文词的嵌入之和来预测中心词
    """
    def __init__(self, vocab_size, embedding_dim):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        
        nn.init.uniform_(self.embeddings.weight, -0.5/embedding_dim, 0.5/embedding_dim)
    
    def forward(self, context_words):
        """
        Args:
            context_words: 上下文词ID张量 [batch, window_size * 2]
        """
        # 嵌入并求和
        embedded = self.embeddings(context_words)  # [batch, window*2, dim]
        summed = torch.sum(embedded, dim=1)       # [batch, dim]
        
        # 预测中心词
        out = self.linear(summed)  # [batch, vocab_size]
        return out
    
    def get_embeddings(self):
        return self.embeddings.weight.detach().numpy()

六、语义空间的几何性质

6.1 语义空间的代数结构

分布式语义空间具有丰富的代数结构，这些结构编码了语言学规律。

6.1.1 线性语义关系

词向量空间中存在系统性的线性关系（Mikolov et al., 2013）：

v_{king} - v_{man} + v_{woman} \approx v_{queen}

数学解释

设语义关系可以表示为线性变换：

v_{b} - v_{a} \approx v_{rel} (a, b)

则：

v_{queen} \approx v_{king} + (v_{woman} - v_{man}) = v_{king} + Δ_{gender}

这表明性别语义是一个”方向向量”。

6.1.2 语义类别的超平面分割

不同语义类别在向量空间中形成聚类，可通过超平面分割：

def analyze_semantic_axes(embeddings, word_to_idx, category_words):
    """
    分析语义空间的主轴方向
    
    找出区分不同语义类别的方向向量
    """
    from sklearn.decomposition import PCA
    
    categories = list(category_words.keys())
    vectors = []
    labels = []
    
    for cat, words in category_words.items():
        for w in words:
            if w in word_to_idx:
                vectors.append(embeddings[word_to_idx[w]])
                labels.append(cat)
    
    vectors = np.array(vectors)
    
    # PCA分析
    pca = PCA(n_components=2)
    coords = pca.fit_transform(vectors)
    
    # 主成分解释的方差
    print(f"PC1 解释方差: {pca.explained_variance_ratio_[0]:.2%}")
    print(f"PC2 解释方差: {pca.explained_variance_ratio_[1]:.2%}")
    
    return coords, labels, pca
 
# 示例语义类别
category_words = {
    '动物': ['dog', 'cat', 'bird', 'fish', 'horse'],
    '水果': ['apple', 'orange', 'banana', 'grape', 'pear'],
    '颜色': ['red', 'blue', 'green', 'yellow', 'purple']
}

6.2 语义空间的度量性质

6.2.1 距离分布

词向量空间中的距离分布呈现特定的统计规律：

距离分布直方图:

频率
  │
  │        ████
  │      ████████
  │    ████████████
  │  ████████████████
  │████████████████████
  └────────────────────────→ 距离
   0    1    2    3    4
    
相似词对：短尾分布（集中在低距离区）
随机词对：长尾分布

6.2.2 语义相似度与关联性

分布式语义学测量的相似度与人类的语义关联性高度相关：

def evaluate_semantic_similarity(embeddings, word_to_idx, test_pairs, human_ratings=None):
    """
    评估语义相似度计算的质量
    
    可与人类评分对比
    """
    from scipy.stats import spearmanr
    
    similarities = []
    for w1, w2 in test_pairs:
        if w1 in word_to_idx and w2 in word_to_idx:
            v1 = embeddings[word_to_idx[w1]]
            v2 = embeddings[word_to_idx[w2]]
            sim = cosine_similarity([v1], [v2])[0, 0]
            similarities.append(sim)
    
    if human_ratings:
        # 计算与人类评分的相关性
        corr, p_value = spearmanr(similarities, human_ratings)
        print(f"Spearman相关系数: {corr:.4f} (p={p_value:.4e})")
    
    return similarities

6.3 语义空间的异常现象

6.3.1 维度灾难与稀疏性

高维空间中存在”维度灾难”问题：

超立方体体积在原点附近的比例 = (\frac{2 r}{d})^{d} d \to \infty 0

这导致：

随机点几乎都分布在”表面”
点间距离趋于相等
聚类效果下降

解决方案：使用适当降低维度和添加正则化

6.3.2 语义空间的非线性结构

简单的线性模型可能无法捕捉复杂的语义关系：

def check_linear_relationships(embeddings, word_to_idx, analogy_pairs):
    """
    检查词向量空间中的线性关系（类比任务）
    
    格式: (a, b, c, expected_d)
    测试: a - b + d ≈ c
    """
    correct = 0
    for a, b, c, expected in analogy_pairs:
        if all(w in word_to_idx for w in [a, b, c, expected]):
            v_a, v_b, v_c = [embeddings[word_to_idx[w]] for w in [a, b, c]]
            
            # 计算目标向量
            v_target = v_a - v_b + v_c
            
            # 找最近的词
            similarities = cosine_similarity([v_target], embeddings)[0]
            predicted = np.argmax(similarities)
            predicted_word = idx_to_word[predicted]
            
            if predicted_word == expected:
                correct += 1
    
    accuracy = correct / len(analogy_pairs)
    print(f"类比准确率: {accuracy:.2%}")
    return accuracy
 
# 类比测试示例
analogy_pairs = [
    # 首都关系
    ('France', 'Paris', 'Japan', 'Tokyo'),
    ('China', 'Beijing', 'Germany', 'Berlin'),
    # 单复数
    ('apple', 'apples', 'car', 'cars'),
    ('man', 'men', 'woman', 'women'),
    # 动词形态
    ('run', 'running', 'swim', 'swimming'),
]

七、实践应用与前沿

7.1 分布式语义学的应用场景

应用领域	具体任务	分布式语义的作用
信息检索	查询-文档匹配	计算语义相似度
文本分类	情感分析、主题分类	文本表示
机器翻译	跨语言对齐	语义空间映射
问答系统	问答应答	语义匹配
推荐系统	物品相似度	协同过滤的语义扩展

7.2 与深度学习的关系

现代深度学习中的词向量是分布式语义学思想的延伸：

发展脉络:

传统分布式语义          →      神经网络分布式语义
   ↓                              ↓
共现矩阵 + SVD                 Word2Vec / GloVe
PMI / PPMI                     Skip-gram / CBOW
   ↓                              ↓
上下文无关词向量            上下文相关词向量
                                  ↓
                             ELMo / BERT / GPT

参考文献与推荐阅读

Firth, J. R. (1957). Studies in linguistic analysis. Blackwell.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition. Psychological Review, 104(2), 211-240.
Deerwester, S., et al. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391-407.
Turney, P. D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. JAIR, 37, 141-188.
Mikolov, T., et al. (2013). Efficient estimation of word representations in vector space. ICLR Workshop.

关联文档

词向量与分布式语义：词向量的完整介绍

计算语义学深度指南：语义学整体框架

形式语义学基础：形式语义学理论

语用学与AI：语用推理应用

概念语义学详解：概念表示理论

人工智能知识库

探索

分布式语义学详解

分布式语义学详解

关键词速览

一、分布式假说的理论渊源

1.1 John Rupert Firth与分布式语义学

1.2 假说的认知基础

1.3 分布式语义学的形式化框架

二、共现矩阵与统计分析

2.1 共现矩阵的构建

2.1.1 上下文定义

2.1.2 权重方案

2.1.3 矩阵类型

2.2 共现矩阵的数学性质

三、SVD降维与语义空间

3.1 奇异值分解（SVD）

3.1.1 SVD数学原理

3.1.2 SVD实现

3.2 语义空间的性质

3.2.1 维度意义

3.2.2 距离与相似度

四、PMI算法家族

4.1 点互信息（PMI）

4.1.1 定义与公式

4.1.2 PMI的问题与改进

4.2 PMI的变体

4.2.1 正规化PMI（NPMI）

4.2.2 差分PMI（DPMI）

4.2.3 PMI-IR

五、Skip-gram与CBOW模型

5.1 Skip-gram模型详解

5.1.1 模型架构

5.1.2 Skip-gram与分布式假说的联系

5.1.3 Skip-gram的完整实现

5.2 CBOW模型详解

5.2.1 模型架构

5.2.2 CBOW与Skip-gram的对比

5.2.3 CBOW实现

六、语义空间的几何性质

6.1 语义空间的代数结构

6.1.1 线性语义关系

6.1.2 语义类别的超平面分割

6.2 语义空间的度量性质

6.2.1 距离分布

6.2.2 语义相似度与关联性

6.3 语义空间的异常现象

6.3.1 维度灾难与稀疏性

6.3.2 语义空间的非线性结构

七、实践应用与前沿

7.1 分布式语义学的应用场景

7.2 与深度学习的关系

参考文献与推荐阅读

关系图谱

目录