图嵌入与知识图谱构建实战

上篇我们讲了图神经网络（GNN）如何通过消息传递机制对图结构数据进行深度学习。但还有另一条同样重要的路：图嵌入（Graph Embedding）——把图中的节点、边映射到低维向量空间，让机器学习算法能直接处理图数据。这篇文章就来系统讲解图嵌入的核心方法，以及如何从零构建一个知识图谱。

为什么需要图嵌入

深度学习擅长处理规整的网格数据（图像、序列），但现实世界中大量数据天生就是图结构：社交网络、知识图谱、分子结构、交通网络。这些数据的特点是：结构不规则，节点之间的关系错综复杂。

直接把邻接矩阵喂给神经网络？维度太高，而且每个节点的邻居数量不同，不好处理。图嵌入要解决的就是这个问题——学习一个低维稠密向量，每个节点用一个固定长度的向量表示，同时保留图中的结构信息。

一个好的图嵌入应该满足：

结构性：结构相似的节点，嵌入空间中也相近。比如社交网络中同一社群的成员
关系性：有边相连的节点，嵌入空间距离更近
可计算性：嵌入向量可以送入传统机器学习模型做下游任务

DeepWalk：随机游走开启图嵌入时代

DeepWalk是图嵌入的先驱工作，2014年由Perozzi等人提出。它的核心思想非常优雅：把图上的随机游走序列，看成自然语言中的句子，然后用Word2Vec学习节点的向量表示。

核心思想

Word2Vec的基本假设是”上下文相似的词，语义也相似”（Distributional Hypothesis）。DeepWalk把这个思想搬到了图上：在图上随机游走时经常一起出现的节点，应该在嵌入空间中也相近。

具体流程：

在图中随机选择一些节点作为起点
从每个起点做随机游走，生成游走序列
把游走序列当成句子，节点当成词，用Skip-Gram模型训练

import numpy as np
from collections import defaultdict
import random
 
class DeepWalk:
    def __init__(self, graph, walk_length=80, num_walks=10, window_size=5, embedding_dim=64):
        self.graph = graph  # 图用邻接表表示: {node: [neighbors]}
        self.walk_length = walk_length
        self.num_walks = num_walks
        self.window_size = window_size
        self.embedding_dim = embedding_dim
    
    def random_walk(self, start_node):
        """从起点开始的随机游走"""
        walk = [start_node]
        current = start_node
        for _ in range(self.walk_length - 1):
            neighbors = self.graph.get(current, [])
            if not neighbors:
                break
            current = random.choice(neighbors)
            walk.append(current)
        return walk
    
    def generate_walks(self):
        """生成所有随机游走序列"""
        walks = []
        nodes = list(self.graph.keys())
        for _ in range(self.num_walks):
            random.shuffle(nodes)
            for node in nodes:
                walks.append(self.random_walk(node))
        return walks
    
    def train(self):
        """用Skip-Gram训练节点嵌入
        这里用简化版本，实际推荐用gensim的Word2Vec
        """
        walks = self.generate_walks()
        # 用gensim训练（推荐）
        from gensim.models import Word2Vec
        model = Word2Vec(
            sentences=walks,
            vector_size=self.embedding_dim,
            window=self.window_size,
            sg=1,  # Skip-Gram
            workers=4,
            epochs=10
        )
        return model
 
# 示例用法
graph = {
    'A': ['B', 'C'],
    'B': ['A', 'C', 'D'],
    'C': ['A', 'B', 'D'],
    'D': ['B', 'C', 'E'],
    'E': ['D']
}
 
deepwalk = DeepWalk(graph)
model = deepwalk.train()
 
# 查看某个节点的嵌入
print(f"节点A的嵌入向量: {model.wv['A']}")
print(f"节点A和D的相似度: {model.wv.similarity('A', 'D')}")

DeepWalk的局限性

DeepWalk简单有效，但它有一个明显问题：所有节点被同等对待，游走过程中没有偏向。在真实网络中，节点的度（邻居数量）差异巨大——网红可能有百万粉丝，普通人只有几十个朋友。用统一的随机游走策略，往往无法充分探索网络结构。

Node2Vec：可控的游走策略

Node2Vec由Grover和Leskovec在2016年提出，是DeepWalk的改进版。它引入了两个关键参数 $p$ 和 $q$ ，让随机游走策略变得可调控，从而能够灵活地在BFS（广度优先）和DFS（深度优先）之间平衡。

BFS vs DFS：局部信息 vs 全局结构

BFS（广度优先）：倾向于在起点的局部邻居中游走，反映的是局部微观结构。适合捕捉同质性——结构相似的节点（比如同一社群的节点）会有相似的嵌入。
DFS（深度优先）：倾向于游走到远离起点的节点，反映的是全局宏观结构。适合捕捉同构性——在网络不同位置但角色相同的节点（比如都是”桥梁节点”）会有相似的嵌入。

Node2Vec的转移概率

Node2Vec在选择下一个节点时，不是简单地在邻居中随机选，而是根据已走过的节点和目标节点的距离来调整概率：

def node2vec_walk(self, start_node):
    walk = [start_node]
    last_node = start_node
    second_last = None
    
    while len(walk) < self.walk_length:
        neighbors = self.graph.get(last_node, [])
        if not neighbors:
            break
        
        # 计算到每个邻居的转移权重
        weights = []
        for neighbor in neighbors:
            if neighbor == second_last:
                # 返回上一步，权重为1/p（p越小越倾向于回退）
                weight = 1.0 / self.p
            elif self.graph.get(neighbor, []).__contains__(last_node):
                # 是在当前节点邻居中且也连向上一节点的节点，权重为1
                weight = 1.0
            else:
                # 远离当前节点的节点，权重为1/q（q越小越倾向于探索新区域）
                weight = 1.0 / self.q
            weights.append(weight)
        
        # 按权重随机选择下一个节点
        next_node = random.choices(neighbors, weights=weights)[0]
        walk.append(next_node)
        second_last = last_node
        last_node = next_node
    
    return walk
 
class Node2Vec(DeepWalk):
    def __init__(self, graph, walk_length=80, num_walks=10, window_size=5, 
                 embedding_dim=64, p=1.0, q=1.0):
        super().__init__(graph, walk_length, num_walks, window_size, embedding_dim)
        self.p = p  # 返回参数：越小越倾向于返回上一个节点
        self.q = q  # 进出参数：越小越倾向于探索远离当前节点的区域

p和q的经验选择

p高（p >> q）：游走倾向于向前探索，更接近DFS，适合发现同构性
q高（q >> p）：游走倾向于在局部游走，更接近BFS，适合发现同质性
p=1, q=1：等价于DeepWalk的标准随机游走

实践中可以用网格搜索或下游任务（节点分类、链接预测）的效果来选参数。

Node2Vec实战

# 使用node2vec库
# pip install node2vec
from node2vec import Node2Vec
import networkx as nx
 
# 创建NetworkX图
G = nx.barbell_graph(m1=5, m2=8)
# 添加节点和边的特征（如果有的话）
 
# 生成Node2Vec游走
node2vec = Node2Vec(G, dimensions=64, walk_length=30, num_walks=10, p=1, q=1, workers=1)
model = node2vec.fit(window=10, min_count=1, batch_words=4)
 
# 节点嵌入
print(f"节点0的嵌入: {model.wv['0']}")
 
# 找相似节点
similar = model.wv.most_similar('0', topn=5)
print(f"和节点0最相似的5个节点: {similar}")
 
# 用于下游任务：节点分类
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
 
# 假设有节点标签
labels = nx.node_degree_centrality(G)  # 用度中心性作为伪标签
node_ids = list(G.nodes())
X = np.array([model.wv[str(n)] for n in node_ids])
y = np.array([int(centrality > np.median(list(labels.values()))) 
              for n, centrality in labels.items()])
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)
print(f"节点分类准确率: {accuracy_score(y_test, clf.predict(X_test)):.3f}")

LINE：大规模信息网络的嵌入

LINE（Large-scale Information Network Embedding）由Tang等人2015年提出，专门解决大规模网络的嵌入问题。相比DeepWalk和Node2Vec，LINE更关注边的方向性（一阶和二阶相似度），且针对大规模场景做了优化。

一阶相似度与二阶相似度

一阶相似度：两个节点之间有边，则认为它们相似。直接建模边的概率分布。
二阶相似度：两个节点的邻居集合相似，则认为它们相似。捕捉的是”你是什么样的人，看你身边的朋友就知道”这种思想。

LINE同时优化这两个相似度：

# 使用 LINE 的简化实现思路
# 实际推荐用 autorl 包或 stellargraph
 
def line_loss(positive_score, negative_score):
    """
    LINE的损失函数：让正样本（相连节点）的得分高，负样本的得分低
    """
    return -np.log(sigmoid(positive_score)) - np.log(sigmoid(-negative_score))
 
# 负采样：用不连接的节点作为负样本
# 节点数多时效果好，计算效率高

LINE的核心优势是可扩展性强，能够在亿级边的网络上高效训练。

TransE：知识图谱嵌入的基石

前面讲的都是通用图的嵌入。知识图谱是一种特殊的图——它有边类型（关系），如”张三是清华大学的学生”。TransE是最经典的知识图谱嵌入方法，由Bordes等人2013年提出。

TransE的核心思想

TransE的假设非常简单优雅：头实体h + 关系r ≈ 尾实体t。

如果”张三是清华大学的学生”成立，那么向量表示应该满足：vec(张三) + vec(学生) ≈ vec(清华大学)。

这个加法操作虽然简单，但出奇地有效。关系可以理解为从头实体到尾实体的”翻译向量”：

import numpy as np
 
class TransE:
    def __init__(self, num_entities, num_relations, embedding_dim=100, margin=1.0, norm=1):
        self.num_entities = num_entities
        self.num_relations = num_relations
        self.embedding_dim = embedding_dim
        self.margin = margin
        self.norm = norm
        
        # 随机初始化实体和关系的嵌入
        self.entity_embeddings = self._init_embeddings(num_entities)
        self.relation_embeddings = self._init_embeddings(num_relations)
    
    def _init_embeddings(self, num):
        embeddings = np.random.randn(num, self.embedding_dim)
        embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)
        return embeddings
    
    def _distance(self, h, r, t):
        """计算三元组的距离"""
        return np.linalg.norm(h + r - t, ord=self.norm)
    
    def loss(self, positive_triples, negative_triples):
        """
        计算margin-based ranking loss
        正样本的距离要小，负样本的距离要大
        """
        pos_dist = np.array([self._distance(
            self.entity_embeddings[h], 
            self.relation_embeddings[r], 
            self.entity_embeddings[t]
        ) for h, r, t in positive_triples])
        
        neg_dist = np.array([self._distance(
            self.entity_embeddings[h], 
            self.relation_embeddings[r], 
            self.entity_embeddings[t]
        ) for h, r, t in negative_triples])
        
        # hinge loss
        return np.sum(np.maximum(0, self.margin + pos_dist - neg_dist))
    
    def predict(self, h, r, t):
        """预测三元组的置信度（距离越小越可能成立）"""
        return self._distance(self.entity_embeddings[h], 
                             self.relation_embeddings[r], 
                             self.entity_embeddings[t])
 
# 示例
num_entities = 1000
num_relations = 50
model = TransE(num_entities, num_relations, embedding_dim=100)
 
# 训练数据：三元组 (head, relation, tail)
train_triples = [
    (0, 1, 100),   # 实体0通过关系1连接到实体100
    (5, 1, 200),   # 实体5通过关系1连接到实体200
    (10, 2, 300),  # 实体10通过关系2连接到实体300
]
 
# 生成负样本：随机替换头或尾
import random
def generate_negative(triples, num_entities, num_negatives=5):
    negatives = []
    for h, r, t in triples:
        for _ in range(num_negatives):
            if random.random() < 0.5:
                # 替换头实体
                h_neg = random.randint(0, num_entities - 1)
                negatives.append((h_neg, r, t))
            else:
                # 替换尾实体
                t_neg = random.randint(0, num_entities - 1)
                negatives.append((h, r, t_neg))
    return negatives
 
negatives = generate_negative(train_triples, num_entities)

TransE的局限性与改进

TransE简单有效，但也有明显局限：无法很好地处理一对多和多对一关系。

考虑”是某大学的学生”这个一对多关系：张三和李四都是清华大学的学生。按TransE，vec(清华) - vec(学生) 应该同时接近 vec(张三) 和 vec(李四)，这要求张三和李四的向量相同或非常接近，但实际中他们的其他属性可能差异很大。

针对这个问题，后续提出了大量改进：

TransR：将实体和关系投影到不同的空间再计算距离
TransD：为每个实体-关系对动态构造投影矩阵，更灵活
TransH：让关系有自己对应的超平面，在超平面上做翻译
RotatE：把关系建模为复数空间的旋转，可以自然地处理对称/反对称/逆向/组合关系

# RotatE的简化示意：关系是复数空间的旋转
def rotate_loss(h, r, t, embeddings):
    # r是一个单位复数（用两个维度表示cos和sin）
    # t' = h * r（复数乘法 = 旋转）
    # 损失：t 和 t' 应该接近
    h_complex = h[:, 0] + 1j * h[:, 1]
    r_complex = r[:, 0] + 1j * r[:, 1]
    t_complex = t[:, 0] + 1j * t[:, 1]
    
    t_rotated = h_complex * r_complex
    score = -np.abs(t_complex - t_rotated)
    return score

图数据库实战：Neo4j入门

知识图谱构建好了需要存储和查询。图数据库是存储知识图谱的最佳选择。Neo4j是最流行的图数据库，它使用Cypher查询语言。

Neo4j安装与基本操作

# macOS安装
brew install neo4j
# 启动
neo4j start
 
# 或者用Docker
docker run -d --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:latest

Cypher查询语言

Cypher是Neo4j的查询语言，类似SQL但专为图设计：

-- 创建节点
CREATE (p:Person {name: "张三", age: 25, occupation: "工程师"})
CREATE (u:University {name: "清华大学", location: "北京"})
CREATE (c:Company {name: "字节跳动", industry: "互联网"})
 
-- 创建关系
MATCH (p:Person {name: "张三"}), (u:University {name: "清华大学"})
CREATE (p)-[:毕业于 {year: 2015}]->(u)
 
MATCH (p:Person {name: "张三"}), (c:Company {name: "字节跳动"})
CREATE (p)-[:就职于 {position: "高级工程师"}]->(c)
 
-- 查询：张三的母校
MATCH (p:Person {name: "张三"})-[:毕业于]->(u:University)
RETURN u.name, u.location
 
-- 查询：所有清华大学毕业且在互联网公司工作的人
MATCH (p:Person)-[:毕业于]->(:University {name: "清华大学"}),
      (p)-[:就职于]->(:Company {industry: "互联网"})
RETURN p.name, p.occupation
 
-- 查询路径：某人和清华大学之间有几跳关系
MATCH path = (p:Person {name: "张三"})-[*1..3]-(u:University {name: "清华大学"})
RETURN path
 
-- 统计：每个大学各有多少毕业生
MATCH (p:Person)-[:毕业于]->(u:University)
RETURN u.name, count(p) as num_graduates
ORDER BY num_graduates DESC

Neo4j + Python实战

# pip install neo4j
from neo4j import GraphDatabase
 
class KnowledgeGraph:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
    
    def close(self):
        self.driver.close()
    
    def add_entity(self, label, properties):
        """添加实体"""
        props_str = ", ".join([f"{k}: ${k}" for k in properties.keys()])
        query = f"CREATE (e:{label} {{{props_str}}})"
        with self.driver.session() as session:
            session.run(query, **properties)
    
    def add_relation(self, from_id, from_label, to_id, to_label, rel_type, rel_props=None):
        """添加关系"""
        rel_props = rel_props or {}
        props_str = ", ".join([f"{k}: ${k}" for k in rel_props.keys()])
        if props_str:
            props_str = " {" + props_str + "}"
        query = f"""
        MATCH (a:{from_label}), (b:{to_label})
        WHERE a.id = $from_id AND b.id = $to_id
        CREATE (a)-[r:{rel_type}{props_str}]->(b)
        """
        with self.driver.session() as session:
            session.run(query, from_id=from_id, to_id=to_id, **rel_props)
    
    def query(self, cypher):
        """执行Cypher查询"""
        with self.driver.session() as session:
            result = session.run(cypher)
            return [dict(record) for record in result]
 
# 使用
kg = KnowledgeGraph("bolt://localhost:7687", "neo4j", "password")
 
# 添加数据
kg.add_entity("Person", {"id": "zhangsan", "name": "张三", "age": 30})
kg.add_entity("University", {"id": "tsinghua", "name": "清华大学"})
kg.add_relation("zhangsan", "Person", "tsinghua", "University", 
                "毕业于", {"year": 2015})
 
# 查询
results = kg.query("""
    MATCH (p:Person)-[r:毕业于]->(u:University)
    WHERE p.name = "张三"
    RETURN p.name, r.year, u.name
""")
print(results)
 
kg.close()

知识图谱构建实战：从文本到图谱

终于到了重头戏——如何从非结构化文本构建知识图谱。整个流程分为三步：命名实体识别（NER）→ 关系抽取 → 图谱构建与存储。

Step 1：命名实体识别

从文本中识别人名、地名、机构名等实体：

# 使用Hugging Face的预训练NER模型
from transformers import pipeline
 
ner_pipeline = pipeline("ner", model="uer/roberta-base-finetuned-cluener2020-chinese")
text = "马云创办了阿里巴巴集团，总部位于杭州，在2023年宣布退休。"
 
entities = ner_pipeline(text)
print(entities)
# 输出: [{'entity_group': 'PER', 'word': '马云', 'score': 0.99},
#        {'entity_group': 'ORG', 'word': '阿里巴巴集团', 'score': 0.97},
#        {'entity_group': 'LOC', 'word': '杭州', 'score': 0.98}]
 
# 合并连续 token 为完整实体
def merge_entities(entities):
    merged = []
    current_entity = None
    for ent in entities:
        if ent['entity_group'] == current_entity.get('type') and ent['word'].startswith('##'):
            current_entity['word'] += ent['word'].replace('##', '')
        else:
            if current_entity:
                merged.append(current_entity)
            current_entity = {'type': ent['entity_group'], 'word': ent['word'], 'score': ent['score']}
    if current_entity:
        merged.append(current_entity)
    return merged

Step 2：关系抽取

识别实体之间的关系。简单方案可以用预训练模型，复杂场景用专用模型：

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
 
# 使用中文关系抽取模型
model_name = "shibing624/bert-base-chinese-relation-extraction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
 
# 关系类型定义
relation_types = [
    "无关系", "老板", "成立于", "位于", "毕业于",
    "就职于", "父母", "配偶", "兄弟姐妹", "合作"
]
 
def extract_relations(text, entities):
    """抽取文本中实体对之间的关系"""
    relations = []
    # 生成所有实体对组合
    for i, ent1 in enumerate(entities):
        for j, ent2 in enumerate(entities):
            if i == j:
                continue
            # 构造输入：用实体替换后缀标记
            input_text = text.replace(ent1['word'], f"[E1]{ent1['word']}[/E1]") \
                             .replace(ent2['word'], f"[E2]{ent2['word']}[/E2]")
            
            inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=128)
            with torch.no_grad():
                outputs = model(**inputs)
            
            pred_idx = outputs.logits.argmax(dim=-1).item()
            pred_relation = relation_types[pred_idx]
            confidence = torch.softmax(outputs.logits, dim=-1)[0][pred_idx].item()
            
            if pred_relation != "无关系" and confidence > 0.7:
                relations.append({
                    'head': ent1,
                    'relation': pred_relation,
                    'tail': ent2,
                    'confidence': confidence
                })
    
    return relations
 
# 示例
entities = [
    {'type': 'PER', 'word': '马云', 'score': 0.99},
    {'type': 'ORG', 'word': '阿里巴巴', 'score': 0.97},
    {'type': 'LOC', 'word': '杭州', 'score': 0.98}
]
text = "马云创办了阿里巴巴集团，总部位于杭州。"
 
relations = extract_relations(text, entities)
for rel in relations:
    print(f"{rel['head']['word']} --[{rel['relation']}]--> {rel['tail']['word']} "
          f"(置信度: {rel['confidence']:.2f})")

Step 3：图谱构建与存储

把抽取的实体和关系存入Neo4j：

from neo4j import GraphDatabase
 
def build_knowledge_graph(triples, uri="bolt://localhost:7687", user="neo4j", password="password"):
    """
    triples: [(head, relation, tail), ...]
    每个元素是 (头实体名, 关系类型, 尾实体名)
    """
    driver = GraphDatabase.driver(uri, auth=(user, password))
    
    with driver.session() as session:
        for head, relation, tail in triples:
            # 创建或匹配头尾实体节点
            session.run("""
                MERGE (h:Entity {name: $head})
                MERGE (t:Entity {name: $tail})
                MERGE (h)-[r:RELATES {type: $relation}]->(t)
            """, head=head, tail=tail, relation=relation)
    
    driver.close()
    print(f"已导入 {len(triples)} 条知识三元组到图数据库")
 
# 从文本中抽取的示例知识
knowledge_triples = [
    ("马云", "创办", "阿里巴巴"),
    ("阿里巴巴", "总部位于", "杭州"),
    ("马云", "就职于", "阿里巴巴"),
    ("阿里巴巴", "创立于", "1999年"),
    ("马云", "毕业于", "杭州师范学院"),
    ("张勇", "现任CEO", "阿里巴巴"),
    ("阿里巴巴", "竞争对手", "京东"),
    ("京东", "总部位于", "北京"),
]
 
build_knowledge_graph(knowledge_triples)
 
# 查询验证
with GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password")) as driver:
    with driver.session() as session:
        result = session.run("""
            MATCH (h)-[r]->(t)
            RETURN h.name as head, r.type as relation, t.name as tail
            LIMIT 10
        """)
        print("\n图谱中的关系：")
        for record in result:
            print(f"  {record['head']} --[{record['relation']}]--> {record['tail']}")

实战案例：金融知识图谱构建

最后用一个完整的端到端案例把各部分串联起来。

从金融新闻构建知识图谱

from transformers import pipeline
from neo4j import GraphDatabase
import re
 
class FinancialKGPipeline:
    def __init__(self, neo4j_uri="bolt://localhost:7687", neo4j_user="neo4j", neo4j_password="password"):
        # 实体识别
        self.ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
        # 情感分析（辅助关系判断）
        self.sentiment = pipeline("sentiment-analysis")
        self.driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_password))
    
    def extract_companies(self, text):
        """识别公司实体"""
        entities = self.ner(text)
        companies = []
        for ent in entities:
            if ent['entity_group'] in ['ORG', 'PER']:
                # 过滤掉明显的非公司名
                if len(ent['word']) >= 2:
                    companies.append(ent['word'])
        return list(set(companies))
    
    def infer_relations(self, text, companies):
        """基于规则推断关系类型"""
        relations = []
        company_pairs = [(c1, c2) for c1 in companies for c2 in companies if c1 != c2]
        
        # 简单的基于关键词的规则
        relation_keywords = {
            '竞争对手': ['竞争', '对抗', '挑战', ' rivalry'],
            '合作': ['合作', '携手', ' partnership', '合作'],
            '收购': ['收购', '并购', '收购'],
            '投资': ['投资', '入股', '持股'],
        }
        
        for c1, c2 in company_pairs:
            for rel_type, keywords in relation_keywords.items():
                if any(kw in text for kw in keywords):
                    relations.append((c1, rel_type, c2))
        
        return relations
    
    def build_graph(self, articles):
        """从文章列表构建知识图谱"""
        all_triples = []
        
        for article in articles:
            text = article['content']
            companies = self.extract_companies(text)
            relations = self.infer_relations(text, companies)
            all_triples.extend(relations)
        
        # 去重
        unique_triples = list(set(all_triples))
        
        # 存入Neo4j
        with self.driver.session() as session:
            for h, r, t in unique_triples:
                session.run("""
                    MERGE (e1:Company {name: $h})
                    MERGE (e2:Company {name: $t})
                    MERGE (e1)-[rel:RELATES {type: $r}]->(e2)
                """, h=h, t=t, r=r)
        
        print(f"构建完成：{len(unique_triples)} 条关系")
        return unique_triples
 
# 使用示例
articles = [
    {"title": "阿里vs京东：电商大战持续", 
     "content": "阿里巴巴和京东作为中国电商两大巨头，长期处于竞争关系。"},
    {"title": "腾讯投资京东", 
     "content": "腾讯是京东的重要投资者，双方在多个领域展开合作。"},
    {"title": "美团进军电商", 
     "content": "美团宣布进军电商领域，将与阿里巴巴和京东形成竞争。"}
]
 
pipeline = FinancialKGPipeline()
triples = pipeline.build_graph(articles)
print("构建的知识三元组：")
for t in triples:
    print(f"  {t}")

图嵌入方法对比与选择

最后总结一下各方法的特点和适用场景：

方法	适用场景	优点	缺点
DeepWalk	小规模同质网络	简单直接	无法控制游走策略
Node2Vec	需要同时捕捉局部和全局结构	灵活可调（p/q参数）	参数调优有门槛
LINE	超大规模网络	可扩展性强	对高阶相似度建模有限
TransE	知识图谱	简单有效，解释性好	无法处理复杂关系
TransR/TransD	知识图谱	处理复杂关系能力强	参数量大，训练慢

实际工程中，往往会组合使用多种方法。比如先用Node2Vec做节点嵌入，然后基于嵌入做链接预测；或者用TransE做知识图谱推理，用GNN做图神经网络分类。工具层面：

图构建与存储：NetworkX（轻量）、Neo4j（图数据库）、JanusGraph（分布式）
图嵌入：OpenKE（集成TransE等方法）、PyKEEN、stellargraph
GNN框架：PyTorch Geometric、Deep Graph Library（DGL）

结语

图嵌入和知识图谱是AI领域的重要基础设施。从DeepWalk到TransE，从随机游走到知识推理，这个领域几十年间积累了丰富的方法和经验。

构建知识图谱的核心难点不在于算法，而在于数据质量和知识获取。从非结构化文本中准确抽取实体和关系至今仍是活跃的研究方向。近年来大模型（LLM）的进展给知识图谱带来了新可能——用LLM做关系抽取、用知识图谱增强LLM，两者正在形成互补关系。

下一步你可以深入：结合LLM做更精准的知识抽取、研究图神经网络和图嵌入的结合（如图神经网络用于链接预测）、或者探索知识图谱在RAG（检索增强生成）中的应用。

人工智能知识库

探索

图嵌入与知识图谱构建实战

图嵌入与知识图谱构建实战

为什么需要图嵌入

DeepWalk：随机游走开启图嵌入时代

核心思想

DeepWalk的局限性

Node2Vec：可控的游走策略

BFS vs DFS：局部信息 vs 全局结构

Node2Vec的转移概率

p和q的经验选择

Node2Vec实战

LINE：大规模信息网络的嵌入

一阶相似度与二阶相似度

TransE：知识图谱嵌入的基石

TransE的核心思想

TransE的局限性与改进

图数据库实战：Neo4j入门

Neo4j安装与基本操作

Cypher查询语言

Neo4j + Python实战

知识图谱构建实战：从文本到图谱

Step 1：命名实体识别

Step 2：关系抽取

Step 3：图谱构建与存储

实战案例：金融知识图谱构建

从金融新闻构建知识图谱

图嵌入方法对比与选择

结语

关系图谱

目录

反向链接