Ontotext GraphDB

Ontotext GraphDB 是一个符合 RDF 和 SPARQL 标准的图数据库和知识发现工具。

本笔记本演示了如何使用 LLMs 为 Ontotext GraphDB 提供自然语言查询（NLQ to SPARQL，也称为 text2sparql）。

GraphDB LLM 功能

GraphDB 支持一些 LLM 集成功能，具体描述如下：

gpt-queries

使用魔法谓词（magic predicates）通过知识图谱（KG）中的数据，向 LLM 获取文本、列表或表格。
查询解释
结果解释、摘要、改写、翻译

retrieval-graphdb-connector

在向量数据库中索引 KG 实体。
支持任何文本嵌入算法和向量数据库。
使用 GraphDB 用于 Elastic、Solr、Lucene 的相同强大连接器（索引）语言。
自动同步 RDF 数据更改到 KG 实体索引。
支持嵌套对象（GraphDB 10.5 版本不支持 UI）。
将 KG 实体序列化为文本，例如（以 Wines 数据集为例）：

Franvino:
- is a RedWine.
- made from grape Merlo.
- made from grape Cabernet Franc.
- has sugar dry.
- has year 2012.

talk-to-graph

使用定义的 KG 实体索引的简单聊天机器人。

在本教程中，我们将不使用 GraphDB LLM 集成，而是从自然语言查询（NLQ）生成 SPARQL。我们将使用 Star Wars API (SWAPI) 的本体和数据集，您可以在此处进行了解。

设置

您需要一个正在运行的 GraphDB 实例。本教程将展示如何使用 GraphDB Docker 镜像在本地运行数据库。它提供了一个 docker compose 设置，其中包含 Star Wars 数据集。包括此笔记本在内的所有必需文件都可以从 GitHub 仓库 langchain-graphdb-qa-chain-demo 下载。

安装 Docker。本教程使用的是 Docker 版本 24.0.7，其中捆绑了 Docker Compose。对于早期版本的 Docker，您可能需要单独安装 Docker Compose。
将 GitHub 仓库 langchain-graphdb-qa-chain-demo 克隆到您机器上的本地文件夹中。
从同一文件夹执行以下脚本启动 GraphDB

docker build --tag graphdb .
docker compose up -d graphdb

您需要等待几秒钟，数据库才能在 http://localhost:7200/ 上启动。Star Wars 数据集 starwars-data.trig 会自动加载到 langchain 存储库中。您可以使用本地 SPARQL 端点 http://localhost:7200/repositories/langchain 来运行查询。您也可以在您喜欢的网页浏览器中打开 GraphDB Workbench http://localhost:7200/sparql，在那里您可以交互式地进行查询。

设置工作环境

如果您使用 conda，请创建一个新的 conda 环境并激活它，例如：

conda create -n graph_ontotext_graphdb_qa python=3.12
conda activate graph_ontotext_graphdb_qa

安装以下库：

pip install jupyter==1.1.1
pip install rdflib==7.1.1
pip install langchain-community==0.3.4
pip install langchain-openai==0.2.4

使用以下命令运行 Jupyter：

jupyter notebook

指定本体

为了让 LLM 能够生成 SPARQL，它需要了解知识图谱的模式（本体）。可以通过 OntotextGraphDBGraph 类上的以下两个参数之一来提供：

query_ontology: 一个在 SPARQL 端点上执行的 CONSTRUCT 查询，用于返回 KG schema 语句。我们建议将本体存储在自己的命名图中，这样可以更容易地获取相关的语句（如以下示例）。不支持 DESCRIBE 查询，因为 DESCRIBE 返回的是对称简洁有界描述（SCBD），即也包括传入的类链接。对于拥有百万个实例的大型图谱，这效率不高。请查看 https://github.com/eclipse-rdf4j/rdf4j/issues/4857
local_file: 一个本地 RDF 本体文件。支持的 RDF 格式有 Turtle、RDF/XML、JSON-LD、N-Triples、Notation-3、Trig、Trix、N-Quads。

无论哪种情况，本体转储都应：

包含关于类、属性、属性与类之间的关联（使用 rdfs:domain、schema:domainIncludes 或 OWL 限制）以及分类体系（重要的个体）的足够信息。
不包含过于冗长且与 SPARQL 构建无关的定义和示例。

from langchain_community.graphs import OntotextGraphDBGraph

# feeding the schema using a user construct query

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    query_ontology="CONSTRUCT {?s ?p ?o} FROM <https://swapi.co/ontology/> WHERE {?s ?p ?o}",
)

API Reference:OntotextGraphDBGraph

# feeding the schema using a local RDF file

graph = OntotextGraphDBGraph(
    query_endpoint="http://localhost:7200/repositories/langchain",
    local_file="/path/to/langchain_graphdb_tutorial/starwars-ontology.nt",  # change the path here
)

无论哪种情况，本体（schema）都以 Turtle 格式输入到 LLM 中，因为带有适当前缀的 Turtle 最紧凑，并且对 LLM 来说最容易记住。

《星球大战》本体有点不寻常，因为它包含了许多关于类的特定三元组，例如 :Aleena 物种生活在 <planet/38> 上，它们是 :Reptile 的子类，具有某些典型特征（平均身高、平均寿命、肤色），并且特定个体（角色）是该类的代表：

@prefix : <https://swapi.co/vocabulary/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

:Aleena a owl:Class, :Species ;
    rdfs:label "Aleena" ;
    rdfs:isDefinedBy <https://swapi.co/ontology/> ;
    rdfs:subClassOf :Reptile, :Sentient ;
    :averageHeight 80.0 ;
    :averageLifespan "79" ;
    :character <https://swapi.co/resource/aleena/47> ;
    :film <https://swapi.co/resource/film/4> ;
    :language "Aleena" ;
    :planet <https://swapi.co/resource/planet/38> ;
    :skinColor "blue", "gray" .

    ...

为了保持本教程的简洁性，我们使用了未加密的 GraphDB。如果 GraphDB 已加密，您应在 OntotextGraphDBGraph 初始化之前设置环境变量 'GRAPHDB_USERNAME' 和 'GRAPHDB_PASSWORD'。

os.environ["GRAPHDB_USERNAME"] = "graphdb-user"
os.environ["GRAPHDB_PASSWORD"] = "graphdb-password"

graph = OntotextGraphDBGraph(
    query_endpoint=...,
    query_ontology=...
)

针对 StarWars 数据集的问答

我们现在可以使用 OntotextGraphDBQAChain 来提问了。

import os

from langchain.chains import OntotextGraphDBQAChain
from langchain_openai import ChatOpenAI

# We'll be using an OpenAI model which requires an OpenAI API Key.
# However, other models are available as well:
# https://python.langchain.com/docs/integrations/chat/

# Set the environment variable `OPENAI_API_KEY` to your OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-***"

# Any available OpenAI model can be used here.
# We use 'gpt-4-1106-preview' because of the bigger context window.
# The 'gpt-4-1106-preview' model_name will deprecate in the future and will change to 'gpt-4-turbo' or similar,
# so be sure to consult with the OpenAI API https://platform.openai.com/docs/models for the correct naming.

chain = OntotextGraphDBQAChain.from_llm(
    ChatOpenAI(temperature=0, model_name="gpt-4-1106-preview"),
    graph=graph,
    verbose=True,
    allow_dangerous_requests=True,
)

API Reference:OntotextGraphDBQAChain | ChatOpenAI

让我们问一个简单的。

chain.invoke({chain.input_key: "What is the climate on Tatooine?"})[chain.output_key]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?climate
WHERE {
  ?planet rdfs:label "Tatooine" ;
          :climate ?climate .
}[0m

[1m> Finished chain.[0m

'The climate on Tatooine is arid.'

还有稍微复杂一点的。

chain.invoke({chain.input_key: "What is the climate on Luke Skywalker's home planet?"})[
    chain.output_key
]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?climate
WHERE {
  ?character rdfs:label "Luke Skywalker" .
  ?character :homeworld ?planet .
  ?planet :climate ?climate .
}[0m

[1m> Finished chain.[0m

"The climate on Luke Skywalker's home planet is arid."

我们还可以提出更复杂的问题，例如

chain.invoke(
    {
        chain.input_key: "What is the average box office revenue for all the Star Wars movies?"
    }
)[chain.output_key]

[1m> Entering new OntotextGraphDBQAChain chain...[0m
Generated SPARQL:
[32;1m[1;3mPREFIX : <https://swapi.co/vocabulary/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT (AVG(?boxOffice) AS ?averageBoxOfficeRevenue)
WHERE {
  ?film a :Film .
  ?film :boxOffice ?boxOfficeValue .
  BIND(xsd:decimal(?boxOfficeValue) AS ?boxOffice)
}
[0m

[1m> Finished chain.[0m

'The average box office revenue for all the Star Wars movies is approximately 754.1 million dollars.'

Chain modifiers

Ontotext GraphDB QA chain 允许您通过提示来优化查询，以进一步改进 QA chain 并提升应用的整体用户体验。

"SPARQL Generation" prompt

此 prompt 用于根据用户的查询和 KG schema 来生成 SPARQL 查询。

sparql_generation_prompt

默认值：

  GRAPHDB_SPARQL_GENERATION_TEMPLATE = """
  Write a SPARQL SELECT query for querying a graph database.
  The ontology schema delimited by triple backticks in Turtle format is:
  ```
  {schema}
  ```
  Use only the classes and properties provided in the schema to construct the SPARQL query.
  Do not use any classes or properties that are not explicitly provided in the SPARQL query.
  Include all necessary prefixes.
  Do not include any explanations or apologies in your responses.
  Do not wrap the query in backticks.
  Do not include any text except the SPARQL query generated.
  The question delimited by triple backticks is:
  ```
  {prompt}
  ```
  """
  GRAPHDB_SPARQL_GENERATION_PROMPT = PromptTemplate(
      input_variables=["schema", "prompt"],
      template=GRAPHDB_SPARQL_GENERATION_TEMPLATE,
  )

"SPARQL Fix" prompt

有时，LLM 可能会生成带有语法错误或缺少前缀的 SPARQL 查询。该链会尝试通过提示 LLM 在一定次数内进行更正来修复此问题。

sparql_fix_prompt

默认值：

  GRAPHDB_SPARQL_FIX_TEMPLATE = """
  This following SPARQL query delimited by triple backticks
  ```
  {generated_sparql}
  ```
  is not valid.
  The error delimited by triple backticks is
  ```
  {error_message}
  ```
  Give me a correct version of the SPARQL query.
  Do not change the logic of the query.
  Do not include any explanations or apologies in your responses.
  Do not wrap the query in backticks.
  Do not include any text except the SPARQL query generated.
  The ontology schema delimited by triple backticks in Turtle format is:
  ```
  {schema}
  ```
  """

  GRAPHDB_SPARQL_FIX_PROMPT = PromptTemplate(
      input_variables=["error_message", "generated_sparql", "schema"],
      template=GRAPHDB_SPARQL_FIX_TEMPLATE,
  )

max_fix_retries

默认值： 5

"Answering" prompt

该 prompt 用于根据从数据库返回的结果和用户的初始问题来回答问题。默认情况下，LLM 会被指示只使用返回结果中的信息。如果结果集为空，LLM 应告知它无法回答该问题。

qa_prompt

默认值：

  GRAPHDB_QA_TEMPLATE = """Task: Generate a natural language response from the results of a SPARQL query.
  You are an assistant that creates well-written and human understandable answers.
  The information part contains the information provided, which you can use to construct an answer.
  The information provided is authoritative, you must never doubt it or try to use your internal knowledge to correct it.
  Make your response sound like the information is coming from an AI assistant, but don't add any information.
  Don't use internal knowledge to answer the question, just say you don't know if no information is available.
  Information:
  {context}

  Question: {prompt}
  Helpful Answer:"""
  GRAPHDB_QA_PROMPT = PromptTemplate(
      input_variables=["context", "prompt"], template=GRAPHDB_QA_TEMPLATE
  )

当您完成 GraphDB 的 QA 测试后，您可以通过运行以下命令来关闭 Docker 环境：

docker compose down -v --remove-orphans

请在包含 Docker compose 文件的目录中执行此命令。

GraphDB LLM 功能​

设置​

指定本体​

针对 StarWars 数据集的问答​

Chain modifiers​

"SPARQL Generation" prompt​

"SPARQL Fix" prompt​

"Answering" prompt​

GraphDB LLM 功能

设置

指定本体

针对 StarWars 数据集的问答

Chain modifiers

"SPARQL Generation" prompt

"SPARQL Fix" prompt

"Answering" prompt