如何重新排序检索结果以缓解“中间遗失”效应

在检索增强生成（RAG）应用中，随着检索文档数量的增加（例如超过十篇），会出现显著的性能下降，已有文献（如 documentation）对此进行了记录。简而言之：模型容易在长上下文的中间部分遗漏相关信息。

相比之下，对向量存储的查询通常会按照相关性降序返回文档（例如，根据 embeddings 的余弦相似度进行衡量）。

为了缓解“中间遗失”效应，您可以在检索后重新排序文档，将最相关的文档放置在最外侧（例如，上下文的第一段和最后一段），并将最不相关的文档放置在中间。在某些情况下，这有助于将最相关的信息呈现给大型语言模型。

LongContextReorder 文档转换器实现了这一重排序过程。下面我们演示一个示例。

%pip install -qU langchain langchain-community langchain-openai

首先，我们嵌入一些人工文档，并将它们索引到一个基本的内存向量存储中。我们将使用 OpenAI 嵌入，但任何 LangChain 向量存储或嵌入模型都可以。

from langchain_core.vectorstores import InMemoryVectorStore
from langchain_openai import OpenAIEmbeddings

# Get embeddings.
embeddings = OpenAIEmbeddings()

texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

# Create a retriever
retriever = InMemoryVectorStore.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "What can you tell me about the Celtics?"

# Get relevant documents ordered by relevance score
docs = retriever.invoke(query)
for doc in docs:
    print(f"- {doc.page_content}")

API Reference:InMemoryVectorStore | OpenAIEmbeddings

- The Celtics are my favourite team.
- This is a document about the Boston Celtics
- The Boston Celtics won the game by 20 points
- L. Kornet is one of the best Celtics players.
- Basquetball is a great sport.
- Larry Bird was an iconic NBA player.
- This is just a random text.
- I simply love going to the movies
- Fly me to the moon is one of my favourite songs.
- Elden Ring is one of the best games in the last 15 years.

请注意，文档将按与查询的相关性降序返回。LongContextReorder 文档转换器将实现上述重新排序：

from langchain_community.document_transformers import LongContextReorder

# Reorder the documents:
# Less relevant document will be at the middle of the list and more
# relevant elements at beginning / end.
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

# Confirm that the 4 relevant documents are at beginning and end.
for doc in reordered_docs:
    print(f"- {doc.page_content}")

API Reference:LongContextReorder

- This is a document about the Boston Celtics
- L. Kornet is one of the best Celtics players.
- Larry Bird was an iconic NBA player.
- I simply love going to the movies
- Elden Ring is one of the best games in the last 15 years.
- Fly me to the moon is one of my favourite songs.
- This is just a random text.
- Basquetball is a great sport.
- The Boston Celtics won the game by 20 points
- The Celtics are my favourite team.

下面，我们将展示如何将重新排序的文档整合到一个简单的问答链中：

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

prompt_template = """
Given these texts:
-----
{context}
-----
Please answer the following question:
{query}
"""

prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "query"],
)

# Create and invoke the chain:
chain = create_stuff_documents_chain(llm, prompt)
response = chain.invoke({"context": reordered_docs, "query": query})
print(response)

API Reference:create_stuff_documents_chain | PromptTemplate | ChatOpenAI

The Boston Celtics are a professional basketball team known for their rich history and success in the NBA. L. Kornet is recognized as one of the best players on the team, and the Celtics recently won a game by 20 points. The Celtics are favored by some fans, as indicated by the statement, "The Celtics are my favourite team." Overall, they have a strong following and are considered a significant part of basketball culture.