Nebius 文本嵌入
Nebius AI Studio 通过统一的接口提供对高质量嵌入模型的 API 访问。Nebius 嵌入模型将文本转换为能够捕捉语义含义的数值向量,使其适用于语义搜索、聚类和推荐等各种应用。
概述
NebiusEmbeddings 类通 过 LangChain 提供对 Nebius AI Studio 嵌入模型的访问。这些嵌入可用于语义搜索、文档相似性以及其他需要文本向量表示的 NLP 任务。
集成详情
- 提供商: Nebius AI Studio
- 模型类型: 文本嵌入模型
- 主要用例: 生成文本的向量表示,用于语义相似性和检索
- 可用模型: 包括 BAAI/bge-en-icl 在内的各种嵌入模型
- 维度: 因模型而异(通常为 1024-4096 维)
设置
安装
Nebius 集成可以通过 pip 安装:
%pip install --upgrade langchain-nebius
凭证
Nebius 需要一个 API 密钥,该密钥可以通过初始化参数 api_key 传递,或者设置为环境变量 NEBIUS_API_KEY。您可以通过在 Nebius AI Studio 创建账户来获取 API 密钥。
import getpass
import os
# Make sure you've set your API key as an environment variable
if "NEBIUS_API_KEY" not in os.environ:
os.environ["NEBIUS_API_KEY"] = getpass.getpass("Enter your Nebius API key: ")
实例化
NebiusEmbeddings 类可以通过可选的 API 密钥和模型名称参数进行实例化:
from langchain_nebius import NebiusEmbeddings
# Initialize the embeddings model
embeddings = NebiusEmbeddings(
# api_key="YOUR_API_KEY", # You can pass the API key directly
model="BAAI/bge-en-icl" # The default embedding model
)
可用模型
支持的模型列表可在 https://studio.nebius.com/?modality=embedding 找到
索引与检索
Embedding 模型常用于检索增强生成 (RAG) 工作流中,用于数据的索引和后续检索。下面的示例演示了如何将 NebiusEmbeddings 与向量存储库结合使用以进行文档检索。
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
# Prepare documents
docs = [
Document(
page_content="Machine learning algorithms build mathematical models based on sample data"
),
Document(page_content="Deep learning uses neural networks with many layers"),
Document(page_content="Climate change is a major global environmental challenge"),
Document(
page_content="Neural networks are inspired by the human brain's structure"
),
]
# Create vector store
vector_store = FAISS.from_documents(docs, embeddings)
# Perform similarity search
query = "How does the brain influence AI?"
results = vector_store.similarity_search(query, k=2)
print("Search results for query:", query)
for i, doc in enumerate(results):
print(f"Result {i+1}: {doc.page_content}")
Search results for query: How does the brain influence AI?
Result 1: Neural networks are inspired by the human brain's structure
Result 2: Deep learning uses neural networks with many layers
与 InMemoryVectorStore 一起使用
您也可以为轻量级应用程序使用 InMemoryVectorStore:
from langchain_core.vectorstores import InMemoryVectorStore
# Create a sample text
text = "LangChain is a framework for developing applications powered by language models"
# Create a vector store
vectorstore = InMemoryVectorStore.from_texts(
[text],
embedding=embeddings,
)
# Use as a retriever
retriever = vectorstore.as_retriever()
# Retrieve similar documents
docs = retriever.invoke("What is LangChain?")
print(f"Retrieved document: {docs[0].page_content}")
API Reference:InMemoryVectorStore
Retrieved document: LangChain is a framework for developing applications powered by language models
直接使用
您无需使用向量存储,即可直接使用 NebiusEmbeddings 类为文本生成嵌入。
嵌入单个文本
您可以使用 embed_query 方法来嵌入单个文本:
query = "What is machine learning?"
query_embedding = embeddings.embed_query(query)
# Check the embedding dimension
print(f"Embedding dimension: {len(query_embedding)}")
print(f"First few values: {query_embedding[:5]}")
Embedding dimension: 4096
First few values: [0.007419586181640625, 0.002246856689453125, 0.00193023681640625, -0.0066070556640625, -0.0179901123046875]
嵌入多个文档
您可以使用 embed_documents 方法一次性嵌入多个文档:
documents = [
"Machine learning is a branch of artificial intelligence",
"Deep learning is a subfield of machine learning",
"Natural language processing deals with interactions between computers and human language",
]
document_embeddings = embeddings.embed_documents(documents)
# Check the results
print(f"Number of document embeddings: {len(document_embeddings)}")
print(f"Each embedding has {len(document_embeddings[0])} dimensions")
Number of document embeddings: 3
Each embedding has 4096 dimensions
异步支持
NebiusEmbeddings 支持异步操作:
import asyncio
async def generate_embeddings_async():
# Embed a single query
query_result = await embeddings.aembed_query("What is the capital of France?")
print(f"Async query embedding dimension: {len(query_result)}")
# Embed multiple documents
docs = [
"Paris is the capital of France",
"Berlin is the capital of Germany",
"Rome is the capital of Italy",
]
docs_result = await embeddings.aembed_documents(docs)
print(f"Async document embeddings count: {len(docs_result)}")
await generate_embeddings_async()
Async query embedding dimension: 4096
Async document embeddings count: 3
文档相似性示例
import numpy as np
from scipy.spatial.distance import cosine
# Create some documents
documents = [
"Machine learning algorithms build mathematical models based on sample data",
"Deep learning uses neural networks with many layers",
"Climate change is a major global environmental challenge",
"Neural networks are inspired by the human brain's structure",
]
# Embed the documents
embeddings_list = embeddings.embed_documents(documents)
# Function to calculate similarity
def calculate_similarity(embedding1, embedding2):
return 1 - cosine(embedding1, embedding2)
# Print similarity matrix
print("Document Similarity Matrix:")
for i, emb_i in enumerate(embeddings_list):
similarities = []
for j, emb_j in enumerate(embeddings_list):
similarity = calculate_similarity(emb_i, emb_j)
similarities.append(f"{similarity:.4f}")
print(f"Document {i+1}: {similarities}")
Document Similarity Matrix:
Document 1: ['1.0000', '0.8282', '0.5811', '0.7985']
Document 2: ['0.8282', '1.0000', '0.5897', '0.8315']
Document 3: ['0.5811', '0.5897', '1.0000', '0.5918']
Document 4: ['0.7985', '0.8315', '0.5918', '1.0000']
API 参考
有关 Nebius AI Studio API 的更多详细信息,请访问 Nebius AI Studio 文档。
Related
- Embedding model conceptual guide
- Embedding model how-to guides