教程：使用聊天模型（Azure AI 搜索中的 RAG）搜索数据

项目
2024-12-16

Azure AI 搜索上的 RAG 解决方案的定义特征是将查询发送到大型语言模型 (LLM)，以提供基于索引内容的对话式搜索体验。如果你只实现基本功能，你会发现它出奇的简单。

在本教程中，你将了解：

设置客户端
撰写 LLM 的说明
提供专为 LLM 输入设计的查询
查看结果并了解后续步骤

本教程以以前的教程为基础。本教程假设你有一个通过索引管道创建的搜索索引。

先决条件

包含 Python 扩展和 Jupyter 包的 Visual Studio Code。有关详细信息，请参阅 Visual Studio Code 中的 Python。
Azure AI 搜索，位于与 Azure OpenAI 共享的区域中。
Azure OpenAI，部署了 gpt-4o。有关详细信息，请参阅在 Azure AI 搜索中为 RAG 选择模型

下载示例

请使用上一索引管道教程中的笔记本。查询 LLM 的脚本遵循管道创建步骤。如果还没有该笔记本，请从 GitHub 下载它。

配置用于发送查询的客户端

Azure AI 搜索中的 RAG 模式是一系列同步的连接，连接到搜索索引以获取上下文关联数据，然后连接到 LLM 来制定对用户问题的响应。两个客户端使用同一查询字符串。

你要设置两个客户端，因此需要两个资源上的终结点和权限。本教程假定你为授权连接设置了角色分配，但应在示例笔记本中提供终结点：

# Set endpoints and API keys for Azure services
AZURE_SEARCH_SERVICE: str = "PUT YOUR SEARCH SERVICE ENDPOINT HERE"
# AZURE_SEARCH_KEY: str = "DELETE IF USING ROLES, OTHERWISE PUT YOUR SEARCH SERVICE ADMIN KEY HERE"
AZURE_OPENAI_ACCOUNT: str = "PUR YOUR AZURE OPENAI ENDPOINT HERE"
# AZURE_OPENAI_KEY: str = "DELETE IF USING ROLES, OTHERWISE PUT YOUR AZURE OPENAI KEY HERE"

提示和查询的示例脚本

下面是实例化客户端、定义提示和设置查询的 Python 脚本。可以在笔记本中运行此脚本，以从聊天模型部署生成响应。

# Import libraries
from azure.search.documents import SearchClient
from openai import AzureOpenAI

token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")
openai_client = AzureOpenAI(
     api_version="2024-06-01",
     azure_endpoint=AZURE_OPENAI_ACCOUNT,
     azure_ad_token_provider=token_provider
 )

deployment_name = "gpt-4o"

search_client = SearchClient(
     endpoint=AZURE_SEARCH_SERVICE,
     index_name=index_name,
     credential=credential
 )

# Provide instructions to the model
GROUNDED_PROMPT="""
You are an AI assistant that helps users learn from the information found in the source material.
Answer the query using only the sources provided below.
Use bullets if the answer has multiple points.
If the answer is longer than 3 sentences, provide a summary.
Answer ONLY with the facts listed in the list of sources below. Cite your source when you answer the question
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Query: {query}
Sources:\n{sources}
"""

# Provide the search query. 
# It's hybrid: a keyword search on "query", with text-to-vector conversion for "vector_query".
# The vector query finds 50 nearest neighbor matches in the search index
query="What's the NASA earth book about?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

# Set up the search results and the chat thread.
# Retrieve the selected fields from the search index related to the question.
# Search results are limited to the top 5 matches. Limiting top can help you stay under LLM quotas.
search_results = search_client.search(
    search_text=query,
    vector_queries= [vector_query],
    select=["title", "chunk", "locations"],
    top=5,
)

# Newlines could be in the OCR'd content or in PDFs, as is the case for the sample PDFs used for this tutorial.
# Use a unique separator to make the sources distinct. 
# We chose repeated equal signs (=) followed by a newline because it's unlikely the source documents contain this sequence.
sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

response = openai_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": GROUNDED_PROMPT.format(query=query, sources=sources_formatted)
        }
    ],
    model=deployment_name
)

print(response.choices[0].message.content)

查看结果

在此响应中，答案基于五个输入 (top=5)，它们由搜索引擎确定的最相关的语块组成。提示中的说明会告知 LLM 仅使用 sources 中的信息或格式化的搜索结果。

第一个查询 ("What's the NASA earth book about?") 的结果应类似于以下示例。

The NASA Earth book is about the intricate and captivating science of our planet, studied 
through NASA's unique perspective and tools. It presents Earth as a dynamic and complex 
system, observed through various cycles and processes such as the water cycle and ocean 
circulation. The book combines stunning satellite images with detailed scientific insights, 
portraying Earth's beauty and the continuous interaction of land, wind, water, ice, and 
air seen from above. It aims to inspire and demonstrate that the truth of our planet is 
as compelling as any fiction.

Source: page-8.pdf

即使提示和查询没有改变，LLM 也可能会返回不同的答案。你的结果看起来可能与示例迥异。有关详细信息，请参阅了解如何使用可重现的输出。

注意

在测试本教程时，我们看到了各种各样的响应，其中一些比另一些更相关。有几次，重复相同的请求会导致响应变差，这很可能是因历史聊天记录混乱造成的，可能是模型将重复的请求视为对生成的答案不满意。管理历史聊天记录超出了本教程的范围，但将其包含在应用程序代码中应该可以减轻甚至消除这种行为。

添加筛选器

回想一下，你使用已应用的 AI 创建了一个 locations 字段，其中填充了实体识别技能所识别的位置。位置的字段定义包括 filterable 属性。让我们重复上一个请求，但这次添加一个筛选器，针对位置字段中的“ice”一词进行选择。

筛选器引入了包含或排除条件。搜索引擎仍在对 "What's the NASA earth book about?" 执行矢量搜索，但现在它会排除不包含 ice 的匹配项。有关按字符串集合与矢量查询进行筛选的详细信息，请参阅文本筛选器基础知识、了解集合筛选器和向矢量查询添加筛选器。

将 search_results 定义替换为以下包含筛选器的示例：

query="what is the NASA earth book about?"
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=50, fields="text_vector")

# Add a filter that selects documents based on whether locations includes the term "ice".
search_results = search_client.search(
    search_text=query,
    vector_queries= [vector_query],
    filter="search.ismatch('ice*', 'locations', 'full', 'any')",
    select=["title", "chunk", "locations"],
    top=5,
)

sources_formatted = "=================\n".join([f'TITLE: {document["title"]}, CONTENT: {document["chunk"]}, LOCATIONS: {document["locations"]}' for document in search_results])

search_results = search_client.search(
    search_text=query,
    top=10,
    filter="search.ismatch('ice*', 'locations', 'full', 'any')",
    select="title, chunk, locations"

已筛选查询的结果现在应该类似于以下响应。请注意对 ice cover 的强调。

The NASA Earth book showcases various geographic and environmental features of Earth through 
satellite imagery, highlighting remarkable landscapes and natural phenomena. 

- It features extraordinary views like the Holuhraun Lava Field in Iceland, captured by 
Landsat 8 during an eruption in 2014, with false-color images illustrating different elements 
such as ice, steam, sulfur dioxide, and fresh lava ([source](page-43.pdf)).
- Other examples include the North Patagonian Icefield in South America, depicted through 
clear satellite images showing glaciers and their changes over time ([source](page-147.pdf)).
- It documents melt ponds in the Arctic, exploring their effects on ice melting and 
- heat absorption ([source](page-153.pdf)).

Overall, the book uses satellite imagery to give insights into Earth's dynamic systems 
and natural changes.

更改输入

增加或减少 LLM 输入数量可能会对响应产生很大的影响。设置 top=8 后尝试再次运行同一查询。增加输入数量时，即使查询没有改变，模型每次也会返回不同的结果。

下面是一个示例，演示了将输入数量增加到 8 后模型返回的内容。

The NASA Earth book features a range of satellite images capturing various natural phenomena 
across the globe. These include:

- The Holuhraun Lava Field in Iceland documented by Landsat 8 during a 2014 volcanic 
eruption (Source: page-43.pdf).
- The North Patagonian Icefield in South America, highlighting glacial landscapes 
captured in a rare cloud-free view in 2017 (Source: page-147.pdf).
- The impact of melt ponds on ice sheets and sea ice in the Arctic, with images from 
an airborne research campaign in Alaska during July 2014 (Source: page-153.pdf).
- Sea ice formations at Shikotan, Japan, and other notable geographic features in various 
locations recorded by different Landsat missions (Source: page-168.pdf).

Summary: The book showcases satellite images of diverse Earth phenomena, such as volcanic 
eruptions, icefields, and sea ice, to provide insights into natural processes and landscapes.

因为模型与上下文关联数据相关，所以随着输入大小的增加，答案会变得更加宽泛。可以使用相关性优化来潜在生成更有针对性的答案。

更改提示

还可以通过更改提示来控制输出格式、语气，以及是否希望模型用其自己的训练数据来补充答案。

如果我们把提示的重点重新放在确定科学研究的位置上，下面是 LLM 输出的另一个例子。

# Provide instructions to the model
GROUNDED_PROMPT="""
You are an AI assistant that helps scientists identify locations for future study.
Answer the query cocisely, using bulleted points.
Answer ONLY with the facts listed in the list of sources below.
If there isn't enough information below, say you don't know.
Do not generate answers that don't use the sources below.
Do not exceed 5 bullets.
Query: {query}
Sources:\n{sources}
"""

如果只更改提示，而不保留之前查询的所有内容，则输出结果可能与下面的示例相似。

The NASA Earth book appears to showcase various locations on Earth captured through satellite imagery, 
highlighting natural phenomena and geographic features. For instance, the book includes:

- The Holuhraun Lava Field in Iceland, detailing volcanic activity and its observation via Landsat 8.
- The North Patagonian Icefield in South America, covering its glaciers and changes over time as seen by Landsat 8.
- Melt ponds in the Arctic and their impacts on the heat balance and ice melting.
- Iceberg A-56 in the South Atlantic Ocean and its interaction with cloud formations.

(Source: page-43.pdf, page-147.pdf, page-153.pdf, page-39.pdf)

提示

如果要继续学习本教程，请记住将提示还原为先前的值 (You are an AI assistant that helps users learn from the information found in the source material)。

更改参数和提示会对 LLM 的响应产生影响。当自行探索时，请记住以下提示：

提高 top 值可能会耗尽模型上的可用配额。如果没有配额，则会返回错误消息，或者模型可能会返回“我不知道”。
提高 top 值不一定会改善结果。在使用最大值的测试中，我们有时会发现答案并没有明显改善。
那么应该怎么做？通常情况下，答案是相关性优化。提高 Azure AI 搜索的搜索结果的相关性通常是最大限度提高 LLM 实用性的最有效方法。

在下一系列教程中，重点将转移到最大化相关性以及优化查询性能以确保速度和简洁性。我们会重新审视架构定义和查询逻辑来实现相关性功能，但管道和模型的其余部分保持不变。

下一步

最大化相关性

通过