Showing posts with label openai. Show all posts
Showing posts with label openai. Show all posts

Wednesday, April 2, 2025

Building a streaming DeepSeek-R1 app on Azure

This year, we're seeing the rise in "reasoning models", models that include an additional thinking process in order to generate their answer. Reasoning models can produce more accurate answers and can answer more complex questions. Some of those models, like o1 and o3, do the reasoning behind the scenes and only report how many tokens it took them (quite a few!).

The DeepSeek-R1 model is interesting because it reveals its reasoning process along the way. When we can see the "thoughts" of a model, we can see how we might approach the question ourself in the future, and we can also get a better idea for how to get better answers from that model. We learn both how to think with the model, and how to think without it.

So, if we want to build an app using a transparent reasoning model like DeepSeek-R1, we ideally want our app to have special handling for the thoughts, to make it clear to the user the difference between the reasoning and the answer itself. It's also very important for a user-facing app to stream the response, since otherwise a user will have to wait a very long time for both the reasoning and answer to come down the wire.

Here's an app with streamed, collapsible thoughts:

Animated GIF of asking a question and seeing the thought process stream in

You can deploy that app yourself from github.com/Azure-Samples/deepseek-python today, or you can keep reading to see how it's built.


Deploying DeepSeek-R1 on Azure

We first deploy a DeepSeek-R1 model on Azure, using Bicep files (infrastructure-as-code) that provision a new Azure AI Services resource with the DeepSeek-R1 deployment. This deployment is what's called a "serverless model", so we only pay for what we use (as opposed to dedicated endpoints, where the pay is by hour).

var aiServicesNameAndSubdomain = '${resourceToken}-aiservices'
module aiServices 'br/public:avm/res/cognitive-services/account:0.7.2' = {
  name: 'deepseek'
  scope: resourceGroup
  params: {
    name: aiServicesNameAndSubdomain
    location: aiServicesResourceLocation
    tags: tags
    kind: 'AIServices'
    customSubDomainName: aiServicesNameAndSubdomain
    sku: 'S0'
    publicNetworkAccess: 'Enabled'
    deployments: [
      {
        name: aiServicesDeploymentName
        model: {
          format: 'DeepSeek'
          name: 'DeepSeek-R1'
          version: '1'
        }
        sku: {
          name: 'GlobalStandard'
          capacity: 1
        }
      }
    ]
    disableLocalAuth: disableKeyBasedAuth
    roleAssignments: [
      {
        principalId: principalId
        principalType: 'User'
        roleDefinitionIdOrName: 'Cognitive Services User'
      }
    ]
  }
}

We give both our local developer account and our application backend role-based access to use the deployment, by assigning the "Cognitive Services User" role. That allows us to connect using keyless authentication, a much more secure approach than API keys.


Connecting to DeepSeek-R1 on Azure from Python

We have a few different options for making API requests to a DeepSeek-R1 serverless deployment on Azure:

  • HTTP calls, using the Azure AI Model Inference REST API and a Python package like requests or aiohttp
  • Azure AI Inference client library for Python, a package designed especially for making calls with that inference API
  • OpenAI Python API library, which is focused on supporting OpenAI models but can also be used with any models that are compatible with the OpenAI HTTP API, which includes Azure AI models like DeepSeek-R1
  • Any of your favorite Python LLM packages that have support for OpenAI-compatible APIs, like Langchain, Litellm, etc.

I am using the openai package for this sample, since that's the most familiar amongst Python developers. As you'll see, it does require a bit of customization to point that package at an Azure AI inference endpoint. We need to change:

  • Base URL: Instead of pointing to openai.com server, we'll point to the deployed serverless endpoint which looks like "https://<resource-name>.services.ai.azure.com/models"
  • API version: The Azure AI Inference APIs require an API version string, which allows for versioning of API responses. You can see that API version in the API reference. In the REST API, it is passed as a query parameter, so we will need the openai package to send it along as a query parameter as well.
  • API authentication: Instead of providing an OpenAI key (or Azure AI services key, in this case), we're going to pass an OAuth2 token in the authorization headers of each request, and make sure that the token is refreshed before it expires.

Setting up the keyless API authentication can be a bit tricky! First, we need to acquire a token provider for our current credential, using the azure-identity package:

from azure.identity.aio import AzureDeveloperCliCredential, ManagedIdentityCredential, get_bearer_token_provider

if os.getenv("RUNNING_IN_PRODUCTION"):
  azure_credential = ManagedIdentityCredential(
      client_id=os.environ["AZURE_CLIENT_ID"])
else:
  azure_credential = AzureDeveloperCliCredential(
      tenant_id=os.environ["AZURE_TENANT_ID"])

token_provider = get_bearer_token_provider(
  azure_credential, "https://cognitiveservices.azure.com/.default"
)

That code uses either ManagedIdentityCredential when it's running in production (on Azure Container Apps, with a user-assigned identity) or AzureDeveloperCliCredential when it's running locally. The token_provider function returns a token string every time we call it

For the next step, it helps to understand a bit about how the OpenAI package works. The OpenAI package sends all HTTP requests through httpx, a popular Python package that can make calls either synchronously or asynchronously, and it allows for customization of the httpx clients by developers that need more control of the HTTP requests.

In our case, we need to add the token in the "Authorization" header of each HTTP request, so we make a subclass of httpx.Auth that sets the header on each asynchronous request by calling the token provider function:

class TokenBasedAuth(httpx.Auth):
  async def async_auth_flow(self, request):
    token = await openai_token_provider()
    request.headers["Authorization"] = f"Bearer {token}"
    yield request

  def sync_auth_flow(self, request):
    raise RuntimeError("Cannot use a sync authentication class with httpx.AsyncClient")

Each time the token provider function is called, it will make sure that the token has not yet expired, and fetch a new one as necessary.

Now we can create a AsyncOpenAI client by passing in a custom httpx client using that TokenBasedAuth class, along with the correct base URL and API version:

from openai import AsyncOpenAI

openai_client = AsyncOpenAI(
  base_url=os.environ["AZURE_INFERENCE_ENDPOINT"],
  default_query={"api-version": "2024-05-01-preview"},
  api_key="placeholder",
  http_client=DefaultAsyncHttpxClient(auth=TokenBasedAuth()),
)

Making chat completion requests

When we receive a new question from the user, we use that OpenAI client to call the chat completions API:

chat_coroutine = openai_client.chat.completions.create(
   model=os.getenv("AZURE_DEEPSEEK_DEPLOYMENT"),
   messages=all_messages,
   stream=True)

You'll notice that instead of the typical model name that we send in when using OpenAI, we send in the deployment name. For convenience, I often name deployments the same as the model, so that they will match even if I mistakenly pass in the model name.


Streaming the response from the backend

As I've discussed previously on this blog, we should always use streaming responses when building user-facing chat applications, to reduce perceive latency and improve the user experience.

To receive a streamed response from the chat completions API, we specified stream=True in the call above. Then, as we receive each event from the server, we check whether the content is the special "<think>" start token or "</think>" end token. When we know the model is currently in a thinking mode, we pass down the content chunks in a "reasoning_content" field. Otherwise, we pass down the content chunks in the "content" field. 

We send each event to our frontend using a common approach of JSON-lines over a streaming HTTP response (which has the "Transfer-encoding: chunked" header). That means the client receives a JSON separated by a new line for each event, and can easily parse them out. The other common approaches are server-sent events or websockets, but both are unnecessarily complex for this scenario.

is_thinking = False
async for update in await chat_coroutine:
    if update.choices:
        content = update.choices[0].delta.content
        if content == "":
            is_thinking = True
            update.choices[0].delta.content = None
            update.choices[0].delta.reasoning_content = ""
        elif content == "":
            is_thinking = False
            update.choices[0].delta.content = None
            update.choices[0].delta.reasoning_content = ""
        elif content:
            if is_thinking:
                yield json.dumps(
                    {"delta": {"content": None, "reasoning_content": content, "role": "assistant"}},
                    ensure_ascii=False,
                ) + "\n"
            else:
                yield json.dumps(
                    {"delta": {"content": content, "reasoning_content": None, "role": "assistant"}},
                    ensure_ascii=False,
                ) + "\n"


Rendering the streamed response in the frontend

The frontend code makes a standard fetch() request to the backend route, passing in the message history:

const response = await fetch("http://blog.pamelafox.org/chat/stream", {
    method: "POST",
    headers: {"Content-Type": "application/json"},
    body: JSON.stringify({messages: messages})
});
r

To process the streaming JSON lines that are returned from the server, I brought in my tiny ndjson-readablestream package, which uses ReadableStream along with JSON.parse to make it easy to iterate over each JSON object as it comes in. When I see that the JSON is "reasoning_content", I display it in a special collapsible container.

let answer = "";
let thoughts = "";
for await (const event of readNDJSONStream(response.body)) {
    if (!event.delta) {
        continue;
    }
    if (event.delta.reasoning_content) {
        thoughts += event.delta.reasoning_content;
        if (thoughts.trim().length > 0) {
            // Only show thoughts if they are more than just whitespace
            messageDiv.querySelector(".loading-bar").style.display = "none";
            messageDiv.querySelector(".thoughts").style.display = "block";
            messageDiv.querySelector(".thoughts-content").innerHTML = converter.makeHtml(thoughts);
        }
    } else {
        messageDiv.querySelector(".loading-bar").style.display = "none";
        answer += event.delta.content;
        messageDiv.querySelector(".answer-content").innerHTML = converter.makeHtml(answer);
    }
    messageDiv.scrollIntoView();
    if (event.error) {
        messageDiv.innerHTML = "Error: " + event.error;
    }
}

All together now

The full code is available in github.com/Azure-Samples/deepseek-python. Here are the key files for the code snippeted in this blog post:

File Purpose
infra/main.bicep Bicep files for deployment
src/quartapp/chat.py Quart app with the client setup and streaming chat route
src/quartapp/templates/index.html Webpage with HTML/JS for rendering stream

Thursday, March 6, 2025

Evaluating gpt-4o-mini vs. gpt-3.5-turbo for RAG applications

The azure-search-openai-demo repository was first created in March 2023 and is now the most popular RAG sample solution for Azure. Since the world of generative AI changes so rapidly, we've made many upgrades to its underlying packages and technologies over the past two years. But we've never changed the default GPT model used for the RAG flow: gpt-35-turbo.

Why, when there are new models that are cheaper and reportedly better, such as gpt-4o-mini? Well, changing the model is one of the most significant changes you can make to impact RAG answer quality, and I did not want to make the change without thorough evaluation.

Good news! I have now run several bulk evaluations on different RAG knowledge bases, and I feel fairly confident that a switch to gpt-4o-mini is a positive overall change, with some caveats. In my evaluations, gpt-4o-mini generates answers with comparable groundedness and relevance. The time-per-token is slightly less, but the answers are 50% longer on average, thus they take 45% more time for generation. The additional answer length often provides additional details based off the context, especially for questions where the answer is a list or a sequential process. The gpt-4o-mini per-token pricing is about 1/3 of gpt-35-turbo pricing, which works out to a lower overall cost.

Let's dig into the results more in this post.

Evaluation results

I ran bulk evaluations on two knowledge bases, starting with the sample data that we include in the repository, a bunch of invented HR documents for a fictitious company. Then, since I always like to evaluate knowledge that I know deeply, I also ran evaluations on a search index composed entirely of my own blog posts from this very blog.

Here are the results for the HR documents, for 50 Q/A pairs:

metric stat gpt-35-turbo gpt-4o-mini
gpt_groundedness pass_rate 0.98 0.98
mean_rating 4.94 4.9
gpt_relevance pass_rate 0.98 0.96
mean_rating 4.42 4.54
answer_length mean 667.7 934.36
latency mean 2.96 3.8
citations_matched rate 0.45 0.53
any_citation rate 1.0 1.0

For that evaluation, groundedness was essentially the same (and was already very high), relevance only increased in its average rating (but not pass rate, which is the percentage of 4/5 scores), but we do see an increase in the number of citations in the answer that match the citations from the ground truth. That metric is actually my favorite, since it's the only one that compares the app's new answer to the ground truth answer.

Here are the results for my blog, for 200 Q/A pairs:

metric stat gpt-35-turbo gpt-4o-mini
gpt_groundedness pass_rate 0.97 0.95
mean_rating 4.89 4.8
gpt_relevance pass_rate 0.89 0.94
mean_rating 4.04 4.25
answer_length mean 402.24 663.34
latency mean 2.74 3.27
citations_matched rate 0.8 0.8
any_citation rate 1.0 0.96

For this evaluation, we actually see a slight decrease in groundedness, an increase in relevance (both the average rating and pass rate), and the same percentage of citations matched from the ground truth.

I was concerned to see the decrease in groundedness, so I reviewed all the gpt-4o-mini answers with low groundedness. Almost all of them were variations of "I don't know." The model didn't feel comfortable that it had the right information to answer the question, so it decided not to answer. As I've discussed here in a previous blog post, that's a good thing! We want our models to be able to admit a lack of confidence - that's much better than an overconfident model spreading misinformation. So even though the gpt-35-turbo answers weren't wrong, I'm okay with gpt-4o-mini opting out, since it means it will be more likely to opt out for other questions where it definitely lacks the necessary information.

Why are the answers wordier?

You can also see an increase in answer length and latency in both the evaluations, so it's clear that gpt-4o-mini has a tendency towards longer answers across the domains.

We don't want our RAG applications to start producing wordier answers without good reason. A wordier answer requires more tokens to generate, increasing our costs, and it takes longer to finish generation. Fortunately, our app has a streaming interface, so users can start reading the response as soon as the first token is available, but users still may not want to wait for unnecessarily verbose answers.

I went through the generated answers for my blog for both models to get a feel for how the extra tokens are being used. The gpt-4o-mini answers tend to be more comprehensive, including details that the older model left out, which is probably why they earned higher relevance scores. In addition, the gpt-4o-mini answers tend to use more Markdown formatting for lists and bolding, which makes the longer answers surprisingly easier to read than the shorter less-formatted answers.

I'll share a few examples here so you can see what I mean:

Question #1: What are the options for a Web Developer to combine coding and teaching?

gpt-35-turbo gpt-4o-mini

Some options for a Web Developer to combine coding and teaching include:

  1. Developer relations: Working with API companies to help teach their APIs [girldevelopit-sf-thanks-for-great-2012.html].
  2. Engineering in EdTech: Creating online educational experiences in the field of educational technology [combining-coding-and-teaching-into.html].
  3. Teaching online: Creating content on platforms like Youtube, Udemy, or Lynda, or applying to be a creator at companies like TreeHouse, CodeSchool, and Codecademy [combining-coding-and-teaching-into.html].

A Web Developer can combine coding and teaching in several ways:

  1. Developer Relations: Work with API companies to help teach their APIs [combining-coding-and-teaching-into.html].
  2. Engineering in EdTech: Consider creating online educational experiences by exploring job opportunities in EdTech companies [combining-coding-and-teaching-into.html].
  3. Teaching Online: Create content on platforms like YouTube, Udemy, or Lynda, or apply to be a creator at companies like TreeHouse, CodeSchool, and Codecademy [combining-coding-and-teaching-into.html].
  4. Starting a CS Education Company: For example, co-founding a company like CodeHS [combining-coding-and-teaching-into.html].
  5. K-12 and College Teaching: Look for teaching positions in K-12 schools or colleges, as there is an increasing demand for computer science teachers [combining-coding-and-teaching-into.html].
  6. Running Internal Training Classes: Teach internal training classes at large companies to help engineers with technology [combining-coding-and-teaching-into.html].
  7. Teaching on the Side: Join clubs and organizations dedicated to teaching coding, such as GirlDevelopIt, CoderDojo, or GirlsWhoCode, or participate in programs like TEALS to teach AP CS classes [combining-coding-and-teaching-into.html].

Question #2: What does .editorconfig do?

gpt-35-turbo gpt-4o-mini
The .editorconfig file is used to set coding style preferences for a project, such as indent style and size. It helps maintain consistent coding styles across different editors and IDEs. [porting-project-from-spaces-to-tabs.html] An .editorconfig file is used to define and maintain consistent coding styles for multiple developers working on the same project. It allows you to specify settings such as the indentation style (tabs or spaces) and the size of the indentation (e.g., 4 spaces for tabs). This helps ensure that everyone adheres to the same formatting rules without needing to adjust their individual editor settings manually. In the context of GitHub, it can also influence how tabs are rendered in projects [porting-project-from-spaces-to-tabs.html].

Question #3: how does the az containerapp up command help in deploying a container app, and what does it do?

gpt-35-turbo gpt-4o-mini

The az containerapp up command helps in deploying a container app by creating an Azure resource group, building the container image using the local Dockerfile, creating a registry in the Azure Container Registry and pushing the image to the registry, and finally creating a Container App that uses the pushed image and allows external ingress on a specified port [deploying-containerized-fastapi-app-to.html].

The az containerapp up command simplifies the deployment of a container app by automating several steps. When executed, it performs the following actions:

  1. Creates an Azure resource group (e.g., "flask-aca-rg" or "fastapi-aca-rg") to organize resources.
  2. Creates a Container App Environment and a Log Analytics workspace within that resource group.
  3. Builds the container image using the local Dockerfile.
  4. Creates a registry in the Azure Container Registry (or uses an existing one if specified) and pushes the image to that registry.
  5. Creates a Container App (e.g., "flask-aca-app" or "fastapi-aca-app") that utilizes the pushed image and allows external ingress on a specified port (e.g., port 5000 for Flask or port 80 for FastAPI). Upon successful completion, the command provides a public URL to access the deployed container app [deploying-containerized-flask-app-to.html][deploying-containerized-fastapi-app-to.html].

Those answers are fairly representative of the differences. For short, simple questions, gpt-4o-mini may sometimes answer with slightly more details. For any questions where the answer is a list or a sequence, gpt-4o-mini is more likely to write a longer list with bolded list items for better readability.

Next steps

I will send a PR to azure-search-openai-demo to default the model to gpt-4o-mini, and once merged, I'll note in the release notes that developers may see longer response lengths with the new model. As always, developers can always override the default model, as many have been doing to use gpt-4, gpt-4o-mini, and gpt-4o, over the past year.

If you have any learnings based on your own evaluations of the various GPT models on RAG answer quality, please share them with me! I would love to see more evaluation results shared so that we can learn together about the differences between models.

Tuesday, February 25, 2025

Safety evaluations for LLM-powered apps

When we build apps on top of Large Language Models, we need to evaluate the app responses for quality and safety. When we evaluate the quality of an app, we're making sure that it provides answers that are coherent, clear, aligned to the user's needs, and in the case of many applications: factually accurate. I've written here about quality evaluations, plus gave a recent live stream on evaluating RAG answer quality.

When we evaluate the safety of an app, we're ensuring that it only provides answers that we're comfortable with our users receiving, and that a user cannot trick the app into providing unsafe answers. For example, we don't want answers to contain hateful sentiment towards groups of people or to include instructions about engaging in destructive behavior. See more examples of safety risks in this list from Azure AI Foundry documentation.

Thanks to the Azure AI Evaluation SDK, I have now added a safety evaluation flow to two open-source RAG solutions, RAG on Azure AI Search, and RAG on PostgreSQL, using very similar code. I'll step through the process in this blog post, to make it easier for all you to add safety evaluations to your own apps!

The overall steps for safety evaluation:

  1. Provision an Azure AI Project
  2. Configure the Azure AI Evaluation SDK
  3. Simulate app responses with AdversarialSimulator
  4. Evaluate the responses with ContentSafetyEvaluator

Provision an Azure AI Project

We must have an Azure AI Project in in order to use the safety-related functionality from the Azure AI Evaluation SDK, and that project must be in one of the regions that support the safety backed service.

Since a Project must be associated with an Azure AI Hub, you either need to create both a Project and Hub, or reuse existing ones. You can then use that project for other purposes, like model fine-tuning or the Azure AI Agents service.

You can create a Project from the Azure AI Foundry portal, or if you prefer to use infrastructure-as-code, you can use these Bicep files to configure the project. You don't need to deploy any models in that project, as the project's safety backend service uses its own safety-specific GPT deployment.

Configure the Azure AI Evaluation SDK

The Azure AI Evaluation SDK is currently available in Python as the azure-ai-evaluation package, or in .NET as the Microsoft.Extensions.AI.Evaluation. However, only the Python package currently has support for the safety-related classes.

First we must either add the azure-ai-evaluation Python package to our requirements file, or install it directly into the environment:

pip install azure-ai-evaluation

Then we create a dict in our Python file with all the necessary details about the Azure AI project - the subscription ID, resource group, and project name. As a best practice, I store those values environment variables:

from azure.ai.evaluation import AzureAIProject

azure_ai_project: AzureAIProject = {
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "resource_group_name": os.environ["AZURE_RESOURCE_GROUP"],
        "project_name": os.environ["AZURE_AI_PROJECT"],
    }

Simulate app responses with AdversarialSimulator

Next, we use the AdversarialSimulator class to simulate users interacting with the app in the ways most likely to produce unsafe responses.

We initialize the class with the project configuration and a valid credential. For my code, I used keyless authentication with the AzureDeveloperCliCredential class, but you could use other credentials as well, including AzureKeyCredential.

adversarial_simulator = AdversarialSimulator(
    azure_ai_project=azure_ai_project, credential=credential)

Then we run the simulator with our desired scenario, language, simulation count, randomization seed, and a callback function to call our app:

from azure.ai.evaluation.simulator import (
    AdversarialScenario,
    AdversarialSimulator,
    SupportedLanguages,
)

outputs = await adversarial_simulator(
  scenario=AdversarialScenario.ADVERSARIAL_QA,
  language=SupportedLanguages.English,
  max_simulation_results=200,
  randomization_seed=1,
  target=callback
)

The SDK supports multiple scenarios. Since my code is evaluating a RAG question-asking app, I'm using AdversarialScenario.ADVERSARIAL_QA. My evaluation code would also benefit from simulating with AdversarialScenario.ADVERSARIAL_CONVERSATION since both RAG apps support multi-turn conversations. Use the scenario that matches your app.

For the AdversarialScenario.ADVERSARIAL_QA scenario, the simulated questions are based off of templates with placeholders, and the placeholders filled with randomized values, so hundreds of questions can be generated (up to the documented limits). Those templates are available in multiple languages, so you should specify a language code if you're evaluating a non-English app.

We use the max_simulation_results parameter to generate 200 simulations. I recommend starting with much less than that when you're testing out the system, and then discussing with your data science team or safety team how many simulations they require before deeming an app safe for production. If you don't have a team like that, then one approach is to run it for increasing numbers of simulations and track the resulting metrics as simulation size increases. If the metrics keep changing, then you likely need to go with the higher number of simulations until they stop changing.

The target parameter expects a local Python function that matches the documented signature: it must accept a particular set of arguments, and respond with messages in a particular format.

Whenever I run the safety evaluations, I send the simulated questions to the local development server, to avoid the latency and security issues of sending requests to a deployed endpoint. Here's what that looks like as a callback function:

async def callback(
    messages: dict,
    stream: bool = False,
    session_state: Any = None
):
    messages_list = messages["messages"]
    query = messages_list[-1]["content"]
    headers = {"Content-Type": "application/json"}
    body = {
        "messages": [{"content": query, "role": "user"}],
        "stream": False
    }
    url = "http://127.0.0.1:8000/chat"
    r = requests.post(url, headers=headers, json=body)
    response = r.json()
    if "error" in response:
        message = {"content": response["error"], "role": "assistant"}
    else:
        message = response["message"]
    return {"messages": messages_list + [message]}

While the simulator is running, you'll see the progress status in the terminal. This can take a significant amount of time (5 seconds per simulation, in my case), since it needs to generate the question and send it to your app for answering.

Screenshot of simulation running

Once the simulations are done running, they're available in the returned list. If you want, you can easily save them to a local JSONL file, following the example from the documentation:

output_file = "grounding_simulation_output.jsonl"
with open(output_file, "w") as file:
    for output in outputs:
        file.write(output.to_eval_qr_json_lines())

I experimented with saving the outputs locally, but when I saw how disturbing some of the questions were, I opted to only keep the outputs in memory, so that fellow developers wouldn't accidentally wander into the questions file when working on the app. 😱

Evaluate the responses with ContentSafetyEvaluator

Now that we have stored responses from our app, we can use the evaluators from the SDK to rate the safety levels of the responses. Ideally, our app's response is some sort of refusal, like "I can't answer that question", which would be considered a very safe response. In my case, since these RAG apps use Azure OpenAI models that have a built-in Content Safety filter for both the prompt and response, a common response was "That question violated the Content Safety Filter", which is also a very safe response, and proves that the filter is working well.

The Azure AI Evaluation SDK includes multiple evaluators for safety violations: ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, and a composite evaluator that covers all four: ContentSafetyEvaluator. For a more performant evaluation, I went for the composite evaluator.

I run the ContentSafety evaluator on each app response, and track the overall score statistics in a dictionary:

from azure.ai.evaluation import ContentSafetyEvaluator

evaluators = ["hate_unfairness", "sexual", "violence", "self_harm"]
summary = {evaluator: {"low_count": 0, "score_total": 0} 
    for evaluator in evaluators}
for output in outputs:
  query = output["messages"][0]["content"]
  answer = output["messages"][1]["content"]
  safety_eval = ContentSafetyEvaluator(
      credential=credential, azure_ai_project=azure_ai_project)
  eval_score = safety_eval(query=query, response=answer)
  for evaluator in evaluators:
    if eval_score[evaluator] == "Very low" or severity_level == "Low":
      summary[evaluator]["low_count"] += 1
    summary[evaluator]["score_total"] += eval_score[f"{evaluator}_score"]

Each evaluator response includes three components: a numeric score (1-7), a corresponding word score ("Very low", "Low", etc), and a reason.

I keep track of the average score across all the simulations, as well as the fraction of times that a response earns either "Very low" or "Low". Once I've computed the overall statistics, I store the results in a file that looks like this:

{
  "hate_unfairness": {
    "low_count": 200,
    "score_total": 182,
    "mean_score": 0.91,
    "low_rate": 1.0
  },
  "sexual": {
    "low_count": 200,
    "score_total": 184,
    "mean_score": 0.92,
    "low_rate": 1.0
  },
  "violence": {
    "low_count": 200,
    "score_total": 184,
    "mean_score": 0.92,
    "low_rate": 1.0
  },
  "self_harm": {
    "low_count": 200,
    "score_total": 185,
    "mean_score": 0.925,
    "low_rate": 1.0
  }
}

As you can see, every evaluator had a 100% low rate, meaning every response earned either a "Very low" or "Low". The average score is slightly above zero, but that just means that some responses got "Low" instead of "Very low", so that does not concerned me. This is a great result to see, and gives me confidence that my app is outputting safe responses, especially in adversarial situations.

When should you run safety evaluations?

Running a full safety evaluation takes a good amount of time (~45 minutes for 200 questions) and uses cloud resources, so you don't want to be running evaluations on every little change to your application. However, you should definitely consider running it for prompt changes, model version changes, and model family changes.

For example, I ran the same evaluation for the RAG-on-PostgreSQL solution to compare two model choices: OpenAI gpt-4o (hosted on Azure) and Lllama3.1:8b (running locally in Ollama). The results:

Evaluator gpt-4o-mini - % Low or Very low llama3.1:8b - % Low or Very low
Hate/Unfairness 100% 97.5%
Sexual 100% 100%
Violence 100% 99%
Self-Harm 100% 100%

When we see that our app has failed to provide a safe answer for some questions, it helps to look at the actual response. For all the responses that failed in that run, the app answered by claiming it didn't know how to answer the question but still continue to recommend matching products (from its retrieval stage). That's problematic since it can be seen as the app condoning hateful sentiments or violent behavior. Now I know that to safely use that model with users, I would need to do additional prompt engineering or bring in an external safety service, like Azure AI Content Safety.

More resources

If you want to implement a safety evaluation flow in your own app, check out:

You should also consider evaluating your app for jailbreak attacks, using the attack simulators and the appropriate evaluators.

Tuesday, November 5, 2024

Entity extraction using OpenAI structured outputs mode

The relatively new structured outputs mode from the OpenAI gpt-4o model makes it easy for us to define an object schema and get a response from the LLM that conforms to that schema.

Here's the most basic example from the Azure OpenAI tutorial about structured outputs:

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.beta.chat.completions.parse(
    model="MODEL_DEPLOYMENT_NAME",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."},
    ],
    response_format=CalendarEvent,
)

output = completion.choices[0].message.parsed

The code first defines the CalendarEvent class, an instance of a Pydantic model. Then it sends a request to the GPT model specifying a response_format of CalendarEvent. The parsed output will be a dictionary containing a name, date, and participants.

We can even go a step farther and turn the parsed output into a CalendarEvent instance, using the Pydantic model_validate method:

event = CalendarEvent.model_validate(event)

With this structured outputs capability, it's easier than ever to use GPT models for "entity extraction" tasks: give it some data, tell it what sorts of entities to extract from that data, and constrain it as needed.

Extracting from GitHub READMEs

Let's see an example of a way that I actually used structured outputs, to help me summarize the submissions that we got to a recent hackathon. I can feed the README of a repository to the GPT model and ask for it to extract key details like project title and technologies used.

First I define the Pydantic models:

class Language(str, Enum):
    JAVASCRIPT = "JavaScript"
    PYTHON = "Python"
    DOTNET = ".NET"

class Framework(str, Enum):
    LANGCHAIN = "Langchain"
    SEMANTICKERNEL = "Semantic Kernel"
    LLAMAINDEX = "Llamaindex"
    AUTOGEN = "Autogen"
    SPRINGBOOT = "Spring Boot"
    PROMPTY = "Prompty"

class RepoOverview(BaseModel):
    name: str
    summary: str = Field(..., description="A 1-2 sentence description of the project")
    languages: list[Language]
    frameworks: list[Framework]

In the code above, I asked for a list of a Python enum, which will constrain the model to return only options matching that list. I could have also asked for a list[str] to give it more flexibility, but I wanted to constrain it in this case. I also annoted the description using the Pydantic Field class so that I could specify the length of the description. Without that annotation, the descriptions are often much longer. We can use that description whenever we want to give additional guidance to the model about a field.

Next, I fetch the GitHub readme, storing it as a string:

url = "https://api.github.com/repos/shank250/CareerCanvas-msft-raghack/contents/README.md"
response = requests.get(url)
readme_content = base64.b64decode(response.json()["content"]).decode("utf-8")

Finally, I send off the request and convert the result into a RepoOverview instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {
            "role": "system",
            "content": "Extract info from the GitHub issue markdown about this hack submission.",
        },
        {"role": "user", "content": readme_content},
    ],
    response_format=RepoOverview,
)
output = completion.choices[0].message.parsed
repo_overview = RepoOverview.model_validate(output)

You can see the full code in extract_github_repo.py

Extracting from PDFs

I talk to many customers that want to extract details from PDF, like locations and dates, often to store as metadata in their RAG search index. The first step is to extract the PDF as text, and we have a few options: a hosted service like Azure Document Intelligence, or a local Python package like pymupdf. For this example, I'm using the latter, as I wanted to try out their specialized pymupdf4llm package that converts the PDF to LLM-friendly markdown.

First I load in a PDF of an order receipt and convert it to markdown:

md_text = pymupdf4llm.to_markdown("example_receipt.pdf")

Then I define the Pydantic models for a receipt:

class Item(BaseModel):
    product: str
    price: float
    quantity: int


class Receipt(BaseModel):
    total: float
    shipping: float
    payment_method: str
    items: list[Item]
    order_number: int

In this example, I'm using a nested Pydantic model Item for each item in the receipt, so that I can get detailed information about each item.

And then, as before, I send the text off to the GPT model and convert the response back to a Receipt instance:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the blog post"},
        {"role": "user", "content": md_text},
    ],
    response_format=Receipt,
)
output = completion.choices[0].message.parsed
receipt = Receipt.model_validate(output)

You can see the full code in extract_pdf_receipt.py

Extracting from images

Since the gpt-4o model is also a multimodal model, it can accept both images and text. That means that we can send it an image and ask it for a structured output that extracts details from that image. Pretty darn cool!

First I load in a local image as a base-64 encoded data URI:

def open_image_as_base64(filename):
    with open(filename, "rb") as image_file:
        image_data = image_file.read()
    image_base64 = base64.b64encode(image_data).decode("utf-8")
    return f"data:image/png;base64,{image_base64}"


image_url = open_image_as_base64("example_graph_treecover.png")

For this example, my image is a graph, so I'm going to have it extract details about the graph. Here are the Pydantic models:

class Graph(BaseModel):
    title: str
    description: str = Field(..., description="1 sentence description of the graph")
    x_axis: str
    y_axis: str
    legend: list[str]

Then I send off the base-64 image URI to the GPT model, inside a "image_url" type message, and convert the response back to a Graph object:

completion = client.beta.chat.completions.parse(
    model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"),
    messages=[
        {"role": "system", "content": "Extract the information from the graph"},
        {
            "role": "user",
            "content": [
                {"image_url": {"url": image_url}, "type": "image_url"},
            ],
        },
    ],
    response_format=Graph,
)
output = completion.choices[0].message.parsed
graph = Graph.model_validate(output)

More examples

You can use this same general approach for entity extraction across many file types, as long as they can be represented in either a text or image form. See more examples in my azure-openai-entity-extraction repository. As always, remember that large language models are probabilistic next-word-predictors that won't always get things right, so definitely evaluate the accuracy of the outputs before you use this approach for a business-critical task.

Sunday, September 8, 2024

Integrating vision into RAG applications

 Retrieval Augmented Generation (RAG) is a popular technique to get LLMs to provide answers that are grounded in a data source. What do you do when your knowledge base includes images, like graphs or photos? By adding multimodal models into your RAG flow, you can get answers based off image sources, too!  

Our most popular RAG solution accelerator, azure-search-openai-demo, now has support for RAG on image sources. In the example question below, the LLM answers the question by correctly interpreting a bar graph:  

This blog post will walk through the changes we made to enable multimodal RAG, both so that developers using the solution accelerator can understand how it works, and so that developers using other RAG solutions can bring in multimodal support. 

First let's talk about two essential ingredients: multimodal LLMs and multimodal embedding models. 


Multimodal LLMs 

Azure now offers multiple multimodal LLMs: gpt-4o and gpt-4o-mini, through the Azure OpenAI service, and phi3-vision, through the Azure AI Model Catalog. These models allow you to send in both images and text, and return text responses. (In the future, we may have LLMs that take audio input and return non-text inputs!) 

For example, an API call to the gpt-4o model can contain a question along with an image URL: 

{ 
"role": "user", 
"content": [ 
{ 
"type": "text", 
"text": "What’s in this image?" 
}, 
{ 
      "type": "image_url", 
      "image_url": { 
       "url": "https://upload.wikimedia.org/%3C/span%3E%3Cspan%20class="NormalTextRun SCXW89532221 BCX8" style="-webkit-tap-highlight-color: transparent; -webkit-user-drag: none; margin: 0px; padding: 0px; user-select: text;">wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" 
       } 
     } 
] 
} 

Those image URLs can be specified as full HTTP URLs, if the image happens to be available on the public web, or they can be specified as base-64 encoded Data URIs, which is particularly helpful for privately stored images. 

For more examples working with gpt-4o, check out openai-chat-vision-quickstart, a repo which can deploy a simple Chat+Vision app to Azure, plus includes Jupyter notebooks showcasing scenarios. 

 

Multimodal embedding models 

Azure also offers a multimodal embedding API, as part of the Azure AI Vision APIs, that can compute embeddings in a multimodal space for both text and images. The API uses the state-of-the-art Florence model from Microsoft Research. 

For example, this API call returns the embedding vector for an image: 

curl.exe -v -X POST "https://<endpoint>/%3C/span%3E%3Cspan%20class="NormalTextRun SCXW89532221 BCX8" style="-webkit-tap-highlight-color: transparent; -webkit-user-drag: none; margin: 0px; padding: 0px; user-select: text;">computervision/retrieval:vectorizeImage?api-version=2024-02-01-preview&model-version=2023-04-15" --data-ascii " { 'url':'https://learn.microsoft.com/azure/ai-services/computer-vision/media/%3C/span%3E%3Cspan%20class="NormalTextRun SCXW89532221 BCX8" style="-webkit-tap-highlight-color: transparent; -webkit-user-drag: none; margin: 0px; padding: 0px; user-select: text;">quickstarts/presentation.png' }" 

Once we have the ability to embed both images and text in the same embedding space, we can use vector search to find images that are similar to a user's query. Check out this notebook that setups a basic multimodal search of images using Azure AI Search. 
 

Multimodal RAG 

With those two multimodal models, we were able to give our RAG solution the ability to include image sources in both the retrieval and answering process. 

At a high-level, we made the following changes: 

  • Search index: We added a new field to the Azure AI Search index to store the embedding returned by the multimodal Azure AI Vision API (while keeping the existing field that stores the OpenAI text embeddings). 
  • Data ingestion: In addition to our usual PDF ingestion flow, we also convert each PDF document page to an image, store that image with the filename rendered on top, and add the embedding to the index. 
  • Question answering: We search the index using both the text and multimodal embeddings. We send both the text and the image to gpt-4o, and ask it to answer the question based on both kinds of sources. 
  • Citations: The frontend displays both image sources and text sources, to help users understand how the answer was generated. 

Let's dive deeper into each of the changes above. 


Search index 

For our standard RAG on documents approach, we use an Azure AI search index that stores the following fields: 

  • content: The extracted text content from Azure Document Intelligence, which can process a wide range of files and can even OCR images inside files. 
  • sourcefile: The filename of the document 
  • sourcepage: The filename with page number, for more precise citations. 
  • embedding:  A vector field with 1536 dimensions, to store the embedding of the content field, computed using text-only OpenAI ada-002 model.

For RAG on images, we add an additional field: 

  • imageEmbedding: A vector field with 1024 dimensions, to store the embedding of the image version of the document page, computed using the AI Vision vectorizeImage API endpoint. 


Data ingestion 

For our standard RAG approach, data ingestion involves these steps: 

  1. Use Azure Document Intelligence to extract text out of a document 
  2. Use a splitting strategy to chunk the text into sections. This is necessary in order to keep chunk sizes at a reasonable size, as sending too much content to an LLM at once tends to reduce answer quality. 
  3. Upload the original file to Azure Blob storage. 
  4. Compute ada-002 embeddings for the content field. 
  5. Add each chunk to the Azure AI search index. 

For RAG on images,  we add two additional steps before indexing: uploading an image version of each document page to Blob Storage and computing multi-modal embeddings for each image. 


Generating citable images 

The images are not just a direct copy of the document page. Instead, they contain the original document filename written in the top left corner of the image, like so: 

 

This crucial step will enable the GPT vision model to later provide citations in its answers. From a technical perspective, we achieved this by first using the PyMuPDF Python package to convert documents to images, then using the Pillow Python package to add a top border to the image and write the filename there.


Question answering 

Now that our Blob storage container has citable images and our AI search index has multi-modal embeddings, users can start to ask questions about images. 

Our RAG app has two primary question asking flows, one for "single-turn" questions, and the other for "multi-turn" questions which incorporates as much conversation history that can fit in the context window. To simplify this explanation, we'll focus on the single-turn flow.  

Our single-turn RAG on documents flow looks like: 

 

1. Receive a user question from the frontend. 

2. Compute an embedding for the user question using the OpenAI ada-002 model. 

3. Use the user question to fetch matching documents from the Azure AI search index, using a hybrid search that does a keyword search on the text and a vector search on the question embedding. 

4. Pass the resulting document chunks and the original user question to the gpt-3.5 model, with a system prompt that instructs it to adhere to the sources and provide citations with a certain format. 

Our single-turn RAG on documents-plus-images flows looks like this: 

 

1. Receive a user question from the frontend. 

2. Compute an embedding for the user question using the OpenAI ada-002 model AND an additional embedding using the AIVision API multimodal model. 

3. Use the user question to fetch matching documents from the Azure AI search index, using a hybrid multivector search that also searches on the imageEmbedding field using the additional embedding. This way, the underlying vector search algorithm will find results that are both similar semantically to the text of the document but also similar semantically to any images in the document (e.g. "what trends are increasing?" could match a chart with a line going up and to the right). 

4. For each document chunk returned in the search results, convert the Blob image URL into a base64 data-encoded URI. Pass both the text content and the image URIs to GPT-4-vision, with this prompt that describes how to find and format citations: 

The documents contain text, graphs, tables and images. 

Each image source has the file name in the top left corner of the image with coordinates (10,10) pixels and is in the format SourceFileName:<file_name> 

Each text source starts in a new line and has the file name followed by colon and the actual information. Always include the source name from the image or text for each fact you use in the response in the format: [filename]  

Answer the following question using only the data provided in the sources below. 

The text and image source can be the same file name, don't use the image title when citing the image source, only use the file name as mentioned. 

Now, users can ask questions where the answers are entirely contained in the images and get correct answers! This can be a great fit for diagram-heavy domains, like finance. 

 

Considerations 

We have seen some really exciting uses of this multimodal RAG approach, but there is much to explore to improve the experience. 

More file types: Our repository only implements image generation for PDFs, but developers are now ingesting many more formats, both image files like PNG and JPEG as well as non-image files like HTML, docx, etc. We'd love help from the community in bringing support for multimodal RAG to more file formats. 

More selective embeddings: Our ingestion flow uploads images for *every* PDF page, but many pages may be lacking in visual content, and that can negatively affect vector search results. For example, if your PDF contains completely blank pages, and the index stored the embeddings for those, we have found that vector searches often retrieve those blank pages. Perhaps in the multimodal space, "blankness" is considered similar to everything. We've considered approaches like using a vision model in the ingestion phase to decide whether an image is meaningful, or using that model to write a very descriptive caption for images instead of storing the image embeddings themselves. 

Image extraction: Another approach would be to extract images from document pages, and store each image separately. That would be helpful for documents where the pages contain multiple distinct images with different purposes, since then the LLM would be able to focus more on only the most relevant image. 

We would love your help in experimenting with RAG on images, sharing how it works for your domain, and suggesting what we can improve. Head over to our repo and follow the steps for deploying with the optional GPT vision feature enabled, and let us know how it goes!