feat: Add support for image function tools #654

vaydingul · 2025-05-06T16:36:54Z

This PR adds support for image function tools to the OpenAI Agents Python SDK.

It is inspired by the #341 .

Current function_tool implementation only allows output to be strictly string, which creates a problem when we want to pass image input in the request data. This PR tackles this problem by both providing a standart function_call_output and additional image-related arguments back-to-back.

What's included

Added ImageFunctionTool class and image_function_tool decorator
Implemented necessary support in the run implementation, models, and item handling
Added example showing usage of image function tools (examples/tools/image_function_tool.py)

Usage

Use the @image_function_tool decorator to create tools that work with images:

@image_function_tool
def image_to_base64(path: str) -> str:
    """
    This function takes a path to an image and returns a base64 encoded string of the image.
    It is used to convert the image to a base64 encoded string so that it can be sent to the LLM.
    """
    with open(path, "rb") as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
    return f"data:image/jpeg;base64,{encoded_string}"

The tool can then be used to allow agents to process and analyze images.

Supporting example script is located at examples/tools/image_function_tool.py

This commit introduces the ImageFunctionTool and ImageFunctionToolResult classes, enabling the creation and execution of image-generating tools. The necessary modifications include updates to the tool execution logic, new data classes for handling image function calls, and adjustments to the response processing to accommodate image outputs. Additionally, the input handling in the Runner class has been refined to support the new image function items. Changes include: - New classes: ImageFunctionTool, ImageFunctionToolResult, ToolRunImageFunction - Updated tool execution methods to handle image functions - Modifications to the ProcessedResponse class to include image function results - Enhancements to ItemHelpers for image function output formatting - Adjustments in the Runner class for input item processing These changes enhance the SDK's capabilities for handling image generation tasks alongside existing function tools.

diwu-sf · 2025-05-15T19:36:52Z

Hi,
We have a very similar use case where we have PDF file analysis function calls that want to return:

file name
base64 file content
type = input_file
additional str information about the file

Can you expand this PR to also support the concept of a @file_function_tool?

stevemadere · 2025-05-18T01:14:41Z

I appreciate very much that you've done this work.
Perhaps I don't understand, but it kind of looks like all an image tool can return is one image. Is that right?
How about a tool that returns complex json output, some properties of which are images?
I have some ideas on how to implement that and am considering it.
LMK, if this method actually supports that already.

nileshtrivedi · 2025-05-18T04:05:33Z

@stevemadere At that point, instead of a tool, we just have an agent participating in the conversation by posting a message made of multiple parts.

diwu-sf · 2025-05-18T15:58:41Z

I thought agent responses have to be all string ? How does the agent response return File content types without a PR like this one?

nileshtrivedi · 2025-05-19T16:48:08Z

@diwu-sf OpenAI is now promoting Responses API instead of ChatCompletion API. This new spec allows agents or models to return output as various parts of different types:

diwu-sf · 2025-05-19T16:59:33Z

@nileshtrivedi nope, function call responses from the tool itself still must be string:
https://platform.openai.com/docs/guides/function-calling?api-mode=responses#formatting-results

That's why this PR is generating a user message to embed the function call's image output:

    @classmethod
    def image_function_tool_call_output_item(
        cls, tool_call: ResponseFunctionToolCall, output: str
    ) -> FunctionCallOutput:
        """Creates a tool call output item from a tool call and its output."""
        return [
            {
                "call_id": tool_call.call_id,
                "output": "Image generating tool is called.",
                "type": "function_call_output",
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_image",
                        "image_url": output,
                    }
                ],
            },
        ]

Something similar can be done for arbitrary PDF / file uploads

stevemadere · 2025-05-19T22:27:48Z

@nileshtrivedi:

Consider the following situation (which I suspect is about to become super common):

A MCP server to conduct web browsing operations such as navigateToUrl, takeAction. (e.g. via stagehand).

Now, such an action would need to return all of these:

requested information from the DOM
the current location (url)
a screenshot of the browser's screen (typically a .png) after navigating or taking the desired action.

It could store the screenshot at a publicly accessible location on the web (e.g. in a S3 bucket served via cloudfront) so that the screenshot could be returned as an http URL easily enough. (just perfect for providing in a input_image message hoisted from the MCP function call results)

The LLM at OpenAI can examine the screenshot and decide which action to take next and make a tool call to take that action.
It will need to see the new screenshot as well as the new location and perhaps some information from the DOM via stagehand's observe method.

Did I do a better job of describing the multi-modal results and why an MCP tool call would need to propagate them all simultaneously to the calling model?

zaddy6 · 2025-06-04T10:41:46Z

This is quite important

github-actions · 2025-06-15T02:12:39Z

This PR is stale because it has been open for 10 days with no activity.

github-actions · 2025-06-23T02:12:41Z

This PR was closed because it has been inactive for 7 days since being marked as stale.

chiehmin-wei mentioned this pull request May 29, 2025

Processing image/ multi modal responses in function tool results? #787

Open

github-actions bot added the stale label Jun 15, 2025

github-actions bot closed this Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add support for image function tools #654

feat: Add support for image function tools #654

vaydingul commented May 6, 2025

Uh oh!

diwu-sf commented May 15, 2025

Uh oh!

stevemadere commented May 18, 2025

Uh oh!

nileshtrivedi commented May 18, 2025

Uh oh!

diwu-sf commented May 18, 2025

Uh oh!

nileshtrivedi commented May 19, 2025

Uh oh!

diwu-sf commented May 19, 2025

Uh oh!

stevemadere commented May 19, 2025 •

edited

Loading

Uh oh!

zaddy6 commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 15, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Uh oh!

feat: Add support for image function tools #654

feat: Add support for image function tools #654

Conversation

vaydingul commented May 6, 2025

What's included

Usage

Uh oh!

diwu-sf commented May 15, 2025

Uh oh!

stevemadere commented May 18, 2025

Uh oh!

nileshtrivedi commented May 18, 2025

Uh oh!

diwu-sf commented May 18, 2025

Uh oh!

nileshtrivedi commented May 19, 2025

Uh oh!

diwu-sf commented May 19, 2025

Uh oh!

stevemadere commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaddy6 commented Jun 4, 2025

Uh oh!

github-actions bot commented Jun 15, 2025

Uh oh!

github-actions bot commented Jun 23, 2025

Uh oh!

Uh oh!

stevemadere commented May 19, 2025 •

edited

Loading