Skip to main content

How to call tools with multi-modal data

Prerequisites

This guide assumes familiarity with the following concepts:

Here we demonstrate how to call tools with multi-modal data, such as images.

Some multi-modal models, such as those that can reason over images or audio, support tool calling features as well.

To call tools using such models, simply bind tools to them in the usual way, and invoke the model using content blocks of the desired type (e.g., containing image data).

Below, we demonstrate examples using OpenAI and Anthropic. We will use the same image and tool in all cases. Let’s first select an image, and build a placeholder tool that expects as input the string “sunny”, “cloudy”, or “rainy”. We will ask the models to describe the weather in the image.

import { DynamicStructuredTool } from "@langchain/core/tools";
import { z } from "zod";

const imageUrl =
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";

const weatherTool = new DynamicStructuredTool({
name: "multiply",
description: "Describe the weather",
schema: z.object({
weather: z.enum(["sunny", "cloudy", "rainy"]),
}),
func: async ({ weather }) => {
console.log(weather);
return weather;
},
});

OpenAI

For OpenAI, we can feed the image URL directly in a content block of type “image_url”:

import { HumanMessage } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";

const model = new ChatOpenAI({
model: "gpt-4o",
}).bindTools([weatherTool]);

const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "call_MbIAYS9ESBG1EWNM2sMlinjR"
}
]

Note that we recover tool calls with parsed arguments in LangChain’s standard format in the model response.

Anthropic

For Anthropic, we can format a base64-encoded image into a content block of type “image”, as below:

import * as fs from "node:fs/promises";

import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "@langchain/core/messages";

const imageData = await fs.readFile("../../data/sunny_day.jpeg");

const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
}).bindTools([weatherTool]);

const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageData.toString("base64")}`,
},
},
],
});

const response = await model.invoke([message]);

console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "toolu_01KnRZWQkgWYSzL2x28crXFm"
}
]

Was this page helpful?


You can leave detailed feedback on GitHub.