Module @agentica/benchmark

@agentica/benchmark

agentica-conceptual-diagram

GitHub license npm version Downloads Build Status

Benchmark program of Agentica.

Agentica is the simplest Agentic AI library specialized in LLM Function Calling, and @agentica/benchmark is the benchmark tool of such Agentic AI library. It supports two quantitive benchmark tools AgenticaSelectBenchmark and AgenticaCallBenchmark which can measure function calling's selecting and calling qualities.

Here is an example report generated by @agentica/benchmark measuring function calling quality of "Shopping Mall" scenario. Below measured benchmark scenario is exactly same with the recorded video, and you can find that every function calling has succeeded without any error.

https://github.com/user-attachments/assets/01604b53-aca4-41cb-91aa-3faf63549ea6

Benchmark of Shopping Mall Scenario

npm install @agentica/core @agentica/benchmark @samchon/openapi typia
npx typia setup

Install @agentica/benchmark with its dependent libraries.

Note that, you have to install not only @agentica/core or @agentica/benchmark libraries, but also @samchon/openapi and typia too.

@samchon/openapi is an OpenAPI specification library which can convert Swagger/OpenAPI document to LLM function calling schema. And typia is a transformer (compiler) library which can compose LLM function calling schema from a TypeScript class type.

By the way, as typia is a transformer library analyzing TypeScript source code in the compilation level, it needs additional setup command npx typia setup.

import { AgenticaSelectBenchmark } from "@agentica/benchmark";
import { Agentica, IAgenticaOperation } from "@agentica/core";
import { HttpLlm, IHttpConnection, OpenApi } from "@samchon/openapi";
import fs from "fs";
import OpenAI from "openai";
import path from "path";

const main = async (): Promise<void> => {
// CREATE AI AGENT
const agent: Agentica<"chatgpt"> = new Agentica({
model: "chatgpt",
vendor: {
api: new OpenAI({
apiKey: "YOUR_OPENAI_API_KEY",
}),
model: "gpt-4o-mini",
},
controllers: [
{
protocol: "http",
name: "shopping",
application: HttpLlm.application({
model: "chatgpt",
document: await fetch(
"https://shopping-be.wrtn.ai/editor/swagger.json",
).then((res) => res.json()),
}),
connection: {
host: "https://shopping-be.wrtn.ai",
},
},
],
});

// DO BENCHMARK
const find = (method: OpenApi.Method, path: string): IAgenticaOperation => {
const found = agent
.getOperations()
.find(
(op) =>
op.protocol === "http" &&
op.function.method === method &&
op.function.path === path,
);
if (!found) throw new Error(`Operation not found: ${method} ${path}`);
return found;
};
const benchmark: AgenticaSelectBenchmark<"chatgpt"> =
new AgenticaSelectBenchmark({
agent,
config: {
repeat: 4,
},
scenarios: [
{
name: "order",
text: [
"I wanna see every sales in the shopping mall",
"",
"And then show me the detailed information about the Macbook.",
"",
"After that, select the most expensive stock",
"from the Macbook, and put it into my shopping cart.",
"And take the shopping cart to the order.",
"",
"At last, I'll publish it by cash payment, and my address is",
"",
" - country: South Korea",
" - city/province: Seoul",
" - department: Wrtn Apartment",
" - Possession: 101-1411",
].join("\n"),
expected: {
type: "array",
items: [
{
type: "standalone",
operation: find("patch", "/shoppings/customers/sales"),
},
{
type: "standalone",
operation: find("get", "/shoppings/customers/sales/{id}"),
},
{
type: "anyOf",
anyOf: [
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders"),
},
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders/direct"),
},
],
},
{
type: "standalone",
operation: find(
"post",
"/shoppings/customers/orders/{orderId}/publish",
),
},
],
},
},
],
});
await benchmark.execute();

// REPORT
const docs: Record<string, string> = benchmark.report();
const root: string = `docs/benchmarks/call`;

await rmdir(root);
for (const [key, value] of Object.entries(docs)) {
await mkdir(path.join(root, key.split("/").slice(0, -1).join("/")));
await fs.promises.writeFile(path.join(root, key), value, "utf8");
}
};

Benchmark of Shopping Mall Scenario

Benchmark function selecting quality.

You can measure a benchmark that AI agent can select proper functions from the user's conversations by the LLM (Large Language Model) function calling feature. Create Agentica and AgenticaSelectBenchmark typed instances, and execute the benchmark with your specific scenarios like above.

If you have written enough and proper descriptions to the functions (or API operations) and DTO schema types, success ratio of AgenticaSelectBenchmark would be higher. Otherwise descriptions are not enough or have bad quality, you may get a threatening benchmark report. If you wanna see how the AgenticaSelectBenchmark reports, click above benchmark report link please.

Benchmark of Shopping Mall Scenario

import { AgenticaCallBenchmark } from "@agentica/benchmark";
import { Agentica, IAgenticaOperation } from "@agentica/core";
import { HttpLlm, IHttpConnection, OpenApi } from "@samchon/openapi";
import fs from "fs";
import OpenAI from "openai";
import path from "path";

const main = async (): Promise<void> => {
// CREATE AI AGENT
const agent: Agentica<"chatgpt"> = new Agentica({
model: "chatgpt",
vendor: {
api: new OpenAI({
apiKey: "YOUR_OPENAI_API_KEY",
}),
model: "gpt-4o-mini",
},
controllers: [
{
protocol: "http",
name: "shopping",
application: HttpLlm.application({
model: "chatgpt",
document: await fetch(
"https://shopping-be.wrtn.ai/editor/swagger.json",
).then((res) => res.json()),
}),
connection: {
host: "https://shopping-be.wrtn.ai",
},
},
],
});

// DO BENCHMARK
const find = (method: OpenApi.Method, path: string): IAgenticaOperation => {
const found = agent
.getOperations()
.find(
(op) =>
op.protocol === "http" &&
op.function.method === method &&
op.function.path === path,
);
if (!found) throw new Error(`Operation not found: ${method} ${path}`);
return found;
};
const benchmark: AgenticaSelectBenchmark<"chatgpt"> =
new AgenticaSelectBenchmark({
agent,
config: {
repeat: 4,
},
scenarios: [
{
name: "order",
text: [
"I wanna see every sales in the shopping mall",
"",
"And then show me the detailed information about the Macbook.",
"",
"After that, select the most expensive stock",
"from the Macbook, and put it into my shopping cart.",
"And take the shopping cart to the order.",
"",
"At last, I'll publish it by cash payment, and my address is",
"",
" - country: South Korea",
" - city/province: Seoul",
" - department: Wrtn Apartment",
" - Possession: 101-1411",
].join("\n"),
expected: {
type: "array",
items: [
{
type: "standalone",
operation: find("patch", "/shoppings/customers/sales"),
},
{
type: "standalone",
operation: find("get", "/shoppings/customers/sales/{id}"),
},
{
type: "anyOf",
anyOf: [
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders"),
},
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders/direct"),
},
],
},
{
type: "standalone",
operation: find(
"post",
"/shoppings/customers/orders/{orderId}/publish",
),
},
],
},
},
],
});
await benchmark.execute();

// REPORT
const docs: Record<string, string> = benchmark.report();
const root: string = `docs/benchmarks/call`;

await rmdir(root);
for (const [key, value] of Object.entries(docs)) {
await mkdir(path.join(root, key.split("/").slice(0, -1).join("/")));
await fs.promises.writeFile(path.join(root, key), value, "utf8");
}
};

Benchmark function calling quality.

You can measure a benchmark that AI agent can call proper functions from the user's conversations by the LLM (Large Language Model) function calling feature. Create Agentica and AgenticaCallBenchmark typed instances, and execute the benchmark with your specific scenarios like above.

If you have written enough and proper descriptions to the functions (or API operations) and DTO schema types, success ratio of AgenticaCallBenchmark would be higher. Otherwise descriptions are not enough or have bad quality, you may get a threatening benchmark report. If you wanna see how the AgenticaCallBenchmark reports, click above benchmark report link please.

For reference, @agentica/core tends not to failed on arguments filling of LLM function calling. So it is okay that ending up with AgenticaSelectBenchmark stage, because function calling with arguments filling spends much more times and LLM tokens.

Also, current AgenticaCallBenchmark has been designed to perform multiple LLM function callings just by one conversation text. However, the multiple LLM function calling benchmark actually requires the #Multi Turn Benchmark feature of #Roadmap. Therefore, AgenticaSelectBenchmark is economic than AgenticaCallBenchmark.

In the above "Shopping Mall" scenario, function selecting benchmark ends in 4 seconds, but function calling benchmark consumes about 3 minutes.

Will support multi-turn benchmark for #Function Calling Benchmark.

We will create some benchmark features that can analyze conversation context and issue summary reports or provide quantitative evaluations.

Namespaces

AgenticaCallBenchmark
AgenticaSelectBenchmark

Classes

AgenticaCallBenchmark
AgenticaSelectBenchmark