@agentica/benchmark
Benchmark program of Agentica
.
Agentica
is the simplest Agentic AI library specialized in LLM Function Calling, and @agentica/benchmark
is the benchmark tool of such Agentic AI library. It supports two quantitive benchmark tools AgenticaSelectBenchmark
and AgenticaCallBenchmark
which can measure function calling's selecting and calling qualities.
Here is an example report generated by @agentica/benchmark
measuring function calling quality of "Shopping Mall" scenario. Below measured benchmark scenario is exactly same with the recorded video, and you can find that every function calling has succeeded without any error.
https://github.com/user-attachments/assets/01604b53-aca4-41cb-91aa-3faf63549ea6
Benchmark of Shopping Mall Scenario
- Benchmark Report
- Swagger Document: https://shopping-be.wrtn.ai/editor
- Repository: https://github.com/wrtnlabs/shopping-backend
npm install @agentica/core @agentica/benchmark @samchon/openapi typia
npx typia setup
Install @agentica/benchmark
with its dependent libraries.
Note that, you have to install not only @agentica/core
or @agentica/benchmark
libraries, but also @samchon/openapi
and typia
too.
@samchon/openapi
is an OpenAPI specification library which can convert Swagger/OpenAPI document to LLM function calling schema. And typia
is a transformer (compiler) library which can compose LLM function calling schema from a TypeScript class type.
By the way, as typia
is a transformer library analyzing TypeScript source code in the compilation level, it needs additional setup command npx typia setup
.
import { AgenticaSelectBenchmark } from "@agentica/benchmark";
import { Agentica, IAgenticaOperation } from "@agentica/core";
import { HttpLlm, IHttpConnection, OpenApi } from "@samchon/openapi";
import fs from "fs";
import OpenAI from "openai";
import path from "path";
const main = async (): Promise<void> => {
// CREATE AI AGENT
const agent: Agentica<"chatgpt"> = new Agentica({
model: "chatgpt",
vendor: {
api: new OpenAI({
apiKey: "YOUR_OPENAI_API_KEY",
}),
model: "gpt-4o-mini",
},
controllers: [
{
protocol: "http",
name: "shopping",
application: HttpLlm.application({
model: "chatgpt",
document: await fetch(
"https://shopping-be.wrtn.ai/editor/swagger.json",
).then((res) => res.json()),
}),
connection: {
host: "https://shopping-be.wrtn.ai",
},
},
],
});
// DO BENCHMARK
const find = (method: OpenApi.Method, path: string): IAgenticaOperation => {
const found = agent
.getOperations()
.find(
(op) =>
op.protocol === "http" &&
op.function.method === method &&
op.function.path === path,
);
if (!found) throw new Error(`Operation not found: ${method} ${path}`);
return found;
};
const benchmark: AgenticaSelectBenchmark<"chatgpt"> =
new AgenticaSelectBenchmark({
agent,
config: {
repeat: 4,
},
scenarios: [
{
name: "order",
text: [
"I wanna see every sales in the shopping mall",
"",
"And then show me the detailed information about the Macbook.",
"",
"After that, select the most expensive stock",
"from the Macbook, and put it into my shopping cart.",
"And take the shopping cart to the order.",
"",
"At last, I'll publish it by cash payment, and my address is",
"",
" - country: South Korea",
" - city/province: Seoul",
" - department: Wrtn Apartment",
" - Possession: 101-1411",
].join("\n"),
expected: {
type: "array",
items: [
{
type: "standalone",
operation: find("patch", "/shoppings/customers/sales"),
},
{
type: "standalone",
operation: find("get", "/shoppings/customers/sales/{id}"),
},
{
type: "anyOf",
anyOf: [
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders"),
},
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders/direct"),
},
],
},
{
type: "standalone",
operation: find(
"post",
"/shoppings/customers/orders/{orderId}/publish",
),
},
],
},
},
],
});
await benchmark.execute();
// REPORT
const docs: Record<string, string> = benchmark.report();
const root: string = `docs/benchmarks/call`;
await rmdir(root);
for (const [key, value] of Object.entries(docs)) {
await mkdir(path.join(root, key.split("/").slice(0, -1).join("/")));
await fs.promises.writeFile(path.join(root, key), value, "utf8");
}
};
Benchmark of Shopping Mall Scenario
- Benchmark Report
- Swagger Document: https://shopping-be.wrtn.ai/editor
- Repository: https://github.com/wrtnlabs/shopping-backend
Benchmark function selecting quality.
You can measure a benchmark that AI agent can select proper functions from the user's conversations by the LLM (Large Language Model) function calling feature. Create Agentica
and AgenticaSelectBenchmark
typed instances, and execute the benchmark with your specific scenarios like above.
If you have written enough and proper descriptions to the functions (or API operations) and DTO schema types, success ratio of AgenticaSelectBenchmark
would be higher. Otherwise descriptions are not enough or have bad quality, you may get a threatening benchmark report. If you wanna see how the AgenticaSelectBenchmark
reports, click above benchmark report link please.
Benchmark of Shopping Mall Scenario
- Benchmark Report
- Swagger Document: https://shopping-be.wrtn.ai/editor
- Repository: https://github.com/wrtnlabs/shopping-backend
import { AgenticaCallBenchmark } from "@agentica/benchmark";
import { Agentica, IAgenticaOperation } from "@agentica/core";
import { HttpLlm, IHttpConnection, OpenApi } from "@samchon/openapi";
import fs from "fs";
import OpenAI from "openai";
import path from "path";
const main = async (): Promise<void> => {
// CREATE AI AGENT
const agent: Agentica<"chatgpt"> = new Agentica({
model: "chatgpt",
vendor: {
api: new OpenAI({
apiKey: "YOUR_OPENAI_API_KEY",
}),
model: "gpt-4o-mini",
},
controllers: [
{
protocol: "http",
name: "shopping",
application: HttpLlm.application({
model: "chatgpt",
document: await fetch(
"https://shopping-be.wrtn.ai/editor/swagger.json",
).then((res) => res.json()),
}),
connection: {
host: "https://shopping-be.wrtn.ai",
},
},
],
});
// DO BENCHMARK
const find = (method: OpenApi.Method, path: string): IAgenticaOperation => {
const found = agent
.getOperations()
.find(
(op) =>
op.protocol === "http" &&
op.function.method === method &&
op.function.path === path,
);
if (!found) throw new Error(`Operation not found: ${method} ${path}`);
return found;
};
const benchmark: AgenticaSelectBenchmark<"chatgpt"> =
new AgenticaSelectBenchmark({
agent,
config: {
repeat: 4,
},
scenarios: [
{
name: "order",
text: [
"I wanna see every sales in the shopping mall",
"",
"And then show me the detailed information about the Macbook.",
"",
"After that, select the most expensive stock",
"from the Macbook, and put it into my shopping cart.",
"And take the shopping cart to the order.",
"",
"At last, I'll publish it by cash payment, and my address is",
"",
" - country: South Korea",
" - city/province: Seoul",
" - department: Wrtn Apartment",
" - Possession: 101-1411",
].join("\n"),
expected: {
type: "array",
items: [
{
type: "standalone",
operation: find("patch", "/shoppings/customers/sales"),
},
{
type: "standalone",
operation: find("get", "/shoppings/customers/sales/{id}"),
},
{
type: "anyOf",
anyOf: [
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders"),
},
{
type: "standalone",
operation: find("post", "/shoppings/customers/orders/direct"),
},
],
},
{
type: "standalone",
operation: find(
"post",
"/shoppings/customers/orders/{orderId}/publish",
),
},
],
},
},
],
});
await benchmark.execute();
// REPORT
const docs: Record<string, string> = benchmark.report();
const root: string = `docs/benchmarks/call`;
await rmdir(root);
for (const [key, value] of Object.entries(docs)) {
await mkdir(path.join(root, key.split("/").slice(0, -1).join("/")));
await fs.promises.writeFile(path.join(root, key), value, "utf8");
}
};
Benchmark function calling quality.
You can measure a benchmark that AI agent can call proper functions from the user's conversations by the LLM (Large Language Model) function calling feature. Create Agentica
and AgenticaCallBenchmark
typed instances, and execute the benchmark with your specific scenarios like above.
If you have written enough and proper descriptions to the functions (or API operations) and DTO schema types, success ratio of AgenticaCallBenchmark
would be higher. Otherwise descriptions are not enough or have bad quality, you may get a threatening benchmark report. If you wanna see how the AgenticaCallBenchmark
reports, click above benchmark report link please.
For reference, @agentica/core
tends not to failed on arguments filling of LLM function calling. So it is okay that ending up with AgenticaSelectBenchmark
stage, because function calling with arguments filling spends much more times and LLM tokens.
Also, current AgenticaCallBenchmark
has been designed to perform multiple LLM function callings just by one conversation text. However, the multiple LLM function calling benchmark actually requires the #Multi Turn Benchmark feature of #Roadmap. Therefore, AgenticaSelectBenchmark
is economic than AgenticaCallBenchmark
.
In the above "Shopping Mall" scenario, function selecting benchmark ends in 4 seconds, but function calling benchmark consumes about 3 minutes.
Will support multi-turn benchmark for #Function Calling Benchmark.
We will create some benchmark features that can analyze conversation context and issue summary reports or provide quantitative evaluations.