HoneyHive Integration for AI Agents

HoneyHive

Use Cases

LLM quality assurance, made conversational

See how AI teams use agents to interact with HoneyHive for logging, evaluation, and experimentation, keeping model observability embedded in daily workflows.

Automated Experiment Reporting for Prompt Engineering

A prompt engineer finishes an A/B test between two system prompts. They ask the agent for results. Your AI Agent fetches the experiment run from HoneyHive, retrieves aggregated metrics including accuracy, latency, and cost per call, and presents a comparison. The engineer decides which prompt to promote to production in minutes. No dashboard navigation needed.

Continuous Dataset Enrichment from Production Feedback

Customer support flags a conversation where the AI gave an incorrect answer. The QA lead tells the agent to add it as a test case. The agent appends the input, expected output, and metadata to the HoneyHive evaluation dataset. The regression suite grows organically from real failures. Future prompt changes get tested against actual edge cases.

Batch Event Logging After Inference Runs

An ML engineer completes an overnight batch inference job and needs to log all results. They trigger the agent, which sends model events to HoneyHive in bulk with inputs, outputs, durations, and token counts. The observability dashboard immediately reflects the new data. The engineer reviews performance trends without writing a single logging script.

Try

HoneyHive

FAQs

Frequently Asked Questions

How does the AI agent log model events to HoneyHive?

The agent calls HoneyHive's batch model events endpoint with an array of event objects containing inputs, outputs, durations, token counts, and metadata. Events are associated with a session and project. HoneyHive processes them for dashboards, evaluators, and alerting. Single and batch logging are both supported.

Can the agent run evaluations against my datasets?

Yes. The agent can start evaluation runs by specifying event IDs, dataset IDs, and project context through HoneyHive's API. It can also end runs and retrieve results with aggregated metrics like average, median, p95, or custom functions. This covers both automated and human evaluation workflows.

What authentication does Tars need for HoneyHive?

Tars uses your HoneyHive API key, which you generate from your account settings. This key authenticates all API calls including session management, event logging, dataset operations, and metric retrieval. You can rotate the key anytime from the HoneyHive dashboard.

Does Tars store my AI model outputs or evaluation data?

No. Tars sends data directly to HoneyHive's API and does not retain copies. Model inputs, outputs, evaluation datasets, and experiment metrics are stored exclusively in your HoneyHive account. Tars handles only the API request and response during the conversation.

Can the agent manage multiple projects in HoneyHive?

Yes. The agent can list all projects, filter by name, and create or update projects through HoneyHive's API. When logging events or managing datasets, the agent specifies the target project name, so your team can work across multiple AI applications from a single conversation.

How is this different from using HoneyHive's web dashboard?

The dashboard requires manual navigation to view experiments, manage datasets, and inspect events. With Tars, your team asks questions like 'How did experiment run_123 perform?' or 'Add this failing case to the QA dataset' and gets results instantly. Observability becomes part of the engineering conversation.

Can the agent update metric definitions and thresholds?

Yes. The agent calls HoneyHive's update metric endpoint to modify names, descriptions, evaluator prompts, code snippets, thresholds, and production enablement flags. This lets your team adjust quality criteria conversationally as requirements evolve.

What aggregation functions are available for experiment results?

When retrieving experiment results, the agent can specify aggregation functions including average, min, max, median, p90, p95, p99, sum, and count. This gives your team flexibility to analyze results from different statistical perspectives without running manual calculations.

Monitor and evaluate your AI pipelines through conversational agents

AI observability inside your workflow

Start Tracking Sessions

Log Model Events in Batch

Create Evaluation Datasets

Retrieve Experiment Results

Manage Project Metrics

Add Datapoints to Datasets