How Databricks is using synthetic data to simplify evaluation of AI agents
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Enterprises are going all in on compound AI agents. They want these systems to reason and handle different tasks in different domains, but are often stifled by the complex and time-consuming process of evaluating agent performance. Today, data ecosystem leader Databricks announced synthetic data capabilities to make this a tad easier for developers.
The move, according to the company, will allow developers to generate high-quality artificial datasets within their workflows to evaluate the performance of in-development agentic systems. This will save them unnecessary back-and-forth with subject matter experts and more quickly bring agents to production.
While it remains to be seen how exactly the synthetic data offering will work for enterprises’ using the Databricks Intelligence platform, the Ali Ghodsi-led company claims that its internal tests have shown it can significantly improve agent performance across various metrics.
Databricks’ play for evaluating AI agents
Databricks acquired MosaicML last year and has fully integrated the company’s technology and models across its Data Intelligence platform to give enterprises everything they need to build, deploy and evaluate machine learning (ML) and generative AI solutions using their data hosted in the company’s lakehouse.
Part of this work has revolved around helping teams build compound AI systems that can not only reason and respond with accuracy but also take actions such as opening/closing support tickets, responding to emails and making reservations. To this end, the company unveiled a whole new suite of Mosaic AI capabilities this year, including support for fine-tuning foundation models, a catalog for AI tools and offerings for building and evaluating the AI agents — Mosaic AI Agent Framework and Agent Evaluation.
Today, the company is expanding Agent Evaluation with a new synthetic data generation API.
So far, Agent Evaluation has provided enterprises with two key capabilities. The first enables users and subject matter experts (SMEs) to manually define datasets with relevant questions and answers and create a yardstick of sorts to rate the quality of answers provided by AI agents. The second enables the SMEs to use this yardstick to assess the agent and provide feedback (labels). This is backed by AI judges that automatically log responses and feedback by humans in a table and rate the agent’s quality on metrics such as accuracy and harmfulness.
This approach works, but the process of building evaluation datasets takes a lot of time. The reasons are easy to imagine: Domain experts are not always available; the process is manual and users may often struggle to identify the most relevant questions and answers to provide ‘golden’ examples of successful interactions.
This is exactly where the synthetic data generation API comes into play, enabling developers to create high-quality evaluation datasets for preliminary assessment in a matter of minutes. It reduces the work of SMEs to final validation and fast-tracks the process of iterative development where developers can themselves explore how permutations of the system — tuning models, changing retrieval or adding tools — alter quality.
The company ran internal tests to see how the datasets generated from the API can help evaluate and improve agents and noted that it can lead to significant improvements across various metrics.
“We asked a researcher to use the synthetic data to evaluate and improve an agent’s performance and then evaluated the resulting agent using the human-curated data,” Eric Peter, AI platform and product leader at Databricks, told VentureBeat. “The results showed that across various metrics, the agent’s performance improved significantly. For instance, we observed a nearly 2X increase in the agent’s ability to find relevant documents (as measured by recall@10). Additionally, we saw improvements in the overall correctness of the agent’s responses.”
How does it stand out?
While there are plenty of tools that can generate synthetic datasets for evaluation, Databricks’ offering stands out with its tight integration with Mosaic AI Agentic Evaluation — meaning developers building on the company’s platform don’t have to leave their workflows.
Peter noted that creating a dataset with the new API is a four-step process. Devs just have to parse their documents (saving them as a Delta Table in their lakehouse), pass the Delta Table to the synthetic data API, run the evaluation with the generated data and view the quality results.
In contrast, using an external tool would mean several additional steps, including running (extract, transform and load (ETL) to move the parsed documents to an external environment that could run the synthetic data generation process; moving the generated data back to the Databricks platform; then transforming it to a format accepted by Agent Evaluation. Only after this can evaluation be executed.
“We knew companies needed a turnkey API that was simple to use — one line of code to generate data,” Peter explained. “We also saw that many solutions on the market were offering simple open-source prompts that aren’t tuned for quality. With this in mind, we made a significant investment in the quality of the generated data while still allowing developers to tune the pipeline for their unique enterprise requirements via a prompt-like interface. Finally, we knew most existing offerings needed to be imported into existing workflows, adding unnecessary complexity to the process. Instead, we built an SDK that was tightly integrated with the Databricks Data Intelligence Platform and Mosaic AI Agent Evaluation capabilities.”
Multiple enterprises using Databricks are already taking advantage of the synthetic data API as part of a private preview, and report a significant reduction in the time taken to improve the quality of their agents and deploy them into production.
One of these customers, Chris Nishnick, director of artificial intelligence at Lippert, said their teams were able to use the API’s data to improve relative model response quality by 60%, even before involving experts.
More agent-centric capabilities in pipeline
As the next step, the company plans to expand Mosaic AI Agent Evaluation with features to help domain experts modify the synthetic data for further accuracy as well as tools to manage its lifecycle.
“In our preview, we learned that customers want several additional capabilities,” said Peter. “First, they want a user interface for their domain experts to review and edit the synthetic evaluation data. Second, they want a way to govern and manage the lifecycle of their evaluation set in order to track changes and make updates from the domain expert review of the data instantly available to developers. To address these challenges, we are already testing several features with customers that we plan to launch early next year.”
Broadly, the developments are expected to boost the adoption of Databrick’s Mosaic AI offering, further strengthening the company’s position as the go-to vendor for all things data and gen AI.
But Snowflake is also catching up in the category and has made a series of product announcements, including a model partnership with Anthropic, for its Cortex AI product that allows enterprises to build gen AI apps. Earlier this year, Snowflake also acquired observability startup TruEra to provide AI application monitoring capabilities within Cortex.