Driving consistent, scalable LLM evaluation with Databricks and CGI

Enterprises are rapidly deploying large language models (LLMs) but scaling them reliably poses a major challenge. Traditional evaluation is often manual, fragmented and too slow to meet production demands. A leading U.S. Telco faced this exact problem with 200+ models deployed across Triton servers without a common framework, resulting in bottlenecks, uneven quality standards and gaps between offline testing and real-world performance.

To address these challenges, CGI developed a comprehensive LLMOps evaluation framework built on the Databricks Data Intelligence Platform, using Mosaic AI and MLflow. Our solution applies LLM-as-a-Judge to automate and standardise evaluation across the entire model lifecycle, ensuring consistency, quality and governance at scale.

Importantly, the framework supports models trained and served outside Databricks, including those hosted on platforms like Light LLM and JFrog, by using MLflow’s model URI integration and external artifact tracking.

Key innovations include:

Multi-tier evaluation architecture: Offline pre-deployment testing, online production monitoring and human-in-the-loop oversight.

Continuous feedback loops: Production insights flow back into offline evaluation, improving model accuracy and judge calibration over time.

Asset Bundles for deployment: Ensuring atomic, versioned and reproducible evaluation environments across test, staging, and production.

Unity Catalog integration: Centralised governance with full lineage of models, datasets, and evaluation artifacts, ensuring traceability, compliance and secure collaboration across teams.

This approach reduces manual effort while expanding coverage and accuracy, transforming evaluation into a self-improving, enterprise-ready system. By unifying model assessment on Databricks, this large U.S. telecommunications provider can now confidently deploy LLMs into production with consistent standards, real-time monitoring and proactive risk management.

CGI’s LLMOps accelerator demonstrates how Databricks can power trustworthy, scalable and future-ready GenAI adoption for enterprise clients.