Crafting synthetic data for text-to-SQL systems: Powering enterprise innovation with Databricks

Imagine being able to ask your organization’s data a simple question in plain language and receive an immediate, accurate answer. For many executives, this is the next frontier of data-driven decision-making.

Text-to-SQL systems are designed to make that vision a reality by translating everyday business questions into precise database queries. Yet despite their promise, many organizations struggle to scale these solutions. The challenge is rarely the underlying technology. More often, it comes down to a lack of high-quality training data: real-world questions paired with accurate, validated SQL queries that make systems reliable, secure and widely adopted.

At CGI, our teams explored how synthetic data generation can help overcome this challenge using Databricks, the unified platform for data, analytics and AI. What began as a technical exploration evolved into a practical, repeatable approach that organizations can apply across industries and geographies to accelerate responsible generative AI adoption.

Building innovation on Databricks

The initiative started with a straightforward but critical question: How can organizations generate reliable, realistic training data for text-to-SQL systems without relying on scarce or inconsistent business user input?

Using Databricks’ centralized catalog, our teams accessed governed schema metadata, including table names, columns, relationships and descriptions, within a single unified environment. With this foundation in place, we designed an end-to-end workflow that brought data engineering, analytics and generative AI together on one platform.

Key elements of the workflow included:

Combining Python and SQL within Databricks notebooks to select a diverse set of tables and representative data samples

Mapping relationships between tables to create realistic joins and meaningful business context

Sampling data at scale using Spark SQL to ensure both speed and efficiency

Automatically generating and validating SQL queries directly within the Databricks environment

Using large language models to generate corresponding natural-language questions, completing high-quality (question, SQL) training pairs

By running the entire workflow within Databricks, teams collaborated, tested and refined their approach in a governed environment, accelerating innovation without sacrificing control, security or data privacy.

Why synthetic data matters

Synthetic data is more than a substitute for missing examples. It is a powerful enabler for scaling AI initiatives responsibly.

For text-to-SQL systems, synthetic data allows organizations to rapidly generate diverse, high-quality training examples while avoiding common challenges such as limited user availability, inconsistent inputs and sensitive data exposure. In our work, this approach enabled the creation of hundreds of validated question and SQL pairs in a short time frame, significantly improving model accuracy and reliability.

In practice, this led to greater trust in the system, driving increased adoption and confidence in data-driven decision-making across business users.

Databricks for enterprise synthetic data and AI

Throughout the process, Databricks served as more than just an execution environment. It provided the foundation for moving from experimentation to enterprise-ready delivery.

Key advantages included:

A unified workspace that brought data, code and AI workflows together

Collaboration features that enabled rapid prototyping and iteration across teams Integration capabilities that simplified access to existing tools and services

Enterprise-grade governance and security through Databricks Unity Catalog and Secrets, ensuring controlled access to data, credentials and configurations

This work demonstrates how Databricks enables organizations to unify data engineering, analytics and generative AI on a single platform, transforming innovation from isolated experiments into scalable, repeatable solutions.

Scaling enterprise generative AI with synthetic data

As organizations around the world seek to unlock more value from their data, synthetic data generation offers a practical and scalable path forward. It helps bridge the gap between ambition and execution, enabling AI-powered solutions that are both effective and responsible.

We help clients across industries harness the power of platforms like Databricks to drive transformation, from modern data architectures to intelligent automation and advanced analytics.

Alliances

2026 CGI Voice of Our Clients

2026 CGI Voice of Our Clients

Careers

CGI’s Second Quarter F2026 Results

Crafting synthetic data for text-to-SQL systems: Powering enterprise innovation with Databricks

Building innovation on Databricks

Why synthetic data matters

Databricks for enterprise synthetic data and AI

Scaling enterprise generative AI with synthetic data

Insights you can act on

Company

Resource center

Support

Follow us

Alliances

2026 CGI Voice of Our Clients

2026 CGI Voice of Our Clients

Careers

CGI’s Second Quarter F2026 Results

Building innovation on Databricks

Why synthetic data matters

Databricks for enterprise synthetic data and AI

Scaling enterprise generative AI with synthetic data

Consult with our experts

Parisa Ghane

Gaby Martin

Databricks

Artificial intelligence

Related media

CGI’s Insula platform supports AI-driven Earth observation in ESA’s Φsat-2 mission

CGI and NetApp deepen global alliance to drive innovation, accelerate growth and strengthen client outcomes

ISG recognized CGI as a Leader in the ISG Provider Lens® Global Capability Center (GCC) Services 2026 report

Energy & utilities: The Decentralized Grid | CGI Industry Foresights

Discover more about CGI

Keeping you informed