Imagine being able to ask your organization’s data a simple question in plain language and receive an immediate, accurate answer. For many executives, this is the next frontier of data-driven decision-making.
Text-to-SQL systems are designed to make that vision a reality by translating everyday business questions into precise database queries. Yet despite their promise, many organizations struggle to scale these solutions. The challenge is rarely the underlying technology. More often, it comes down to a lack of high-quality training data: real-world questions paired with accurate, validated SQL queries that make systems reliable, secure and widely adopted.
At CGI, our teams explored how synthetic data generation can help overcome this challenge using Databricks, the unified platform for data, analytics and AI. What began as a technical exploration evolved into a practical, repeatable approach that organizations can apply across industries and geographies to accelerate responsible generative AI adoption.
Building innovation on Databricks
The initiative started with a straightforward but critical question: How can organizations generate reliable, realistic training data for text-to-SQL systems without relying on scarce or inconsistent business user input?
Using Databricks’ centralized catalog, our teams accessed governed schema metadata, including table names, columns, relationships and descriptions, within a single unified environment. With this foundation in place, we designed an end-to-end workflow that brought data engineering, analytics and generative AI together on one platform.
Key elements of the workflow included:
- Combining Python and SQL within Databricks notebooks to select a diverse set of tables and representative data samples
- Mapping relationships between tables to create realistic joins and meaningful business context
- Sampling data at scale using Spark SQL to ensure both speed and efficiency
- Automatically generating and validating SQL queries directly within the Databricks environment
- Using large language models to generate corresponding natural-language questions, completing high-quality (question, SQL) training pairs
By running the entire workflow within Databricks, teams collaborated, tested and refined their approach in a governed environment, accelerating innovation without sacrificing control, security or data privacy.
Why synthetic data matters
Synthetic data is more than a substitute for missing examples. It is a powerful enabler for scaling AI initiatives responsibly.
For text-to-SQL systems, synthetic data allows organizations to rapidly generate diverse, high-quality training examples while avoiding common challenges such as limited user availability, inconsistent inputs and sensitive data exposure. In our work, this approach enabled the creation of hundreds of validated question and SQL pairs in a short time frame, significantly improving model accuracy and reliability.
In practice, this led to greater trust in the system, driving increased adoption and confidence in data-driven decision-making across business users.
Databricks for enterprise synthetic data and AI
Throughout the process, Databricks served as more than just an execution environment. It provided the foundation for moving from experimentation to enterprise-ready delivery.
Key advantages included:
- A unified workspace that brought data, code and AI workflows together
- Collaboration features that enabled rapid prototyping and iteration across teams Integration capabilities that simplified access to existing tools and services
- Enterprise-grade governance and security through Databricks Unity Catalog and Secrets, ensuring controlled access to data, credentials and configurations
This work demonstrates how Databricks enables organizations to unify data engineering, analytics and generative AI on a single platform, transforming innovation from isolated experiments into scalable, repeatable solutions.
Scaling enterprise generative AI with synthetic data
As organizations around the world seek to unlock more value from their data, synthetic data generation offers a practical and scalable path forward. It helps bridge the gap between ambition and execution, enabling AI-powered solutions that are both effective and responsible.
We help clients across industries harness the power of platforms like Databricks to drive transformation, from modern data architectures to intelligent automation and advanced analytics.