Sarah Benson, CGI Federal

Sarah Benson

Consultant

As Artificial Intelligence (AI) becomes ever more sophisticated and adept, the number of use cases in both the private and public sectors continues to grow. Powerful machine learning techniques, such as deep learning, allow organizations to process and analyze large volumes of diverse data—traditional structured data, but also unstructured text, images, video and recorded speech. 

A key challenge that organizations face, however, is that most traditional machine learning approaches demand large volumes of labeled data in order to train the system. Labeled data is information that is already identified and classified before training starts. As the system learns through repeatedly encountering and identifying this labeled data, it becomes increasingly able to identify similar data without labels. However, the number of items that have to be labeled in order to train the system is often overwhelming. And because the performance of supervised machine learning models depends on the quality of its labeled training data, diligent data labeling processes will also involve additional steps, such as label auditing and review.

In our work at CGI, we have discovered that a significant percentage of organizations cite data labeling and data quality issues as their main obstacle in adopting artificial intelligence. A 2020 O’Reilly Media report puts that number at 19%. That estimate might be low for government organizations, which face unique restrictions in privacy and policy compliance. 

Since many government AI applications are also domain-specific, agencies may need data experts in order to assign correct and meaningful labels. This limits the number of personnel available to perform data labeling, and often eliminates the possibility of using third-party data labeling services. Moreover, agencies with high secrecy requirements or rarefied expertise can’t crowd-source labeled data from public sources.

Self-supervised learning as a solution

Self-supervised learning, a relatively new method in machine learning, seeks to make the way machines learn closer to the way that humans learn. It promises to deliver benefits of supervised learning, while reducing the time and labor expense of manually labeling massive quantities of data. The method supports the idea that if a model can glean a general understanding of the world from its training data, then it can learn to perform a specialized task faster and more effectively in the future. 

The underlying mechanism for self-supervised learning: the model learns by itself, in a designated pre-training phase, by leveraging some part of its training data to predict another part. Although a variety of approaches exist for this pre-training phase, they often explore the difference between the model’s prediction and real data. Through this process, the self-supervised learning model moves closer to true understanding of its training data. 

Although there are a variety of different approaches to this pre-training phase, this stage frequently involves determining the difference between what the model predicts and what the real data contains. This calculation provides a metric to track improvements as the model learns. Using known data sets with some information removed allows for quick comparisons to see how much the AI got right. Imagine taking a few pages of text and removing some of the words; then compare the original text to the same few pages with the AI's best guesses for the missing words. Like the answer key for a test, the comparison is easy to automate and quickly indicates whether the AI is improving its accuracy over time.

Easing the burden for federal agencies

In addition to saving time and reducing program risk, self-supervised learning allows government organizations to be more effective with the data and knowledge they already possess. Raw, unlabeled data, which provides limited utility in the supervised learning scenario, can provide an effective primer in the self-supervised learning pre-training phase to improve performance and reduce training effort in a downstream task. 
Consider a few scenarios: 

  • Housing programs: Self-supervised learning could enable AI systems to identify common objects and circumstances, such electrical hazards, in property inspection photos.
  • Military: The technique can train systems to identify structural issues or anomalies in facilities using satellite imagery or drone footage.
  • Natural language processing (NLP): Teach systems to extract insights from social media, flagging instances of violent or threatening language that might indicate extremism.

Wherever a government organization currently stands on its journey to AI adoption, I believe self-supervised learning offers an exciting new area of machine learning that can help meet agency missions faster by making the most of data and resources already available. For more information on CGI Federal’s Intelligent Automation capabilities, contact me or visit our Intelligent Automation page.  
 

About this author

Sarah Benson, CGI Federal

Sarah Benson

Consultant

Sarah Benson is a data scientist in CGI Federal's Emerging Technology Practice.