Mayank Khandelwal

Mayank Khandelwal

Data Scientist

One of the most popular Azure AI offerings by Microsoft is the Azure Cognitive Services. In essence, it is a family of artificial intelligence services and cognitive APIs which enables you to build intelligent applications. As the name suggests, the idea is to embed the ability to see, hear, speak, search, understand and decision-making into your services.

During the course of my career, I have implemented many Azure solutions and proof-of-concepts using Azure Cognitive Services. The best part about this service is that it requires minimal setup, a little or no training and easily accessible via an API endpoint. Let’s take a look at bit more detail on how this works.

The Azure Cognitive Services is primarily divided into five domains: Decision, Language, Speech, Vision and Web Search.

Decision service

The Decision making service has four options at the time of writing this article.

The Anomaly Detector is a very handy tool to quickly identify problems for your use-case by ingesting temporal data which then automatically fits the bestfitting detection model. In reality, this does require some level of customization. It’s surprising to note that Azure is the only major cloud provider which offers this as an AI service - not that it’s not possible on other providers such as AWS, for instance, through CloudWatch Anomaly Detection, but I do not find it as intuitive. In my experience, it only works well if you really know your data really well else you are bound to deviate from solving your use-case. The next service is the Metrics Advisor (in preview) which is built on Anomaly Detector. This is a very new service aiming at monitoring the performance of different aspects of your business through near-real time monitoring, model adaptation, granular analysis with diagnostics and alerting.

Next comes the Content Moderator, which as the name suggests is aimed at removing profanity from data. The best part about this tool is that you can use it with images, videos and text with lot of language support options. It also features a Human Review Tool which is a handy feature to improve and tune the performance to your use-case. In my experience, it works well for textual data, although some improvement is definitely required with multimedia data.

The final service I talk about is the Personalizer allowing you to deliver personalized, relevant experience for every user, the best part being that it requires no machine learning expertise. This service is different from traditional recommendation engines that offer options from a large catalog. The way this works is through a continuous learning feedback and optimization using reinforcement learning. Through the data being sent to users, it trains a shared model updated with new interactions, thus improving the results for all users using the service.

Language service

The Language service has five options at the time of writing this article.

The Text Analytics service is responsible for extracting key phrases, detecting sentiment and named entities from text. In my experience, this service works best with English, even though multiple languages are supported. While dealing with other languages, such as Finnish, I have found that using the Translator service to convert to English has given me a better results. The translation works surprisingly well with a couple of languages I tried. While translation with text analytics adds a level of latency to the response, it can prove to deliver better results. I foresee a lot of improvement in the Text Analytics service because something as trivial as lower or upper-casing of words can affect results. Definitely preprocess your data before using this service.

Next is the Immersive Reader which, as I prefer to call it, allows you to build accessibility into your service. Through this service, the service provides functionalities such as reading aloud texts, translate languages and focusing attention through highlighting etc. As with the case of Anomaly Detector, Azure is the only major cloud provider offering this type of reading technology.

QnA Maker lets you create a conversational Q&A layer over your existing data. The way this works is that builds a knowledge base by extracting Q&As from semi-structures content which it then uses to respond to the user’s question and continually learn from user input.

The Language Understanding service allows you to build custom language models which can interpret goals of the user and extract key information from conversational phrases. From my experience, while you do not need machine learning expertise, you do need to have some coding expertise. The documentation is quite decent but it can get really complicated really quick and you need a lot of patience to build this. The best way to be able to leverage this service is by going through the documentation well. In my opinion, this is the least intuitive service from a development perspective.

Speech service

The Speech service has five options at the time of writing this article.

The Speech to Text service, as the name suggests, converts speech to text. The best part about this service is that you can build customized models for the domain use-case, for example through adding words to the base vocabulary. It was a pleasant surprise to see how well this works with many languages, though it is recommended to check out the language you wish to use it for.

The Text to Speech service, again, as the name suggests, converts speech to text. While the sound is not really natural sounding for languages apart from English, as compared to services by Google, it does a pretty good job with many languages. The emotional factor was slightly sub-par while narration, but overall a good pick given the minimal amount of effort for setting this up.

The Speech Translation service converts speech to text in your preferred language. Similar to the Speech to Text service, you can build customized models for the domain use-case; and similar to the Text to Speech service, the limitations remain the same.

The Speaker Preview is a new introductory service by Microsoft Azure, which can verify and identify speakers by their unique voice characteristics, even within a group.

Vision sevice

The Vision service has five options at the time of writing this article.

The Computer Vision service helps with image labelling, image description generation, moderate content & (real-time) video analysis, automatic text extraction, OCR, spatial analysis, amongst other features. This service works as more of a black-box and does a decent job in predictions. However, if you want to train a model for your own dataset, it is possible to leverage the Custom Vision service. Through the Custom Vision, it is possible to add your own images and annotations using a GUI and use a custom or generic domain as baseline for training the images. One wonderful feature of this service is the ability to export the model in a variety of formats which can be directly used for many kinds of deployments. Keep in mind, that the Custom Vision service is meant for images only.

The Facial Recognition service, as the name suggests, allows facial recognition into your app or service without machine learning expertise. It can extract faces and features from an image, match & group individuals in a private repository and detect emotion.

The Form Recognizer service can extract text, key-value pairs, tables and structures from documents. Provide as many samples as you can to tune this service according to your needs. In my experience, this service has a lot of scope for improvement and there are better alternatives in the market at the time of writing.

The Video Indexer service extracts meta-data such as speech, text, faces, speakers, emotions and popular content in video as well as audio files. Models can be customized to train and tune in order to improve accuracy.

Web search service

The Web Search service consists of the Bing Web Search APIs, but are being moved to a new surface area under; and thus not covered in detail in this article. The basic idea of this service is to enable safe, ad-free, location-aware search results, surfacing relevant information from billions of web documents.


There’s a lot of potential for the Azure Cognitive Services, and I believe they are on the right track in building services keeping in mind the industrial use-cases. That’s all for this article. I hope you enjoyed reading this article and gained some insights on Azure Cognitive Services. If you have any questions or wish to have a discussion, contact me.