<img height="1" width="1" style="display:none;" alt="" src="https://dc.ads.linkedin.com/collect/?pid=49414&amp;fmt=gif">

I’d like to learn more about Campaign Creators!

A Detailed Guide on Automatic Multilingual Transcription Services

Automatic Multilingual Transcription Services are used when an audio/video contains two or more languages. Here's how VIDIZMO does it.
by Muhammad Nabeel Ali Updated on February 22, 2022

If you stumbled your way on this page, we expect you are an individual holding some knowledge on the topic of transcription. In simple terms, it is the process of converting speech into text. This is the simplest way monolingual transcription can be put. But when we move into bilingual and multilingual transcription, things get more complex.

If one language is spoken throughout the video, then the act of translating it is called monolingual transcription. But when multiple languages are being spoken in a single video, it’s called multilingual transcription. And here’s where things get complicated.

In this blog, we will be diving into why we need multilingual transcription services, what challenges come across, and how can we work to make our output more accurate? Last, we will be looking into the leading providers of automatic transcription services today, and then introduce our platform VIDIZMO, which offers multilingual transcription.

Why Do We Need Automatic Multilingual Transcription?


Let us suppose you are an English speaker employed by a law enforcement agency to handle digital audio and video evidence. You are provided with a call recording of a person speaking a foreign language; let us assume it is Spanish and the person is of Barcelona origin.

You, having no understanding of the language, will not be able to understand what language the media is in. However, built-in transcription services can be deployed here to provide you with transcriptions by automatically detecting the language, which can later be translated to a language of preference, such as English.

But not all transcription services can work with multilingual media. If said person switches during the conversation between Spanish and Catalan, the transcription service being monolingual would have trouble detecting a different language and providing an accurate transcript.

This is where multilingual transcription services come into play.

More than half of the world's population uses two or more languages (or dialects) in everyday life.

In Europe, a bit more than half of the population is at least bilingual.

In North America, some 35% of the population in Canada is bilingual and although the percentage is smaller in the United States - close to 23% in 2018 - this still corresponds to an estimated 70 million inhabitants.

Source: Francois Grosjean Ph.D., Author “Life as a Bilingual”

YouTube, a popular public video hosting service, provides autogenerated closed captions (CC) in several videos by transcribing their audio. Consider the video below:


As seen in the video above, and several other videos, when it comes to multiple languages switching back and forth, the transcription service fails to keep up and provide good output. Often the language is not automatically detected, and if it is, it does not transcribe accurately.

Therefore, an alternative solution is required where videos containing multiple languages may be transcribed automatically.

Before We Begin multilingual


The Challenges Faced By Multilingual Transcription Services

Multilingual transcription may be our solution to the problem at hand, but it is no piece of cake. There are several hurdles that occur in providing an accurate transcription. Let us investigate these hurdles or challenges, so that we can better understand how to make our output better.

Audio Clarity

We must understand that at times the audio we will be working with would have not been recorded in ideal settings. Lack of clarity in audio leads to transcription services making errors, adding gibberish and garbage terms and the like, reducing the accuracy of the transcription.

We work with law enforcement agencies and often help automatically transcribe bodycam footage, 911 call recordings, interrogation recordings, and similar scenarios. And a common issue we see with these is that they lack clarity in their audio.

Background Noise

The audio input is often affected by background noise. This includes static, airflow and resistance noises, ambient noise such as traffic, birds, indecipherable conversations in the surrounding areas, etc.

A solution we use to fix this issue is that remove background noise from audio before we transcribe it. This is generally done by adding a high pass or low pass filter that chops out any audio that is not within the human vocal frequency range.

Audio Quality

If transcription is done on an uncompressed audio file of high quality such as 90 kbps and above, the accuracy and quality of transcription are excellent.

However, audio files are normally compressed to be transmitted over the internet. When the file is being transcribed, it is decompressed back to its original size. At this stage, due to the lossy algorithm, there is a huge loss in quality.

This results in the transcription results being not as good as they would appear on the original, high-quality audio. This is often the case when transcribing from sources such as MS Teams recordings.

Conversation Speed, Accent, Voice Pitch And Other Factors


There are several elements that create variations in audio. Two people could be saying the same thing, but their accent, pitch and how fast they talk, could play a contributing factor to the accuracy of the transcription, along with various other phonetic factors.

Based on what data the AI (Artificial Intelligence) used in the transcription service is trained with, the accuracy could be different for the same audio from speakers of different states, or individuals of different ages, genders, racial backgrounds, etc.


In automatic transcription services, the AI is usually trained using NLP (Natural Language Processing) techniques where the AI will make use of a whole sentence to work on context and other factors.

However, in general, in everyday conversations, often a sentence stays incomplete due to someone else interrupting it. This crosstalk is often difficult for transcription AI to decipher as it may not be trained to do so.

Custom Terms

There are several terminologies and jargon that are not considered standard or for regular use in a language. This includes elements such as slang, abbreviations, industry terms, scientific names, and chemical nomenclature; terminologies for which most AI-based transcription algorithms are not trained for.

Codec Issues

Each codec provides a different level of accuracy and quality for transcription, with some providing a much higher quality output than others. Certain organizations make use of old or propriety codecs, however, which are not recognized by the transcription software. Thus, the media must be converted into a generalized codec, which reduces its quality and does not provide accurate transcriptions.

However, it can be time-consuming making use of different tools for codec conversion, then transmission of the file for transcription. A better option would be a complete solution that offers all the features to generalize codecs, and transcribe audio files – we’ll get to that in a bit.

How Multilingual Transcription Works?

Multilingual Transcription Infographic

Multilingual transcription works using an audio or video input, which is broken down into smaller elements. These small elements are compared to existing data to determine what language they are from, after which the audio is mapped into words and these small, transcribed elements are merged into one whole transcript. In other words, the transcription software will detect language from video or audio and convert it to its text equivalent.

If we dig deeper into the process, we will see that like any other form of neural network-based artificial intelligence programs, natural language processing, too, works on a statistical comparative basis with existing data. When audio is provided as input (or audio extracted from a video as input), the audio is broken down into segments of uniform length by the automatic machine transcriber; there is no uniform value for this as this varies between services. Each segment is separately analyzed by comparing it to present data sets.

Phonetics comes into play here as the audio is mapped into words and matched with existing data sets. Wherever the data shows the highest probability for the input to be a part of a certain language, the input audio is classified as that language. As observed, here the data sets play a key role in determining the accuracy of the input. Therefore, transcription of lesser-known languages is often not easy due to an insufficient amount of sample data present for machine learning.

Multilingual Transcription Infographic

Source: “A Benchmarking on Cloud based Speech-To-Text Services for French Speech and Background Noise Effect” by Binbin et al.

The Best Automatic Multilingual Transcription Services

There are several human-generated multilingual transcription services to choose from, but only a select few provide automatic multilingual transcription services powered by AI.

But how do you decide which one is the best?

In this section, we will be making a comparison of the word error rate (WER) of certain leading providers of the service, when working with ideal audio and noisy audio, based on the evaluation of Binbin et al., in the paper “A Benchmarking on Cloud based Speech-To-Text Services for French Speech and Background Noise Effect” (May 2021).

We have summarized their results below:

Service Name

WER for

clean audio

WER for

noisy audio

Microsoft Azure Cognitive Services



Amazon Transcribe (AWS)



Google Cloud Speech-to-Text



IBM Watson Speech-to-Text




Automatic Multilingual Transcription Using VIDIZMO

VIDIZMO offers a Gartner-recognized video content management system and digital evidence management system with multilingual transcription through the integration of the best option the market has to offer, Azure Cognitive Services.

Our multilingual transcription service will work with up to four languages in a single video, transcribing its audio into a single file for your ease.

Multilingual Speech-To-Text transcription is not the only service that VIDIZMO deploys through its Azure Cognitive Services AI Integration. These are some of the features that VIDIZMO offers in all its products.

  • Automatic Closed Captions (CC) can be generated in up to 39 languages to meet both compliance and accessibility needs for your video.
  • Translations are offered in up to 52 different languages for both closed captions and transcriptions in your video file.
  • Automatic Tagging of elements detected by AI, such as faces, brands, objects, and topics, to optimize the video for better search and filtering.
  • Optical Character Recognition (OCR) to allow text appearing in a video to be searchable as well.
  • With our Redaction Tool, redact faces, license plates and other critical information to meet various privacy and compliance requirements, especially for digital evidence management.
  • VIDIZMO also provides custom-designed AI models, catered to meet your specific needs.

If your organization is looking to implement a multilingual transcription solution for video or audio files, feel free to contact us or visit our site for more information on our products.


Learn More About VIDIZMO Enterprise Video Platform

More About VIDIZMO   Digital Evidence Management System




For Additional Reading

  1. Xu, B., Tao, C., Feng, Z., Raqui, Y., & Ranwez, S. (2021). A Benchmarking on Cloud based Speech-To-Text Services for French Speech and Background Noise Effect. arXiv preprint arXiv:2105.03409.

  2. Chakravarthi, B. R., Priyadharshini, R., Stearns, B., Jayapal, A. K., Sridevy, S., Arcan, M., ... & McCrae, J. P. (2019, August). Multilingual multimodal machine translation for Dravidian languages utilizing phonetic transcription. In Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages (pp. 56-63).

  3. Grosjean, F. (2021). Life as a Bilingual: Knowing and Using Two or More Languages. Cambridge: Cambridge University Press. doi:10.1017/9781108975490


Posted by Muhammad Nabeel Ali

Nabeel is an Associate Product Marketing Manager at VIDIZMO and an expert in evidence management technologies. He is actively researching on innovative trends in this domain such as Artificial Intelligence. For any queries, feel free to reach out to websales@vidizmo.com

VIDIZMO Whitepapers

Post a comment