NHR@FAU helps to make FAU’s video portal more accessible
Background
Since the beginning of this century, FAU has been publishing recordings from lectures and exercises, contributions to science slams, presentations from Lange Nacht der Wissenschaft, and many more clips at its own video portal FAU.tv. Until now, more than 40,000 clips have been recorded which sum up to more than 40,000 hours of audio/video; currently, more than 80 recordings are being added per week. However, this large source of knowledge has not been very accessible so far—neither for hearing-impaired people, non-native speakers nor for full-text search due to the low quality of automatic captions.
In the last decade, several attempts have been made already to add subtitles to selected recordings but with limited success. Manual transcription is expensive and usually requires domain-specific knowledge and 2–3 times the media length. Moreover, previous AI-based software resulted in transcripts with too many errors to be helpful.
In fall 2022, OpenAI released its new automatic speech recognition (ASR) system called “Whisper,” which was pre-trained on 680,000 hours of multilingual and multitask supervised data. The pre-trained models “medium” and “large” use 769 million and 1.55 billion parameters, respectively. The resulting word error rate (“WER”) of this new and robust AI-based software is significantly lower compared to previous approaches, even decreasing to less than 5% for clean English input which is a similar rate as expected in manual transcriptions.
Analysis
Whisper uses a simple end-to-end approach, implemented as an encoder-decoder transformer. Input audio is split into 30-second chunks (using FFmpeg), converted into a log-Mel spectrogram, and then passed into the encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, and multilingual speech transcription.
Whisper uses PyTorch and can be easily installed locally from GitHub using pip
or tested directly in the cloud on HuggingFace Transformers. The Whisper automatic speech recognition model can be executed on CPUs as well as GPUs.
Optimization
NHR@FAU conducted rigorous benchmarking before starting to work on the FAU.tv repository:
- For a single input file, the execution on CPU-only nodes is not competitive to a GPU.
- If multiple input files have to be processed, 2–4 OpenMP threads are most efficient overall.
- On modern GPUs like NVIDIA A40 or A100, a single input file is not sufficient to keep all units busy.
- Naively starting multiple instances on one GPU is counterproductive.
- Using NVIDIA multi-process daemon (MPS) is much better.
- A version of Whisper which supports “batching” (https://github.com/Blair-Johnson/batch-whisper) is most efficient; however, currently it still suffers from stability issues in certain cases (https://github.com/Blair-Johnson/batch-whisper/issues/6 and https://github.com/Blair-Johnson/batch-whisper/issues/9).
Surprisingly, there is only a minor performance difference between NVIDIA A40 and A100 GPUs. Using old GPUs like NVIDIA GTX 1080/1080Ti is not energy efficient and current consumer-grade GPUs like the RTX3080 are also less efficient due to their lower VRAM, which limits the number of concurrently processed batches in the MPS approach.
Using a small fraction of NHR@FAU’s GPGPU cluster Alex, all 40,000 media files with more than 40,000 hours of audio/video have been processed energy efficiently in about one day. A total of only about 2,500 GPU hours was sufficient to transcribe all these recordings with the “medium” Whisper model. Thus, the overall execution was more than 15 times faster than real time.
If transcription had to be done manually, each FAU student (about 38,000 students in total) would have to work for more than 3 hours in a “crowd-sourcing” effort to cover the 120,000 hours of work needed to process all videos/audios currently available on FAU.tv. In the end, the results would probably be similar because many students would have worked on recordings unrelated to their area of expertise and the audio sometimes lacks sufficient quality even for human ears. Furthermore, common desktop PCs stand no chance against supercomputers: Had the transcription been done on a desktop PC, it would have taken longer than half a year.
Overall, the energy consumption for transcribing all existing FAU.tv media files in nearly 2,500 GPU-hours on NHR@FAU’s energy-efficient Alex cluster sums up to around 1,000 kWh, including cooling of the hardware. Taking electricity costs into account, the numbers on the power bill are approximately the same as the salary for a student assistant working full time for less than a week but processing only 10–15 media files.
Summary
Some transcripts have been verified manually, and the overall quality is very promising. It remains unclear whether quality improvements can be gained from the “large-v2” model compared to the “medium” model for both German and English recordings. The recording language is automatically detected by Whisper: About 30,000 recordings were recognized as English, about 10,000 as German, and a few hundreds as other languages, but these are probably false detections. If a lecturer switched from English to German within the recording, Whisper continued with English and automatically translated the German audio to English subtitles. Surprisingly, this automatic translation is remarkably good.
The automatically generated transcripts will soon become available on FAU.tv. In the long term, a full-text search capability based on these transcripts shall be added to FAU’s video portal. This will take the FAU a big step forward to make its video portal content more accessible.