HPC User Report from P. Uhrig (Chair of English Linguistics)
Multimodal Corpus Linguistics
While using linguistic annotation of textual data is mainstream in linguistic research, creating such datasets is expensive. For small collections, manual annotation ist still an option, but for large quantities of texts (i.e. billions of words) only automatic annotation is feasible. Even traditional textual analysis is computationally expensive enough to necessitate HPC resources. When audio-visual data comes into the picture, the required CPU time multiplies.
Motivation and problem definition
The aim of this research is to make collections of texts and audiovisual data searchable and to analyze it with automatic methods. In order to achieve this, forced alignment of transcripts and audio track are performed, followed by image analysis (currently mainly hand movement, head movement, facial expressions). Also, a full set of linguistic annotations is run on the data, e.g. PoS-tagging, lemmatization and dependency-parsing.
For monomodal text data, the linguistic annotation and an analysis of co-occurrence frequencies is run on the HPC systems.
Methods and codes
The software used for Natural Language Processing (Stanford CoreNLP: https://stanfordnlp.github.io/CoreNLP/) ist mostly off-the-shelf, some features depend on our own code.
For the forced alignment, gentle (https://lowerquality.com/gentle/) is currently used for English.
Gesture recognition is performed by a piece of software developed in the context of the Distributed Little Red Hen Lab (http://redhenlab.org) at Case Western Reserve University by Sergiy Turchyn, based on OpenCV (https://opencv.org).
Some of these tools can make use of multiple CPUs and/or GPUs, but the vast majority of the code is not parallel and thus relies only on throughput computing.
The automatically annotated data enables users of our databases to find relevant data for their research projects, e.g. find abstract grammatical structures, find the exact locations of certain expressions in the video recordings, etc. The annotations are made available via a customized web-based search interface.
For instance, a study of clausal subjects in English, which are nearly impossible to find without syntactic annotation, was carried out on a much larger scale than otherwise possible. In multimodal research, the data is used to find gestures associated with certain constructions in a semi-automatic approach. To what extent a fully-automatic approach can be used for different research questions is currently investigated in an ongoing research project.
The following publications/talks report on the infrastructure or present results generated with RRZE’s HPC facilities:
- Peter Uhrig (2018): Subjects in English [revised PhD thesis; will be published in spring 2018 in the series Trends in Linguistics. Studies and Monographs with De Gruyter Mouton]
- Stefan Evert/Peter Uhrig/Sabine Bartsch/Thomas Proisl (2017): “E-VIEW-alation – a large-scale evaluation study of association measures for collocation identification.” In Electronic lexicography in the 21st century. Proceedings of the eLex 2017 conference, Leiden, The Netherlands.
- Peter Uhrig/Thomas Proisl (2012): “Less hay, more needles – using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates.” Lexicographica 28.
- Thomas Proisl/Peter Uhrig (2012): “Efficient Dependency Graph Matching with the IMS Open Corpus Workbench.” LREC 2012, Istanbul.
- Peter Uhrig (2017): Texts – Sounds – Images: Multimodal Corpus Linguistics. LMU München.
- Peter Uhrig (2017): Researching co-speech gesture in NewsScape – an integrated workflow for retrieval, annotation, and analysis. International Conference on Multimodal Communication: Developing New Theories and Methods, Osnabrück. [Plenary Workshop on Methods]
- Peter Uhrig (2017): Demo on multimodal data extraction and annotation. Time concepts and their expression: creativity, cognition, communication: CREATIME workshop, Pamplona, Spain.
- Peter Uhrig/Thomas Proisl (2012): Sprachstrukturen effizient speichern, verarbeiten und abfragen: Das Erlanger Treebank.info-Projekt. Vortragsreihe Digital Humanities Erlangen. [Repeated for the general public: Lange Nacht der Wissenschaften 2013, Erlangen.]
- Peter Uhrig (2012): A fast and user-friendly interface for large treebanks. Universität Trier.
- Peter Uhrig/Thomas Proisl (2011): Treebank.info – Ein System zur Abfrage syntaktisch annotierter Korpora. Otto-Friedrich-Universität Bamberg.
Further Talks and Conference Papers:
- Stefan Evert/Peter Uhrig/Sabine Bartsch/Thomas Proisl: E-VIEW-alation — a large-scale evaluation study of association measures for collocation identification. Electronic Lexicography in the 21st century: Lexicography from scratch. Leiden (Niederlande).
- Peter Uhrig (2017): NewsScape and the Distributed Little Red Hen Lab – A digital infrastructure for the large-scale analysis of TV broadcasts. Anglistentag 2017, Regensburg.
- Peter Uhrig (2017): Gesture and Argument Structure – gesture as evidence for item-specific and general knowledge. 14th International Cognitive Linguistics Conference, Tartu (Estonia).
- Peter Uhrig (2017): A corpus infrastructure for accessing multimodal data: NewsScape and the Distributed Little Red Hen Lab. ICAME 38, Prag.
- Sabine Bartsch/Stefan Evert/Thomas Proisl/Peter Uhrig (2015): (Association) measure for measure: Comparing collocation dictionaries with co-occurrence data for a better understanding of the notion of collocation. ICAME 36, Trier.
- Thomas Proisl/Peter Uhrig (2012): Using Dependency-Annotated Corpora to Improve Collocation Extraction. ICAME 33, Leuven.
- Peter Uhrig/Thomas Proisl (2012): Geparste Korpora für alle! Pre-Conference Workshop auf dem GAL-Kongress 2012, Erlangen.
- Peter Uhrig/Thomas Proisl (2011): A fast and user-friendly interface for large treebanks. Corpus Linguistics 2011, Birmingham.
- Thomas Proisl/Peter Uhrig (2011): Verbesserung der Kollokationsextraktion durch Verwendung dependenzannotierter Korpora. GAL Sektionentagung 2011, Bayreuth.
- Peter Uhrig (2011): Als die Sprachwissenschaft fast zu einer Naturwissenschaft wurde: Wie der Computer die Sprachforschung revolutioniert hat. Lange Nacht der Wissenschaften, Erlangen.
- Peter Uhrig/Thomas Proisl (2011): The Erlangen Treebank. Vortragsreihe Approaches to Corpus Linguistics, IZ LVK, Erlangen.
- Peter Uhrig/Thomas Proisl (2011): The Treebank.info project. Software Demonstration. ICAME 32, Oslo.
Researcher’s Bio and Affiliation
Dr. Peter Uhrig is a researcher at the chair of English Linguistics. He is currently working on a post-doctoral project on large-scale multimodal corpus linguistics, for which HPC resources are essential.