The Leibniz ScienceCampus “Empirical Linguistics & Computational Language Modeling” aims to conduct innovative research to support the high-quality automatic annotation of large-scale corpus resources of German language through induction of domain-, genre- and variety-adaptive natural language processing models, to enable advanced empirical research in linguistics as well as innovative applications in the humanities and the social sciences.
Central research themes are the corpus-based acquisition of linguistic, especially distributional semantic models, the interfacing of corpora with linguistic and knowledge ontologies and the corpus- and computational linguistic analysis of genre effects in grammar and lexicon research.
As a distinguishing feature the Leibniz ScienceCampus will focus on German language, which is considerably understudied compared to English. The computational modeling will take advantage of weak supervision and unsupervised learning techniques.
Expected results include improved large-scale annotated corpus resources of contemporary German, enhanced with novel semantic annotation layers, and advanced NLP models and resources for the analysis of German corpora from different genres and domains.
The close cooperation between linguists and computational linguists in the ScienceCampus will foster novel research methods in empirical linguistics. Through improved genre- and domain-adaptive computational models it will be possible to address a wide range of applications in Digital Humanities and Language Technology. The Leibniz ScienceCampus will explore novel research questions in this area through collaborative interdisciplinary incubator projects in empirical linguistics and Digital Humanities.
The research activities in the Leibniz ScienceCampus “Empirical Linguistics & Computational Language Modeling” are organised in three research areas:
Area A: Natural Language Processing & Annotation Science
Area A aims to create high-level, multi-layered annotations for large volumes of text, focusing on (i) lesser studied varieties of German, (ii) varieties that connect to Area B and C and (iii) at least two levels from contrasting NLP areas (such as syntax and semantics) to validate the generality of our approach. Our chosen use cases are part-of-speech tagging for the spoken language and sentiment analysis for various varieties. The sentiment analysis work will mainly aim at (a) the automatic extraction of genre, community, domain and/or period-specific sentiment lexicons and (b) the automatic labeling of different varieties of German with context-sensitive sentiment annotations.
Further, Area A will collaborate with Area B on new unsupervised work for genre profiling in support of domain adaptation. Area A’s sentiment analysis capabilities will feed into Area C’s work on argumentation.
Area B: Induction of (variational) linguistic models & resources
The main aim of Area B is the induction of syntactic and semantic models for German. Research topics include dependency parsing and local and non-local semantic role labeling for German, as well as inducing parsing models adapted to specific domains and varieties of German. Another field of interest is the creation of a semantic resource for describing and identifying causal language in text. The last two topics are closely linked to areas A and C and will provide a rich basis for innovative empirical linguistic analysis in both areas.
Area C: Applications in empirical linguistics & Digital Humanities
Area C focuses on knowledge discovery, using the already gathered and annotated data for argumentation mining as scenarios where implied knowledge and inferencing are used, to learn to uncover these phenomena and expand them to other domains.
Two projects will investigate these areas: one focuses on reconstructing links between mentions in argumentative texts, with the final goal to uncover missing links and to elicit them through manual annotations; the other addresses the issue of non-essential information, to be used to simplify and thus generalize knowledge found in textual form. On top of these projects and the previously annotated data, we will work on finding sentences that carry interesting knowledge, and use them to bridge argument gaps and enrich knowledge repositories. Additional work will focus on applying similar methods to different types of text collections, in particular scientific articles and news or historical texts.