Suchergebnisse

Metaphor detection for German poetry

Erschienen: 2019

Verlag: München [u.a.] : German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg

This paper presents first steps towards metaphor detection in German poetry, in particular in expressionist poems. We create a dataset with adjective-noun pairs extracted from expressionist poems, manually annotated for metaphoricity. We discuss the... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9316 https://ids-pub.bsz-bw.de/files/9316/Reinig_Rehbein_Metaphor_detection_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93163

This paper presents first steps towards metaphor detection in German poetry, in particular in expressionist poems. We create a dataset with adjective-noun pairs extracted from expressionist poems, manually annotated for metaphoricity. We discuss the annotation process and present models and experiments for metaphor detection where we investigate the impact of context and the domain dependence of the models.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Deutsch; Versdichtung; Expressionismus; Metapher; Automatische Sprachanalyse
Lizenz:	creativecommons.org/licenses/by-nc-sa/4.0/deed.de ; info:eu-repo/semantics/openAccess

Detecting the boundaries of sentence-like units on spoken German

Autor*in: Ruppenhofer, Josef ; Rehbein, Ines

Erschienen: 2019

Verlag: München [u.a.] : German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg

Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9317 https://ids-pub.bsz-bw.de/files/9317/Ruppenhofer_Rehbein_Detecting_the_boundaries_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93174

Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we investigate the impact of task setup and data representation, encoding of context information as well as different model architectures for this task.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Deutsch; Gesprochene Sprache; Automatische Sprachanalyse; Segmentierung; Satz
Lizenz:	creativecommons.org/licenses/by-nc-sa/4.0/deed.de ; info:eu-repo/semantics/openAccess

Improving Sentence Boundary Detection for Spoken Language Transcripts

Autor*in: Rehbein, Ines ; Ruppenhofer, Josef ; Schmidt, Thomas

Erschienen: 2020

Verlag: Paris : European Language Resources Association

This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9838 https://ids-pub.bsz-bw.de/files/9838/Rehbein_Ruppenhofer_Schmidt_Improving_Sentence_Boundary_Detection_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98382

This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of punctuation. In addition, one of the main bottlenecks for many NLP applications for spoken language is the small size of the training data, as the transcription and annotation of spoken language is by far more time-consuming and labour-intensive than processing written language. We therefore investigate the benefits of data expansion and transfer learning and test different ML architectures for this task. Our results show that data expansion is not straightforward and even data from the same domain does not always improve results. They also highlight the importance of modelling, i.e. of finding the best architecture and data representation for the task at hand. For the detection of boundaries in spoken language transcripts, we achieve a substantial improvement when framing the boundary detection problem as a sentence pair classification task, as compared to a sequence tagging approach.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Automatische Spracherkennung; Gesprochene Sprache; Korpus; Satzende; Maschinelles Lernen
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Fine-grained Named Entity Annotations for German Biographic Interviews

Autor*in: Ruppenhofer, Josef ; Rehbein, Ines ; Flinz, Carolina

Erschienen: 2020

Verlag: Paris : European Language Resources Association

We present a fine-grained NER annotations scheme with 30 labels and apply it to German data. Building on the OntoNotes 5.0 NER inventory, our scheme is adapted for a corpus of transcripts of biographic interviews by adding categories for AGE and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9865 https://ids-pub.bsz-bw.de/files/9865/Ruppenhofer_Rehbein_Flinz_Fine_grained_NEA_for_biographic_interviews_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98652

We present a fine-grained NER annotations scheme with 30 labels and apply it to German data. Building on the OntoNotes 5.0 NER inventory, our scheme is adapted for a corpus of transcripts of biographic interviews by adding categories for AGE and LAN(guage) and also adding label classes for various numeric and temporal expressions. Applying the scheme to the spoken data as well as a collection of teaser tweets from newspaper sites, we can confirm its generality for both domains, also achieving good inter-annotator agreement. We also show empirically how our inventory relates to the well-established 4-category NER inventory by re-annotating a subset of the GermEval 2014 NER coarse-grained dataset with our fine label inventory. Finally, we use a BERT-based system to establish some baselines for NER tagging on our two new datasets. Global results in in-domain testing are quite high on the two datasets, near what was achieved for the coarse inventory on the CoNLLL2003 data. Cross-domain testing produces much lower results due to the severe domain differences.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Gesprochene Sprache; Name; Annotation; Automatische Spracherkennung; Oral history
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

Autor*in: Sanguinetti, Manuela ; Bosco, Cristina ; Cassidy, Lauren ; Çetinoğlu, Özlem ; Cignarella, Alessandra Teresa ; Lynn, Teresa ; Rehbein, Ines ; Ruppenhofer, Josef ; Seddah, Djamé ; Zeldes, Amir

Erschienen: 2020

Verlag: Paris : European Language Resources Association

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9868 https://ids-pub.bsz-bw.de/files/9868/Sanguinetti_Bosco_Cassidy_et_al_Treebanking_user_generated_content_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98686

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Strukturbaum; Social Media; Annotation; Natürliche Sprache; User Generated Content
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

A New Resource for German Causal Language

Autor*in: Rehbein, Ines ; Ruppenhofer, Josef

Erschienen: 2020

Verlag: Paris : European Language Resources Association

We present a new resource for German causal language, with annotations in context for verbs, nouns and adpositions. Our dataset includes 4,390 annotated instances for more than 150 different triggers. The annotation scheme distinguishes three... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9869 https://ids-pub.bsz-bw.de/files/9869/Rehbein_Ruppenhofer_A_new_resource_for_German_causal_language_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98692

We present a new resource for German causal language, with annotations in context for verbs, nouns and adpositions. Our dataset includes 4,390 annotated instances for more than 150 different triggers. The annotation scheme distinguishes three different types of causal events (CONSEQUENCE, MOTIVATION, PURPOSE). We also provide annotations for semantic roles, i.e. of the cause and effect for the causal event as well as the actor and affected party, if present. In the paper, we present inter-annotator agreement scores for our dataset and discuss problems for annotating causal language. Finally, we present experiments where we frame causal annotation as a sequence labelling problem and report baseline results for the prediciton of causal arguments and for predicting different types of causation.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Kausalität; Korpus; Deutsch; Annotation; Natürliche Sprache
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

STTS goes Kiez – Experiments on Annotating and Tagging Urban Youth Language

Autor*in: Rehbein, Ines ; Schalowski, Sören

Erschienen: 2016

Verlag: Regensburg : GSCL

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5439 https://ids-pub.bsz-bw.de/files/5439/Rehbein_Schalowski_STTS_goes_Kiez_Experiments_on_Annotating_and_Tagging_Urban_Youth_Language_2013.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-54390

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Jugendsprache; Automatische Sprachverarbeitung; Annotation; Gesprochene Sprache
Lizenz:	creativecommons.org/licenses/by-sa/4.0/deed.de ; info:eu-repo/semantics/openAccess

Der Einfluss der Dependenzgrammatik auf die Computerlinguistik

Autor*in: Rehbein, Ines

Erschienen: 2016

Verlag: Berlin/Boston : de Gruyter

In 1959, Lucien Tesnière wrote his main work Éléments de syntaxe structurale. While the impact on theoretical linguistics was not very strong at first, 50 years later there exist a variety of linguistic theories based on Tesnière's work. In... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5595 https://ids-pub.bsz-bw.de/files/5595/Rehbein_Der_Einfluss_der_Dependenzgrammatik_auf_die_Computerlinguistik_2010.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-55952 https://doi.org/DOI 10.1515/ZGL.2010.015

In 1959, Lucien Tesnière wrote his main work Éléments de syntaxe structurale. While the impact on theoretical linguistics was not very strong at first, 50 years later there exist a variety of linguistic theories based on Tesnière's work. In computational linguistics, as in theoretical linguistics, dependency grammar was not very influential at first. The last 10–15 years, however, have brought a noticeable change and dependency grammar has found its way into computational linguistics. Syntactically annotated corpora based on dependency representations are available for a variety of languages, as well as statistical parsers which give a syntactic analysis of running text describing the underlying dependency relations between word tokens in the text. This article gives an overview of relevant areas of computational linguistics which have been influenced by dependency grammar. It discusses the pros and cons of different types of syntactic representation used in natural language processing and their suitability as representations of meaning. Finally, an attempt is made to give an outlook on the future impact of dependency grammar on computational linguistics.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Deutsch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Dependenzgrammatik; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc-nd/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

Towards a syntactically motivated analysis of modifiers in German

Autor*in: Rehbein, Ines ; Hirschmann, Hagen

Erschienen: 2016

Verlag: Hildesheim : Universitätsverlag Hildesheim

The Stuttgart-Tübingen Tagset (STTS) is a widely used POS annotation scheme for German which provides 54 different tags for the analysis on the part of speech level. The tagset, however, does not distinguish between adverbs and different types of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5597 https://ids-pub.bsz-bw.de/files/5597/Rehbein_Hirschmann_Towards_a_syntactically_motivated_analysis_of_modifiers_in_German_2014.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-55975

The Stuttgart-Tübingen Tagset (STTS) is a widely used POS annotation scheme for German which provides 54 different tags for the analysis on the part of speech level. The tagset, however, does not distinguish between adverbs and different types of particles used for expressing modality, intensity, graduation, or to mark the focus of the sentence. In the paper, we present an extension to the STTS which provides tags for a more fine-grained analysis of modification, based on a syntactic perspective on parts of speech. We argue that the new classification not only enables us to do corpus-based linguistic studies on modification, but also improves statistical parsing. We give proof of concept by training a data-driven dependency parser on data from the TiGer treebank, providing the parser a) with the original STTS tags and b) with the new tags. Results show an improved labelled accuracy for the new, syntactically motivated classification.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Annotation; Automatische Sprachanalyse; Korpus
Lizenz:	creativecommons.org/licenses/by/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

POS error detection in automatically annotated corpora

Autor*in: Rehbein, Ines

Erschienen: 2016

Verlag: Stroudsburg, PA : ACL

Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5598 https://ids-pub.bsz-bw.de/files/5598/Rehbein_POS_error_detection_in_automatically_annotated_corpora_2014.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-55986

Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can not be used for automatically annotated corpora where errors are systematic and cannot easily be identified by looking at the variance in the data. This paper targets the detection of POS errors in automatically annotated corpora, so-called silver standards, showing that by combining different measures sensitive to annotation quality we can identify a large part of the errors and obtain a substantial increase in accuracy.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Automatische Sprachanalyse; Annotation
Lizenz:	creativecommons.org/licenses/by/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

The KiezDeutsch Korpus (KiDKo) Release 1.0

Autor*in: Rehbein, Ines ; Schalowski, Sören ; Wiese, Heike

Erschienen: 2016

Verlag: Paris : European Language Resources Association

This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5599 https://ids-pub.bsz-bw.de/files/5599/Rehbein_Schalowski_Wiese_The_KiezDeutsch_Korpus_KiDKo_Release_1_0_2014.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-55999

This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Gesprochene Sprache; Stadtmundart; Jugendsprache; Multikulturelle Gesellschaft; Korpus
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Discussing best practices for the annotation of Twitter microtext

Autor*in: Rehbein, Ines ; Visser, Emiel ; Lestmann, Nadine

Erschienen: 2016

Verlag: Sofia : Bulgarian Academy of Sciences

This paper contributes to the discussion on best practices for the syntactic analysis of non-canonical language, focusing on Twitter microtext. We present an annotation experiment where we test an existing POS tagset, the Stuttgart-Tübingen Tagset... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5601 https://ids-pub.bsz-bw.de/files/5601/Rehbein_Visser_Lestmann_Discussing_best_practices_for_the_annotation_of_Twitter_microtext_2013.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-56013

This paper contributes to the discussion on best practices for the syntactic analysis of non-canonical language, focusing on Twitter microtext. We present an annotation experiment where we test an existing POS tagset, the Stuttgart-Tübingen Tagset (STTS), with respect to its applicability for annotating new text from the social media, in particular from Twitter microblogs. We discuss different tagset extensions proposed in the literature and test our extended tagset on a set of 506 tweets (7.418 tokens) where we achieve an inter-annotator agreement for two human annotators in the range of 92.7 to 94.4 (k). Our error analysis shows that especially the annotation of Twitterspecific phenomena such as hashtags and at-mentions causes disagreements between the human annotators. Following up on this, we provide a discussion of the different uses of the @- and #-marker in Twitter and argue against analysing both on the POS level by means of an at-mention or hashtag label. Instead, we sketch a syntactic analysis which describes these phenomena by means of syntactic categories and grammatical functions.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Syntaktische Analyse; Annotation; Twitter <Softwareplattform>
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Extending the STTS for the Annotation of Spoken Language

Autor*in: Rehbein, Ines ; Schalowski, Sören

Erschienen: 2016

Verlag: Wien : Eigenverlag ÖGAI

This paper presents an extension to the Stuttgart-Tübingen TagSet, the standard part-of-speech tag set for German, for the annotation of spoken language. The additional tags deal with hesitations, backchannel signals, interruptions, onomatopoeia and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5602 https://ids-pub.bsz-bw.de/files/5602/Rehbein_Schalowski_Extending_the_STTS_for_the_Annotation_of_Spoken_Language_2012.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-56026

This paper presents an extension to the Stuttgart-Tübingen TagSet, the standard part-of-speech tag set for German, for the annotation of spoken language. The additional tags deal with hesitations, backchannel signals, interruptions, onomatopoeia and uninterpretable material. They allow one to capture phenomena specific to spoken language while, at the same time, preserving inter-operability with already existing corpora of written language.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Gesprochene Sprache; Annotation; Automatische Sprachanalyse; Interoperabilität
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Data point selection for self-training

Autor*in: Rehbein, Ines

Erschienen: 2016

Verlag: Stroudsburg, PA : Association for Computational

Problems for parsing morphologically rich languages are, amongst others, caused by the higher variability in structure due to less rigid word order constraints and by the higher number of different lexical forms. Both properties can result in sparse... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5604 https://ids-pub.bsz-bw.de/files/5604/Rehbein_Data_point_selection_for_self_training_2011.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-56043

Problems for parsing morphologically rich languages are, amongst others, caused by the higher variability in structure due to less rigid word order constraints and by the higher number of different lexical forms. Both properties can result in sparse data problems for statistical parsing. We present a simple approach for addressing these issues. Our approach makes use of self-training on instances selected with regard to their similarity to the annotated data. Our similarity measure is based on the perplexity of part-of-speech trigrams of new instances measured against the annotated training data. Preliminary results show that our method outperforms a self-training setting where instances are simply selected by order of occurrence in the corpus and argue that selftraining is a cheap and effective method for improving parsing accuracy for morphologically rich languages.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Satzanalyse; Automatische Sprachanalyse
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Hard constraints for grammatical function labelling

Autor*in: Seeker, Wolfgang ; Rehbein, Ines ; Kuhn, Joans ; van Genabith, Josef

Erschienen: 2016

Verlag: Stroudsburg, PA : Association for Computational Linguistics

For languages with (semi-) free word order (such as German), labelling grammatical functions on top of phrase-structural constituent analyses is crucial for making them interpretable. Unfortunately, most statistical classifiers consider only local... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5605 https://ids-pub.bsz-bw.de/files/5605/Seeker_Rehbein_Kuhn_Hard_Constraints_for_Grammatical_Function_Labelling_2010.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-56059

For languages with (semi-) free word order (such as German), labelling grammatical functions on top of phrase-structural constituent analyses is crucial for making them interpretable. Unfortunately, most statistical classifiers consider only local information for function labelling and fail to capture important restrictions on the distribution of core argument functions such as subject, object etc., namely that there is at most one subject (etc.) per clause. We augment a statistical classifier with an integer linear program imposing hard linguistic constraints on the solution space output by the classifier, capturing global distributional restrictions. We show that this improves labelling quality, in particular for argument grammatical functions, in an intrinsic evaluation, and, importantly, grammar coverage for treebankbased (Lexical-Functional) grammar acquisition and parsing, in an extrinsic evaluation.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Phrasenstruktur; Automatische Sprachanalyse
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks

Autor*in: Rehbein, Ines ; Scholman, Merel ; Demberg, Vera

Erschienen: 2016

Verlag: Paris : European Language Resources Association

In discourse relation annotation, there is currently a variety of different frameworks being used, and most of them have been developed and employed mostly on written data. This raises a number of questions regarding interoperability of discourse... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5606 https://ids-pub.bsz-bw.de/files/5606/Rehbein_Scholman_Demberg_Annotating_Discourse_Relations_in_Spoken_Language_A_Comparison_of_the_PDTB_and_CCR_Frameworks_2016.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-56068

In discourse relation annotation, there is currently a variety of different frameworks being used, and most of them have been developed and employed mostly on written data. This raises a number of questions regarding interoperability of discourse relation annotation schemes, as well as regarding differences in discourse annotation for written vs. spoken domains. In this paper, we describe ouron annotating two spoken domains from the SPICE Ireland corpus (telephone conversations and broadcast interviews) according todifferent discourse annotation schemes, PDTB 3.0 and CCR. We show that annotations in the two schemes can largely be mappedone another, and discuss differences in operationalisations of discourse relation schemes which present a challenge to automatic mapping. We also observe systematic differences in the prevalence of implicit discourse relations in spoken data compared to written texts,find that there are also differences in the types of causal relations between the domains. Finally, we find that PDTB 3.0 addresses many shortcomings of PDTB 2.0 wrt. the annotation of spoken discourse, and suggest further extensions. The new corpus has roughly theof the CoNLL 2015 Shared Task test set, and we hence hope that it will be a valuable resource for the evaluation of automatic discourse relation labellers.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Gesprochene Sprache; Annotation; Irisch
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Scalable Discriminative Parsing for German

Autor*in: Versley, Yannick ; Rehbein, Ines

Erschienen: 2016

Verlag: Stroudsburg, PA : Association for Computational Linguistics

Generative lexicalized parsing models, which are the mainstay for probabilistic parsing of English, do not perform as well when applied to languages with different language-specific properties such as free(r) word order or rich morphology. For German... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5608 https://ids-pub.bsz-bw.de/files/5608/Versley_Rehbein_Scalable_Discriminative_Parsing_for_German_2009.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-56080

Generative lexicalized parsing models, which are the mainstay for probabilistic parsing of English, do not perform as well when applied to languages with different language-specific properties such as free(r) word order or rich morphology. For German and other non-English languages, linguistically motivated complex treebank transformations have been shown to improve performance within the framework of PCFG parsing, while generative lexicalized models do not seem to be as easily adaptable to these languages. In this paper, we show a practical way to use grammatical functions as first-class citizens in a discriminative model that allows to extend annotated treebank grammars with rich feature sets without having to suffer from sparse data problems. We demonstrate the flexibility of the approach by integrating unsupervised PP attachment and POS-based word clusters into the parser.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Deutsch; Syntaktische Analyse; Automatische Sprachanalyse; Grammatik
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

How to Compare Treebanks

Autor*in: Kübler, Sandra ; Maier, Wolfgang ; Rehbein, Ines ; Versley, Yannick

Erschienen: 2017

Verlag: Paris : European Language Resources Association

Recent years have seen an increasing interest in developing standards for linguistic annotation, with a focus on the interoperability of the resources. This effort, however, requires a profound knowledge of the advantages and disadvantages of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5752 https://ids-pub.bsz-bw.de/files/5752/Kuebler_Maier_Rehbein_Versley_How_to_Compare_Treebanks_2008.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-57520

Recent years have seen an increasing interest in developing standards for linguistic annotation, with a focus on the interoperability of the resources. This effort, however, requires a profound knowledge of the advantages and disadvantages of linguistic annotation schemes in order to avoid importing the ﬂaws and weaknesses of existing encoding schemes into the new standards. This paper addresses the question how to compare syntactically annotated corpora and gain insights into the usefulness of speciﬁc design decisions. We present an exhaustive evaluation of two German treebanks with crucially different encoding schemes. We evaluate three different parsers trained on the two treebanks and compare results using EVALB, the Leaf-Ancestor metric, and a dependency-based evaluation. Furthermore, we present TePaCoC, a new testsuite for the evaluation of parsers on complex German grammatical constructions. The testsuite provides a well thought-out error classiﬁcation, which enables us to compare parser output for parsers trained on treebanks with different encoding schemes and provides interesting insights into the impact of treebank annotation schemes on speciﬁc constructions like PP attachment or non-constituent coordination.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Syntaktische Analyse
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Evaluating Evaluation Measures

Autor*in: Rehbein, Ines ; van Genabith, Josef

Erschienen: 2017

Verlag: Tartu : University of Tartu

This paper presents a thorough examination of the validity of three evaluation measures on parser output. We assess parser performance of an unlexicalised probabilistic parser trained on two German treebanks with different annotation schemes and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5754 https://ids-pub.bsz-bw.de/files/5754/Rehbein_Van_Genabith_Evaluating_Evaluation_Measures_2007.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-57543

This paper presents a thorough examination of the validity of three evaluation measures on parser output. We assess parser performance of an unlexicalised probabilistic parser trained on two German treebanks with different annotation schemes and evaluate parsing results using the PARSEVAL metric, the Leaf-Ancestor metric and a dependency-based evaluation. We reject the claim that the TüBa-D/Z annotation scheme is more adequate then the TIGER scheme for PCFG parsing and show that PARSEVAL should not be used to compare parser performance for parsers trained on treebanks with different annotation schemes. An analysis of specific error types indicates that the dependency-based evaluation is most appropriate to reflect parse quality.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Syntaktische Analyse; Deutsch
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Why is it so difficult to compare treebanks? TIGER and TüBa-D/Z revisited

Autor*in: Rehbein, Ines ; van Genabith, Josef

Erschienen: 2017

Verlag: Tartu : Northern European Association for Language Technology

This paper is a contribution to the ongoing discussion on treebank annotation schemes and their impact on PCFG parsing results. We provide a thorough comparison of two German treebanks: the TIGER treebank and the TüBa-D/Z. We use simple statistics on... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5782 https://ids-pub.bsz-bw.de/files/5782/Rehbein_Van_Genabith_Why_is_It_so_Difficult_to_Compare_Treebanks_TIGER_and_TueBa_DZ_Revisited_2007.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-57822

This paper is a contribution to the ongoing discussion on treebank annotation schemes and their impact on PCFG parsing results. We provide a thorough comparison of two German treebanks: the TIGER treebank and the TüBa-D/Z. We use simple statistics on sentence length and vocabulary size, and more refined methods such as perplexity and its correlation with PCFG parsing results, as well as a Principal Components Analysis. Finally we present a qualitative evaluation of a set of 100 sentences from the TüBa- D/Z, manually annotated in the TIGER as well as in the TüBa-D/Z annotation scheme, and show that even the existence of a parallel subcorpus does not support a straightforward and easy comparison of both annotation schemes.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Syntaktische Analyse; Annotation
Lizenz:	creativecommons.org/licenses/by-nc-nd/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

Catching the common cause: extraction and annotation of causal relations and their participants

Autor*in: Rehbein, Ines ; Ruppenhofer, Josef

Erschienen: 2017

Verlag: Stroudsburg, PA : EACL

In this paper, we present a simple, yet effective method for the automatic identification and extraction of causal relations from text, based on a large English-German parallel corpus. The goal of this effort is to create a lexical resource for... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/6153 https://ids-pub.bsz-bw.de/files/6153/Rehbein_Ruppenhofer_Catching_the_common_cause_2017.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-61534

In this paper, we present a simple, yet effective method for the automatic identification and extraction of causal relations from text, based on a large English-German parallel corpus. The goal of this effort is to create a lexical resource for German causal relations. The resource will consist of a lexicon that describes constructions that trigger causality as well as the participants of the causal event, and will be augmented by a corpus with annotated instances for each entry, that can be used as training data to develop a system for automatic classification of causal relations. Focusing on verbs, our method harvested a set of 100 different lexical triggers of causality, including support verb constructions. At the moment, our corpus includes over 1,000 annotated instances. The lexicon and the annotated data will be made available to the research community.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Automatische Sprachanalyse; Kausalität; Korpus; Computerlinguistik; Annotation
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Data point selection for genre-aware parsing

Autor*in: Rehbein, Ines ; Bildhauer, Felix

Erschienen: 2018

Verlag: Prague : Charles University

In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/7119 https://ids-pub.bsz-bw.de/files/7119/Rehbein_Bildhauer_Data_point_selection_for_genre_aware_parsing_2017.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-71193

In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Syntaktische Analyse; Automatische Sprachanalyse; Textsorte; Korpus; Sprachstatistik
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

I’ve got a construction looks funny – representing and recovering non-standard constructions in UD

Autor*in: Ruppenhofer, Josef ; Rehbein, Ines

Erschienen: 2020

Verlag: Stroudsburg, PA : Association for Computational Linguistics

The UD framework defines guidelines for a crosslingual syntactic analysis in the framework of dependency grammar, with the aim of providing a consistent treatment across languages that not only supports multilingual NLP applications but also... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/10202 https://ids-pub.bsz-bw.de/files/10202/Ruppenhofer_Rehbein_Representing_and_recovering_nonstandard_constructions_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-102024

The UD framework defines guidelines for a crosslingual syntactic analysis in the framework of dependency grammar, with the aim of providing a consistent treatment across languages that not only supports multilingual NLP applications but also facilitates typological studies. Until now, the UD framework has mostly focussed on bilexical grammatical relations. In the paper, we propose to add a constructional perspective and discuss several examples of spoken-language constructions that occur in multiple languages and challenge the current use of basic and enhanced UD relations. The examples include cases where the surface relations are deceptive, and syntactic amalgams that either involve unconnected subtrees or structures with multiply-headed dependents. We argue that a unified treatment of constructions across languages will increase the consistency of the UD annotations and thus the quality of the treebanks for linguistic analysis.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Natürliche Sprache; Dependenzgrammatik; Kontrastive Syntax; Sprachtypologie; Sprachanalyse; Gesprochene Sprache; Automatische Sprachanalyse
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Autor*in: Sanguinetti, Manuela ; Bosco, Cristina ; Cassidy, Lauren ; Cetinoglu, Özlem ; Cignarella, Alessandra Teresa ; Lynn, Teresa ; Rehbein, Ines ; Ruppenhofer, Josef ; Seddah, Djamé ; Zeldes, Amir

Erschienen: 2022

Verlag: Dordrecht [u.a.] : Springer ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11000 https://ids-pub.bsz-bw.de/files/11000/Sanguinetti_Bosco_Cassidy_Treebanking_user_generated_22.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110002 https://doi.org/10.1007/s10579-022-09581-9

This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework of syntactic analysis. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this article is twofold: (1) to provide a condensed, though comprehensive, overview of such treebanks—based on available literature—along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The overarching goal of this article is to provide a common framework for researchers interested in developing similar resources in UD, thus promoting cross-linguistic consistency, which is a principle that has always been central to the spirit of UD.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	World Wide Web; Annotation; Angewandte Linguistik; Social Media; Datenbanksystem; Strukturbaum
Lizenz:	creativecommons.org/licenses/by/4.0/deed.de ; info:eu-repo/semantics/openAccess

Who’s in, who’s out? Predicting the inclusiveness or exclusiveness of personal pronouns in parliamentary debates

Autor*in: Rehbein, Ines ; Ruppenhofer, Josef

Erschienen: 2022

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper presents a compositional annotation scheme to capture the clusivity properties of personal pronouns in context, that is their ability to construct and manage in-groups and out-groups by including/excluding the audience and/or non-speech... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11115 https://ids-pub.bsz-bw.de/files/11115/Rehbein_Ruppenhofer_Whos_in_whos_out_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111151

This paper presents a compositional annotation scheme to capture the clusivity properties of personal pronouns in context, that is their ability to construct and manage in-groups and out-groups by including/excluding the audience and/or non-speech act participants in reference to groups that also include the speaker. We apply and test our schema on pronoun instances in speeches taken from the German parliament. The speeches cover a time period from 2017-2021 and comprise manual annotations for 3,126 sentences. We achieve high inter-annotator agreement for our new schema, with a Cohen’s κ in the range of 89.7-93.2 and a percentage agreement of > 96%. Our exploratory analysis of in/exclusive pronoun use in the parliamentary setting provides some face validity for our new schema. Finally, we present baseline experiments for automatically predicting clusivity in political debates, with promising results for many referential constellations, yielding an overall 84.9% micro F1 for all pronouns.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Personalpronomen; Parlamentsdebatte; Eigengruppe; Fremdgruppe; Deutschland. Deutscher Bundestag; Deutsch; Pronomen; Textlinguistik; Politik
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

Metaphor detection for German poetry

Detecting the boundaries of sentence-like units on spoken German

Improving Sentence Boundary Detection for Spoken Language Transcripts

Fine-grained Named Entity Annotations for German Biographic Interviews

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

A New Resource for German Causal Language

STTS goes Kiez – Experiments on Annotating and Tagging Urban Youth Language

Der Einfluss der Dependenzgrammatik auf die Computerlinguistik

Towards a syntactically motivated analysis of modifiers in German

POS error detection in automatically annotated corpora

The KiezDeutsch Korpus (KiDKo) Release 1.0

Discussing best practices for the annotation of Twitter microtext

Extending the STTS for the Annotation of Spoken Language

Data point selection for self-training

Hard constraints for grammatical function labelling

Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks

Scalable Discriminative Parsing for German

How to Compare Treebanks

Evaluating Evaluation Measures

Why is it so difficult to compare treebanks? TIGER and TüBa-D/Z revisited

Catching the common cause: extraction and annotation of causal relations and their participants

Data point selection for genre-aware parsing

I’ve got a construction looks funny – representing and recovering non-standard constructions in UD

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Who’s in, who’s out? Predicting the inclusiveness or exclusiveness of personal pronouns in parliamentary debates

Kontakt

Partner