Suchergebnisse

A harmonised testsuite for POS tagging of German social media data

Autor*in: Rehbein, Ines ; Ruppenhofer, Josef ; Zimmermann, Victor

Erschienen: 2018

Verlag: Vienna, Austria : Austrian academy of sciences

We present a testsuite for POS tagging German web data. Our testsuite provides the original raw text as well as the gold tokenisations and is annotated for parts-of-speech. The testsuite includes a new dataset for German tweets, with a current size... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/7931 https://ids-pub.bsz-bw.de/files/7931/Rehbein_Ruppenhofer_Zimmermann_A_harmonized_testsuite_2018.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-79318

We present a testsuite for POS tagging German web data. Our testsuite provides the original raw text as well as the gold tokenisations and is annotated for parts-of-speech. The testsuite includes a new dataset for German tweets, with a current size of 3,940 tokens. To increase the size of the data, we harmonised the annotations in already existing web corpora, based on the Stuttgart-Tübingen Tag Set. The current version of the corpus has an overall size of 48,344 tokens of web data, around half of it from Twitter. We also present experiments, showing how different experimental setups (training set size, additional out-of-domain training data, self-training) influence the accuracy of the taggers. All resources and models will be made publicly available to the research community.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Deutsch; Soziale Software
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Data point selection for genre-aware parsing

Autor*in: Rehbein, Ines ; Bildhauer, Felix

Erschienen: 2018

Verlag: Stroudsburg PA, USA : The Association for Computational Linguistics

In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8000 https://ids-pub.bsz-bw.de/files/8000/Rehbein_Bildhauer_Data_point_selection_2017.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-80007

In the NLP literature, adapting a parser to new text with properties different from the training data is commonly referred to as domain adaptation. In practice, however, the differences between texts from different sources often reflect a mixture of domain and genre properties, and it is by no means clear what impact each of those has on statistical parsing. In this paper, we investigate how differences between articles in a newspaper corpus relate to the concepts of genre and domain and how they influence parsing performance of a transition-based dependency parser. We do this by applying various similarity measures for data point selection and testing their adequacy for creating genre-aware parsing models.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Parsing; Korpus; Textsorte
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Evaluating LSTM models for grammatical function labelling

Autor*in: Do, Bich-Ngoc ; Rehbein, Ines

Erschienen: 2018

Verlag: Stroudsburg PA, USA : The Association for Computational Linguistics

To improve grammatical function labelling for German, we augment the labelling component of a neural dependency parser with a decision history. We present different ways to encode the history, using different LSTM architectures, and show that our... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8001 https://ids-pub.bsz-bw.de/files/8001/Do_Rehbein_Evaluating_LSTM_models_2017.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-80010

To improve grammatical function labelling for German, we augment the labelling component of a neural dependency parser with a decision history. We present different ways to encode the history, using different LSTM architectures, and show that our models yield significant improvements, resulting in a LAS for German that is close to the best result from the SPMRL 2014 shared task (without the reranker).

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Automatische Sprachverarbeitung; Syntaktische Analyse; Parser; Deutsch
Lizenz:	creativecommons.org/licenses/by/4.0/deed.de ; info:eu-repo/semantics/openAccess

Universal Dependencies are hard to parse – or are they?

Autor*in: Rehbein, Ines ; Steen, Julius ; Do, Bich-Ngoc ; Frank, Anette

Erschienen: 2018

Verlag: Linköping, Schweden : Linköping University Electronic Press

Universal Dependency (UD) annotations, despite their usefulness for cross-lingual tasks and semantic applications, are not optimised for statistical parsing. In the paper, we ask what exactly causes the decrease in parsing accuracy when training a... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8023 https://ids-pub.bsz-bw.de/files/8023/Rehbein_etal._Universal_dependences_are_hard_to%20parse_2017.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-80232

Universal Dependency (UD) annotations, despite their usefulness for cross-lingual tasks and semantic applications, are not optimised for statistical parsing. In the paper, we ask what exactly causes the decrease in parsing accuracy when training a parser on UD-style annotations and whether the effect is similarly strong for all languages. We conduct a series of experiments where we systematically modify individual annotation decisions taken in the UD scheme and show that this results in an increased accuracy for most, but not for all languages. We show that the encoding in the UD scheme, in particular the decision to encode content words as heads, causes an increase in dependency length for nearly all treebanks and an increase in arc direction entropy for many languages, and evaluate the effect this has on parsing accuracy.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Syntax; Annotation; Parser; Universalgrammatik
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

What do we need to know about an unknown word when parsing German

Autor*in: Do, Bich-Ngoc ; Rehbein, Ines ; Frank, Anette

Erschienen: 2018

Verlag: Stroudsburg PA, USA : The Association for Computational Linguistics

We propose a new type of subword embedding designed to provide more information about unknown compounds, a major source for OOV words in German. We present an extrinsic evaluation where we use the compound embeddings as input to a neural dependency... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8024 https://ids-pub.bsz-bw.de/files/8024/Do_Rehbein_Frank_What_do_we_need_2017.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-80244

We propose a new type of subword embedding designed to provide more information about unknown compounds, a major source for OOV words in German. We present an extrinsic evaluation where we use the compound embeddings as input to a neural dependency parser and compare the results to the ones obtained with other types of embeddings. Our evaluation shows that adding compound embeddings yields a significant improvement of 2% LAS over using word embeddings when no POS information is available. When adding POS embeddings to the input, however, the effect levels out. This suggests that it is not the missing information about the semantics of the unknown words that causes problems for parsing German, but the lack of morphological information for unknown words. To augment our evaluation, we also test the new embeddings in a language modelling task that requires both syntactic and semantic information.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Deutsch; Kompositum; Automatische Spracherkennung
Lizenz:	creativecommons.org/licenses/by/4.0/deed.de ; info:eu-repo/semantics/openAccess

Introducing the International Comparable Corpus

Autor*in: Kirk, John ; Čermáková, Anna ; Oksefjell Ebeling, Signe ; Ebeling, Jarle ; Kren, Michal ; Aijmer, Karin ; Benko, Vladimir ; Garabik, Radovan ; Gorski, Rafal ; Jantunen, Jarmo ; Kupietz, Marc ; Simkova, Maria ; Schmidt, Thomas ; Wicher, Oliver

Erschienen: 2018

Verlag: Louvain-la-Neuve : Université catholique de Louvain

This presentation introduces a new collaborative project: the International Comparable Corpus (ICC) (https://korpus.cz/icc), to be compiled from European national, standard(ised) languages, using the protocols for text categories and their quantities... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/8248 https://ids-pub.bsz-bw.de/files/8248/Kirk_etal._Introducing_the_International_Comparable_Corpus_2018.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-82484

This presentation introduces a new collaborative project: the International Comparable Corpus (ICC) (https://korpus.cz/icc), to be compiled from European national, standard(ised) languages, using the protocols for text categories and their quantities of texts in the International Corpus of English (ICE).

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Kontrastive Linguistik; Europa
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

A corpus-based lexical resource of spoken German in interaction

Autor*in: Meliss, Meike ; Möhrs, Christine ; Ribeiro Silveira, Maria ; Schmidt, Thomas

Erschienen: 2019

Verlag: Brno, Czech Republic : Lexical Computing CZ s.r.o.

This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9302 https://ids-pub.bsz-bw.de/files/9302/Meliss_etal._A_corpus_based_lexical_resource_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93025

This paper presents the prototype of a lexicographic resource for spoken German in interaction, which was conceived within the framework of the LeGeDe-project (LeGeDe=Lexik des gesprochenen Deutsch). First of all, it summarizes the theoretical and methodological approaches that were used for the initial planning of the resource. The headword candidates were selected by analyzing corpus-based data. Therefore, the data of two corpora (written and spoken German) were compared with quantitative methods. The information that was gathered on the selected headword candidates can be assigned to two different sections: meanings and functions in interaction. Additionally, two studies on the expectations of future users towards the resource were carried out. The results of these two studies were also taken into account in the development of the prototype. Focusing on the presentation of the resource’s content, the paper shows both the different lexicographical information in selected dictionary entries, and the information offered by the provided hyperlinks and external texts. As a conclusion, it summarizes the most important innovative aspects that were specifically developed for the implementation of such a resource.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerunterstützte Lexikografie; Gesprochene Sprache; Korpus; Deutsch
Lizenz:	creativecommons.org/licenses/by-sa/4.0/ ; info:eu-repo/semantics/openAccess

The microstructure of Online Linguistics Dictionaries: obligatory and facultative elements

Autor*in: Flinz, Carolina

Erschienen: 2019

Verlag: Ljubljana : Trojina, Institute for Applied Slovene Studies

The planning of a dictionary should consider both theoretical and empiric aspects, either for its macro- and microstructure: this is true also for Online Specialized Dictionaries of Linguistics. In particular the microstructure should be standardized... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9304 https://ids-pub.bsz-bw.de/files/9304/Flinz_The_microstructure_of_Online_Linguistics_Dictionaries_2011.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93042

The planning of a dictionary should consider both theoretical and empiric aspects, either for its macro- and microstructure: this is true also for Online Specialized Dictionaries of Linguistics. In particular the microstructure should be standardized and structured so as to fit with the primary and secondary functions of a dictionary. Unfortunately, empirical studies that investigate Online Specialized Dictionaries of Linguistics are rare, making it unclear which microstructural elements are obligatory and which are facultative. This article will present and comment upon the results of an investigation into a corpus of Online Specialized Dictionaries of Linguistics, focusing attention on these aspects and also the most important theoretical issues. An example taken from DIL, a German-Italian Online Dictionary of Linguistics, will end the article.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerunterstützte Lexikografie; Mikrostruktur; Methodologie; Fachsprache
Lizenz:	creativecommons.org/licenses/by-sa/4.0/ ; info:eu-repo/semantics/openAccess

Detecting the boundaries of sentence-like units on spoken German

Autor*in: Ruppenhofer, Josef ; Rehbein, Ines

Erschienen: 2019

Verlag: München [u.a.] : German Society for Computational Linguistics & Language Technology und Friedrich-Alexander-Universität Erlangen-Nürnberg

Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9317 https://ids-pub.bsz-bw.de/files/9317/Ruppenhofer_Rehbein_Detecting_the_boundaries_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93174

Automatic division of spoken language transcripts into sentence-like units is a challenging problem, caused by disfluencies, ungrammatical structures and the lack of punctuation. We present experiments on dividing up German spoken dialogues where we investigate the impact of task setup and data representation, encoding of context information as well as different model architectures for this task.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Deutsch; Gesprochene Sprache; Automatische Sprachanalyse; Segmentierung; Satz
Lizenz:	creativecommons.org/licenses/by-nc-sa/4.0/deed.de ; info:eu-repo/semantics/openAccess

How much “tourism” is there in dictionary apps? An empirical study of lexicographical resources on mobile devices (German, Italian, Spanish)

Autor*in: Flinz, Carolina ; Egido Vicente, Maria

Erschienen: 2019

Verlag: Brno, Czech Republic : Lexical Computing CZ s.r.o.

Bibliographische Angaben
Zugang

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9359 https://ids-pub.bsz-bw.de/files/9359/Flinz_Egido_Vicente_How_much_tourism_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93599

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Online-Wörterbuch; computerunterstützte Lexikographie; Tourismus; Deutsch; Italienisch; Spanisch
Lizenz:	creativecommons.org/licenses/by-sa/4.0/ ; info:eu-repo/semantics/openAccess

A corpus-based lexical resource of spoken German in interaction

Autor*in: Meliss, Meike ; Möhrs, Christine ; Ribeiro Silveira, Maria ; Schmidt, Thomas

Erschienen: 2019

Verlag: Brno, Czech Republic : Lexical Computing CZ s.r.o.

Bibliographische Angaben
Zugang

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9362 https://ids-pub.bsz-bw.de/files/9362/Meliss_etal._A_corpus_based_lexical_resource_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93621

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerunterstützte Lexikographie; Gesprochene Sprache; Korpus
Lizenz:	creativecommons.org/licenses/by-sa/4.0/ ; info:eu-repo/semantics/openAccess

CLARIN Web Services for TEI-annotated Transcripts of Spoken Language

Autor*in: Fisseni, Bernhard ; Schmidt, Thomas

Erschienen: 2020

Verlag: Utrecht : CLARIN

We present web services implementing a workflow for transcripts of spoken language following TEI guidelines, in particular ISO 24624:2016 "Language resource management - Transcription of spoken language". The web services are available at our website... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9621 https://ids-pub.bsz-bw.de/files/9621/Fisseni_Schmidt_CLARIN_web_services_for_TEI_annotated_transcripts_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-96211

We present web services implementing a workflow for transcripts of spoken language following TEI guidelines, in particular ISO 24624:2016 "Language resource management - Transcription of spoken language". The web services are available at our website and will be available via the CLARIN infrastructure, including the Virtual Language Observatory and WebLicht.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Text Encoding Initiative; Gesprochene Sprache; Transkription; Computerlinguistik; Web Services
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora

Autor*in: Arnold, Denis ; Fisseni, Bernhard ; Kamocki, Paweł ; Schonefeld, Oliver ; Kupietz, Marc ; Schmidt, Thomas

Erschienen: 2020

Verlag: Paris : European Language Resources Association

This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources,... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9812 https://ids-pub.bsz-bw.de/files/9812/Arnold_Fisseni_Kamocki_et_al_Challenges_in_Long_Term_Archiving_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98129

This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Langzeitarchivierung; Nutzungsrecht; Dateiformat
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Using Full Text Indices for Querying Spoken Language Data

Autor*in: Frick, Elena ; Schmidt, Thomas

Erschienen: 2020

Verlag: Paris : European Language Resources Association

As a part of the ZuMult-project, we are currently modelling a backend architecture that should provide query access to corpora from the Archive of Spoken German (AGD) at the Leibniz-Institute for the German Language (IDS). We are exploring how to... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9814 https://ids-pub.bsz-bw.de/files/9814/Frick_Schmidt_Using_full_text_indices_for_querying_SLD_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98143

As a part of the ZuMult-project, we are currently modelling a backend architecture that should provide query access to corpora from the Archive of Spoken German (AGD) at the Leibniz-Institute for the German Language (IDS). We are exploring how to reuse existing search engine frameworks providing full text indices and allowing to query corpora by one of the corpus query languages (QLs) established and actively used in the corpus research community. For this purpose, we tested MTAS - an open source Lucene-based search engine for querying on text with multilevel annotations. We applied MTAS on three oral corpora stored in the TEI-based ISO standard for transcriptions of spoken language (ISO 24624:2016). These corpora differ from the corpus data that MTAS was developed for, because they include interactions with two and more speakers and are enriched, inter alia, with timeline-based annotations. In this contribution, we report our test results and address issues that arise when search frameworks originally developed for querying written corpora are being transferred into the field of spoken language.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Abfrage; Gesprochene Sprache; Text Encoding Initiative; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Improving Sentence Boundary Detection for Spoken Language Transcripts

Autor*in: Rehbein, Ines ; Ruppenhofer, Josef ; Schmidt, Thomas

Erschienen: 2020

Verlag: Paris : European Language Resources Association

This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9838 https://ids-pub.bsz-bw.de/files/9838/Rehbein_Ruppenhofer_Schmidt_Improving_Sentence_Boundary_Detection_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98382

This paper presents experiments on sentence boundary detection in transcripts of spoken dialogues. Segmenting spoken language into sentence-like units is a challenging task, due to disfluencies, ungrammatical or fragmented structures and the lack of punctuation. In addition, one of the main bottlenecks for many NLP applications for spoken language is the small size of the training data, as the transcription and annotation of spoken language is by far more time-consuming and labour-intensive than processing written language. We therefore investigate the benefits of data expansion and transfer learning and test different ML architectures for this task. Our results show that data expansion is not straightforward and even data from the same domain does not always improve results. They also highlight the importance of modelling, i.e. of finding the best architecture and data representation for the task at hand. For the detection of boundaries in spoken language transcripts, we achieve a substantial improvement when framing the boundary detection problem as a sentence pair classification task, as compared to a sequence tagging approach.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Automatische Spracherkennung; Gesprochene Sprache; Korpus; Satzende; Maschinelles Lernen
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Fine-grained Named Entity Annotations for German Biographic Interviews

Autor*in: Ruppenhofer, Josef ; Rehbein, Ines ; Flinz, Carolina

Erschienen: 2020

Verlag: Paris : European Language Resources Association

We present a fine-grained NER annotations scheme with 30 labels and apply it to German data. Building on the OntoNotes 5.0 NER inventory, our scheme is adapted for a corpus of transcripts of biographic interviews by adding categories for AGE and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9865 https://ids-pub.bsz-bw.de/files/9865/Ruppenhofer_Rehbein_Flinz_Fine_grained_NEA_for_biographic_interviews_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98652

We present a fine-grained NER annotations scheme with 30 labels and apply it to German data. Building on the OntoNotes 5.0 NER inventory, our scheme is adapted for a corpus of transcripts of biographic interviews by adding categories for AGE and LAN(guage) and also adding label classes for various numeric and temporal expressions. Applying the scheme to the spoken data as well as a collection of teaser tweets from newspaper sites, we can confirm its generality for both domains, also achieving good inter-annotator agreement. We also show empirically how our inventory relates to the well-established 4-category NER inventory by re-annotating a subset of the GermEval 2014 NER coarse-grained dataset with our fine label inventory. Finally, we use a BERT-based system to establish some baselines for NER tagging on our two new datasets. Global results in in-domain testing are quite high on the two datasets, near what was achieved for the coarse inventory on the CoNLLL2003 data. Cross-domain testing produces much lower results due to the severe domain differences.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Gesprochene Sprache; Name; Annotation; Automatische Spracherkennung; Oral history
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

Autor*in: Sanguinetti, Manuela ; Bosco, Cristina ; Cassidy, Lauren ; Çetinoğlu, Özlem ; Cignarella, Alessandra Teresa ; Lynn, Teresa ; Rehbein, Ines ; Ruppenhofer, Josef ; Seddah, Djamé ; Zeldes, Amir

Erschienen: 2020

Verlag: Paris : European Language Resources Association

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9868 https://ids-pub.bsz-bw.de/files/9868/Sanguinetti_Bosco_Cassidy_et_al_Treebanking_user_generated_content_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98686

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Strukturbaum; Social Media; Annotation; Natürliche Sprache; User Generated Content
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

A New Resource for German Causal Language

Autor*in: Rehbein, Ines ; Ruppenhofer, Josef

Erschienen: 2020

Verlag: Paris : European Language Resources Association

We present a new resource for German causal language, with annotations in context for verbs, nouns and adpositions. Our dataset includes 4,390 annotated instances for more than 150 different triggers. The annotation scheme distinguishes three... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9869 https://ids-pub.bsz-bw.de/files/9869/Rehbein_Ruppenhofer_A_new_resource_for_German_causal_language_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98692

We present a new resource for German causal language, with annotations in context for verbs, nouns and adpositions. Our dataset includes 4,390 annotated instances for more than 150 different triggers. The annotation scheme distinguishes three different types of causal events (CONSEQUENCE, MOTIVATION, PURPOSE). We also provide annotations for semantic roles, i.e. of the cause and effect for the causal event as well as the actor and affected party, if present. In the paper, we present inter-annotator agreement scores for our dataset and discuss problems for annotating causal language. Finally, we present experiments where we frame causal annotation as a sequence labelling problem and report baseline results for the prediciton of causal arguments and for predicting different types of causation.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Kausalität; Korpus; Deutsch; Annotation; Natürliche Sprache
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

FOLKER : an annotation tool for efficient transcription of natural, multi-party interaction

Autor*in: Schmidt, Thomas ; Schütte, Wilfried

Erschienen: 2014

Verlag: Valletta, Malta : European Language Resources Association (ELRA)

This paper presents FOLKER, an annotation tool developed for the efficient transcription of natural, multi-party interaction in a conversation analysis framework. FOLKER is being developed at the Institute for German Language in and for the FOLK... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/2232 https://ids-pub.bsz-bw.de/files/2232/Schmidt_Schuette_FOLKER_2010_Paper.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-22323

This paper presents FOLKER, an annotation tool developed for the efficient transcription of natural, multi-party interaction in a conversation analysis framework. FOLKER is being developed at the Institute for German Language in and for the FOLK project, whose aim is the construction of a large corpus of spoken present-day German, to be used for research and teaching purposes. FOLKER builds on the experience gained with multi-purpose annotation tools like ELAN and EXMARaLDA, but attempts to improve transcription efficiency by restricting and optimizing both data model and tool functionality to a single, well-defined purpose. This paper starts with a description of the GAT transcription conventions and the data model underlying the tool. It then gives an overview of the tool functionality and compares this functionality to that of other widely used tools.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	gesprochene Sprache; Korpus; Transkription; Computerlinguistik
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

EXMARaLDA : un système pour la constitution et l’exploitation de corpus oraux

Autor*in: Schmidt, Thomas

Erschienen: 2014

Verlag: Limoges : Lambert-Lucas

Bibliographische Angaben
Zugang

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/2237 https://ids-pub.bsz-bw.de/files/2237/Schmidt_EXMARalDA_2010.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-22378

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Französisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	gesprochene Sprache; Transkription; Computerlinguistik; Standardisierung; Korpus
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Creating and working with spoken language corpora in EXMARaLDA

Autor*in: Schmidt, Thomas

Erschienen: 2014

Verlag: Bozen : Europ. Akad.

Spoken language corpora— as used in conversation analytic research, language acquisition studies and dialectology— pose a number of challenges that are rarely addressed by corpus linguistic methodology and technology. This paper starts by giving an... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/2254 https://ids-pub.bsz-bw.de/files/2254/Schmidt_Creating_and_Working_2009.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-22548

Spoken language corpora— as used in conversation analytic research, language acquisition studies and dialectology— pose a number of challenges that are rarely addressed by corpus linguistic methodology and technology. This paper starts by giving an overview of the most important methodological issues distinguishing spoken language corpus workfrom the work with written data. It then shows what technological challenges these methodological issues entail and demonstrates how they are dealt with in the architecture and tools of the EXMARaLDA system.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	gesprochene Sprache; Korpus; Computerlinguistik; geschriebene Sprache
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Towards a syntactically motivated analysis of modifiers in German

Autor*in: Rehbein, Ines ; Hirschmann, Hagen

Erschienen: 2016

Verlag: Hildesheim : Universitätsverlag Hildesheim

The Stuttgart-Tübingen Tagset (STTS) is a widely used POS annotation scheme for German which provides 54 different tags for the analysis on the part of speech level. The tagset, however, does not distinguish between adverbs and different types of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5597 https://ids-pub.bsz-bw.de/files/5597/Rehbein_Hirschmann_Towards_a_syntactically_motivated_analysis_of_modifiers_in_German_2014.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-55975

The Stuttgart-Tübingen Tagset (STTS) is a widely used POS annotation scheme for German which provides 54 different tags for the analysis on the part of speech level. The tagset, however, does not distinguish between adverbs and different types of particles used for expressing modality, intensity, graduation, or to mark the focus of the sentence. In the paper, we present an extension to the STTS which provides tags for a more fine-grained analysis of modification, based on a syntactic perspective on parts of speech. We argue that the new classification not only enables us to do corpus-based linguistic studies on modification, but also improves statistical parsing. We give proof of concept by training a data-driven dependency parser on data from the TiGer treebank, providing the parser a) with the original STTS tags and b) with the new tags. Results show an improved labelled accuracy for the new, syntactically motivated classification.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Annotation; Automatische Sprachanalyse; Korpus
Lizenz:	creativecommons.org/licenses/by/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

POS error detection in automatically annotated corpora

Autor*in: Rehbein, Ines

Erschienen: 2016

Verlag: Stroudsburg, PA : ACL

Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5598 https://ids-pub.bsz-bw.de/files/5598/Rehbein_POS_error_detection_in_automatically_annotated_corpora_2014.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-55986

Recent work on error detection has shown that the quality of manually annotated corpora can be substantially improved by applying consistency checks to the data and automatically identifying incorrectly labelled instances. These methods, however, can not be used for automatically annotated corpora where errors are systematic and cannot easily be identified by looking at the variance in the data. This paper targets the detection of POS errors in automatically annotated corpora, so-called silver standards, showing that by combining different measures sensitive to annotation quality we can identify a large part of the errors and obtain a substantial increase in accuracy.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Automatische Sprachanalyse; Annotation
Lizenz:	creativecommons.org/licenses/by/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

The KiezDeutsch Korpus (KiDKo) Release 1.0

Autor*in: Rehbein, Ines ; Schalowski, Sören ; Wiese, Heike

Erschienen: 2016

Verlag: Paris : European Language Resources Association

This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5599 https://ids-pub.bsz-bw.de/files/5599/Rehbein_Schalowski_Wiese_The_KiezDeutsch_Korpus_KiDKo_Release_1_0_2014.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-55999

This paper presents the first release of the KiezDeutsch Korpus (KiDKo), a new language resource with multiparty spoken dialogues of Kiezdeutsch, a newly emerging language variety spoken by adolescents from multi-ethnic urban areas in Germany. The first release of the corpus includes the transcriptions of the data as well as a normalisation layer and part-of-speech annotations. In the paper, we describe the main features of the new resource and then focus on automatic POS tagging of informal spoken language. Our tagger achieves an accuracy of nearly 97% on KiDKo. While we did not succeed in further improving the tagger using ensemble tagging, we present our approach to using the tagger ensembles for identifying error patterns in the automatically tagged data.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Gesprochene Sprache; Stadtmundart; Jugendsprache; Multikulturelle Gesellschaft; Korpus
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Discussing best practices for the annotation of Twitter microtext

Autor*in: Rehbein, Ines ; Visser, Emiel ; Lestmann, Nadine

Erschienen: 2016

Verlag: Sofia : Bulgarian Academy of Sciences

This paper contributes to the discussion on best practices for the syntactic analysis of non-canonical language, focusing on Twitter microtext. We present an annotation experiment where we test an existing POS tagset, the Stuttgart-Tübingen Tagset... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5601 https://ids-pub.bsz-bw.de/files/5601/Rehbein_Visser_Lestmann_Discussing_best_practices_for_the_annotation_of_Twitter_microtext_2013.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-56013

This paper contributes to the discussion on best practices for the syntactic analysis of non-canonical language, focusing on Twitter microtext. We present an annotation experiment where we test an existing POS tagset, the Stuttgart-Tübingen Tagset (STTS), with respect to its applicability for annotating new text from the social media, in particular from Twitter microblogs. We discuss different tagset extensions proposed in the literature and test our extended tagset on a set of 506 tweets (7.418 tokens) where we achieve an inter-annotator agreement for two human annotators in the range of 92.7 to 94.4 (k). Our error analysis shows that especially the annotation of Twitterspecific phenomena such as hashtags and at-mentions causes disagreements between the human annotators. Following up on this, we provide a discussion of the different uses of the @- and #-marker in Twitter and argue against analysing both on the POS level by means of an at-mention or hashtag label. Instead, we sketch a syntactic analysis which describes these phenomena by means of syntactic categories and grammatical functions.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Syntaktische Analyse; Annotation; Twitter <Softwareplattform>
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

A harmonised testsuite for POS tagging of German social media data

Data point selection for genre-aware parsing

Evaluating LSTM models for grammatical function labelling

Universal Dependencies are hard to parse – or are they?

What do we need to know about an unknown word when parsing German

Introducing the International Comparable Corpus

A corpus-based lexical resource of spoken German in interaction

The microstructure of Online Linguistics Dictionaries: obligatory and facultative elements

Detecting the boundaries of sentence-like units on spoken German

How much “tourism” is there in dictionary apps? An empirical study of lexicographical resources on mobile devices (German, Italian, Spanish)

A corpus-based lexical resource of spoken German in interaction

CLARIN Web Services for TEI-annotated Transcripts of Spoken Language

Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora

Using Full Text Indices for Querying Spoken Language Data

Improving Sentence Boundary Detection for Spoken Language Transcripts

Fine-grained Named Entity Annotations for German Biographic Interviews

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

A New Resource for German Causal Language

FOLKER : an annotation tool for efficient transcription of natural, multi-party interaction

EXMARaLDA : un système pour la constitution et l’exploitation de corpus oraux

Creating and working with spoken language corpora in EXMARaLDA

Towards a syntactically motivated analysis of modifiers in German

POS error detection in automatically annotated corpora

The KiezDeutsch Korpus (KiDKo) Release 1.0

Discussing best practices for the annotation of Twitter microtext

Kontakt

Partner