Suchergebnisse

Semiautomatic data generation for academic Named Entity Recognition in German text corpora

Autor*in: Schwarz, Pia

Erschienen: 2024

Verlag: Wien : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12842 https://ids-pub.bsz-bw.de/files/12842/Schwarz_Semiautomatic_data_generation_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128423

An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity types, an off-the-shelf NE recognizer, and an LLM. Experiments fine-tuning a BERT model with different strategies of post-processing the automatically generated data result in several NER models achieving overall F1 scores of up to 92.45%.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Named Entity Recognition; Deutsch; Korpus; Großes Sprachmodell; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Unlocking the corpus: enriching metadata with state-of-the-art NLP methodology and linked data

Autor*in: Ecker, Jennifer ; Fischer, Stefan ; Schwarz, Pia ; Trippel, Thorsten ; Werthmann, Antonina ; Wilm, Rebecca

Erschienen: 2024

Verlag: Utrecht : CLARIN ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12853 https://ids-pub.bsz-bw.de/files/12853/Ecker_Fischer_Schwarz_Unlocking_the_corpus_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128531

In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. Our approach is to add semantic metadata at the text level to facilitate the search over data. We show how to enrich metadata with three NLP methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search for texts that are about certain topics or described by certain keywords, or to identify people, places, and organisations mentioned in texts without actually having to read them and at the same time facilitate the creation of task-tailored subcorpora. To enhance this usability of the data we explore options based on the German Reference Corpus DeReKo, the largest linguistically motivated collection of German language material (Kupietz & Keibel, 2009; Kupietz et al., 2010, 2018), which contains multiple newspapers, books, transcriptions, etc., and enrich its metadata on the level of subportions, i.e. newspaper articles. We received access to a number of data files in DeReKo’s native XML format, I5. To develop the methodology, we focus on a single XML file containing all issues of one newspaper of a whole year. The following sections only give an overview of our approach, we intend, however, to provide a detailed description of the experiments and the selection of data in a subsequent longer contribution.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Metadaten; Natürliche Sprache; Computerlinguistik; Datenmanagement; Named Entity Recognition; Deutsch; XML
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Möglichkeiten der Verknüpfung qualitativer und quantitativer Zugänge – Narrative von Wasserstoff

Autor*in: Kern, Lesley-Ann ; Meer, Dorothee

Erschienen: 2024

Verlag: Mannheim : IDS-Verlag ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

Narratives (defined as a discursive structure based on local and temporal indications and actors as part of a development of action helping to solve problems) are similar to linguistic phenomena such as metaphors or argumentative patterns usually... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12870 https://ids-pub.bsz-bw.de/files/12870/Kern_Meer_Moeglichkeiten_der_Verknuepfung_qualitativer_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128700 https://doi.org/10.21248/idsopen.8.2024.29

Narratives (defined as a discursive structure based on local and temporal indications and actors as part of a development of action helping to solve problems) are similar to linguistic phenomena such as metaphors or argumentative patterns usually analyzed in qualitative linguistic research. In this paper, our idea is to combine these qualitative linguistic methods with quantitative corpus linguistic approaches such as named entity recognition (NER). We present a pilot study in which we use current NER technology to semi-automatically detect the narrative of energy partnership in media discourse. Based on a manual pre-study of a part of our corpus, we defined detectable expressions with a semantic relation to the subject that we link to occurrences of the narrative. Our study shows that the automatic detection of narratives still proves to be difficult, yet using NER we could identify pre-defined narratives in our corpus. This shows that quantitative approaches can support qualitative text analysis and should therefore be considered when working with digital corpora to gain extensive insights into the material.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Deutsch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Narrativ <Sozialwissenschaften>; Named Entity Recognition; Wasserstoff; Diskurs; Korpus
Lizenz:	creativecommons.org/licenses/by-sa/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

Semiautomatic data generation for academic Named Entity Recognition in German text corpora

Unlocking the corpus: enriching metadata with state-of-the-art NLP methodology and linked data

Möglichkeiten der Verknüpfung qualitativer und quantitativer Zugänge – Narrative von Wasserstoff

Kontakt

Partner