Suchergebnisse

Abfrage und Analyse von Korpusbelegen

Autor*in: Lemnitzer, Lothar ; Diewald, Nils

Erschienen: 2022

Verlag: Paderborn : Wilhelm Fink ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In diesem Kapitel stellen wir zunächst grundlegende Konzepte von Abfragesystemen und Abfragesprachen für die Suche in Korpora vor. Diese Konzepte sollen Ihnen helfen, die einzelnen Abfragesprachen besser zu verstehen und vergleichen zu können. Die... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11069 https://ids-pub.bsz-bw.de/files/11069/Lemnitzer_Diewald_Abfrage_und_Analyse_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110697

In diesem Kapitel stellen wir zunächst grundlegende Konzepte von Abfragesystemen und Abfragesprachen für die Suche in Korpora vor. Diese Konzepte sollen Ihnen helfen, die einzelnen Abfragesprachen besser zu verstehen und vergleichen zu können. Die gängigen Abfragesprachen unterscheiden sich in vielen Details. Diese Details und die Möglichkeiten und Grenzen der einzelnen Abfragesprachen stellen wir im zweiten Teil mit vielen Beispielaufgaben und dazu passenden Lösungen in jeweils drei Abfragesprachen vor.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Deutsch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Sprachdaten; Korpus; Abfragesprache; Datenbank; Forschungsdaten
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Matrix and double-array representations for efficient finite state tokenization

Autor*in: Diewald, Nils

Erschienen: 2022

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11109 https://ids-pub.bsz-bw.de/files/11109/Diewald_Matrix_and_double_array_representations_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111091

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Algorithmus; Endlicher Zustandsraum; Datenstruktur; Deutsch; Korpus
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Autor*in: Diewald, Nils ; Kupietz, Marc ; Lüngen, Harald

Erschienen: 2022

Verlag: Mannheim : IDS-Verlag

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11124 https://ids-pub.bsz-bw.de/files/11124/Diewald_Kupietz_Tokenizing_on_scale_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111245

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state-ofthe-art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Englisch, Altenglisch (420)
Schlagworte:	Korpus; Software; Automatische Sprachanalyse; Daten; Deutsch
Lizenz:	creativecommons.org/licenses/by-sa/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Autor*in: Diewald, Nils ; Kupietz, Marc ; Lüngen, Harald

Erschienen: 2022

Verlag: Mannheim : IDS-Verlag ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11146 https://ids-pub.bsz-bw.de/files/11146/Diewald_Kupietz_Luengen_Tokenizing_on_scale_22.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111464 https://doi.org/10.14618/ids-pub-11146

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Englisch, Altenglisch (420)
Schlagworte:	Korpus
Lizenz:	creativecommons.org/licenses/by-sa/4.0/deed.de ; info:eu-repo/semantics/openAccess

Building paths to corpus data. A multi-level least effort and maximum return approach

Autor*in: Kupietz, Marc ; Diewald, Nils ; Margaretha, Eliza

Erschienen: 2022

Verlag: Berlin/Boston : de Gruyter ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11286 https://ids-pub.bsz-bw.de/files/11286/Kupietz_Diewald_Margaretha_Building_paths_to_corpus_data_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-112864 https://doi.org/10.1515/9783110767377-007

Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Sprachdaten; Deutsches Referenzkorpus (DeReKo); Korpus; Technische Infrastruktur; Nachhaltigkeit
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

Abfrage und Analyse von Korpusbelegen

Matrix and double-array representations for efficient finite state tokenization

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Building paths to corpus data. A multi-level least effort and maximum return approach

Kontakt

Partner