Suchergebnisse

Lessons learned in quality management for online research software tools in linguistics

Autor*in: Diewald, Nils ; Margaretha, Eliza ; Kupietz, Marc

Erschienen: 2021

Verlag: Mannheim : Leibniz-Institut für Deutsche Sprache

In this paper, we present our experiences and decisions in dealing with challenges in developing, maintaining and operating online research software tools in the field of linguistics. In particular, we highlight reproducibility, dependability, and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/10469 https://ids-pub.bsz-bw.de/files/10469/Diewald_Margaretha_Kupietz_Lessons_learned_in_quality_management_2021.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-104694 https://doi.org/10.14618/ids-pub-10469

In this paper, we present our experiences and decisions in dealing with challenges in developing, maintaining and operating online research software tools in the field of linguistics. In particular, we highlight reproducibility, dependability, and security as important aspects of quality management – taking into account the special circumstances in which research software is usually created.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Datenqualität; Software
Lizenz:	creativecommons.org/licenses/by/4.0/deed.de ; info:eu-repo/semantics/openAccess

Matrix and double-array representations for efficient finite state tokenization

Autor*in: Diewald, Nils

Erschienen: 2022

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11109 https://ids-pub.bsz-bw.de/files/11109/Diewald_Matrix_and_double_array_representations_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111091

This paper presents an algorithm and an implementation for efficient tokenization of texts of space-delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is compared with state-of-the-art approaches. The presented solution is faster than other tools while maintaining comparable quality.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Algorithmus; Endlicher Zustandsraum; Datenstruktur; Deutsch; Korpus
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Autor*in: Diewald, Nils ; Kupietz, Marc ; Lüngen, Harald

Erschienen: 2022

Verlag: Mannheim : IDS-Verlag

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11124 https://ids-pub.bsz-bw.de/files/11124/Diewald_Kupietz_Tokenizing_on_scale_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111245

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state-ofthe-art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Englisch, Altenglisch (420)
Schlagworte:	Korpus; Software; Automatische Sprachanalyse; Daten; Deutsch
Lizenz:	creativecommons.org/licenses/by-sa/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Autor*in: Diewald, Nils ; Kupietz, Marc ; Lüngen, Harald

Erschienen: 2022

Verlag: Mannheim : IDS-Verlag ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11146 https://ids-pub.bsz-bw.de/files/11146/Diewald_Kupietz_Luengen_Tokenizing_on_scale_22.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111464 https://doi.org/10.14618/ids-pub-11146

When comparing different tools in the field of natural language processing (NLP), the quality of their results usually has first priority. This is also true for tokenization. In the context of large and diverse corpora for linguistic research purposes, however, other criteria also play a role – not least sufficient speed to process the data in an acceptable amount of time. In this paper we evaluate several state of the art tokenization tools for German – including our own – with regard to theses criteria. We conclude that while not all tools are applicable in this setting, no compromises regarding quality need to be made.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Englisch, Altenglisch (420)
Schlagworte:	Korpus
Lizenz:	creativecommons.org/licenses/by-sa/4.0/deed.de ; info:eu-repo/semantics/openAccess

Building paths to corpus data. A multi-level least effort and maximum return approach

Autor*in: Kupietz, Marc ; Diewald, Nils ; Margaretha, Eliza

Erschienen: 2022

Verlag: Berlin/Boston : de Gruyter ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11286 https://ids-pub.bsz-bw.de/files/11286/Kupietz_Diewald_Margaretha_Building_paths_to_corpus_data_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-112864 https://doi.org/10.1515/9783110767377-007

Enabling appropriate access to linguistic research data, both for many researchers and for innovative research applications, is a challenging task. In this chapter, we describe how we address this challenge in the context of the German Reference Corpus DeReKo and the corpus analysis platform KorAP. The core of our approach, which is based on and tightly integrated into the CLARIN infrastructure, is to offer access at different levels. The graduated access levels make it possible to find a low-loss compromise between the possibilities opened up and the costs incurred by users and providers for each individual use case, so that, viewed over many applications, the ratio between effort and results achieved can be effectively optimized. We also report on experiences with the current state of this approach.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Sprachdaten; Deutsches Referenzkorpus (DeReKo); Korpus; Technische Infrastruktur; Nachhaltigkeit
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Access control by query rewriting: the case of KorAP

Autor*in: Bański, Piotr ; Diewald, Nils ; Hanl, Michael ; Kupietz, Marc ; Witt, Andreas

Erschienen: 2014

Verlag: Reykjavik : European Language Resources Association (ELRA)

We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/3136 https://ids-pub.bsz-bw.de/files/3136/Banski_Diewald_Hanl_Kupietz_Witt_Access%20control_2014.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-31366

We present an approach to an aspect of managing complex access scenarios to large and heterogeneous corpora that involves handling user queries that, intentionally or due to the complexity of the queried resource, target texts or annotations outside of the given user’s permissions. We first outline the overall architecture of the corpus analysis platform KorAP, devoting some attention to the way in which it handles multiple query languages, by implementing ISO CQLF (Corpus Query Lingua Franca), which in turn constitutes a component crucial for the functionality discussed here. Next, we look at query rewriting as it is used by KorAP and zoom in on one kind of this procedure, namely the rewriting of queries that is forced by data access restrictions.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Germanische Sprachen; Deutsch (430)
Schlagworte:	Korpus
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

KorAP: the new corpus analysis platform at IDS Mannheim

Autor*in: Bański, Piotr ; Bingel, Joachim ; Diewald, Nils ; Frick, Elena ; Hanl, Michael ; Kupietz, Marc ; Pȩzik, Piotr ; Schnober, Carsten ; Witt, Andreas

Erschienen: 2014

Verlag: Poznań : Uniwersytet im. Adama Mickiewicza w Poznaniu

The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem,... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/3261 https://ids-pub.bsz-bw.de/files/3261/Banski_KorAP_2013.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-32617

The KorAP project (“Korpusanalyseplattform der nächste Generation”, “Corpus-analysis platform of the next generation”), carried out at the Institut fUr Deutsche Sprache (IDS) in Mannheim, Germany, has as its goal the development of a modem, state-of-the-art corpus-analysis platform, capable of handling very large corpora and opening the perspectives for innovative linguistic research. The platform will facilitate new linguistic findings by making it possible to manage and analyse extremely large amounts of primary data and annotations, while at the same time allowing an undistorted view of the primary un-annotated text, and thus fully satisfying expectations associated with a scientific tool. The project started in July 2011 and is funded till June 2014. The demo presentation in December will be the first version following a preliminary feature freeze, and will open the alpha testing phase of the project.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Germanische Sprachen; Deutsch (430)
Schlagworte:	Korpus
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

KoralQuery - a General Corpus Query Protocol

Autor*in: Bingel, Joachim ; Diewald, Nils

Erschienen: 2015

Verlag: Linköping University Electronic Press, Linköpings universitet

The task-oriented and format-driven development of corpus query systems has led to the creation of numerous corpus query languages (QLs) that vary strongly in expressiveness and syntax. This is a severe impediment for the interoperability of corpus... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/4310 https://ids-pub.bsz-bw.de/files/4310/Bingel_Diewald_KoralQuery_2015.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-43102

The task-oriented and format-driven development of corpus query systems has led to the creation of numerous corpus query languages (QLs) that vary strongly in expressiveness and syntax. This is a severe impediment for the interoperability of corpus analysis systems, which lack a common protocol. In this paper, we present KoralQuery, a JSON-LD based general corpus query protocol, aiming to be independent of particular QLs, tasks and corpus formats. In addition to describing the system of types and operations that Koral- Query is built on, we exemplify the representation of corpus queries in the serialized format and illustrate use cases in the KorAP project.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Linguistik (410)
Schlagworte:	Korpus; Computerlinguistik; Automatische Sprachverarbeitung
Lizenz:	creativecommons.org/licenses/by-nd/4.0/ ; info:eu-repo/semantics/openAccess

KorAP architecture – diving in the deep sea of corpus data

Autor*in: Diewald, Nils ; Hanl, Michael ; Margaretha, Eliza ; Bingel, Joachim ; Kupietz, Marc ; Bański, Piotr ; Witt, Andreas

Erschienen: 2016

Verlag: Paris : European Language Resources Association (ELRA)

KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/5036 https://ids-pub.bsz-bw.de/files/5036/Diewald_et_al_KorAP_architecture_2016.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-50361

KorAP is a corpus search and analysis platform, developed at the Institute for the German Language (IDS). It supports very large corpora with multiple annotation layers, multiple query languages, and complex licensing scenarios. KorAP’s design aims to be scalable, flexible, and sustainable to serve the German Reference Corpus DEREKO for at least the next decade. To meet these requirements, we have adopted a highly modular microservice-based architecture. This paper outlines our approach: An architecture consisting of small components that are easy to extend, replace, and maintain. The components include a search backend, a user and corpus license management system, and a web-based user frontend. We also describe a general corpus query protocol used by all microservices for internal communications. KorAP is open source, licensed under BSD-2, and available on GitHub.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Germanische Sprachen; Deutsch (430)
Schlagworte:	Korpus
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Krill: KorAP search and analysis engine

Autor*in: Diewald, Nils ; Margaretha, Eliza

Erschienen: 2017

Verlag: Berlin : Gesellschaft für Sprachtechnologie und Computerlinguistik

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/6220 https://ids-pub.bsz-bw.de/files/6220/Diewald_Margaretha_Krill_KorAP_2016.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-62203

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Germanische Sprachen; Deutsch (430)
Schlagworte:	Korpus; Suchmaschine; Automatische Sprachanalyse; Texttechnologie
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

How to get the computation near the data: improving data accessibility to, and reusability of analysis functions in corpus query platforms

Autor*in: Kupietz, Marc ; Diewald, Nils ; Fankhauser, Peter

Erschienen: 2018

Verlag: Paris : European language resources association (ELRA)

The paper discusses use cases and proposals to increase the flexibility and reusability of components for analysis and further processing of analysis results in corpus query platforms by providing standardized interfaces to access data at multiple... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/7534 https://ids-pub.bsz-bw.de/files/7534/Kupietz_Diewald_Frankhauser_How_to_get_the_computation_2018.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-75346

The paper discusses use cases and proposals to increase the flexibility and reusability of components for analysis and further processing of analysis results in corpus query platforms by providing standardized interfaces to access data at multiple levels.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Technologie; Interoperabilität
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

What's New in EuReCo? Interoperability, Comparable Corpora, Licensing

Autor*in: Kupietz, Marc ; Margaretha, Eliza ; Diewald, Nils ; Lüngen, Harald ; Fankhauser, Peter

Erschienen: 2019

Verlag: Mannheim : Leibniz-Institut für Deutsche Sprache

This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9026 https://ids-pub.bsz-bw.de/files/9026/Kupietz_Margaretha_Diewald_Luengen_Fankhauser_Whats_new_in_EuReCo_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-90261 https://doi.org/10.14618/ids-pub-9026

This paper reports on the latest developments of the European Reference Corpus EuReCo and the German Reference Corpus in relation to three of the most important CMLC topics: interoperability, collaboration on corpus infrastructure building, and legal issues. Concerning interoperability, we present new ways to access DeReKo via KorAP on the API and on the plugin level. In addition we report about advancements in the EuReCo- and ICC-initiatives with the provision of comparable corpora, and about recent problems with license acquisitions and our solution approaches using an indemnification clause and model licenses that include scientific exploitation.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus
Lizenz:	creativecommons.org/licenses/by/4.0/deed.de ; info:eu-repo/semantics/openAccess

Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian

Autor*in: Tufiș, Dan ; Barbu Mititelu, Verginica ; Irimia, Elena ; Păiș, Vasile ; Ion, Radu ; Diewald, Nils ; Mitrofan, Maria ; Onofrei, Mihaela

Erschienen: 2019

Verlag: Bucureşti : Editura Academiei Române

The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9385 https://ids-pub.bsz-bw.de/files/9385/Tufis_Mititelu_Irimia_et_al_Little_strokes_fell_great_oaks_CoRoLa_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93851

The paper presents the quite long-standing tradition of Romanian corpus acquisition and processing, which reaches its peak with the reference corpus of contemporary Romanian language (CoRoLa). The paper describes decisions behind the kinds of texts collected, as well as processing and annotation steps, highlighting the structure and importance of metadata to the corpus. The reader is also introduced to the three ways in which (s)he can plunge into the rich linguistic data of the corpus, waiting to be discovered. Besides querying the corpus, word embeddings extracted from it are useful to various natural language processing applications and for linguists, when user-friendly interfaces offer them the possibility to exploit the data.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Rumänisch; Korpus; Annotation; Metadaten
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

The KorAP user interface. Accessing CoRoLa via KorAP

Autor*in: Diewald, Nils ; Barbu Mititelu, Verginica ; Kupietz, Marc

Erschienen: 2019

Verlag: Bucureşti : Editura Academiei Române

The user interfaces for corpus analysis platforms must provide a high degree of accessibility for ordinary users and at the same time provide the possibility to answer complex research questions. In this paper, we present the design concepts behind... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9386 https://ids-pub.bsz-bw.de/files/9386/Diewald_Mititelu_Kupietz_The_KorAP_user_interface_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93866

The user interfaces for corpus analysis platforms must provide a high degree of accessibility for ordinary users and at the same time provide the possibility to answer complex research questions. In this paper, we present the design concepts behind the user interface of KorAP, a corpus analysis platform that has evolved into the main gateway to CoRoLa, the Reference Corpus of Contemporary Romanian Language. Based on established principles of user interface design, we show how KorAP addresses the challenge of providing a user-friendly interface for heterogeneous corpus data to a wide range of users with different research questions.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Softwareergonomie; Rumänisch; Benutzeroberfläche
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

How to find a shining needle in the haystack. Querying CoRoLa: solutions and perspectives

Autor*in: Cristea, Dan ; Diewald, Nils ; Haja, Gabriela ; Mărănduc, Cătălina ; Barbu Mititelu, Verginica ; Onofrei, Mihaela

Erschienen: 2019

Verlag: Bucureşti : Editura Academiei Române

The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9387 https://ids-pub.bsz-bw.de/files/9387/Cristea_Diewald_Haja_et_al_How_to_find_a_shining_needle_Querying_CoRoLa_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-93872

The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Abfrage; Rumänisch; Benutzeroberfläche; Suchmaschine
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP

Autor*in: Kupietz, Marc ; Diewald, Nils ; Margaretha, Eliza

Erschienen: 2020

Verlag: Paris : European Language Resources Association

Making corpora accessible and usable for linguistic research is a huge challenge in view of (too) big data, legal issues and a rapidly evolving methodology. This does not only affect the design of user-friendly graphical interfaces to corpus analysis... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9843 https://ids-pub.bsz-bw.de/files/9843/Kupietz_Diewald_Margaretha_RKorAPClient_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98430

Making corpora accessible and usable for linguistic research is a huge challenge in view of (too) big data, legal issues and a rapidly evolving methodology. This does not only affect the design of user-friendly graphical interfaces to corpus analysis tools, but also the availability of programming interfaces supporting access to the functionality of these tools from various analysis and development environments. RKorAPClient is a new research tool in the form of an R package that interacts with the Web API of the corpus analysis platform KorAP, which provides access to large annotated corpora, including the German reference corpus DeReKo with 45 billion tokens. In addition to optionally authenticated KorAP API access, RKorAPClient provides further processing and visualization features to simplify common corpus analysis tasks. This paper introduces the basic functionality of RKorAPClient and exemplifies various analysis tasks based on DeReKo, that are bundled within the R package and can serve as a basic framework for advanced analysis and visualization approaches.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Visualisierung; Forschungsdaten; R; Web Services
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Recent developments in the European Reference Corpus EuReCo

Autor*in: Kupietz, Marc ; Diewald, Nils ; Trawiński, Beata ; Cosma, Ruxandra ; Cristea, Dan ; Tufiş, Dan ; Váradi, Tamás ; Wöllstein, Angelika

Erschienen: 2023

Verlag: Louvain-la-Neuve : Presses universitaires de Louvain ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper reports on recent developments within the European Reference Corpus EuReCo, an open initiative that aims at providing and using virtual and dynamically definable comparable corpora based on existing national, reference or other large... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11829 https://ids-pub.bsz-bw.de/files/11829/Kupietz_Diewald_Recent_developments_in_EuReCo_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-118291

This paper reports on recent developments within the European Reference Corpus EuReCo, an open initiative that aims at providing and using virtual and dynamically definable comparable corpora based on existing national, reference or other large corpora. Given the well-known shortcomings of other types of multilingual corpora such as parallel/translation corpora (shining-through effects, over-normalization, simplification, etc.) or web-based comparable corpora (covering only web material), EuReCo provides a unique linguistic resource offering new perspectives for fine-grained contrastive research on authentic cross-linguistic data, applications in translation studies and foreign language teaching and learning.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Forschungsdaten; Sprachdaten; Kontrastive Linguistik; Übersetzungswissenschaft; Fremdsprachenunterricht; Fremdsprachenlernen
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

News from the International Comparable Corpus. First launch of ICC written

Autor*in: Kupietz, Marc ; Barbaresi, Adrien ; Čermáková, Anna ; Czachor, Małgorzata ; Diewald, Nils ; Ebeling, Jarle ; Górski, Rafał L. ; Margaretha, Eliza ; Kirk, John ; Křen, Michal ; Lüngen, Harald ; Oksefjell Ebeling, Signe ; Ó Meachair, Mícheál ; Pisetta, Ines ; Uí Dhonnchadha, Elaine ; Vogel, Friedemann ; Wilm, Rebecca ; Xu, Jiajin ; Yaddehige, Rameela

Erschienen: 2023

Verlag: Mannheim : IDS-Verlag; Leibniz-Institut für Deutsche Sprache (IDS)

The International Comparable Corpus (ICC) (Kirk/Čermáková 2017; Čermáková et al. 2021) is an open initiative which aims to improve the empirical basis for contrastive linguistics by compiling comparable corpora for many languages and making them as... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12183 https://ids-pub.bsz-bw.de/files/12183/Kupietz_Barbaresi_Cermakova_ua_News_2023.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-121830 https://doi.org/10.14618/f8rt-m155

The International Comparable Corpus (ICC) (Kirk/Čermáková 2017; Čermáková et al. 2021) is an open initiative which aims to improve the empirical basis for contrastive linguistics by compiling comparable corpora for many languages and making them as freely available as possible as well as providing tools with which they can easily be queried and analysed. In this contribution we present the first release of written language parts of the ICC which includes corpora for Chinese, Czech, English, German, Irish (partly), and Norwegian. Each of the released corpora contains 400k words distributed over 14 different text categories according to the ICC specifications. Our poster covers the design basics of the ICC, its TEI encoding, a demonstration of using the ICC via different query tools, and an outlook on future plans. Similar to the European Reference Corpus EuReCo (Kupietz et al. 2020), ICC follows the approach of reusing existing linguistic resources wherever possible in order to cover as many languages as possible with realistic effort in as short a time as possible. In contrast to EuReCo, however, comparable corpus pairs are not defined dynamically in the usage phase, but the compositions of the corpora are fixed in the ICC design. The approaches are thus complementary in this respect. The design principles and composition of the ICC are based on those of the International Corpus of English (ICE) (Greenbaum (ed.) 1996), with the deviation that the ICC includes the additional text category blog post and excludes spoken legal texts (see Čermáková et al. 2021 for details). ICC’s fixed-design approach has the advantage that all single-language corpora in the ICC have the same composition with respect to the selected text types and that this guarantees that the selected broad spectrum of potential influencing variables for linguistic variation is always represented. The disadvantage, however, is that this can only be achieved for quite small corpora and that the generalisability of comparative findings based on the ...

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Lizenz:	creativecommons.org/licenses/by-sa/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

Applying the newly extended European reference corpus EuReCo. Pilot studies of light-verb constructions in German, Romanian, Hungarian and Polish

Autor*in: Bański, Piotr ; Diewald, Nils ; Kupietz, Marc ; Trawiński, Beata

Erschienen: 2023

Verlag: Mannheim : IDS-Verlag

It is well known that the distribution of lexical and grammatical patterns is size- and register-sensitive (Biber 1986, and later publications). This fact alone presents a challenge to many corpus-oriented linguistic studies focusing on a single... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12289 https://ids-pub.bsz-bw.de/files/12289/Banski_etal._Applying_the_newly_ext._europ._reference_2023.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-122898 https://doi.org/10.14618/f8rt-m155

It is well known that the distribution of lexical and grammatical patterns is size- and register-sensitive (Biber 1986, and later publications). This fact alone presents a challenge to many corpus-oriented linguistic studies focusing on a single language. When it comes to cross-linguistic studies using corpora, the challenge becomes even greater due to the lack of high-quality multilingual corpora (Kupietz et al. 2020; Kupietz/Trawiński 2022), which are comparable with respect to the size and the register. That was the motivation for the creation of the European Reference Corpus EuReCo, an initiative started in 2013 at the Leibniz Institute for the German Language (IDS) together with several European partners (Kupietz et al. 2020). EuReCo is an emerging federated corpus, with large virtual comparable corpora across various languages and with an infrastructure supporting contrastive research. The core of the infrastructure is KorAP (Diewald et al. 2016), a scalable open-source platform supporting the analysis and visualisation of properties of texts annotated by multiple and potentially conflicting information layers, and supporting several corpus query languages. Until recently, EuReCo consisted of three monolingual subparts: the German Reference Corpus DeReKo (Kupietz et al. 2018), the Reference Corpus of Contemporary Romanian Language (Barbu Mititelu/Tufiş/Irimia 2018), and the Hungarian National Corpus (Váradi 2002). The goal of the present submission is twofold. On the one hand, it reports about the new component of EuReCo: a sample of the National Corpus of Polish (Przepiórkowski et al. 2010). On the other hand, it presents the results of a new pilot study using the newly extended EuReCo. This pilot study investigates selected Polish collocations involving light verbs and their prepositional / nominal complements (Fig. 1) and extends the collocation analyses of German, Romanian and Hungarian (Fig. 2) discussed in Kupietz/Trawiński (2022).

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Kontrastive Linguistik
Lizenz:	creativecommons.org/licenses/by-sa/3.0/de/deed.de ; info:eu-repo/semantics/openAccess

EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research

Autor*in: Kupietz, Marc ; Bański, Piotr ; Diewald, Nils ; Trawiński, Beata ; Witt, Andreas

Erschienen: 2024

Verlag: Paris : ELRA Language Resource Association ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper gives an overview of recent developments concerning the European Reference Corpus EuReCo, an open long-term initiative aimed at providing and using virtual and dynamically definable comparable corpora based on existing national, reference... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12696 https://ids-pub.bsz-bw.de/files/12696/Kupietz_Banski_EuReCo_Not_Building_and_Yet_Using_Federated_Comparable_Corpora_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126961

This paper gives an overview of recent developments concerning the European Reference Corpus EuReCo, an open long-term initiative aimed at providing and using virtual and dynamically definable comparable corpora based on existing national, reference or other large corpora. Given the problems and shortcomings of other types of multilingual corpora – such as the shining-through effects in parallel corpora or the limitation to web material only in web-based comparable corpora – EuReCo constitutes a unique linguistic resource that offers new perspectives for fine-grained cross-linguistic research. The approach advocated here puts forward new solutions to notorious IPR and licensing issues, as well as to challenges of interoperability. It also addresses methodological questions concerning comparability and representativeness. While the focus of this paper is on EuReCo’s implementation-based approach to ensuring interoperability in a feasible and maintainable way, it also presents preliminary results of pilot comparative studies on light verb constructions in German, Romanian, Hungarian, Polish and Bulgarian, and reports on recent extensions and plans.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
Schlagworte:	Korpus; Mehrsprachigkeit
Lizenz:	creativecommons.org/licenses/by-nc/4.0/deed.de ; info:eu-repo/semantics/openAccess

Managing access to language resources in a corpus analysis platform

Autor*in: Illig, Eliza Margaretha ; Diewald, Nils ; Kamocki, Pawel ; Kupietz, Marc

Erschienen: 2024

Verlag: Utrecht : CLARIN ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

Corpus query tools are crucial to CLARIN’s mission of facilitating the sharing and using language data for research. It is a huge challenge for online corpus platforms to manage user access rights for large corpora with complex licenses and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12889 https://ids-pub.bsz-bw.de/files/12889/Illig_Diewald_Kamocki_Kupietz_Managing_access_to_language_resources_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128890

Corpus query tools are crucial to CLARIN’s mission of facilitating the sharing and using language data for research. It is a huge challenge for online corpus platforms to manage user access rights for large corpora with complex licenses and heterogeneous restrictions on access methods and purposes. This paper presents an approach to maximize user access to corpus data while protecting rights holders’ legitimate interests. Query rewriting techniques and authorization procedures allow for modelling license terms in details, enabling broader applications. This offers an alternative to methods that only model a greatest common denominator of licenses, thereby limiting the possibilities for using the data. Our approach constitutes a flexible and extensible corpus license and user rights management component applicable for other language research environments.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Bibliotheks- und Informationswissenschaften (020); Sprache (400)
Schlagworte:	Korpus
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

National library as corpus: introducing DeLiKo@DNB – a large synchronous German fiction corpus

Autor*in: Kupietz, Marc ; Leinen, Peter ; Diewald, Nils ; Genêt, Philippe ; Wilm, Rebecca ; Witt, Andreas ; Yaddehige, Rameela

Erschienen: 2025

Verlag: Genf : Zenodo ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper introduces DeLiKo@DNB, a large, linguistically annotated, and large, freely accessible contemporary corpus of German fiction. The corpus currently comprises 2 billion words from over 26,000 books published between 2005 and the present,... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/13070 https://ids-pub.bsz-bw.de/files/13070/Kupietz_Leinen_Diewald_National_library_as_corpus_2025.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-130705 https://doi.org/10.5281/zenodo.14943116

This paper introduces DeLiKo@DNB, a large, linguistically annotated, and large, freely accessible contemporary corpus of German fiction. The corpus currently comprises 2 billion words from over 26,000 books published between 2005 and the present, spanning pulp and genre fiction as well as literary award-winning works. We provide a detailed account of the corpus composition, metadata, and key features. Additionally, we outline our approach to ensuring lawful and productive access by deploying an instance of the open-source corpus analysis platform KorAP within the German National Library.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Deutsche Nationalbibliothek; Nationalbibliothek; Korpus; Deutsch; Annotation; Metadaten
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

Lessons learned in quality management for online research software tools in linguistics

Matrix and double-array representations for efficient finite state tokenization

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Tokenizing on scale. Preprocessing large text corpora on the lexical and sentence level

Building paths to corpus data. A multi-level least effort and maximum return approach

Access control by query rewriting: the case of KorAP

KorAP: the new corpus analysis platform at IDS Mannheim

KoralQuery - a General Corpus Query Protocol

KorAP architecture – diving in the deep sea of corpus data

Krill: KorAP search and analysis engine

How to get the computation near the data: improving data accessibility to, and reusability of analysis functions in corpus query platforms

What's New in EuReCo? Interoperability, Comparable Corpora, Licensing

Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian

The KorAP user interface. Accessing CoRoLa via KorAP

How to find a shining needle in the haystack. Querying CoRoLa: solutions and perspectives

RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP

Recent developments in the European Reference Corpus EuReCo

News from the International Comparable Corpus. First launch of ICC written

Applying the newly extended European reference corpus EuReCo. Pilot studies of light-verb constructions in German, Romanian, Hungarian and Polish

EuReCo: Not Building and Yet Using Federated Comparable Corpora for Cross-Linguistic Research

Managing access to language resources in a corpus analysis platform

National library as corpus: introducing DeLiKo@DNB – a large synchronous German fiction corpus

Kontakt

Partner