Suchergebnisse

Designing a verb guesser for part of speech tagging in Northern Sotho

Autor*in: Prinsloo, Danie J ; Faaß, Gertrud ; Taljard, Elsabé ; Heid, Ulrich

Erschienen: 2024

Verlag: London : Taylor & Francis ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

The aim of this article is to describe the design and implementation of a verb guesser that will enhance the results of statistical part of speech (POS) tagging of verbs in Northern Sotho. It will be illustrated that verb stems in Northern Sotho can... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12476 https://ids-pub.bsz-bw.de/files/12476/Prinsloo_Faass_Designing_a_verb_guesser_2008.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-124766 https://doi.org/10.2989/SALALS.2008.26.2.1.565

The aim of this article is to describe the design and implementation of a verb guesser that will enhance the results of statistical part of speech (POS) tagging of verbs in Northern Sotho. It will be illustrated that verb stems in Northern Sotho can successfully be recognised by examining their suffixes and combinations of suffixes. Two approaches to verbal derivation analysis will be utilised, namely morphological analysis and corpus querying of suffixes and combinations of suffixes.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Verb; Wortart; Pedi-Sprache; Verbalstamm; Suffix; Computerlinguistik; Datenanalyse
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

‘Representativeness’, ‘Bad Data’, and legitimate expectations. What can an electronic historical corpus tell us that we didn’t actually know already (and how)?

Autor*in: Durrell, Martin

Erschienen: 2024

Verlag: Tübingen : Narr ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

The availability of electronic corpora of historical stages of languages has been wel- comed as possibly attenuating the inherent problem of diachronic linguistics, i.e. that we only have access to what has chanced to come down to us - the problem... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12548 https://ids-pub.bsz-bw.de/files/12548/Durrell_Representativeness_2015_pdf.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-125483

The availability of electronic corpora of historical stages of languages has been wel- comed as possibly attenuating the inherent problem of diachronic linguistics, i.e. that we only have access to what has chanced to come down to us - the problem which was memorably named by Labov (1992) as one of “Bad Data”. However, such corpora can only give us access to an increased amount ot historical material and this can essentially still only be a partial and possibly distorted picture of the actual language at a particular period of history. Corpora can be improved by taking a more representative sample of extant texts if these are available (as they are in significant number for periods after the invention of printing). But, as examples from the recently compiled GerManC corpus of seventeenth and eighteenth century German show, the evidence from such corpora can still fail to yield definitive answers to our questions about earlier stages of a language. The data still require expert interpretation, and it is important to be realistic about what can legitimately be expected from an electronic historical corpus.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Historische Sprachwissenschaft; Korpus; Computerlinguistik
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Automatische Detektion von Neologismuskandidaten mit anschließender manueller Aufbereitung. Internal Progress Report IDS-KL-2007-01

Autor*in: Keibel, Holger

Erschienen: 2024

Verlag: Mannheim : Institut für Deutsche Sprache

Bibliographische Angaben
Zugang

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12636 https://ids-pub.bsz-bw.de/files/12636/Keibel_Automatische_Detektion_2007.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126364

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Deutsch
Medientyp:	Bericht
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Neologismus; Bericht; Wortschatz; Korpus; Computerlinguistik
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Effiziente halbautomatische Detektion von Neologismuskandidaten. Technical Report IDS-KL-2010-01

Autor*in: Keibel, Holger ; Hennig, Sophie ; Perkuhn, Rainer

Erschienen: 2024

Verlag: Mannheim : Institut für Deutsche Sprache

Bei der hier vorgestellten Studie wird eine halbautomatische Strategie verfolgt: Zunächst wird durch automatische Verfahren eine Kandidatenliste generiert, bei der im Kompromiss zwischen Recall und Precision der Recall stärker gewichtet wird. Recall... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12637 https://ids-pub.bsz-bw.de/files/12637/Keibel_Hennig_Effiziente_halbautomatische_Detektion_2010.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126371

Bei der hier vorgestellten Studie wird eine halbautomatische Strategie verfolgt: Zunächst wird durch automatische Verfahren eine Kandidatenliste generiert, bei der im Kompromiss zwischen Recall und Precision der Recall stärker gewichtet wird. Recall wird hierbei aber nicht einseitig maximiert, denn sonst wäre die Liste extrem lang und nahezu wertlos. Die automatisch gewonnene Kandidatenliste wird anschließend zügig (und ohne eigentliche Analyse) manuell gesichtet und eindeutige Nichtneologismen werden dabei herausgefiltert. Dadurch wird die Precision erheblich erhöht, während der Recall weitgehend unverändert hoch bleibt. Erst diese gefilterte Liste dient als Input für nähere Expertenanalysen. Dieser halbautomatische Ansatz besteht insgesamt aus drei Phasen, die im Folgenden näher beschrieben werden. Der Fokus liegt dabei auf Neulexemen – Neubedeutungen werden zwar nicht ausgeschlossen, für die meisten von ihnen ist es jedoch unwahrscheinlich, dass sie mit der hier vorgestellten Methode aufgespürt werden können.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Deutsch
Medientyp:	Bericht
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Neologismus; Korpus; Worthäufigkeit; Computerlinguistik; Wortschatz; Bericht
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Building a historical corpus for Classical Portuguese: some technological aspects

Autor*in: Paixão de Sousa, Maria Clara ; Trippel, Thorsten

Erschienen: 2024

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12640 https://ids-pub.bsz-bw.de/files/12640/Paixao_de_Sousa_Trippel_Building_a_historical_corpus_2006.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126407

This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted in a conceptual and technical restructuring of the formerly existing corpus. The development of the standardized markup and techniques allowed the inclusion of important new materials, such as original 16th and 17th century prints and manuscripts; and enlarged the potential user groups. On the technological side, we were grounded on the premise that open standards are the best way of making sure that the resources will be accessible even after years in an archive. This is a welcomed result in view of the additional consequence of the remodeled corpus concept: it serves as a repository for important historical documents, some of which had been preserved for 500 years in paper format. This very rich material can from now on be handled freely for linguistic research goals.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Portugiesisch; Archivierung; Annotation; Metadaten; Sprachdaten; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

CoGesT: a formal transcription system for conversational gesture

Autor*in: Trippel, Thorsten ; Gibbon, Dafydd ; Thies, Alexandra ; Milde, Jan-Torsten ; Looks, Karin ; Hell, Benjamin ; Gut, Ulrike

Erschienen: 2024

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In order to create reusable and sustainable multimodal resources a transcription model for hand and arm gestures in conversation is needed. We argue that transcription systems so far developed for sign language transcription and psychological... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12644 https://ids-pub.bsz-bw.de/files/12644/Trippel_Gibbon_Thies_CoGesT_2004.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126442

In order to create reusable and sustainable multimodal resources a transcription model for hand and arm gestures in conversation is needed. We argue that transcription systems so far developed for sign language transcription and psychological analysis are not suitable for the linguistic analysis of conversational gesture. Such a model must adhere to a strict form-function distinction and be both computationally explicit and compatible with descriptive notations such as feature structures in other areas of computational and descriptive linguistics. We describe the development and evaluation of a suitable formal model using a feature-based transcription system, concentrating as a first step on arm gestures within the context of the development of an annotated video resource and gesture lexicon.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Transkription; Körpersprache; Gespräch; Gestik; Computerlinguistik; Annotation; Konversationsanalyse
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Consistent storage of metadata in inference lexica: the MetaLex approach

Autor*in: Trippel, Thorsten ; Sasaki, Felix ; Gibbon, Dafydd

Erschienen: 2024

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

With MetaLex we introduce a framework for metadata management where information can be inferred from different areas of metadata coding, such as metadata for catalogue descriptions, linguistic levels, or tiers. This is done for consistency and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12656 https://ids-pub.bsz-bw.de/files/12656/Trippel_Sasaki_Consistent_storage_of_metadata_2004.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126566

With MetaLex we introduce a framework for metadata management where information can be inferred from different areas of metadata coding, such as metadata for catalogue descriptions, linguistic levels, or tiers. This is done for consistency and efficiency in metadata recording and applies the same inference techniques that are used for lexical inference. For this purpose we motivate the need for metadata descriptions on all document levels, describe the different structures of metadata, use existing metadata recommendations on different levels of annotations, and show a usecase of metadata inference.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Metadaten; Schlussfolgern; Lexikon; Annotation; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

A multi-view hyperlexicon resource for speech and language system development

Autor*in: Gibbon, Dafydd ; Trippel, Thorsten

Erschienen: 2024

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12665 https://ids-pub.bsz-bw.de/files/12665/Gibbon_Trippel_A_multi_view_hyperlexicon_resource_2000.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126658

New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development offer new perspectives for the development of such resources which are at the same time practically useful, computationally feasible, and theoretically well-founded. We describe the specification, three-level lexical document design principles, and implementation of a MARTIF document structure and several presentation structures for a terminological lexicon, including both on demand access and full hypertext lexicon compilation. The underlying resource is a relational lexical database with SQL querying and access via a CGI internet interface. This resource is mapped on to the hypergraph structure which defines the macrostructure of the hyperlexicon.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	SGML; XML; Multimodalität; Datenbank; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

The computational semantics of characters

Autor*in: Gibbon, Dafydd ; Hughes, Baden ; Trippel, Thorsten

Erschienen: 2024

Verlag: University : Tilburg University ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In this paper we present a new approach to the computational semantics of characters, which fills this gap: the orthographic projection of linguistic information, analogous to phonetic interpretation. We consider a number of use cases prior to... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12690 https://ids-pub.bsz-bw.de/files/12690/Gibbon_Hughes_Trippel_Computational_semantics_2007.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-126909

In this paper we present a new approach to the computational semantics of characters, which fills this gap: the orthographic projection of linguistic information, analogous to phonetic interpretation. We consider a number of use cases prior to discussion of three different perspectives. Adopting a holistic view of semantics, we discover that there are properties at this lower level which require similar specification to that at more well-studied levels, and which can coherently extend computational linguistic models to the domain of orthography.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Sprachdaten; Semantik; Modell; Zeichen; Computerlinguistik
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Automatic question answering for the linguistic domain – An evaluation of LLM knowledge base extension with RAG

Autor*in: Lang, Christian ; Schneider, Roman ; Tu, Ngoc Duyen Tanja

Erschienen: 2024

Verlag: Cham : Springer ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

We investigate the extent to which Retrieval Augmented Generation improves the quality of Large Language Models’ answers to technical questions in the field of linguistics—a domain known for its broad terminological inventory and theory-dependent use... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12813 https://ids-pub.bsz-bw.de/files/12813/Lang_Schneider_Automatic_question_answering_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128138 https://doi.org/10.1007/978-3-031-70242-6_16

We investigate the extent to which Retrieval Augmented Generation improves the quality of Large Language Models’ answers to technical questions in the field of linguistics—a domain known for its broad terminological inventory and theory-dependent use of technical terms. Furthermore, this application is not only about terminological information on language, but also about information on its well-formedness. We present the results of an empirical evaluation of automatically generated answers based on authentic data from a language consulting service, with special emphasis on different question types.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Großes Sprachmodell; Terminologie; Computerlinguistik; Antwort; Automatische Sprachanalyse
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Semiautomatic data generation for academic Named Entity Recognition in German text corpora

Autor*in: Schwarz, Pia

Erschienen: 2024

Verlag: Wien : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12842 https://ids-pub.bsz-bw.de/files/12842/Schwarz_Semiautomatic_data_generation_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128423

An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity types, an off-the-shelf NE recognizer, and an LLM. Experiments fine-tuning a BERT model with different strategies of post-processing the automatically generated data result in several NER models achieving overall F1 scores of up to 92.45%.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Named Entity Recognition; Deutsch; Korpus; Großes Sprachmodell; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Labeling results of topic models: word sense disambiguation as key method for automatic topic labeling with GermaNet

Autor*in: Ecker, Jennifer

Erschienen: 2024

Verlag: Paris : European Language Resources Association ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

The combination of topic modeling and automatic topic labeling sheds light on understanding large corpora of text. It can be used to add semantic information for existing metadata. In addition, one can use the documents and the corresponding topic... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12847 https://ids-pub.bsz-bw.de/files/12847/Ecker_Labeling_results_of_topic_models_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128477

The combination of topic modeling and automatic topic labeling sheds light on understanding large corpora of text. It can be used to add semantic information for existing metadata. In addition, one can use the documents and the corresponding topic labels for topic classification. While there are existing algorithms for topic modeling readily accessible for processing texts, there is a need to postprocess the result to make the topics more interpretable and self-explanatory. The topic words from the topic model are ranked and the first/top word could easily be considered as a label. However, it is imperative to use automatic topic labeling, because the highest scored word is not the word that sums up the topic in the best way. Using the lexical-semantic word net GermaNet, the first step is to disambiguate words that are represented in GermaNet with more than one sense. We show how to find the correct sense in the context of a topic with the method of word sense disambiguation. To enhance accuracy, we present a similarity measure based on vectors of topic words that considers semantic relations of the senses demonstrating superior performance of the investigated cases compared to existing methods.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	GermaNet; Korpus; Semantische Relation; Semantik; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Unlocking the corpus: enriching metadata with state-of-the-art NLP methodology and linked data

Autor*in: Ecker, Jennifer ; Fischer, Stefan ; Schwarz, Pia ; Trippel, Thorsten ; Werthmann, Antonina ; Wilm, Rebecca

Erschienen: 2024

Verlag: Utrecht : CLARIN ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12853 https://ids-pub.bsz-bw.de/files/12853/Ecker_Fischer_Schwarz_Unlocking_the_corpus_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128531

In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. Our approach is to add semantic metadata at the text level to facilitate the search over data. We show how to enrich metadata with three NLP methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search for texts that are about certain topics or described by certain keywords, or to identify people, places, and organisations mentioned in texts without actually having to read them and at the same time facilitate the creation of task-tailored subcorpora. To enhance this usability of the data we explore options based on the German Reference Corpus DeReKo, the largest linguistically motivated collection of German language material (Kupietz & Keibel, 2009; Kupietz et al., 2010, 2018), which contains multiple newspapers, books, transcriptions, etc., and enrich its metadata on the level of subportions, i.e. newspaper articles. We received access to a number of data files in DeReKo’s native XML format, I5. To develop the methodology, we focus on a single XML file containing all issues of one newspaper of a whole year. The following sections only give an overview of our approach, we intend, however, to provide a detailed description of the experiments and the selection of data in a subsequent longer contribution.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Metadaten; Natürliche Sprache; Computerlinguistik; Datenmanagement; Named Entity Recognition; Deutsch; XML
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Tools for historical corpus research, and a corpus of Latin

Autor*in: McGillivray, Barbara ; Kilgarriff, Adam

Erschienen: 2024

Verlag: Tübingen : Narr ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present LatinlSE, a Latin corpus for the Sketch Engine. LatinlSE consists of Latin works comprising a total of 13 million words, covering the time span from the 2nd Century BC to the 21st century AD. LatinlSE is provided with rich metadata... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12899 https://ids-pub.bsz-bw.de/files/12899/McGillivray_Kilgarriff_Tools_for_historical_corpus_2013.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-128994

We present LatinlSE, a Latin corpus for the Sketch Engine. LatinlSE consists of Latin works comprising a total of 13 million words, covering the time span from the 2nd Century BC to the 21st century AD. LatinlSE is provided with rich metadata mark-up, including author, title, genre, era, date and century, as well as book, section, paragraph and line of verses. We have automatically annotated LatinlSE with lemma and part-of-speech information, enabling users to search the corpus with a number of criteria, ranging from lemma, part-of speech, context, to subcorpora defined chronologically or by genre. We also illustrate word sketches, one-page summaries of a word’s corpus based collocational behaviour. Our future plan is to produce word sketches for Latin words by adding richer morphological and syntactic annotation to the corpus.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Computerlinguistik; Latein; Historische Sprachwissenschaft
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

New laws, new opportunities – the effect of the Digital Services Act and the Data Act on access to language data for research purposes

Autor*in: Kamocki, Paweł ; Kelli, Aleksei ; Navarretta, Costanza ; Puksas, Andrius ; Tomazin, Mateja Jemec ; Trollip, Benito ; Calamai, Silvia

Erschienen: 2024

Verlag: Utrecht : CLARIN ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This abstract discusses the impact of the Digital Services Act (Regulation 2022 (EU) 2022/2065 of 19 October 2022) and of the Data Act (Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023) on access to language... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12906 https://ids-pub.bsz-bw.de/files/12906/Kamocki_Kelli_Navarretta_Puksas_Tomazin_Trollip_New_laws_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-129064

This abstract discusses the impact of the Digital Services Act (Regulation 2022 (EU) 2022/2065 of 19 October 2022) and of the Data Act (Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023) on access to language data for research purposes. The referred legal acts significantly contribute to the existing regulatory infrastructure for language research at the EU level.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerlinguistik; Sprachdaten; Regulierung
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Still no evidence for an effect of the proportion of non-native speakers on natural language complexity

Autor*in: Koplenig, Alexander

Erschienen: 2024

Verlag: Basel : MDPI ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In a recent study, I demonstrated that large numbers of L2 (second language) speakers do not appear to influence the morphological or information-theoretic complexity of natural languages. This paper has three primary aims: First, I address recent... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12911 https://ids-pub.bsz-bw.de/files/12911/Koplenig_Still_no_evidence_2024.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-129111 https://doi.org/10.3390/e26110993

In a recent study, I demonstrated that large numbers of L2 (second language) speakers do not appear to influence the morphological or information-theoretic complexity of natural languages. This paper has three primary aims: First, I address recent criticisms of my analyses, showing that the points raised by my critics were already explicitly considered and analysed in my original work. Furthermore, I show that the proposed alternative analyses fail to withstand detailed examination. Second, I introduce new data on the information-theoretic complexity of natural languages, with the estimates derived from various language models—ranging from simple statistical models to advanced neural networks—based on a database of 40 multilingual text collections that represent a wide range of text types. Third, I re-analyse the information-theoretic and morphological complexity data using novel methods that better account for model uncertainty in parameter estimation, as well as the genealogical relatedness and geographic proximity of languages. In line with my earlier findings, the results show no evidence that large numbers of L2 speakers have an effect on natural language complexity.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Non-native speaker; Natürliche Sprache; Sprachtypologie; Sprachstatistik; Statistische Analyse; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

Designing a verb guesser for part of speech tagging in Northern Sotho

‘Representativeness’, ‘Bad Data’, and legitimate expectations. What can an electronic historical corpus tell us that we didn’t actually know already (and how)?

Automatische Detektion von Neologismuskandidaten mit anschließender manueller Aufbereitung. Internal Progress Report IDS-KL-2007-01

Effiziente halbautomatische Detektion von Neologismuskandidaten. Technical Report IDS-KL-2010-01

Building a historical corpus for Classical Portuguese: some technological aspects

CoGesT: a formal transcription system for conversational gesture

Consistent storage of metadata in inference lexica: the MetaLex approach

A multi-view hyperlexicon resource for speech and language system development

The computational semantics of characters

Automatic question answering for the linguistic domain – An evaluation of LLM knowledge base extension with RAG

Semiautomatic data generation for academic Named Entity Recognition in German text corpora

Labeling results of topic models: word sense disambiguation as key method for automatic topic labeling with GermaNet

Unlocking the corpus: enriching metadata with state-of-the-art NLP methodology and linked data

Tools for historical corpus research, and a corpus of Latin

New laws, new opportunities – the effect of the Digital Services Act and the Data Act on access to language data for research purposes

Still no evidence for an effect of the proportion of non-native speakers on natural language complexity

Kontakt

Partner