Ergebnisse für *

Es wurden 16 Ergebnisse gefunden.

Zeige Ergebnisse 1 bis 16 von 16.

Sortieren

  1. Designing a verb guesser for part of speech tagging in Northern Sotho
    Erschienen: 2024
    Verlag:  London : Taylor & Francis ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

    The aim of this article is to describe the design and implementation of a verb guesser that will enhance the results of statistical part of speech (POS) tagging of verbs in Northern Sotho. It will be illustrated that verb stems in Northern Sotho can... mehr

     

    The aim of this article is to describe the design and implementation of a verb guesser that will enhance the results of statistical part of speech (POS) tagging of verbs in Northern Sotho. It will be illustrated that verb stems in Northern Sotho can successfully be recognised by examining their suffixes and combinations of suffixes. Two approaches to verbal derivation analysis will be utilised, namely morphological analysis and corpus querying of suffixes and combinations of suffixes.

     

    Export in Literaturverwaltung   RIS-Format
      BibTeX-Format
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Aufsatz aus einer Zeitschrift
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Verb; Wortart; Pedi-Sprache; Verbalstamm; Suffix; Computerlinguistik; Datenanalyse
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  2. ‘Representativeness’, ‘Bad Data’, and legitimate expectations. What can an electronic historical corpus tell us that we didn’t actually know already (and how)?
    Erschienen: 2024
    Verlag:  Tübingen : Narr ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    The availability of electronic corpora of historical stages of languages has been wel- comed as possibly attenuating the inherent problem of diachronic linguistics, i.e. that we only have access to what has chanced to come down to us - the problem... mehr

     

    The availability of electronic corpora of historical stages of languages has been wel- comed as possibly attenuating the inherent problem of diachronic linguistics, i.e. that we only have access to what has chanced to come down to us - the problem which was memorably named by Labov (1992) as one of “Bad Data”. However, such corpora can only give us access to an increased amount ot historical material and this can essentially still only be a partial and possibly distorted picture of the actual language at a particular period of history. Corpora can be improved by taking a more representative sample of extant texts if these are available (as they are in significant number for periods after the invention of printing). But, as examples from the recently compiled GerManC corpus of seventeenth and eighteenth century German show, the evidence from such corpora can still fail to yield definitive answers to our questions about earlier stages of a language. The data still require expert interpretation, and it is important to be realistic about what can legitimately be expected from an electronic historical corpus.

     

    Export in Literaturverwaltung   RIS-Format
      BibTeX-Format
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Aufsatz aus einem Sammelband
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Historische Sprachwissenschaft; Korpus; Computerlinguistik
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  3. Automatische Detektion von Neologismuskandidaten mit anschließender manueller Aufbereitung. Internal Progress Report IDS-KL-2007-01
    Erschienen: 2024
    Verlag:  Mannheim : Institut für Deutsche Sprache

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Deutsch
    Medientyp: Bericht
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Neologismus; Bericht; Wortschatz; Korpus; Computerlinguistik
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  4. Effiziente halbautomatische Detektion von Neologismuskandidaten. Technical Report IDS-KL-2010-01
    Erschienen: 2024
    Verlag:  Mannheim : Institut für Deutsche Sprache

    Bei der hier vorgestellten Studie wird eine halbautomatische Strategie verfolgt: Zunächst wird durch automatische Verfahren eine Kandidatenliste generiert, bei der im Kompromiss zwischen Recall und Precision der Recall stärker gewichtet wird. Recall... mehr

     

    Bei der hier vorgestellten Studie wird eine halbautomatische Strategie verfolgt: Zunächst wird durch automatische Verfahren eine Kandidatenliste generiert, bei der im Kompromiss zwischen Recall und Precision der Recall stärker gewichtet wird. Recall wird hierbei aber nicht einseitig maximiert, denn sonst wäre die Liste extrem lang und nahezu wertlos. Die automatisch gewonnene Kandidatenliste wird anschließend zügig (und ohne eigentliche Analyse) manuell gesichtet und eindeutige Nichtneologismen werden dabei herausgefiltert. Dadurch wird die Precision erheblich erhöht, während der Recall weitgehend unverändert hoch bleibt. Erst diese gefilterte Liste dient als Input für nähere Expertenanalysen. Dieser halbautomatische Ansatz besteht insgesamt aus drei Phasen, die im Folgenden näher beschrieben werden. Der Fokus liegt dabei auf Neulexemen – Neubedeutungen werden zwar nicht ausgeschlossen, für die meisten von ihnen ist es jedoch unwahrscheinlich, dass sie mit der hier vorgestellten Methode aufgespürt werden können.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Deutsch
    Medientyp: Bericht
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Neologismus; Korpus; Worthäufigkeit; Computerlinguistik; Wortschatz; Bericht
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  5. Building a historical corpus for Classical Portuguese: some technological aspects
    Erschienen: 2024
    Verlag:  Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the... mehr

     

    This paper describes the restructuring process of a large corpus of historical documents and the system architecture that is used for accessing it. The initial challenge of this process was to get the most out of existing material, normalizing the legacy markup and harvesting the inherent information using widely available standards. This resulted in a conceptual and technical restructuring of the formerly existing corpus. The development of the standardized markup and techniques allowed the inclusion of important new materials, such as original 16th and 17th century prints and manuscripts; and enlarged the potential user groups. On the technological side, we were grounded on the premise that open standards are the best way of making sure that the resources will be accessible even after years in an archive. This is a welcomed result in view of the additional consequence of the remodeled corpus concept: it serves as a repository for important historical documents, some of which had been preserved for 500 years in paper format. This very rich material can from now on be handled freely for linguistic research goals.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Korpus; Portugiesisch; Archivierung; Annotation; Metadaten; Sprachdaten; Computerlinguistik
    Lizenz:

    creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

  6. CoGesT: a formal transcription system for conversational gesture
    Erschienen: 2024
    Verlag:  Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    In order to create reusable and sustainable multimodal resources a transcription model for hand and arm gestures in conversation is needed. We argue that transcription systems so far developed for sign language transcription and psychological... mehr

     

    In order to create reusable and sustainable multimodal resources a transcription model for hand and arm gestures in conversation is needed. We argue that transcription systems so far developed for sign language transcription and psychological analysis are not suitable for the linguistic analysis of conversational gesture. Such a model must adhere to a strict form-function distinction and be both computationally explicit and compatible with descriptive notations such as feature structures in other areas of computational and descriptive linguistics. We describe the development and evaluation of a suitable formal model using a feature-based transcription system, concentrating as a first step on arm gestures within the context of the development of an annotated video resource and gesture lexicon.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Transkription; Körpersprache; Gespräch; Gestik; Computerlinguistik; Annotation; Konversationsanalyse
    Lizenz:

    creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

  7. Consistent storage of metadata in inference lexica: the MetaLex approach
    Erschienen: 2024
    Verlag:  Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    With MetaLex we introduce a framework for metadata management where information can be inferred from different areas of metadata coding, such as metadata for catalogue descriptions, linguistic levels, or tiers. This is done for consistency and... mehr

     

    With MetaLex we introduce a framework for metadata management where information can be inferred from different areas of metadata coding, such as metadata for catalogue descriptions, linguistic levels, or tiers. This is done for consistency and efficiency in metadata recording and applies the same inference techniques that are used for lexical inference. For this purpose we motivate the need for metadata descriptions on all document levels, describe the different structures of metadata, use existing metadata recommendations on different levels of annotations, and show a usecase of metadata inference.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Metadaten; Schlussfolgern; Lexikon; Annotation; Computerlinguistik
    Lizenz:

    creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

  8. A multi-view hyperlexicon resource for speech and language system development
    Erschienen: 2024
    Verlag:  Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development... mehr

     

    New generations of integrated multimodal speech and language systems with dictation, readback or talking face facilities require multiple sources of lexical information for development and evaluation. Recent developments in hyperlexicon development offer new perspectives for the development of such resources which are at the same time practically useful, computationally feasible, and theoretically well-founded. We describe the specification, three-level lexical document design principles, and implementation of a MARTIF document structure and several presentation structures for a terminological lexicon, including both on demand access and full hypertext lexicon compilation. The underlying resource is a relational lexical database with SQL querying and access via a CGI internet interface. This resource is mapped on to the hypergraph structure which defines the macrostructure of the hyperlexicon.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: SGML; XML; Multimodalität; Datenbank; Computerlinguistik
    Lizenz:

    creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

  9. The computational semantics of characters
    Erschienen: 2024
    Verlag:  University : Tilburg University ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    In this paper we present a new approach to the computational semantics of characters, which fills this gap: the orthographic projection of linguistic information, analogous to phonetic interpretation. We consider a number of use cases prior to... mehr

     

    In this paper we present a new approach to the computational semantics of characters, which fills this gap: the orthographic projection of linguistic information, analogous to phonetic interpretation. We consider a number of use cases prior to discussion of three different perspectives. Adopting a holistic view of semantics, we discover that there are properties at this lower level which require similar specification to that at more well-studied levels, and which can coherently extend computational linguistic models to the domain of orthography.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Sprachdaten; Semantik; Modell; Zeichen; Computerlinguistik
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  10. Automatic question answering for the linguistic domain – An evaluation of LLM knowledge base extension with RAG
    Erschienen: 2024
    Verlag:  Cham : Springer ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

    We investigate the extent to which Retrieval Augmented Generation improves the quality of Large Language Models’ answers to technical questions in the field of linguistics—a domain known for its broad terminological inventory and theory-dependent use... mehr

     

    We investigate the extent to which Retrieval Augmented Generation improves the quality of Large Language Models’ answers to technical questions in the field of linguistics—a domain known for its broad terminological inventory and theory-dependent use of technical terms. Furthermore, this application is not only about terminological information on language, but also about information on its well-formedness. We present the results of an empirical evaluation of automatically generated answers based on authentic data from a language consulting service, with special emphasis on different question types.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Großes Sprachmodell; Terminologie; Computerlinguistik; Antwort; Automatische Sprachanalyse
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  11. Semiautomatic data generation for academic Named Entity Recognition in German text corpora
    Autor*in: Schwarz, Pia
    Erschienen: 2024
    Verlag:  Wien : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity... mehr

     

    An NER model is trained to recognize three types of entities in academic contexts: person, organization, and research area. Training data is generated semiautomatically from newspaper articles with the help of word lists for the individual entity types, an off-the-shelf NE recognizer, and an LLM. Experiments fine-tuning a BERT model with different strategies of post-processing the automatically generated data result in several NER models achieving overall F1 scores of up to 92.45%.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Named Entity Recognition; Deutsch; Korpus; Großes Sprachmodell; Computerlinguistik
    Lizenz:

    creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

  12. Labeling results of topic models: word sense disambiguation as key method for automatic topic labeling with GermaNet
    Erschienen: 2024
    Verlag:  Paris : European Language Resources Association ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    The combination of topic modeling and automatic topic labeling sheds light on understanding large corpora of text. It can be used to add semantic information for existing metadata. In addition, one can use the documents and the corresponding topic... mehr

     

    The combination of topic modeling and automatic topic labeling sheds light on understanding large corpora of text. It can be used to add semantic information for existing metadata. In addition, one can use the documents and the corresponding topic labels for topic classification. While there are existing algorithms for topic modeling readily accessible for processing texts, there is a need to postprocess the result to make the topics more interpretable and self-explanatory. The topic words from the topic model are ranked and the first/top word could easily be considered as a label. However, it is imperative to use automatic topic labeling, because the highest scored word is not the word that sums up the topic in the best way. Using the lexical-semantic word net GermaNet, the first step is to disambiguate words that are represented in GermaNet with more than one sense. We show how to find the correct sense in the context of a topic with the method of word sense disambiguation. To enhance accuracy, we present a similarity measure based on vectors of topic words that considers semantic relations of the senses demonstrating superior performance of the investigated cases compared to existing methods.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: GermaNet; Korpus; Semantische Relation; Semantik; Computerlinguistik
    Lizenz:

    creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

  13. Unlocking the corpus: enriching metadata with state-of-the-art NLP methodology and linked data
    Erschienen: 2024
    Verlag:  Utrecht : CLARIN ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is... mehr

     

    In research data management, descriptive metadata are indispensable to describing data and are a key element in preparing data according to the FAIR principles (Wilkinson et al., 2016). Extracting semantic metadata from textual research data is currently not part of most metadata workflows, even more so if a research data set can be subdivided into smaller parts, such as a newspaper corpus containing multiple newspaper articles. Our approach is to add semantic metadata at the text level to facilitate the search over data. We show how to enrich metadata with three NLP methods: named entity recognition, keyword extraction, and topic modeling. The goal is to make it possible to search for texts that are about certain topics or described by certain keywords, or to identify people, places, and organisations mentioned in texts without actually having to read them and at the same time facilitate the creation of task-tailored subcorpora. To enhance this usability of the data we explore options based on the German Reference Corpus DeReKo, the largest linguistically motivated collection of German language material (Kupietz & Keibel, 2009; Kupietz et al., 2010, 2018), which contains multiple newspapers, books, transcriptions, etc., and enrich its metadata on the level of subportions, i.e. newspaper articles. We received access to a number of data files in DeReKo’s native XML format, I5. To develop the methodology, we focus on a single XML file containing all issues of one newspaper of a whole year. The following sections only give an overview of our approach, we intend, however, to provide a detailed description of the experiments and the selection of data in a subsequent longer contribution.

     

    Export in Literaturverwaltung   RIS-Format
      BibTeX-Format
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Aufsatz aus einem Sammelband
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Korpus; Metadaten; Natürliche Sprache; Computerlinguistik; Datenmanagement; Named Entity Recognition; Deutsch; XML
    Lizenz:

    creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

  14. Tools for historical corpus research, and a corpus of Latin
    Erschienen: 2024
    Verlag:  Tübingen : Narr ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    We present LatinlSE, a Latin corpus for the Sketch Engine. LatinlSE consists of Latin works comprising a total of 13 million words, covering the time span from the 2nd Century BC to the 21st century AD. LatinlSE is provided with rich metadata... mehr

     

    We present LatinlSE, a Latin corpus for the Sketch Engine. LatinlSE consists of Latin works comprising a total of 13 million words, covering the time span from the 2nd Century BC to the 21st century AD. LatinlSE is provided with rich metadata mark-up, including author, title, genre, era, date and century, as well as book, section, paragraph and line of verses. We have automatically annotated LatinlSE with lemma and part-of-speech information, enabling users to search the corpus with a number of criteria, ranging from lemma, part-of speech, context, to subcorpora defined chronologically or by genre. We also illustrate word sketches, one-page summaries of a word’s corpus based collocational behaviour. Our future plan is to produce word sketches for Latin words by adding richer morphological and syntactic annotation to the corpus.

     

    Export in Literaturverwaltung   RIS-Format
      BibTeX-Format
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Aufsatz aus einem Sammelband
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Korpus; Computerlinguistik; Latein; Historische Sprachwissenschaft
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  15. New laws, new opportunities – the effect of the Digital Services Act and the Data Act on access to language data for research purposes
    Erschienen: 2024
    Verlag:  Utrecht : CLARIN ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    This abstract discusses the impact of the Digital Services Act (Regulation 2022 (EU) 2022/2065 of 19 October 2022) and of the Data Act (Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023) on access to language... mehr

     

    This abstract discusses the impact of the Digital Services Act (Regulation 2022 (EU) 2022/2065 of 19 October 2022) and of the Data Act (Regulation (EU) 2023/2854 of the European Parliament and of the Council of 13 December 2023) on access to language data for research purposes. The referred legal acts significantly contribute to the existing regulatory infrastructure for language research at the EU level.

     

    Export in Literaturverwaltung
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Konferenzveröffentlichung
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Computerlinguistik; Sprachdaten; Regulierung
    Lizenz:

    rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

  16. Still no evidence for an effect of the proportion of non-native speakers on natural language complexity
    Erschienen: 2024
    Verlag:  Basel : MDPI ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

    In a recent study, I demonstrated that large numbers of L2 (second language) speakers do not appear to influence the morphological or information-theoretic complexity of natural languages. This paper has three primary aims: First, I address recent... mehr

     

    In a recent study, I demonstrated that large numbers of L2 (second language) speakers do not appear to influence the morphological or information-theoretic complexity of natural languages. This paper has three primary aims: First, I address recent criticisms of my analyses, showing that the points raised by my critics were already explicitly considered and analysed in my original work. Furthermore, I show that the proposed alternative analyses fail to withstand detailed examination. Second, I introduce new data on the information-theoretic complexity of natural languages, with the estimates derived from various language models—ranging from simple statistical models to advanced neural networks—based on a database of 40 multilingual text collections that represent a wide range of text types. Third, I re-analyse the information-theoretic and morphological complexity data using novel methods that better account for model uncertainty in parameter estimation, as well as the genealogical relatedness and geographic proximity of languages. In line with my earlier findings, the results show no evidence that large numbers of L2 speakers have an effect on natural language complexity.

     

    Export in Literaturverwaltung   RIS-Format
      BibTeX-Format
    Quelle: BASE Fachausschnitt Germanistik
    Sprache: Englisch
    Medientyp: Aufsatz aus einer Zeitschrift
    Format: Online
    DDC Klassifikation: Sprache (400)
    Schlagworte: Non-native speaker; Natürliche Sprache; Sprachtypologie; Sprachstatistik; Statistische Analyse; Computerlinguistik
    Lizenz:

    creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess