Suchergebnisse

Introduction

Autor*in: Cosma, Ruxandra ; Kupietz, Marc

Erschienen: 2020

Verlag: Bucureşti : Editura Academiei Române

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9661 https://ids-pub.bsz-bw.de/files/9661/Cosma_Kupietz_Introduction_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-96617

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einer Zeitschrift
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Rumänisch; Korpus; Wissenschaftliche Kooperation; Software
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Autor*in: Lüngen, Harald ; Kupietz, Marc

Erschienen: 2020

Verlag: Berlin [u.a.] : de Gruyter

Der Beitrag untersucht vorhandene Lösungen und neue Möglichkeiten des Korpusausbaus aus Social Media- und internetbasierter Kommunikation (IBK) für das Deutsche Referenzkorpus (DEREKO). DEREKO ist eine Sammlung gegenwartssprachlicher Schriftkorpora... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9705 https://ids-pub.bsz-bw.de/files/9705/L%C3%BCngen_Kupietz_Deutsch_in_Sozialen_Medien_IBK-_und_Social_Media-Korpora_am_Leibniz-Institut_fr_Deutsche_Sprache_JB2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-97052 https://doi.org/10.1515/9783110679885-016

Der Beitrag untersucht vorhandene Lösungen und neue Möglichkeiten des Korpusausbaus aus Social Media- und internetbasierter Kommunikation (IBK) für das Deutsche Referenzkorpus (DEREKO). DEREKO ist eine Sammlung gegenwartssprachlicher Schriftkorpora am IDS, die der sprachwissenschaftlichen Öffentlichkeit über die Korpusschnittstellen COSMAS II und KorAP angeboten wird. Anhand von Definitionen und Beispielen gehen wir zunächst auf die Extensionen und Überlappungen der Konzepte Social Media, Internetbasierte Kommunikation und Computer-mediated Communication ein. Wir betrachten die rechtlichen Voraussetzungen für einen Korpusausbau aus Sozialen Medien, die sich aus dem kürzlich in relevanten Punkten reformierten deutschen Urheberrecht, aus Persönlichkeitsrechten wie der europäischen Datenschutz-Grundverordnung ergeben und stellen Konsequenzen sowie mögliche und tatsächliche Umsetzungen dar. Der Aufbau von Social Media-Korpora in großen Textmengen unterliegt außerdem korpustechnologischen Herausforderungen, die für traditionelle Schriftkorpora als gelöst galten oder gar nicht erst bestanden. Wir berichten, wie Fragen der Datenaufbereitung, des Korpus-Encoding, der Anonymisierung oder der linguistischen Annotation von Social Media Korpora für DEREKO angegangen wurden und welche Herausforderungen noch bestehen. Wir betrachten die Korpuslandschaft verfügbarer deutschsprachiger IBK- und Social Media-Korpora und geben einen Überblick über den Bestand an IBK- und Social Media-Korpora und ihre Charakteristika (Chat-, Wiki Talk- und Forenkorpora) in DEREKO sowie von laufenden Projekten in diesem Bereich. Anhand korpuslinguistischer Mikro- und Makro-Analysen von Wikipedia-Diskussionen im Vergleich mit dem Gesamtbestand von DEREKO zeigen wir charakterisierende sprachliche Eigenschaften von Wikipedia-Diskussionen auf und bewerten ihren Status als Repräsentant von IBK-Korpora.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Deutsch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Deutsch; Soziale Medien; Leibniz-Institut für Deutsche Sprache (IDS); Korpus; Internetkommunikation
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Proceedings of the LREC 2020 Workshop, Language Resources and Evaluation Conference, 11–16 May 2020, 8th Workshop on Challenges in the Management of Large Corpora (CMLC-8)

Autor*in: Bański, Piotr ; Barbaresi, Adrien ; Clematide, Simon ; Kupietz, Marc ; Lüngen, Harald ; Pisetta, Ines

Erschienen: 2020

Verlag: Paris : European Language Resources Association (ELRA)

In order to satisfy the information needs of a wide range of researchers across a number of disciplines, large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation. This daunting set of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9811 https://ids-pub.bsz-bw.de/files/9811/Banski_Barbaresi_Clematide_Kupietz_Luengen_Pisetta_Proceedings_LREC_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98112

In order to satisfy the information needs of a wide range of researchers across a number of disciplines, large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation. This daunting set of tasks has coalesced into a number of key themes and questions that are of interest to the contributing research communities: (a) what sampling techniques can we apply? (b) what quality issues should we be aware of? (c) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (d) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (e) what legal paths can be followed in dealing with IPR and data protection issues governing both the data sources and the query results? (f) how to guarantee that corpus data remain available and usable in a sustainable way?

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Buch (Monographie)
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Computerlinguistik; Forschungsdaten; Datenmanagement
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora

Autor*in: Arnold, Denis ; Fisseni, Bernhard ; Kamocki, Paweł ; Schonefeld, Oliver ; Kupietz, Marc ; Schmidt, Thomas

Erschienen: 2020

Verlag: Paris : European Language Resources Association

This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources,... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9812 https://ids-pub.bsz-bw.de/files/9812/Arnold_Fisseni_Kamocki_et_al_Challenges_in_Long_Term_Archiving_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98129

This paper addresses long-term archival for large corpora. Three aspects specific to language resources are focused, namely (1) the removal of resources for legal reasons, (2) versioning of (unchanged) objects in constantly growing resources, especially where objects can be part of multiple releases but also part of different collections, and (3) the conversion of data to new formats for digital preservation. It is motivated why language resources may have to be changed, and why formats may need to be converted. As a solution, the use of an intermediate proxy object called a signpost is suggested. The approach will be exemplified with respect to the corpora of the Leibniz Institute for the German Language in Mannheim, namely the German Reference Corpus (DeReKo) and the Archive for Spoken German (AGD).

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Langzeitarchivierung; Nutzungsrecht; Dateiformat
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Evaluating a Dependency Parser on DeReKo

Autor*in: Fankhauser, Peter ; Do, Bich-Ngoc ; Kupietz, Marc

Erschienen: 2020

Verlag: Paris : European Language Resources Association

We evaluate a graph-based dependency parser on DeReKo, a large corpus of contemporary German. The dependency parser is trained on the German dataset from the SPMRL 2014 Shared Task which contains text from the news domain, whereas DeReKo also covers... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9813 https://ids-pub.bsz-bw.de/files/9813/Fankhauser_Do_Kupietz_Evaluating_a_dependency_parser_on_DeReKo_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98138

We evaluate a graph-based dependency parser on DeReKo, a large corpus of contemporary German. The dependency parser is trained on the German dataset from the SPMRL 2014 Shared Task which contains text from the news domain, whereas DeReKo also covers other domains including fiction, science, and technology. To avoid the need for costly manual annotation of the corpus, we use the parser’s probability estimates for unlabeled and labeled attachment as main evaluation criterion. We show that these probability estimates are highly correlated with the actual attachment scores on a manually annotated test set. On this basis, we compare estimated parsing scores for the individual domains in DeReKo, and show that the scores decrease with increasing distance of a domain to the training corpus.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Parser; Evaluation; Zuverlässigkeit; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP

Autor*in: Kupietz, Marc ; Diewald, Nils ; Margaretha, Eliza

Erschienen: 2020

Verlag: Paris : European Language Resources Association

Making corpora accessible and usable for linguistic research is a huge challenge in view of (too) big data, legal issues and a rapidly evolving methodology. This does not only affect the design of user-friendly graphical interfaces to corpus analysis... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9843 https://ids-pub.bsz-bw.de/files/9843/Kupietz_Diewald_Margaretha_RKorAPClient_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98430

Making corpora accessible and usable for linguistic research is a huge challenge in view of (too) big data, legal issues and a rapidly evolving methodology. This does not only affect the design of user-friendly graphical interfaces to corpus analysis tools, but also the availability of programming interfaces supporting access to the functionality of these tools from various analysis and development environments. RKorAPClient is a new research tool in the form of an R package that interacts with the Web API of the corpus analysis platform KorAP, which provides access to large annotated corpora, including the German reference corpus DeReKo with 45 billion tokens. In addition to optionally authenticated KorAP API access, RKorAPClient provides further processing and visualization features to simplify common corpus analysis tasks. This paper introduces the basic functionality of RKorAPClient and exemplifies various analysis tasks based on DeReKo, that are bundled within the R package and can serve as a basic framework for advanced analysis and visualization approaches.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Korpus; Visualisierung; Forschungsdaten; R; Web Services
Lizenz:	creativecommons.org/licenses/by-nc/4.0/ ; info:eu-repo/semantics/openAccess

Recherche in Social-Media-Korpora mit KorAP

Autor*in: Kupietz, Marc ; Diewald, Nils ; Margaretha, Eliza ; Bodmer Mory, Franck ; Stallkamp, Helge ; Harders, Peter

Erschienen: 2020

Verlag: Berlin [u.a.] : de Gruyter

Die Korpusanalyseplattform KorAP wird als Nachfolgesystem zu COSMAS II am Leibniz-Institut für Deutsche Sprache (IDS) entwickelt und erlaubt einen umfassenden Zugriff auf einen Teil von DeReKo (Kupietz et al. 2010). Trotz einiger noch fehlender... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9894 https://ids-pub.bsz-bw.de/files/9894/Kupietz_u_a_Recherche_in_Social_Media_Korpora_mit_KorAP_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-98940 https://doi.org/10.1515/9783110679885-024

Die Korpusanalyseplattform KorAP wird als Nachfolgesystem zu COSMAS II am Leibniz-Institut für Deutsche Sprache (IDS) entwickelt und erlaubt einen umfassenden Zugriff auf einen Teil von DeReKo (Kupietz et al. 2010). Trotz einiger noch fehlender Funktionalitäten ist KorAP bereits produktiv einsetzbar. Im Folgenden wollen wir am Beispiel der Untersuchung von Social-Media-Korpora einige neue Möglichkeiten und Besonderheiten vorstellen.

Export in Literaturverwaltung

RIS-Format
BibTeX-Format

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Deutsch
Medientyp:	Aufsatz aus einem Sammelband
Format:	Online
DDC Klassifikation:	Germanische Sprachen; Deutsch (430)
Schlagworte:	Korpus; Social Media; Online-Dienst; Benutzeroberfläche
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

Introduction

IBK- und Social Media-Korpora am Leibniz-Institut für Deutsche Sprache

Proceedings of the LREC 2020 Workshop, Language Resources and Evaluation Conference, 11–16 May 2020, 8th Workshop on Challenges in the Management of Large Corpora (CMLC-8)

Addressing Cha(lle)nges in Long-Term Archiving of Large Corpora

Evaluating a Dependency Parser on DeReKo

RKorAPClient: An R Package for Accessing the German Reference Corpus DeReKo via KorAP

Recherche in Social-Media-Korpora mit KorAP

Kontakt

Partner