Word sense alignment and disambiguation for historical encyclopedias
This paper will address the challenge of creating a knowledge graph from a corpus of historical encyclopedias with a special focus on word sense alignment (WSA) and disambiguation (WSD). More precisely, we examine WSA and WSD approaches based on...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
This paper will address the challenge of creating a knowledge graph from a corpus of historical encyclopedias with a special focus on word sense alignment (WSA) and disambiguation (WSD). More precisely, we examine WSA and WSD approaches based on article similarity to link messy historical data, utilizing Wikipedia as aground-truth component – as the lack of a critical overlap in content paired with the amount of variation between and within the encyclopedias does not allow for choosing a ”baseline” encyclopedia to align the others to. Additionally, we are comparing the disambiguation performance of conservative methods like the Lesk algorithm to more recent approaches, i.e. using language models to disambiguate senses.
|
LRTwiki: enriching the likelihood ratio test with encyclopedic information for the extraction of relevant terms
This paper introduces LRTwiki, an improved variant of the Likelihood Ratio Test (LRT). The central idea of LRTwiki is to employ a comprehensive domain specific knowledge source as additional “on-topic” data sets, and to modify the calculation of the...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
This paper introduces LRTwiki, an improved variant of the Likelihood Ratio Test (LRT). The central idea of LRTwiki is to employ a comprehensive domain specific knowledge source as additional “on-topic” data sets, and to modify the calculation of the LRT algorithm to take advantage of this new information. The knowledge source is created on the basis of Wikipedia articles. We evaluate on the two related tasks product feature extraction and keyphrase extraction, and find LRTwiki to yield a significant improvement over the original LRT in both tasks.
|
Export in Literaturverwaltung |
|
Knowledge sources for bridging resolution in multi-party dialog
In this paper we investigate the coverage of the two knowledge sources WordNet and Wikipedia for the task of bridging resolution. We report on an annotation experiment which yielded pairs of bridging anaphors and their antecedents in spoken...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
In this paper we investigate the coverage of the two knowledge sources WordNet and Wikipedia for the task of bridging resolution. We report on an annotation experiment which yielded pairs of bridging anaphors and their antecedents in spoken multi-party dialog. Manual inspection of the two knowledge sources showed that, with some interesting exceptions, Wikipedia is superior to WordNet when it comes to the coverage of information necessary to resolve the bridging anaphors in our data set. We further describe a simple procedure for the automatic extraction of the required knowledge from Wikipedia by means of an API, and discuss some of the implications of the procedure’s performance.
|
Export in Literaturverwaltung |
|
Learning from students. On the design and usability of an e-dictionary of mathematical graph theory
We created a prototype of an electronic dictionary for the mathematical domain of graph theory. We evaluate our prototype and compare its effectiveness in task-based tests with that of Wikipedia. Our dictionary is based on a corpus; the terms and...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
We created a prototype of an electronic dictionary for the mathematical domain of graph theory. We evaluate our prototype and compare its effectiveness in task-based tests with that of Wikipedia. Our dictionary is based on a corpus; the terms and their definitions were automatically extracted and annotated by experts (cf. Kruse/Heid 2020). The dictionary is bilingual, covering German and English; it gives equivalents, definitions and semantically related terms. For the implementation of the dictionary, we used LexO (Bellandi et al. 2017). The target group of the dictionary are students of mathematics who attend lectures in German and work with English resources. We carried out tests to understand which items the students search for when they work on graph-theoretical tasks. We ran the same test twice, with comparable student groups, either allowing Wikipedia as an information source or our dictionary. The dictionary seems to be especially helpful for students who already have a vague idea of a term because they can use the resource to check if their idea is right.
|
A comparable Wikipedia corpus: from wiki syntax to POS tagged XML
To build a comparable Wikipedia corpus of German, French, Italian, Norwegian, Polish and Hungarian for contrastive grammar research, we used a set of XSLT stylesheets to transform the mediawiki anntations to XML. Furthermore, the data has been...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
To build a comparable Wikipedia corpus of German, French, Italian, Norwegian, Polish and Hungarian for contrastive grammar research, we used a set of XSLT stylesheets to transform the mediawiki anntations to XML. Furthermore, the data has been amnntated with word class information using different taggers. The outcome is a corpus with rich meta data and linguistic annotation that can be used for multilingual research in various linguistic topics.
|
CMC Corpora in DeReKo
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
We introduce three types of corpora of computer-mediated communication that have recently been compiled at the Institute for the German Language or curated from an external project and included in DeReKo, the German Reference Corpus, namely Wikipedia (discussion) corpora, the Usenet news corpus, and the Dortmund Chat Corpus. The data and corpora have been converted to I5, the TEI customization to represent texts in DeReKo, and are researchable via the web-based IDS corpus research interfaces and in the case of Wikipedia and chat also downloadable from the IDS repository and download server, respectively.
|
Studying the distribution of reply relations in Wikipedia talk pages
This paper presents an extended annotation and analysis of interpretative reply relations focusing on a comparison of reply relation types and targets between conflictual pages and neutral pages of German Wikipedia (WP) talk pages. We briefly present...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
This paper presents an extended annotation and analysis of interpretative reply relations focusing on a comparison of reply relation types and targets between conflictual pages and neutral pages of German Wikipedia (WP) talk pages. We briefly present the different categories identified for interpretative reply relations to analyze the relationship between WP postings as well as linguistic cues for each category. We investigate referencing strategies of WP authors in discussion page postings, illustrated by means of reply relation types and targets taking into account the degree of disagreement displayed on a WP talk page. We provide richly annotated data that can be used for further analyses such as the identification of interactional relations on higher levels, or for training tasks in machine learning algorithms.
|
Investigating reply relations on Wikipedia talk pages to reconstruct interactional strategies of Wikipedia authors
This chapter presents the annotation and analysis of interpretative reply relations on Wikipedia talk pages using data from the WikiDemoCorpus (WDC). Building on an approach of annotating interpretative reply relations to analyze these relations in...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
This chapter presents the annotation and analysis of interpretative reply relations on Wikipedia talk pages using data from the WikiDemoCorpus (WDC). Building on an approach of annotating interpretative reply relations to analyze these relations in Wikipedia talk page posts, the chapter presents nine reply relation categories found in the German WDC. Additionally, linguistic cues for each category and the Wikipedia discussion pages overall are explained in detail, illustrated through reply relation targets. The results of the linguistic annotation are threefold: First, we provide an annotation scheme that can be used by third parties to produce more data according to their needs. Second, we shed light on and quantify the numerous ways Wikipedia authors reply to each other’s posts on talk pages. Finally, we provide richly annotated data that can be used for further analyses, such as identifying interactional relations on higher levels or training tasks in machine learning algorithms.
|
Investigating interaction signs across genres, modes and languages: The example of OKAY
This paper presents results of a case study that compared the usage of OKAY across genre types (Wikipedia articles vs. talk pages), across modes (spoken vs. written language), and across languages (German vs. French CMC data from Wikipedia...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
This paper presents results of a case study that compared the usage of OKAY across genre types (Wikipedia articles vs. talk pages), across modes (spoken vs. written language), and across languages (German vs. French CMC data from Wikipedia talkpages).The cross-genre study builds on the results of Herzberg (2016), who compared the usage of OKAY in German Wikipedia articles with its usage in Wikipedia talk pages. These results also form the basis for comparing the CMC genre of Wikipedia talk pages with occurrences of OKAY in the German spoken language corpus FOLK. Finally, we compared the results on the usage of OKAY in German Wikipedia talk pages with the usage of OKAY in French Wikipedia talk pages. With our case study, we want to demonstrate that it is worthwhile to investigate interaction signs across genres and languages,and to compare the usage in written CMC with the usage in spoken interaction.
|
Export in Literaturverwaltung |
|
Linguistische Wikipedistik
Die Wikipedia ist nicht nur die größte Online-Enzyklopädie weltweit, sondern auch eines der erfolgreichsten Projekte im Web 2.0: In nur 16 Jahren sind rund 48 Millionen Einträge in 295 Sprachversionen entstanden (Wikimedia 2018). Mit Rang 5 des...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
Die Wikipedia ist nicht nur die größte Online-Enzyklopädie weltweit, sondern auch eines der erfolgreichsten Projekte im Web 2.0: In nur 16 Jahren sind rund 48 Millionen Einträge in 295 Sprachversionen entstanden (Wikimedia 2018). Mit Rang 5 des Alexa-Rankings ist die Wikipedia eine der meistgenutzten Plattformen im Internet (Alexa 2018). Durch ihre Relevanz und Reichweite wird die Wikipedia auch intensiv beforscht. Die Seite „Wikipedistik“ (WP-Wikipedistik; Wikipedia 2018) im Metabereich der deutschsprachigen Wikipedia gibt einen Überblick über nationale und internationale Forschungsaktivitäten und -ergebnisse. Die interessierten Disziplinen, die Erkenntnisinteressen und methodischen Zugänge der Wikipedistik sind vielfältig. Hammwöhner (2007) beschäftigt sich aus informationswissenschaftlicher Perspektive mit Methoden und Ergebnissen der Qualitätsbewertung von Wikipedia-Artikeln. Pscheida (2010) untersucht die Wikipedia unter wissenssoziologischer Perspektive und begründet am Beispiel der Wikipedia interessante Thesen zur „Wissenskultur des digitalen Zeitalters“ (Pscheida 2010: 458 ff.). Stegbauer (2009) untersucht das soziale Rollengefüge und die Motivation der Akteure in der deutschen Wikipedia und gibt einen empirisch sehr gut gestützten Einblick in die sozialen Prozesse im Projekt. In diesem Beitrag geben wir einen Überblick über die aktuelle Forschung zur Wikipedia aus der Perspektive der Sprach- und Diskursanalyse. Zunächst (Abschnitte 2.1–2.4) verdeutlichen wir das Potenzial der Wikipedia als Forschungsgegenstand an vier Themenfeldern: Text und Interaktion, Diskurslinguistik, Multimodalität, Sprach- und Kulturvergleich. Der anschließende Abschnitt 2.5 „Wikipedaktik“ beschäftigt sich mit der Wikipedia als lohnenswertem Lerngegenstand in Schule und Hochschule. Wikipedia ist nicht nur interessant als Ressource, an der sich die Besonderheiten digitaler Diskurse, multimodaler Hypertexte und kollaborativer Schreib- und Aushandlungsprozesse gut verdeutlichen lassen. Es ist auch ein Projekt des freien Wissens, ...
|
Investigating OKAY across genres, modes and languages: A corpus-based study on German and French
In our study, we used the spoken language corpus FOLK and the Wikipedia corpus family, provided by the Institute for the German Language (IDS) in Mannheim, to examine the usage of OKAY in various spelling and pronunciation variants across genre types...
mehr
Volltext:
|
|
Zitierfähiger Link:
|
|
In our study, we used the spoken language corpus FOLK and the Wikipedia corpus family, provided by the Institute for the German Language (IDS) in Mannheim, to examine the usage of OKAY in various spelling and pronunciation variants across genre types (Wikipedia articles vs. talk pages), across modes (transcribed spoken vs. written language), and across languages (German vs. French Wikipedia talk pages). Our comparison of German Wikipedia talk and article pages made evident that OKAY is used far more frequently in the CMC-like Wikipedia talk pages than in the text-like Wikipedia articles. The comparison of the CMC data with the FOLK corpus of transcribed spoken language revealed interesting differences in the distribution of functional and topological features. The results suggest the emergence of particular functions and usage patterns for OKAY in written CMC that differ from the patterns observed in spoken interaction. The comparison of German and French Wikipedia talk pages yielded common usage patterns in both languages, e.g. the preference for "speedy" spelling variants (ok, OK, Ok) and a similar distribution of topological features, but also differences in the distribution of functional features.
|