Suchergebnisse

Word-level alignment of paper documents with their electronic full-text counterparts

Autor*in: Müller, Mark-Christoph ; Ghosh, Sucheta ; Wittig, Ulrike ; Rey, Maja

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11083 https://ids-pub.bsz-bw.de/files/11083/Mueller_Ghosh_Word_level_alignment_2021.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110839 https://doi.org/10.18653/v1/2021.bionlp-1.19

We describe a simple procedure for the automatic creation of word-level alignments between printed documents and their respective full-text versions. The procedure is unsupervised, uses standard, off-the-shelf components only, and reaches an F-score of 85.01 in the basic setup and up to 86.63 when using pre- and post-processing. Potential areas of application are manual database curation (incl. document triage) and biomedical expression OCR.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerlinguistik; Volltext; Optische Zeichenerkennung; XML; Ausrichten
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

pyMMAX2: Deep access to MMAX2 projects from Python

Autor*in: Müller, Mark-Christoph

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

pyMMAX2 is an API for processing MMAX2 stand-off annotation data in Python. It provides a lightweight basis for the development of code which opens up the Java- and XML-based ecosystem of MMAX2 for more recent, Python-based NLP and data science... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11084 https://ids-pub.bsz-bw.de/files/11084/Mueller_pyMMAX2_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110848

pyMMAX2 is an API for processing MMAX2 stand-off annotation data in Python. It provides a lightweight basis for the development of code which opens up the Java- and XML-based ecosystem of MMAX2 for more recent, Python-based NLP and data science methods. While pyMMAX2 is pure Python, and most functionality is implemented from scratch, the API re-uses the complex implementation of the essential business logic for MMAX2 annotation schemes by interfacing with the original MMAX2 Java libraries. pyMMAX2 is available for download at github.com/nlpAThits/pyMMAX2.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerlinguistik; Python; API; XML; Neurolinguistisches Programmieren; Data Science
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Reconstructing manual information extraction with DB-to-document backprojection: Experiments in the life science domain

Autor*in: Müller, Mark-Christoph ; Ghosh, Sucheta ; Rey, Maja ; Wittig, Ulrike ; Müller, Wolfgang ; Strube, Michael

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We introduce a novel scientific document processing task for making previously inaccessible information in printed paper documents available to automatic processing. We describe our data set of scanned documents and data records from the biological... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11085 https://ids-pub.bsz-bw.de/files/11085/Mueller_Reconstructing_manual_information_extraction_2020.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110854 https://doi.org/10.18653/v1/2020.sdp-1.9

We introduce a novel scientific document processing task for making previously inaccessible information in printed paper documents available to automatic processing. We describe our data set of scanned documents and data records from the biological database SABIO-RK, provide a definition of the task, and report findings from preliminary experiments. Rigorous evaluation proved challenging due to lack of gold-standard data and a difficult notion of correctness. Qualitative inspection of results, however, showed the feasibility and usefulness of the task.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerlinguistik; Information Extraction; Schriftstück; Experiment; Datenanalyse; Qualitative Inhaltsanalyse
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

Transparent, efficient, and robust word embedding access with WOMBAT

Autor*in: Müller, Mark-Christoph ; Strube, Michael

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present WOMBAT, a Python tool which supports NLP practitioners in accessing word embeddings from code. WOMBAT addresses common research problems, including unified access, scaling, and robust and reproducible preprocessing. Code that uses WOMBAT... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11086 https://ids-pub.bsz-bw.de/files/11086/Mueller_Strube_Transparent_efficient_and_robust_2018.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110862

We present WOMBAT, a Python tool which supports NLP practitioners in accessing word embeddings from code. WOMBAT addresses common research problems, including unified access, scaling, and robust and reproducible preprocessing. Code that uses WOMBAT for accessing word embeddings is not only cleaner, more readable, and easier to reuse, but also much more efficient than code using standard in-memory methods: a Python script using WOMBAT for evaluating seven large word embedding collections (8.7M embedding vectors in total) on a simple SemEval sentence similarity task involving 250 raw sentence pairs completes in under ten seconds end-to-end on a standard notebook computer.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Python; Automatische Sprachanalyse; Code; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by/4.0/ ; info:eu-repo/semantics/openAccess

LRTwiki: enriching the likelihood ratio test with encyclopedic information for the extraction of relevant terms

Autor*in: Jakob, Niklas ; Müller, Mark-Christoph ; Gurevych, Iryna

Erschienen: 2022

Verlag: Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

This paper introduces LRTwiki, an improved variant of the Likelihood Ratio Test (LRT). The central idea of LRTwiki is to employ a comprehensive domain specific knowledge source as additional “on-topic” data sets, and to modify the calculation of the... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11090 https://ids-pub.bsz-bw.de/files/11090/Jakob_Mueller_Gurevych_LRTwiki_2009.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110906

This paper introduces LRTwiki, an improved variant of the Likelihood Ratio Test (LRT). The central idea of LRTwiki is to employ a comprehensive domain specific knowledge source as additional “on-topic” data sets, and to modify the calculation of the LRT algorithm to take advantage of this new information. The knowledge source is created on the basis of Wikipedia articles. We evaluate on the two related tasks product feature extraction and keyphrase extraction, and find LRTwiki to yield a significant improvement over the original LRT in both tasks.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Bibliotheks- und Informationswissenschaften (020); Sprache (400)
Schlagworte:	Likelihood-Quotienten-Test; Enzyklopädie; Information Extraction; Datensatz; Algorithmus; Wikipedia; Fehleranalyse
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Flexible UIMA components for information retrieval research

Autor*in: Müller, Christof ; Zesch, Torsten ; Müller, Mark-Christoph ; Bernhard, Delphine ; Ignatova, Kateryna ; Gurevych, Iryna ; Mühlhäuser, Max

Erschienen: 2022

Verlag: Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In this paper, we present a suite of flexible UIMA-based components for information retrieval research which have been successfully used (and re-used) in several projects in different application domains. Implementing the whole system as UIMA... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11096 https://ids-pub.bsz-bw.de/files/11096/Mueller_Zesch_Flexible_UIMA_components_2008.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-110969

In this paper, we present a suite of flexible UIMA-based components for information retrieval research which have been successfully used (and re-used) in several projects in different application domains. Implementing the whole system as UIMA components is beneficial for configuration management, component reuse, implementation costs, analysis and visualization.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400); Bibliotheks- und Informationswissenschaften (020)
Schlagworte:	Information Retrieval; Konfigurationsmanagement; Information Extraction; Datensatz; Forschung; Algorithmus; Automatische Sprachanalyse; Informationsmanagement
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Knowledge sources for bridging resolution in multi-party dialog

Autor*in: Müller, Mark-Christoph ; Mieskes, Margot ; Strube, Michael

Erschienen: 2022

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In this paper we investigate the coverage of the two knowledge sources WordNet and Wikipedia for the task of bridging resolution. We report on an annotation experiment which yielded pairs of bridging anaphors and their antecedents in spoken... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11102 https://ids-pub.bsz-bw.de/files/11102/Mueller_Mieskes_Knowledge_sources_2008.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111024

In this paper we investigate the coverage of the two knowledge sources WordNet and Wikipedia for the task of bridging resolution. We report on an annotation experiment which yielded pairs of bridging anaphors and their antecedents in spoken multi-party dialog. Manual inspection of the two knowledge sources showed that, with some interesting exceptions, Wikipedia is superior to WordNet when it comes to the coverage of information necessary to resolve the bridging anaphors in our data set. We further describe a simple procedure for the automatic extraction of the required knowledge from Wikipedia by means of an API, and discuss some of the implications of the procedure’s performance.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Dialog; WordNet; Wikipedia; Gesprochene Sprache; Information; Datensatz; Wissensextraktion; API; Diskurs; Semantic Web; Lexikon
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Resolving it, this, and that in unrestricted multi-party dialog

Autor*in: Müller, Mark-Christoph

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present an implemented system for the resolution of it, this, and that in transcribed multi-party dialog. The system handles NP-anaphoric as well as discourse-deictic anaphors, i.e. pronouns with VP antecedents. Selectional preferences for NP or... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11103 https://ids-pub.bsz-bw.de/files/11103/Mueller_Resolving_it_this_and_that_2007.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111034

We present an implemented system for the resolution of it, this, and that in transcribed multi-party dialog. The system handles NP-anaphoric as well as discourse-deictic anaphors, i.e. pronouns with VP antecedents. Selectional preferences for NP or VP antecedents are determined on the basis of corpus counts. Our results show that the system performs significantly better than a recency-based baseline.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	it; that; Dialog; Anapher <Syntax>; Nominalphrase; Deixis; Diskurs; Pronomen; Verbalphrase; Korpus; Daten
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Automatic detection of nonreferential it in spoken multi-party dialog

Autor*in: Müller, Mark-Christoph

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present an implemented machine learning system for the automatic detection of nonreferential it in spoken dialog. The system builds on shallow features extracted from dialog transcripts. Our experiments indicate a level of performance that makes... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11104 https://ids-pub.bsz-bw.de/files/11104/Mueller_Automatic_detection_of_nonreferential_it_2006.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111048

We present an implemented machine learning system for the automatic detection of nonreferential it in spoken dialog. The system builds on shallow features extracted from dialog transcripts. Our experiments indicate a level of performance that makes the system usable as a preprocessing filter for a coreference resolution system. We also report results of an annotation study dealing with the classification of it by naive subjects.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	it; Dialog; Gesprochene Sprache; Maschinelles Lernen; Mitschrift; Korpus; Automatische Klassifikation; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Off-the-shelf semantic author name disambiguation for bibliographic data bases

Autor*in: Müller, Mark-Christoph ; Bannister, Adam ; Reitz, Florian

Erschienen: 2022

Verlag: Cham : Springer ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

The demo presents a minimalist, off-the-shelf AND tool which provides a fundamental AND operation, the comparison of two publications with ambiguous authors, as an easily accessible HTTP interface. The tool implements this operation using standard... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11128 https://ids-pub.bsz-bw.de/files/11128/Mueller_Bannister_Reitz_Off_the_shelf_2019.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111280 https://doi.org/10.1007/978-3-030-30760-8_42

The demo presents a minimalist, off-the-shelf AND tool which provides a fundamental AND operation, the comparison of two publications with ambiguous authors, as an easily accessible HTTP interface. The tool implements this operation using standard AND functionality, but puts particular emphasis on advanced methods from natural language processing (NLP) for comparing publication title semantics.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Literaturdatenbank; Datenbank; Veröffentlichung; Automatische Sprachanalyse; Semantik; Open Source
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Semantic author name disambiguation with word embeddings

Autor*in: Müller, Mark-Christoph

Erschienen: 2022

Verlag: Cham : Springer ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11135 https://ids-pub.bsz-bw.de/files/11135/Mueller_Semantic_author_name_disambiguation_2017.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111355 https://doi.org/10.1007/978-3-319-67008-9_24

We present a supervised machine learning AND system which tackles semantic similarity between publication titles by means of word embeddings. Word embeddings are integrated as external components, which keeps the model small and efficient, while allowing for easy extensibility and domain adaptation. Initial experiments show that word embeddings can improve the Recall and F score of the binary classification sub-task of AND. Results for the clustering sub-task are less clear, but also promising and overall show the feasibility of the approach.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Maschinelles Lernen; Veröffentlichung; Deep learning; Semantik; Computerlinguistik
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

On the contribution of word-level semantics to practical author name disambiguation

Autor*in: Müller, Mark-Christoph

Erschienen: 2022

Verlag: New York : Association for Computing Machinery ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

We demonstrate the utility of word embedding-based semantic similarity methods for Author Name Disambiguation. mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11137 https://ids-pub.bsz-bw.de/files/11137/Mueller_On_the_contribution_2018.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111370 https://doi.org/10.1145/3197026.3203912

We demonstrate the utility of word embedding-based semantic similarity methods for Author Name Disambiguation.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Semantik; Autor; Elektronische Bibliothek; Maschinelles Lernen; Computerlinguistik
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations

Autor*in: Jakob, Niklas ; Weber, Stefan Hagen ; Müller, Mark-Christoph ; Gurevych, Iryna

Erschienen: 2022

Verlag: New York : Association for Computing Machinery ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS) [Zweitveröffentlichung]

In this paper we show that the extraction of opinions from free-text reviews can improve the accuracy of movie recommendations. We present three approaches to extract movie aspects as opinion targets and use them as features for the collaborative... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11139 https://ids-pub.bsz-bw.de/files/11139/Jakob_Weber_Mueller_Beyond_the_stars_2009.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111390 https://doi.org/10.1145/1651461.1651473

In this paper we show that the extraction of opinions from free-text reviews can improve the accuracy of movie recommendations. We present three approaches to extract movie aspects as opinion targets and use them as features for the collaborative filtering. Each of these approaches requires different amounts of manual interaction. We collected a data set of reviews with corresponding ordinal (star) ratings of several thousand movies to evaluate the different features for the collaborative filtering. We employ a state-of-the-art collaborative filtering engine for the recommendations during our evaluation and compare the performance with and without using the features representing user preferences mined from the free-text reviews provided by the users. The opinion mining based features perform significantly better than the baseline, which is based on star ratings and genre information only.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Rezension; Film; Empfehlung; Kollaborative Filterung; Datensatz; Benutzer; Automatische Sprachanalyse; Textanalyse; Datenbank; Data Mining; Algorithmus; Empfehlungssystem
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

A flexible stand-off data model with query language for multi-level annotation

Autor*in: Müller, Mark-Christoph

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present an implemented XML data model and a new, simplified query language for multi-level annotated corpora. The new query language involves automatic conversion of queries into the underlying, more complicated MMAXQL query language. It supports... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11153 https://ids-pub.bsz-bw.de/files/11153/Mueller_A_flexible_stand_off_data_model_2005.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111537 https://doi.org/10.3115/1225753.1225781

We present an implemented XML data model and a new, simplified query language for multi-level annotated corpora. The new query language involves automatic conversion of queries into the underlying, more complicated MMAXQL query language. It supports queries for sequential and hierarchical, but also associative (e.g. coreferential) relations. The simplified query language has been designed with non-expert users in mind.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Datenmodell; Abfragesprache; XML; Korpus; Computerlinguistik
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

A machine learning approach to pronoun resolution in spoken dialogue

Autor*in: Strube, Michael ; Müller, Mark-Christoph

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We apply a decision tree based approach to pronoun resolution in spoken dialogue. Our system deals with pronouns with NP- and non-NP-antecedents. We present a set of features designed for pronoun resolution in spoken dialogue and determine the most... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11156 https://ids-pub.bsz-bw.de/files/11156/Strube_Mueller_A_machine_learning_approach_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111560 https://doi.org/10.3115/1075096.1075118

We apply a decision tree based approach to pronoun resolution in spoken dialogue. Our system deals with pronouns with NP- and non-NP-antecedents. We present a set of features designed for pronoun resolution in spoken dialogue and determine the most promising features. We evaluate the system on twenty Switchboard dialogues and show that it compares well to Byron’s (2002) manually tuned system.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Maschinelles Lernen; Pronomen; Dialog; Gesprochene Sprache; Entscheidungsbaum; Korpus; Nominalphrase
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Multi-level annotation in MMAX

Autor*in: Müller, Mark-Christoph ; Strube, Michael

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present a light-weight tool for the annotation of linguistic data on multiple levels. It is based on the simplification of annotations to sets of markables having attributes and standing in certain relations to each other. We describe the main... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11159 https://ids-pub.bsz-bw.de/files/11159/Mueller_Strube_Multi_level_annotation_2003.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111596

We present a light-weight tool for the annotation of linguistic data on multiple levels. It is based on the simplification of annotations to sets of markables having attributes and standing in certain relations to each other. We describe the main features of the tool, emphasizing its simplicity, customizability and versatility

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerlinguistik; Daten; Korpus; Sprachdaten; Annotation
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

An API for discourse-level access to XML-encoded corpora

Autor*in: Müller, Mark-Christoph ; Strube, Michael

Erschienen: 2022

Verlag: Paris : European Language Resources Association (ELRA) ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We describe a simple and efficient Java object model and application programming interface (API) for (possibly multi-modal) annotated natural language corpora. Corpora are represented as elements like Sentences, Turns, Utterances, Words, Gestures and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11160 https://ids-pub.bsz-bw.de/files/11160/Mueller_Strube_An_API_for_discourse_level_access_2002.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111602

We describe a simple and efficient Java object model and application programming interface (API) for (possibly multi-modal) annotated natural language corpora. Corpora are represented as elements like Sentences, Turns, Utterances, Words, Gestures and Markables. The API allows linguists to access corpora in terms of these discourse-level elements, i.e. at a conceptual level they are familiar with, with the flexibility offered by a general purpose programming language. It is also a contribution to corpus standardization efforts because it is based on a straightforward and easily extensible data model which can serve as a target for conversion of different corpus formats.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	API; XML; Korpus; Natürliche Sprache; Vereinheitlichung; Datenmodell; Softwarewiederverwendung
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Applying co-training to reference resolution

Autor*in: Müller, Mark-Christoph ; Rapp, Stefan ; Strube, Michael

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

In this paper, we investigate the practical applicability of Co-Training for the task of building a classifier for reference resolution. We are concerned with the question if Co-Training can significantly reduce the amount of manual labeling work and... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11164 https://ids-pub.bsz-bw.de/files/11164/Mueller_Rapp_Strube_Applying_co_training_2002.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111649 https://doi.org/10.3115/1073083.1073142

In this paper, we investigate the practical applicability of Co-Training for the task of building a classifier for reference resolution. We are concerned with the question if Co-Training can significantly reduce the amount of manual labeling work and still produce a classifier with an acceptable performance.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Computerlinguistik; Korpus
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Annotating anaphoric and bridging relations with MMAX

Autor*in: Müller, Mark-Christoph ; Strube, Michael

Erschienen: 2022

Verlag: Stroudsburg, Pennsylvania : Association for Computational Linguistics ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present a tool for the annotation of anaphoric and bridging relations in a corpus of written texts. Based on differences as well as similarities between these phenomena, we define an annotation scheme. We then implement the scheme within an... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11165 https://ids-pub.bsz-bw.de/files/11165/Mueller_Annotating_anaphoric_and_bridging_relations_2001.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111657 https://doi.org/10.3115/1118078.1118090

We present a tool for the annotation of anaphoric and bridging relations in a corpus of written texts. Based on differences as well as similarities between these phenomena, we define an annotation scheme. We then implement the scheme within an annotation tool and demonstrate its use.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Annotation; Anapher <Syntax>; Korpus; Computerlinguistik; Schriftsprache; Datenmodell; XML
Lizenz:	creativecommons.org/licenses/by-nc-sa/3.0/ ; info:eu-repo/semantics/openAccess

Information extraction with the Darmstadt Knowledge Processing Software Repository (Extended Abstract)

Autor*in: Gurevych, Iryna ; Müller, Mark-Christoph

Erschienen: 2022

Verlag: Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

Current Natural Language Processing (NLP) systems feature high-complexity processing pipelines that require the use of components at different levels of linguistic and application specific processing. These components often have to interface with... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/11166 https://ids-pub.bsz-bw.de/files/11166/Gurevych_Mueller_Information_extraction_2008.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-111666

Current Natural Language Processing (NLP) systems feature high-complexity processing pipelines that require the use of components at different levels of linguistic and application specific processing. These components often have to interface with external e.g. machine learning and information retrieval libraries as well as tools for human annotation and visualization. At the UKP Lab, we are working on the Darmstadt Knowledge Processing Software Repository (DKPro) (Gurevych et al., 2007a; Müller et al., 2008) to create a highly flexible, scalable and easy-to-use toolkit that allows rapid creation of complex NLP pipelines for semantic information processing on demand. The DKPro repository consists of several main parts created to serve the purposes of different NLP application areas

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Information Extraction; Automatische Sprachanalyse; Maschinelles Lernen; Information Retrieval; Annotation; Informationsverarbeitung; Text Mining
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Robust extraction of marked-up text sections from scientific document printouts

Autor*in: Müller, Mark-Christoph

Erschienen: 2024

Verlag: La Rochelle : La Rochelle University ; Mannheim : Leibniz-Institut für Deutsche Sprache (IDS)

We present a simple tool for extracting text and markup information from printouts of (not only) scientific documents. While the heavy-lifting OCR is done by off-the-shelf tesseract, our focus is on detection, extraction, and basic categorization of... mehr

Volltext:	https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12427 https://ids-pub.bsz-bw.de/files/12427/Mueller_Robust_extraction_2022.pdf
Zitierfähiger Link:	https://nbn-resolving.org/urn:nbn:de:bsz:mh39-124275

We present a simple tool for extracting text and markup information from printouts of (not only) scientific documents. While the heavy-lifting OCR is done by off-the-shelf tesseract, our focus is on detection, extraction, and basic categorization of color-highlighted text sections, as well as on providing a framework for downstream processing of extraction results. The tool can be useful for document analysis tasks that must, or benefit from being able to, use printed paper.

Export in Literaturverwaltung

Quelle:	BASE Fachausschnitt Germanistik
Sprache:	Englisch
Medientyp:	Konferenzveröffentlichung
Format:	Online
DDC Klassifikation:	Sprache (400)
Schlagworte:	Schriftstück; Dokument; Optische Zeichenerkennung; Kategorisierung; Texttechnologie
Lizenz:	rightsstatements.org/page/InC/1.0/ ; info:eu-repo/semantics/openAccess

Filtern nach

Aktive Filter

Kategorien:

Bereich

Quelle

Format

Beteiligt

Medientyp

Sprache

Jahr

Letzte Suchanfragen

Ergebnisse für *

Word-level alignment of paper documents with their electronic full-text counterparts

pyMMAX2: Deep access to MMAX2 projects from Python

Reconstructing manual information extraction with DB-to-document backprojection: Experiments in the life science domain

Transparent, efficient, and robust word embedding access with WOMBAT

LRTwiki: enriching the likelihood ratio test with encyclopedic information for the extraction of relevant terms

Flexible UIMA components for information retrieval research

Knowledge sources for bridging resolution in multi-party dialog

Resolving it, this, and that in unrestricted multi-party dialog

Automatic detection of nonreferential it in spoken multi-party dialog

Off-the-shelf semantic author name disambiguation for bibliographic data bases

Semantic author name disambiguation with word embeddings

On the contribution of word-level semantics to practical author name disambiguation

Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations

A flexible stand-off data model with query language for multi-level annotation

A machine learning approach to pronoun resolution in spoken dialogue

Multi-level annotation in MMAX

An API for discourse-level access to XML-encoded corpora

Applying co-training to reference resolution

Annotating anaphoric and bridging relations with MMAX

Information extraction with the Darmstadt Knowledge Processing Software Repository (Extended Abstract)

Robust extraction of marked-up text sections from scientific document printouts

Kontakt

Partner