Workshop on extracting structured knowledge from scientific publications (ESSP)

Workshop Description

Scientific knowledge is one of the greatest assets of humankind. This knowledge is recorded and disseminated in scientific publications, and the body of scientific literature is growing at an enormous rate. Automatic methods of processing and cataloguing that information are necessary for assisting scientists to navigate this vast amount of information, and for facilitating automated reasoning, discovery and decision making on that data.

Structured information can be extracted at different levels of granularity. Previous and ongoing work has focused on bibliographic information (segmentation and linking of referenced literature, (Wick et al.,2013)), keyword extraction and categorization (e.g., what are tasks, materials and processes central to a publication, (Augenstein et al., 2017)), and cataloguing research findings. Scientific discoveries can often be represented as pairwise relationships, e.g., protein-protein (Mallory et al., 2016), drug-drug (Segura-Bedmar et al., 2013), and chemical-disease (Li et al., 2016) interactions, or as more complicated networks such as action graphs describing scientific procedures (e.g., synthesis recipes in material sciences, (Mysore et al., 2017)). Information extracted with such methods can be enriched with time-stamps, and other meta-information, such as indicators of uncertainty or limitations of the discovered facts (Zhou et al., 2015).

Structured representations, such as knowledge graphs, summarize information from a variety of sources in a convenient and machine readable format. Graph representations, that link the information of a large body of publications, can reveal patterns and lead to the discovery of new information that would not be apparent from the analysis of just one publication. This kind of aggregation can lead to new scientific insights (Kim et al., 2017), and it can also help to detect trends (Prabhakaran et al., 2016), or find experts for a particular scientific area (Neshati et al., 2014).

While various workshops have focused separately on several aspects – extraction of information from scientific articles, building and using knowledge graphs, the analysis of bibliographical information, graph algorithms for text analysis – the proposed workshop focuses on processing scientific articles and creating structured repositories such as knowledge graphs for finding new information and making scientific discoveries. The aim of this workshop is to identify the necessary representations for facilitating automated reasoning over scientific information, and to bring together experts in natural language processing and information extraction with scientists from other domains (e.g. material sciences, biomedical research) who want to leverage the vast amount of information stored in scientific publications.

Potential topics:

We invite submission on (but not limited to) the following topics:

  • Information extraction from scientific publications
    • identification of concepts in scientific articles (in various domains)
    • extraction of relationships in scientific articles (in various domains) – including n-ary relations with n>2, relation attributes, ”negative relations”
    • large scale information extraction, clustering and detection of trends in scientific fields
    • targeted information extraction for completing knowledge graphs
    • updating knowledge graphs (e.g. adding new information, removing erroneous facts, having explicit links for incorrect statements)
  • Finding patterns and mining new information
    • automatic generation and ranking of scientific hypotheses
    • aggregation and extraction of human-understandable scientific rules and generalities
    • extraction of script-knowledge and scientific procedures
    • detection of (explicitly stated) causality
    • automated reasoning over repositories of extracted information
  • Using extracted structured knowledge
    • visualization of knowledge in particular domains
    • tools for interacting with users
    • querying knowledge graphs/knowledge repositories
    • Evaluation of extracted knowledge

Organizing committee

Vivi Nastase, University of Heidelberg
Dr. Vivi Nastase is a research group leader at the University of Heidelberg, focusing on semantic relations, knowledge acquisition and knowledge graphs. She has obtained her PhD at the University of Ottawa, Canada. Previous organizational experience includes one edition of the TextGraphs workshop (at EMNLP 2013), and two editions of the workshop on Collaboratively Built Knowledge Sources and Artificial Intelligence (IJCAI 2009, AAAI 2010). She
served as area chair for EMNLP 2009, ACL 2011, EMNLP 2016, WWW 2016.

Benjamin Roth, Ludwig Maximilian University of Munich
Dr. Benjamin Roth is a researcher and lecturer at the The Center for Information and Language Processing, Munich University (LMU). He obtained his PhD at Saarland University, Germany, and he was a postdoc in the Information Extraction and Synthesis Lab (IESL) at the University of Massachusetts, Amherst. His research focus lies in machine learning for relation extraction and knowledge bases. His relation extraction system RelationFactory was
top-ranked in the TAC KBP English Slot-Filling benchmark in 2013.

Laura Dietz, University of New Hampshire
Prof. Dr. Laura Dietz is an Assistant Professor at University of New Hampshire, where she teaches Information Retrieval and Data Science. Before that she was working research labs at University of Massachusetts, Mannheim University and obtained her Ph.D. from the Max Planck Institute for Informatics. Her research focuses on text processing and information
retrieval with knowledge graphs. Her scientific contributions span from entity linking to the prediction of influences in citation graphs.

Andrew McCallum, University of Massachusetts Amherst
Andrew McCallum is a Professor and Director of the Information Extraction and Synthesis Laboratory, as well as Director of Center for Data Science in the College of Information and Computer Science at University of Massachusetts Amherst. He has published over 250 papers in many areas of AI, including natural language processing, machine learning and reinforcement learning; his work has received over 50,000 citations. He obtained his PhD
from University of Rochester in 1995 with Dana Ballard and a postdoctoral fellowship from CMU with Tom Mitchell and Sebastian Thrun. In the early 2000’s he was Vice President of Research and Development at at WhizBang Labs, a 170-person start-up company that used machine learning for information extraction from the Web. He is a AAAI Fellow, the recipient of the UMass Chancellor’s Award for Research and Creative Activity, the UMass NSM Distinguished Research Award, the UMass Lilly Teaching Fellowship, and research awards from Google, IBM, Microsoft, and Yahoo. He was the General Chair for the International Conference on Machine Learning (ICML) 2012, and is the current President of the International Machine Learning Society, as well as member of the editorial board of the
Journal of Machine Learning Research. For the past ten years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, entity resolution, social network analysis, structured prediction, semi-supervised learning, and deep neural networks for knowledge representation.

Program Committee (confirmed members marked in bold)

Sergio Baranzini, UCSF
Chaitan Baru, UCSD
Chandra Bhagavatula, Allen Institute for AI
Volha Bryl, Springer Nature
Trevor Cohen, MBChB
Anette Frank, University of Heidelberg
Ingo Frommholz, University of Bedfordshire
Daniel Garijo, ISI
Lee Giles, Penn State
Przemyslaw Grabowicz, Max Planck Institute for Software Systems
Keith Hall, Google
Shizhu He, Chinese Academy of Sciences
Marcel Karnstedt, Springer Semantic Web
Bhushan Kotnis, NEC Labs
Anne Lauscher, Mannheim University
Jiao Li, Chinese Academy of Medical Sciences
Sebastian Martschat, BASF
Philipp Mayr-Schlegel, GESIS
Arunav Mishra, BASF
Justin Mower, Rice University
Mathias Niepert, NEC Labs
Adam Roegiest, Kira Systems
Isabel Segura-Bedmar, University Carlos III of Madrid
Andrew Su, Scripps Institute
Mihai Surdeanu, University of Arizona
Niket Tandon, Allen Institute for AI
Kristina Toutanova, Google
Gerhard Weikum, MPII Saarbruecken
Guido Zucchon, Queensland University

Invited speakers

Elsa Olivetti, MIT (confirmed)

Sponsorship pledges

BASF (Ludwigshafen, Germany): 1000 euros
Leibniz ScienceCampus (Leibniz-Institut für Deutsche Sprache & Univ. of Heidelberg): 3000 euros
DFG grant RO 5127/2-1 (Benjamin Roth): 3000 euros

Estimated number of attendees


Workshop date and place

June 6, 2019, collocated with NAACL 2019,  Minneapolis, USA (for more information, see here)

Related Workshops and conferences

BIRNDL birndl-sigir2018/
The Ninth International Workshop on Health Text Mining and Information Analysis
Workshop on Bio-Medical Semantic Indexing and Question Answering
Workshop on Geo-Knowledge Graphs
Quality Engineering Meets Knowledge Graph


Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and Andrew McCallum. 2017. Semeval 2017 task 10: ScienceIE – extracting keyphrases and relations from scientific publications. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada, pages

Edward Kim, Kevin Huang, Adam Saunders, Andrew McCallum, Gerbrand Ceder, and Elsa Olivetti. 2017. Materials synthesis insights from scientific literature via text extraction and machine learning. Chemistry of Materials 29(21):9436–9444.

Jiao Li, Yueping Sun, Robin J Johnson, Daniela Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. 2016. Biocreative vcdr task corpus: a resource for chemical disease relation extraction. Database : the journal of biological databases and curation 2016 baw068. 8 May. 2016.

Emily K. Mallory, Ce Zhang, Christopher Ré, and Russ B. Altman. 2016. Large-scale extraction of gene interactions from full-text literature using deepdive. Bioinformatics 32(1):106–113.

Sheshera Mysore, Edward Kim, Emma Strubell, Ao Liu, Haw-Shiuan Chang, Srikrishna Kompella, Kevin Huang, Andrew McCallum, and Elsa Olivetti. 2017. Automatically extracting action graphs from materials science synthesis procedures.
CoRR abs/1711.06872.

Mahmood Neshati, Djoerd Hiemstra, Ehsaneddin Asgari, and Hamid Beigy. 2014. Integration of scientific and social networks. World Wide Web 17(5):1051–1079.

Vinodkumar Prabhakaran, William L. Hamilton, Dan McFarland, and Dan Jurafsky. 2016. Predicting the rise and fall of scientific topics from trends in their rhetorical framing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1170–1180.

Isabel Segura-Bedmar, Paloma Martı́nez, and Marı́a Herrero Zazo. 2013. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Associa-
tion for Computational Linguistics, Atlanta, Georgia, USA, pages 341–350.

Michael L Wick, Ari Kobren, and Andrew McCallum. 2013. Large-scale author coreference via hierarchical entity representations. In ICML Workshop on Peer Reviewing and Publishing Models (PEER).

Huiwei Zhou, Huijie Deng, Degen Huang, and Minling Zhu. 2015. Hedge scope detection in biomedical texts: An effective dependency-based method. PLoS One 10(7). doi:10.1371/journal.pone.0133715.