Scholarly publishing via scientific events revolves around the entities scholar, paper, institution, proceedings, events and eventseries. An example set (using Wikidata Q-Identifiers) would be scholar Tim-Berners Lee (Q80), paper Tabulator Redux: Browsing and Writing Linked Data, institution CERN (Q42944), proceedings LDOW 2008, event LDOW 2008 (Q11367282) and the event series LDOW Q105491258. Scholia is a project that has created a portal at https://scholia.toolforge.org/ that allows to search, browse and analyse scholarly publishing related data that has been curated in the Wikidata knowledge graph. You can verify the example Q-Identifiers by performing a plain text search. Note that https://ceur-ws.org/Vol-369/paper11.pdf is missing since it has not been curated yet by the CEUR-WS Semantification project 1 and has no persistent identifier other than its URL. The LDOW 2008 proceedings are found at https://dblp.org/db/conf/www/ldow2008.html but have not been in the list of events and were therefore added manually by us.
The frustration of incomplete search results is the effect of a combination of factors. We would hope that the SPARQL query https://github.com/WDscholia/scholia/blob/master/scholia/app/templates/event-series_events.sparql event-series-events.sparql would work as designed and answer with the year,ordinal,short_name event id and label and proceedings id and level when given the identifier of an event series.
The authors' favorite query for which it would be nice to get the result is: At what conferences did scholars affiliated to an institution q at the time of publishing successfully submit papers in the time frame start_date to end_date. Now it is quite straightforward to envision a (federated) knowledge graph that provides the data and SPARQL query to retrieve the results for any major institution. Unfortunately, the reality is much bleaker and there are still obstacles and challenges to overcome. It is promising that natural language input is quite feasible these days and asking the question without any SPARQL knowledge and getting a result has been demonstrated lately with a good success rate (ORKG demonstration ESWC 2024 Sören Auer)
Scholia currently faces the challenge that the maximum size of the knowledge graph that the Blazegraph SPARQL engine that is currently backing the wikidata knowledge graph can hold is 4 Terabyte. Currently there are claims for more than 110 million entities and more than 15 billion triples in the wikidata KG. Over 20 million humans and as many scholarly papers are referenced. Unfortunately this is only a small portion of the scholarly publishing data that would be collectible from all the digital traces that have been left by scholarly publishing in the past centuries. The list of academic databases and search engines in wikipedia has entries with more than 300 million items. So scholarly publishing data alone would break the 4 TB limit.
The Wikimedia foundation which is responsible for running the Wikidata infrastructure has decided to go the route of a graph split. The scholarly data shall get its own KG and the data be migrated. Unfortunately this potentially invalidates all current 373(check #) Scholia queries.
Scholia queries are named parameterized queries. Scholia uses python and JavaScript as programming languages, SPARQL as a query language and Jinja Templates for query parameterization. See [[1]] for an analysis.
The CEUR-WS semantification project is using Wikidata and Scholia as a target Knowledge Graph to introduce "Metadata-First Publishing" (⚠️ Ref to Papers and paper under review ...). As part of the project a set of SPARQL queries has been created that is based on the Semantic Publishing Challenges 2. Some of these queries have already been manually refactored to run successfully on both the Wikidata Query Service and QLever.
The refactoring activities of the CEUR-WS semantification project have shown challenges that are a strong motivation for this work.
QLever is a SPARQL Engine developed at the Computer Science Department of the University of Freiburg by Hannah Bast and her team. Written in C++ and aiming for high performance QLever is a candidate for the replacement of Blazegraph as the main SPARQL engine for the Wikidata knowledge graph. QLever is not feature complete. It is developed as an open source project using the github infrastructure ⚠️ Ref. https://github.com/ad-freiburg/qlever.
We have proposed that QLever should use queries from Scholia as a test suite https://github.com/ad-freiburg/qlever/issues/859. In this work we go a step further and show how a testsuite can be constructed from SPARQL queries which are extractable from the github issues. A typical issue such as https://github.com/ad-freiburg/qlever/issues/896 (CONCAT not implemented) will start with a standard situation/action/expected result setup in which the action is executing a given query and the result is an error or misbehavior. When the issues is closed the query will run or an alternative query is presented that shows that the issue has been fixed.
The Wikidata SPARQL Service example wiki page https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples has been created in 2016 and has had more than 1500 edits since then. As of the creation of this work in July 2024 more than 300 queries have been added to the page to illustrate the usage of Wikidata. The assumption is that all these queries should actually work against the Wikidata service endpoint (hopefully at all times). This work shows that this is not the case and introduces tools and methods to analyze the reasons systematically.
The Wikidata query service at https://query.wikidata.org/ offers an option to create Short-URLs for queries that have been entered using the service. We have included a random set of 100 such queries as a query set for investigation.
W3C test set - why did we not use that as an example
see https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_TFT
DOI: 10.5281/zenodo.4035223