Difference between revisions of "Snapquery EKAW 2024 paper"
Line 17: | Line 17: | ||
The authors' favorite query for which it would be nice to get the result is: At what conferences did scholars affiliated to an institution q at the time of publishing successfully submit papers in the time frame start_date to end_date. Now it is quite straightforward to envision a (federated) knowledge graph that provides the data and SPARQL query to retrieve the results for any major institution. Unfortunately, the reality is much bleaker and there are still obstacles and challenges to overcome. It is promising that natural language input is quite feasible these days and asking the question without any SPARQL knowledge and getting a result has been demonstrated lately with a good success rate (ORKG demonstration ESWC 2024 Sören Auer) | The authors' favorite query for which it would be nice to get the result is: At what conferences did scholars affiliated to an institution q at the time of publishing successfully submit papers in the time frame start_date to end_date. Now it is quite straightforward to envision a (federated) knowledge graph that provides the data and SPARQL query to retrieve the results for any major institution. Unfortunately, the reality is much bleaker and there are still obstacles and challenges to overcome. It is promising that natural language input is quite feasible these days and asking the question without any SPARQL knowledge and getting a result has been demonstrated lately with a good success rate (ORKG demonstration ESWC 2024 Sören Auer) | ||
− | Scholia currently faces the challenge that the maximum size of the knowledge graph that the Blazegraph SPARQL engine that is currently backing the wikidata knowledge graph can hold is 4 Terabyte. Currently there are claims for more than 110 million entities and more than 15 billion triples in the wikidata KG. Over 20 million humans and as many scholarly papers are referenced. Unfortunately this is only a small portion of the scholarly publishing data that would be collectible from all the digital traces that have been left by scholarly publishing in the past centuries. The list of | + | Scholia currently faces the challenge that the maximum size of the knowledge graph that the Blazegraph SPARQL engine that is currently backing the wikidata knowledge graph can hold is 4 Terabyte. Currently there are claims for more than 110 million entities and more than 15 billion triples in the wikidata KG. Over 20 million humans and as many scholarly papers are referenced. Unfortunately this is only a small portion of the scholarly publishing data that would be collectible from all the digital traces that have been left by scholarly publishing in the past centuries. The [https://en.wikipedia.org/wiki/List_of_academic_databases_and_search_engines list of academic databases and search engines] in wikipedia has entries with more than 300 million items. So scholarly publishing data alone would break the 4 TB limit. |
The Wikimedia foundation which is responsible for running the Wikidata infrastructure has decided to go the route of a graph split. The scholarly data shall get its own KG and the data be migrated. Unfortunately this potentially invalidates all current 373(check #) Scholia queries. | The Wikimedia foundation which is responsible for running the Wikidata infrastructure has decided to go the route of a graph split. The scholarly data shall get its own KG and the data be migrated. Unfortunately this potentially invalidates all current 373(check #) Scholia queries. | ||
Line 24: | Line 24: | ||
See [[https://cr.bitplan.com/index.php/Scholia_Parameter_Handling]] for an analysis. | See [[https://cr.bitplan.com/index.php/Scholia_Parameter_Handling]] for an analysis. | ||
− | The CEUR-WS semantification project is using Wikidata and Scholia as a target Knowledge Graph to introduce "Metadata-First Publishing" (⚠️ Ref to Papers and paper under review ...). As part of the project | + | The CEUR-WS semantification project is using Wikidata and Scholia as a target Knowledge Graph to introduce "Metadata-First Publishing" (⚠️ Ref to Papers and paper under review ...). As part of the project [https://cr.bitplan.com/index.php/List_of_Queries a set of SPARQL queries] has been created that is based on the Semantic Publishing Challenges (⚠️ Refs ...). Some of these queries have already been manually refactored to run successfully on both the Wikidata Query Service and QLever. |
The refactoring activities of the CEUR-WS semantification project have shown challenges that are a strong motivation for this work. | The refactoring activities of the CEUR-WS semantification project have shown challenges that are a strong motivation for this work. |
Revision as of 05:59, 10 July 2024
The Usecases
This work is motivated by three use cases:
- Scholarly publishing
- QLever SPARQL engine development
- Wikidata SPARQL examples, tutorials and usage
We will introduce the use cases and the challenges relevant for this work.
Scholarly publishing
Scholarly publishing via scientific events revolves around the entities scholar, paper, institution, proceedings, events and eventseries. An example set (using Wikidata Q-Identifiers) would be scholar Tim-Berners Lee (Q80), paper Tabulator Redux: Browsing and Writing Linked Data, institution CERN (Q42944), proceedings LDOW 2008, event LDOW 2008 (Q11367282) and the event series LDOW Q105491258. Scholia is a project that has created a portal at https://scholia.toolforge.org/ that allows to search, browse and analyse scholarly publishing related data that has been curated in the Wikidata knowledge graph. You can verify the example Q-Identifiers by performing a plain text search. Note that https://ceur-ws.org/Vol-369/paper11.pdf is missing since it has not been curated yet by the CEUR-WS Semantification project and has no persistent identifier other than its URL. The LDOW 2008 proceedings are found at https://dblp.org/db/conf/www/ldow2008.html but not in the list of events.
The frustration of incomplete search results is the effect of a combination of factors. We would hope that the SPARQL query https://github.com/WDscholia/scholia/blob/master/scholia/app/templates/event-series_events.sparql event-series-events.sparql would work as designed and answer with the year,ordinal,short_name event id and label and proceedings id and level when given the identifier of an event series.
The authors' favorite query for which it would be nice to get the result is: At what conferences did scholars affiliated to an institution q at the time of publishing successfully submit papers in the time frame start_date to end_date. Now it is quite straightforward to envision a (federated) knowledge graph that provides the data and SPARQL query to retrieve the results for any major institution. Unfortunately, the reality is much bleaker and there are still obstacles and challenges to overcome. It is promising that natural language input is quite feasible these days and asking the question without any SPARQL knowledge and getting a result has been demonstrated lately with a good success rate (ORKG demonstration ESWC 2024 Sören Auer)
Scholia currently faces the challenge that the maximum size of the knowledge graph that the Blazegraph SPARQL engine that is currently backing the wikidata knowledge graph can hold is 4 Terabyte. Currently there are claims for more than 110 million entities and more than 15 billion triples in the wikidata KG. Over 20 million humans and as many scholarly papers are referenced. Unfortunately this is only a small portion of the scholarly publishing data that would be collectible from all the digital traces that have been left by scholarly publishing in the past centuries. The list of academic databases and search engines in wikipedia has entries with more than 300 million items. So scholarly publishing data alone would break the 4 TB limit.
The Wikimedia foundation which is responsible for running the Wikidata infrastructure has decided to go the route of a graph split. The scholarly data shall get its own KG and the data be migrated. Unfortunately this potentially invalidates all current 373(check #) Scholia queries.
Scholia queries are named parameterized queries. Scholia uses python and JavaScript as programming languages, SPARQL as a query language and Jinja Templates for query parameterization. See [[1]] for an analysis.
The CEUR-WS semantification project is using Wikidata and Scholia as a target Knowledge Graph to introduce "Metadata-First Publishing" (⚠️ Ref to Papers and paper under review ...). As part of the project a set of SPARQL queries has been created that is based on the Semantic Publishing Challenges (⚠️ Refs ...). Some of these queries have already been manually refactored to run successfully on both the Wikidata Query Service and QLever.
The refactoring activities of the CEUR-WS semantification project have shown challenges that are a strong motivation for this work.
QLever SPARQL Engine development
QLever is a SPARQL Engine developed at the Computer Science Department of the University of Freiburg by Hannah Bast and her team. Written in C++ and aiming for high performance QLever is a candidate for the replacement of Blazegraph as the main SPARQL engine for the Wikidata knowledge graph. QLever is not feature complete. It is developed as an open source project using the github infrastructure ⚠️ Ref. https://github.com/ad-freiburg/qlever.
We have proposed that QLever should use queries from Scholia as a test suite https://github.com/ad-freiburg/qlever/issues/859. In this work we go a step further and show how a testsuite can be constructed from SPARQL queries which are extractable from the github issues. A typical issue such as https://github.com/ad-freiburg/qlever/issues/896 (CONCAT not implemented) will start with a standard situation/action/expected result setup in which the action is executing a given query and the result is an error or misbehavior. When the issues is closed the query will run or an alternative query is presented that shows that the issue has been fixed.
Wikidata SPARQL examples, tutorials and usage
The Wikidata SPARQL Service example wiki page https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples has been created in 2016 and has had more than 1500 edits since then. As of the creation of this work in July 2024 more than 300 queries have been added to the page to illustrate the usage of Wikidata. The assumption is that all these queries should actually work against the Wikidata service endpoint (hopefully at all times). This work shows that this is not the case and introduces tools and methods to analyze the reasons systematically.
The Wikidata query service at https://query.wikidata.org/ offers an option to create Short-URLs for queries that have been entered using the service. We have included a random set of 100 such queries as a query set for investigation.
Structure
Introduction
- ★★★★★ Query rot versus link rot
- ★★★★☆ Transparency vs. complexity of SPARQL queries
- ★★★★☆ Use cases for named queries
- ★★★★☆ Persistent identifiers
- ★★★☆☆ Query hashes and short_urls
Mitigation Query Rot using snapquery
- ★★★★★ Parameterized queries
- ★★★☆☆ https://web.archive.org/web/20150512231123/http://answers.semanticweb.com:80/questions/12147/whats-the-best-way-to-parameterize-sparql-queries
- ★★★☆☆ https://jena.apache.org/documentation/query/parameterized-sparql-strings.html
- ★★★★☆ Scholia Jinja templates
- ★★★★☆ Technical debt and accidental complexity
- ★★★☆☆ How to deal with aspects that do not (usually) influence the execution of a SPARQL query, like whitespace, comments, capitalization and variable names?
SnapQuery Implementation
- ★★★★☆ SPARQL standard changes
- ★★★☆☆ Natural Language input
- ★★★☆☆ Automatic syntax repairs
- ★★★☆☆ Automatic conversion of SQL input, SPARQL output
Evaluation
- ★★★★☆ Wikidata example queries
- ★★★★★ Scholia and Wikidata graph split
- ★★★☆☆ Other knowledge graphs, e.g., DBLP, OpenStreetMap
- ★★☆☆☆ Perhaps also some NFDI examples or some custom knowledge graphs like FAIRJupyter
- ★★★★★ Quality criteria https://github.com/WolfgangFahl/snapquery/issues/26
- ★★★★☆ List of standard refactoring activities and the support by this approach
- ★★★★☆ Getting your own copy of Wikidata; the infrastructure effort needs to be mentioned
- ★★★☆☆ Usability evaluation https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/
- ★★★★☆ https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines
- ★★★★☆ A closed issue should have at least one example that runs
Conclusion and Future Work
- ★★★★★ Hypothesis by Stefan Decker: Query rot is more prominent in KG environments than with relational databases
- ★★★☆☆ Ambiguity of names
Additional Resources
- ★★☆☆☆ https://stackoverflow.com/questions/tagged/sparql
- ★★★☆☆ https://www.semantic-web-journal.net/system/files/swj3076.pdf
- ★★☆☆☆ https://arxiv.org/pdf/cs/0605124
- ★★★☆☆ https://arxiv.org/pdf/1402.0576 optimizing queries
- ★★☆☆☆ https://www.w3.org/TR/REC-rdf-syntax/
- ★★☆☆☆ https://biblio.ugent.be/publication/8632551/file/8653456 Towards supporting multiple semantics of named graphs using N3 rules
- ★★★☆☆ ESWC 2019 proceedings (978-3-030-21348-0.pdf)
- ★★☆☆☆ Linked Data Fragments https://linkeddatafragments.org/ e.g. https://ldfclient.wmflabs.org/ 404 error
Related Work
- ★★★☆☆ Link rot
- ★★★★☆ Information Hiding and Dependency Inversion Principles
- ★★★☆☆ Federated Queries
- ★★★☆☆ grlc
- ★★☆☆☆ querypulator
Misc
A Comparison of the Cognitive Difficulties Posed by SPARQL Query Constructs
Using SPARQL – The Practitioners’ Viewpoint
LSQ: The Linked SPARQL Queries Dataset
Detecting SPARQL Query Templates for Data Prefetching
An analytical study of large SPARQL query logs
Testsuites
★★★☆☆ W3C SPARQL 1.1 Test Suite
W3C test set - why did we not use that as an example
- Official test suite developed by the W3C SPARQL Working Group
- Designed to test conformance to the SPARQL 1.1 specification
- Covers a wide range of SPARQL features and edge cases
- Primarily focused on correctness rather than performance
see https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_TFT
Benchmarks
★★★★☆ An Ngoc Lam et al.'s ESWC 2023 paper "Evaluation of a Representative Selection of SPARQL Query Engines Using Wikidata"
- Evaluates performance of 5 RDF triplestores and 1 experimental SPARQL engine
- Uses complete version of Wikidata knowledge graph
- Compares importing time, loading time, exporting time, and query performance
- Evaluates 328 queries defined by Wikidata users
- Also uses SP2Bench synthetic benchmark for comparison
- Provides detailed analysis of query execution plans and profiling information
- Offers insights on triplestore performance with large-scale real-world data
Wikidata Graph Pattern Benchmark (WGPB) for RDF/SPARQL by Aidan Hogan et al., 2020
DOI: 10.5281/zenodo.4035223
- Focuses on evaluating performance of graph pattern matching in SPARQL engines
- Uses a subset of Wikidata as the dataset
- Provides a large set of SPARQL basic graph patterns
- Designed to test the benefits of worst-case optimal join algorithms
- Exhibits a variety of increasingly complex join patterns
- Allows for systematic testing of query optimization techniques
- Offers insights into the performance characteristics of different SPARQL engines on complex graph patterns
References
- ^ Paul Warren;Paul Mulholland. (2020) "A Comparison of the Cognitive Difficulties Posed by SPARQL Query Constructs" - 3-19 pages. doi: 10.1007/978-3-030-61244-3_1at: EKAW 2022
- ^ Paul Warren;Paul Mulholland. (2018) "Using SPARQL – The Practitioners’ Viewpoint" - 485-500 pages. doi: 10.1007/978-3-030-03667-6_31
- ^ | Muhammad Saleem;Muhammad Intizar Ali;Aidan Hogan;Qaiser Mehmood;Axel-Cyrille Ngonga Ngomo. (2015) "LSQ: The Linked SPARQL Queries Dataset" - 261-269 pages. doi: 10.1007/978-3-319-25010-6_15
- ^ Johannes Lorey;Felix Naumann. (2013) "Detecting SPARQL Query Templates for Data Prefetching" - 124-139 pages. doi: 10.1007/978-3-642-38288-8_9
- ^ Angela Bonifati;Wim Martens;Thomas Timm. (2020) "An analytical study of large SPARQL query logs" - 655-679 pages. doi: 10.1007/s00778-019-00558-9