Snapquery EKAW 2024 paper


Tim Holzheim, Wolfgang Fahl

The Usecases[edit]

This work is motivated by three use cases:

  1. Scholarly publishing
  2. QLever SPARQL engine development
  3. Wikidata SPARQL examples, tutorials and usage

We will introduce the use cases and the challenges relevant for this work.

Scholarly publishing[edit]

Scholarly publishing via scientific events revolves around the entities scholar, paper, institution, proceedings, events and eventseries. An example set (using Wikidata Q-Identifiers) would be scholar Tim-Berners Lee (Q80), paper Tabulator Redux: Browsing and Writing Linked Data, institution CERN (Q42944), proceedings LDOW 2008, event LDOW 2008 (Q11367282) and the event series LDOW Q105491258. Scholia is a project that has created a portal at https://scholia.toolforge.org/ that allows to search, browse and analyse scholarly publishing related data that has been curated in the Wikidata knowledge graph. You can verify the example Q-Identifiers by performing a plain text search. Note that https://ceur-ws.org/Vol-369/paper11.pdf is missing since it has not been curated yet by the CEUR-WS Semantification project 1 and has no persistent identifier other than its URL. The LDOW 2008 proceedings are found at https://dblp.org/db/conf/www/ldow2008.html but have not been in the list of events and were therefore added manually by us.

The frustration of incomplete search results is the effect of a combination of factors. We would hope that the SPARQL query https://github.com/WDscholia/scholia/blob/master/scholia/app/templates/event-series_events.sparql event-series-events.sparql would work as designed and answer with the year,ordinal,short_name event id and label and proceedings id and level when given the identifier of an event series.

The authors' favorite query for which it would be nice to get the result is: At what conferences did scholars affiliated to an institution q at the time of publishing successfully submit papers in the time frame start_date to end_date. Now it is quite straightforward to envision a (federated) knowledge graph that provides the data and SPARQL query to retrieve the results for any major institution. Unfortunately, the reality is much bleaker and there are still obstacles and challenges to overcome. It is promising that natural language input is quite feasible these days and asking the question without any SPARQL knowledge and getting a result has been demonstrated lately with a good success rate (ORKG demonstration ESWC 2024 Sören Auer)

Scholia currently faces the challenge that the maximum size of the knowledge graph that the Blazegraph SPARQL engine that is currently backing the wikidata knowledge graph can hold is 4 Terabyte. Currently there are claims for more than 110 million entities and more than 15 billion triples in the wikidata KG. Over 20 million humans and as many scholarly papers are referenced. Unfortunately this is only a small portion of the scholarly publishing data that would be collectible from all the digital traces that have been left by scholarly publishing in the past centuries. The list of academic databases and search engines in wikipedia has entries with more than 300 million items. So scholarly publishing data alone would break the 4 TB limit.

The Wikimedia foundation which is responsible for running the Wikidata infrastructure has decided to go the route of a graph split. The scholarly data shall get its own KG and the data be migrated. Unfortunately this potentially invalidates all current 373(check #) Scholia queries.

Scholia queries are named parameterized queries. Scholia uses python and JavaScript as programming languages, SPARQL as a query language and Jinja Templates for query parameterization. See [[1]] for an analysis.

The CEUR-WS semantification project is using Wikidata and Scholia as a target Knowledge Graph to introduce "Metadata-First Publishing" (⚠️ Ref to Papers and paper under review ...). As part of the project a set of SPARQL queries has been created that is based on the Semantic Publishing Challenges 2. Some of these queries have already been manually refactored to run successfully on both the Wikidata Query Service and QLever.

The refactoring activities of the CEUR-WS semantification project have shown challenges that are a strong motivation for this work.

QLever SPARQL Engine development[edit]

QLever is a SPARQL Engine developed at the Computer Science Department of the University of Freiburg by Hannah Bast and her team. Written in C++ and aiming for high performance QLever is a candidate for the replacement of Blazegraph as the main SPARQL engine for the Wikidata knowledge graph. QLever is not feature complete. It is developed as an open source project using the github infrastructure ⚠️ Ref. https://github.com/ad-freiburg/qlever.

We have proposed that QLever should use queries from Scholia as a test suite https://github.com/ad-freiburg/qlever/issues/859. In this work we go a step further and show how a testsuite can be constructed from SPARQL queries which are extractable from the github issues. A typical issue such as https://github.com/ad-freiburg/qlever/issues/896 (CONCAT not implemented) will start with a standard situation/action/expected result setup in which the action is executing a given query and the result is an error or misbehavior. When the issues is closed the query will run or an alternative query is presented that shows that the issue has been fixed.

Wikidata SPARQL examples, tutorials and usage[edit]

The Wikidata SPARQL Service example wiki page https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples has been created in 2016 and has had more than 1500 edits since then. As of the creation of this work in July 2024 more than 300 queries have been added to the page to illustrate the usage of Wikidata. The assumption is that all these queries should actually work against the Wikidata service endpoint (hopefully at all times). This work shows that this is not the case and introduces tools and methods to analyze the reasons systematically.

The Wikidata query service at https://query.wikidata.org/ offers an option to create Short-URLs for queries that have been entered using the service. We have included a random set of 100 such queries as a query set for investigation.

Scholarly publishing[edit]

Scholarly publishing via scientific events revolves around the entities scholar, paper, institution, proceedings, events and eventseries. An example set (using Wikidata Q-Identifiers) would be scholar Tim-Berners Lee (Q80), paper Tabulator Redux: Browsing and Writing Linked Data, institution CERN (Q42944), proceedings LDOW 2008, event LDOW 2008 (Q11367282) and the event series LDOW Q105491258. Scholia is a project that has created a portal at https://scholia.toolforge.org/ that allows to search, browse and analyse scholarly publishing related data that has been curated in the Wikidata knowledge graph. You can verify the example Q-Identifiers by performing a plain text search. Note that https://ceur-ws.org/Vol-369/paper11.pdf is missing since it has not been curated yet by the CEUR-WS Semantification project 1 and has no persistent identifier other than its URL. The LDOW 2008 proceedings are found at https://dblp.org/db/conf/www/ldow2008.html but have not been in the list of events and were therefore added manually by us.

The frustration of incomplete search results is the effect of a combination of factors. We would hope that the SPARQL query https://github.com/WDscholia/scholia/blob/master/scholia/app/templates/event-series_events.sparql event-series-events.sparql would work as designed and answer with the year,ordinal,short_name event id and label and proceedings id and level when given the identifier of an event series.

The authors' favorite query for which it would be nice to get the result is: At what conferences did scholars affiliated to an institution q at the time of publishing successfully submit papers in the time frame start_date to end_date. Now it is quite straightforward to envision a (federated) knowledge graph that provides the data and SPARQL query to retrieve the results for any major institution. Unfortunately, the reality is much bleaker and there are still obstacles and challenges to overcome. It is promising that natural language input is quite feasible these days and asking the question without any SPARQL knowledge and getting a result has been demonstrated lately with a good success rate (ORKG demonstration ESWC 2024 Sören Auer)

Scholia currently faces the challenge that the maximum size of the knowledge graph that the Blazegraph SPARQL engine that is currently backing the wikidata knowledge graph can hold is 4 Terabyte. Currently there are claims for more than 110 million entities and more than 15 billion triples in the wikidata KG. Over 20 million humans and as many scholarly papers are referenced. Unfortunately this is only a small portion of the scholarly publishing data that would be collectible from all the digital traces that have been left by scholarly publishing in the past centuries. The list of academic databases and search engines in wikipedia has entries with more than 300 million items. So scholarly publishing data alone would break the 4 TB limit.

The Wikimedia foundation which is responsible for running the Wikidata infrastructure has decided to go the route of a graph split. The scholarly data shall get its own KG and the data be migrated. Unfortunately this potentially invalidates all current 373(check #) Scholia queries.

Scholia queries are named parameterized queries. Scholia uses python and JavaScript as programming languages, SPARQL as a query language and Jinja Templates for query parameterization. See [[1]] for an analysis.

The CEUR-WS semantification project is using Wikidata and Scholia as a target Knowledge Graph to introduce "Metadata-First Publishing" (⚠️ Ref to Papers and paper under review ...). As part of the project a set of SPARQL queries has been created that is based on the Semantic Publishing Challenges 2. Some of these queries have already been manually refactored to run successfully on both the Wikidata Query Service and QLever.

The refactoring activities of the CEUR-WS semantification project have shown challenges that are a strong motivation for this work.

QLever SPARQL Engine development[edit]

QLever is a SPARQL Engine developed at the Computer Science Department of the University of Freiburg by Hannah Bast and her team. Written in C++ and aiming for high performance QLever is a candidate for the replacement of Blazegraph as the main SPARQL engine for the Wikidata knowledge graph. QLever is not feature complete. It is developed as an open source project using the github infrastructure ⚠️ Ref. https://github.com/ad-freiburg/qlever.

We have proposed that QLever should use queries from Scholia as a test suite https://github.com/ad-freiburg/qlever/issues/859. In this work we go a step further and show how a testsuite can be constructed from SPARQL queries which are extractable from the github issues. A typical issue such as https://github.com/ad-freiburg/qlever/issues/896 (CONCAT not implemented) will start with a standard situation/action/expected result setup in which the action is executing a given query and the result is an error or misbehavior. When the issues is closed the query will run or an alternative query is presented that shows that the issue has been fixed.

Wikidata SPARQL examples, tutorials and usage[edit]

The Wikidata SPARQL Service example wiki page https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples has been created in 2016 and has had more than 1500 edits since then. As of the creation of this work in July 2024 more than 300 queries have been added to the page to illustrate the usage of Wikidata. The assumption is that all these queries should actually work against the Wikidata service endpoint (hopefully at all times). This work shows that this is not the case and introduces tools and methods to analyze the reasons systematically.

The Wikidata query service at https://query.wikidata.org/ offers an option to create Short-URLs for queries that have been entered using the service. We have included a random set of 100 such queries as a query set for investigation.

Structure[edit]

Introduction[edit]

  1. ★★★★★ Query rot versus link rot
  2. ★★★★☆ Transparency vs. complexity of SPARQL queries
  3. ★★★★☆ Use cases for named queries
  4. ★★★★☆ Persistent identifiers
  5. ★★★☆☆ Query hashes and short_urls

Mitigation Query Rot using snapquery[edit]

  1. ★★★★★ Parameterized queries
  2. ★★★☆☆ https://web.archive.org/web/20150512231123/http://answers.semanticweb.com:80/questions/12147/whats-the-best-way-to-parameterize-sparql-queries
  3. ★★★☆☆ https://jena.apache.org/documentation/query/parameterized-sparql-strings.html
  4. ★★★★☆ Scholia Jinja templates
  5. ★★★★☆ Technical debt and accidental complexity
  6. ★★★☆☆ How to deal with aspects that do not (usually) influence the execution of a SPARQL query, like whitespace, comments, capitalization and variable names?

SnapQuery Implementation[edit]

  1. ★★★☆☆ Natural Language input
  2. ★★★☆☆ Automatic syntax repairs
  3. ★★★☆☆ Automatic conversion of SQL input, SPARQL output

Evaluation[edit]

  1. ★★★★☆ Wikidata example queries
  2. ★★★★★ Scholia and Wikidata graph split
  3. ★★★☆☆ Other knowledge graphs, e.g., DBLP, OpenStreetMap
  4. ★★☆☆☆ Perhaps also some NFDI examples or some custom knowledge graphs like FAIRJupyter
  5. ★★★★★ Quality criteria https://github.com/WolfgangFahl/snapquery/issues/26
  6. ★★★★☆ List of standard refactoring activities and the support by this approach
  7. ★★★★☆ Getting your own copy of Wikidata; the infrastructure effort needs to be mentioned
  8. ★★★☆☆ Usability evaluation https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/
  9. ★★★★☆ https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines
  10. ★★★★☆ A closed issue should have at least one example that runs

Conclusion and Future Work[edit]

  1. ★★★★☆ SPARQL standard changes
  2. ★★★★★ Hypothesis by Stefan Decker: Query rot is more prominent in KG environments than with relational databases
  3. ★★★☆☆ Ambiguity of names
  4. ★★★☆☆ Sensitivity Analysis

Additional Resources[edit]

  1. ★★☆☆☆ https://stackoverflow.com/questions/tagged/sparql
  2. ★★★☆☆ https://www.semantic-web-journal.net/system/files/swj3076.pdf
  3. ★★☆☆☆ https://arxiv.org/pdf/cs/0605124
  4. ★★★☆☆ https://arxiv.org/pdf/1402.0576 optimizing queries
  5. ★★☆☆☆ https://www.w3.org/TR/REC-rdf-syntax/
  6. ★★☆☆☆ https://biblio.ugent.be/publication/8632551/file/8653456 Towards supporting multiple semantics of named graphs using N3 rules
  7. ★★★☆☆ ESWC 2019 proceedings (978-3-030-21348-0.pdf)
  8. ★★☆☆☆ Linked Data Fragments https://linkeddatafragments.org/ e.g. https://ldfclient.wmflabs.org/ 404 error

Related Work[edit]

  1. ★★★☆☆ Link rot
  2. ★★★★☆ Information Hiding and Dependency Inversion Principles
  3. ★★★☆☆ Federated Queries
  4. ★★★☆☆ grlc
  5. ★★☆☆☆ querypulator

Introduction[edit]

  1. ★★★★★ Query rot versus link rot
  2. ★★★★☆ Transparency vs. complexity of SPARQL queries
  3. ★★★★☆ Use cases for named queries
  4. ★★★★☆ Persistent identifiers
  5. ★★★☆☆ Query hashes and short_urls

Mitigation Query Rot using snapquery[edit]

  1. ★★★★★ Parameterized queries
  2. ★★★☆☆ https://web.archive.org/web/20150512231123/http://answers.semanticweb.com:80/questions/12147/whats-the-best-way-to-parameterize-sparql-queries
  3. ★★★☆☆ https://jena.apache.org/documentation/query/parameterized-sparql-strings.html
  4. ★★★★☆ Scholia Jinja templates
  5. ★★★★☆ Technical debt and accidental complexity
  6. ★★★☆☆ How to deal with aspects that do not (usually) influence the execution of a SPARQL query, like whitespace, comments, capitalization and variable names?

SnapQuery Implementation[edit]

  1. ★★★☆☆ Natural Language input
  2. ★★★☆☆ Automatic syntax repairs
  3. ★★★☆☆ Automatic conversion of SQL input, SPARQL output

Evaluation[edit]

  1. ★★★★☆ Wikidata example queries
  2. ★★★★★ Scholia and Wikidata graph split
  3. ★★★☆☆ Other knowledge graphs, e.g., DBLP, OpenStreetMap
  4. ★★☆☆☆ Perhaps also some NFDI examples or some custom knowledge graphs like FAIRJupyter
  5. ★★★★★ Quality criteria https://github.com/WolfgangFahl/snapquery/issues/26
  6. ★★★★☆ List of standard refactoring activities and the support by this approach
  7. ★★★★☆ Getting your own copy of Wikidata; the infrastructure effort needs to be mentioned
  8. ★★★☆☆ Usability evaluation https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/
  9. ★★★★☆ https://github.com/ad-freiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPARQL-engines
  10. ★★★★☆ A closed issue should have at least one example that runs

Conclusion and Future Work[edit]

  1. ★★★★☆ SPARQL standard changes
  2. ★★★★★ Hypothesis by Stefan Decker: Query rot is more prominent in KG environments than with relational databases
  3. ★★★☆☆ Ambiguity of names
  4. ★★★☆☆ Sensitivity Analysis

Additional Resources[edit]

  1. ★★☆☆☆ https://stackoverflow.com/questions/tagged/sparql
  2. ★★★☆☆ https://www.semantic-web-journal.net/system/files/swj3076.pdf
  3. ★★☆☆☆ https://arxiv.org/pdf/cs/0605124
  4. ★★★☆☆ https://arxiv.org/pdf/1402.0576 optimizing queries
  5. ★★☆☆☆ https://www.w3.org/TR/REC-rdf-syntax/
  6. ★★☆☆☆ https://biblio.ugent.be/publication/8632551/file/8653456 Towards supporting multiple semantics of named graphs using N3 rules
  7. ★★★☆☆ ESWC 2019 proceedings (978-3-030-21348-0.pdf)
  8. ★★☆☆☆ Linked Data Fragments https://linkeddatafragments.org/ e.g. https://ldfclient.wmflabs.org/ 404 error

Related Work[edit]

  1. ★★★☆☆ Link rot
  2. ★★★★☆ Information Hiding and Dependency Inversion Principles
  3. ★★★☆☆ Federated Queries
  4. ★★★☆☆ grlc
  5. ★★☆☆☆ querypulator

Use cases[edit]

Scholarly Publishing[edit]

Semantic Publishing Challenge – Assessing the Quality of Scientific Output[edit]

1

Semantification of CEUR-WS with Wikidata as a Target Knowledge Graph[edit]

2


Misc[edit]

A Comparison of the Cognitive Difficulties Posed by SPARQL Query Constructs[edit]

3


Using SPARQL – The Practitioners’ Viewpoint[edit]

4

LSQ: The Linked SPARQL Queries Dataset[edit]

5


Detecting SPARQL Query Templates for Data Prefetching[edit]

6

An analytical study of large SPARQL query logs[edit]

7


Testsuites[edit]

★★★☆☆ W3C SPARQL 1.1 Test Suite[edit]

W3C test set - why did we not use that as an example

  • Official test suite developed by the W3C SPARQL Working Group
  • Designed to test conformance to the SPARQL 1.1 specification
  • Covers a wide range of SPARQL features and edge cases
  • Primarily focused on correctness rather than performance

see https://wikitech.wikimedia.org/wiki/User:AndreaWest/WDQS_Testing/Running_TFT

Benchmarks[edit]

★★★★☆ An Ngoc Lam et al.'s ESWC 2023 paper "Evaluation of a Representative Selection of SPARQL Query Engines Using Wikidata"[edit]

  • Evaluates performance of 5 RDF triplestores and 1 experimental SPARQL engine
  • Uses complete version of Wikidata knowledge graph
  • Compares importing time, loading time, exporting time, and query performance
  • Evaluates 328 queries defined by Wikidata users
  • Also uses SP2Bench synthetic benchmark for comparison
  • Provides detailed analysis of query execution plans and profiling information
  • Offers insights on triplestore performance with large-scale real-world data

Wikidata Graph Pattern Benchmark (WGPB) for RDF/SPARQL by Aidan Hogan et al., 2020[edit]

DOI: 10.5281/zenodo.4035223

  • Focuses on evaluating performance of graph pattern matching in SPARQL engines
  • Uses a subset of Wikidata as the dataset
  • Provides a large set of SPARQL basic graph patterns
  • Designed to test the benefits of worst-case optimal join algorithms
  • Exhibits a variety of increasingly complex join patterns
  • Allows for systematic testing of query optimization techniques
  • Offers insights into the performance characteristics of different SPARQL engines on complex graph patterns
🖨 🚪