Difference between revisions of "Snapquery Metaquery Experiment"

From BITPlan cr Wiki
Jump to navigation Jump to search
(Created page with "= Claude Sonnet 4.5 = == Meta Queries: Eating Our Own Dog Food == === Generation Prompt === <pre> Create a MediaWiki table and section for the meta queries describing our "ea...")
 
 
(2 intermediate revisions by the same user not shown)
Line 87: Line 87:
  
 
When we complete the migration to SPARQL/RDF, these 15 meta queries will continue functioning without modification - the ultimate proof that query abstraction isolates applications from infrastructure evolution.
 
When we complete the migration to SPARQL/RDF, these 15 meta queries will continue functioning without modification - the ultimate proof that query abstraction isolates applications from infrastructure evolution.
 +
= GPT-5 =
 +
== Meta Queries (WF 2025-05-06): Dogfooding and Future-Proofing ==
 +
 +
We “eat our own dog food” by managing operational meta queries with the very same query-set tool, registry, and pipelines used for end-user queries. We started with agile JSON/YAML declarations and a relational backend, while intentionally defining a stable query contract that can be lifted to a full SPARQL/RDF stack without changing how users or dashboards interact with results.
 +
 +
Key points:
 +
* Same contract, multiple backends: the named queries below expose stable result shapes that will be preserved when we enable a SPARQL endpoint and RDF publishing.
 +
* Future-proof via SIB format: our query-set tool converts the current JSON/YAML inputs to SPARQL/RDF in the format proposed by the SIB project (via Jeven Bollemans issue #59), so the exact same query set can target RDF graphs.
 +
* One pipeline for all: declarations, scheduling, execution, and monitoring are unified; meta queries ride the same CI, caching, and reporting paths as user-facing queries.
 +
* Purpose only: the table intentionally lists purposes, not code, to document intent and contract without disclosing implementation details.
 +
 +
=== Meta Query Catalog (purpose-only) ===
 +
{| class="wikitable sortable"
 +
! Name !! Purpose
 +
|-
 +
| query_count || Total number of registered named queries.
 +
|-
 +
| query_success || Count of successful executions per endpoint.
 +
|-
 +
| query_failures_by_category || Failures grouped by domain, namespace, and error category.
 +
|-
 +
| query_failures_by_category_grouped || Aggregated failures per error category with grouped domains and namespaces.
 +
|-
 +
| query_failures_by_category_grouped_counted || Failure rollups per category with per-domain/namespace/endpoint tallies.
 +
|-
 +
| query_failures_by_database_count || Failure counts grouped by backend database type.
 +
|-
 +
| query_success_by_namespace || Successful runs per namespace and endpoint.
 +
|-
 +
| query_namespace_endpoint_matrix_with_distinct || Matrix of distinct successful/failed queries per domain–namespace–endpoint, including totals and sums.
 +
|-
 +
| all_queries || Inventory of all named queries for introspection and cataloging.
 +
|-
 +
| error_histogram || Per-query error frequency to identify hotspots.
 +
|-
 +
| query_stats || Per-query execution summaries (durations and records) for performance tracking.
 +
|-
 +
| params_stats || Frequency of parameter signatures to understand usage patterns.
 +
|-
 +
| query_details_stats || Distribution of query complexity metrics (parameters, lines, size).
 +
|-
 +
| domain_namespace_stats || Query counts per domain and namespace (filterable).
 +
|-
 +
| all_endpoints || Catalog of available endpoints and their characteristics.
 +
|-
 +
| scholia_jinja_for_loops || Analysis of template for-loop usage patterns in query text.
 +
|}
 +
 +
=== Contract and Migration ===
 +
* Contract stability: each meta query’s output schema is treated as a contract that remains stable across backends.
 +
* SPARQL/RDF lift: contracts are mapped 1:1 to SPARQL/RDF shapes using the SIB-aligned conversion (Jeven Bollemans issue #59).
 +
* Seamless switch: dashboards and exports consume the contract, not the implementation, enabling a backend switch without user-facing changes.
 +
 +
== Prompt used to generate this section ==
 +
<pre>
 +
create a latex table and section for the meta queries and explain how we eat our own dog food
 +
</pre>
 +
= Grok-4 =
 +
== Meta Queries ==
 +
 +
Snapquery demonstrates its future-proof design by "eating its own dog food"—applying its principles to manage and analyze query metadata while starting with agile, lightweight technologies. To enable rapid development and iteration, we initially implemented metadata storage using JSON/YAML files and an internal SQLite database, with SQL-based meta-queries for analysis. This choice prioritizes agility and ease of prototyping over immediate full RDF compliance. However, snapquery is designed to be technology-agnostic, allowing seamless migration to a complete SPARQL endpoint and RDF capabilities in the future. Using snapquery itself, we can refactor and migrate these meta-queries and data structures to RDF without disrupting existing functionality, ensuring long-term scalability and standards compliance. Specifically, our query-set tool supports converting current JSON and YAML inputs to SPARQL/RDF formats as proposed by the SIB project (via Jeven Bollemans' issue #59), maintaining the same query contract across technologies.
 +
 +
These meta-queries operate on query details, execution statistics, and endpoint information, enabling systematic analysis of query performance, failures, and characteristics across namespaces, domains, and endpoints. This self-referential approach showcases snapquery's extensibility: while currently using SQL for meta-analysis, it can uniformly handle SPARQL or other query languages, supporting proactive monitoring and optimization of query sets. The meta-queries are stored in ''metaqueries.yaml'' and can be executed via snapquery's CLI or API.
 +
 +
This page serves as the primary MediaWiki documentation for snapquery's meta-queries. It was initially generated using an LLM prompt emphasizing snapquery's future-proof migration strategy. A LaTeX adaptation of this content is available for paper submissions, but this wiki version includes the full meta-query YAML for collaborative editing (see below or the source repository).
 +
 +
The generation of this section was guided by the following prompt: “are you completly nuts? my input was [full YAML content of meta-queries] and i asked for a table and section describing our own eaty-our own dog food style. To document this we need a mediawiki version first. The table will of course not disclose the code for how the query works just the puppose. That shows off that we will later be able to use the same query contract for sparql / rdf. our query-set tool allows us to conver the current json and yaml input to SPARQL/RDF in the format proposed by the SIB project via Jeven Bollemans issue #59 You must add the prompt”
 +
 +
=== Selected Meta Queries in snapquery ===
 +
{| class="wikitable" style="width: 100%;"
 +
|-
 +
! Query Name !! Purpose
 +
|-
 +
| query_count || Counts the total number of named queries in the system.
 +
|-
 +
| query_success || Aggregates successful query executions per endpoint, ordered by count.
 +
|-
 +
| query_failures_by_category || Groups query failures by error category, domain, and namespace for detailed failure analysis.
 +
|-
 +
| query_failures_by_category_grouped || Aggregates failures by category, collecting associated domains and namespaces.
 +
|-
 +
| query_failures_by_category_grouped_counted || Provides detailed aggregated counts of failures per category, including breakdowns by domain, namespace, and endpoint.
 +
|-
 +
| query_failures_by_database_count || Counts failures grouped by the underlying database type of the endpoints.
 +
|-
 +
| query_success_by_namespace || Aggregates successful queries per namespace and endpoint, useful for namespace performance comparison.
 +
|-
 +
| query_namespace_endpoint_matrix_with_distinct || Creates a matrix view of success and failure statistics per domain, namespace, and endpoint, including distinct counts.
 +
|-
 +
| all_queries || Retrieves all named queries for comprehensive listing.
 +
|-
 +
| error_histogram || Generates a histogram of error occurrences per query ID.
 +
|-
 +
| query_stats || Computes execution statistics (e.g., duration and record counts) per query ID.
 +
|-
 +
| params_stats || Analyzes the distribution of query parameters across details.
 +
|-
 +
| query_details_stats || Calculates aggregate statistics on query properties like parameter counts, line lengths, and sizes.
 +
|-
 +
| domain_namespace_stats || Counts queries per domain and namespace, with wildcard support for filtering.
 +
|-
 +
| all_endpoints || Lists all configured endpoints with their key attributes.
 +
|-
 +
| scholia_jinja_for_loops || Analyzes the usage of Jinja for-loops in Scholia-related queries for pattern identification.
 +
|}
 +
 +
= Full Meta-Query YAML =
 +
''For collaborative editing and reference, the full YAML content is provided here (note: this includes SQL details for internal use; public documentation focuses on purposes only):''
 +
 +
<syntaxhighlight lang="yaml">
 +
# Meta Queries
 +
# WF 2025-05-06
 +
'query_count':
 +
    sql: |
 +
      SELECT COUNT(*)
 +
      FROM NamedQuery
 +
'query_success':
 +
    sql: |
 +
        SELECT
 +
          COUNT(*) AS count,
 +
          endpoint_name
 +
       
 +
        FROM "QueryStats"
 +
        WHERE records>0
 +
        GROUP BY endpoint_name
 +
        ORDER BY 1 DESC
 +
'query_failures_by_category':
 +
    sql: |
 +
        SELECT
 +
        count(*) as count,
 +
        nq.domain,
 +
        nq.namespace,
 +
        error_category
 +
        FROM QueryStats qs
 +
        JOIN NamedQuery nq
 +
        ON qs.query_id=nq.query_id
 +
        WHERE error_category IS NOT NULL
 +
       
 +
        GROUP BY error_category,nq.namespace,nq.domain
 +
        ORDER BY 1 DESC
 +
'query_failures_by_category_grouped':
 +
    sql: |
 +
        SELECT
 +
          count(*) AS count,
 +
          GROUP_CONCAT(DISTINCT nq.domain) AS domains,
 +
          GROUP_CONCAT(DISTINCT nq.namespace) AS namespaces,
 +
          error_category
 +
        FROM QueryStats qs
 +
        JOIN NamedQuery nq ON qs.query_id = nq.query_id
 +
        WHERE error_category IS NOT NULL
 +
        GROUP BY error_category
 +
        ORDER BY  count DESC;
 +
'query_failures_by_category_grouped_counted':
 +
    sql: |
 +
        SELECT
 +
          error_category,
 +
          SUM(entry_count) AS total_count,
 +
          GROUP_CONCAT(DISTINCT domain_counts ORDER BY domain_count DESC) AS domain_counts,
 +
          GROUP_CONCAT(DISTINCT namespace_counts ORDER BY namespace_count DESC) AS namespace_counts,
 +
          GROUP_CONCAT(DISTINCT endpoint_counts ORDER BY endpoint_count DESC) AS endpoint_counts
 +
        FROM (
 +
          SELECT
 +
            error_category,
 +
            domain,
 +
            namespace,
 +
            endpoint_name,
 +
            COUNT(*) AS entry_count,
 +
            domain || ' (' || SUM(COUNT(*)) OVER (PARTITION BY error_category, domain) || ')' AS domain_counts,
 +
            namespace || ' (' || SUM(COUNT(*)) OVER (PARTITION BY error_category, namespace) || ')' AS namespace_counts,
 +
            endpoint_name || ' (' || SUM(COUNT(*)) OVER (PARTITION BY error_category, endpoint_name) || ')' AS endpoint_counts,
 +
            SUM(COUNT(*)) OVER (PARTITION BY error_category, domain) AS domain_count,
 +
            SUM(COUNT(*)) OVER (PARTITION BY error_category, namespace) AS namespace_count,
 +
            SUM(COUNT(*)) OVER (PARTITION BY error_category, endpoint_name) AS endpoint_count
 +
          FROM QueryStats qs
 +
          JOIN NamedQuery nq ON qs.query_id = nq.query_id
 +
          WHERE error_category IS NOT NULL
 +
          GROUP BY error_category, domain, namespace, endpoint_name
 +
        ) sub
 +
        GROUP BY error_category
 +
        ORDER BY total_count DESC;
 +
'query_failures_by_database_count':
 +
    sql: |
 +
        SELECT
 +
          COUNT(*) AS count,
 +
          ep.database
 +
        FROM QueryStats qs
 +
        JOIN Endpoint ep
 +
        ON qs.endpoint_name=ep.name
 +
        WHERE qs.error_msg IS NOT NULL
 +
        ORDER BY 1 DESC
 +
'query_success_by_namespace':
 +
    sql: |
 +
        SELECT
 +
          COUNT(*) AS count,
 +
          nq.namespace,
 +
          qs.endpoint_name
 +
         
 +
        FROM QueryStats qs
 +
        JOIN NamedQuery nq on qs.query_id=nq.query_id
 +
        WHERE records>0
 +
        GROUP BY endpoint_name,namespace
 +
        ORDER BY 2 ASC,1 DESC
 +
# This query calculates statistics for named queries across
 +
# different domains, namespaces and endpoints.
 +
# It joins NamedQuery with a subquery from QueryStats
 +
# to get both success and failure counts.
 +
# Another subquery is used to get the total count of
 +
# distinct queries per namespace.
 +
#
 +
# The results show:
 +
# - domain
 +
# - namespace
 +
# - endpoint name (or 'No Endpoint' if null)
 +
# - total count of distinct queries in the namespace
 +
# - count of distinct successful queries
 +
# - count of distinct failed queries
 +
# - sum of success counts
 +
# - sum of failure counts
 +
#
 +
# Note: A query is considered failed if it has a non-null error_msg.
 +
# The records count is used for successful queries.
 +
'query_namespace_endpoint_matrix_with_distinct':
 +
  sql: |
 +
    SELECT
 +
        nq.domain,
 +
        nq.namespace,
 +
        COALESCE(qs.endpoint_name, 'No Endpoint') AS endpoint_name,
 +
        total.total_count AS total,
 +
        COUNT(DISTINCT (CASE WHEN qs.success_count > 0 THEN qs.query_id END)) AS distinct_successful,
 +
        COUNT(DISTINCT (CASE WHEN qs.failure_count > 0 THEN qs.query_id END)) AS distinct_failed,
 +
        SUM(qs.success_count) AS success_count,
 +
        SUM(qs.failure_count) AS failure_count
 +
    FROM NamedQuery nq
 +
        LEFT JOIN (
 +
            SELECT
 +
                query_id,
 +
                endpoint_name,
 +
                context,
 +
                SUM(CASE WHEN error_msg IS NULL AND records > 0 THEN 1 ELSE 0 END) AS success_count,
 +
                SUM(CASE WHEN error_msg IS NOT NULL THEN 1 ELSE 0 END) AS failure_count
 +
            FROM QueryStats
 +
            GROUP BY query_id, endpoint_name, context
 +
        ) qs ON nq.query_id = qs.query_id
 +
        LEFT JOIN (
 +
          SELECT domain, namespace, COUNT(DISTINCT(query_id)) AS total_count
 +
          FROM NamedQuery
 +
          GROUP BY domain, namespace
 +
        ) total ON nq.domain = total.domain AND nq.namespace = total.namespace
 +
    GROUP BY nq.domain, nq.namespace, qs.endpoint_name, total.total_count
 +
'all_queries':
 +
    sql: |
 +
      SELECT * FROM NamedQuery
 +
'error_histogram':
 +
    sql: |
 +
      SELECT COUNT(*), query_id
 +
      FROM QueryStats
 +
      WHERE error_msg is not null
 +
      GROUP BY query_id
 +
      ORDER BY 1 DESC
 +
'query_stats':
 +
    sql: |
 +
      SELECT query_id,
 +
        COUNT(duration) AS count,
 +
        MIN(duration) AS min_time,
 +
        MAX(duration) AS max_time,
 +
        AVG(duration) AS avg_time,
 +
        MIN(records) AS min,
 +
        MAX(records) AS max,
 +
        AVG(records) AS avg
 +
      FROM QueryStats
 +
        GROUP by query_id
 +
        ORDER BY 1 DESC;
 +
'params_stats':
 +
    sql: |
 +
        SELECT count(*),
 +
            params
 +
        FROM "QueryDetails"
 +
        GROUP BY params
 +
        ORDER BY 1 desc
 +
'query_details_stats':
 +
    sql: |
 +
      SELECT
 +
        MIN(param_count) AS min_param_count,
 +
        MAX(param_count) AS max_param_count,
 +
        AVG(param_count) AS avg_param_count,
 +
        MIN(lines) AS min_lines,
 +
        MAX(lines) AS max_lines,
 +
        AVG(lines) AS avg_lines,
 +
        MIN(size) AS min_size,
 +
        MAX(size) AS max_size,
 +
        AVG(size) AS avg_size
 +
      FROM
 +
        QueryDetails;
 +
'domain_namespace_stats':
 +
  sql: |
 +
    SELECT
 +
      domain,
 +
      namespace,
 +
      COUNT(*) AS query_count
 +
    FROM NamedQuery
 +
    WHERE domain LIKE ? AND namespace LIKE ?
 +
    GROUP BY domain, namespace
 +
    ORDER BY 3 DESC
 +
'all_endpoints':
 +
    sql: |
 +
        SELECT
 +
            name,
 +
            endpoint,
 +
            lang,
 +
            website,
 +
            database,
 +
            method
 +
        FROM Endpoint
 +
'scholia_jinja_for_loops':
 +
    sql: |
 +
        SELECT
 +
          count(*),
 +
          substr(
 +
            sparql,
 +
            instr(sparql, '{% for') + length('{% for'),  -- Start position right after "{% for"
 +
            instr(substr(sparql, instr(sparql, '{% for')), '%}') - length('{% for')  -- Length of substring
 +
          ) as for_loop_content
 +
        FROM
 +
          NamedQuery
 +
        WHERE
 +
          sparql LIKE '%{% for%' ESCAPE '\' and for_loop_content like "%in%"
 +
        group by for_loop_content
 +
        order by 1 desc
 +
</syntaxhighlight>

Latest revision as of 10:13, 4 December 2025

Claude Sonnet 4.5

Meta Queries: Eating Our Own Dog Food

Generation Prompt

Create a MediaWiki table and section for the meta queries describing our "eating our own dog food" approach.

Context: We have 15 meta queries that analyze snapquery's own query execution data. Currently these use SQL on JSON/YAML data for development agility. The key point: the same query names and contracts will work when we migrate to SPARQL/RDF format (as proposed by SIB project via Jerven Bolleman's issue #59). This proves our technology-agnostic approach - snapquery will use its own query-set tool to convert from current JSON/YAML to SPARQL/RDF without breaking client code.

Input: YAML file with meta queries (query_count, query_success, query_failures_by_category, etc.)

Output needed: 
- MediaWiki section explaining the self-hosting validation concept
- Table listing query names and purposes (NOT the SQL code)
- Explain future migration path to SPARQL/RDF

Overview

Snapquery demonstrates its core principle of technology-agnostic query management by applying it to itself - a practice known as "eating your own dog food." We maintain 15 meta queries that analyze snapquery's own query execution data, currently implemented using SQL queries over JSON/YAML storage for development agility.

The crucial validation: these same named queries with identical contracts will seamlessly migrate to SPARQL queries over RDF data when we transition snapquery's internal storage to a knowledge graph. We will use snapquery's own query-set tool to perform this migration, converting the current JSON/YAML format to SPARQL/RDF following the format proposed by the SIB project via Jerven Bolleman's issue #59.

This self-hosting approach proves that:

  • Query contracts remain stable across technology changes
  • Client code doesn't break when backend storage evolves
  • The abstraction layer successfully isolates queries from execution technology
  • Snapquery can manage its own infrastructure evolution

Meta Query Inventory

Table 1: Meta Queries for Analyzing Snapquery Query Sets
Query Name Purpose
query_count Count total number of named queries in the system
query_success Analyze successful query executions grouped by endpoint
query_failures_by_category Break down query failures by error category, domain, and namespace
query_failures_by_category_grouped Aggregate failure patterns showing which error categories affect which domains and namespaces
query_failures_by_category_grouped_counted Advanced failure analysis with ranked counts across domains, namespaces, and endpoints per error category
query_failures_by_database_count Count query failures grouped by underlying database technology
query_success_by_namespace Analyze success counts organized by namespace and endpoint
query_namespace_endpoint_matrix_with_distinct Comprehensive matrix showing total, successful, and failed query counts for each namespace-endpoint combination
query_stats Calculate execution statistics including duration and result set sizes per query
params_stats Analyze distribution of parameter usage patterns across queries
query_details_stats Compute overall statistics on query complexity (parameter counts, line counts, sizes)
domain_namespace_stats Count queries grouped by domain and namespace
all_endpoints List all registered endpoints with metadata
error_histogram Generate frequency distribution of errors per query to identify most problematic queries
scholia_jinja_for_loops Analyze Jinja template usage patterns in Scholia queries

Migration Path

The meta queries currently execute as SQL queries on snapquery's JSON/YAML data model. When we migrate to RDF storage with full SPARQL endpoint capabilities:

  1. The query names remain identical (query_count, query_failures_by_category_grouped_counted, etc.)
  2. The query contracts (inputs/outputs) remain unchanged
  3. The implementation evolves from SQL to SPARQL
  4. Client code using these queries requires zero modifications
  5. The format conversion follows SIB project specifications (Jerven Bolleman's issue #59)

For example, the query_failures_by_category_grouped_counted meta query currently uses SQL window functions and aggregations. After migration, it will use equivalent SPARQL aggregation patterns and GROUP_CONCAT operations - but applications calling this named query won't notice the difference.

Self-Hosting Validation

This approach validates snapquery's architecture in the most practical way possible: we build snapquery using the same principles we advocate for SPARQL query management. By starting with SQL/JSON/YAML for rapid iteration while maintaining the architectural vision for full RDF/SPARQL support, we ensure that:

  • Our technology-agnostic abstraction layer actually works
  • Migration paths are real, not theoretical
  • The query-set tool can handle complex transformations
  • "Query rot" mitigation strategies apply across different backend technologies

When we complete the migration to SPARQL/RDF, these 15 meta queries will continue functioning without modification - the ultimate proof that query abstraction isolates applications from infrastructure evolution.

GPT-5

Meta Queries (WF 2025-05-06): Dogfooding and Future-Proofing

We “eat our own dog food” by managing operational meta queries with the very same query-set tool, registry, and pipelines used for end-user queries. We started with agile JSON/YAML declarations and a relational backend, while intentionally defining a stable query contract that can be lifted to a full SPARQL/RDF stack without changing how users or dashboards interact with results.

Key points:

  • Same contract, multiple backends: the named queries below expose stable result shapes that will be preserved when we enable a SPARQL endpoint and RDF publishing.
  • Future-proof via SIB format: our query-set tool converts the current JSON/YAML inputs to SPARQL/RDF in the format proposed by the SIB project (via Jeven Bollemans issue #59), so the exact same query set can target RDF graphs.
  • One pipeline for all: declarations, scheduling, execution, and monitoring are unified; meta queries ride the same CI, caching, and reporting paths as user-facing queries.
  • Purpose only: the table intentionally lists purposes, not code, to document intent and contract without disclosing implementation details.

Meta Query Catalog (purpose-only)

Name Purpose
query_count Total number of registered named queries.
query_success Count of successful executions per endpoint.
query_failures_by_category Failures grouped by domain, namespace, and error category.
query_failures_by_category_grouped Aggregated failures per error category with grouped domains and namespaces.
query_failures_by_category_grouped_counted Failure rollups per category with per-domain/namespace/endpoint tallies.
query_failures_by_database_count Failure counts grouped by backend database type.
query_success_by_namespace Successful runs per namespace and endpoint.
query_namespace_endpoint_matrix_with_distinct Matrix of distinct successful/failed queries per domain–namespace–endpoint, including totals and sums.
all_queries Inventory of all named queries for introspection and cataloging.
error_histogram Per-query error frequency to identify hotspots.
query_stats Per-query execution summaries (durations and records) for performance tracking.
params_stats Frequency of parameter signatures to understand usage patterns.
query_details_stats Distribution of query complexity metrics (parameters, lines, size).
domain_namespace_stats Query counts per domain and namespace (filterable).
all_endpoints Catalog of available endpoints and their characteristics.
scholia_jinja_for_loops Analysis of template for-loop usage patterns in query text.

Contract and Migration

  • Contract stability: each meta query’s output schema is treated as a contract that remains stable across backends.
  • SPARQL/RDF lift: contracts are mapped 1:1 to SPARQL/RDF shapes using the SIB-aligned conversion (Jeven Bollemans issue #59).
  • Seamless switch: dashboards and exports consume the contract, not the implementation, enabling a backend switch without user-facing changes.

Prompt used to generate this section

create a latex table and section for the meta queries and explain how we eat our own dog food 

Grok-4

Meta Queries

Snapquery demonstrates its future-proof design by "eating its own dog food"—applying its principles to manage and analyze query metadata while starting with agile, lightweight technologies. To enable rapid development and iteration, we initially implemented metadata storage using JSON/YAML files and an internal SQLite database, with SQL-based meta-queries for analysis. This choice prioritizes agility and ease of prototyping over immediate full RDF compliance. However, snapquery is designed to be technology-agnostic, allowing seamless migration to a complete SPARQL endpoint and RDF capabilities in the future. Using snapquery itself, we can refactor and migrate these meta-queries and data structures to RDF without disrupting existing functionality, ensuring long-term scalability and standards compliance. Specifically, our query-set tool supports converting current JSON and YAML inputs to SPARQL/RDF formats as proposed by the SIB project (via Jeven Bollemans' issue #59), maintaining the same query contract across technologies.

These meta-queries operate on query details, execution statistics, and endpoint information, enabling systematic analysis of query performance, failures, and characteristics across namespaces, domains, and endpoints. This self-referential approach showcases snapquery's extensibility: while currently using SQL for meta-analysis, it can uniformly handle SPARQL or other query languages, supporting proactive monitoring and optimization of query sets. The meta-queries are stored in metaqueries.yaml and can be executed via snapquery's CLI or API.

This page serves as the primary MediaWiki documentation for snapquery's meta-queries. It was initially generated using an LLM prompt emphasizing snapquery's future-proof migration strategy. A LaTeX adaptation of this content is available for paper submissions, but this wiki version includes the full meta-query YAML for collaborative editing (see below or the source repository).

The generation of this section was guided by the following prompt: “are you completly nuts? my input was [full YAML content of meta-queries] and i asked for a table and section describing our own eaty-our own dog food style. To document this we need a mediawiki version first. The table will of course not disclose the code for how the query works just the puppose. That shows off that we will later be able to use the same query contract for sparql / rdf. our query-set tool allows us to conver the current json and yaml input to SPARQL/RDF in the format proposed by the SIB project via Jeven Bollemans issue #59 You must add the prompt”

Selected Meta Queries in snapquery

Query Name Purpose
query_count Counts the total number of named queries in the system.
query_success Aggregates successful query executions per endpoint, ordered by count.
query_failures_by_category Groups query failures by error category, domain, and namespace for detailed failure analysis.
query_failures_by_category_grouped Aggregates failures by category, collecting associated domains and namespaces.
query_failures_by_category_grouped_counted Provides detailed aggregated counts of failures per category, including breakdowns by domain, namespace, and endpoint.
query_failures_by_database_count Counts failures grouped by the underlying database type of the endpoints.
query_success_by_namespace Aggregates successful queries per namespace and endpoint, useful for namespace performance comparison.
query_namespace_endpoint_matrix_with_distinct Creates a matrix view of success and failure statistics per domain, namespace, and endpoint, including distinct counts.
all_queries Retrieves all named queries for comprehensive listing.
error_histogram Generates a histogram of error occurrences per query ID.
query_stats Computes execution statistics (e.g., duration and record counts) per query ID.
params_stats Analyzes the distribution of query parameters across details.
query_details_stats Calculates aggregate statistics on query properties like parameter counts, line lengths, and sizes.
domain_namespace_stats Counts queries per domain and namespace, with wildcard support for filtering.
all_endpoints Lists all configured endpoints with their key attributes.
scholia_jinja_for_loops Analyzes the usage of Jinja for-loops in Scholia-related queries for pattern identification.

Full Meta-Query YAML

For collaborative editing and reference, the full YAML content is provided here (note: this includes SQL details for internal use; public documentation focuses on purposes only):

# Meta Queries
# WF 2025-05-06
'query_count':
    sql: |
      SELECT COUNT(*)
      FROM NamedQuery
'query_success':
    sql: |
        SELECT 
           COUNT(*) AS count,
           endpoint_name
        
        FROM "QueryStats"
        WHERE records>0
        GROUP BY endpoint_name
        ORDER BY 1 DESC
'query_failures_by_category':
    sql: |
        SELECT 
        count(*) as count,
        nq.domain,
        nq.namespace,
        error_category
        FROM QueryStats qs
        JOIN NamedQuery nq
        ON qs.query_id=nq.query_id
        WHERE error_category IS NOT NULL 
        
        GROUP BY error_category,nq.namespace,nq.domain
        ORDER BY 1 DESC
'query_failures_by_category_grouped':
    sql: |
        SELECT 
          count(*) AS count,
          GROUP_CONCAT(DISTINCT nq.domain) AS domains,
          GROUP_CONCAT(DISTINCT nq.namespace) AS namespaces,
          error_category
        FROM QueryStats qs
        JOIN NamedQuery nq ON qs.query_id = nq.query_id
        WHERE error_category IS NOT NULL
        GROUP BY error_category
        ORDER BY  count DESC;
'query_failures_by_category_grouped_counted':
    sql: |
        SELECT 
          error_category,
          SUM(entry_count) AS total_count,
          GROUP_CONCAT(DISTINCT domain_counts ORDER BY domain_count DESC) AS domain_counts,
          GROUP_CONCAT(DISTINCT namespace_counts ORDER BY namespace_count DESC) AS namespace_counts,
          GROUP_CONCAT(DISTINCT endpoint_counts ORDER BY endpoint_count DESC) AS endpoint_counts
        FROM (
          SELECT 
            error_category,
            domain,
            namespace,
            endpoint_name,
            COUNT(*) AS entry_count,
            domain || ' (' || SUM(COUNT(*)) OVER (PARTITION BY error_category, domain) || ')' AS domain_counts,
            namespace || ' (' || SUM(COUNT(*)) OVER (PARTITION BY error_category, namespace) || ')' AS namespace_counts,
            endpoint_name || ' (' || SUM(COUNT(*)) OVER (PARTITION BY error_category, endpoint_name) || ')' AS endpoint_counts,
            SUM(COUNT(*)) OVER (PARTITION BY error_category, domain) AS domain_count,
            SUM(COUNT(*)) OVER (PARTITION BY error_category, namespace) AS namespace_count,
            SUM(COUNT(*)) OVER (PARTITION BY error_category, endpoint_name) AS endpoint_count
          FROM QueryStats qs
          JOIN NamedQuery nq ON qs.query_id = nq.query_id
          WHERE error_category IS NOT NULL
          GROUP BY error_category, domain, namespace, endpoint_name
        ) sub
        GROUP BY error_category
        ORDER BY total_count DESC;
'query_failures_by_database_count':
    sql: |
        SELECT
           COUNT(*) AS count,
           ep.database 
        FROM QueryStats qs
        JOIN Endpoint ep
        ON qs.endpoint_name=ep.name
        WHERE qs.error_msg IS NOT NULL
        ORDER BY 1 DESC
'query_success_by_namespace':
    sql: |
        SELECT 
           COUNT(*) AS count,
           nq.namespace,
           qs.endpoint_name
           
        FROM QueryStats qs
        JOIN NamedQuery nq on qs.query_id=nq.query_id
        WHERE records>0
        GROUP BY endpoint_name,namespace
        ORDER BY 2 ASC,1 DESC
# This query calculates statistics for named queries across 
# different domains, namespaces and endpoints.
# It joins NamedQuery with a subquery from QueryStats 
# to get both success and failure counts.
# Another subquery is used to get the total count of 
# distinct queries per namespace.
# 
# The results show:
# - domain
# - namespace
# - endpoint name (or 'No Endpoint' if null)
# - total count of distinct queries in the namespace
# - count of distinct successful queries
# - count of distinct failed queries
# - sum of success counts
# - sum of failure counts
#
# Note: A query is considered failed if it has a non-null error_msg.
# The records count is used for successful queries.
'query_namespace_endpoint_matrix_with_distinct':
  sql: |
    SELECT
        nq.domain,
        nq.namespace,
        COALESCE(qs.endpoint_name, 'No Endpoint') AS endpoint_name,
        total.total_count AS total,
        COUNT(DISTINCT (CASE WHEN qs.success_count > 0 THEN qs.query_id END)) AS distinct_successful,
        COUNT(DISTINCT (CASE WHEN qs.failure_count > 0 THEN qs.query_id END)) AS distinct_failed,
        SUM(qs.success_count) AS success_count,
        SUM(qs.failure_count) AS failure_count
    FROM NamedQuery nq
        LEFT JOIN (
            SELECT 
                query_id,
                endpoint_name,
                context,
                SUM(CASE WHEN error_msg IS NULL AND records > 0 THEN 1 ELSE 0 END) AS success_count,
                SUM(CASE WHEN error_msg IS NOT NULL THEN 1 ELSE 0 END) AS failure_count
            FROM QueryStats
            GROUP BY query_id, endpoint_name, context
        ) qs ON nq.query_id = qs.query_id
        LEFT JOIN (
          SELECT domain, namespace, COUNT(DISTINCT(query_id)) AS total_count
          FROM NamedQuery
          GROUP BY domain, namespace
        ) total ON nq.domain = total.domain AND nq.namespace = total.namespace
    GROUP BY nq.domain, nq.namespace, qs.endpoint_name, total.total_count
'all_queries':
    sql: |
      SELECT * FROM NamedQuery
'error_histogram':
    sql: |
      SELECT COUNT(*), query_id
      FROM QueryStats
      WHERE error_msg is not null
      GROUP BY query_id
      ORDER BY 1 DESC
'query_stats':
    sql: |
      SELECT query_id,
        COUNT(duration) AS count,
        MIN(duration) AS min_time,
        MAX(duration) AS max_time,
        AVG(duration) AS avg_time,
        MIN(records) AS min,
        MAX(records) AS max,
        AVG(records) AS avg
      FROM QueryStats
        GROUP by query_id
        ORDER BY 1 DESC;
'params_stats':
    sql: |
        SELECT count(*),
            params 
        FROM "QueryDetails" 
        GROUP BY params 
        ORDER BY 1 desc
'query_details_stats':
    sql: |
      SELECT 
        MIN(param_count) AS min_param_count,
        MAX(param_count) AS max_param_count,
        AVG(param_count) AS avg_param_count,
        MIN(lines) AS min_lines,
        MAX(lines) AS max_lines,
        AVG(lines) AS avg_lines,
        MIN(size) AS min_size,
        MAX(size) AS max_size,
        AVG(size) AS avg_size
      FROM 
        QueryDetails;
'domain_namespace_stats':
  sql: |
    SELECT 
      domain,
      namespace,
      COUNT(*) AS query_count
    FROM NamedQuery
    WHERE domain LIKE ? AND namespace LIKE ?
    GROUP BY domain, namespace
    ORDER BY 3 DESC
'all_endpoints':
    sql: |
        SELECT 
            name,
            endpoint,
            lang,
            website,
            database,
            method
        FROM Endpoint
'scholia_jinja_for_loops':
    sql: |
        SELECT
          count(*),
          substr(
            sparql,
            instr(sparql, '{% for') + length('{% for'),  -- Start position right after "{% for"
            instr(substr(sparql, instr(sparql, '{% for')), '%}') - length('{% for')  -- Length of substring
          ) as for_loop_content
        FROM
          NamedQuery
        WHERE
          sparql LIKE '%{% for%' ESCAPE '\' and for_loop_content like "%in%"
        group by for_loop_content
        order by 1 desc