Difference between revisions of "Semantify3"

From BITPlan cr Wiki
Jump to navigation Jump to search
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
semantify³ - extract knowledge graph ready triples from human readable annotations whereever  possible - Syntax matters!
+
semantify³ - extract knowledge graph-ready triples from human-readable annotations wherever possible - Syntax matters!
  
 
= What it is =
 
= What it is =
Inspired by {{Link|target=Syntax_Matters}} thoughts this work wants to showcase a straightforward approach that is intented to be better than state of the
+
Inspired by {{Link|target=Syntax_Matters}} thoughts, this work wants to showcase a straightforward approach that is intended to be better than state-of-the-art metadata embedding formats such as [https://en.wikipedia.org/wiki/RDFa RDFa] and [https://en.wikipedia.org/wiki/Microformat Microformat].
art meta data embeeding formats such as [https://en.wikipedia.org/wiki/RDFa RDFa] and [https://en.wikipedia.org/wiki/Microformat Microformat].  
 
  
 
= Goal =
 
= Goal =
Compute readable content is ubiquitous these days. There are many different file formats that have emerged and proliferated with hight speed. Acronyms such  
+
Computer-readable content is ubiquitous these days. There are many different file formats that have emerged and proliferated with high speed. Acronyms such as HTML, JSON, XML, CSV, PDF, BibTeX, DOC(X), XLS(X), and PPT(X) are well known to most computer users and in daily use.
as HTML, JSON, XML, CSV, PDF, BIBTEX, DOC(X),XLS(X), PPT(X) are well known to most computer users and in daily use.
+
Still, a query such as "Give me all PowerPoint files mentioned in my Excel table courses.xlsx - find the keywords in the notes and then look up in my BibTeX files which authors are mentioned to look them up in DBLP, Wikidata, and Google Scholar" easily leads to a frustrating amount of effort.
Still a query such as "Give me all powerpoint files mentioned in my excel table courses.xlsx - find the keywords in the notes and then look up in my bibtex files which authors are mentioned to look them up in dblp, wikidata and google scholar" is easily leading to a frustating amount of effort.
 
  
The Semantic Web promise of a long time ago has not been fulfilled yet. Personally i believe we are much closer than in the past decade wher the adoption rate of
+
The Semantic Web promise of a long time ago has not been fulfilled yet. Personally, I believe we are much closer than in the past decade where the adoption rate of proposed solutions has been subpar.
propsed solutions has been sub par.
 
 
The critical success factor for metadata adoption is '''Human Readability'''. If the authors/curators cannot easily write, verify, and maintain the metadata, the system collapses.
 
The critical success factor for metadata adoption is '''Human Readability'''. If the authors/curators cannot easily write, verify, and maintain the metadata, the system collapses.
  
Therefore, metadata embedding should not pollute the document object model (DOM) like RDFa or Microformats. Instead, it should reside in distinct, clean blocks. We propose using standard '''Markdown backticks''' to indicate the syntax (e.g., ` ```yaml `) combined with a '''UTF-8 World-Wide Web Marker''' (e.g., 🌐🕸) to explicitly signal to crawlers that the embedded markup is intended for ingestion into a Knowledge Graph - the World Wide Web of Semantic Data.
+
Therefore, metadata embedding should not pollute the Document Object Model (DOM) like RDFa or Microformats. Instead, it should reside in distinct, clean blocks. We propose using standard '''Markdown backticks''' to indicate the syntax (e.g., ` ```yaml `) combined with a '''UTF-8 World Wide Web Marker''' (e.g., 🌐🕸) to explicitly signal to crawlers that the embedded markup is intended for ingestion into a Knowledge Graph - the World Wide Web of Semantic Data.
  
 
=== Syntax matters! ===
 
=== Syntax matters! ===
Line 32: Line 29:
  
 
==== Legacy: RDFa ====
 
==== Legacy: RDFa ====
RDFa represents the absolute low-point of the Semantic Web. It forces data into the HTML structure using attributes like `vocab` and `property`, creating a verbose, unreadable mess that is difficult for humans to parse visually and impossible to maintain.
+
RDFa represents the absolute low point of the Semantic Web. It forces data into the HTML structure using attributes like `vocab` and `property`, creating a verbose, unreadable mess that is difficult for humans to parse visually and impossible to maintain.
 
<source lang='html'>
 
<source lang='html'>
 
<!-- RDFa: Visual clutter. The data is effectively hidden inside tags. -->
 
<!-- RDFa: Visual clutter. The data is effectively hidden inside tags. -->
Line 78: Line 75:
 
</syntaxhighlight>
 
</syntaxhighlight>
 
= Example =
 
= Example =
== Let's eat our own dog food ! ==
+
== Let's eat our own dog food! ==
 
https://github.com/BITPlan/semantify3
 
https://github.com/BITPlan/semantify3
has the proposed annotations
+
has the proposed annotations.
you can search the with a grep -R
+
You can search for them with a `grep -R`.
 +
 
 +
=== search for semantify annotations ===
 +
==== files that have the annotation marker 🌐🕸 ====
 +
<source lang='bash' highlight='1'>
 +
grep -rl "🌐🕸" --include="*.py" semantify3
 +
semantify3/tests/test_parser.py
 +
semantify3/sem3/sem3_cmd.py
 +
</source>
 +
==== extracted markup snippets====
 +
<source lang='bash' highlight='1'>
 +
egrep -rn -A5 '```(yaml|sidif)' --include="*.py" semantify3 | grep -A5 "🌐🕸"
 +
semantify3/tests/test_parser.py-3-🌐🕸
 +
semantify3/tests/test_parser.py-4-test_extractor isA PythonModule
 +
semantify3/tests/test_parser.py-5-  "Wolfgang Fahl" is author of it
 +
semantify3/tests/test_parser.py-6-  "2025-11-29" is createdAt of it
 +
semantify3/tests/test_parser.py-7-  "Test main micro annotation snippet extraction"
 +
--
 +
--
 +
semantify3/sem3/sem3_cmd.py-5-🌐🕸
 +
semantify3/sem3/sem3_cmd.py-6-sem3_cmd:
 +
semantify3/sem3/sem3_cmd.py-7-  isA: PythonModule
 +
semantify3/sem3/sem3_cmd.py-8-  author: Wolfgang Fahl
 +
semantify3/sem3/sem3_cmd.py-9-  created: 2025-11-29
 +
</source>
 +
 
 +
===== YAML snippet (sem3_cmd.py) =====
 +
<syntaxhighlight lang="yaml">
 +
sem3_cmd:
 +
  isA: PythonModule
 +
  author: Wolfgang Fahl
 +
  created: 2025-11-29
 +
  partOf: semantify3
 +
  purpose: Command-line interface for semantify³
 +
</syntaxhighlight>
 +
===== SiDIF snippet (test_parser.py) =====
 +
<syntaxhighlight lang="bash">
 +
test_extractor isA PythonModule
 +
  "Wolfgang Fahl" is author of it
 +
  "2025-11-29" is createdAt of it
 +
  "Test main micro annotation snippet extraction" is purpose of it
 +
</syntaxhighlight>

Latest revision as of 18:23, 29 November 2025

semantify³ - extract knowledge graph-ready triples from human-readable annotations wherever possible - Syntax matters!

What it is

Inspired by Syntax_Matters thoughts, this work wants to showcase a straightforward approach that is intended to be better than state-of-the-art metadata embedding formats such as RDFa and Microformat.

Goal

Computer-readable content is ubiquitous these days. There are many different file formats that have emerged and proliferated with high speed. Acronyms such as HTML, JSON, XML, CSV, PDF, BibTeX, DOC(X), XLS(X), and PPT(X) are well known to most computer users and in daily use. Still, a query such as "Give me all PowerPoint files mentioned in my Excel table courses.xlsx - find the keywords in the notes and then look up in my BibTeX files which authors are mentioned to look them up in DBLP, Wikidata, and Google Scholar" easily leads to a frustrating amount of effort.

The Semantic Web promise of a long time ago has not been fulfilled yet. Personally, I believe we are much closer than in the past decade where the adoption rate of proposed solutions has been subpar. The critical success factor for metadata adoption is Human Readability. If the authors/curators cannot easily write, verify, and maintain the metadata, the system collapses.

Therefore, metadata embedding should not pollute the Document Object Model (DOM) like RDFa or Microformats. Instead, it should reside in distinct, clean blocks. We propose using standard Markdown backticks to indicate the syntax (e.g., ` ```yaml `) combined with a UTF-8 World Wide Web Marker (e.g., 🌐🕸) to explicitly signal to crawlers that the embedded markup is intended for ingestion into a Knowledge Graph - the World Wide Web of Semantic Data.

Syntax matters!

Below is a comparison of legacy embedding methods versus the proposed clean-code approach using backticks and markers.

Legacy: Microformats

Microformats attempt to use existing HTML `class` attributes to convey meaning. This couples data to design, leading to fragility; if you change the CSS class name for styling, you might accidentally break your data graph.

<!-- Microformats: Fragile and tied to CSS classes -->
<div class="h-product">
  <p>Kaufen Sie den
    <span class="p-name">Staubsauger XF704</span>
    <img class="u-photo" src="acmeXF704.jpg" alt="" />
  </p>
</div>

Legacy: RDFa

RDFa represents the absolute low point of the Semantic Web. It forces data into the HTML structure using attributes like `vocab` and `property`, creating a verbose, unreadable mess that is difficult for humans to parse visually and impossible to maintain.

<!-- RDFa: Visual clutter. The data is effectively hidden inside tags. -->
<div vocab="http://schema.org/" typeof="Product">
  <p>Kaufen Sie den
     <span property="name">Staubsauger XF704</span>
     jetzt im Sonderangebot!
     <img property="image" src="acmeXF704.jpg" />
  </p>
</div>

Proposal: Spider Marker + Backticks (YAML)

We propose using standard Markdown code fences. The Spider Marker (🕸) tells the machine "This is data," and the backticks tell the editor "This is YAML." The data remains clean and separated from the HTML.

```yaml
# 🕸 Knowledge Graph Block
Products:
  StaubsaugerXF704:
    name: Staubsauger XF704
    image: acmeXF704.jpg
```
Products:
  StaubsaugerXF704:
    name: Staubsauger XF704
    image: acmeXF704.jpg

Proposal: Spider Marker + Backticks (SiDIF)

SiDIF allows for sentence-like structure, making the logic immediately apparent to the human reader, encapsulated safely in a code block.

```sidif
# 🕸 Knowledge Graph Block
StaubsaugerXF704 isA Product
  "Staubsauger XF704" is name of it
  "acmeXF704.jpg" is image of it
```
StaubsaugerXF704 isA Product
  "Staubsauger XF704" is name of it
  "acmeXF704.jpg" is image of it

Example

Let's eat our own dog food!

https://github.com/BITPlan/semantify3 has the proposed annotations. You can search for them with a `grep -R`.

search for semantify annotations

files that have the annotation marker 🌐🕸

grep -rl "🌐🕸" --include="*.py" semantify3
semantify3/tests/test_parser.py
semantify3/sem3/sem3_cmd.py

extracted markup snippets

egrep -rn -A5 '```(yaml|sidif)' --include="*.py" semantify3 | grep -A5 "🌐🕸" 
semantify3/tests/test_parser.py-3-🌐🕸
semantify3/tests/test_parser.py-4-test_extractor isA PythonModule
semantify3/tests/test_parser.py-5-  "Wolfgang Fahl" is author of it
semantify3/tests/test_parser.py-6-  "2025-11-29" is createdAt of it
semantify3/tests/test_parser.py-7-  "Test main micro annotation snippet extraction"
--
--
semantify3/sem3/sem3_cmd.py-5-🌐🕸
semantify3/sem3/sem3_cmd.py-6-sem3_cmd:
semantify3/sem3/sem3_cmd.py-7-  isA: PythonModule
semantify3/sem3/sem3_cmd.py-8-  author: Wolfgang Fahl
semantify3/sem3/sem3_cmd.py-9-  created: 2025-11-29
YAML snippet (sem3_cmd.py)
sem3_cmd:
  isA: PythonModule
  author: Wolfgang Fahl
  created: 2025-11-29
  partOf: semantify3
  purpose: Command-line interface for semantify³
SiDIF snippet (test_parser.py)
test_extractor isA PythonModule
  "Wolfgang Fahl" is author of it
  "2025-11-29" is createdAt of it
  "Test main micro annotation snippet extraction" is purpose of it