Difference between revisions of "Semantify3"
| (One intermediate revision by the same user not shown) | |||
| Line 80: | Line 80: | ||
You can search for them with a `grep -R`. | You can search for them with a `grep -R`. | ||
| − | === search for annotations === | + | === search for semantify annotations === |
| + | ==== files that have the annotation marker 🌐🕸 ==== | ||
<source lang='bash' highlight='1'> | <source lang='bash' highlight='1'> | ||
grep -rl "🌐🕸" --include="*.py" semantify3 | grep -rl "🌐🕸" --include="*.py" semantify3 | ||
| Line 86: | Line 87: | ||
semantify3/sem3/sem3_cmd.py | semantify3/sem3/sem3_cmd.py | ||
</source> | </source> | ||
| + | ==== extracted markup snippets==== | ||
| + | <source lang='bash' highlight='1'> | ||
| + | egrep -rn -A5 '```(yaml|sidif)' --include="*.py" semantify3 | grep -A5 "🌐🕸" | ||
| + | semantify3/tests/test_parser.py-3-🌐🕸 | ||
| + | semantify3/tests/test_parser.py-4-test_extractor isA PythonModule | ||
| + | semantify3/tests/test_parser.py-5- "Wolfgang Fahl" is author of it | ||
| + | semantify3/tests/test_parser.py-6- "2025-11-29" is createdAt of it | ||
| + | semantify3/tests/test_parser.py-7- "Test main micro annotation snippet extraction" | ||
| + | -- | ||
| + | -- | ||
| + | semantify3/sem3/sem3_cmd.py-5-🌐🕸 | ||
| + | semantify3/sem3/sem3_cmd.py-6-sem3_cmd: | ||
| + | semantify3/sem3/sem3_cmd.py-7- isA: PythonModule | ||
| + | semantify3/sem3/sem3_cmd.py-8- author: Wolfgang Fahl | ||
| + | semantify3/sem3/sem3_cmd.py-9- created: 2025-11-29 | ||
| + | </source> | ||
| + | |||
| + | ===== YAML snippet (sem3_cmd.py) ===== | ||
| + | <syntaxhighlight lang="yaml"> | ||
| + | sem3_cmd: | ||
| + | isA: PythonModule | ||
| + | author: Wolfgang Fahl | ||
| + | created: 2025-11-29 | ||
| + | partOf: semantify3 | ||
| + | purpose: Command-line interface for semantify³ | ||
| + | </syntaxhighlight> | ||
| + | ===== SiDIF snippet (test_parser.py) ===== | ||
| + | <syntaxhighlight lang="bash"> | ||
| + | test_extractor isA PythonModule | ||
| + | "Wolfgang Fahl" is author of it | ||
| + | "2025-11-29" is createdAt of it | ||
| + | "Test main micro annotation snippet extraction" is purpose of it | ||
| + | </syntaxhighlight> | ||
Latest revision as of 18:23, 29 November 2025
semantify³ - extract knowledge graph-ready triples from human-readable annotations wherever possible - Syntax matters!
What it is
Inspired by Syntax_Matters thoughts, this work wants to showcase a straightforward approach that is intended to be better than state-of-the-art metadata embedding formats such as RDFa and Microformat.
Goal
Computer-readable content is ubiquitous these days. There are many different file formats that have emerged and proliferated with high speed. Acronyms such as HTML, JSON, XML, CSV, PDF, BibTeX, DOC(X), XLS(X), and PPT(X) are well known to most computer users and in daily use. Still, a query such as "Give me all PowerPoint files mentioned in my Excel table courses.xlsx - find the keywords in the notes and then look up in my BibTeX files which authors are mentioned to look them up in DBLP, Wikidata, and Google Scholar" easily leads to a frustrating amount of effort.
The Semantic Web promise of a long time ago has not been fulfilled yet. Personally, I believe we are much closer than in the past decade where the adoption rate of proposed solutions has been subpar. The critical success factor for metadata adoption is Human Readability. If the authors/curators cannot easily write, verify, and maintain the metadata, the system collapses.
Therefore, metadata embedding should not pollute the Document Object Model (DOM) like RDFa or Microformats. Instead, it should reside in distinct, clean blocks. We propose using standard Markdown backticks to indicate the syntax (e.g., ` ```yaml `) combined with a UTF-8 World Wide Web Marker (e.g., 🌐🕸) to explicitly signal to crawlers that the embedded markup is intended for ingestion into a Knowledge Graph - the World Wide Web of Semantic Data.
Syntax matters!
Below is a comparison of legacy embedding methods versus the proposed clean-code approach using backticks and markers.
Legacy: Microformats
Microformats attempt to use existing HTML `class` attributes to convey meaning. This couples data to design, leading to fragility; if you change the CSS class name for styling, you might accidentally break your data graph.
<!-- Microformats: Fragile and tied to CSS classes -->
<div class="h-product">
<p>Kaufen Sie den
<span class="p-name">Staubsauger XF704</span>
<img class="u-photo" src="acmeXF704.jpg" alt="" />
</p>
</div>
Legacy: RDFa
RDFa represents the absolute low point of the Semantic Web. It forces data into the HTML structure using attributes like `vocab` and `property`, creating a verbose, unreadable mess that is difficult for humans to parse visually and impossible to maintain.
<!-- RDFa: Visual clutter. The data is effectively hidden inside tags. -->
<div vocab="http://schema.org/" typeof="Product">
<p>Kaufen Sie den
<span property="name">Staubsauger XF704</span>
jetzt im Sonderangebot!
<img property="image" src="acmeXF704.jpg" />
</p>
</div>
Proposal: Spider Marker + Backticks (YAML)
We propose using standard Markdown code fences. The Spider Marker (🕸) tells the machine "This is data," and the backticks tell the editor "This is YAML." The data remains clean and separated from the HTML.
```yaml
# 🕸 Knowledge Graph Block
Products:
StaubsaugerXF704:
name: Staubsauger XF704
image: acmeXF704.jpg
```
Products:
StaubsaugerXF704:
name: Staubsauger XF704
image: acmeXF704.jpg
Proposal: Spider Marker + Backticks (SiDIF)
SiDIF allows for sentence-like structure, making the logic immediately apparent to the human reader, encapsulated safely in a code block.
```sidif # 🕸 Knowledge Graph Block StaubsaugerXF704 isA Product "Staubsauger XF704" is name of it "acmeXF704.jpg" is image of it ```
StaubsaugerXF704 isA Product
"Staubsauger XF704" is name of it
"acmeXF704.jpg" is image of it
Example
Let's eat our own dog food!
https://github.com/BITPlan/semantify3 has the proposed annotations. You can search for them with a `grep -R`.
search for semantify annotations
files that have the annotation marker 🌐🕸
grep -rl "🌐🕸" --include="*.py" semantify3
semantify3/tests/test_parser.py
semantify3/sem3/sem3_cmd.py
extracted markup snippets
egrep -rn -A5 '```(yaml|sidif)' --include="*.py" semantify3 | grep -A5 "🌐🕸"
semantify3/tests/test_parser.py-3-🌐🕸
semantify3/tests/test_parser.py-4-test_extractor isA PythonModule
semantify3/tests/test_parser.py-5- "Wolfgang Fahl" is author of it
semantify3/tests/test_parser.py-6- "2025-11-29" is createdAt of it
semantify3/tests/test_parser.py-7- "Test main micro annotation snippet extraction"
--
--
semantify3/sem3/sem3_cmd.py-5-🌐🕸
semantify3/sem3/sem3_cmd.py-6-sem3_cmd:
semantify3/sem3/sem3_cmd.py-7- isA: PythonModule
semantify3/sem3/sem3_cmd.py-8- author: Wolfgang Fahl
semantify3/sem3/sem3_cmd.py-9- created: 2025-11-29
YAML snippet (sem3_cmd.py)
sem3_cmd:
isA: PythonModule
author: Wolfgang Fahl
created: 2025-11-29
partOf: semantify3
purpose: Command-line interface for semantify³
SiDIF snippet (test_parser.py)
test_extractor isA PythonModule
"Wolfgang Fahl" is author of it
"2025-11-29" is createdAt of it
"Test main micro annotation snippet extraction" is purpose of it