E.g. for Natural Language Processing for/using Knowledge Graphs (e.g. entity linking and resolution using target knowledge such as Wikidata and DBpedia, foundation models)
Usecase: Lookup an event by Acronym e.g.
In the process of digitalization of scientific publishing PID have been introduced for quite a few entities such as Papers(DOI), Authors (ORCID), Organizations(ROR) but unfortunately not for scientific events and series where the most common disambiguating identifier is still acronyms/short names such as ESWC 2023/Semantics '23. (Only very few instances have PIDs DOI (200)/ pseudo PIDs Wikidata Id (9000/1000).
We estimate that some 5000 (dblp)-25.000 and some 50.000 (dblp) to 250.000 events/eventseries that have - public digital traces (as part of their lifecyle) such as homepages, entries in public cfps, library indices for their proceedings, homepages - would still need PIDs.
To create PIDs and enter the metadata in public KGs such as wikidata acronyms look like a promising tool for disambiguation (as has been proven by the Work of Simon Cobb using OpenRefine ...)
Given a piece of natural language text (and it's context) and a semi structure corpus of digital traces of scientific communication assembled from different sources we'd like to perform a two step process:
(under the assumption there is a common sense ..)
Resource bounded data cleaning, disambiguation and knowledge graph extraction.
Maximizing the "overall" effort/result ratio is the goal. Please note that the effort to maximize the effort/result ratio is part of the effort.
Assumption: separating the input data into standard cases, corner cases and exotic cases according to a Zipfian / longtail / pareto distribution allows to simplify the necessary formalization avoid accidential complexity and get to a better effort/result ratio.
See Approach ...
E.g. for Natural Language Processing for/using Knowledge Graphs (e.g. entity linking and resolution using target knowledge such as Wikidata and DBpedia, foundation models)
Usecase: Lookup an event by Acronym e.g.
In the process of digitalization of scientific publishing PID have been introduced for quite a few entities such as Papers(DOI), Authors (ORCID), Organizations(ROR) but unfortunately not for scientific events and series where the most common disambiguating identifier is still acronyms/short names such as ESWC 2023/Semantics '23. (Only very few instances have PIDs DOI (200)/ pseudo PIDs Wikidata Id (9000/1000).
We estimate that some 5000 (dblp)-25.000 and some 50.000 (dblp) to 250.000 events/eventseries that have - public digital traces (as part of their lifecyle) such as homepages, entries in public cfps, library indices for their proceedings, homepages - would still need PIDs.
To create PIDs and enter the metadata in public KGs such as wikidata acronyms look like a promising tool for disambiguation (as has been proven by the Work of Simon Cobb using OpenRefine ...)
Given a piece of natural language text (and it's context) and a semi structure corpus of digital traces of scientific communication assembled from different sources we'd like to perform a two step process:
(under the assumption there is a common sense ..)
Resource bounded data cleaning, disambiguation and knowledge graph extraction.
Maximizing the "overall" effort/result ratio is the goal. Please note that the effort to maximize the effort/result ratio is part of the effort.
Assumption: separating the input data into standard cases, corner cases and exotic cases according to a Zipfian / longtail / pareto distribution allows to simplify the necessary formalization avoid accidential complexity and get to a better effort/result ratio.
See Approach ...
60% of all WikiCFP acronyms extracted are matching the regular expression
[A-Z]+\s*[12][0-9]{3}
e.g. ISWC 2012
43990/73731 ( 59.7%) matches for [A-Z]+\s*[12][0-9]{3} 654/43989 ( 1.5%) year different
long acronyms tend to indicate the extraction has not worked or there is some other issue with the acronym such as indicating a joint / colocated situation
SELECT acronym
FROM "event_wikicfp"
where length(acronym)=40
The acroynm entries with a length of 40 are mostly not acronyms ...
... Political Theology Agenda Symposium 2010 Knowledge Engineering Special Issue 2010 CFP MapReduce Special Issue of CCPE 2010 AOSD - Student Research Competition 2011 special session for Wireless VITAE 2011 Political Theology Agenda Symposium 2011 12th EANN / 7th AIAI Joint Congress 2011 ...
There is only one entry in wikicfp where the extracted acronym was longer than 50 chars.
SELECT acronym,url
FROM "event_wikicfp"
where length(acronym)>50
call for chapters - images of female aggression 2016 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=52302
This is not a call for papers for scientific events at all.