Acronym paper
- Acronym definition see Acronym
Problem Statement
E.g. for Natural Language Processing for/using Knowledge Graphs (e.g. entity linking and resolution using target knowledge such as Wikidata and DBpedia, foundation models)
Usecase: Lookup an event by Acronym e.g.
- ESWC -> https://www.openresearch.org/wiki/ESWC
- ESWC -> https://scholia.toolforge.org/event-series/Q17012957
In the process of digitalization of scientific publishing PID have been introduced for quite a few entities such as Papers(DOI), Authors (ORCID), Organizations(ROR) but unfortunately not for scientific events and series where the most common disambiguating identifier is still acronyms/short names such as ESWC 2023/Semantics '23. (Only very few instances have PIDs DOI (200)/ pseudo PIDs Wikidata Id (9000/1000).
We estimate that some 5000 (dblp)-25.000 and some 50.000 (dblp) to 250.000 events/eventseries that have - public digital traces (as part of their lifecyle) such as homepages, entries in public cfps, library indices for their proceedings, homepages - would still need PIDs.
To create PIDs and enter the metadata in public KGs such as wikidata acronyms look like a promising tool for disambiguation (as has been proven by the Work of Simon Cobb using OpenRefine ...)
Given a piece of natural language text (and it's context) and a semi structure corpus of digital traces of scientific communication assembled from different sources we'd like to perform a two step process:
- Assert whether the text (char string) is an acronym for some (1 or more) event or event series (likelihood)
- Map the acronym to the knowledge graph of proceedings/events/eventseries
Common Sense Assumptions / Situation
(under the assumption there is a common sense ..)
- An acronym such a ISWC identifies one or more scientific events series (Semantic Web / Wearable computing)
- Typically an acronym/year combination is used to identify installments of such events e.g. ISWC 2022 / ESWC 2022 / Semantic' 2023
- The referencing of such events is done using these acronyms during the whole lifecycle:
- announcements are done via e.g. http//iswc.2022.org
- CFPs are done using e.g. ISWC 2023 / ISWC in the title/metadata of the cfp
- indexing is done using the acronyms e.g. in dblp / TIBKat/Wikidata
- citations are done e.g. using citation "Proceedings ISWC 2022, pp. 153-159 ...)
- PIDs are not common yet
Ideal Idea of digitization of this realm
- In an ideal world there we would be a KG that represents the entities: Proceedings, Event an EventSeries and mostly allows to interlink them by acronyms, with some exceptions where acronyms are ambiguous and disambiguation via other metadata is necessary - ideally a PID is available for each entity type to avoid the disambiguation need.
Approach
Resource bounded data cleaning, disambiguation and knowledge graph extraction.
Maximizing the "overall" effort/result ratio is the goal. Please note that the effort to maximize the effort/result ratio is part of the effort.
Assumption: separating the input data into standard cases, corner cases and exotic cases according to a Zipfian / longtail / pareto distribution allows to simplify the necessary formalization avoid accidential complexity and get to a better effort/result ratio.
Research questions
- What do acronyms for scientific events and event series look like and how formal can they be described?
- How well do acronyms disambiguate scientific events and event series?
- How well is the acronym information curated in metadata sources for events and event series
- How well are acronyms used in citations of scientific events and event series?
- Acronym checker - does the Acronym fit the long version ...
Method
See Approach ...
What do acronyms for scientific events and event series look like and how formal can they be described?
- Try regular expressions see Acronym_-_Regular_Expressions
- Check length histograms see https://github.com/WolfgangFahl/ConferenceCorpus/blob/main/tests/testAcronymCategory.py
Results
What do acronyms look like
Length distribution
WikiCFP
Standard case
60% of all WikiCFP acronyms extracted are matching the regular expression
[A-Z]+\s*[12][0-9]{3}
e.g. ISWC 2012
43990/73731 ( 59.7%) matches for [A-Z]+\s*[12][0-9]{3} 654/43989 ( 1.5%) year different
Corner cases
long acronyms tend to indicate the extraction has not worked or there is some other issue with the acronym such as indicating a joint / colocated situation
SELECT acronym
FROM "event_wikicfp"
where length(acronym)=40
The acroynm entries with a length of 40 are mostly not acronyms ...
... Political Theology Agenda Symposium 2010 Knowledge Engineering Special Issue 2010 CFP MapReduce Special Issue of CCPE 2010 AOSD - Student Research Competition 2011 special session for Wireless VITAE 2011 Political Theology Agenda Symposium 2011 12th EANN / 7th AIAI Joint Congress 2011 ...
Exotic cases / Outliers
There is only one entry in wikicfp where the extracted acronym was longer than 50 chars.
SELECT acronym,url
FROM "event_wikicfp"
where length(acronym)>50
call for chapters - images of female aggression 2016 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=52302
This is not a call for papers for scientific events at all.