Acronym paper

From BITPlan cr Wiki
Jump to navigation Jump to search

Problem Statement

E.g. for Natural Language Processing for/using Knowledge Graphs (e.g. entity linking and resolution using target knowledge such as Wikidata and DBpedia, foundation models)

Usecase: Lookup an event by Acronym e.g.

In the process of digitalization of scientific publishing PID have been introduced for quite a few entities such as Papers(DOI), Authors (ORCID), Organizations(ROR) but unfortunately not for scientific events and series where the most common disambiguating identifier is still acronyms/short names such as ESWC 2023/Semantics '23. (Only very few instances have PIDs DOI (200)/ pseudo PIDs Wikidata Id (9000/1000).

We estimate that some 5000 (dblp)-25.000 and some 50.000 (dblp) to 250.000 events/eventseries that have - public digital traces (as part of their lifecyle) such as homepages, entries in public cfps, library indices for their proceedings, homepages - would still need PIDs.

To create PIDs and enter the metadata in public KGs such as wikidata acronyms look like a promising tool for disambiguation (as has been proven by the Work of Simon Cobb using OpenRefine ...)

Given a piece of natural language text (and it's context) and a semi structure corpus of digital traces of scientific communication assembled from different sources we'd like to perform a two step process:

  1. Assert whether the text (char string) is an acronym for some (1 or more) event or event series (likelihood)
  2. Map the acronym to the knowledge graph of proceedings/events/eventseries

Common Sense Assumptions / Situation

(under the assumption there is a common sense ..)

  • An acronym such a ISWC identifies one or more scientific events series (Semantic Web / Wearable computing)
  • Typically an acronym/year combination is used to identify installments of such events e.g. ISWC 2022 / ESWC 2022 / Semantic' 2023
  • The referencing of such events is done using these acronyms during the whole lifecycle:
  1. announcements are done via e.g. http//iswc.2022.org
  2. CFPs are done using e.g. ISWC 2023 / ISWC in the title/metadata of the cfp
  3. indexing is done using the acronyms e.g. in dblp / TIBKat/Wikidata
  4. citations are done e.g. using citation "Proceedings ISWC 2022, pp. 153-159 ...)
  • PIDs are not common yet

Ideal Idea of digitization of this realm

  • In an ideal world there we would be a KG that represents the entities: Proceedings, Event an EventSeries and mostly allows to interlink them by acronyms, with some exceptions where acronyms are ambiguous and disambiguation via other metadata is necessary - ideally a PID is available for each entity type to avoid the disambiguation need.

Approach

Resource bounded data cleaning, disambiguation and knowledge graph extraction.

Maximizing the "overall" effort/result ratio is the goal. Please note that the effort to maximize the effort/result ratio is part of the effort.

Assumption: separating the input data into standard cases, corner cases and exotic cases according to a Zipfian / longtail / pareto distribution allows to simplify the necessary formalization avoid accidential complexity and get to a better effort/result ratio.

Research questions

  1. What do acronyms for scientific events and event series look like and how formal can they be described?
  2. How well do acronyms disambiguate scientific events and event series?
  3. How well is the acronym information curated in metadata sources for events and event series
  4. How well are acronyms used in citations of scientific events and event series?
  5. Acronym checker - does the Acronym fit the long version ...

Method

See Approach ...

What do acronyms for scientific events and event series look like and how formal can they be described?

Results

What do acronyms look like

Length distribution

AcronymHistograms

WikiCFP

Standard case

60% of all WikiCFP acronyms extracted are matching the regular expression

[A-Z]+\s*[12][0-9]{3}

e.g. ISWC 2012

43990/73731 ( 59.7%)  matches for [A-Z]+\s*[12][0-9]{3}
654/43989 (  1.5%)  year different
Corner cases

long acronyms tend to indicate the extraction has not worked or there is some other issue with the acronym such as indicating a joint / colocated situation

SELECT acronym
FROM "event_wikicfp"
where length(acronym)=40

The acroynm entries with a length of 40 are mostly not acronyms ...

...
Political Theology Agenda Symposium 2010
Knowledge Engineering Special Issue 2010
CFP MapReduce Special Issue of CCPE 2010
AOSD - Student Research Competition 2011
special session for Wireless VITAE 2011
Political Theology Agenda Symposium 2011
12th EANN / 7th AIAI Joint Congress 2011 
...
Exotic cases / Outliers

There is only one entry in wikicfp where the extracted acronym was longer than 50 chars.

SELECT acronym,url
FROM "event_wikicfp"
where length(acronym)>50

call for chapters - images of female aggression 2016 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=52302

This is not a call for papers for scientific events at all.