Acronym - Regular Expressions

From BITPlan cr Wiki
Revision as of 13:06, 31 October 2020 by Wf (talk | contribs) (→‎Discussion)
Jump to navigation Jump to search

Experiments

WikiCFP Acronyms

Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:

ISEM 2012 : 8th International Conference on Semantic Systems

Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems

We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.

Question: What patterns do the acronyms follow with what frequency?

Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit - A sequence of upper case letters followed by whitespace followed by a sequence of digits.

Test with Sample of 5

ProceedingsTitleParser/WolfgangFahl:adds test for Acronyms (Wolfgang Fahl/2020-10-31 12:18:54 +0100)

def testAcronyms(self):
        '''
        test Acronyms
        '''
        wikiCFP=WikiCFP()
        em=wikiCFP.em
        sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
        acronymRecords=sqlDB.query("select acronym from event_wikicfp")
        print ("total acronyms: %d" % len(acronymRecords))
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:5]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008

That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)

 limit=10
        count=0
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:limit]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                if matches:
                    count+=1
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
            print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008

5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+

Poorer luck this time? Try 10.000 for a bigger sample:

7077/10000 ( 70.8%) matches for [A-Z]+\s*[0-9]+

Will this converge? Try 50.000 acronyms now:

33707/50000 ( 67.4%) matches for [A-Z]+\s*[0-9]+

Try the whole sample set 81.996 next:

49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+

Discussion

Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. This might also be correlated to the year of the CFP which we extracted from the "start_date" rdfA property.

order of CFPs

select eventId,acronym,year,url from event_wikicfp limit 10
eventId acronym year url
wikiCFP#158 COLING 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=158
wikiCFP#159 IJCNLP 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=159
wikiCFP#160 Prosody and Language Processing 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=160
wikiCFP#161 ICGL 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=161
wikiCFP#162 ICPLA 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=162
wikiCFP#163 LabPhon 11 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=163
wikiCFP#164 MALC 2007 2007 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=164
wikiCFP#165 KCTOS workshop 2007 2007 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=165
wikiCFP#166 Euralex 2008 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=166
wikiCFP#167 SCL and SPCL Cayenne 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=167

cfps per year

select count(*) as perYear,year 
from event_wikicfp 
group by year
order by 2
perYear year
8235 None
1 35
1 1920
20 2000
5 2001
1 2002
1 2004
2 2005
462 2007
2177 2008
2541 2009
4069 2010
5363 2011
5298 2012
5153 2013
5742 2014
5721 2015
6346 2016
7355 2017
8207 2018
8006 2019
6237 2020
1014 2021
2 2022
1 2024
1 2025
3 2026
1 2081
1 2091