Acronym - Regular Expressions

From BITPlan cr Wiki
Jump to navigation Jump to search

Experiments

Structure:

  1. Question
  2. Assumption
  3. Sample/Experiment
  4. Observation
  5. Interpretation

WikiCFP Acronyms

Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:

ISEM 2012 : 8th International Conference on Semantic Systems

Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems

We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.

Question: What patterns do the acronyms follow with what frequency?

Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit - A sequence of upper case letters followed by whitespace followed by a sequence of digits.

Test with Sample of 5,10,10.000,50.000 and 81.966

ProceedingsTitleParser/WolfgangFahl:adds test for Acronyms (Wolfgang Fahl/2020-10-31 12:18:54 +0100)

def testAcronyms(self):
        '''
        test Acronyms
        '''
        wikiCFP=WikiCFP()
        em=wikiCFP.em
        sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
        acronymRecords=sqlDB.query("select acronym from event_wikicfp")
        print ("total acronyms: %d" % len(acronymRecords))
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:5]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008

That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)

 limit=10
        count=0
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:limit]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                if matches:
                    count+=1
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
            print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008

5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+

Poorer luck this time? Try 10.000 for a bigger sample:

7077/10000 ( 70.8%) matches for [A-Z]+\s*[0-9]+

Will this converge? Try 50.000 acronyms now:

33707/50000 ( 67.4%) matches for [A-Z]+\s*[0-9]+

Try the whole sample set 81.996 next:

49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+

Discussion

Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. This might also be correlated to the year of the CFP which we extracted from the "start_date" rdfA property.
ProceedingsTitleParser/WolfgangFahl:adds queries (Wolfgang Fahl/2020-10-31 13:07:36 +0100)

order of CFPs

select eventId,acronym,year,url from event_wikicfp limit 10
eventId acronym year url
wikiCFP#158 COLING 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=158
wikiCFP#159 IJCNLP 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=159
wikiCFP#160 Prosody and Language Processing 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=160
wikiCFP#161 ICGL 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=161
wikiCFP#162 ICPLA 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=162
wikiCFP#163 LabPhon 11 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=163
wikiCFP#164 MALC 2007 2007 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=164
wikiCFP#165 KCTOS workshop 2007 2007 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=165
wikiCFP#166 Euralex 2008 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=166
wikiCFP#167 SCL and SPCL Cayenne 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=167

cfps per year

select count(*) as perYear,year 
from event_wikicfp 
group by year
order by 2
perYear year
8235 None
1 35
1 1920
20 2000
5 2001
1 2002
1 2004
2 2005
462 2007
2177 2008
2541 2009
4069 2010
5363 2011
5298 2012
5153 2013
5742 2014
5721 2015
6346 2016
7355 2017
8207 2018
8006 2019
6237 2020
1014 2021
2 2022
1 2024
1 2025
3 2026
1 2081
1 2091

acronym match by year 2007-2021

Only the years 2007-2021 seem to be valid years in our WikiCFP dataset. Therefore the selection is limited to this year range.ProceedingsTitleParser/WolfgangFahl:adds test for years (Wolfgang Fahl/2020-10-31 13:19:17 +0100)

total acronyms for year 2007: 462

✅:MALC 2007 ❌:KCTOS workshop 2007 ✅:ALC 2007 ✅:IWSLT 2007 ❌:Speech and Body in Interaction 2007 ❌:Lingua Francae in the Nordic Countries 2007 ✅:MALC 2007 ✅:SIP 2007 ❌:FreeLing 2007 ✅:MESS 2007 ❌:Portsmouth Translation Conference 2007 ✅:MLS 2007 ❌:Argument Structure 2007 ❌:CoPaLaSt 2007 ❌:Conference on Language Planning 2007 ❌:OntoLex 2007 ❌:GDP III 2007 ❌:AFinLA 2007 ❌:Perspectives on Underspecification 2007 ❌:Temporalität 2007 ✅:UTASCILT 2007

298/462 ( 64.5%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2008: 2177

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008 ✅:LATA 2008 ✅:GWC 2008 ❌:Syntactic Parameters 2008 ❌:Lexis-Grammar Interface 2008 ✅:CIL 2008 ❌:Silent Issues in Linguistic Theory 2008 ❌:Writing Systems 2008 ❌:Linguistic Interfaces 2008 ✅:LREC 2008 ❌:ConSOLE XVI 2008

1639/2177 ( 75.3%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2009: 2541

✅:EACL 2009 ✅:ICBB 2009 ✅:ICBB 2009 ✅:ICBB 2009 ✅:ICASSP 2009 ✅:PSCE 2009 ✅:GECCO 2009 ❌:ITiCSE 2009 ✅:POPL 2009

1843/2541 ( 72.5%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2010: 4069

✅:ICIP 2010 ❌:Micro 2010 ❌:NSS/MIC 2010 ✅:HRI 2010 ✅:IAV 2010 ✅:HRI 2010 ✅:ICIP 2010 ✅:ACC 2010 ❌:ITherm 2010 ✅:ECCV 2010 ✅:IROS 2010

2900/4069 ( 71.3%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2011: 5363

✅:SOSP 2011 ✅:TEI 2011 ❌:Sustainability Conference 2011 ❌:Book Conference 2011 ❌:Technology Conference 2011 ✅:ICASSP 2011 ✅:ICDL 2011 ❌:Mediterranean 2011 ❌:Sports 2011 ❌:Tourism 2011 ❌:Sociology 2011 ❌:Media 2011 ❌:Environment 2011 ❌:Education 2011 ❌:Psychology 2011 ❌:Fine and Performing Arts 2011 ❌:Mathematics 2011 ❌:Politics 2011 ❌:Computer 2011 ❌:Health 2011 ❌:Accounting 2011 ❌:Finance 2011 ❌:Management 2011 ❌:Marketing 2011 ❌:Literature 2011 ❌:Agriculture 2011 ❌:Law 2011 ❌:Economic 2011 ✅:SME 2011 ✅:ICDAR 2011 ✅:HRI 2011 ✅:FOGA 2011

3874/5363 ( 72.2%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2012: 5298

✅:ICPR 2012 ✅:CICC 2012 ❌:RadarCon 2012 ✅:HICSS 2012 ✅:BCFIC 2012 ✅:ACSEAC 2012 ✅:WWW 2012 ✅:INCOM 2012 ❌:PPoPP 2012 ✅:SICPRO 2012

3603/5298 ( 68.0%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2013: 5153

❌:WCM Conference 2013 ❌:AOM Annual Meeting 2013 ✅:SMS 2013 ❌:CINet 2013 ✅:POMS 2013 ❌:P&OM 2012 ❌:G-Forum 2013 ✅:RADMA 2013 ✅:IAMOT 2013 ✅:BCERC 2013 ✅:ISPIM 2013 ✅:EURAM 2013 ✅:ICSB 2013

3539/5153 ( 68.7%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2014: 5742

✅:ICASSP 2014 ✅:GLOBECOM 2014 ✅:ICDM 2014 ✅:MDA 2014 ✅:MLDM 2014 ✅:DMM 2014 ❌:CBR-MD 2014 ✅:DMA 2014 ✅:DSA 2014

3890/5742 ( 67.7%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2015: 5721

❌:Artworks 2016 ✅:FESCA 2015 ✅:GLOBECOM 2015 ✅:ACRL 2015 ✅:MES 2016 ✅:ICSE 2015 ❌:Demography and Population Studies 2015 ✅:INFOCOM 2015 ❌:Sociology 2015 ✅:ICNGCCT 2015 ❌:Philosophy 2015 ✅:ISMH 2016 ❌:Architecture 2015

3820/5721 ( 66.8%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2016: 6346

❌:EuSAR 2016 ❌:Summer School Radar/SAR 2016 ✅:ISCSO 2016 ✅:DSA 2016 ✅:SE2020-SEC 2016 ❌:CoSeRa 2016 ✅:INFOCOM 2016 ✅:CIIISI 2016 ✅:WCCI 2016 ✅:KST 2016 ❌:SMRLO'16 2016 ✅:ACC 2016

4076/6346 ( 64.2%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2017: 7355

✅:IFAC 2017 ❌:IJSRMS || August 2017 ✅:BBC 2017 ✅:ICBMS 2016 ✅:HSSEAP 2017 ✅:ECLLL 2015 ✅:ICESD 2017 ✅:ECEAP 2017 ✅:ECITNS 2017

4379/7355 ( 59.5%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2018: 8207

❌:Wind Integration Workshop 2018 ❌:Solar Integration Workshop 2018 ✅:ACECS 2018 ✅:GEEE 2018 ✅:ICASSP 2018 ✅:BEMM 2018 ✅:ICDAMT 2018 ❌:Euro Nursing 2018 2018 ✅:DS4IDS@FiCloud 2018 ✅:FIKM 2018 ✅:MATE 2018

4149/8207 ( 50.6%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2019: 8006

✅:SIPRO 2019 ❌:ModComp 2019 ✅:SEDUCE 2019 ❌:IEEE IoTCSCL 2019 ✅:WCICSS 2019 ✅:ACMME 2019 ❌:AIM@EPIA 2019 ❌:RiE 2019 ❌:EI-ICCAEE 2019 2019 ❌:EI-JCMME 2019 2019 ❌:ICESEE EI 2019 ✅:EIS 2019 ❌:FiCloud Workshops 2019 ✅:ICEEMR 2019 ✅:ICIIP 2019 ✅:VLSIA 2019

3890/8006 ( 48.6%) matches for [A-Z]+\s*[0-9]+

total acronyms for year 2020: 6237

❌:GreeNet Symposium - SGNC 2020 ✅:MEAP 2020 ✅:BDMIP 2020 ✅:G2ESD 2020 ❌:RiE 2020 ✅:EJHSS 2020 ✅:EJTNS 2020 ✅:NMOCT 2020 ✅:BIEN 2020 ✅:ICSD 2020

3049/6237 ( 48.9%) matches for [A-Z]+\s*[0-9]+

Year assumption

The tests so far indicate that the digit part might mostly represent the year of the event's startdate. Let's check this assumption with a regular expression [12][0-9]{3} and checking whether the year found in the acronym matches the year of the startdate as found in rdfA property.

Test

ProceedingsTitleParser/WolfgangFahl:adds test for year assumption (Wolfgang Fahl/2020-12-08 18:18:12 +0100)

= Test result

43990/73731 ( 59.7%)  matches for [A-Z]+\s*[12][0-9]{3}
654/43989 (  1.5%)  year different