Acronym - Regular Expressions
Experiments
WikiCFP Acronyms
Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:
ISEM 2012 : 8th International Conference on Semantic Systems
Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems
We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.
Question: What patterns do the acronyms follow with what frequency?
Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit - A sequence of upper case letters followed by whitespace followed by a sequence of digits.
Test with Sample of 5,10,10.000,50.000 and 81.966
ProceedingsTitleParser/WolfgangFahl:adds test for Acronyms (Wolfgang Fahl/2020-10-31 12:18:54 +0100)
def testAcronyms(self):
'''
test Acronyms
'''
wikiCFP=WikiCFP()
em=wikiCFP.em
sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
acronymRecords=sqlDB.query("select acronym from event_wikicfp")
print ("total acronyms: %d" % len(acronymRecords))
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:5]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008
That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)
limit=10
count=0
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:limit]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
if matches:
count+=1
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008
5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+
Poorer luck this time? Try 10.000 for a bigger sample:
7077/10000 ( 70.8%) matches for [A-Z]+\s*[0-9]+
Will this converge? Try 50.000 acronyms now:
33707/50000 ( 67.4%) matches for [A-Z]+\s*[0-9]+
Try the whole sample set 81.996 next:
49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+
Discussion
Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. This might also be correlated to the year of the CFP which we extracted from the "start_date" rdfA property.
ProceedingsTitleParser/WolfgangFahl:adds queries (Wolfgang Fahl/2020-10-31 13:07:36 +0100)
order of CFPs
select eventId,acronym,year,url from event_wikicfp limit 10
eventId | acronym | year | url |
---|---|---|---|
wikiCFP#158 | COLING 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=158 |
wikiCFP#159 | IJCNLP 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=159 |
wikiCFP#160 | Prosody and Language Processing 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=160 |
wikiCFP#161 | ICGL 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=161 |
wikiCFP#162 | ICPLA 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=162 |
wikiCFP#163 | LabPhon 11 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=163 |
wikiCFP#164 | MALC 2007 | 2007 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=164 |
wikiCFP#165 | KCTOS workshop 2007 | 2007 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=165 |
wikiCFP#166 | Euralex 2008 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=166 |
wikiCFP#167 | SCL and SPCL Cayenne 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=167 |
cfps per year
select count(*) as perYear,year
from event_wikicfp
group by year
order by 2
perYear | year |
---|---|
8235 | None |
1 | 35 |
1 | 1920 |
20 | 2000 |
5 | 2001 |
1 | 2002 |
1 | 2004 |
2 | 2005 |
462 | 2007 |
2177 | 2008 |
2541 | 2009 |
4069 | 2010 |
5363 | 2011 |
5298 | 2012 |
5153 | 2013 |
5742 | 2014 |
5721 | 2015 |
6346 | 2016 |
7355 | 2017 |
8207 | 2018 |
8006 | 2019 |
6237 | 2020 |
1014 | 2021 |
2 | 2022 |
1 | 2024 |
1 | 2025 |
3 | 2026 |
1 | 2081 |
1 | 2091 |
acronym match by year 2007-2021
ProceedingsTitleParser/WolfgangFahl:adds test for years (Wolfgang Fahl/2020-10-31 13:19:17 +0100)
total acronyms for year 2007: 462
✅:MALC 2007 ❌:KCTOS workshop 2007 ✅:ALC 2007 ✅:IWSLT 2007 ❌:Speech and Body in Interaction 2007 ❌:Lingua Francae in the Nordic Countries 2007 ✅:MALC 2007 ✅:SIP 2007 ❌:FreeLing 2007 ✅:MESS 2007 ❌:Portsmouth Translation Conference 2007 ✅:MLS 2007 ❌:Argument Structure 2007 ❌:CoPaLaSt 2007 ❌:Conference on Language Planning 2007 ❌:OntoLex 2007 ❌:GDP III 2007 ❌:AFinLA 2007 ❌:Perspectives on Underspecification 2007 ❌:Temporalität 2007 ✅:UTASCILT 2007
298/462 ( 64.5%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2008: 2177
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008 ✅:LATA 2008 ✅:GWC 2008 ❌:Syntactic Parameters 2008 ❌:Lexis-Grammar Interface 2008 ✅:CIL 2008 ❌:Silent Issues in Linguistic Theory 2008 ❌:Writing Systems 2008 ❌:Linguistic Interfaces 2008 ✅:LREC 2008 ❌:ConSOLE XVI 2008
1639/2177 ( 75.3%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2009: 2541
✅:EACL 2009 ✅:ICBB 2009 ✅:ICBB 2009 ✅:ICBB 2009 ✅:ICASSP 2009 ✅:PSCE 2009 ✅:GECCO 2009 ❌:ITiCSE 2009 ✅:POPL 2009
1843/2541 ( 72.5%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2010: 4069
✅:ICIP 2010 ❌:Micro 2010 ❌:NSS/MIC 2010 ✅:HRI 2010 ✅:IAV 2010 ✅:HRI 2010 ✅:ICIP 2010 ✅:ACC 2010 ❌:ITherm 2010 ✅:ECCV 2010 ✅:IROS 2010
2900/4069 ( 71.3%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2011: 5363
✅:SOSP 2011 ✅:TEI 2011 ❌:Sustainability Conference 2011 ❌:Book Conference 2011 ❌:Technology Conference 2011 ✅:ICASSP 2011 ✅:ICDL 2011 ❌:Mediterranean 2011 ❌:Sports 2011 ❌:Tourism 2011 ❌:Sociology 2011 ❌:Media 2011 ❌:Environment 2011 ❌:Education 2011 ❌:Psychology 2011 ❌:Fine and Performing Arts 2011 ❌:Mathematics 2011 ❌:Politics 2011 ❌:Computer 2011 ❌:Health 2011 ❌:Accounting 2011 ❌:Finance 2011 ❌:Management 2011 ❌:Marketing 2011 ❌:Literature 2011 ❌:Agriculture 2011 ❌:Law 2011 ❌:Economic 2011 ✅:SME 2011 ✅:ICDAR 2011 ✅:HRI 2011 ✅:FOGA 2011
3874/5363 ( 72.2%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2012: 5298
✅:ICPR 2012 ✅:CICC 2012 ❌:RadarCon 2012 ✅:HICSS 2012 ✅:BCFIC 2012 ✅:ACSEAC 2012 ✅:WWW 2012 ✅:INCOM 2012 ❌:PPoPP 2012 ✅:SICPRO 2012
3603/5298 ( 68.0%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2013: 5153
❌:WCM Conference 2013 ❌:AOM Annual Meeting 2013 ✅:SMS 2013 ❌:CINet 2013 ✅:POMS 2013 ❌:P&OM 2012 ❌:G-Forum 2013 ✅:RADMA 2013 ✅:IAMOT 2013 ✅:BCERC 2013 ✅:ISPIM 2013 ✅:EURAM 2013 ✅:ICSB 2013
3539/5153 ( 68.7%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2014: 5742
✅:ICASSP 2014 ✅:GLOBECOM 2014 ✅:ICDM 2014 ✅:MDA 2014 ✅:MLDM 2014 ✅:DMM 2014 ❌:CBR-MD 2014 ✅:DMA 2014 ✅:DSA 2014
3890/5742 ( 67.7%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2015: 5721
❌:Artworks 2016 ✅:FESCA 2015 ✅:GLOBECOM 2015 ✅:ACRL 2015 ✅:MES 2016 ✅:ICSE 2015 ❌:Demography and Population Studies 2015 ✅:INFOCOM 2015 ❌:Sociology 2015 ✅:ICNGCCT 2015 ❌:Philosophy 2015 ✅:ISMH 2016 ❌:Architecture 2015
3820/5721 ( 66.8%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2016: 6346
❌:EuSAR 2016 ❌:Summer School Radar/SAR 2016 ✅:ISCSO 2016 ✅:DSA 2016 ✅:SE2020-SEC 2016 ❌:CoSeRa 2016 ✅:INFOCOM 2016 ✅:CIIISI 2016 ✅:WCCI 2016 ✅:KST 2016 ❌:SMRLO'16 2016 ✅:ACC 2016
4076/6346 ( 64.2%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2017: 7355
✅:IFAC 2017 ❌:IJSRMS || August 2017 ✅:BBC 2017 ✅:ICBMS 2016 ✅:HSSEAP 2017 ✅:ECLLL 2015 ✅:ICESD 2017 ✅:ECEAP 2017 ✅:ECITNS 2017
4379/7355 ( 59.5%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2018: 8207
❌:Wind Integration Workshop 2018 ❌:Solar Integration Workshop 2018 ✅:ACECS 2018 ✅:GEEE 2018 ✅:ICASSP 2018 ✅:BEMM 2018 ✅:ICDAMT 2018 ❌:Euro Nursing 2018 2018 ✅:DS4IDS@FiCloud 2018 ✅:FIKM 2018 ✅:MATE 2018
4149/8207 ( 50.6%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2019: 8006
✅:SIPRO 2019 ❌:ModComp 2019 ✅:SEDUCE 2019 ❌:IEEE IoTCSCL 2019 ✅:WCICSS 2019 ✅:ACMME 2019 ❌:AIM@EPIA 2019 ❌:RiE 2019 ❌:EI-ICCAEE 2019 2019 ❌:EI-JCMME 2019 2019 ❌:ICESEE EI 2019 ✅:EIS 2019 ❌:FiCloud Workshops 2019 ✅:ICEEMR 2019 ✅:ICIIP 2019 ✅:VLSIA 2019
3890/8006 ( 48.6%) matches for [A-Z]+\s*[0-9]+
total acronyms for year 2020: 6237
❌:GreeNet Symposium - SGNC 2020 ✅:MEAP 2020 ✅:BDMIP 2020 ✅:G2ESD 2020 ❌:RiE 2020 ✅:EJHSS 2020 ✅:EJTNS 2020 ✅:NMOCT 2020 ✅:BIEN 2020 ✅:ICSD 2020
3049/6237 ( 48.9%) matches for [A-Z]+\s*[0-9]+