Difference between revisions of "Acronym - Regular Expressions"
Line 80: | Line 80: | ||
Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. This might also be correlated to the year of the CFP which we extracted from the "start_date" rdfA property. | Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. This might also be correlated to the year of the CFP which we extracted from the "start_date" rdfA property. | ||
+ | |||
{{commit|host=https://github.com|path=WolfgangFahl|project=ProceedingsTitleParser|subject=adds queries|name=Wolfgang Fahl|date=2020-10-31 13:07:36 +0100|hash=cdde6fa|storemode=subobject|viewmode=line}} | {{commit|host=https://github.com|path=WolfgangFahl|project=ProceedingsTitleParser|subject=adds queries|name=Wolfgang Fahl|date=2020-10-31 13:07:36 +0100|hash=cdde6fa|storemode=subobject|viewmode=line}} | ||
=== order of CFPs === | === order of CFPs === |
Revision as of 13:08, 31 October 2020
Experiments
WikiCFP Acronyms
Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:
ISEM 2012 : 8th International Conference on Semantic Systems
Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems
We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.
Question: What patterns do the acronyms follow with what frequency?
Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit - A sequence of upper case letters followed by whitespace followed by a sequence of digits.
Test with Sample of 5
ProceedingsTitleParser/WolfgangFahl:adds test for Acronyms (Wolfgang Fahl/2020-10-31 12:18:54 +0100)
def testAcronyms(self):
'''
test Acronyms
'''
wikiCFP=WikiCFP()
em=wikiCFP.em
sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
acronymRecords=sqlDB.query("select acronym from event_wikicfp")
print ("total acronyms: %d" % len(acronymRecords))
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:5]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008
That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)
limit=10
count=0
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:limit]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
if matches:
count+=1
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008
5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+
Poorer luck this time? Try 10.000 for a bigger sample:
7077/10000 ( 70.8%) matches for [A-Z]+\s*[0-9]+
Will this converge? Try 50.000 acronyms now:
33707/50000 ( 67.4%) matches for [A-Z]+\s*[0-9]+
Try the whole sample set 81.996 next:
49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+
Discussion
Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. This might also be correlated to the year of the CFP which we extracted from the "start_date" rdfA property.ProceedingsTitleParser/WolfgangFahl:adds queries (Wolfgang Fahl/2020-10-31 13:07:36 +0100)
order of CFPs
select eventId,acronym,year,url from event_wikicfp limit 10
eventId | acronym | year | url |
---|---|---|---|
wikiCFP#158 | COLING 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=158 |
wikiCFP#159 | IJCNLP 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=159 |
wikiCFP#160 | Prosody and Language Processing 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=160 |
wikiCFP#161 | ICGL 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=161 |
wikiCFP#162 | ICPLA 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=162 |
wikiCFP#163 | LabPhon 11 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=163 |
wikiCFP#164 | MALC 2007 | 2007 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=164 |
wikiCFP#165 | KCTOS workshop 2007 | 2007 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=165 |
wikiCFP#166 | Euralex 2008 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=166 |
wikiCFP#167 | SCL and SPCL Cayenne 2008 | 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=167 |
cfps per year
select count(*) as perYear,year
from event_wikicfp
group by year
order by 2
perYear | year |
---|---|
8235 | None |
1 | 35 |
1 | 1920 |
20 | 2000 |
5 | 2001 |
1 | 2002 |
1 | 2004 |
2 | 2005 |
462 | 2007 |
2177 | 2008 |
2541 | 2009 |
4069 | 2010 |
5363 | 2011 |
5298 | 2012 |
5153 | 2013 |
5742 | 2014 |
5721 | 2015 |
6346 | 2016 |
7355 | 2017 |
8207 | 2018 |
8006 | 2019 |
6237 | 2020 |
1014 | 2021 |
2 | 2022 |
1 | 2024 |
1 | 2025 |
3 | 2026 |
1 | 2081 |
1 | 2091 |