Difference between revisions of "Acronym - Regular Expressions"
Line 77: | Line 77: | ||
49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+ | 49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+ | ||
= Discussion = | = Discussion = | ||
+ | |||
Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. | Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping. | ||
<source lang='sql' hightlight='1'> | <source lang='sql' hightlight='1'> |
Revision as of 12:52, 31 October 2020
Experiments
WikiCFP Acronyms
Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:
ISEM 2012 : 8th International Conference on Semantic Systems
Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems
We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.
Question: What patterns do the acronyms follow with what frequency?
Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit
Test with Sample of 5
ProceedingsTitleParser/WolfgangFahl:adds test for Acronyms (Wolfgang Fahl/2020-10-31 12:18:54 +0100)
def testAcronyms(self):
'''
test Acronyms
'''
wikiCFP=WikiCFP()
em=wikiCFP.em
sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
acronymRecords=sqlDB.query("select acronym from event_wikicfp")
print ("total acronyms: %d" % len(acronymRecords))
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:5]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008
That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)
limit=10
count=0
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:limit]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
if matches:
count+=1
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008
5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+
Poorer luck this time? Try 10.000 for a bigger sample:
7077/10000 ( 70.8%) matches for [A-Z]+\s*[0-9]+
Will this converge? Try 50.000 acronyms now:
33707/50000 ( 67.4%) matches for [A-Z]+\s*[0-9]+
Try the whole sample set 81.996 next:
49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+
Discussion
Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping.
select eventId,acronym,url from event_wikicfp limit 10
eventId | acronym | url |
---|---|---|
wikiCFP#158 | COLING 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=158 |
wikiCFP#159 | IJCNLP 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=159 |
wikiCFP#160 | Prosody and Language Processing 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=160 |
wikiCFP#161 | ICGL 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=161 |
wikiCFP#162 | ICPLA 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=162 |
wikiCFP#163 | LabPhon 11 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=163 |
wikiCFP#164 | MALC 2007 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=164 |
wikiCFP#165 | KCTOS workshop 2007 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=165 |
wikiCFP#166 | Euralex 2008 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=166 |
wikiCFP#167 | SCL and SPCL Cayenne 2008 | http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=167 |