Acronym - Regular Expressions: Difference between revisions
No edit summary |
No edit summary |
||
| Line 9: | Line 9: | ||
We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax. | We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax. | ||
=== Question: What patterns do the acronyms follow with what frequency? === | |||
Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit | |||
==== Test with Sample of 5 ==== | |||
<source lang='python'> | |||
def testAcronyms(self): | |||
''' | |||
test Acronyms | |||
''' | |||
wikiCFP=WikiCFP() | |||
em=wikiCFP.em | |||
sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL)) | |||
acronymRecords=sqlDB.query("select acronym from event_wikicfp") | |||
print ("total acronyms: %d" % len(acronymRecords)) | |||
for regex in [r'[A-Z]+\s*[0-9]+']: | |||
for acronymRecord in acronymRecords[:5]: | |||
acronym=acronymRecord['acronym'] | |||
matches=re.match(regex,acronym) | |||
print ("%s:%s" % ('✅' if matches else '❌' ,acronym)) | |||
</source> | |||
✅:COLING 2008 | |||
✅:IJCNLP 2008 | |||
❌:Prosody and Language Processing 2008 | |||
✅:ICGL 2008 | |||
✅:ICPLA 2008 | |||
That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10) | |||
<source lang='python'> | |||
limit=10 | |||
count=0 | |||
for regex in [r'[A-Z]+\s*[0-9]+']: | |||
for acronymRecord in acronymRecords[:limit]: | |||
acronym=acronymRecord['acronym'] | |||
matches=re.match(regex,acronym) | |||
if matches: | |||
count+=1 | |||
print ("%s:%s" % ('✅' if matches else '❌' ,acronym)) | |||
print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex)) | |||
</source> | |||
✅:COLING 2008 | |||
✅:IJCNLP 2008 | |||
❌:Prosody and Language Processing 2008 | |||
✅:ICGL 2008 | |||
✅:ICPLA 2008 | |||
❌:LabPhon 11 2008 | |||
✅:MALC 2007 | |||
❌:KCTOS workshop 2007 | |||
❌:Euralex 2008 2008 | |||
❌:SCL and SPCL Cayenne 2008 | |||
5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+ | |||
Revision as of 11:09, 31 October 2020
Experiments
WikiCFP Acronyms
Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:
ISEM 2012 : 8th International Conference on Semantic Systems
Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems
We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.
Question: What patterns do the acronyms follow with what frequency?
Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit
Test with Sample of 5
def testAcronyms(self):
'''
test Acronyms
'''
wikiCFP=WikiCFP()
em=wikiCFP.em
sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
acronymRecords=sqlDB.query("select acronym from event_wikicfp")
print ("total acronyms: %d" % len(acronymRecords))
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:5]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008
That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)
limit=10
count=0
for regex in [r'[A-Z]+\s*[0-9]+']:
for acronymRecord in acronymRecords[:limit]:
acronym=acronymRecord['acronym']
matches=re.match(regex,acronym)
if matches:
count+=1
print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))
✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008 5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+