Difference between revisions of "Acronym - Regular Expressions"

From BITPlan cr Wiki
Jump to navigation Jump to search
Line 9: Line 9:
  
 
We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.
 
We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.
 +
 +
=== Question: What patterns do the acronyms follow with what frequency? ===
 +
Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit
 +
 +
==== Test with Sample of 5 ====
 +
<source lang='python'>
 +
def testAcronyms(self):
 +
        '''
 +
        test Acronyms
 +
        '''
 +
        wikiCFP=WikiCFP()
 +
        em=wikiCFP.em
 +
        sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
 +
        acronymRecords=sqlDB.query("select acronym from event_wikicfp")
 +
        print ("total acronyms: %d" % len(acronymRecords))
 +
        for regex in [r'[A-Z]+\s*[0-9]+']:
 +
            for acronymRecord in acronymRecords[:5]:
 +
                acronym=acronymRecord['acronym']
 +
                matches=re.match(regex,acronym)
 +
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
 +
</source>
 +
 +
✅:COLING 2008
 +
✅:IJCNLP 2008
 +
❌:Prosody and Language Processing 2008
 +
✅:ICGL 2008
 +
✅:ICPLA 2008
 +
 +
That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)
 +
<source lang='python'>
 +
limit=10
 +
        count=0
 +
        for regex in [r'[A-Z]+\s*[0-9]+']:
 +
            for acronymRecord in acronymRecords[:limit]:
 +
                acronym=acronymRecord['acronym']
 +
                matches=re.match(regex,acronym)
 +
                if matches:
 +
                    count+=1
 +
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
 +
            print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))
 +
</source>
 +
 +
✅:COLING 2008
 +
✅:IJCNLP 2008
 +
❌:Prosody and Language Processing 2008
 +
✅:ICGL 2008
 +
✅:ICPLA 2008
 +
❌:LabPhon 11 2008
 +
✅:MALC 2007
 +
❌:KCTOS workshop 2007
 +
❌:Euralex 2008 2008
 +
❌:SCL and SPCL Cayenne 2008
 +
5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+

Revision as of 12:09, 31 October 2020

Experiments

WikiCFP Acronyms

Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:

ISEM 2012 : 8th International Conference on Semantic Systems

Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems

We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.

Question: What patterns do the acronyms follow with what frequency?

Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit

Test with Sample of 5

def testAcronyms(self):
        '''
        test Acronyms
        '''
        wikiCFP=WikiCFP()
        em=wikiCFP.em
        sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
        acronymRecords=sqlDB.query("select acronym from event_wikicfp")
        print ("total acronyms: %d" % len(acronymRecords))
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:5]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008

That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)

 limit=10
        count=0
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:limit]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                if matches:
                    count+=1
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
            print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008 5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+