Difference between revisions of "Acronym - Regular Expressions"

From BITPlan cr Wiki
Jump to navigation Jump to search
Line 77: Line 77:
 
49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+
 
49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+
 
= Discussion =
 
= Discussion =
 +
 
Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping.
 
Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping.
 
<source lang='sql' hightlight='1'>
 
<source lang='sql' hightlight='1'>

Revision as of 13:52, 31 October 2020

Experiments

WikiCFP Acronyms

Input: Each WikiCFP page has a description rdfA property that use the syntax <acronym>:<title>. E.g. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=20605 has:

ISEM 2012 : 8th International Conference on Semantic Systems

Acronym: ISEM 2012 Title: 8th International Conference on Semantic Systems

We have scraped 81966 wikiCFP pages for the acronym and title based on this syntax.

Question: What patterns do the acronyms follow with what frequency?

Assumption: The Regular expression '[A-Z]+\s*[0-9]+ might fit

Test with Sample of 5

ProceedingsTitleParser/WolfgangFahl:adds test for Acronyms (Wolfgang Fahl/2020-10-31 12:18:54 +0100)

def testAcronyms(self):
        '''
        test Acronyms
        '''
        wikiCFP=WikiCFP()
        em=wikiCFP.em
        sqlDB=em.getSQLDB(em.getCacheFile(em.config, StoreMode.SQL))
        acronymRecords=sqlDB.query("select acronym from event_wikicfp")
        print ("total acronyms: %d" % len(acronymRecords))
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:5]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008

That is an 80% fit for the first 5. Looks promising ... Was this just by chance? Try out more (10)

 limit=10
        count=0
        for regex in [r'[A-Z]+\s*[0-9]+']:
            for acronymRecord in acronymRecords[:limit]:
                acronym=acronymRecord['acronym']
                matches=re.match(regex,acronym)
                if matches:
                    count+=1
                print ("%s:%s" % ('✅' if matches else '❌' ,acronym))
            print("%d/%d (%5.1f%%) matches for %s" % (count,limit,count/limit*100,regex))

✅:COLING 2008 ✅:IJCNLP 2008 ❌:Prosody and Language Processing 2008 ✅:ICGL 2008 ✅:ICPLA 2008 ❌:LabPhon 11 2008 ✅:MALC 2007 ❌:KCTOS workshop 2007 ❌:Euralex 2008 2008 ❌:SCL and SPCL Cayenne 2008

5/10 ( 50.0%) matches for [A-Z]+\s*[0-9]+

Poorer luck this time? Try 10.000 for a bigger sample:

7077/10000 ( 70.8%) matches for [A-Z]+\s*[0-9]+

Will this converge? Try 50.000 acronyms now:

33707/50000 ( 67.4%) matches for [A-Z]+\s*[0-9]+

Try the whole sample set 81.996 next:

49728/81966 ( 60.7%) matches for [A-Z]+\s*[0-9]+

Discussion

Note that the samples were not random but the query style retrieves records in the order of the "eventId" initially being used for scraping.

select eventId,acronym,url from event_wikicfp limit 10
eventId acronym url
wikiCFP#158 COLING 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=158
wikiCFP#159 IJCNLP 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=159
wikiCFP#160 Prosody and Language Processing 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=160
wikiCFP#161 ICGL 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=161
wikiCFP#162 ICPLA 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=162
wikiCFP#163 LabPhon 11 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=163
wikiCFP#164 MALC 2007 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=164
wikiCFP#165 KCTOS workshop 2007 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=165
wikiCFP#166 Euralex 2008 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=166
wikiCFP#167 SCL and SPCL Cayenne 2008 http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=167