Workdocumentation 2021-05-03
Question
What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?
Assumption
Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.
Experiment
Hierarchy of: - Letter - Token - Grammatical structure - Sentence
Input: Proceedings titles of dblp conference entries.
Letter
def testMostCommonFirstLetter(self):
'''
get the most common first letters
'''
dblp,foundEvents=self.getEvents()
self.assertTrue(foundEvents>43950)
# collect first letters
counter=Counter()
for eventId in dblp.em.events:
if eventId.startswith("conf"):
event=dblp.em.events[eventId]
first=ord(event.title[0])
counter[first]+=1
bins=len(counter.keys())
print(f" {bins} different first letters found")
for o,count in counter.most_common(bins):
c=chr(o)
print (f"{c}: {count}")
46 different first letters found P: 12599 2: 3526 I: 3515 A: 3296 C: 2333 S: 2260 1: 2105 T: 1559 M: 1312 E: 1252 F: 1246 D: 1177 R: 624 H: 578 N: 566 3: 564 W: 522 L: 502 G: 501 B: 479 4: 354 V: 334 K: 257 O: 255 5: 252 U: 236 9: 215 6: 211 7: 199 8: 187 J: 150 X: 88 Q: 76 e: 19 Z: 13 i: 12 p: 7 «: 5 (: 3 ": 2 d: 2 f: 1 t: 1 s: 1 ': 1 Y: 1 ---------------------------------------------------------------------- Ran 1 test in 0.577s