Workdocumentation 2021-05-03

From BITPlan cr Wiki
Jump to navigation Jump to search

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
        bins=len(counter.keys())
        print(f" {bins} different first letters found")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count}")
 46 different first letters found
P: 12599
2: 3526
I: 3515
A: 3296
C: 2333
S: 2260
1: 2105
T: 1559
M: 1312
E: 1252
F: 1246
D: 1177
R: 624
H: 578
N: 566
3: 564
W: 522
L: 502
G: 501
B: 479
4: 354
V: 334
K: 257
O: 255
5: 252
U: 236
9: 215
6: 211
7: 199
8: 187
J: 150
X: 88
Q: 76
e: 19
Z: 13
i: 12
p: 7
«: 5
(: 3
": 2
d: 2
f: 1
t: 1
s: 1
': 1
Y: 1
----------------------------------------------------------------------
Ran 1 test in 0.577s