Workdocumentation 2021-05-03

From BITPlan cr Wiki
Jump to navigation Jump to search

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

Most common first letters

# key count %
total 46 43398
1 P 12599 29.03
2 2 3526 8.12
3 I 3515 8.10
4 A 3296 7.59
5 C 2333 5.38
6 S 2260 5.21
7 1 2105 4.85
8 T 1559 3.59
9 M 1312 3.02
10 E 1252 2.88
11 F 1246 2.87
12 D 1177 2.71
13 R 624 1.44
14 H 578 1.33
15 N 566 1.30
16 3 564 1.30
17 W 522 1.20
18 L 502 1.16
19 G 501 1.15
20 B 479 1.10
21 4 354 0.82
22 V 334 0.77
23 K 257 0.59
24 O 255 0.59
25 5 252 0.58
26 U 236 0.54
27 9 215 0.50
28 6 211 0.49
29 7 199 0.46
30 8 187 0.43
31 J 150 0.35
32 X 88 0.20
33 Q 76 0.18
34 e 19 0.04
35 Z 13 0.03
36 i 12 0.03
37 p 7 0.02
38 « 5 0.01
39 ( 3 0.01
40 " 2 0.00
41 d 2 0.00
42 f 1 0.00
43 t 1 0.00
44 s 1 0.00
45 ' 1 0.00
46 Y 1 0.00

Observation for Letter

Relevance Matrix

top 10% top 20% top 30%
Letter 1:P 1:P 2: P, 2
Token
Grammar structure

Interpretation for Letter

That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words

Word