Workdocumentation 2021-05-03: Difference between revisions

From BITPlan cr Wiki
Jump to navigation Jump to search
No edit summary
Line 22: Line 22:
         # collect first letters
         # collect first letters
         counter=Counter()
         counter=Counter()
        total=0
         for eventId in dblp.em.events:
         for eventId in dblp.em.events:
             if eventId.startswith("conf"):
             if eventId.startswith("conf"):
Line 27: Line 28:
                 first=ord(event.title[0])
                 first=ord(event.title[0])
                 counter[first]+=1
                 counter[first]+=1
                total+=1
         bins=len(counter.keys())
         bins=len(counter.keys())
         print(f" {bins} different first letters found")
         print(f"found {bins} different first letters in {total} titles")
         for o,count in counter.most_common(bins):
         for o,count in counter.most_common(bins):
             c=chr(o)
             c=chr(o)
             print (f"{c}: {count}")
             print (f"{c}: {count:5} {count/total*100:4.1f} %")
</source>
</source>
<pre>
<pre>
46 different first letters found
read 43976 Events from dblp in  0.2 s
P: 12599
found 46 different first letters in 43398 titles
2: 3526
P: 12599 29.0 %
I: 3515
2: 3526 8.1 %
A: 3296
I: 3515 8.1 %
C: 2333
A: 3296 7.6 %
S: 2260
C: 2333 5.4 %
1: 2105
S: 2260 5.2 %
T: 1559
1: 2105 4.9 %
M: 1312
T: 1559 3.6 %
E: 1252
M: 1312 3.0 %
F: 1246
E: 1252 2.9 %
D: 1177
F: 1246 2.9 %
R: 624
D: 1177 2.7 %
H: 578
R:   624 1.4 %
N: 566
H:   578 1.3 %
3: 564
N:   566 1.3 %
W: 522
3:   564 1.3 %
L: 502
W:   522 1.2 %
G: 501
L:   502 1.2 %
B: 479
G:   501 1.2 %
4: 354
B:   479 1.1 %
V: 334
4:   354 0.8 %
K: 257
V:   334 0.8 %
O: 255
K:   257 0.6 %
5: 252
O:   255 0.6 %
U: 236
5:   252 0.6 %
9: 215
U:   236 0.5 %
6: 211
9:   215 0.5 %
7: 199
6:   211 0.5 %
8: 187
7:   199 0.5 %
J: 150
8:   187 0.4 %
X: 88
J:   150 0.3 %
Q: 76
X:   88 0.2 %
e: 19
Q:   76 0.2 %
Z: 13
e:   19 0.0 %
i: 12
Z:   13 0.0 %
p: 7
i:   12 0.0 %
«: 5
p:     7 0.0 %
(: 3
«:     5 0.0 %
": 2
(:     3 0.0 %
d: 2
":     2 0.0 %
f: 1
d:     2 0.0 %
t: 1
f:     1 0.0 %
s: 1
t:     1 0.0 %
': 1
s:     1 0.0 %
Y: 1
':     1 0.0 %
Y:     1 0.0 %
----------------------------------------------------------------------
----------------------------------------------------------------------
Ran 1 test in 0.577s
Ran 1 test in 0.557s
</pre>
</pre>

Revision as of 06:34, 3 May 2021

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s