Difference between revisions of "Workdocumentation 2021-05-03"

From BITPlan cr Wiki
Jump to navigation Jump to search
Line 22: Line 22:
 
         # collect first letters
 
         # collect first letters
 
         counter=Counter()
 
         counter=Counter()
 +
        total=0
 
         for eventId in dblp.em.events:
 
         for eventId in dblp.em.events:
 
             if eventId.startswith("conf"):
 
             if eventId.startswith("conf"):
Line 27: Line 28:
 
                 first=ord(event.title[0])
 
                 first=ord(event.title[0])
 
                 counter[first]+=1
 
                 counter[first]+=1
 +
                total+=1
 
         bins=len(counter.keys())
 
         bins=len(counter.keys())
         print(f" {bins} different first letters found")
+
         print(f"found {bins} different first letters in {total} titles")
 
         for o,count in counter.most_common(bins):
 
         for o,count in counter.most_common(bins):
 
             c=chr(o)
 
             c=chr(o)
             print (f"{c}: {count}")
+
             print (f"{c}: {count:5} {count/total*100:4.1f} %")
 
</source>
 
</source>
 
<pre>
 
<pre>
46 different first letters found
+
read 43976 Events from dblp in  0.2 s
P: 12599
+
found 46 different first letters in 43398 titles
2: 3526
+
P: 12599 29.0 %
I: 3515
+
2: 3526 8.1 %
A: 3296
+
I: 3515 8.1 %
C: 2333
+
A: 3296 7.6 %
S: 2260
+
C: 2333 5.4 %
1: 2105
+
S: 2260 5.2 %
T: 1559
+
1: 2105 4.9 %
M: 1312
+
T: 1559 3.6 %
E: 1252
+
M: 1312 3.0 %
F: 1246
+
E: 1252 2.9 %
D: 1177
+
F: 1246 2.9 %
R: 624
+
D: 1177 2.7 %
H: 578
+
R:   624 1.4 %
N: 566
+
H:   578 1.3 %
3: 564
+
N:   566 1.3 %
W: 522
+
3:   564 1.3 %
L: 502
+
W:   522 1.2 %
G: 501
+
L:   502 1.2 %
B: 479
+
G:   501 1.2 %
4: 354
+
B:   479 1.1 %
V: 334
+
4:   354 0.8 %
K: 257
+
V:   334 0.8 %
O: 255
+
K:   257 0.6 %
5: 252
+
O:   255 0.6 %
U: 236
+
5:   252 0.6 %
9: 215
+
U:   236 0.5 %
6: 211
+
9:   215 0.5 %
7: 199
+
6:   211 0.5 %
8: 187
+
7:   199 0.5 %
J: 150
+
8:   187 0.4 %
X: 88
+
J:   150 0.3 %
Q: 76
+
X:   88 0.2 %
e: 19
+
Q:   76 0.2 %
Z: 13
+
e:   19 0.0 %
i: 12
+
Z:   13 0.0 %
p: 7
+
i:   12 0.0 %
«: 5
+
p:     7 0.0 %
(: 3
+
«:     5 0.0 %
": 2
+
(:     3 0.0 %
d: 2
+
":     2 0.0 %
f: 1
+
d:     2 0.0 %
t: 1
+
f:     1 0.0 %
s: 1
+
t:     1 0.0 %
': 1
+
s:     1 0.0 %
Y: 1
+
':     1 0.0 %
 +
Y:     1 0.0 %
 
----------------------------------------------------------------------
 
----------------------------------------------------------------------
Ran 1 test in 0.577s
+
Ran 1 test in 0.557s
 
</pre>
 
</pre>

Revision as of 07:34, 3 May 2021

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s