Difference between revisions of "Workdocumentation 2021-05-03"

From BITPlan cr Wiki
Jump to navigation Jump to search
Line 88: Line 88:
 
Ran 1 test in 0.557s
 
Ran 1 test in 0.557s
 
</pre>
 
</pre>
 +
=== Most common first letters ==
 +
{| class="wikitable" style="text-align: left;"
 +
|+ <!-- caption -->
 +
|-
 +
! #    !! key  !! align="right"|  count !! align="right"|    %
 +
|-
 +
| total || 46    || align="right"|  43398 || align="right"|
 +
|-
 +
| 1    || P    || align="right"|  12599 || align="right"| 29.03
 +
|-
 +
| 2    || 2    || align="right"|    3526 || align="right"|  8.12
 +
|-
 +
| 3    || I    || align="right"|    3515 || align="right"|  8.10
 +
|-
 +
| 4    || A    || align="right"|    3296 || align="right"|  7.59
 +
|-
 +
| 5    || C    || align="right"|    2333 || align="right"|  5.38
 +
|-
 +
| 6    || S    || align="right"|    2260 || align="right"|  5.21
 +
|-
 +
| 7    || 1    || align="right"|    2105 || align="right"|  4.85
 +
|-
 +
| 8    || T    || align="right"|    1559 || align="right"|  3.59
 +
|-
 +
| 9    || M    || align="right"|    1312 || align="right"|  3.02
 +
|-
 +
| 10    || E    || align="right"|    1252 || align="right"|  2.88
 +
|-
 +
| 11    || F    || align="right"|    1246 || align="right"|  2.87
 +
|-
 +
| 12    || D    || align="right"|    1177 || align="right"|  2.71
 +
|-
 +
| 13    || R    || align="right"|    624 || align="right"|  1.44
 +
|-
 +
| 14    || H    || align="right"|    578 || align="right"|  1.33
 +
|-
 +
| 15    || N    || align="right"|    566 || align="right"|  1.30
 +
|-
 +
| 16    || 3    || align="right"|    564 || align="right"|  1.30
 +
|-
 +
| 17    || W    || align="right"|    522 || align="right"|  1.20
 +
|-
 +
| 18    || L    || align="right"|    502 || align="right"|  1.16
 +
|-
 +
| 19    || G    || align="right"|    501 || align="right"|  1.15
 +
|-
 +
| 20    || B    || align="right"|    479 || align="right"|  1.10
 +
|-
 +
| 21    || 4    || align="right"|    354 || align="right"|  0.82
 +
|-
 +
| 22    || V    || align="right"|    334 || align="right"|  0.77
 +
|-
 +
| 23    || K    || align="right"|    257 || align="right"|  0.59
 +
|-
 +
| 24    || O    || align="right"|    255 || align="right"|  0.59
 +
|-
 +
| 25    || 5    || align="right"|    252 || align="right"|  0.58
 +
|-
 +
| 26    || U    || align="right"|    236 || align="right"|  0.54
 +
|-
 +
| 27    || 9    || align="right"|    215 || align="right"|  0.50
 +
|-
 +
| 28    || 6    || align="right"|    211 || align="right"|  0.49
 +
|-
 +
| 29    || 7    || align="right"|    199 || align="right"|  0.46
 +
|-
 +
| 30    || 8    || align="right"|    187 || align="right"|  0.43
 +
|-
 +
| 31    || J    || align="right"|    150 || align="right"|  0.35
 +
|-
 +
| 32    || X    || align="right"|      88 || align="right"|  0.20
 +
|-
 +
| 33    || Q    || align="right"|      76 || align="right"|  0.18
 +
|-
 +
| 34    || e    || align="right"|      19 || align="right"|  0.04
 +
|-
 +
| 35    || Z    || align="right"|      13 || align="right"|  0.03
 +
|-
 +
| 36    || i    || align="right"|      12 || align="right"|  0.03
 +
|-
 +
| 37    || p    || align="right"|      7 || align="right"|  0.02
 +
|-
 +
| 38    || «    || align="right"|      5 || align="right"|  0.01
 +
|-
 +
| 39    || (    || align="right"|      3 || align="right"|  0.01
 +
|-
 +
| 40    || "    || align="right"|      2 || align="right"|  0.00
 +
|-
 +
| 41    || d    || align="right"|      2 || align="right"|  0.00
 +
|-
 +
| 42    || f    || align="right"|      1 || align="right"|  0.00
 +
|-
 +
| 43    || t    || align="right"|      1 || align="right"|  0.00
 +
|-
 +
| 44    || s    || align="right"|      1 || align="right"|  0.00
 +
|-
 +
| 45    || '    || align="right"|      1 || align="right"|  0.00
 +
|-
 +
| 46    || Y    || align="right"|      1 || align="right"|  0.00
 +
|}
  
 
== Observation for Letter ==
 
== Observation for Letter ==

Revision as of 08:50, 3 May 2021

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

= Most common first letters

# key count %
total 46 43398
1 P 12599 29.03
2 2 3526 8.12
3 I 3515 8.10
4 A 3296 7.59
5 C 2333 5.38
6 S 2260 5.21
7 1 2105 4.85
8 T 1559 3.59
9 M 1312 3.02
10 E 1252 2.88
11 F 1246 2.87
12 D 1177 2.71
13 R 624 1.44
14 H 578 1.33
15 N 566 1.30
16 3 564 1.30
17 W 522 1.20
18 L 502 1.16
19 G 501 1.15
20 B 479 1.10
21 4 354 0.82
22 V 334 0.77
23 K 257 0.59
24 O 255 0.59
25 5 252 0.58
26 U 236 0.54
27 9 215 0.50
28 6 211 0.49
29 7 199 0.46
30 8 187 0.43
31 J 150 0.35
32 X 88 0.20
33 Q 76 0.18
34 e 19 0.04
35 Z 13 0.03
36 i 12 0.03
37 p 7 0.02
38 « 5 0.01
39 ( 3 0.01
40 " 2 0.00
41 d 2 0.00
42 f 1 0.00
43 t 1 0.00
44 s 1 0.00
45 ' 1 0.00
46 Y 1 0.00

Observation for Letter

Relevance Matrix

top 10% top 20% top 30%
Letter 1:P 1:P 2: P, 2
Token
Grammar structure

Interpretation for Letter

That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words

Word