Workdocumentation 2021-05-03


Wolfgang Fahl

Introduction

The Relevance Matrix approach has been discussed in https://rq.bitplan.com/index.php/Hackathon_2021-04-27#Relevance_Matrix. It might be feasible to create a systematic analysis approach/design and solution approach based on this idea.

By walking the path from 1st decile down the dependency tree at each cell an observation is made:

  1. how many items fall in this cell?
  2. what is the category of this cell if it is not known in advanced. E.g. the in the Country/Region/City hierarchy we have assumed the category knowledge in advanced. For the Proceedings Title Parsing Problem we'll try the approach out as if we wouldn't know about the categories yet and only go with general parsing categories for a start.

If we find an element in a cell we'll then categorize it.

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Introduction[edit]

The Relevance Matrix approach has been discussed in https://rq.bitplan.com/index.php/Hackathon_2021-04-27#Relevance_Matrix. It might be feasible to create a systematic analysis approach/design and solution approach based on this idea.

By walking the path from 1st decile down the dependency tree at each cell an observation is made:

  1. how many items fall in this cell?
  2. what is the category of this cell if it is not known in advanced. E.g. the in the Country/Region/City hierarchy we have assumed the category knowledge in advanced. For the Proceedings Title Parsing Problem we'll try the approach out as if we wouldn't know about the categories yet and only go with general parsing categories for a start.

If we find an element in a cell we'll then categorize it.

Question[edit]

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption[edit]

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment[edit]

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter[edit]

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

Letter[edit]

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

Most common first letters[edit]

# key count %
total 46 43398
1 P 12599 29.03
2 2 3526 8.12
3 I 3515 8.10
4 A 3296 7.59
5 C 2333 5.38
6 S 2260 5.21
7 1 2105 4.85
8 T 1559 3.59
9 M 1312 3.02
10 E 1252 2.88
11 F 1246 2.87
12 D 1177 2.71
13 R 624 1.44
14 H 578 1.33
15 N 566 1.30
16 3 564 1.30
17 W 522 1.20
18 L 502 1.16
19 G 501 1.15
20 B 479 1.10
21 4 354 0.82
22 V 334 0.77
23 K 257 0.59
24 O 255 0.59
25 5 252 0.58
26 U 236 0.54
27 9 215 0.50
28 6 211 0.49
29 7 199 0.46
30 8 187 0.43
31 J 150 0.35
32 X 88 0.20
33 Q 76 0.18
34 e 19 0.04
35 Z 13 0.03
36 i 12 0.03
37 p 7 0.02
38 « 5 0.01
39 ( 3 0.01
40 " 2 0.00
41 d 2 0.00
42 f 1 0.00
43 t 1 0.00
44 s 1 0.00
45 ' 1 0.00
46 Y 1 0.00

Observation for Letter[edit]

Top categories: Letter and Digit.

Relevance Matrix[edit]

top 10% top 20% top 30%
Letter 1:P 1:P 2: P, 2
Token
Grammar structure

Interpretation for Letter[edit]

That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words

Word[edit]

first Letter[edit]

# key count %
total 90 809260
1 2 96484 11.92
2 I 65214 8.06
3 C 62697 7.75
4 S 59363 7.34
5 P 50577 6.25
6 o 47073 5.82
7 A 40291 4.98
8 1 33935 4.19
9 a 26085 3.22
10 M 25474 3.15
11 T 19690 2.43
12 W 19391 2.40
13 t 18726 2.31
14 D 17672 2.18
15 E 16201 2.00
16 U 15969 1.97
17 J 15688 1.94
18 N 15309 1.89
19 - 14558 1.80
20 F 13717 1.70
21 R 13104 1.62
22 B 11044 1.36
23 L 10255 1.27
24 G 10170 1.26
25 O 9972 1.23
26 V 8218 1.02
27 H 8078 1.00
28 i 7781 0.96
29 3 5577 0.69
30 f 4769 0.59
31 ( 4738 0.59
32 K 4666 0.58
33 4 3501 0.43
34 5 3107 0.38
35 6 2910 0.36
36 8 2875 0.36
37 7 2832 0.35
38 9 2826 0.35
39 ' 2311 0.29
40 w 2124 0.26
41 d 2000 0.25
42 c 1614 0.20
43 Q 1210 0.15
44 e 975 0.12
45 & 929 0.11
46 X 802 0.10
47 u 695 0.09
48 Y 688 0.09
49 Z 686 0.08
50 0 632 0.08

word[edit]

# key count %
total 30492 809260
1 International 26360 3.26
2 on 25486 3.15
3 and 24329 3.01
4 Proceedings 22995 2.84
5 of 21438 2.65
6 the 17733 2.19
7 Conference 14916 1.84
8 - 14527 1.80
9 USA, 9163 1.13
10 Conference, 8668 1.07
11 Workshop 7152 0.88
12 in 7106 0.88
13 September 6424 0.79
14 June 5651 0.70
15 October 4955 0.61
16 IEEE 4731 0.58
17 Symposium 4426 0.55
18 July 4349 0.54
19 Information 4170 0.52
20 November 3972 0.49
21 for 3798 0.47
22 August 3756 0.46
23 Computer 3685 0.46
24 Systems 3634 0.45
25 Papers 3411 0.42
26 Systems, 3373 0.42
27 May 3254 0.40
28 2018, 3248 0.40
29 2017, 3036 0.38
30 2019, 2983 0.37
31 2016, 2956 0.37
32 Revised 2948 0.36
33 Selected 2881 0.36
34 December 2879 0.36
35 Workshop, 2827 0.35
36 2015, 2794 0.35
37 Software 2704 0.33
38 ACM 2687 0.33
39 April 2652 0.33
40 Computing 2373 0.29
41 China, 2339 0.29
42 2014, 2321 0.29
43 Germany, 2319 0.29
44 Part 2240 0.28
45 2013, 2214 0.27
46 2011, 2207 0.27
47 2010, 2106 0.26
48 2015 2101 0.26
49 Italy, 2072 0.26
50 2009, 2056 0.25

Ordinal[edit]

# key count %
total 93 809260
1 781321 96.55
2 2 2337 0.29
3 3 2152 0.27
4 1 2099 0.26
5 4 1955 0.24
6 5 1865 0.23
7 6 1716 0.21
8 7 1622 0.20
9 8 1490 0.18
10 9 1451 0.18
11 10 1380 0.17
12 14 979 0.12
13 15 873 0.11
14 16 748 0.09
15 17 680 0.08
16 18 637 0.08
17 19 592 0.07
18 20 509 0.06
19 21 480 0.06
20 22 392 0.05
21 23 379 0.05
22 24 353 0.04
23 25 332 0.04
24 26 289 0.04
25 27 245 0.03
26 28 230 0.03
27 30 198 0.02
28 29 183 0.02
29 31 157 0.02
30 11 130 0.02
31 32 114 0.01
32 12 104 0.01
33 13 100 0.01
34 34 98 0.01
35 33 94 0.01
36 35 83 0.01
37 36 81 0.01
38 37 79 0.01
39 38 77 0.01
40 39 64 0.01
41 40 60 0.01
42 60 55 0.01
43 41 52 0.01
44 42 41 0.01
45 44 31 0.00
46 43 29 0.00
47 46 27 0.00
48 47 25 0.00
49 49 24 0.00
50 45 23 0.00
🖨 🚪