Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")

read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

Most common first letters


#	key	count	%
total	46	43398
1	P	12599	29.03
2	2	3526	8.12
3	I	3515	8.10
4	A	3296	7.59
5	C	2333	5.38
6	S	2260	5.21
7	1	2105	4.85
8	T	1559	3.59
9	M	1312	3.02
10	E	1252	2.88
11	F	1246	2.87
12	D	1177	2.71
13	R	624	1.44
14	H	578	1.33
15	N	566	1.30
16	3	564	1.30
17	W	522	1.20
18	L	502	1.16
19	G	501	1.15
20	B	479	1.10
21	4	354	0.82
22	V	334	0.77
23	K	257	0.59
24	O	255	0.59
25	5	252	0.58
26	U	236	0.54
27	9	215	0.50
28	6	211	0.49
29	7	199	0.46
30	8	187	0.43
31	J	150	0.35
32	X	88	0.20
33	Q	76	0.18
34	e	19	0.04
35	Z	13	0.03
36	i	12	0.03
37	p	7	0.02
38	«	5	0.01
39	(	3	0.01
40	"	2	0.00
41	d	2	0.00
42	f	1	0.00
43	t	1	0.00
44	s	1	0.00
45	'	1	0.00
46	Y	1	0.00

Observation for Letter

Relevance Matrix

	top 10%	top 20%	top 30%
Letter	1:P	1:P	2: P, 2
Token
Grammar structure

Interpretation for Letter

That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words

Word

Workdocumentation 2021-05-03

Contents