Workdocumentation 2021-05-03

Wolfgang Fahl

Introduction

The Relevance Matrix approach has been discussed in https://rq.bitplan.com/index.php/Hackathon_2021-04-27#Relevance_Matrix. It might be feasible to create a systematic analysis approach/design and solution approach based on this idea.

By walking the path from 1st decile down the dependency tree at each cell an observation is made:

how many items fall in this cell?
what is the category of this cell if it is not known in advanced. E.g. the in the Country/Region/City hierarchy we have assumed the category knowledge in advanced. For the Proceedings Title Parsing Problem we'll try the approach out as if we wouldn't know about the categories yet and only go with general parsing categories for a start.

If we find an element in a cell we'll then categorize it.

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Introduction[edit]

By walking the path from 1st decile down the dependency tree at each cell an observation is made:

how many items fall in this cell?
what is the category of this cell if it is not known in advanced. E.g. the in the Country/Region/City hierarchy we have assumed the category knowledge in advanced. For the Proceedings Title Parsing Problem we'll try the approach out as if we wouldn't know about the categories yet and only go with general parsing categories for a start.

If we find an element in a cell we'll then categorize it.

Question[edit]

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption[edit]

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment[edit]

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter[edit]

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")

read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

Letter[edit]

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")

read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

Most common first letters[edit]


#	key	count	%
total	46	43398
1	P	12599	29.03
2	2	3526	8.12
3	I	3515	8.10
4	A	3296	7.59
5	C	2333	5.38
6	S	2260	5.21
7	1	2105	4.85
8	T	1559	3.59
9	M	1312	3.02
10	E	1252	2.88
11	F	1246	2.87
12	D	1177	2.71
13	R	624	1.44
14	H	578	1.33
15	N	566	1.30
16	3	564	1.30
17	W	522	1.20
18	L	502	1.16
19	G	501	1.15
20	B	479	1.10
21	4	354	0.82
22	V	334	0.77
23	K	257	0.59
24	O	255	0.59
25	5	252	0.58
26	U	236	0.54
27	9	215	0.50
28	6	211	0.49
29	7	199	0.46
30	8	187	0.43
31	J	150	0.35
32	X	88	0.20
33	Q	76	0.18
34	e	19	0.04
35	Z	13	0.03
36	i	12	0.03
37	p	7	0.02
38	«	5	0.01
39	(	3	0.01
40	"	2	0.00
41	d	2	0.00
42	f	1	0.00
43	t	1	0.00
44	s	1	0.00
45	'	1	0.00
46	Y	1	0.00

Observation for Letter[edit]

Top categories: Letter and Digit.

Relevance Matrix[edit]

	top 10%	top 20%	top 30%
Letter	1:P	1:P	2: P, 2
Token
Grammar structure

Interpretation for Letter[edit]

That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words

Word[edit]

first Letter[edit]


#	key	count	%
total	90	809260
1	2	96484	11.92
2	I	65214	8.06
3	C	62697	7.75
4	S	59363	7.34
5	P	50577	6.25
6	o	47073	5.82
7	A	40291	4.98
8	1	33935	4.19
9	a	26085	3.22
10	M	25474	3.15
11	T	19690	2.43
12	W	19391	2.40
13	t	18726	2.31
14	D	17672	2.18
15	E	16201	2.00
16	U	15969	1.97
17	J	15688	1.94
18	N	15309	1.89
19	-	14558	1.80
20	F	13717	1.70
21	R	13104	1.62
22	B	11044	1.36
23	L	10255	1.27
24	G	10170	1.26
25	O	9972	1.23
26	V	8218	1.02
27	H	8078	1.00
28	i	7781	0.96
29	3	5577	0.69
30	f	4769	0.59
31	(	4738	0.59
32	K	4666	0.58
33	4	3501	0.43
34	5	3107	0.38
35	6	2910	0.36
36	8	2875	0.36
37	7	2832	0.35
38	9	2826	0.35
39	'	2311	0.29
40	w	2124	0.26
41	d	2000	0.25
42	c	1614	0.20
43	Q	1210	0.15
44	e	975	0.12
45	&	929	0.11
46	X	802	0.10
47	u	695	0.09
48	Y	688	0.09
49	Z	686	0.08
50	0	632	0.08

word[edit]


#	key	count	%
total	30492	809260
1	International	26360	3.26
2	on	25486	3.15
3	and	24329	3.01
4	Proceedings	22995	2.84
5	of	21438	2.65
6	the	17733	2.19
7	Conference	14916	1.84
8	-	14527	1.80
9	USA,	9163	1.13
10	Conference,	8668	1.07
11	Workshop	7152	0.88
12	in	7106	0.88
13	September	6424	0.79
14	June	5651	0.70
15	October	4955	0.61
16	IEEE	4731	0.58
17	Symposium	4426	0.55
18	July	4349	0.54
19	Information	4170	0.52
20	November	3972	0.49
21	for	3798	0.47
22	August	3756	0.46
23	Computer	3685	0.46
24	Systems	3634	0.45
25	Papers	3411	0.42
26	Systems,	3373	0.42
27	May	3254	0.40
28	2018,	3248	0.40
29	2017,	3036	0.38
30	2019,	2983	0.37
31	2016,	2956	0.37
32	Revised	2948	0.36
33	Selected	2881	0.36
34	December	2879	0.36
35	Workshop,	2827	0.35
36	2015,	2794	0.35
37	Software	2704	0.33
38	ACM	2687	0.33
39	April	2652	0.33
40	Computing	2373	0.29
41	China,	2339	0.29
42	2014,	2321	0.29
43	Germany,	2319	0.29
44	Part	2240	0.28
45	2013,	2214	0.27
46	2011,	2207	0.27
47	2010,	2106	0.26
48	2015	2101	0.26
49	Italy,	2072	0.26
50	2009,	2056	0.25

Ordinal[edit]


#	key	count	%
total	93	809260
1		781321	96.55
2	2	2337	0.29
3	3	2152	0.27
4	1	2099	0.26
5	4	1955	0.24
6	5	1865	0.23
7	6	1716	0.21
8	7	1622	0.20
9	8	1490	0.18
10	9	1451	0.18
11	10	1380	0.17
12	14	979	0.12
13	15	873	0.11
14	16	748	0.09
15	17	680	0.08
16	18	637	0.08
17	19	592	0.07
18	20	509	0.06
19	21	480	0.06
20	22	392	0.05
21	23	379	0.05
22	24	353	0.04
23	25	332	0.04
24	26	289	0.04
25	27	245	0.03
26	28	230	0.03
27	30	198	0.02
28	29	183	0.02
29	31	157	0.02
30	11	130	0.02
31	32	114	0.01
32	12	104	0.01
33	13	100	0.01
34	34	98	0.01
35	33	94	0.01
36	35	83	0.01
37	36	81	0.01
38	37	79	0.01
39	38	77	0.01
40	39	64	0.01
41	40	60	0.01
42	60	55	0.01
43	41	52	0.01
44	42	41	0.01
45	44	31	0.00
46	43	29	0.00
47	46	27	0.00
48	47	25	0.00
49	49	24	0.00
50	45	23	0.00

Workdocumentation 2021-05-03

Contents

Introduction

Question

Assumption

Experiment

Introduction[edit]

Question[edit]

Assumption[edit]

Contents

Experiment[edit]

Letter[edit]

Letter[edit]

Most common first letters[edit]

Observation for Letter[edit]

Relevance Matrix[edit]

Interpretation for Letter[edit]

Contents

Word[edit]

first Letter[edit]

word[edit]

Ordinal[edit]