Difference between revisions of "Workdocumentation 2021-05-03"

From BITPlan cr Wiki
Jump to navigation Jump to search
Line 1: Line 1:
 +
= Introduction =
 +
The Relevance Matrix approach has been discussed in https://rq.bitplan.com/index.php/Hackathon_2021-04-27#Relevance_Matrix.
 +
It might be feasible to create a systematic analysis approach/design and solution approach based on this idea.
 +
 +
By walking the path from 1st decile down the dependency tree at each cell an observation is made:
 +
# how many items fall in this cell?
 +
# what is the category of this cell if it is not known in advanced. E.g. the in the Country/Region/City hierarchy we have assumed the category knowledge in advanced. For the Proceedings Title Parsing Problem we'll try the approach out as if we wouldn't know about the categories yet and only go with general parsing categories for a start.
 +
 +
If we find an element in a cell we'll then categorize it.
 +
 
= Question =
 
= Question =
 
What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?
 
What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?
Line 190: Line 200:
  
 
== Observation for Letter ==
 
== Observation for Letter ==
 
+
Top categories: Letter and Digit.
 
=== Relevance Matrix ===
 
=== Relevance Matrix ===
 
{| class="wikitable"
 
{| class="wikitable"

Revision as of 09:59, 3 May 2021

Introduction

The Relevance Matrix approach has been discussed in https://rq.bitplan.com/index.php/Hackathon_2021-04-27#Relevance_Matrix. It might be feasible to create a systematic analysis approach/design and solution approach based on this idea.

By walking the path from 1st decile down the dependency tree at each cell an observation is made:

  1. how many items fall in this cell?
  2. what is the category of this cell if it is not known in advanced. E.g. the in the Country/Region/City hierarchy we have assumed the category knowledge in advanced. For the Proceedings Title Parsing Problem we'll try the approach out as if we wouldn't know about the categories yet and only go with general parsing categories for a start.

If we find an element in a cell we'll then categorize it.

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

ProceedingsTitleParser/WolfgangFahl:adds testMostCommonFirstLetter experiment (Wolfgang Fahl/2021-05-03 08:39:45 +0200)

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        total=0
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
                total+=1
        bins=len(counter.keys())
        print(f"found {bins} different first letters in {total} titles")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in   0.2 s
found 46 different first letters in 43398 titles
P: 12599 29.0 %
2:  3526  8.1 %
I:  3515  8.1 %
A:  3296  7.6 %
C:  2333  5.4 %
S:  2260  5.2 %
1:  2105  4.9 %
T:  1559  3.6 %
M:  1312  3.0 %
E:  1252  2.9 %
F:  1246  2.9 %
D:  1177  2.7 %
R:   624  1.4 %
H:   578  1.3 %
N:   566  1.3 %
3:   564  1.3 %
W:   522  1.2 %
L:   502  1.2 %
G:   501  1.2 %
B:   479  1.1 %
4:   354  0.8 %
V:   334  0.8 %
K:   257  0.6 %
O:   255  0.6 %
5:   252  0.6 %
U:   236  0.5 %
9:   215  0.5 %
6:   211  0.5 %
7:   199  0.5 %
8:   187  0.4 %
J:   150  0.3 %
X:    88  0.2 %
Q:    76  0.2 %
e:    19  0.0 %
Z:    13  0.0 %
i:    12  0.0 %
p:     7  0.0 %
«:     5  0.0 %
(:     3  0.0 %
":     2  0.0 %
d:     2  0.0 %
f:     1  0.0 %
t:     1  0.0 %
s:     1  0.0 %
':     1  0.0 %
Y:     1  0.0 %
----------------------------------------------------------------------
Ran 1 test in 0.557s

Most common first letters

# key count %
total 46 43398
1 P 12599 29.03
2 2 3526 8.12
3 I 3515 8.10
4 A 3296 7.59
5 C 2333 5.38
6 S 2260 5.21
7 1 2105 4.85
8 T 1559 3.59
9 M 1312 3.02
10 E 1252 2.88
11 F 1246 2.87
12 D 1177 2.71
13 R 624 1.44
14 H 578 1.33
15 N 566 1.30
16 3 564 1.30
17 W 522 1.20
18 L 502 1.16
19 G 501 1.15
20 B 479 1.10
21 4 354 0.82
22 V 334 0.77
23 K 257 0.59
24 O 255 0.59
25 5 252 0.58
26 U 236 0.54
27 9 215 0.50
28 6 211 0.49
29 7 199 0.46
30 8 187 0.43
31 J 150 0.35
32 X 88 0.20
33 Q 76 0.18
34 e 19 0.04
35 Z 13 0.03
36 i 12 0.03
37 p 7 0.02
38 « 5 0.01
39 ( 3 0.01
40 " 2 0.00
41 d 2 0.00
42 f 1 0.00
43 t 1 0.00
44 s 1 0.00
45 ' 1 0.00
46 Y 1 0.00

Observation for Letter

Top categories: Letter and Digit.

Relevance Matrix

top 10% top 20% top 30%
Letter 1:P 1:P 2: P, 2
Token
Grammar structure

Interpretation for Letter

That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words

Word