Workdocumentation 2021-05-03
Introduction
The Relevance Matrix approach has been discussed in https://rq.bitplan.com/index.php/Hackathon_2021-04-27#Relevance_Matrix. It might be feasible to create a systematic analysis approach/design and solution approach based on this idea.
By walking the path from 1st decile down the dependency tree at each cell an observation is made:
- how many items fall in this cell?
- what is the category of this cell if it is not known in advanced. E.g. the in the Country/Region/City hierarchy we have assumed the category knowledge in advanced. For the Proceedings Title Parsing Problem we'll try the approach out as if we wouldn't know about the categories yet and only go with general parsing categories for a start.
If we find an element in a cell we'll then categorize it.
Question
What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?
Assumption
Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.
Experiment
Hierarchy of: - Letter - Token - Grammatical structure - Sentence
Input: Proceedings titles of dblp conference entries.
Letter
def testMostCommonFirstLetter(self):
'''
get the most common first letters
'''
dblp,foundEvents=self.getEvents()
self.assertTrue(foundEvents>43950)
# collect first letters
counter=Counter()
total=0
for eventId in dblp.em.events:
if eventId.startswith("conf"):
event=dblp.em.events[eventId]
first=ord(event.title[0])
counter[first]+=1
total+=1
bins=len(counter.keys())
print(f"found {bins} different first letters in {total} titles")
for o,count in counter.most_common(bins):
c=chr(o)
print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in 0.2 s found 46 different first letters in 43398 titles P: 12599 29.0 % 2: 3526 8.1 % I: 3515 8.1 % A: 3296 7.6 % C: 2333 5.4 % S: 2260 5.2 % 1: 2105 4.9 % T: 1559 3.6 % M: 1312 3.0 % E: 1252 2.9 % F: 1246 2.9 % D: 1177 2.7 % R: 624 1.4 % H: 578 1.3 % N: 566 1.3 % 3: 564 1.3 % W: 522 1.2 % L: 502 1.2 % G: 501 1.2 % B: 479 1.1 % 4: 354 0.8 % V: 334 0.8 % K: 257 0.6 % O: 255 0.6 % 5: 252 0.6 % U: 236 0.5 % 9: 215 0.5 % 6: 211 0.5 % 7: 199 0.5 % 8: 187 0.4 % J: 150 0.3 % X: 88 0.2 % Q: 76 0.2 % e: 19 0.0 % Z: 13 0.0 % i: 12 0.0 % p: 7 0.0 % «: 5 0.0 % (: 3 0.0 % ": 2 0.0 % d: 2 0.0 % f: 1 0.0 % t: 1 0.0 % s: 1 0.0 % ': 1 0.0 % Y: 1 0.0 % ---------------------------------------------------------------------- Ran 1 test in 0.557s
Most common first letters
# | key | count | % |
---|---|---|---|
total | 46 | 43398 | |
1 | P | 12599 | 29.03 |
2 | 2 | 3526 | 8.12 |
3 | I | 3515 | 8.10 |
4 | A | 3296 | 7.59 |
5 | C | 2333 | 5.38 |
6 | S | 2260 | 5.21 |
7 | 1 | 2105 | 4.85 |
8 | T | 1559 | 3.59 |
9 | M | 1312 | 3.02 |
10 | E | 1252 | 2.88 |
11 | F | 1246 | 2.87 |
12 | D | 1177 | 2.71 |
13 | R | 624 | 1.44 |
14 | H | 578 | 1.33 |
15 | N | 566 | 1.30 |
16 | 3 | 564 | 1.30 |
17 | W | 522 | 1.20 |
18 | L | 502 | 1.16 |
19 | G | 501 | 1.15 |
20 | B | 479 | 1.10 |
21 | 4 | 354 | 0.82 |
22 | V | 334 | 0.77 |
23 | K | 257 | 0.59 |
24 | O | 255 | 0.59 |
25 | 5 | 252 | 0.58 |
26 | U | 236 | 0.54 |
27 | 9 | 215 | 0.50 |
28 | 6 | 211 | 0.49 |
29 | 7 | 199 | 0.46 |
30 | 8 | 187 | 0.43 |
31 | J | 150 | 0.35 |
32 | X | 88 | 0.20 |
33 | Q | 76 | 0.18 |
34 | e | 19 | 0.04 |
35 | Z | 13 | 0.03 |
36 | i | 12 | 0.03 |
37 | p | 7 | 0.02 |
38 | « | 5 | 0.01 |
39 | ( | 3 | 0.01 |
40 | " | 2 | 0.00 |
41 | d | 2 | 0.00 |
42 | f | 1 | 0.00 |
43 | t | 1 | 0.00 |
44 | s | 1 | 0.00 |
45 | ' | 1 | 0.00 |
46 | Y | 1 | 0.00 |
Observation for Letter
Top categories: Letter and Digit.
Relevance Matrix
top 10% | top 20% | top 30% | |
---|---|---|---|
Letter | 1:P | 1:P | 2: P, 2 |
Token | |||
Grammar structure |
Interpretation for Letter
That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words