Difference between revisions of "Workdocumentation 2021-05-03"
Jump to navigation
Jump to search
(→Letter) |
|||
Line 88: | Line 88: | ||
Ran 1 test in 0.557s | Ran 1 test in 0.557s | ||
</pre> | </pre> | ||
+ | === Most common first letters == | ||
+ | {| class="wikitable" style="text-align: left;" | ||
+ | |+ <!-- caption --> | ||
+ | |- | ||
+ | ! # !! key !! align="right"| count !! align="right"| % | ||
+ | |- | ||
+ | | total || 46 || align="right"| 43398 || align="right"| | ||
+ | |- | ||
+ | | 1 || P || align="right"| 12599 || align="right"| 29.03 | ||
+ | |- | ||
+ | | 2 || 2 || align="right"| 3526 || align="right"| 8.12 | ||
+ | |- | ||
+ | | 3 || I || align="right"| 3515 || align="right"| 8.10 | ||
+ | |- | ||
+ | | 4 || A || align="right"| 3296 || align="right"| 7.59 | ||
+ | |- | ||
+ | | 5 || C || align="right"| 2333 || align="right"| 5.38 | ||
+ | |- | ||
+ | | 6 || S || align="right"| 2260 || align="right"| 5.21 | ||
+ | |- | ||
+ | | 7 || 1 || align="right"| 2105 || align="right"| 4.85 | ||
+ | |- | ||
+ | | 8 || T || align="right"| 1559 || align="right"| 3.59 | ||
+ | |- | ||
+ | | 9 || M || align="right"| 1312 || align="right"| 3.02 | ||
+ | |- | ||
+ | | 10 || E || align="right"| 1252 || align="right"| 2.88 | ||
+ | |- | ||
+ | | 11 || F || align="right"| 1246 || align="right"| 2.87 | ||
+ | |- | ||
+ | | 12 || D || align="right"| 1177 || align="right"| 2.71 | ||
+ | |- | ||
+ | | 13 || R || align="right"| 624 || align="right"| 1.44 | ||
+ | |- | ||
+ | | 14 || H || align="right"| 578 || align="right"| 1.33 | ||
+ | |- | ||
+ | | 15 || N || align="right"| 566 || align="right"| 1.30 | ||
+ | |- | ||
+ | | 16 || 3 || align="right"| 564 || align="right"| 1.30 | ||
+ | |- | ||
+ | | 17 || W || align="right"| 522 || align="right"| 1.20 | ||
+ | |- | ||
+ | | 18 || L || align="right"| 502 || align="right"| 1.16 | ||
+ | |- | ||
+ | | 19 || G || align="right"| 501 || align="right"| 1.15 | ||
+ | |- | ||
+ | | 20 || B || align="right"| 479 || align="right"| 1.10 | ||
+ | |- | ||
+ | | 21 || 4 || align="right"| 354 || align="right"| 0.82 | ||
+ | |- | ||
+ | | 22 || V || align="right"| 334 || align="right"| 0.77 | ||
+ | |- | ||
+ | | 23 || K || align="right"| 257 || align="right"| 0.59 | ||
+ | |- | ||
+ | | 24 || O || align="right"| 255 || align="right"| 0.59 | ||
+ | |- | ||
+ | | 25 || 5 || align="right"| 252 || align="right"| 0.58 | ||
+ | |- | ||
+ | | 26 || U || align="right"| 236 || align="right"| 0.54 | ||
+ | |- | ||
+ | | 27 || 9 || align="right"| 215 || align="right"| 0.50 | ||
+ | |- | ||
+ | | 28 || 6 || align="right"| 211 || align="right"| 0.49 | ||
+ | |- | ||
+ | | 29 || 7 || align="right"| 199 || align="right"| 0.46 | ||
+ | |- | ||
+ | | 30 || 8 || align="right"| 187 || align="right"| 0.43 | ||
+ | |- | ||
+ | | 31 || J || align="right"| 150 || align="right"| 0.35 | ||
+ | |- | ||
+ | | 32 || X || align="right"| 88 || align="right"| 0.20 | ||
+ | |- | ||
+ | | 33 || Q || align="right"| 76 || align="right"| 0.18 | ||
+ | |- | ||
+ | | 34 || e || align="right"| 19 || align="right"| 0.04 | ||
+ | |- | ||
+ | | 35 || Z || align="right"| 13 || align="right"| 0.03 | ||
+ | |- | ||
+ | | 36 || i || align="right"| 12 || align="right"| 0.03 | ||
+ | |- | ||
+ | | 37 || p || align="right"| 7 || align="right"| 0.02 | ||
+ | |- | ||
+ | | 38 || « || align="right"| 5 || align="right"| 0.01 | ||
+ | |- | ||
+ | | 39 || ( || align="right"| 3 || align="right"| 0.01 | ||
+ | |- | ||
+ | | 40 || " || align="right"| 2 || align="right"| 0.00 | ||
+ | |- | ||
+ | | 41 || d || align="right"| 2 || align="right"| 0.00 | ||
+ | |- | ||
+ | | 42 || f || align="right"| 1 || align="right"| 0.00 | ||
+ | |- | ||
+ | | 43 || t || align="right"| 1 || align="right"| 0.00 | ||
+ | |- | ||
+ | | 44 || s || align="right"| 1 || align="right"| 0.00 | ||
+ | |- | ||
+ | | 45 || ' || align="right"| 1 || align="right"| 0.00 | ||
+ | |- | ||
+ | | 46 || Y || align="right"| 1 || align="right"| 0.00 | ||
+ | |} | ||
== Observation for Letter == | == Observation for Letter == |
Revision as of 08:50, 3 May 2021
Question
What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?
Assumption
Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.
Experiment
Hierarchy of: - Letter - Token - Grammatical structure - Sentence
Input: Proceedings titles of dblp conference entries.
Letter
def testMostCommonFirstLetter(self):
'''
get the most common first letters
'''
dblp,foundEvents=self.getEvents()
self.assertTrue(foundEvents>43950)
# collect first letters
counter=Counter()
total=0
for eventId in dblp.em.events:
if eventId.startswith("conf"):
event=dblp.em.events[eventId]
first=ord(event.title[0])
counter[first]+=1
total+=1
bins=len(counter.keys())
print(f"found {bins} different first letters in {total} titles")
for o,count in counter.most_common(bins):
c=chr(o)
print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in 0.2 s found 46 different first letters in 43398 titles P: 12599 29.0 % 2: 3526 8.1 % I: 3515 8.1 % A: 3296 7.6 % C: 2333 5.4 % S: 2260 5.2 % 1: 2105 4.9 % T: 1559 3.6 % M: 1312 3.0 % E: 1252 2.9 % F: 1246 2.9 % D: 1177 2.7 % R: 624 1.4 % H: 578 1.3 % N: 566 1.3 % 3: 564 1.3 % W: 522 1.2 % L: 502 1.2 % G: 501 1.2 % B: 479 1.1 % 4: 354 0.8 % V: 334 0.8 % K: 257 0.6 % O: 255 0.6 % 5: 252 0.6 % U: 236 0.5 % 9: 215 0.5 % 6: 211 0.5 % 7: 199 0.5 % 8: 187 0.4 % J: 150 0.3 % X: 88 0.2 % Q: 76 0.2 % e: 19 0.0 % Z: 13 0.0 % i: 12 0.0 % p: 7 0.0 % «: 5 0.0 % (: 3 0.0 % ": 2 0.0 % d: 2 0.0 % f: 1 0.0 % t: 1 0.0 % s: 1 0.0 % ': 1 0.0 % Y: 1 0.0 % ---------------------------------------------------------------------- Ran 1 test in 0.557s
= Most common first letters
# | key | count | % |
---|---|---|---|
total | 46 | 43398 | |
1 | P | 12599 | 29.03 |
2 | 2 | 3526 | 8.12 |
3 | I | 3515 | 8.10 |
4 | A | 3296 | 7.59 |
5 | C | 2333 | 5.38 |
6 | S | 2260 | 5.21 |
7 | 1 | 2105 | 4.85 |
8 | T | 1559 | 3.59 |
9 | M | 1312 | 3.02 |
10 | E | 1252 | 2.88 |
11 | F | 1246 | 2.87 |
12 | D | 1177 | 2.71 |
13 | R | 624 | 1.44 |
14 | H | 578 | 1.33 |
15 | N | 566 | 1.30 |
16 | 3 | 564 | 1.30 |
17 | W | 522 | 1.20 |
18 | L | 502 | 1.16 |
19 | G | 501 | 1.15 |
20 | B | 479 | 1.10 |
21 | 4 | 354 | 0.82 |
22 | V | 334 | 0.77 |
23 | K | 257 | 0.59 |
24 | O | 255 | 0.59 |
25 | 5 | 252 | 0.58 |
26 | U | 236 | 0.54 |
27 | 9 | 215 | 0.50 |
28 | 6 | 211 | 0.49 |
29 | 7 | 199 | 0.46 |
30 | 8 | 187 | 0.43 |
31 | J | 150 | 0.35 |
32 | X | 88 | 0.20 |
33 | Q | 76 | 0.18 |
34 | e | 19 | 0.04 |
35 | Z | 13 | 0.03 |
36 | i | 12 | 0.03 |
37 | p | 7 | 0.02 |
38 | « | 5 | 0.01 |
39 | ( | 3 | 0.01 |
40 | " | 2 | 0.00 |
41 | d | 2 | 0.00 |
42 | f | 1 | 0.00 |
43 | t | 1 | 0.00 |
44 | s | 1 | 0.00 |
45 | ' | 1 | 0.00 |
46 | Y | 1 | 0.00 |
Observation for Letter
Relevance Matrix
top 10% | top 20% | top 30% | |
---|---|---|---|
Letter | 1:P | 1:P | 2: P, 2 |
Token | |||
Grammar structure |
Interpretation for Letter
That P is the most common first letter could be since the word "Proceedings" starts with "P" and might be one of the most common words