Difference between revisions of "Workdocumentation 2021-05-03"
Jump to navigation
Jump to search
Line 22: | Line 22: | ||
# collect first letters | # collect first letters | ||
counter=Counter() | counter=Counter() | ||
+ | total=0 | ||
for eventId in dblp.em.events: | for eventId in dblp.em.events: | ||
if eventId.startswith("conf"): | if eventId.startswith("conf"): | ||
Line 27: | Line 28: | ||
first=ord(event.title[0]) | first=ord(event.title[0]) | ||
counter[first]+=1 | counter[first]+=1 | ||
+ | total+=1 | ||
bins=len(counter.keys()) | bins=len(counter.keys()) | ||
− | print(f" {bins} different first letters | + | print(f"found {bins} different first letters in {total} titles") |
for o,count in counter.most_common(bins): | for o,count in counter.most_common(bins): | ||
c=chr(o) | c=chr(o) | ||
− | print (f"{c}: {count}") | + | print (f"{c}: {count:5} {count/total*100:4.1f} %") |
</source> | </source> | ||
<pre> | <pre> | ||
− | + | read 43976 Events from dblp in 0.2 s | |
− | P: 12599 | + | found 46 different first letters in 43398 titles |
− | 2: 3526 | + | P: 12599 29.0 % |
− | I: 3515 | + | 2: 3526 8.1 % |
− | A: 3296 | + | I: 3515 8.1 % |
− | C: 2333 | + | A: 3296 7.6 % |
− | S: 2260 | + | C: 2333 5.4 % |
− | 1: 2105 | + | S: 2260 5.2 % |
− | T: 1559 | + | 1: 2105 4.9 % |
− | M: 1312 | + | T: 1559 3.6 % |
− | E: 1252 | + | M: 1312 3.0 % |
− | F: 1246 | + | E: 1252 2.9 % |
− | D: 1177 | + | F: 1246 2.9 % |
− | R: 624 | + | D: 1177 2.7 % |
− | H: 578 | + | R: 624 1.4 % |
− | N: 566 | + | H: 578 1.3 % |
− | 3: 564 | + | N: 566 1.3 % |
− | W: 522 | + | 3: 564 1.3 % |
− | L: 502 | + | W: 522 1.2 % |
− | G: 501 | + | L: 502 1.2 % |
− | B: 479 | + | G: 501 1.2 % |
− | 4: 354 | + | B: 479 1.1 % |
− | V: 334 | + | 4: 354 0.8 % |
− | K: 257 | + | V: 334 0.8 % |
− | O: 255 | + | K: 257 0.6 % |
− | 5: 252 | + | O: 255 0.6 % |
− | U: 236 | + | 5: 252 0.6 % |
− | 9: 215 | + | U: 236 0.5 % |
− | 6: 211 | + | 9: 215 0.5 % |
− | 7: 199 | + | 6: 211 0.5 % |
− | 8: 187 | + | 7: 199 0.5 % |
− | J: 150 | + | 8: 187 0.4 % |
− | X: 88 | + | J: 150 0.3 % |
− | Q: 76 | + | X: 88 0.2 % |
− | e: 19 | + | Q: 76 0.2 % |
− | Z: 13 | + | e: 19 0.0 % |
− | i: 12 | + | Z: 13 0.0 % |
− | p: 7 | + | i: 12 0.0 % |
− | «: 5 | + | p: 7 0.0 % |
− | (: 3 | + | «: 5 0.0 % |
− | ": 2 | + | (: 3 0.0 % |
− | d: 2 | + | ": 2 0.0 % |
− | f: 1 | + | d: 2 0.0 % |
− | t: 1 | + | f: 1 0.0 % |
− | s: 1 | + | t: 1 0.0 % |
− | ': 1 | + | s: 1 0.0 % |
− | Y: 1 | + | ': 1 0.0 % |
+ | Y: 1 0.0 % | ||
---------------------------------------------------------------------- | ---------------------------------------------------------------------- | ||
− | Ran 1 test in 0. | + | Ran 1 test in 0.557s |
</pre> | </pre> |
Revision as of 07:34, 3 May 2021
Question
What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?
Assumption
Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.
Experiment
Hierarchy of: - Letter - Token - Grammatical structure - Sentence
Input: Proceedings titles of dblp conference entries.
Letter
def testMostCommonFirstLetter(self):
'''
get the most common first letters
'''
dblp,foundEvents=self.getEvents()
self.assertTrue(foundEvents>43950)
# collect first letters
counter=Counter()
total=0
for eventId in dblp.em.events:
if eventId.startswith("conf"):
event=dblp.em.events[eventId]
first=ord(event.title[0])
counter[first]+=1
total+=1
bins=len(counter.keys())
print(f"found {bins} different first letters in {total} titles")
for o,count in counter.most_common(bins):
c=chr(o)
print (f"{c}: {count:5} {count/total*100:4.1f} %")
read 43976 Events from dblp in 0.2 s found 46 different first letters in 43398 titles P: 12599 29.0 % 2: 3526 8.1 % I: 3515 8.1 % A: 3296 7.6 % C: 2333 5.4 % S: 2260 5.2 % 1: 2105 4.9 % T: 1559 3.6 % M: 1312 3.0 % E: 1252 2.9 % F: 1246 2.9 % D: 1177 2.7 % R: 624 1.4 % H: 578 1.3 % N: 566 1.3 % 3: 564 1.3 % W: 522 1.2 % L: 502 1.2 % G: 501 1.2 % B: 479 1.1 % 4: 354 0.8 % V: 334 0.8 % K: 257 0.6 % O: 255 0.6 % 5: 252 0.6 % U: 236 0.5 % 9: 215 0.5 % 6: 211 0.5 % 7: 199 0.5 % 8: 187 0.4 % J: 150 0.3 % X: 88 0.2 % Q: 76 0.2 % e: 19 0.0 % Z: 13 0.0 % i: 12 0.0 % p: 7 0.0 % «: 5 0.0 % (: 3 0.0 % ": 2 0.0 % d: 2 0.0 % f: 1 0.0 % t: 1 0.0 % s: 1 0.0 % ': 1 0.0 % Y: 1 0.0 % ---------------------------------------------------------------------- Ran 1 test in 0.557s