Workdocumentation 2021-05-03: Difference between revisions
Jump to navigation
Jump to search
(Created page with "= Question = What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)? = Assumption = Following a hierarchy of letter,...") |
|||
| Line 11: | Line 11: | ||
Input: Proceedings titles of dblp conference entries. | Input: Proceedings titles of dblp conference entries. | ||
== Letter == | |||
<source lang='python'> | |||
def testMostCommonFirstLetter(self): | |||
''' | |||
get the most common first letters | |||
''' | |||
dblp,foundEvents=self.getEvents() | |||
self.assertTrue(foundEvents>43950) | |||
# collect first letters | |||
counter=Counter() | |||
for eventId in dblp.em.events: | |||
if eventId.startswith("conf"): | |||
event=dblp.em.events[eventId] | |||
first=ord(event.title[0]) | |||
counter[first]+=1 | |||
bins=len(counter.keys()) | |||
print(f" {bins} different first letters found") | |||
for o,count in counter.most_common(bins): | |||
c=chr(o) | |||
print (f"{c}: {count}") | |||
</source> | |||
<pre> | |||
46 different first letters found | |||
P: 12599 | |||
2: 3526 | |||
I: 3515 | |||
A: 3296 | |||
C: 2333 | |||
S: 2260 | |||
1: 2105 | |||
T: 1559 | |||
M: 1312 | |||
E: 1252 | |||
F: 1246 | |||
D: 1177 | |||
R: 624 | |||
H: 578 | |||
N: 566 | |||
3: 564 | |||
W: 522 | |||
L: 502 | |||
G: 501 | |||
B: 479 | |||
4: 354 | |||
V: 334 | |||
K: 257 | |||
O: 255 | |||
5: 252 | |||
U: 236 | |||
9: 215 | |||
6: 211 | |||
7: 199 | |||
8: 187 | |||
J: 150 | |||
X: 88 | |||
Q: 76 | |||
e: 19 | |||
Z: 13 | |||
i: 12 | |||
p: 7 | |||
«: 5 | |||
(: 3 | |||
": 2 | |||
d: 2 | |||
f: 1 | |||
t: 1 | |||
s: 1 | |||
': 1 | |||
Y: 1 | |||
---------------------------------------------------------------------- | |||
Ran 1 test in 0.577s | |||
</pre> | |||
Revision as of 06:27, 3 May 2021
Question
What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?
Assumption
Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.
Experiment
Hierarchy of: - Letter - Token - Grammatical structure - Sentence
Input: Proceedings titles of dblp conference entries.
Letter
def testMostCommonFirstLetter(self):
'''
get the most common first letters
'''
dblp,foundEvents=self.getEvents()
self.assertTrue(foundEvents>43950)
# collect first letters
counter=Counter()
for eventId in dblp.em.events:
if eventId.startswith("conf"):
event=dblp.em.events[eventId]
first=ord(event.title[0])
counter[first]+=1
bins=len(counter.keys())
print(f" {bins} different first letters found")
for o,count in counter.most_common(bins):
c=chr(o)
print (f"{c}: {count}")
46 different first letters found P: 12599 2: 3526 I: 3515 A: 3296 C: 2333 S: 2260 1: 2105 T: 1559 M: 1312 E: 1252 F: 1246 D: 1177 R: 624 H: 578 N: 566 3: 564 W: 522 L: 502 G: 501 B: 479 4: 354 V: 334 K: 257 O: 255 5: 252 U: 236 9: 215 6: 211 7: 199 8: 187 J: 150 X: 88 Q: 76 e: 19 Z: 13 i: 12 p: 7 «: 5 (: 3 ": 2 d: 2 f: 1 t: 1 s: 1 ': 1 Y: 1 ---------------------------------------------------------------------- Ran 1 test in 0.577s