Difference between revisions of "Workdocumentation 2021-05-03"

From BITPlan cr Wiki
Jump to navigation Jump to search
(Created page with "= Question = What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)? = Assumption = Following a hierarchy of letter,...")
 
Line 11: Line 11:
  
 
Input: Proceedings titles of dblp conference entries.
 
Input: Proceedings titles of dblp conference entries.
 +
 +
== Letter ==
 +
<source lang='python'>
 +
def testMostCommonFirstLetter(self):
 +
        '''
 +
        get the most common first letters
 +
        '''
 +
        dblp,foundEvents=self.getEvents()
 +
        self.assertTrue(foundEvents>43950)
 +
        # collect first letters
 +
        counter=Counter()
 +
        for eventId in dblp.em.events:
 +
            if eventId.startswith("conf"):
 +
                event=dblp.em.events[eventId]
 +
                first=ord(event.title[0])
 +
                counter[first]+=1
 +
        bins=len(counter.keys())
 +
        print(f" {bins} different first letters found")
 +
        for o,count in counter.most_common(bins):
 +
            c=chr(o)
 +
            print (f"{c}: {count}")
 +
</source>
 +
<pre>
 +
46 different first letters found
 +
P: 12599
 +
2: 3526
 +
I: 3515
 +
A: 3296
 +
C: 2333
 +
S: 2260
 +
1: 2105
 +
T: 1559
 +
M: 1312
 +
E: 1252
 +
F: 1246
 +
D: 1177
 +
R: 624
 +
H: 578
 +
N: 566
 +
3: 564
 +
W: 522
 +
L: 502
 +
G: 501
 +
B: 479
 +
4: 354
 +
V: 334
 +
K: 257
 +
O: 255
 +
5: 252
 +
U: 236
 +
9: 215
 +
6: 211
 +
7: 199
 +
8: 187
 +
J: 150
 +
X: 88
 +
Q: 76
 +
e: 19
 +
Z: 13
 +
i: 12
 +
p: 7
 +
«: 5
 +
(: 3
 +
": 2
 +
d: 2
 +
f: 1
 +
t: 1
 +
s: 1
 +
': 1
 +
Y: 1
 +
----------------------------------------------------------------------
 +
Ran 1 test in 0.577s
 +
</pre>

Revision as of 08:27, 3 May 2021

Question

What happens if the relevance matrix approach is applied to proceedings title parsing (later: parsing in general)?

Assumption

Following a hierarchy of letter, token, grammatical structure and sentence along the relevance matrix path column first (depth first) leads to interesting observations.

Experiment

Hierarchy of: - Letter - Token - Grammatical structure - Sentence

Input: Proceedings titles of dblp conference entries.

Letter

def testMostCommonFirstLetter(self):
        '''
        get the most common first letters
        '''
        dblp,foundEvents=self.getEvents()
        self.assertTrue(foundEvents>43950)
        # collect first letters
        counter=Counter()
        for eventId in dblp.em.events:
            if eventId.startswith("conf"):
                event=dblp.em.events[eventId]
                first=ord(event.title[0])
                counter[first]+=1
        bins=len(counter.keys())
        print(f" {bins} different first letters found")
        for o,count in counter.most_common(bins):
            c=chr(o)
            print (f"{c}: {count}")
 46 different first letters found
P: 12599
2: 3526
I: 3515
A: 3296
C: 2333
S: 2260
1: 2105
T: 1559
M: 1312
E: 1252
F: 1246
D: 1177
R: 624
H: 578
N: 566
3: 564
W: 522
L: 502
G: 501
B: 479
4: 354
V: 334
K: 257
O: 255
5: 252
U: 236
9: 215
6: 211
7: 199
8: 187
J: 150
X: 88
Q: 76
e: 19
Z: 13
i: 12
p: 7
«: 5
(: 3
": 2
d: 2
f: 1
t: 1
s: 1
': 1
Y: 1
----------------------------------------------------------------------
Ran 1 test in 0.577s