Friday, November 16, 2018

6% is pretty low, but not even near low enough!


Notes
1) This list was created using public/free subtitles, from opensubtitles in particular.
The order is based on the number of occurences of each word in the subtitles.
2 The original source of this list can be found here:
https://invokeit.wordpress.com/frequency-word-lists/
3) It is licensed under the following Creative Commons license:
http://creativecommons.org/licenses/by-sa/3.0/
4) More Most Common word lists (other languages) can be found here:
http://www.101languages.net/common-words/


I took a list of 60k Czech words ordered by frequency. I went through the first 4k and made a very quick decision about whether or not I knew what they meant. These were contextless, so I am pretty sure I would have been able to deduce a fair amount of them from some context. I tried to lean towards the stricter side, being painfully honest with myself about whether or not I knew the word. The thing is, "knowing" is not a yes/no thing at all, it's more like a spectrum. I chose to be very strict on this spectrum. Did I immediately, without any context, know exactly what a word meant in such a way that I would be 99% sure of my definition if given (or in some cases where it's like, a clitic or something untranslatable, that I knew what it meant)? If not, then the word is in this list.

228/4000 is about 5.6%, which is pretty low.

But not really low enough at all.

I filtered the words for some of the swears (By the way, the first unknown word on the list was "kurva", which I did not know at all, and I found pretty funny. It was this word that caused me to realize I would want to filter the swears, so I copy and pasted the 228 into a translator. It did not catch all the swears, though - and after looking at the word for more than a split second, sometimes I realized its meaning later, which is why this next list has some curse words. I have seen and heard some Czech swears, but obviously all my collaborators are super nice, polite people who don't use this kind of language in the context of speaking with me, and actually I don't use this kind of language myself except sometimes in my head, so yeah, the context in which I would find them is pretty small). This left me with 215.

Here they are. The second column is for the frequency within the entire "corpus." 


přece330
haló382is this a cognate?
odsud414is this a spelling error?
zrovna420
vždyť439
seš510too many options - is this a clitic? Is it a spelling error of "jseš"?
vzal537
new610is this infiltrated English?
vem716Is this a clitic?
jacku772is this hovorové?
vede790
prý920
koukni945is this from koukat?
vole1025isn't this a curse word?
držet1041
pusť1055
věř1080is this a preposition?
rozkaz1149<-- first example of commonish words that I don't know that is not in HP. Why isn't it?
pomozte1175
držte1189
poslyš1190
kdysi1238
či1240this is a grammar word; is it the same as čí?
rande1303is this a curse word?
pusťte1382
kus1416
vzadu1472
mozek1473
přesto1476
vemte1482
zápas1483
zcela1493
vrah1546looks like a noun.
sehnat1616
vezměte1621
seržante1638looks like sergeant.
zticha1717
sklapni1723looks like an imperative
kruci1762looks like an imperative
kecy1773looks like kecat, "to chat" - chattings?
bacha1780looks like a noun
stává1817looks like zůstávat
sotva1849
přines1851
vítězství1870
doprdele1878hmm. I think this is another swear.
vezmeme1881some kind of preposition?
poručíku1897some zdrobnělina?
seru1921some grammar word?
řek1930is this supposed to be řekl?
nima1934
výsosti1943is this like, highness? Specialized register, pohádky
nůž1956some kind of body part?
řečeno1971
miku1972some kind of noun
zvedni1973some kind of imperative
pošlete1988
posaď1989?
zasraný2008first adjective on this list that I do not know.
vezme2012
žádám2055
řeč2083left off here
opustil2091
rozkazy2123
synku2128
chránit2137
fér2154
raz2163
pokus2169
los2170
uklidněte2188
přede2194
udržet2198
zázrak2204
veliteli2211
požádat2239
podstatě2240
hnout2249
odtamtud2262
svině2282
roli2292
sednout2313
rayi2330
jamesi2363
řešení2393
poblíž2449
obavy2482
potkala2486
palubě2488
vraždu2492
odcházím2524
ctihodnosti2528
období2536
vedení2551
stůjte2574
přinesu2591
stovky2598
dočkat2640
uhni2646
nebi2675
území2679
zranění2716
hře2727
rozhodnout2787
ochranu2845
odpočinout2879
opice2927
vzhledem2965
nabídnout2966
plukovník2969
přesvědčit2973
zároveň2974
nuže2984
nejprve3008
vydržte3031
prostor3033
naštvaná3041
zaslouží3054
parchante3060
láhev3078
přineste3089
poldové3169
polib3171
spadl3173
seženu3205
cvok3272
vůči3282
polovinu3286
řízení3301
vraha3319
soukromí3323
chrisi3333
představu3343
ovládat3357
poručík3362
odveďte3383
vyspat3387
ctí3397
zkušenosti3404
zvláštního3409
podporu3412
chyťte3414
varování3416
kousky3418
důvodů3419
vydělat3420
zabiješ3421
stvoření3422
vypni3427
přežil3430
seržant3432
zkurvenej3436
dohoda3437
ostatními3444
lehnout3445
ujistit3448
dostaňte3451
okolností3457
poplach3458
zatracenej3461
zvládnout3464
vyprávět3466
hůř3467
bojoval3468
prst3490
všechen3499
obrovské3500
ubohý3501
kámoše3520
velení3523
zmrde3530
zabiják3546
báječné3557
spojit3559
vnitřní3569
sebevraždu3588
předstírat3600
ukrást3601
uhněte3623
vevnitř3624
boku3630
hale3641
začneš3645
soustředit3646
lano3648
ostrově3654
misi3662
parchant3681
pokusím3690
podepsat3694
palubu3695
soutěž3696
koukejte3702
lituji3706
ničí3707
varovat3720
zvládneme3730
sáro3759
dýchej3760
sežeň3784
šílená3812
veřejnosti3817
podezření3834
neber3844
šampaňské3849
střechu3869
oběma3871
sundejte3873
nestřílejte3906
díra3907
přiveďte3925
křídla3936
zabere3938
přinese3939
okouzlující3961

Přece and zrovna are words which I have recently learned but they were included in this list because I could not immediately remember what they mean or how to use them. 

I then took this list and started to find the words in a PDF version of Harry Potter 1. I highlighted the first 68 or so words and then it was time to go to sleep. 

Despite the flaws in the "corpus", the method is pretty awesome! I think it's pretty obvious corpus linguistics is fantastic. Like, really - this is so amazing to me. 

How great would it be to have your L2 textbooks informed of off actual, real data?! I love it. Pretty much daily, perhaps hourly, I think of new ways to use this methodology and the tools that come with it. There is totally a place for me in this field. 

No comments:

Post a Comment