Notes | | | |
| | | |
1) This list was created using public/free subtitles, from opensubtitles in particular. |
The order is based on the number of occurences of each word in the subtitles. |
| | | |
2 The original source of this list can be found here: |
https://invokeit.wordpress.com/frequency-word-lists/ |
| | | |
3) It is licensed under the following Creative Commons license: |
http://creativecommons.org/licenses/by-sa/3.0/ |
| | | |
4) More Most Common word lists (other languages) can be found here: |
http://www.101languages.net/common-words/ |
I took a list of 60k Czech words ordered by frequency. I went through the first 4k and made a very quick decision about whether or not I knew what they meant. These were contextless, so I am pretty sure I would have been able to deduce a fair amount of them from some context. I tried to lean towards the stricter side, being painfully honest with myself about whether or not I knew the word. The thing is, "knowing" is not a yes/no thing at all, it's more like a spectrum. I chose to be very strict on this spectrum. Did I immediately, without any context, know exactly what a word meant in such a way that I would be 99% sure of my definition if given (or in some cases where it's like, a clitic or something untranslatable, that I knew what it meant)? If not, then the word is in this list.
228/4000 is about 5.6%, which is pretty low.
But not really low enough at all.
I filtered the words for some of the swears (By the way, the first unknown word on the list was "kurva", which I did not know at all, and I found pretty funny. It was this word that caused me to realize I would want to filter the swears, so I copy and pasted the 228 into a translator. It did not catch all the swears, though - and after looking at the word for more than a split second, sometimes I realized its meaning later, which is why this next list has some curse words. I have seen and heard some Czech swears, but obviously all my collaborators are super nice, polite people who don't use this kind of language in the context of speaking with me, and actually I don't use this kind of language myself except sometimes in my head, so yeah, the context in which I would find them is pretty small). This left me with 215.
Here they are. The second column is for the frequency within the entire "corpus."
přece | 330 | |
haló | 382 | is this a cognate? |
odsud | 414 | is this a spelling error? |
zrovna | 420 | |
vždyť | 439 | |
seš | 510 | too many options - is this a clitic? Is it a spelling error of "jseš"? |
vzal | 537 | |
new | 610 | is this infiltrated English? |
vem | 716 | Is this a clitic? |
jacku | 772 | is this hovorové? |
vede | 790 | |
prý | 920 | |
koukni | 945 | is this from koukat? |
vole | 1025 | isn't this a curse word? |
držet | 1041 | |
pusť | 1055 | |
věř | 1080 | is this a preposition? |
rozkaz | 1149 | <-- first example of commonish words that I don't know that is not in HP. Why isn't it? |
pomozte | 1175 | |
držte | 1189 | |
poslyš | 1190 | |
kdysi | 1238 | |
či | 1240 | this is a grammar word; is it the same as čí? |
rande | 1303 | is this a curse word? |
pusťte | 1382 | |
kus | 1416 | |
vzadu | 1472 | |
mozek | 1473 | |
přesto | 1476 | |
vemte | 1482 | |
zápas | 1483 | |
zcela | 1493 | |
vrah | 1546 | looks like a noun. |
sehnat | 1616 | |
vezměte | 1621 | |
seržante | 1638 | looks like sergeant. |
zticha | 1717 | |
sklapni | 1723 | looks like an imperative |
kruci | 1762 | looks like an imperative |
kecy | 1773 | looks like kecat, "to chat" - chattings? |
bacha | 1780 | looks like a noun |
stává | 1817 | looks like zůstávat |
sotva | 1849 | |
přines | 1851 | |
vítězství | 1870 | |
doprdele | 1878 | hmm. I think this is another swear. |
vezmeme | 1881 | some kind of preposition? |
poručíku | 1897 | some zdrobnělina? |
seru | 1921 | some grammar word? |
řek | 1930 | is this supposed to be řekl? |
nima | 1934 | |
výsosti | 1943 | is this like, highness? Specialized register, pohádky |
nůž | 1956 | some kind of body part? |
řečeno | 1971 | |
miku | 1972 | some kind of noun |
zvedni | 1973 | some kind of imperative |
pošlete | 1988 | |
posaď | 1989 | ? |
zasraný | 2008 | first adjective on this list that I do not know. |
vezme | 2012 | |
žádám | 2055 | |
řeč | 2083 | left off here |
opustil | 2091 | |
rozkazy | 2123 | |
synku | 2128 | |
chránit | 2137 | |
fér | 2154 | |
raz | 2163 | |
pokus | 2169 | |
los | 2170 | |
uklidněte | 2188 | |
přede | 2194 | |
udržet | 2198 | |
zázrak | 2204 | |
veliteli | 2211 | |
požádat | 2239 | |
podstatě | 2240 | |
hnout | 2249 | |
odtamtud | 2262 | |
svině | 2282 | |
roli | 2292 | |
sednout | 2313 | |
rayi | 2330 | |
jamesi | 2363 | |
řešení | 2393 | |
poblíž | 2449 | |
obavy | 2482 | |
potkala | 2486 | |
palubě | 2488 | |
vraždu | 2492 | |
odcházím | 2524 | |
ctihodnosti | 2528 | |
období | 2536 | |
vedení | 2551 | |
stůjte | 2574 | |
přinesu | 2591 | |
stovky | 2598 | |
dočkat | 2640 | |
uhni | 2646 | |
nebi | 2675 | |
území | 2679 | |
zranění | 2716 | |
hře | 2727 | |
rozhodnout | 2787 | |
ochranu | 2845 | |
odpočinout | 2879 | |
opice | 2927 | |
vzhledem | 2965 | |
nabídnout | 2966 | |
plukovník | 2969 | |
přesvědčit | 2973 | |
zároveň | 2974 | |
nuže | 2984 | |
nejprve | 3008 | |
vydržte | 3031 | |
prostor | 3033 | |
naštvaná | 3041 | |
zaslouží | 3054 | |
parchante | 3060 | |
láhev | 3078 | |
přineste | 3089 | |
poldové | 3169 | |
polib | 3171 | |
spadl | 3173 | |
seženu | 3205 | |
cvok | 3272 | |
vůči | 3282 | |
polovinu | 3286 | |
řízení | 3301 | |
vraha | 3319 | |
soukromí | 3323 | |
chrisi | 3333 | |
představu | 3343 | |
ovládat | 3357 | |
poručík | 3362 | |
odveďte | 3383 | |
vyspat | 3387 | |
ctí | 3397 | |
zkušenosti | 3404 | |
zvláštního | 3409 | |
podporu | 3412 | |
chyťte | 3414 | |
varování | 3416 | |
kousky | 3418 | |
důvodů | 3419 | |
vydělat | 3420 | |
zabiješ | 3421 | |
stvoření | 3422 | |
vypni | 3427 | |
přežil | 3430 | |
seržant | 3432 | |
zkurvenej | 3436 | |
dohoda | 3437 | |
ostatními | 3444 | |
lehnout | 3445 | |
ujistit | 3448 | |
dostaňte | 3451 | |
okolností | 3457 | |
poplach | 3458 | |
zatracenej | 3461 | |
zvládnout | 3464 | |
vyprávět | 3466 | |
hůř | 3467 | |
bojoval | 3468 | |
prst | 3490 | |
všechen | 3499 | |
obrovské | 3500 | |
ubohý | 3501 | |
kámoše | 3520 | |
velení | 3523 | |
zmrde | 3530 | |
zabiják | 3546 | |
báječné | 3557 | |
spojit | 3559 | |
vnitřní | 3569 | |
sebevraždu | 3588 | |
předstírat | 3600 | |
ukrást | 3601 | |
uhněte | 3623 | |
vevnitř | 3624 | |
boku | 3630 | |
hale | 3641 | |
začneš | 3645 | |
soustředit | 3646 | |
lano | 3648 | |
ostrově | 3654 | |
misi | 3662 | |
parchant | 3681 | |
pokusím | 3690 | |
podepsat | 3694 | |
palubu | 3695 | |
soutěž | 3696 | |
koukejte | 3702 | |
lituji | 3706 | |
ničí | 3707 | |
varovat | 3720 | |
zvládneme | 3730 | |
sáro | 3759 | |
dýchej | 3760 | |
sežeň | 3784 | |
šílená | 3812 | |
veřejnosti | 3817 | |
podezření | 3834 | |
neber | 3844 | |
šampaňské | 3849 | |
střechu | 3869 | |
oběma | 3871 | |
sundejte | 3873 | |
nestřílejte | 3906 | |
díra | 3907 | |
přiveďte | 3925 | |
křídla | 3936 | |
zabere | 3938 | |
přinese | 3939 | |
okouzlující | 3961 | |
Přece and zrovna are words which I have recently learned but they were included in this list because I could not immediately remember what they mean or how to use them.
I then took this list and started to find the words in a PDF version of Harry Potter 1. I highlighted the first 68 or so words and then it was time to go to sleep.
Despite the flaws in the "corpus", the method is pretty awesome! I think it's pretty obvious corpus linguistics is fantastic. Like, really - this is so amazing to me.
How great would it be to have your L2 textbooks informed of off actual, real data?! I love it. Pretty much daily, perhaps hourly, I think of new ways to use this methodology and the tools that come with it. There is totally a place for me in this field.
No comments:
Post a Comment