Super Obvious DUH application of corpus linguistics to Czech as a Second Language

This always seems to happen - I have an idea and then think, "Oh man. That was so obvious. Why didn't I think of it before?" 

I just, well, didn't.

Rather than picking words from my Harry Potter readings at random, I could do it in a much more focused, precise way because CL can inform me of which words are the most frequently used words. 

And if they're frequently used, obviously I can find them in Harry Potter.


So now I just downloaded some guy's list of the top 2k most common words in Czech. I'm going to go through it and mark the ones which I definitely do know. The ones I do not know I am going to search for in a PDF of Harry Potter (post chapter 11, where I am), and highlight them for when I come across them later.

If I were more hardcore, I could probably somehow investigate the list and verify its authenticity but I am going to assume that any of the first 2k words made by almost anybody using almost any method are probably going to be pretty good, and the words I pick can't become less based on random factors (though, to be honest - I am not a complete moron! I pick words which I truly have no idea what they could mean from the context - and sometimes after I pick 'em, I figure it out. So yeah, it's not entirely random what I pick. I also try to get a variety of POS)

I seriously can't believe I didn't think of doing this earlier. It would have been much more logical.


More Early Thoughts on Corpus/Computational Linguistics

You know what is so great about Corpus Linguistics?

Finally, finally, I have found some people who are asking and wondering similarly weird questions as I do, and always have. Ever since I was a very little girl.

For example, as a young girl not more than 8 or 9, I used to wonder what it would be like if you could have a pen (like, a BIC ink pen) that could remember everything it had ever written. If you could set it with some paper and just let it "run", and you could see all of the things it had ever written, what would that look like? What could you learn about the person who wrote it? This is so similar to the idea of creating some kind of "Kate Corpus" - gathering all of my writing (or at least the typed part, which must be a full 3/4s) and putting it into a big pile for analysis.

I am co-authoring a book on Czech Land Records for English-Speaking Genealogists and I was very adamant that we include a section there about some of the key linguistic differences between Czech and English which would really help the reader to better orient themselves. Things I wish I had known five years ago, before I even understood that learning Czech was an actual option for me. I tried to compare a handful of sentences - about five - in English to sentences in Czech to "prove" how Czech uses fewer words than English because it is a synthetic language. I wanted to show it. But then I realized I couldn't really do that without a HUGE massive study, since some ideas in Czech take more words to say than in English, and I could not even begin to imagine how this could be done. Now that I see that there is an entire field that undertakes objective analysis by computers of language through using statistics and massive amounts of data comparing languages to one another, I can start to glimpse how it would be possible to find evidence for or against the idea that Czech typically takes fewer words to say the same thing than English. What a fascinating study that would be!

About three years ago, I started wondering about whether or not you could deduce the religious piety of a person in the past through a thorough analysis of the language that was used in their genealogical records. Was it all just a bunch of boilerplate, or - if you squinted hard enough, and had enough data to compare, could you start to uncover the actual attitudes, thoughts, feelings, desires, beliefs of your ancestors who lived so long ago and whose lives intersected with these pieces of paper that remain?

It basically tickles me pink to see a histogram of frequency of the occurrences of the f-word in the BNC.

And then, on a lazy Sunday morning when my husband was sleeping and I just couldn't - I picked up my embarrassingly nerdy "light reading" and found myself waking the poor man up through a fit of hysterical, uncontrollable laughter. The author of the textbook had a friend who believed that academic writing not as fun to read because it had less adjectives than fiction. So the author compared the data about these two kinds of texts in the BNC, wrote detailed notes in the margins so that a non-geek could read it and understand it, and proved empirically that she was totally wrong; academic texts have significantly more adjectives than fiction. She came back with, "That's nice, but I still think the adjectives in fiction are somehow richer." "That's an entirely different research question!" - and much later, he implies that they went on to study that, as well.

Somebody out there in this world really deeply appreciates and understands my desire for evidence, quantification of ideas - proof. I am so excited - so massively excited - about the possibilities in this field. Especially because I don't have to be a programmer; I can imagine learning some programming skills, but I can't imagine ever "catching up" to others, nor can I imagine enjoying it very much. But perhaps I'm wrong.

The other thing that really speaks to me is how I've noticed that in every text and every lecture I've read and watched that were produced by linguists (so far), there has been a keen attention thing...I don't know how to call it. Non-dryness? Realness? Emotiveness? Like, you can still write and talk like a person, not like a robot. This appeals to me.

And the weird, quirky names of certain principles (eg Zipf's law) are just...they just make me smile. They are somehow super relatable to me. Because of the quirkiness? I'm not sure.

Also, I bet there is 100x more information written about whelks as they relate to linguistics than as they relate to marine biology. This is such a funny idea. It's almost self-defeating. Kind of like having a conversation about your favorite password mnemonic system. Or telling somebody that you wish they would spontaneously give you unprompted, genuine praise.

I am thrilled to discover that this field of computational linguistics exists and to strongly feel that there is a place for me in it.

Nedostatek samohlásek - a Lack of Vowels

Nedostatek samohlásek
A Lack of Vowels

For native English speakers,
one has but to take a glance
at a single phrase in Czech to know
they haven't got a chance.

It's not devoid of cognates,
but that's really not apparent
through the massive lack of vowels,
which is really quite aberrant.

For how am I to pronounce words
completely lacking vowels?
It either gives one headaches
or does something to one’s bowels.

Stick fingers into any throat*,
or give cold wolves their grain,**
no matter what you try to do
this language causes pain.

The words have all been chopped to bits
then tossed with k's and z's.
And trying to pronounce an ř
brings devils to their knees.

Why is it such a bother to
mark schwas in sheep and fathers***?
And there’s multiple good reasons
why baptism**** is a bother!

And English-speaking gadgets
to this rule are no exception.
“Take the next left to B-R-N-
O” - her cop-out directions.  

Hard though it be t’enunciate
still I am not a quitter -
but nothing in my world of Czech
had better ever glitter*****!

*Strč prst skrz krk.
**Vlk zmrzl, zhltl hrst zrn.
*** ovce, otcové
*****třpytit se - ughhhh!

Meta 7 - Should I automate the gathering of example sentences?

It's easier to remember a word when you learn it within a context. It's weird that is this way, but that is the way things are.

I have long considered this question: how can I gather a sufficient amount of Czech "context" when I live in the middle of a cornfield in Iowa (i.e. immersion is unfeasable). I decided that one sentence in a novel is not enough context; it would be better to see the word (or words that come from its stem) in other contexts.

You can put this question in other words: should I use the internet and its infinitely amazing tools (such as corpora or this sweet little tool my friend just pointed me towards) to find example sentences for myself?

I am still not sure.

Here is why. I am not an idiot; if I wanted to just find a sentence, I could simply search the internet for whatever word or phrase is causing me to scratch my head. I could do that.

I could even narrow it down a bit smaller to searching just Wikipedie, for example. In fact, I am playing around with that idea for chapters 7-9, which are currently serving as some sort of "buffer" zone while I wait for my patient collaborators to give me some example sentences for chapter 10.

One of them, by the way, used the above SkELL tool mentioned. But here's the thing: even him using the tool will not produce the same results as me using the tool (you know, first of all, if it worked; the site consistently crashed for me). He can very quickly pick which sentences are "good," assigning a judgment to them, and skip the ones that are less useful. I can't do that.

The best sentences I have had so far are the ones that were designed specifically for me. If there were a tool somewhere which tried to explain terms in simplish Czech with shortish sentences using lots of cognates or at least drawing on familiarish situations and stereotypes, then sure. I could draw from that as a source.

(There is not a robot that can draw from context that we actually do share, which should be mentioned.)

But like...a sentence like this, taken straight from Wikipedie?

Čest (od slovesa ctíti) je vlastnost bytosti (entity, která používá přirozenou inteligenci), nejčastěji člověka, již lze charakterizovat jako morální kredit, vážnost, hodnověrnost nebo dobré jméno.

Is that really better than what my collaborators came up with?

Veškerá zbylá čest byla pryč.

Když nepolkneš ten prášek, tak se nuzdravíš.

No. They aren't.

My gut tells me this: if you ask a native speaker to use a word in a sentence, especially a list of words where there are a lot - if they have the patience to tolerate such an exercise, and if they care about me at all (which I think they all do), the sentence will be better for me.

Now that my mind has been opened to the prospect of the ability to measure words and phrases using statistics, I guess it would be possible to quantify what a "better" context is. For the sake of L2 learning, it means it helps me achieve my goal of mastering Czech better.

What is the secret sauce to learning a new language?

I've had long conversations about this both within the classroom, during office hours, and more recently over the phone with one of my amazing professors (Kirk Belnap, director of the National Middle East Language Resource Center, PhD in linguistics, MA in Language Acquisition, etc. etc.)
- persistent practice
- targeted feedback
- small victories

When I automate the collection of sentences, does it help me with my persistent practice? Sure, I don't see why it wouldn't. That is more of an issue of doing the work. I am going to do the work no matter what.

(Tonight at the dinner table my dad was telling our kids the story about how I learned how to play all the piano music from "the Phantom of the Opera" by myself as a 9 year old because my dad simply told me, "Oh...I don't think you can do this. It's too hard." It was kind of interesting to me to hear him say this, since I've never heard him say it in my life: "Never tell your mother she can't do something, because if you say that, then she will always just do it to prove you wrong." After he said that, I immediately thought of about half a dozen other examples of times in my life when I've literally done just that. So why not Czech. It may be really difficult and my learning situation may not be ideal. But it wasn't for this guy either, and he literally created the system by which Japanese is transliterated into the Latin alphabet, which is amazing. I can do amazing things, too.)

When I automate the collection of sentences, is it targeted? Well, certainly it is faster to find words this way. But is it as targeted, for example, as what somebody who knows me, who has spent lots of time speaking with me in both English and Czech, might try to say in order to get me to succeed? I guess it probably isn't.

Where this model significantly fails is to give me small victories. I do not know if it is possible to quantify the motivating feeling that I get when I discover the meaning of a sentence that was meant for me. If somebody was working closely with me on a task, the feeling of elation that comes when they say, "dobrá práce!" or even just a, " to answer your question? I will have to think about that..." Or even "Hahaha that is completely wrong, and not only that, but you ended up insulting me/coming on to me in a hilarious way! That's so embarrassingly funny!" So I turn a bright shade of red while working on my sentences. How can that possibly be a "small victory" - yet it is.

I guess for me, the victory comes from succeeding in communicating at all with someone, even if it is failed communication.

Can that be automated? Can computers eventually be taught so well that they can take the place of my collaborators beyond a flashcard drilling type area? Will I be satisfied and motivated to continue if I can't imagine that there is a human being on the other side of the world who is reading my writing 7 hours ahead of me, especially if they are people who I have actually seen both via skype and with my own eyes? I have thought about these ideas a lot. I do not know.

My gut still tells me no, it can't be automated because I believe in communicative language learning. Basically, the TLDR of my TESOL minor is this: "To teach your students, you have to know your students. To know them, you have to listen to them and talk with them."

Since my situation with Czech is like, the ultimate flipped classroom approach, I could rephrase that to be, "To learn, I have to know my teachers. To know them, I have to listen to them and talk with them."

Reader, you can basically see me tearing my hair out in frustration with the fact that I do not have a really set schedule yet with this infant, so arranging skype calls has to wait. Though actually, I think I might start to attempt that again sooner rather than later, depending on how the Quest for Czech Female Collaborators goes. And now you can imagine that I am basically bald (kidding - the postpartum alopecia hasn't kicked in yet. Though I know it will.)

From this book called Corpus Linguistics for English Teachers, I read,

The Oregon Department of Education (2002), in a publication distributed to its teachers, suggests that it is necessary for teachers to ask themselves the following question: Will the use of technology make this lesson better? Will it facilitate student understanding? Will students' capacity to demonstrate their understanding increase because of it?" This publication notes that, by asking these questions, teachers will be able to determine when these technologies are appropriate and when they are not. The answers to these questions can be useful to English teachers as they formulate goals in incorporating CL tools into the classroom. The recognition that CL tools will not work all the time, across language topics and lesson settings, is very important for teachers. Knowing how the tools work and being able to take control, in case they don't work, are necessary in the successful integration of CL in the classroom. By thinking about CL tools within the frame of instructional technology, English teachers will come to view these tools as everyday, nonthreatening classroom devices. The tools will not be as tep ahead of the teachers in their instruction, and they can use the tools when they are needed for a collocation exercise, for example, but not when it gets too complicated or confusing for learners. [emphasis added]

So, I will play around with some of these interesting ways of automating the gathering of example sentences, especially with these "buffer" chapters. I will definitely start to explore the fascinating world of concordancers and corpora. I asked some 20 people to supply me with example sentences, and typically about 2-6 respond, so while I am waiting for them I can at least work on those three "buffer" chapters.

But will it supplant my annoying weekly emails begging for my collaborators to use the words in a sentence?


Hey, I can always ask, right. At least now I know what to do if all of them simultaneously decide to abandon me. But I am (mostly) sure that won't happen.

Chat History as a Corpus?

Well, I just did something pretty weird and nerdy.

I figured out a way to download both my skype chat history (that was easy) and my hangouts history (still easy but only because of this guy's parser) to make an objective pile of chat history. corpus.

Except it is not a real corpus because I don't know enough about how to format the files, and also as of now there are two kinds of files: one is an .html the other a .txt. I guess it will have to be in a .txt file in order to run it through a concordancer (that is the word I did not know yet) in order to look at the language quickly and much more objectively.

Actually, I decided to make several...uh...pre-corpora piles...divided by who I was texting with. By the way, the person who I chat most with on hangouts is, surprisingly, my brother.

As interesting as this is for me on a personal level, it's also very interesting to me on a much broader level. If I can do this on a small scale, it can certainly be done on a much bigger scale. One of my oldest online friends once told me that he thinks that online friends are exactly the same as in person friends.

I've wondered a lot about if that is true. But since it was impossible to measure, it remained an intuitive, gut feeling. Not something really answerable. Sure, good fodder for pontification (this blog was almost named pontifiKate for the record) and rantings on blogs that nobody reads. But that's about it - nothing you can really do about it.

But you can analyze and measure properties of written language by sticking 'em in a corpus.

I suppose that it would not be impossible to measure spoken text, but perhaps we are limited because there is no way to do so that is socially um...fluid? Acceptable? Legal? First, there are not very many spoken language corpora (it's possible - but it would be a problem to convert the language into a written form, and you'd lose data). It's not like you walk around with a tape recorder on your shoulder collecting spoken text - and even if you did, it would be altered because of the mere presence of the recorder (people would not say the same things because they see the recorder). Whereas with texting, you have a written record. It is there. It is downloadable. It is analyzable. It is...corporeal. Corporeable. Hahaha.

So you probably could not compare IRL communication with online communication after all, and perhaps then, there is no way to answer that question about whether or not friends are friends are friends, or if the ones in real life are more or less valuable than the online ones, then how.

But still, analyzing the written texting communication, you could find interesting answers to questions about how people interact - you can even measure hesitation and even like, emotions (!) because of time stamps and emojis and punctuation and...

You could also analyze this kind of communicative language and look for patterns in language learning. You could even potentially measure progress in L2 learning (perhaps based on hesitancy/reaction time?). This could then inform best practices in teaching/learning an L2 (or 3 or 4 or whatever). I mean, we live in a digital world. I obviously firmly believe that there is no reason I should be prevented from learning Czech just because I'm an Iowa housewife with no foreseeable access to language immersion. I already have long considered texting in my L4 to be a language learning tool, although I am a firm believer in communicative language teaching and learning - authenticity is king, meaning: I probably over-invest in the building-the-relationship-with-my-collaborators side of the equation, meaning, I definitely do not view texting as a means to an end, but as the basis of real, important relationships in my actual life. Meaning: I don't look at my collaborators as chatbotičky. Meaning there is a lot of banter in English that I don't consider wasted time. But maybe I should. :::shrug:::

Most of all, you could use such corpora to better understand how people and relationships work. I care about that.

In fact, perhaps that is one of the things I care about most of all.

Wow, I know I've been a fan of instant messaging since I was a little girl of about 9 years old and against my parents' wishes I download AIM onto our dial-up computer so I could text with my friends down the street or across town. But I didn't realize my love of and interaction with this kind of communication could lead me to explore ways in which texting could be analyzed in order to inform language learning, psychology, and even forming friends as an adult.

This is exactly the kind of stuff I can see myself studying for an advanced degree. Cool! Exciting!