Wednesday, November 14, 2018

More Early Thoughts on Corpus/Computational Linguistics

You know what is so great about Corpus Linguistics?

Finally, finally, I have found some people who are asking and wondering similarly weird questions as I do, and always have. Ever since I was a very little girl.

For example, as a young girl not more than 8 or 9, I used to wonder what it would be like if you could have a pen (like, a BIC ink pen) that could remember everything it had ever written. If you could set it with some paper and just let it "run", and you could see all of the things it had ever written, what would that look like? What could you learn about the person who wrote it? This is so similar to the idea of creating some kind of "Kate Corpus" - gathering all of my writing (or at least the typed part, which must be a full 3/4s) and putting it into a big pile for analysis.

I am co-authoring a book on Czech Land Records for English-Speaking Genealogists and I was very adamant that we include a section there about some of the key linguistic differences between Czech and English which would really help the reader to better orient themselves. Things I wish I had known five years ago, before I even understood that learning Czech was an actual option for me. I tried to compare a handful of sentences - about five - in English to sentences in Czech to "prove" how Czech uses fewer words than English because it is a synthetic language. I wanted to show it. But then I realized I couldn't really do that without a HUGE massive study, since some ideas in Czech take more words to say than in English, and I could not even begin to imagine how this could be done. Now that I see that there is an entire field that undertakes objective analysis by computers of language through using statistics and massive amounts of data comparing languages to one another, I can start to glimpse how it would be possible to find evidence for or against the idea that Czech typically takes fewer words to say the same thing than English. What a fascinating study that would be!

About three years ago, I started wondering about whether or not you could deduce the religious piety of a person in the past through a thorough analysis of the language that was used in their genealogical records. Was it all just a bunch of boilerplate, or - if you squinted hard enough, and had enough data to compare, could you start to uncover the actual attitudes, thoughts, feelings, desires, beliefs of your ancestors who lived so long ago and whose lives intersected with these pieces of paper that remain?

It basically tickles me pink to see a histogram of frequency of the occurrences of the f-word in the BNC.

And then, on a lazy Sunday morning when my husband was sleeping and I just couldn't - I picked up my embarrassingly nerdy "light reading" and found myself waking the poor man up through a fit of hysterical, uncontrollable laughter. The author of the textbook had a friend who believed that academic writing not as fun to read because it had less adjectives than fiction. So the author compared the data about these two kinds of texts in the BNC, wrote detailed notes in the margins so that a non-geek could read it and understand it, and proved empirically that she was totally wrong; academic texts have significantly more adjectives than fiction. She came back with, "That's nice, but I still think the adjectives in fiction are somehow richer." "That's an entirely different research question!" - and much later, he implies that they went on to study that, as well.

Somebody out there in this world really deeply appreciates and understands my desire for evidence, quantification of ideas - proof. I am so excited - so massively excited - about the possibilities in this field. Especially because I don't have to be a programmer; I can imagine learning some programming skills, but I can't imagine ever "catching up" to others, nor can I imagine enjoying it very much. But perhaps I'm wrong.

The other thing that really speaks to me is how I've noticed that in every text and every lecture I've read and watched that were produced by linguists (so far), there has been a keen attention to...to...this thing...I don't know how to call it. Non-dryness? Realness? Emotiveness? Like, you can still write and talk like a person, not like a robot. This appeals to me.

And the weird, quirky names of certain principles (eg Zipf's law) are just...they just make me smile. They are somehow super relatable to me. Because of the quirkiness? I'm not sure.

Also, I bet there is 100x more information written about whelks as they relate to linguistics than as they relate to marine biology. This is such a funny idea. It's almost self-defeating. Kind of like having a conversation about your favorite password mnemonic system. Or telling somebody that you wish they would spontaneously give you unprompted, genuine praise.

I am thrilled to discover that this field of computational linguistics exists and to strongly feel that there is a place for me in it.

No comments:

Post a Comment