Wednesday, November 7, 2018

Middle of the night thoughts about Corpus Linguistics

Oh! I could create a Kate corpus.
But how?
Where is the stuff I've written?
Can I make a tool to export all my Google docs?
But how would it know it's something I've written?
Oh! You could probably use this tool to analyze authorship.
Oh! That is probably how that guy did it who studied authorship of the Book of Mormon and found it to be consistent with 4 main different authors.
But what about the stuff I've written that's on my computer? Like... How would I find it?
And what about a blog? Seems like it should be the most simple kind of thing to somehow upload into a corpus.
But would I even want a corpus of all my writing ever or would I want separate it by various writing projects, like one for my blog when I was in Jordan, one for that terrible high school poetry...
But why couldn't I have multiple corpora? There were sooooooooooo many.
And didn't that lecturer say you could create corpora that update daily?
But how would I create the one Kate corpus to rule them all?
And what about email? How could I filter the meaningful emails that I've written? "Could a program be written to download all "sent" emails?"
Danny: "oh, so you could make a better ad predicting tool?"
Oh! This is how one might practically do this. I never actually thought about it. A tool that parses the text, labeling parts of speech and whatnot would definitely be necessary.
Danny: "What gets really interesting is when you can write short scripts to *do* something with the data. Like, that will open that tool, take the data, do something with it, and then maybe feed it back."
Oh! You could make a meta corpus. What?!
Danny: "Or you could use it to create lots of beautiful graphs. I wonder if instead of Python, I should teach you R..."
What. Would my brain be able to handle that?
Seeing the look on my face: "it's not really a programming language, it's a scripting language. You could do it."
Could I? I dunno.

Waking up in the middle of the night: Oh! I could learn all kinds of things about myself that I didn't know before by quickly and objectively analyzing my writing. Like, do I write differently for different audiences? How has my language changed over time? Has my vocabulary shrunken or increased since becoming a mom - though what's provable would probably just be correlation, not causation.
- is there a way to determine objectively to whom I was actually writing when I wrote x? I have only a very few things - mostly handwritten - which were written with vague "posterity's sake" audience. Could the language analysis be precise enough to group writing into groups, and then could I tag them based on my memory of who they were for? Could a tool be trained to do this on its own?
- oh, how embarrassing to write like that. I should not have to have a muse. That's so pathetic.
- but it is a fact. So it may be possible to be measurable. What!!
- if for me, then for others, too?
- can you use this to predict if somebody is being honest?


- Can a tool like this measure the probable emotions of the writer based on word frequency?

- There has to be a Bible corpus.
- Corpora. There are lots of different ways to like, pile biblical writing. There must be various corpora of this. What could you do with it?
- You could create a Bible concordance that's intelligent.
- Oh! This is what the lecturer meant when he said all modern dictionaries use corpora.
- OH! THIS IS HOW I COULD SOMEDAY CREATE AN EXTENSIVE CZECH LAND RECORDS DICTIONARY.
- Oh! You could somehow make something that makes the handwritten text analysis tools export data into a corpora! Could you like, automate that? Or make it intelligent?

- Oh! But I don't even have to wait to write a script to take this to some extra level. I can use this *right now* in my study of Czech. But how? I need some time to play with this.

Time is always the limiting factor.

Time and sleep.

How can I sleep now?! The possibilities! Sooooooooooo many possibilities!!!

1 comment:

  1. My middle of the night self was wrong. R is a statistical interface.

    ReplyDelete