24 April 2006

canes hope maurice gets shot in toronto

This apparently was a headline for a brief time on the Globe and Mail website this morning, and a friend emailed me to let me know. The story was about the likelihood of former Carolina Hurricanes coach Paul Maurice being hired on to coach the Toronto Maple Leafs. By the time I got there, it had been changed to "Canes hope Maurice gets a shot in Toronto." Seems they'd had to include the indefinite article to avoid a blundered-sounding headline.

Headlines as a rule drop articles, pronouns, and forms of be as either a copula or auxiliary verb. Usually relinquishing a definite or indefinite article has no detrimental effect, unless the result is an independently occurring construction or idiom. In this case, the presence of the article in the expression get a shot (as in, be granted an opportunity to try) is all that distinguishes the structure from the expression get shot (as in suffer a bullet wound). I wonder how many other pairs of expressions or idioms differ only by the presence or absence of an article.

22 April 2006

not afraid to back down

So said Brian Campbell of the Buffalo Sabres during a live TV interview (on, um, Outdoor Life Network) in the hallway just before the first overtime period of his team's opening NHL playoff game against the Philadelphia Flyers. He had been asked about how he and his teammates were trying to handle the physical play of the (generally bigger) Flyers players. In response, Campbell overnegated (or undernegated?) - he had two equivalent expressions from the arsenal of interview phrases to choose from, "not backing down" and "standing up for yourself", and wanted to embed whatever he said in the "not afraid to X" frame. He just grabbed the wrong embeddee, and said "I think we're not afraid to back down". No biggee, probably no one else but me noticed, and I'm not trying to be critical. I just think that it's worth noting that these things can happen when a network insists on interviewing a professional athlete, before the conclusion of a game, when the athlete just wants to play.

The same telecast (along with several others this weekend) has forced me to correct a point in my previous post regarding the structure of "infraction announcements" in the NHL. I had claimed that the referee, when announcing a penalty, always refers to the team of the transgressor by the colour of their sweater, as in "6 white, two minutes for hooking". I was correct about the structure of the announcement (number, team, punishment, trangression) but not about the team reference - all the penalty calls I saw this weekend referred to the team by its city. So we saw things like "forty-four Edmonton, two minutes interference" rather than "forty-four white". I have seen the team colour used, but only by Kerry Fraser.

18 April 2006

structure of infractions

Heidi Harley's recent post links to an old post of hers about the structure of the phrase grand theft auto, a left-headed legal term. Makes me think of the structure of penalty calls in football and hockey, where the referee announces to the crowd, team, and other game officials what the infraction is. Actually, the very first thing I thought of were the ice hockey infractions "obstruction-hooking", "obstruction-interference", and "obstruction-holding the stick". Each of these is parallel to grand theft auto, since they seem to be left headed - each infraction is a type of obstruction call.

Football penalties are usually not left-headed (e.g. "pass interference", not "interference - pass" (but see below). Though, I suppose "[personal foul] - [roughing the passer]" could be analyzed as left-headed. Non-obstruction penalties in ice hockey are also usually not left-headed (e.g. "unsportsmanlike conduct", not "misconduct - unsportsmanlike").

But the next step is to look at the structure of the announcement overall - in both games there's an announcement of the trangressor, the trangression, and the punishment. These elements are also provided in standardized orders.

Football: [transgression] [trangressor] [punishment].
    e.g. "offsides, number 98, defense, 5 yards, remains second down."

Hockey: [transgressor] [punishment] [trangression].
    e.g. "number 6, white, 2 minutes, hooking".

Some aspects of the [trangressor] component deserve further comment. First, in both games, the [trangressor] element includes the team and (if an individual infraction) the player's number. Second, it also seems that the order within this element can vary - team first or player number first. Third, in neither game is the name of the team or its home mentioned: football infractions refer to "offense", "defense", "kicking team", and "receiving team", while hockey infractions refer to the colour of the trangressor's sweater. (Though, when the arena host officially announces the penalty over the PA, the team's name is used, and before the player's number).

Which leads to a special note about infractions in which a hockey player has been particularly naughty, racking up several penalties at once. In these cases, some of the infractions are called with [punishment] elements but no [transgression]s. In recent years the NHL has been miking referees to let them announce penalties like NFL referees, but in multiple-penalty situations he does not bother. Instead, only the arena PA announcer does so, and lists the infractions in PA structure. e.g. "Flyers penalty to number 10, Gord Smith, 2 minutes for roughing, 5 minutes for fighting, 10 minute misconduct, game misconduct" - the last two [punishment] elements have no corresponding [trangression] component (unless misconduct is the trangression).

A last quirk is that pass interference calls in football leave room to be analyzed as left-headed structures. The rules allow pass interference to be called on the defense or offense - on the defense if the defender hinders the intended receiver's movement prior to the catch, and on the offense if the intended receiver hinders a potential interception. Such infractions may be called as follows: "pass interference, offense, number N, X yards ..." or "pass interference, defense, number N, the ball will be spotted ...". Now according to my grouping above, "offense" and "defense" are grouped under the [trangressor] element, but in these examples, could plausibly be analyzed as part of the [trangression] element.

ocr

Last week while looking for some other function in my version of Adobe Acrobat, I stumbled across "OCR" in one of the pull-down menus. OCR stands for Optical Character Recognition, an application which detects text characters from digital images. OCR has been around in some form for over a decade, with performance improving over that span. I presume some combination of scanning + OCR has already been implemented in corpus linguistics, but coupled with improvements in scanner technology, I think OCR could open up lots of new research angles.

First, about text searches in pdf files. As you probably know, much digitally-accessible academic literature is available in a particular type of portable document format (pdf) that has been directly generated from some sort of word processor or text editor. The text in these pdfs is searchable in Acrobat Reader, and depending on the author's settings, can be copied and pasted into other applications. The other type of pdf, generated from an image scan, is not inherently searchable, because each page in the file is a picture rather than an arrangement of characters. These are the pdfs we're used to seeing from course reserves posted online or articles obtained electronically via interlibrary loan. They're readable on screen and print fairly clearly, but they can have huge file sizes, and depending on your printer, they might print at a slow page-per-minute rate. They also are not searchable.

This is where OCR comes in - the pdf becomes searchable in Reader once OCR has been run, and the text can then be pasted into another application. Acrobat Pro might also let you export it to another file format as an alternative to copy-and-paste. The search and copy/paste functions aren't necessarily useful if you're just reading a scan of an old article, but suppose you have text data that's only in hard copy format - maybe a printout from an obsolete file format, or a bound grammar or dictionary. If you scan your hard copy data to pdf, OCR could enable you to make a searchable electronic copy of it. I'm thinking, whatever dictionary you have, you could even convert it to a sortable database, and better yet mark it up with xml.

OK, this has always been possible even without scanning and OCR, assuming you're willing to re-type your source from scratch. But the potential savings in time (and chiropractic care) make the task a whole lot more palatable.

The one major stumbling block is, how well does OCR work? To answer this, I've decided to put Adobe's OCR capables to a brief test, using the Woleaian-English dictionary (Sohn & Tawerilmang 1976). Using a new scanner/copier that sends a pdf to my email address, I had 8 pages scanned and OCRed in 20 minutes. As for the OCR results, there is some good and a little bad.

First, the good:

  1. Acrobat's OCR is sophisticated enough to know which end is up in your file. So if you scan something upside-down or sideways, it figures it out, so you don't have to rotate it yourself. In fact, I had some pages that could be OCRed only in a sideways orientation - but the output was always right-side up.
  2. It seems like character recognition is not context dependent. I was worried it would incorporate an English lexicon to help guess top-down-like at a fuzzy character image. This would be really useful if your document is unilingual, but if you're scanning a dictionary of a 1000-speaker language with a quirky orthography, fuzzy characters in non-English words would probably be rendered poorly if they were guessed at with an English lexicon.

  3. Italic text is generally recognizable (but may induce some misreads - see below).

  4. The error rate is surprisingly low (and regular - see below).

And the bad:


  1. Special characters can be problematic. In the dictionary I'm using, homonymic headwords are differentiated with subscript numbers, which appear in the post-OCR pdf as commas, l, z, and non-subscript 3.

  2. After a first OCR run-through, the program maintains a list of "suspects" - characters that it is not fully confident about and has thus not rendered into text yet. You have to go through these manually to accept the character that the application proposes for the suspect. Unfortunately, it gives you no choices.

  3. the available characters so far seem to be limited to the ASCII set, though I may test this by scanning a printout of an IPA-ful document.

  4. Special formats like sub/superscript are troublesome, and there are mixups among the characters 1, l, [square brackets], and (parentheses). And oddly, my entire OCR output (all 8 pages so far) is italic.

Overall, I feel like the good outweighs the bad at this point. I also believe a lot of misread characters can be cleaned out with an automated perl or java script once you've exported the pdf to some other text format. I say this only because a lot of the errors I found were regular.

I considered an error to be any substitution or deletion of a character, or coalescence of two into one. Format errors like inaccurate transfer of boldface, italicization, small-caps, and so on were not counted as errors. I checked one pdf page that comprised two dictionary pages and found 57 errors. This seems like a lot, but considering the file contained 5650 non-space characters, it makes for a character accuracy of about 99%. Waaay better than I can type.

Moreover, of those errors, 27 are the same mistake: the italic string i) is rendered z) (but in 4 cases, comes out as t)). So if I wanted, I could run a search-and-replace to remove these errors. Another 11 errors are misread subscript characters, half of which could be taken out with another search-and-replace. I should add that a lot of these errors are partially products of the structure of an entry: headword (possibly with subscript number), italic lexical representation in parentheses, followed by part-of-speech, glosses, and italic examples.

The upshot is, OCR + your own automated error-correction leaves about 0.5% of all characters misread. I think this is great, but it also means the result still needs to be checked by hand. Of course, if you would rather type the whole thing in yourself, you still need to proofread (to look for errors that are far less predictable to boot). I should add that "checking by hand" in Acrobat was actually really easy - I just alt-tabbed between the pre-OCR and post-OCR documents. (while proofreading a typed copy requires turning your head from book to screen and back every couple of words or so).

For the curious, I've posted screenshots of the sample scan before and after applying OCR; I have highlighted errors in the post-OCR document. I guess now I'm going to extend the test to the rest of the dictionary.

12 April 2006

one year of piloklok

Today marks one year of piloklok, which is a fine excuse for taking stock and reflecting upon the paths it’s taken. Piloklok is one of a roll call of language blogs, many of which are linked from this page, and many of which have started to link back. So to begin, many thanks to all my mutual linkers!

The first post on piloklok is a little brief; basically it says “this is a new language blog; what should I call it?”. A minor change of mind that occurred after I set up a blogspot account resulted in the discrepancy between the blog’s name (piloklok) and its url (biloklok.blogspot.com). So far this discrepancy has not posed a problem, and probably never will.

I became involved in linguablogging, nearly a year before starting this site, as a contributor to Eric Bakovic’s phonoloblog, a forum whose mission is to enable linguistic discussion in the blog medium, so long as the topic is phonological in nature. Generally I followed these guidelines, but I occasionally found myself wanting to post about something linguistic but not-so-phonological (or in some cases, not phonological at all). It was the growing lack of fit between some of those posts and the “all things phonology” component of phonoloblog that led to the creation of piloklok.

Several aspects of phonoloblog are notable: its contributors use no pseudonyms, and many are at a career stage of trying to build a strong CV. The same applies to the solitary contributor to piloklok. As a result, I avoid putting truly serious scholarly content on either site, because I’d rather submit it for formal blind review. This leaves less potential content to write up for either phonoloblog or piloklok, and has led me to construct posts about the speech errors produced by reality TV contestants and nativization of Russian last names by English-speaking broadcasters. Basically, phenomena that might be worth bringing to the attention of other linguists, in some cases simply for amusement, but which otherwise probably would not be worth trying to develop into a scholarly reviewed publication. Still, I believe that since piloklok started, my subsequent posts on phonoloblog have returned to more serious phonological content.

In the meantime, a lot of what grabs space on piloklok has fallen within what you might call “the linguistics of sports”. So much so that this Yahoo directory of language and linguistics blogs describes piloklok as “a Blog investigating linguistics, modern language use, and lexical oddities in sport, from a researcher in Santa Barbara, California”. This post might be my most detailed discussion of sports writing.

A particular recurring theme has been what I’ve called lexical crossover – usages like home run in football and quarterback in hockey. One spooky result of this line of research is my realization that ace (as in skilled player of any sport) and point man (as in power-play defenseman in ice hockey) have non-sport origins that both refer to the head of a cavalry or column of troops.

Nevertheless, not everything on piloklok has been about the linguistics of sports. Some other favourites of mine include a discussion of the cran-morph kini, a taxonomy of nicknames for cities and states, my tongue-in-cheek defense of the Morissette song Ironic, and my one snowclone scoop.

I also have assembled a lengthy list of posts I either failed to finish or decided not to publish, including the following:
  • a laudation of Douglas Coupland’s recent coffee-table book “Souvenir of Canada”, which mentions the Inuit at least three times but never ever invokes anything regarding words for snow
  • a diatribe about baseless prescriptivism in The Vocabula Review
  • a brief writeup regarding the definition of planet
  • an unfavourable review of the parody usage guide Eats, shites, and leaves.
I think in the future I will probably keep up with the sports linguistics, like this recent treatment, and maybe I'll post another cartoon. I may also embark upon a lengthy discussion of the parallels of linguistic analysis in the sociological and behavioral structure of certain licensed establishments, with topics like “prescriptivism in wine selection”, “cocktail morphology”, “shooters and the lexicon”, and “kitchen pidgin”. We’ll see.