11 June 2005

coding for lexical neighbors

A few weeks ago on phonoloblog, I posted some thoughts on lexical neighborhoods for languages with contrastive segment length. The issue is that calculating the number of neighbors for a given item presumably would net different results based on how you conceive a geminate: is it a pair of segments, or is it a single segment?

The clarifying example I used is from Italian: if geminates are considered two segments for neighborhood purposes, then occo and osso are not neighbors, but inna [6/25/05 - I meant inno] and anno are. But if geminates are considered one segment, occo and osso are neighbors.

The basic intuition for the two-segment procedure is like this: your comparison procedure takes two words and decides if they're neighbors. A separate procedure keeps a tally of neighbors for each word in your list. If the words are not identical, and have a difference in length of no more than one segment, then the procedure knows they may be neighbors.

So then you need some contingencies. If the target word is shorter by one segment, then the program needs to watch for a segment addition in the compared item. If the target is longer by one, the procedure needs to watch for a deletion in the compared item . If they are the same length, there may be a substitutiuon. So the procedure checks equivalently positioned segments, looking for mismatches. If it finds more than one, the words are not neighbors.

In a basic measure of computational complexity, this requires 50 lines of code in Java, 9 ifs, 13 pairs of curly brackets, and three while clauses. I have placed the procedure into a larger program that asks for a source data file to read and writes to a different output file.

But adhering to the concept of a geminate as a single segment becomes much more computationally cumbersome. The identical-length comparison still just needs to check for substitutions, but must be updated to tolerate an instance of one geminate subbing for another (e.g., occo otto). It also needs to avoid being tricked by forms like a hypothetical pair occo ooco, which are orthographic neighbors, but not phonological ones (because they differ in the length of two segments).

When words differ in length, you need more contingencies of situations to watch for... a difference in overall length could foretell an addition, or a length contrast. Imagine your pair is (pak, paak). When you reach the third-segment comparison, there is a mismatch. The program needs to know that the a in the second item is the second half of a geminate, and that really the comparison should now be between the 3rd segment of pak and the 4th segment of paak - so tally up a score of one mismatch so far. Simple enough, but then what if you're comparing pak and pook? You already have a mismatch in second position; when you see another in third position, you need to know this time not to make the mismatch score 2.

Now my coding skills are not supreme, and I'm further thrown by the index of the first segment in a string being 0. The code I have for this type of neighbor detection is much more complicated ... 139 lines (including comments), 43 curly brackets, 40 ifs, and 3 whiles. There may be some ways of making it neater, but this works, which I've determined by using a small artificial lexicon. As I show on phonoloblog, my original intuition seems to be confirmed: most words have more neighbors when you allow geminate substitutions.

The last time I blogged about my coding exploits, it was about an alphabetizing script I'd written that was taking several hours to sort about 360,000 words. A commenter kindly pointed me toward a sort command that would automatically alphabetize elements in an array, changing the hours to milliseconds. I'm dreading finding out the same is true for neighbor detection, like the following imaginary script.

public String[] neighbors(String [a]) {
String b[] = new String[2];
b = a.getNeighbors(b);
return b

I bet it couldn't handle the geminates the way mine does, though.


Post a Comment

<< Home