[FX.php List] [OFF] levenshtein and related methods for user helper

Troy Meyers tcmeyers at troymeyers.com
Wed Apr 22 11:27:01 MDT 2009


Not necessarily an FX.php question, but perhaps you have experience with this.

On our site there's a series of pages where seed donors specify the species and cultivar name of the parent plants involved in pollinations. This is simple if the donor has never reported the particular parent before, they just enter the genus, specific epithet, and cultivar name in a form with text fields. If there are misspellings, we correct them on this end for the publicly displayed name, though we leave the misspelling in place so that the user can find the original record again. Also, some species are reported with one name by the user, but botanists currently call the species something else... for example many people still use the name Encyclia cochleata, or Anacheilium cochleatum, but the name preferred by most botanists these days is Prosthechea cochleata. Another example could be: Davejonesia prenticei and Dendrobium lichenastrum, which are the same species, just to illustrate that there may be no similarity whatsoever between alternate names.

The complication occurs when the seed donor later registers another pollination with either the same parent plant, or a different one, but of the same species. The donor may not remember registering the parent before, and at this point we want to improve the entry page so that they can easily see a list of likely pre-registered plants.

The tough part is generating the list for display, for the donor to check for prior registrations. If there weren't the issue of misspellings and name changes, this would be trivial. But, both are very common, and some donors have over 1000 individual parent plants already registered, so just showing a list of all of a donors plants is not a good solution.

What I have in mind is for the user, when reporting a new pollination, to fill out the usual genus and specific epithet information (regardless of whether they think it's a new registration or not, since they are so often wrong) and then have the page display a list of prior parent plant registrations of that donor that _roughly_ match either names as entered, names as displayed, or names that are known synonyms of species. These names are in 3 different tables but can be merged and all treated the same for display and choosing purposes. Some brief information about the prior pollinations would be displayed with each as a memory-jogger.

Then the user could choose the plant he recognizes, or the choice "Not previously registered".

......

SO, I've been trying to figure out how to generate the list of names (and associated clues).

It seems like this could be a two step process, first do an FX.php find of a subset of the donor's prior registrations (3 of them, from the three tables, 'prior as entered' 'display name' and 'synonyms') into a PHP array, then do a levenshtein() of the genus and specific epithet in that array comparing them to the just-entered name provided by the user. Then, the names array would be displayed in Levenshtein-distance order (as radio buttons), and names with a Levenshtein distance that is too far would not be displayed.

While the initial find could be as simple as finding all of the donor's registered plants, for donors with, say, 1000 registered plants, the array would have to hold the 1000 entered names, the 1000 displayed names, and perhaps 5000 synonym names, assuming that each species averages 5 synonyms each, not unrealistic.

It seems better to limit this initial find to things likely to match. A thought that occurred to me is to just match the first letter (and donor ID) but then common misspellings where the first letter is wrong would be wiped out (happens with names written phonetically).

This lead me to think that perhaps using some of the phonetic tools like soundex() and/or metaphone() might be helpful, but for that to work at this stage, I'd need to have the soundex and metaphone keys already in FileMaker as an indexed field... and those functions don't exist in FileMaker, so I suppose they'd have to be reproduced as custom functions ... but then, would they really work well? Is metaphone really smart enough to know that "pseudepidendrum" sounds like "soodepidendrum"?

I haven't gotten this far, but I'm guessing once I can build a reasonable array of possibles, levenshtein() will easily allow me to order and limit the list displayed to the user.

.....

I need a sanity check on all this. Is there be another way I should be thinking about this?

Thanks for any suggestions.

-Troy



More information about the FX.php_List mailing list