[FX.php List] [OFF] levenshtein and related methods for user helper

Dale Bengston dbengston at tds.net
Wed Apr 22 12:49:28 MDT 2009


Hi Troy,

This all seems like a reasonable process to me. I have some routines  
that rip through 30000 or 40000 records to aggregate information, so a  
couple thousand should be fine. Your users should thank you for  
figuring out how to do such a clean display on such a messy search.

Dale

PS I don't seem to be getting my own posts any more. Am I banned? Did  
my presets change? Are you receiving me, list?

On Apr 22, 2009, at 12:27 PM, Troy Meyers wrote:

> Not necessarily an FX.php question, but perhaps you have experience  
> with this.
>
> On our site there's a series of pages where seed donors specify the  
> species and cultivar name of the parent plants involved in  
> pollinations. This is simple if the donor has never reported the  
> particular parent before, they just enter the genus, specific  
> epithet, and cultivar name in a form with text fields. If there are  
> misspellings, we correct them on this end for the publicly displayed  
> name, though we leave the misspelling in place so that the user can  
> find the original record again. Also, some species are reported with  
> one name by the user, but botanists currently call the species  
> something else... for example many people still use the name  
> Encyclia cochleata, or Anacheilium cochleatum, but the name  
> preferred by most botanists these days is Prosthechea cochleata.  
> Another example could be: Davejonesia prenticei and Dendrobium  
> lichenastrum, which are the same species, just to illustrate that  
> there may be no similarity whatsoever between alternate names.
>
> The complication occurs when the seed donor later registers another  
> pollination with either the same parent plant, or a different one,  
> but of the same species. The donor may not remember registering the  
> parent before, and at this point we want to improve the entry page  
> so that they can easily see a list of likely pre-registered plants.
>
> The tough part is generating the list for display, for the donor to  
> check for prior registrations. If there weren't the issue of  
> misspellings and name changes, this would be trivial. But, both are  
> very common, and some donors have over 1000 individual parent plants  
> already registered, so just showing a list of all of a donors plants  
> is not a good solution.
>
> What I have in mind is for the user, when reporting a new  
> pollination, to fill out the usual genus and specific epithet  
> information (regardless of whether they think it's a new  
> registration or not, since they are so often wrong) and then have  
> the page display a list of prior parent plant registrations of that  
> donor that _roughly_ match either names as entered, names as  
> displayed, or names that are known synonyms of species. These names  
> are in 3 different tables but can be merged and all treated the same  
> for display and choosing purposes. Some brief information about the  
> prior pollinations would be displayed with each as a memory-jogger.
>
> Then the user could choose the plant he recognizes, or the choice  
> "Not previously registered".
>
> ......
>
> SO, I've been trying to figure out how to generate the list of names  
> (and associated clues).
>
> It seems like this could be a two step process, first do an FX.php  
> find of a subset of the donor's prior registrations (3 of them, from  
> the three tables, 'prior as entered' 'display name' and 'synonyms')  
> into a PHP array, then do a levenshtein() of the genus and specific  
> epithet in that array comparing them to the just-entered name  
> provided by the user. Then, the names array would be displayed in  
> Levenshtein-distance order (as radio buttons), and names with a  
> Levenshtein distance that is too far would not be displayed.
>
> While the initial find could be as simple as finding all of the  
> donor's registered plants, for donors with, say, 1000 registered  
> plants, the array would have to hold the 1000 entered names, the  
> 1000 displayed names, and perhaps 5000 synonym names, assuming that  
> each species averages 5 synonyms each, not unrealistic.
>
> It seems better to limit this initial find to things likely to  
> match. A thought that occurred to me is to just match the first  
> letter (and donor ID) but then common misspellings where the first  
> letter is wrong would be wiped out (happens with names written  
> phonetically).
>
> This lead me to think that perhaps using some of the phonetic tools  
> like soundex() and/or metaphone() might be helpful, but for that to  
> work at this stage, I'd need to have the soundex and metaphone keys  
> already in FileMaker as an indexed field... and those functions  
> don't exist in FileMaker, so I suppose they'd have to be reproduced  
> as custom functions ... but then, would they really work well? Is  
> metaphone really smart enough to know that "pseudepidendrum" sounds  
> like "soodepidendrum"?
>
> I haven't gotten this far, but I'm guessing once I can build a  
> reasonable array of possibles, levenshtein() will easily allow me to  
> order and limit the list displayed to the user.
>
> .....
>
> I need a sanity check on all this. Is there be another way I should  
> be thinking about this?
>
> Thanks for any suggestions.
>
> -Troy
>
> _______________________________________________
> FX.php_List mailing list
> FX.php_List at mail.iviking.org
> http://www.iviking.org/mailman/listinfo/fx.php_list

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 1582 bytes
Desc: not available
Url : http://mail.iviking.org/pipermail/fx.php_list/attachments/20090422/5e606952/smime-0001.bin


More information about the FX.php_List mailing list