[FX.php List] Security Concerns with FileMaker Website

Thu Jan 25 12:04:23 MST 2007

In the case of the website I started this thread about, it's hosted  
by a major academic institution (see my email address for a clue), so  
bandwidth isn't really a problem -- in August 2006, the number of  
pages on our servers was between 400k and 500k (not an exaggeration!)  
-- I don't know the stats, but I can only imagine our monthly data  
transfer is on the order of tens of terabytes, if not more.

My main concern is about information confidentiality.  The search  
engines can be told where and where not to go, but the nasty bots  
aren't so kind.  Since the information provided on this site is valid  
contact info, it's sellable info that can be easily picked up  
automated bots.

The purpose of my site is essentially to be a "Musician White Pages",  
so that these people may be contacted by anyone within or outside the  
University for music jobs.  Within this website, contact info such as  
phone numbers, email addresses, personal websites, and bios are  
provided, based on how much info people provide when they sign up to  
be listed.

Now, when someone signs up for the service, they must agree to our  
privacy policy, which includes a statement:
"By signing up for this service, you agree to release your contact  
information publicly through this website.  Steps have been taken to  
prevent your contact information from showing up in public search  
engines, but the information is still listed on a public website,  
accessible to anyone in the world.  By signing up for this service,  
you acknowledge the risks associated with listing contact information  
publicly."

I have a robots.txt file that indicates that information about the  
service should be indexed, but the actual pages with people's info  
are not to be indexed.

How nasty are these bots?  I got an eye opener this past week.  A new  
entry on these white pages had been public for only 6 hours before  
the email address received a spam message -- and this address was  
brand new, never ever published anywhere before.  This spam message  
was sent to about 20 people, the majority of which were addresses  
listed on this site, so I know this email was harvested from that site.

A new concern of mine is also regarding phone numbers.  TXT spam  
isn't big yet, but let's face it, it's coming -- and so by similar  
harvesting methods, spammers can harvest phone numbers and start  
sending spam TXTs to them -- it costs them nothing, but it costs the  
users, and the cell companies have been slow so far to implement  
controls since it's more money in their pocket.

So while these folks have agreed to the "dangers" of listing info  
publicly online and I could stop there and let things take their  
course, I feel it is my responsibility to do what I can to adequately  
prevent abuse of the information these white pages provide for  
protection of the individuals using our service, while still  
maintaining a user friendly system for those users who have  
legitimate needs for the contact information listed.

I could go on about my position, but I think this is long enough :-)
--Ed
------------------------------------
http://www.edwardford.net

On Jan 25, 2007, at 1:26 PM, Joel Shapiro wrote:

> Can anyone briefly explain the risks of these bots and how  
> "nasty" (a term used in this thread) they can really be?
>
> I've looked at the link Gjermund posted, and it looks like the  
> biggest real problems are email harvesting and use of bandwidth.   
> Ed's original post was because he has many emails and websites on  
> his site, but if there are just one or two contact email addresses  
> on a site, what's the big risk?
>
> Thanks for enlightening me.
>
> -Joel
>
>
> On Jan 25, 2007, at 10:11 AM, Edward L. Ford wrote:
>
>> I didn't think about checking user agents.  I just did a cursory  
>> investigation of this, and I may personally implement this  
>> system.  For the bots that say they're a real browser, I'll set up  
>> other roadblocks to keep the info on my site (relatively)  
>> protected from harvesting.
>>
>> For others interested in checking user agents, I found the  
>> following to investigate:
>> o) look in the $_SERVER['HTTP_USER_AGENT'] string.  More info @  
>> http://us3.php.net/reserved.variables   There's a built in  
>> get_browser() function in PHP, which I never knew.
>>
>> o) PEAR also has some tools:  Check out http://pear.php.net/ 
>> package/Net_UserAgent_Detect
>>
>> --Ed
>>
>> -----------------------------------
>> http://www.edwardford.net
>>
>> On Jan 25, 2007, at 11:41 AM, Gjermund Gusland Thorsen wrote:
>>
>>> Alternative 3 works in theory but still leaves some bots crawling,
>>> as the worst bots tells your webserver that it's the most popular  
>>> browser.
>>>
>>> ggt667
>>>
>>> On 1/24/07, Jason LEWIS <jasonlewis at weber.edu> wrote:
>>>> Ed,
>>>>
>>>> There are a few things you can do:
>>>>
>>>> 1. Sue to get them to stop (I don't like this option because it  
>>>> makes
>>>> enemies and takes a VERY long time.)
>>>> 2. use a robots.txt (This option is good for bots that actually  
>>>> honor
>>>> this.  Google honors the robots.txt.)
>>>> 3. Detect the type of browser before you give them anything and  
>>>> ship
>>>> out bad information for unwanted browser types.  (This is only  
>>>> good if
>>>> the bot owner does not imitate browser variables.)
>>>>
>>>> As for my options, I prefer #3 as I can give the bots something  
>>>> that
>>>> they are looking for, bad information.  In perl, I use
>>>> $ENV{HTTP_USER_AGENT}, but I am not sure how to call this in php.
>>>> Anyone else know?  A quick search returned nothing on this.   
>>>> Could this
>>>> involve $HTTP_ENV_VARS?
>>>>
>>>> Jason
>>>>
>>>> >>> elford at cs.bu.edu 01/23/2007 10:18 PM >>>
>>>> Hello everyone,
>>>> In the past hour, I've done some analysis of various logs and  
>>>> emails,
>>>>
>>>> and I've come to a chilling realization that I've never had before
>>>> about bots harvesting information from websites -- I knew it
>>>> happened, but I never knew the scope of the problem until  
>>>> tonight --
>>>> and this is a low traffic website!
>>>>
>>>> So, I have a website which contains a public listing of email
>>>> addresses and websites from a FileMaker database.  I want to stop
>>>> unknown bots from crawling the site.  All of the data comes out of
>>>> FileMaker, nicely formatted as links for the end user's clicking
>>>> convenience.  I have a solution to fix email addresses from being
>>>> harvested, but I was wondering if anyone knows of a way to prevent
>>>> website addresses from being harvested, but still clickable as a
>>>> hyperlink.
>>>>
>>>> I thought maybe a PHP redirect link, like redirect.php?id=16 where
>>>> redirect puts a user at the website listed in record 16, but  
>>>> once the
>>>>
>>>> PHP is all said and done, we're still at the linked website, so  
>>>> that
>>>> doesn't really prevent anything from being harvested.
>>>>
>>>> Is there a way to maybe detect is a link was actually clicked by a
>>>> person, and not just passed through by an automated bot?  PHP is
>>>> preferable for such a solution -- JavaScript is too easy to turn
>>>> off.  Or, is there a way to specify that only bots from places like
>>>> Google, Live, and Yahoo are allowed to crawl the site?
>>>>
>>>> Hopefully my predicament is clear.  I need to solve this ASAP...
>>>>
>>>> --Ed
>>>> ---------------------
>>>> http://www.edwardford.net
>>>>
>>>>
>>>> _______________________________________________
>>>> FX.php_List mailing list
>>>> FX.php_List at mail.iviking.org
>>>> http://www.iviking.org/mailman/listinfo/fx.php_list
>>>>
>>> _______________________________________________
>>> FX.php_List mailing list
>>> FX.php_List at mail.iviking.org
>>> http://www.iviking.org/mailman/listinfo/fx.php_list
>>
>> _______________________________________________
>> FX.php_List mailing list
>> FX.php_List at mail.iviking.org
>> http://www.iviking.org/mailman/listinfo/fx.php_list
>
> _______________________________________________
> FX.php_List mailing list
> FX.php_List at mail.iviking.org
> http://www.iviking.org/mailman/listinfo/fx.php_list

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.iviking.org/pipermail/fx.php_list/attachments/20070125/4278c658/attachment-0001.html