Instead of giving our private information to Google to use for 18 months, we are giving it to you to use indefinitely?
Where is your privacy policy?
And, what is your hosting provider's policy? I found out recently that my hosting provider refused to make any guarantees; that means that I cannot make any guarantees either, since my hosting provider has full access to my server.
A privacy policy is very high on my list of things to do, along with an explanation of how it works so that people can feel confident in their privacy. Right now, the only thing I record is a SHA1 has of everyone's IP address and a timestamp so that I can track how many unique people have used it. With the SHA1 hash, there's no (very difficult) way to trace that back to the user. I'm also talking to my host about the web server logs and how quickly those can be purged.
Can you use a SHA-2 hash instead? Those variants are considerably more difficult to break.
A problem I see with your service compared to an anonymizing proxy like Tor is that you are still a single point of failure(please correct me if I'm wrong though). If you were legally forced to turn over search records (as the govt was attempting with google a while back), then the requests could be traced directly back to the user.
You mention clearing the database daily which is a good idea. But again if it was compromised and a snap shot could be taken, then a brute force crack of your SHA-1 hashes would be possible. Basically, everyone is trusting the security of your database. A misdirection service which telescopes the request through interconnected proxies will not have this single point of failure issue.
Not criticizing your implementation, just making some observations. I think this is a great idea. Mainly your site is so easy for people to use, not needing to install a client application.
Interesting points. So if someone was able to get my php code, they could find the salt, dump the database, generate the lookup tables by hashing every IP address possible plus the salt, and then they would be able to figure out every IP that has used the site since midnight. But, they would still only have your IP as I do not record any of the search results. This frame of events also shows that the hash algorithm I use really doesn't matter. It's protecting the salt and the database that matter most. Anyone have thoughts on that?
Also, another feature I was thinking of adding was an ssl option so you could securely access the site. However, as I don't make any money from the site, it becomes more difficult to justify additional expenses.
You mentioned that you keep timestamp info for each inbound connection, is that a requirement? I only ask b/c this could be used to match up a request on the search engines servers (with your servers IP as the source) with the connection on your server to pinpoint the user in the event your database was compromised.
One thing you could do which should be easy is send chaff. Randomly send out connection requests to some of search engines from your server even though a user is not requesting the data. It makes tying back connections to the users more difficult because you dont know which request is real and which is fake.
SSL would eventually be important because it would protect against man-in-the-middle attacks. Someone could hijack connections to your server claiming to be you and then get all of the requests. Users could potentially be putting in very sensitive information so this could be a big deal. There will also be protection from someone sniffing inbound requests that come into your server as the channel is encrypted.
I understand the expenses thing, so I wouldnt worry too much about that. I'd prefer your service be free and not use SSL than to charge for usage. Although I wouldnt mind some ads, you could monetize a bit on that if you wanted..
It's true that timestamps are probably no longer important. I initially had them in there to monitor usage while the test group was fairly limited. I will remove them in the next day or two as the site gets going (or dies...).
The idea of a chaff is interesting and It wouldn't be too difficult to implement. Thanks for the idea.
The thing about low-cost web hosting is that it is impossible to stop your web host--or their contractors--from getting the data. Even if you use TLS (SSL), the web host and their associates have full access to your private key. For most hosting solutions, the best that you can do is colocating your own physically-locked server, and use TLS to encrypt everything.
Without using TLS, you cannot prevent the user's ISP from recording--and even reselling--your user's search histories. Similarly, your hosting providers could be doing the same thing. Keep in mind that hosting and bandwidth is a multi-level value chain--your host is probably renting space and bandwidth from somebody else, who is renting from somebody else, who is renting from somebody else. Any one of those companies and/or their rogue employees can collect, re-transmit, prevent, and/or redirect (e.g. man-in-the-middle) your user's queries without your knowledge.
Where is your privacy policy?
And, what is your hosting provider's policy? I found out recently that my hosting provider refused to make any guarantees; that means that I cannot make any guarantees either, since my hosting provider has full access to my server.