friendly fire : semantic web crawler DDOS
February 3, 2007
Geonames has yesterday become victim of a distributed denial-of-service attack by a semantic web crawler. I had to block the IP addresses of the network segments used by the crawler to keep the site operational. The crawler was fetching the RDF representation of geonames features in bursts from up to 16 different IP addresses concurrently and our servers were not able to deal with this additional load.
Frédérick Giasson has blogged this week about why the use of RDF dumps and dereferencable uris are preferable from a crawlers point of view. Danny Ayers continues the thread and asks whether it is really necessary to have a local copy of the full remote database to build a useful semantic web application.
They both don’t mention the data providers point of view, but the preferences are obviously similar. It simply does not make sense to download a huge database record by record if a full dump is available. If you do want to download it record by record then at least follow the basic policy of politeness. From a provider’s point of view the following is important :
- use only a single thread. Definitely don’t use 16 servers concurrently to grab data from a single data provider
- wait some seconds between requests
- make the length of the wait depending on the response time of the previous request. This way you will automatically slow down if the server is busy serving other users and speed up if the server has cycles to spare for you.
The semantic web is suffering from lack of available data since most database owners are reluctant to open their database. I don’t think this attitude is going to change if they have to fear to become a victim of a DDOS attack by a semantic web crawler gone berserk.
In order to protect the geonames website from future denial-of-service attacks on web services we will move the www part to another server. For data synchronization between geonames servers we will use a message architecture. The message cluster architecture has the nice side effect that anybody running a geonames mirror can subscribe to the message topic and synchronize their own geonames mirrors in near real time.