friendly fire : semantic web crawler DDOS

Geonames has yesterday become victim of a distributed denial-of-service attack by a semantic web crawler. I had to block the IP addresses of the network segments used by the crawler to keep the site operational. The crawler was fetching the RDF representation of geonames features in bursts from up to 16 different IP addresses concurrently and our servers were not able to deal with this additional load.

Frédérick Giasson has blogged this week about why the use of RDF dumps and dereferencable uris are preferable from a crawlers point of view. Danny Ayers continues the thread and asks whether it is really necessary to have a local copy of the full remote database to build a useful semantic web application.

They both don’t mention the data providers point of view, but the preferences are obviously similar. It simply does not make sense to download a huge database record by record if a full dump is available. If you do want to download it record by record then at least follow the basic policy of politeness. From a provider’s point of view the following is important :

  • use only a single thread. Definitely don’t use 16 servers concurrently to grab data from a single data provider
  • wait some seconds between requests
  • make the length of the wait depending on the response time of the previous request. This way you will automatically slow down if the server is busy serving other users and speed up if the server has cycles to spare for you.

The semantic web is suffering from lack of available data since most database owners are reluctant to open their database. I don’t think this attitude is going to change if they have to fear to become a victim of a DDOS attack by a semantic web crawler gone berserk.

In order to protect the geonames website from future denial-of-service attacks on web services we will move the www part to another server. For data synchronization between geonames servers we will use a message architecture. The message cluster architecture has the nice side effect that anybody running a geonames mirror can subscribe to the message topic and synchronize their own geonames mirrors in near real time.

2 thoughts on “friendly fire : semantic web crawler DDOS

  1. Hey — it wasn’t swoogle was it? We had some problems when we first started up in 2004, but I think we have a reasonable set of procedures in place to limit our impact on sites with lots of RDF. In particular, we stop after getting 50K documents from any site, in general.

  2. Hi Tim

    It wasn’t swoogle, it was SWSE. They have now asked for the RDF dump.
    If you or any other semantic web crawler are also interested in the RDF dump just drop me an email. For the time being I start the dump manually on request. If the demand for it increases I will implement an automatic and periodic dump.

    Marc

Leave a comment