Wikipedia Web Services

Over the last couple of weeks a new data extract for the wikipedia web services was implemented and deployed. The major change is certainly the dramatically increased number of geo located wikipedia articles.

A new attribute ‘rank‘ has been added to the xml and json responses. It gives an indication of the popularity or relevancy of an article. The rank is an integer number from ‘1‘ for the least popular articles to ‘100‘ for the most popular articles. It is calculated from the number of links pointing to an article and the article length. The articles are more or less evenly distributed over the 100 ranks.

The ‘elevation‘ field is now filled for nearly all articles, where no elevation could be parsed from the article itself it was enhanced with a reverse geocoded value from srtm3 or aster. The ‘countryCode‘ coverage has also been improved. The attributes ‘population‘ and ‘elevation‘ are no longer set to ‘0’ for unknown values, they are left empty instead.

Advertisement

Links for Toponyms

A new pseudo language code ‘link‘ has been added to the alternate name edit function and the links to the English Wikipedia have been inserted as alternate names. The links to the corresponding wikipedia articles have often been requested. While they were available on the forum linked in some threads, they were not included in the normal dump. With this simple change they can now be included in the dump as alternate names and they can easily be maintained using the wiki interface. All other kind of links, I think of hotel websites for hotel entries, can also be added in the same manner.

The language code for the alternate names are normally the 2-character ISO 639 language codes, for more exotic languages that do not have a 2-character ISO code the 3-character code is used instead.

Pseudo codes

  • post‘ for postal codes
  • link‘ for a link to a website
  • iata‘, ‘icao‘ and ‘faac‘ for the respective airport codes
  • abbr‘ for an abbreviation
  • fr_1793‘ for names used during the French Revolution

Geotagged Wikipedia Articles

We have updated our database of geotagged Wikipedia articles and increased the total number of articles to 1.2 million up from 800’000. The most popular language is still English with 170’000 articles (up from 137’000) followed by Dutch with 107’000 (up from 67’000). Fifth is “Volapük” a language I have to admit I have never heard of before. It is a constructed language derived from English and German. Most articles in Volapük, which literally translates to ‘world speak’, are stubs created by wikipedia bots.

The number of entries for German would have decreased hadn’t it been for our merging the previous parse result with the newest parse. The decrease is mainly caused by wikipedians who develop bots to alter established templates into new templates. The new templates are used only for a minuscule fraction of articles. This trend seems to show that while the wikipedia approach works well for unstructured textual data it does not work so well for structured data.

Wikinear

An application quite popular in the Blogosphere these days is Wikinear. It is a very simple application for mobile phones that makes use of some interesting new technologies and web services : OAuth, Fire Eagle, GeoNames and the Google Static Maps API.

Wikipedia Thumbnail Images

The first wikipedia load this year has brought the total number of georeferenced wikipedia articles available on geonames to 611,758. English will soon cross the magic number of 100,000 (current=99,333) after 81,282 in November.

Thumbnail images for wikipedia articles are a new experimental addition to the geonames webservices, the full text search and the maps mashup. Around a third of all articles on geonames have thumbnail images. A simple algorithm determines which image to use as thumbnail if more than one image could be parsed from the original article.

Wikipedia mashup with thumbnail images

Wikipedia Load

With the newest load we have added French to the languages for which geonames supports wikipedia fulltext search and text blurbs. These features are now available in English, German, French, Spanish and Polish.
The total number of entries since July has increased from 230,000 to 500,000, an increase of 117%. The number of entries in English has increased from 53,000 to 81,000 (52%) and the number of entries in German from 38,000 to 51,000 (26%). Other languages are catching up. Dutch, French and Italian also have around 50,000 geolocated entries.

You find detailed numbers on the geonames wikipedia page :

wikipedia-2006-nov.html

The numbers of previous months are available here :

wikipedia-2006-july-15.html

wikipedia-2006-april-23.html

wikipedia-2006-march-05.html

wikipedia-2006-feb-21.html

Wikipedia Load

With the newest load we have added Polish to the languages for which geonames supports wikipedia fulltext search and text blurbs. These features are now available in English,German, Spanish and Polish.

The total number of entries has increased from 180,000 to 230,000 which means an increase of 27%. The number of entries in English has increased from 44,000 to 53,000 (20%) and the number of entries in German from 32,000 to 38,000 (18%).

You find detailed numbers on the geonames wikipedia page :

wikipedia-2006-july-15.html

The numbers of previous months are available here :

wikipedia-2006-april-23.html

wikipedia-2006-march-05.html

wikipedia-2006-feb-21.html

Wikipedia load

Today I have loaded Geonames with the newest Wikipedia dumps for English, German and Spanish.

Here some numbers comparing the load of today with 4 March 2006. (I didn’t have time to generate numbers for the load of 2 April 2006).

The total number of entries has increased from 155,000 to 180,000 which means an increase of 17%. The number of entries in English has increased from 40,000 to 44,000 (10%) and the number of entries in German from 28,000 to 32,000 (17%). The number of languages is still around 190.

You find detailed numbers on the geonames wikipedia page :

wikipedia-2006-april-23.html

Numbers of previous months are backed up here :

wikipedia-2006-march-05.html

wikipedia-2006-feb-21.html

Bye the way, the geonames webservice “findNearbyWikipedia” has recently become the service most often called per day among all our webservices. A close second is the “full text search“.