Geotagged Wikipedia Articles

We have updated our database of geotagged Wikipedia articles and increased the total number of articles to 1.2 million up from 800’000. The most popular language is still English with 170’000 articles (up from 137’000) followed by Dutch with 107’000 (up from 67’000). Fifth is “Volapük” a language I have to admit I have never heard of before. It is a constructed language derived from English and German. Most articles in Volapük, which literally translates to ‘world speak’, are stubs created by wikipedia bots.

The number of entries for German would have decreased hadn’t it been for our merging the previous parse result with the newest parse. The decrease is mainly caused by wikipedians who develop bots to alter established templates into new templates. The new templates are used only for a minuscule fraction of articles. This trend seems to show that while the wikipedia approach works well for unstructured textual data it does not work so well for structured data.

Wikinear

An application quite popular in the Blogosphere these days is Wikinear. It is a very simple application for mobile phones that makes use of some interesting new technologies and web services : OAuth, Fire Eagle, GeoNames and the Google Static Maps API.

15 thoughts on “Geotagged Wikipedia Articles

  1. I’m curious about what tag you scan for when you collect the geotagged wikipedia artices. I know there are a few types out there. My guess is that you’re scanning for the coor template: {{coor}}.

    I’ve noticed that there are a number of articles with other types of geotags that don’t appear to be in the geonames database.

    Also, I’m curious about how frequently the database is updated. I’m in the process converting the geotags on some articles from older styles to {{coor}} style in the hope that geonames will pick them up, but I don’t know how long I’ll have to wait to see them appear.

    I’m building a little app that is very similar to wikinear.com and that’s why I’m asking.

    Thanks!

  2. Hi Dave

    Yes, this is correct we are parsing the ‘coor’ template together with some others. We would wish to see consensus in the wikipedia community about a common template, one problem we are running into with every dump is that some articles disappear from our parsing as the template has been changed to something new and fancy. It is a real pity that many wikipedians don’t understand the data value that would be included in a common template.

    How often we load it depends on the problems we are facing. It sometimes happens that we are missing a lot of articles for a language. Then we stop and wait another month hoping that other wikipedia users will revert back to a template our parser understands.

    Marc

    • It looks like that the consensus is to use the “coord” template with “display” parameter.

      Is there any way to find out how long ago was the dump processed and what is the scheduled day of the next Wikipedia dump parse?

      I am interested to know more information about the Ukrainian Wiki Dump — when was the dump processed last time, which articles were detected.

      I would like to unify geographic templates to make them suitable for GeoNames parser.

      Thanks,
      Victor

    • Hello Marc,

      Thanks for your reply.

      Are there any criteria to qualify a Wikipedia dump (Ukrainian in this case) to be parsed by GeoNames parser?

      The http://www.GeoNames.org/wikipedia page places Ukrainian Wikipedia on the 42. place with 4066 geotagged articles. Since Ukrainian Wikipedia was never parsed, where do these articles come from? Are they detected based on geotagged english articles?

      I am looking forward at expanding the number of geotagged articles in Ukrainian Wikipedia known to the external webservices (such as GeoNames, Google Maps/Earth, etc.) and would be very grateful for your response.

      Victor

      • Hi Victor

        It is a pain to parse even one dump because of too many different templates. The more wikipedia dumps are added the worse it gets because of local differences. It also adds the risk that the parser understands the template wrongly because it is parsing it with the logic that makes sense for another template and for instance switches the lat/lng.

        It only makes sense to parse an additional language when there are a big number of geocoded entries that follow the same templates and that are not already described in another language.

        Marc

  3. Easiest way to get past the parsing template issue is if Wikipedia defines a standard tag for all Wikipedia pages and incorporates that as a feature on Wikipedia so even non-authors can contribute by quickly geotagging all articles that are relevant to their location that they know.

    An example would be something like Flickr has for after a non-geotagged photo was uploaded, or even like Panoramio has. But in Wikipedia’s case we may want any contributor to be able to geotag a page.

    This would dramatically speed up and accurately place the geotagging through crowdsourcing (sort of like Wikipedia intended in the first place).

    I was very frustrated when i used Layar the other day and knew there were Wikipedia page places right in front of me but nothing appeared in Layar and I am clueless as to how to go and tag those articles myself now.

    • Update to my comment above…. seems that most pages are in fact geotagged now but many are just not appearing in the Layar Wikipedia layer. I have managed to update one incorrect coordinate on Wikipedia itelf without too much difficulty but would be great still if that interactive map also allowed interactive updates (but it would have to still prompt for a reason for the change).

  4. We are now incorporating the wiki articles in our database, but in a slightly distinct way: we do not use an api yet we will run a script once and then redirect trafic if aplicable to wikipedia.org

  5. Hi there:

    I am interested in a service detail “Wikipedia Articles in Bounding Box”.

    One of the labels that returns the xml of this service is “”. Where I can find documentation on the possible values for this tag ?, that is, what is the universe of values that can meet on this label?

    I need them together to make a classification of spatial objects I find in Wikipedia.

    Regards.

  6. Another question, please.

    These web services provided by Geonames to access Wikipedia articles, just returning articles, related to toponyms located in Geonames ?. Ie could happen that an article, returned by one of such services, is not in Geonames but being in Google Earth for example.

    Regards.

Leave a comment