We have updated the geonames database with the alternate names in Breton (br) of about 1500 towns in Brittany, France. Breton (Brezhoneg) is an insular Celtic language spoken by some of the inhabitants of Brittany (Breizh) and Loire-Atlantique (historically part of Brittany). Breton is related to Cornish and Welsh spoken in neighboring Great Britain.
Thanks to Mikael Bodlore-Penlaez from geobreizh and eurominority.
We are glad to include minority languages in the geonames project and have previously added minority language data sets from the geonative project in Basque, Frisian, Asturian or Leonese, Anymara, Welsh and Macedo Romania.
The latest NGA load with about 300’000 modifications of over 114’000 records was primarily a Farsi (fa) update. Farsi or Persian is an Indo-European language spoken in Iran, Afghanistan, Tajikistan and other countries. It is normally written using a modified variant of the Arabic alphabet with different pronunciation of the letters.
We have updated nearly 60’000 populated places with their Cyrillic name variants, thanks to the help of Валерий Хронусов (Valery Hronusov, Ph.D.) who provided a Russian dataset of over 70’000 populated places. For the matching of the existing geonames database with the Russian dataset we used the featureclass, the location (geographical distance) and the name similarity of the transcribed Cyrillic names with the existing international (English) name.
As a name similarity measurement we used the Levenshtein distance (edit distance) and the letter pair similarity. The GOST system was used for the transcription of Cyrillic names into English. The Cyrillic place name Логиновка will become Loginovka.
The same transcription is also used for the geonames search engine. This means a search with a Cyrillic place name may also return the correct place name, even if the Cyrillic name is not yet included as an alternate name in the geonames database. A query for Логиновка not only returns the two cities in the Omskaya and Saratovskaya provinces but also the city in Bashkortostan, even though we don’t yet have the Cyrillic name for the latter.
One of the most interesting parts of the 'natural language geocoder' is Place Name Disambiguation. Depending on the context, the grammatical structure or the language a term may have one of several possible meanings. Examples :
- Hayden : the CIA director Michael Hayden or the city in Idaho.
- Java : the island or the programming language
- Brisbane : city in Australia or city in California, USA
- Como : city in Italy or a very frequent word in Spanish.
We are using several processing steps to tackle this problem. First we identify the language of the text and the contexts (Example : In an IT context the term java most likely stands for the programming language).
Then we try to find person names and we do some simple grammatical analysis using coocurrences of left and right neighbours (Example : If the term we are looking at is preceded by the expression 'south of', we can be nearly certain that the term has a geographical meaning.)
With the new availability of google maps for many countries across Europe and Oceania, I thought it would be interessting to see how the geonames place name labelling algorithm is doing compared to google/teleatlas.
To my big surprise geonames compares very well, it even seems to be doing better. The close cities of Vienna and Bratislava are a good example. Geonames shows Vienna whereas google/teleatlas shows Bratislava. The blue labels are delivered by geonames :
Some of the sites using geonames place name labels are ExploreOurPla.net (use the GHYB overlay switch to turn on google maps hybrid) or globefeed (IE only).
Edit : Noiv, from ExploreOurPla.net, has just sent me these two links :
Region around Vienna with Googlemaps and with MSN Maps.
In a previous posting I have written about searching with country names in different languages and was asking our readers to find problems with the algorithm. No one found was able to spot the problem, but this does not stop us from fixing it.
The problem was with the country aliaser for composite country names in foreign languages. A search for 'Seoul, Süd Korea' would not have found anything as the aliaser was only looking at single tokens and was not considering composite country names.
The aliaser has been refactored and is now using a look ahead to see what other tokens are following and whether a sequence of tokens is matching a well known country name.
It is working flawless now, as you can see yourself :
The geonames search function now understands Country Names in different languages and has better support for non-english search queries.
Here some examples :
Country name Italy :
Country name Italien :
Country name Italia :
Country name इटली :
Country name ايطاليا :
The geonames search engine is able to correctly identify the country name in different languages and returns the correct result.
(For most countries at least, there are some country names for which it does not yet work. Can you find some of these exceptions?)
Edit May 14 2006 : The problem I was referring to is fixed now, glad no one found it 😉