This is a static archive of our old OpenStreetMap Help Site. Please post any new questions and answers at community.osm.org.

Nominatim Server on GCP VM Stalls When Doing Batch Geocoding Through Multiprocessing

Hello all,

I really need help on this front! I've set up a Nominatim Server on GCP's Compute Engine. It works ok enough, but now I have 100 million unique addresses that I need to forward-geocode through my service, and I'm trying to use multiprocessing to speed things up - even 100 addresses processed simultaneously stalls the service. My VM has 128 GB of RAM & 24 CPUs. I followed the configuration recommendations from the installation guide. Does anyone have any best practices for setting up the service for handling HUGE bulk-loads? Would switching from apache to nginx help?

Reproduceable Code Example in Python:

from joblib import Parallel, delayed
from multiprocessing import cpu_count
from geopy.geocoders import Nominatim
from collections import defaultdict

def geopy_parse(address_str):
    """
    Use Custom Nominatim Server to Extract 
    Country, Locality and Region from Unstructured Address
    """
    osm = Nominatim(domain="<url>", scheme="http", timeout=10000)
    result = osm.geocode(address_str, language='en', addressdetails=True)
    if result is not None:
        return defaultdict(lambda: None, result.raw['address'])
    else:
        return None

addresses = ['Vancouver BC Canada']*100
Parallel(n_jobs=cpu_count())(delayed(geopy_parse)(a) for a in addresses)

python parallelization nominatim multiprocessing

asked 20 Sep '20, 23:58

rirhun
26●5●5●9
accept rate: 0%

edited 20 Sep '20, 23:59

One Answer:

For installations with a high load, you should switch your server at least to php-fpm. In my experience it is also worth switching to nginx, as it is much better in coping with many parallel requests. Your system setup should be able to manage 600 request/s. (It depends on how the VM is set up, in particular, how fast is access to the disks.)

On a general note: it is not really worth increasing the number of parallel requests infinitely. Your server has a fixed number of CPUs and that limits the number of parallel work you can do. Too many parallel request and they get in each other's way, which actually slows you down. In your case I would expect that beyond 50 parallel requests you won't see much increase in throughput.

answered 21 Sep '20, 09:09

lonvia
6.2k●2●57●89
accept rate: 40%

Thanks for the insightful response - I'm using an SSD so access to disk should be relatively fast I believe. I tried looking into nginx and it wasn't working - the server kept complaining "file not found" w/ regards to the php-fpm.socket even though the path to the socket was correct.

(21 Sep '20, 17:55) rirhun