Hello all,

I really need help on this front! I've set up a Nominatim Server on GCP's Compute Engine. It works ok enough, but now I have 100 million unique addresses that I need to forward-geocode through my service, and I'm trying to use multiprocessing to speed things up - even 100 addresses processed simultaneously stalls the service. My VM has 128 GB of RAM & 24 CPUs. I followed the configuration recommendations from the installation guide. Does anyone have any best practices for setting up the service for handling HUGE bulk-loads? Would switching from apache to nginx help?

Reproduceable Code Example in Python:

from joblib import Parallel, delayed
from multiprocessing import cpu_count
from geopy.geocoders import Nominatim
from collections import defaultdict

def geopy_parse(address_str):
    """
    Use Custom Nominatim Server to Extract 
    Country, Locality and Region from Unstructured Address
    """
    osm = Nominatim(domain="<url>", scheme="http", timeout=10000)
    result = osm.geocode(address_str, language='en', addressdetails=True)
    if result is not None:
        return defaultdict(lambda: None, result.raw['address'])
    else:
        return None

addresses = ['Vancouver BC Canada']*100
Parallel(n_jobs=cpu_count())(delayed(geopy_parse)(a) for a in addresses)

asked 20 Sep, 23:58

rirhun's gravatar image

rirhun
2629
accept rate: 0%

edited 20 Sep, 23:59


For installations with a high load, you should switch your server at least to php-fpm. In my experience it is also worth switching to nginx, as it is much better in coping with many parallel requests. Your system setup should be able to manage 600 request/s. (It depends on how the VM is set up, in particular, how fast is access to the disks.)

On a general note: it is not really worth increasing the number of parallel requests infinitely. Your server has a fixed number of CPUs and that limits the number of parallel work you can do. Too many parallel request and they get in each other's way, which actually slows you down. In your case I would expect that beyond 50 parallel requests you won't see much increase in throughput.

permanent link

answered 21 Sep, 09:09

lonvia's gravatar image

lonvia
5.7k25381
accept rate: 41%

Thanks for the insightful response - I'm using an SSD so access to disk should be relatively fast I believe. I tried looking into nginx and it wasn't working - the server kept complaining "file not found" w/ regards to the php-fpm.socket even though the path to the socket was correct.

(21 Sep, 17:55) rirhun
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Question tags:

×617
×60
×1
×1

question asked: 20 Sep, 23:58

question was seen: 191 times

last updated: 21 Sep, 17:55

powered by OSQA