What happens when you have half a million addresses and need to find out where in the world they actually are? You geocode them! Which isn’t the easiest thing to do — the difficulty increasing as your need for performance, quality and quantity goes up.
The first, and probably the most obvious, is to simply use an external service. There are plenty to choose from, with many major companies offering up their Geocoder API for your personal enjoyment (or dis-enjoyment). So pick your poison — Google, Yahoo, Bing all have nice API’s that will happily take an address and give you something that looks like a lat and long.
Of course, can you really make 500,000 web requests and no one be the wiser? Of course not. There are limitations, both on the service and potentially on your process to make these requests. For one, it’s pretty expensive for most API’s I looked at to purchase a commercial license suitable for bulk geocoding. But it’s also slow. And in slow I mean it takes time to make 500,000 web requests, wait for them to complete, and put the data some place. That’s not counting the fact you’ll need to somehow address failures, which probably means some sort of queue/worker system, that’ll need supervision, management, et al. Yeah, not fun. So even if you manage to find an awesome deal on 500k web requests to someone’s API, you still need to manage the carrier pigeon effect. At least until connections get faster and it suddenly becomes cheap to buy expensive online data and services.
It’s also worth mentioning that many of these providers rely on the same sources of data that are freely available. In my tests, there were only marginal differences in quality across various solutions (both commercial API, private sources of data and home-brew systems). The largest factor in geocoding appears to be address quality first, and your data source second (and how well your parser can understand bad addresses). Soundex helps with address quality, but some addresses data is just bad and I suspect Google and others use some form of dynamic analysis based upon past searches to obtain higher quality and suggestions for incorrect cities, zip codes, etc. Here’s an interesting report comparing various geocoding options and the effectiveness of each based upon testing with 100 addresses (pay close attention to shapefile based data sources).
With these facts of life in mind, I decided what any crazy person would — setup my own! Google be damned. I can do it, and in one week. Well, maybe two. I know what you’re thinking, pretty foolish and maybe even a little masochistic, and you’d be right on both counts.
Fortunately, PostGIS has in it’s latest version, buried under extras a quite comprehensive geocoder project based upon freely available US Census Shapefile data. I’ll walk you through setting up your own geocoder, complete with soundex support and ability to bulk geocode from a db table.
If you already have Postgres setup, you can skip or skim over these steps. Just make sure you have a working installation and now the paths to all the relevant bits where things are installed
Download & Install Postgres
For maximum flexibility in the initial prototypying, I decided to compile Postgres from source, but most distributions also have a package available and Postgres maintains binary and graphical installer packages for some. To compile:
You’ll need build-essential, gcc, etc. packages to compile
You’ll now need to compile and install PostGIS. Before you do this, it’s a good idea to spend some time setting up accounts on Postgres. The authentication model for Postgres relies on system accounts, specifically allowing root access to the postgres user by default, where you can administer other users as necessary. For my setup, I made sure to have a system account for every postgres user.
Create the postgres user:
You now need to create a directory to store the main Postgres DB files. I put mine under /usr/local, but you can put it generally anywhere as needed (the path is specified in configuration and/or during server startup).
Now you’ll need to create a postgres cluster and start the postmaster process. First, change to the postgres user:
You can now login using the psql client (as the postgres system user):
If all is well, you should be at a postgres command prompt like the following:
To quit, type “\q”. Now we’re ready to install and setup PostGIS!
You’ll need libxml2-dev, libgeos-dev and libproj-dev (xslt and convert are also needed for documentation). I had to manually compile and install geos (which you can get from osgeo).
Since we’re interested in the Geocoder from 2.0, we’ll need to grab the latest development snapshot:
Create a spatial database template
So we can easily create spatially enabled databases, we’ll create one as a template and apply the spatial data types to it:
Now to create a new postgis enabled spatial database, we just do the following:
Add soundex ability
To add soundex ability to the db (used in matching parsed address parts which includes soundex, metaphone and levenshtein abilities), first build the module under the contrib directory of your postgres source:
This will add the fuzzystrmatch under the contrib folder of your installation. To add it to your database:
Create a geocoder database
Now, we’ll need to create a new geocoder database:
Now, we can begin to setup our geocoder. We first need to populate some street abbreviation and state lookup tables, and the tiger_loader.sql, which is the actual code to load tiger data into the geocoder table structure:
(I had my PostGIS sources under /tmp in this case):
Next, create the actual geocode function in the db (note you need to be in the tiger_2010 directory):
You’re now ready to load US Census shapefile data into the geocoder! There’s a slightly contrived way of generated scripts from the db, but I found this cumbersome for loading many states, so I wrote my own with perl. Someone from the PostGIS mailing list also wrote their own:
For the original loader script and instructions, check the original documentation.
I found it useful to get all the files into a single directory, and then just process them from there (instead of having a separate script to do each state):
The loading process takes anywhere from 30 minutes to over an hour, depending on how many points of interest are in the shape file and the complexity of the generated sql. Once it’s finished, it’s quite simple to geocode addresses with pretty good quality:
What’s even more interesting is bulk geocoding by loading an entire table of addresses and selecting them back out (or to another table) geocoded. Here’s some code that does that (modified from (http://postgis.refractions.net/documentation/manual-svn/Geocode.html)):
What’s it like to use it for real life stuff? Well, your mileage may vary. I loaded all US States and performance ranged anywhere from 40 ms to over 3000 ms for some difficult addresses on a dual quad core machine with 32 GB of ram and fast disks. Part of the complexity is in the address parsing and attempting to match address parts to many tables, resulting in table scans and a lot of time and disk complexity. If you have many millions of addresses to geocode, this could be a problem.
I was able to increase performance by “pre-normalizing” my addresses, and then sorting them by zip code. This seemed to help the most, along with increases shared memory cache in Postgres configuration. I’m also looking at optimizing away some unnecessary table scans, and trying to make more use of partitioned tables and different indexes. So far, it’s worked for my purposes, and I haven’t tried to cluster it yet, which may be an option for some (e.g. partition multiple states or sets of data over many machines).