Extracting addresses from millions of pages with AutoML and Ruby

12 February, 2020

Code can be found here

I ran into a problem in the past where I needed to extract Australian addresses from a few million pages. Utilising Google’s AutoML NLP (Natural Language Processing) service we can solve the problem of extracting addresses. I’d then need to somehow take those addresses and make them more useful by splitting them into address parts (street no, street name, postcode, state, etc) and geocoding each address(longitude and latitude).

We could then store the data into Elasticsearch and perform geolocation based searches, allowing users to get the distance from their current location to an address listed on a site or filtering sites based on suburbs, states or postcodes.

The diagram below shows how we go from scraping HTML that contains addresses. To address records that are stored in Postgres and Elasticsearch. I’ll mainly focus on the green boxes.

img

To give you a better idea of what we are expecting, we’ll be going from raw HTML that may contain an address like this:

<html>
  <body>
    <p>
      Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a  piece of classical Latin literature from 45 BC, making it over 2000 years old.
    </p>
    <span>address:</span>
    <span>Unit 2, Marina Mirage, 74 Seaworld Drive, Main Beach QLD 4217</span>
    <span>phone:</span>
    <span>11112222</span>
    ....
  </body>
</html>

Then transforming each address into something like this:

[1] pry(main)> Address.first
=> #<Address:0x00007fc3ce9a4fb8
 id: 1,
 site_id: 104,
 street_no: nil,
 street_name: "Seaworld Drive",
 suburb: "Main Beach",
 state: "QLD",
 postcode: "4217",
 lat: -0.27968262e2,
 lng: 0.153426398e3,
 status: "verified",
 nlp_address: "unit 2, marina mirage, 74 seaworld drive, main beach qld 4217",
 nlp_confidence: 0.99984866e0,
 mappify_confidence: 1,
 created_at: Mon, 10 Feb 2020 12:48:35 UTC +00:00,
 updated_at: Mon, 10 Feb 2020 14:22:23 UTC +00:00>

Scraping sites

You will need to have links to scrape. I won’t provide these but you could buy these online from various places to at least get you started. When scraping you could collect links to other domains that may exist on a page and scrape those as well.

I’ve used Scrapy and deployed it to Scrapy cloud. Scrapy cloud ‘s free plan provides a single process with a one hour run time limit per run. You can run a process as many times as you’d like.

If you need scheduling, Scrapy cloud does not provide it for free. If you want free scheduling you can implement it yourself using their API along with Sidekiq & Sidekiq cron.


Creating training data

This part will be quite tedious but of course, essential. If you were to manually label thousands of addresses, it can take many hours (I know from experience). The prediction model used in the examples of this article has been trained with 2 labels — po_box and address.

Labelling PO box addresses allow us to filter them out when predicting, as they are irrelevant to my requirements. If we don’t teach it the difference, it will return a partial PO box address as an address. This occurs because of the similarity of an address and PO box.

Here are a couple of tools to help annotate your data.

At the time I had used Doccano. It didn’t have support to export to Google supported format so I‘ve written my own script to transform the Doccano export file into a format Google AutoML would accept.

I only just found Label Studio, that one looks promising.

There are also services that allow you to outsource this work if it’s within your budget. I can’t advise on this as I just did it myself.


Training a prediction model

We’ll go through the process of taking some training data and creating the prediction model on google cloud.

Setup Google account

  • Sign up for an account over at https://console.cloud.google.com (you’ll need to add debit/credit card details to activate the free trial)
  • Create a new project e.g automl-address-extractor

Copy training files to storage

  • From the storage section of google, console create a new bucket
  • Upload your annotations file to the bucket
  • Create a CSV file google_annotations.csv. The contents should contain the location of the annotations file e.g ,gs://bucket-name/google_annotations.jsonl
  • Upload the CSV file to the bucket the annotations file was uploaded to

Create a google cloud service account

  • In cloud console, select “IAM & admin” and choose “Service accounts”.
  • Create a new service account
  • Select the Role “AutoML Predictor”
  • Create key and download the JSON key type

The JSON file contains credentials which you can then use to set environment variables in your project required by the googleauth gem. See .env.example for the env var keys.

Setup AutoML

  • Head over to the natural language area of the console under artificial intelligence
  • Enable the natural language models API when prompted
  • Click on AutoML Entity Extraction

img

  • Click “New Dataset”
  • Select the CSV file that was uploaded earlier & import.
  • Your items screen should look something like this

img

  • Go to Train and click the button Start Training. This can take several hours.
  • Once training is finished, you can test out your model in Test & Use.

Heres an example:

img

Test prediction API from rails console

Let’s confirm that we can now start predicting addresses with our code within the rails console.

We’ve got an AutoMLClient class that takes a string and returns a prediction from AutoML’s API.

[1] pry(main)> client = AutoMLClient.new
[2] pry(main)> client.predict("1300 111 222 unit 2, marina mirage, 74 seaworld drive, main beach qld 4217 2019 copyright")
=> {"payload"=>
  [{"annotationSpecId"=>"7316645151454003200",
    "displayName"=>"address",
    "textExtraction"=>{"score"=>0.99951684, "textSegment"=>{"startOffset"=>"13", "endOffset"=>"74", "content"=>"unit 2, marina mirage, 74 seaworld drive, main beach qld 4217"}}}]}

As you can see AutoML has managed to extract out the correct address from the string.


Cost of predicting

So far so good right? If you’re only running predictions in the thousands that would be okay. But what if we need to process a million or more addresses? The first thing that comes to mind is cost.

Immediately we can reduce costs substantially simply by avoiding sending entire pages to AutoML. Most of the content on a page will be irrelevant and a page can contain hundreds to thousands of characters. Instead, we can extract snippets from a page where we think an address may exist. I’ll explain how we extract snippets later when we go through the code. But just so you know, a snippet may or may not contain an actual address. That’s why we need to use NLP.

Let’s take a look at some approximate numbers and determine if this can be a viable solution. Because of the way snippet extractions work, the number of characters per snippet is variable. But we can take an average of the snippet length and use that when doing the maths. Heres google pricing page if you’re interested.

At the time of writing the costs are:

img

We’re more interested in the prediction pricing of text records — $5 per 1000 text records. If a request is 1000 characters or less, it will work out to be 1 text record. 4500 characters would be 5 text records.

Now let’s take a look out our costs currently for predicting addresses from snippets with AutoML.

One snippet per request

Number of snippets: 1 000 000
The average length of a snippet: 110 characters

$5 per 1000 text records = 5 / 1000 = 0.005

Cost: 1 000 000 * 0.005 = $5000

Let’s explore if it’s possible to optimise on cost. By looking at data that has already been collected we know a snippet will have an average length of 110 characters. We also know that google charges per text record of 1000 characters. So what would the costs look like if we could code up a solution to group snippets into blocks of text up to 1000 characters?

Grouping snippets by 1000 characters

1 000 000 * 110 = 110000000 (total number of characters)

Text records: (110000000 / 1000) = 110 000

110 000 * 0.005 = $550

We’ll be saving approximately $4450 USD.

Looks much better (if you have free trial credits), but before we move forward, let’s check the limitations of googles API.

img

Google allows up to 10000 characters per request. When we group snippets in blocks of 1000 characters, in most cases we won’t get exactly 1000 characters. We could end up with something like 920 or 970. Those missing characters can add up and increase costs over time. So what we’ll do is group snippets in blocks of a maximum of 10000 characters. This will allow us to save a bit more money.

If you’re wondering if AutoML is a good fit for you, it depends on the amount of data you need to process and how much value you can get from that data. If you’re looking to reduce costs further you could implement the entity extraction yourself with spacy. Then build your own prediction API. But that depends on your own personal situation and if the costs of rolling your own solution are worth it.

Google provides a free trial period with $500 credit when setting up a new account. The first 5,000 text records are free during the trial period (5 million characters). That may be enough for you to get a quick MVP out the door, then move on to something like spacy if you need to in the future. There isn’t much stopping you from signing up with another account for more free credits as well. If you’re starting a business, you can also apply for startup credits and get another $3000 credits or more on top.


Extracting addresses with ruby

When we save a page for a site, we store the HTML in our DB. We then strip the HTML and take that result and save it to the content field of a page record. Now we won’t need to worry about HTML when processing.

Extracting snippets

If you look at the snippet extractor code, you’ll see that it will go through each word looking for an Australian state e.g NSW. Once a state has been found, it will extract the surrounding words including the state. This is what we will call the address snippet.

Grouping snippets

Grouping is a bit more complicated. Googles API is expecting a block of text with 10000 characters max. When calling the API, if a prediction is successful it will return the index positions of labels found. We need to take that index position and find the snippet associated with that prediction. The snippet will contain additional information that we require to associate to a site within our DB. So that is the responsibility of the MLContentBuilder class. It will:

  • join a collection of snippets, and separate with a full stop.
  • keep track of start and end positions of each snippet
  • keep track of site id for a snippet
  • allow us to look up a snippet at an index position

Snippets are separated with a full stop to improve the accuracy of a prediction.

Entity extraction with AutoML NLP (Natural Language Processing)

There is an ExtractAddress service that we’ll use to go from HTML to extracted addresses. When executed it performs the following:

  • Retrieve from the DB pages for multiple sites
  • Extract address snippets from pages
  • Group snippets to a maximum of 10 000 characters for each group
  • Send grouped snippets to AutoML NLP service to extract addresses
  • Bulk import into addresses table

Retrieving useful information from an address string

So now we got an address as a string, what now? You may already be familiar with address autocompletion provided by Google or other APIs.

There are several API’s that are Australian address only. I’ve found that those contain more complete data than google. Heres a few Australian addresses only services from a quick search:

Looking at the pricing page, Mappify is the cheapest with 2500 free requests per day. If none of the services is within your budget another option is to build your own address search API using the GNAF dataset. Setting it up is out of the scope of this article, but here are some tips if you’re interested:

  • Follow the instructions in this repo to load up the GNAF dataset into Postgres
  • If you decide to use rails as your API, you could use Searchkick to sync to Elasticsearch
  • Update your model so Searchkick’s callbacks will sync address parts and geolocation data to Elasticsearch
  • Create a search endpoint and use Searchkick’s search method to perform searches.

All of the Australian address API’s that I’ve found mention that they are using the GNAF dataset.

We’ll keep things simple and use Mappify as I don’t need to be processing millions of snippets for this article. I’ve modified the extracted code to use Mappify for address lookups.

Source for Mappify Client

[1] pry(main)> mc = MappifyClient.new("your-api-key-goes-here")
[2] pry(main)> mc.address_search("unit 2, marina mirage, 74 seaworld drive, main beach qld")
=> {"type"=>"completeAddressRecordArray",
 "result"=>
  [{"buildingName"=>"MARINA MIRAGE",
    "flatNumberPrefix"=>nil,
    "flatNumber"=>2,
    "flatNumberSuffix"=>nil,
    "levelNumber"=>nil,
    "numberFirst"=>74,
    "numberFirstPrefix"=>nil,
    "numberFirstSuffix"=>nil,
    "numberLast"=>nil,
    "numberLastPrefix"=>nil,
    "numberLastSuffix"=>nil,
    "streetName"=>"SEAWORLD",
    "streetType"=>"DRIVE",
    "streetSuffixCode"=>nil,
    "suburb"=>"MAIN BEACH",
    "state"=>"QLD",
    "postCode"=>"4217",
    "location"=>{"lat"=>-27.96826181, "lon"=>153.42639831},
    "streetAddress"=>"Unit 2, Marina Mirage, 74 Seaworld Drive, Main Beach QLD 4217"}],
 "confidence"=>0.9180327868852459}

The SearchAddress service will accept addresses that have been extracted with AutoML then look up the address using Mappify’s autocomplete API. It will then bulk update the address records.


Sidekiq Jobs

I won’t cover the code for both services, just the jobs that handle address extraction.

There are two workers involved in the address extraction process.

First is the scheduled job. It will find sites where addresses haven’t been extracted. Then, in batches, it will enqueue another worker, passing it a list of site ids.

The Job that was enqueued will retrieve sites from the ids and execute the ExtractAddress service.

class Scheduled::ExtractAddressesBatchJob
include Sidekiq::Worker
sidekiq_options lock: :until_and_while_executing
def perform
sites = Site
.select(:id)
.addresses_not_checked
.limit(10_000)
sites.find_in_batches do |batch|
site_ids = batch.pluck(:id)
site_ids.each_slice(500) do |slice|
ExtractAddressesJob.perform_later(slice)
end
timestamp = DateTime.now
Site
.where(id: site_ids)
.update_all(last_address_check_at: timestamp, updated_at: timestamp)
end
end
end
class ExtractAddressesJob < ApplicationJob
queue_as :default
def perform(page_ids)
sites = Site.where(id: site_ids).includes(:pages)
ExtractAddresses.process(sites)
end
end


That’s it, pretty straight forward stuff. I’ve found the most difficult parts are getting the data to train and annotating the data to be used for training.

AutoML can be expensive if you’ve got a lot of data to process. But the free trial can go a long way as well. From the calculations above I can cover the costs of processing a million snippets with the free trial.


Written by Glen Bray