I’ve been doing a lot of web and Google Spreadsheet scraping recently, and one situation I’ve run into a lot is that the schema of the data source doesn’t quite fit into the schema I’m trying to dump the data into. The data source might expose someone’s full name, for instance, whereas I want to store the first and last name separately. I’ve developed a useful little coding pattern to help with that situation that I thought I’d share here.
Let’s say that the scrape of any data source produces a result
result : Map[A,B], where
B are almost always strings in real life. For example:
Let’s define a “fixer” as a function that takes your result and outputs the missing key-value pairs that you would have liked to have seen.
And then we define a function
fix which simply applies a fixer to the result and folds its output back into the original map (possibly overwriting some keys).
Now we’ve got an incredibly useful little function that can help us tidy up any schema-misaligned data that we’re pulling in. To split the full name into first and last components, we might do the following (pardon the lack of error checking):
and then, given our
result object from above, we can fix it by simply saying
which will result in the map
Why go through all this trouble to wrap a simple modification to a data object? Because if you’re scraping many different sites, you’re going to need a pipeline to automate the work for you. And by folding a “fix-it” step into this pipeline, and formalizing it like this, you can write your scraper bot in a domain-independent manner and then simply provide it with a chain of
Fixer functions for each URL pattern you request of it.