We do a lot of bulk loads. A lot.
So many that we
have some standard ETL-like modules for generic "Validate", "Load",
"Load_Dimension", "Load_Fact" and those sorts of obvious patterns.
our business processes amount to a "dimensional conformance and fact
load", followed by extracts, followed by a different "dimensional
conformance and fact load". We have multiple fact tables, with some
common dimensions. In short, we're building up collections of facts
about entities in one of the dimensions. [And no, we're not building up
data individual consumers. Really.]
of course, someone has a brain-fart.
An overall load
application is a simple loop. For each row in the source document,
conform the various dimensions, and then load the fact. Clearly, we
have a bunch of dimension conformance objects and a fact loading object.
Each object gets a crack at the input row and enriches it with some
little tidbit (like a foreign key).
us to pretty generic "Builder" and "Dimension Builder" and "Fact
Builder" class hierarchy. Very tidy.
kind of feed (usually because no two customers are alike) is really just
a small module with builders that are specific to that customer. And
the builders devolve to two methods
- Transform a row to
a new-entity dict, suitable for a Django model. Really, just a simple dict(
field=source['Column'], field=source['Column'], ... ) block of
- Transform a row to a dimension conformance query, suitable
for a Django filter. Again, a simple dict( some_key__iexact=
The nice thing is that the
builders abstract out all the messy details. Except.
We're now getting data that's not --
narrowly -- based on things our customers tell us. We're getting data
that might be useful to our customer. Essentially, we're processing
they're data as well as offering additional data.
But... We lack the obvious
customer-supplied keys required to do dimensional conformance. Instead,
we have to resort to a multi-step matching dance.
The multi-step matching dance
pushed the "Builder" design one step beyond. It moved from tidy to
obscure. There's a line that seems to be drawn around "too much"
back-and-forth between framework code and our Builders.
as bone-simple as a bulk loader has two candidate design patterns.
loader app with plug-in features for mappings. This is what I chose.
The mappings have been (until now) simple. The app is standard. Plug a
short list of classes into the standard framework. Done.
load support libraries that make a simple load app look simple. In
this case, each load app really is a top-level app, not simply some
classes that plug into an existing, standardized app. Write the
standard outer loop? Please.
What's wrong with
It's hard to say. But it seems that a
plug-in passes some limit to OO understandability. It seems that if we
refactor too much up to the superclass then our plug-ins become hard to
understand because they lose any "conceptual unity".
limiting factor seems to be a "conceptually complete" operation or
step. Not all code is so costly that a simple repeat is an accident
waiting to happen.
Hints from Map-Reduce
seems like there are two conceptual units. The loop. The function
applied within the loop. And we should write all of the loop or all of
the mapped function.
If we're writing the
mapped function, we might call other functions, but it feels like we
should limit how much other functions call back to the customer-specific
If we're writing the overall loop --
because some bit of logic is really convoluted -- we should simply write
the loop without shame. It's a for statement. It's not obscure
or confusing. And there's no reason to try and factor the for
statement into the superclass just because we can.