DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Decisions and Consequences

Decisions and Consequences

Steven Lott user avatar by
Steven Lott
·
May. 20, 11 · Interview
Like (0)
Save
Tweet
Share
4.18K Views

Join the DZone community and get the full member experience.

Join For Free

A single poorly-made decision can have profound ripple-effects. Once your stuck with it, you make accommodations, hacks and work-arounds. Eventually, things work, but the result is less than ideal.

Changing tack requires sometimes pervasive rework to the application. How can we reduce the risks and improve the value created?

A Recent Example

When dealing with bulk econometric data (Bloomberg, D&B, Moody's, etc.) you get BIG files with lots of fields. Depending on what you're paying for, the file layouts are frequently different even though the content is similar. I'm a big fan of plain-old CSV data. Even the tab-delimited variant of CSV is not bad to work with.

Further, most vendors will slap some heading rows on the file so that the column names are--more or less--identified. Surprisingly, this doesn't work out well in practice because there are often multiple columns with the same name. Sigh.

Using Python's csv library module lets us cope with CSV (and tab-delim) quite gracefully.

What's wrong with that decision? Nothing.

Variant Column Names

The question arises when you've purchased several files of econometric data and the column names are slightly different. This happens with a single vendor and across vendors. It's part of the game that can't easily be avoided. Column names vary.

What to do?

Here's the less-than-ideal decision. Make the column names a parameter.

In Python, this is not terribly difficult. The csv module's DictReader provides us a dictionary for each row. Each column name becomes a key. We can access the fields with some_row['this_field'] and some_row['that_field']. How bad can it be?

The extra punctuation is fairly hideous.

More importantly, however, is the nature of the metadata.

Consequence One -- Dynamic Metadata

Dynamic metadata, in this case, means that any indexing of the data is done based on character string column names.

index[index_name][row[column_name]].append( row )

That's rather more complex than the alternative where the metadata has a fixed definition.

some_index[row.column].append( row )

Consequence Two -- Murky ORM

Once we have dynamic metadata, we're largely frozen out of ordinary SQL database implementations. We don't know the column names, we don't know the indices. We can't do simple CREATE TABLE statements because we don't really have the column names until we open the working files.

We have to grub through all the code to find out where the dynamic mapping is reasoned out. Once we find that, we can then consider how to make the metadata fixed enough to tackle a SQL database.

We could, of course, generate the SQL CREATE INDEX statements on-the-fly. There's nothing wrong with it. But it slows down analysis and decision-making when we're not sure what indexes there are or what leads to a choice of index.

What's important here is that we want to use SQLite because it ships with Python. We want our application to use an ORM (like SQLAlchemy or SQLObject). We don't want our application to become a kind of ORM because of the dynamic SQL and dynamic column names.

Cleanup

The cleanup road is clear.

  1. Map all variant inputs to one common structure. Rather than work with raw dictionaries from csv, map each row to a standard set of names. For now, we can replace the dictionaries with named tuples to prepare for a migration to an ORM when that's possible.
  2. Replace the row['some field'] syntax with row.some_field syntax. Of course, there's a lot of this. This is a pervasive change.
  3. Find all the dynamic index creation and refactor that into a more static "database-like" place for now.

Item 1 is pretty easy to unit test. We're adding a function to map from dynamic names to fixed names. Nothing much to this testing-wise.

Item 2 requires unit tests with really good code coverage or there's no earthy way we can be sure that each mapping-syntax name has been transformed into an attribute-syntax name.

Item 3 barely requires testing. Indexes and other features are performance enhancements that can be removed and added without altering functionality.

From http://slott-softwarearchitect.blogspot.com/2011/05/decisions-and-consequences.html

sql Database

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Choose the Right Streaming Database
  • Building the Next-Generation Data Lakehouse: 10X Performance
  • How To Create a Failover Client Using the Hazelcast Viridian Serverless
  • Running Databases on Kubernetes

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: