Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Properly Getting Into Jail: Data Processing

DZone's Guide to

Properly Getting Into Jail: Data Processing

This series has talked a lot about data flow. This time, we'll get into the intricacies of data processing and data modeling.

· Database Zone ·
Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

In this series of posts, I have talked a lot about the way data flows, the notion of always talking to a local server, and the difference between each location’s own data and the shared data that comes from the other parts of the system.

Here is how it looks when modeling things:

Snapshot

To simplify things, we have the notion of the shared database (which is how I typically get data from the rest of the system) and my database. Data gets into the shared database using replication, which is using a gossip protocol, is resilient to network errors and can route around them, etc. The application will only ever write data to its own database, never directly to the shared one. ETL processes will write the data from the application database to the local copy of the shared database, and from there, it will be sent across to the other parties.

In terms of input/output, the process of writing locally to an app DB and the ETL process to a local shared DB, the automatic dissemination of data to the rest of the world is quite simple once you have finished the setup. This means that you don’t really have to think about the way you publish information but can still do so in such a way that you are not constrained in the development of the application (no shared database blues here, thank you!). However, that only deals with the outgoing side of things — how are we going to handle incoming data?

We need to remember that a core part of the design is that we aren’t just blindly copying data from the shared database. Even though this is trusted, we still need to process the data and reconcile it with what we have in our own database.

A good example of that might be the release inmate workflow that we already discussed. This is initiated by the Registration Office, and it is sent to all the parties in the prison. Let’s see how a particular block is going to handle the processing of such a core scenario.

The actual workflow for releasing an inmate needs to be handled by many parties. From the block’s perspective, this means getting the inmate physically into the release party and handing over responsibility for that inmate. When the workflow document for the inmate release reaches the block’s shared database, we need to start the internal process inside the block to handle that. We can use RavenDB subscriptions for this purpose. A subscription is a persistent query, and any time a match is found on the subscription query, we’ll get the matching documents and can operate on that.

Here is how the subscription looks:

image

Basically, it says, “Gimme all the release workflows for block B.” The idea of a persistent query is that whenever a new document arrives, if it matches the query, we’ll send it to the process that has this subscription opened. This means that we have a typical latency of a few milliseconds before we process the document in the worker process.

Now, let’s consider what we’ll need to do whenever we get a new release workflow. This can look like this:

##!/usr/bin/env python3

from pyravendb import RavenDB
from macto import Models

def createStore(url, db):
  store = document_store (urls = [url],database = db)
  store.initialize()
  return store

shared_store = createStore("https://block-b.prison.macto", "shared.macto")
local_store = createStore("https://block-b.prison.macto", "local.macto")

def process_batch(batch):
  for wf in batch.items:
    with local_store.open_session() as local_session:
      inmates = list(local_session.query(Inmate).where_equals("InmateId", wf.InmateId))
      if len(inmates) == 0:
        # this is seen by the Command & Control Center and the block's sergent
        raise_alert(local_session, f"Inmate {wf.InmateId}, {wf.Name} is not found on Block B, but go release warrant #{wf.Number}")
        continue
      # the inmate is in the block...
      # generate local workflow document starting the block internal
      # release inmate operation
      local_session.store(BlockReleaseInfo(wf)) 
      local_session.save_changes()
  notify_block_sergent() # send SMS

with shared_store.Subscriptions.GetSubscriptionWorker(
subscription_connection_options(
id = "Block's inmates to be released"
)) as subscription:

subscription.run(process_batch).join()

I’m obviously skipping stuff here, but you should get the gist (pun intended) of what is going on.

There are a couple of interesting things in here. First, you can see that I’m writing the code here in Python. I could have also used Ruby, node.js, etc.

The idea is that this is an internal ingestion pipeline for a particular workflow, independent of any other thing that happens in the system. Basically, the idea is to have an open-to-extension, close-to-modification kind of system. Integration with the outside world is done through subscriptions that filter the data that we care about and integration scripts that operate over the stream of data. I’m using a Python script in this manner because it is easy to show how fluid this can be. I could have used a compiled application using C# or Java just as easily. But the idea in this architecture is that it is possible and easy to modify and manage things on the fly.

The subscription workers ingesting the documents from the subscriptions take the data from the shared database, process and validate it, and then make the decisions on what should be done further. On any batch of workflow documents for releasing inmates, we’ll alert the sergeant. Either way, we need to release the inmate or we need to figure out why the warrant is on us while the inmate is not in our hands.

More complex scripts may check all the release workflows in case the block that the Registration Office thinks the inmate is on is out-of-date, for example. We can also use these scripts to glue in additional players (preparing the release party to take the inmate, scheduling this in advanced, etc.), but we might want to do that in our main app instead of in the scripts, to make it more visible what is going on.

The underlying architecture and shape of the application are quite explicit on the notion of data transfer, though, so it might be a good idea to do it in the scripts. A lot of that depends on whatever this is shared functionality, something that is customized per block, etc.

Get comfortable using NoSQL in a free, self-directed learning course provided by RavenDB. Learn to create fully-functional real-world programs on NoSQL Databases. Register today.

Topics:
database ,data processing ,data modeling

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}