Over a million developers have joined DZone.

Processing and analysing sensor data: a DIY approach

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

This post was written by Mario Koppen at the comSysto blog.

Motivated by a current customer project (and the interesting nature of Big Data projects from industry in general), we decided to get our hands on sensor data. We wanted to learn how to handle, store and analyze it and what specific challenges sensor data presents.

To get sensor data, we decided to generate our own by putting sensors into our office. We found Tinkerforge’s system of bricks and bricklets quite nice and easy to start with, so we went for that option.

We got the following four sensor bricklets:

  • Sound intensity (basically a small microphone)
  • Temperature
  • A multi-touch bricklet (12 self-made touch pads from aluminium foil can be connected)
  • A motion detector

The four bricklets are connected to a master bricklet, which is in turn connected to a Raspberry Pi.

We put the temperature sensor into a central place in the office. We set up the motion detector in the hallway leading to the kitchen and the bathrooms. We put the sound intensity sensor next to the kitchen door and placed touch sensors on the coffee machine, the fridge door and the door handle for the men’s bathroom.

Although this is clearly a toy setup (and you will have to wait a long time for the data to become big) we quickly came upon some key issues that also arise in real-world situations with sensors involved.

As a storage solution we chose MongoDB, mainly because it was also used in the customer project that motivated the lab.

The data generated by the four sensors can be grouped into two categories: While the temperature and sound intensity sensors output a constant stream of data, the motion detector and multi-touch sensor are triggered by events that typically don’t occur with a fixed frequency.

This gave rise to two different document models in MongoDB. For the first category (streaming), we used the model that MongoDB actually suggests as best practice for such a situation and that could be called the “Time Series Model” (see http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb). It consists of one collection with nested documents in it. The number of nesting levels and the number of subdocuments on each level depends on the time granularity of the data. In our case, the highest time resolution of the Tinkerforge sensors is 100 ms, which gives rise to the following document structure:

  • One document per hour
  • Fields: timestamp of the hour, sensor type, values
  • Values: nested set of subdocuments: 60 subdocuments for each minute, 60 subdocs for each second, 10 subdocs for each tenth of a second
    "_id" : ObjectId("53304fcd74fece149f175975"),
    "timestamp_hour" : "ISODate(2014-03-24T16:00:00)",
    "type" : "SI",
    "values" : {
        "10l" : {
            "05" : {
                "00" : -500,
                "01" : -500,
                "02" : -500,
                "03" : -500,
                "04" : -500,                 
                "05" : -500, 
                "06" : -500, 
                "07" : -500,
                "08" : -500,
                "09" : 0

The documents are pre-allocated in MongoDB, initalizing all data fields to a value that is outside the range of the sensor data. This is done to avoid constantly growing documents which MongoDB would have to keep moving around on disk.

Data of the second type (event-driven/triggered) is stored in a “bucket”-like document model. For each sensor type, a number of documents with a fixed number of entries for the values (a bucket of size of e.g. 100) are pre-allocated. Events are then written into these documents as they occur. Each event corresponds to a subdocument which sits in an array with 100 entries. The subdocument carries the start and end time of the event as well as its duration. As the first record/event is written into the document, the overall document gets a timestamp corresponding to the start date/time. On each write to the database, the application checks whether the current record is the last fitting into the current document. If so, it sets the end date/time of the document and starts directing writes to the next document.

    "_id" : ObjectId("532c1f9774fece0aa9325a13"),
    "end" : ISODate("2014-03-21T12:18:12.648Z"),
    "start" : ISODate("2014-03-21T12:16:39.047Z"),
    "type" : "MD",
    "values" : [
            "start" : ISODate("2014-03-21T12:16:44.594Z"),
            "length" : 5,
            "end" : ISODate("2014-03-21T12:16:49.801Z")
            "start" : ISODate("2014-03-21T12:16:53.617Z"),
            "length" : 5,
            "end" : ISODate("2014-03-21T12:16:59.615Z")
            "start" : ISODate("2014-03-21T12:17:01.683Z"),
            "length" : 3,
            "end" : ISODate("2014-03-21T12:17:05.147Z")
            "start" : ISODate("2014-03-21T12:17:55.223Z"),
            "length" : 5,
            "end" : ISODate("2014-03-21T12:18:00.470Z")
            "start" : ISODate("2014-03-21T12:18:04.653Z"),
            "length" : 7,
            "end" : ISODate("2014-03-21T12:18:12.648Z")

These two document models represent the edge cases of a trade-off that seems to be quite common with sensor data.

The “Time Series” model suggested by MongoDB is great for efficient writing and has the advantage of having a nice, consistent schema: every document corresponds to a natural unit of time (in our case, one hour), which makes managing and retrieving data quite comfortable. Furthermore, the “current” document to write to can easily be inferred from the current time, so the application doesn’t have to keep track of it.

The nested structure allows for the easy aggregation of data at different levels of granularity – although you have to put up with the fact that these aggregations will have to be done “by hand” in your application. This is due to the fact that in this document model there are no single keys for “minute”, “second” and “millisecond”. Instead, every minute, second and millisecond has its own key.

This model has issues as soon as the data can be sparse. This is obviously the case for the data coming from the motion and multi-touch sensors: There is just no natural frequency for this data since events can happen at any time. For the Time Series document model this would mean that a certain fraction of the document fields would never be touched, which obviously is a waste of disk space.

Sparse data can also arise in situations where the sensor data does not seem to be event-driven at first. Namely, many sensors, although they measure data with a fixed frequency, only automatically output this data if the value has changed compared to the last measurement. This is a challenge one has to deal with. If one wanted to stick with the time series document model, one would have to constantly check whether values were omitted by the sensor and update the corresponding slots in the database with the last value that was sent from the sensor. Of course, this would introduce lots of redundancy in the database.

Continue reading here. 



Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.


Published at DZone with permission of Comsysto Gmbh, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}