Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Voron & Time Series: Working with Real Data

DZone's Guide to

Voron & Time Series: Working with Real Data

· Java Zone ·
Free Resource

Verify, standardize, and correct the Big 4 + more– name, email, phone and global addresses – try our Data Quality APIs now at Melissa Developer Portal!

Dan Liebster has been kind enough to send me a real world time series database. The data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this.

This is what this looks like:

image

The first thing that I did was take the code in this post, and try it out for size. I wrote the following:

       int i = 0;
       using (var parser = new TextFieldParser(@"C:\Users\Ayende\Downloads\TimeSeries.csv"))
       {
          parser.HasFieldsEnclosedInQuotes = true;
          parser.Delimiters = new[] {","};
          parser.ReadLine();//ignore headers
          var startNew = Stopwatch.StartNew();
          while (parser.EndOfData == false)
          {
             var fields = parser.ReadFields();
             Debug.Assert(fields != null);
        
              dts.Add(fields[1], DateTime.ParseExact(fields[2], "o", CultureInfo.InvariantCulture), double.Parse(fields[3]));
              i++;
              if (i == 25*1000)
              {
                  break;
              }
              if (i%1000 == 0)
                  Console.Write("\r{0,15:#,#}          ", i);
          }
          Console.WriteLine();
          Console.WriteLine(startNew.Elapsed);
       }

Note that we are using a separate transaction per line, which means that we are really doing a lot of extra work. But this simulate very well incoming events coming one at a time. We were able to process 25,000 events in 8.3 seconds. At a rate of just over 3 events per millisecond.

Now, note that we have in here the notion of “channels”. From my investigation, it seems clear that some form of separation is actually very common in time series data. We are usually talking about sensors or some such, and we want to track data across different sensors over time. And there is little if any call for working over multiple sensors / channels at the same time.

Because of that, I made a relatively minor change in Voron, that allows it to have an infinite number of separate trees. That means that I can use as many trees as you want, and we can model a channel as a tree in Voron. I also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. That dropped the time to insert 25,000 lines to 0.8 seconds. Or a full order of magnitude faster.

That done, I inserted the full data set, which is just over 1,096,384 records. That took 36 seconds. In the data set I have, there are 35 channels.

I just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. That allows doing things like doing averages over time, comparing data, etc.

You can see the code implementing this in the following link.


Developers! Quickly and easily gain access to the tools and information you need! Explore, test and combine our data quality APIs at Melissa Developer Portal – home to tools that save time and boost revenue. Our APIs verify, standardize, and correct the Big 4 + more – name, email, phone and global addresses – to ensure accurate delivery, prevent blacklisting and identify risks in real-time.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}