Stream SmartThings Data to Cosmos DB and PowerBI (Part 2)
Stream SmartThings Data to Cosmos DB and PowerBI (Part 2)
Learn how to work on the SmartThings data we collected in Part 1 so that we can use it in our reports and do historical analysis.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
In Part 1 of this series, we looked at how we can collect data from SmartThings and send it over to an Azure Event Hub. We're now going to take the next step and work on storing that data so that we can use it in our reports and do historical analysis. For this, we are going to be making use of Azure Cosmos DB for storage and Azure Stream Analytics to move the data.
The first thing we need to do is create a Cosmos DB account to store the data. Head over to the Azure Portal and create a Cosmos DB account instance. When you do this, it is going to ask you to select the API to use. I chose SQL, but pick whatever you prefer. You also want to put this in the same region as your Event Hub to improve latency.
At this point, you have created an account that will contain your database. This account is in a single region using the default Session consistency. This was fine for my needs, but if you want to replicate your data to more regions or alter consistency, you can do so now.
We now need to create a database and a collection in Cosmos to store out data. The easiest way to do this is to scroll down to the Collections section, then click Browse. Click on the Add Collection option at the top. For the database, select Create New and then enter a name for the database and for the collection. You then need to pick a size, either fixed or unlimited. The fixed option uses a single partition, which imposes a limit of 10GB. This has the advantage of not having to deal with managing partition keys. To give you an idea of the data volume, I have had this running for around four months and have generated 80MB of data. I would recommend the fixed option to keep things simple. Finally, we need to select the amount of throughput we want the database to support in RUs. The lowest you can go is 400 RU, and that has been more than enough for my setup which has around 20 sensors reporting in. Once you have completed all that, create the DB and collection.
The great thing about Cosmos is that it is NoSQL-based, so we don't need to worry about creating schemas and tables. All we need to do now is send the data across.
At the moment, our sensor data has arrived at the Event Hub and is sitting there doing nothing. Event Hub is not intended as a long-term storage mechanism and will only keep this data for seven days. We need to create a process to move this data over to our new Cosmos DB.
In this example, I have chosen to use Azure Stream Analytics. This is Microsoft's tool for processing streams of data and performing analytics. I chose this option because it's very simple and easy to set up, it processes data quickly, and it can undertake data transformations if required. It also supports sending data to many different locations. This includes the option to stream data to PowerBI for real-time reporting. We will be looking at historical reporting here, but this option is possible.
There is a pretty significant downside to using Stream Analytics for a home project: the cost. Running a single streaming unit full-time will run at around $80 a month. In a future post, we will look at ways we can reduce this cost. We'll take a look at scheduling batch running Stream Analytics jobs and using other tools like Azure Data Factory and Azure Functions to process the data instead.
Stream Analytics Job
Using the Azure portal, create a new Stream Analytics job. Again, it should be in the same region as the Event Hub and Cosmos DB and you should select the Cloud hosting option. We then need to define how many streaming units we need. Streaming units are instances of the job, more units equal more parallel processing of data. As we are running this full-time, the processing of the data will be almost immediate. Given this, and the low amount of data, we should only need a single streaming unit. When we look at batch processing data, we may want to use more units to speed up throughput.
After creating the Stream Analytics job, we need to define the query that will tell it how to process the data. Before we can do that, we need to define our inputs and outputs.
In the Steam Analytics job, select the Inputs option under Job Topology and then click Add Stream Input. In the drop-down, select Event Hub. The window that opens should populate with the details of existing Event Hubs. Provide an alias for this input and then select the Namespace, Event Hub and Policy Name that we created in the previous article. Leave the rest of the options as default, including an empty consumer group.
We'll follow a similar process to create the output. Under the same Job Topology menu, select Outputs. Click Add, and then pick Cosmos DB from the drop-down. Again, this will fill in the Cosmos DB account, Key, and Database for you. The value to use for the Collection Pattern Name field will depend on what you chose for the Cosmos DB size. If you selected the fixed 10GB size, then you can enter the collection name here. If you selected the unlimited option, then you need to enter a pattern to locate the appropriate collection partitions. Finally, the Document ID field allows you to specify a field in the records to use as the document ID. For simplicities' sake, I chose to leave this blank and use the default.
Now, we have our input and outputs we need to create the query to transfer data from one to the other. Again, under the Job Topology menu, select Query. This will open the visual query editor and it should look like this:
In the editor, select [YourOutputAlias] and replace it with the alias of the Cosmos DB output we created. Then, select [YourInputAlias] and replace it with the alias of the Event Hub. There is a list of the aliases on the left if you need them. We will leave the select query to
* for now so that we send all data to Cosmos. If you want to, you can refine this query to reduce the amount of data sent.
Now that we have our query defined, all we need to do is start the job. Click back on the Overview tab and then click on the Start option. This will take a few minutes to start. When it does, you should start to see events appear in the metrics charts.
At this point, we can go back to our Cosmos DB instance and check data is arriving. Select your Cosmos DB account and then click on the Data Explorer option. Expanding the database and collection, we can then click on the Documents option and see a list of all the documents in our collection.
Click on a document and we can view its content and confirm that we are seeing the data from SmartThings. We now have a process in place to take the data from the Event Hub and store it in Cosmos DB ready for reporting on. As long as the Steam Analytics job is running, it will transfer the data once it arrives at the Event Hub. If you wish to stop the charges for the stream analytics job you can select the option to stop the job. Stream Analytics jobs only cost money whilst they are running.
In Part 3 of this series, we are going to look at taking the data in Cosmos DB and building some reports in PowerBI.
Published at DZone with permission of Sam Cogan , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.