Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
Data Warehousing
Data warehousing has become an absolute must in today’s fast-paced, data-driven, modern business landscape. As the demand for informed business decisions and analytics continues to skyrocket, data warehouses are gaining in popularity, especially as more and more businesses adopt cloud-based data warehouses.DZone’s 2020 Data Warehousing Trend Report explores data warehouse adoption across industries, including key challenges, cloud storage, and common data tools such as data lakes, data virtualization, and ETL/ELT. In this report, readers will find original research, an exclusive interview with "the father of data warehousing," and additional resources with helpful tips, best practices, and more.
In this article, I will talk about the different types of data. So, as some of you might be aware, data can be broken down into different types. One such categorization which is very useful when you are building a machine learning pipeline is the following: structured data, semi-structured data, and unstructured data. So What Is the Difference Between These Types of Data? Structured Data This term refers to data that is organized in a tabular format or in something like a relational database which organizes data in multiple tables which can then be joined together. So structured data presents the easiest type of data to work with. If your data is stored in an SQL database, for example, then most data scientists will find it pretty easy to access the database and then extract insights from the data. That being said, not all databases are created equal. Some databases might be organized in a very bad manner; other databases might be organized in a very easy-to-use manner. But all things being equal, structured data is easy to work with. If you look deep down into how machine learning pipelines are created, you always need structured data. So even if you have data that is in the same structured or structured format, what algorithms do internally is they steal, they digest this data, and then they transform it into a structured format. Semi-Structured Data This term refers to data that is not completely organized but not disorganized either. Good example of this is HTML, JSON, and XML. For those who are familiar with HTML or JSON, if you're not, it's very easy to Google JSON and see an example of what a JSON file looks like. You'll very quickly see that JSON seems to follow some kind of structure, and it's the same for HTML. You see something which looks like code, but then again, the JSON or HTML are not fully structured, so they're not organized in the table. An HTML file or Adjacent file can look very different from some other HTML or JSON file. This means that there are certain freedoms that the developers of those files take, and this can make it somewhat challenging to work with them. How Do Data Scientists Collect Data From Different Sources? A data scientist will have to extract information from the semi-structured data and then restructure it into a tabular format. The challenge here is that there are usually many ways to do that. And this step data can be quite time-consuming depending on the kind of data and how the data is organized. In general, I'm not a huge fan of semi-structured data. As a data scientist, I prefer structured data. Like most data scientists, however, semi-structured data is very useful in domains like social media. Social media is full of text data, image data, video data, and data formats like JSO. Let us store this data alongside meta information. So, you can store a video, let's say, and then you can store who created this video, comment around this video, etc. This is easier to do using JSON than using SQL, for example. Therefore, semi-structured formats have become so popular in the last ten years. Semi-structured data quite often goes hand in hand with no SQL databases and big data. Unstructured Data This term refers to data where there is clearly no structure. For example, data set that consists only of images or videos, or audio is an example of an unstructured data set. So, information in an unstructured data set does not follow a preexisting data model. And this makes it quite challenging to work with because someone might have to go through all the data and understand whether some of the data is potentially noisy or have some other issues which are going to prevent a machine-learning pipeline from being successfully built. In most cases, unstructured data in the real world is usually you're going to encounter it in two situations. It's either some sort of open data set or a machine learning competition where someone curates an unstructured data set. You must use this data and try to predict whether a photo contains humans or animals as best as you can. Or the other case where you might encounter structured data is when a data strategy was not designed. Somehow, a company ended up having structured data instead of semi-structured data. Because really, in most scenarios, we expect to see this data alongside some meta information, like when this video showed up, who posted this if we're talking about social media. How Does a Data Scientist Digest this Type of Data? I would expect that, in most cases, most of the data should be semi-structured. There are still cases where data might just be unstructured because there is not so much that we can do about it. For example, in customer support, maybe a data set consists of questions and responses. You want to build a bot based on those questions and responses so it can automatically produce answers to different queries. Well, in this case, probably there's not much you can do to structure the data. In one way or another, you will have to end up with an unstructured data set. But unstructured data, even though it is challenging, quite often can still be successfully analyzed. In most cases, we're using deep learning. There are deep learning algorithms in order to digest this kind of data. And deep learning has been very successful with data like audio data, natural language data, images, and all this sort of stuff. Regarding these, I've worked in sports analytics in creating predictive models for football injuries and recovery after injuries; I've worked in financial predictions and studied the application of deep learning in manufacturing. The results are very encouraging. Conclusion This was a summary of the different types of data that you can encounter in the business recap. We talked about structured data, the same as structured data and unstructured data. Structured data is usually the low-hanging fruit for a business. And ideally, as a business, you want to have a data strategy that ensures that most of your data is stored in a structured format. The reason is that this makes the life of data scientists much easier, and they will be able to spend more time on valuable tasks instead of just data wrangling. Schema structured data and unstructured data have started to become to grow in the last 10-15 years. It's the era of big data, after all. But in most cases, you should try to turn structured data and semi-structured data. And once again, semi-structured data is a difficult topic because of the kind of database you need to choose and how you should organize the different fields, and for what purpose.
Apache Avro and Apache ORC Apache Avro and Apache ORC (Optimized Row Columnar) are top-level projects under the Apache Software Foundation. Fundamentally, they are data serialization formats with different strengths. Apache Avro is an efficient row-based binary file format for serializing data during transfer or at rest. It uses a schema to define the data structure that has to be serialized, and the schema is collocated and stored as part of Avro’s data file. As frequently needed in big data space, Avro was designed to support data evolution by allowing the augmentation of new fields to the data structure without the need for a complete recompilation of the code that uses it. Whereas Apache ORC is a column-based storage format (primarily for Hadoop Distributed File System) that is optimized for storing and querying large datasets with heavy levels of compression. ORC is designed to improve the performance of query engines like Hive and Pig by providing a more efficient way to store and query data. ORC is more performant than Avro in processing speed and storage, especially for large datasets, as it keeps data in a more compact and columnar format. ORC also supports predicate pushdown, which allows it to filter data at the storage layer, reducing the amount of data that needs to be read from the disk and processed. Primarily, Avro is a general-purpose data serialization format that is well-suited for transmitting data. Whereas ORC is a specialized data storage format optimized for storing, querying, and processing large datasets. Therefore, ORC may be a better choice than Avro if you are working with large datasets and need to perform efficient queries and data processing. However, like most data infrastructures, your data lake might also comprise data predominantly stored in Avro format. It is predominately the case because Avro was founded in 2009 and cemented its footing with big data since its early days, whereas ORC was launched much later in 2013. Challenges in Converting Data From Apache Avro to Apache ORC Converting data from Avro to ORC is particularly challenging, and you might face issues like: Schema conversion: Avro and ORC have their respective schema model. The schema conversion from Avro to ORC is time-consuming, more so if the schema is complex. Unfortunately, that is often the case with most big data datasets. Data type differences: As with schema models, both Avro and ORC support distinct data types and do not map one on one. This typecasting further complicates the schema conversion. Performance: Transforming data from Avro to ORC is often resource-intensive for large datasets. It can take excruciatingly long if not carefully crafted and heavily optimized. Loss of data: Even if appropriately coded, data loss is possible during data conversion, primarily because of failures in intermediate tasks or incompatibility between Avro and ORC fields. Using Apache Gobblin To Convert Apache Avro Data to Apache Orc To overcome the challenges of Avro to ORC data conversion, Apache Gobblin can be put to use. Apache Gobblin is an open-source data integration framework that simplifies the process of extracting, transforming, and loading large datasets into a target system. While we will discuss Gobblin’s usage in the context of Avro to ORC data conversion, Gobblin has a host of other capabilities and provides a range of built-in connectors and converters that can be used to transfer data between different formats and systems, including Avro and ORC. Apache Gobblin effectively addresses the challenges in converting Avro data to ORC. It converts the schema, maps data types, and has specialized capabilities like flattening a nested column if needed. Gobblin also supports fault tolerance, data validation, and data quality checks, thereby ensuring the integrity of the data being translated. Furthermore, it is highly configurable and customizable to address specific requirements of any specialized scenario. We will cover out-of-box configurations and capabilities provided by Gobblin in this article. To use Gobblin for converting Avro data to ORC: Start by registering Avro data with Apache Hive (partitioned or snapshot). Download the Gobblin binaries from here. Modify the configuration detailed below this section to configure your Gobblin job. Launch and run Gobblin in standalone mode using the documentation here. For Step 3 above, refer to this example job and modify it per your requirements: Plain Text # Source Avro hive tables to convert hive.dataset.whitelist=avro_db_name.* # Configurations that instruct Gobblin to discover and convert the data (Do Not Change) source.class=org.apache.gobblin.data.management.conversion.hive.source.HiveAvroToOrcSource writer.builder.class=org.apache.gobblin.data.management.conversion.hive.writer.HiveQueryWriterBuilder converter.classes=org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToFlattenedOrcConverter,org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToNestedOrcConverter data.publisher.type=org.apache.gobblin.data.management.conversion.hive.publisher.HiveConvertPublisher hive.dataset.finder.class=org.apache.gobblin.data.management.conversion.hive.dataset.ConvertibleHiveDatasetFinder # Destination format and location hive.conversion.avro.destinationFormats=flattenedOrc hive.conversion.avro.flattenedOrc.destination.dataPath=/output_orc/ # Destination Hive table name (optionally with a postfix _orc) hive.conversion.avro.flattenedOrc.destination.tableName=$TABLE_orc # Destination Hive database name hive.conversion.avro.flattenedOrc.destination.dbName=$DB # Enable or disable schema evolution hive.conversion.avro.flattenedOrc.evolution.enabled=true # No host and port required. Hive starts an embedded hiveserver2 (Do Not Change) hiveserver.connection.string=jdbc:hive2:// # Maximum lookback hive.source.maximum.lookbackDays=3 ## Gobblin standard properties ## task.maxretries=1 taskexecutor.threadpool.size=75 workunit.retry.enabled=true # Gobblin framework locations mr.job.root.dir=/app/gobblin/working state.store.dir=/app/gobblin/state_store writer.staging.dir=/app/gobblin/writer_staging writer.output.dir=/app/gobblin/writer_output # Gobblin mode launcher.type=LOCAL classpath=lib/* To understand the conversion process better, let us break this down: hive.dataset.whitelist = avro_db_name.avro_table_nameThe Avro data registered as a Hive dataset is specified in the database.table format. You can use regex here. source.class = org.apache.gobblin.data.management.conversion.hive.source.HiveAvroToOrcSource The internal Java class that Gobblin uses to initiate and run the conversion.Do not change. writer.builder.class=org.apache.gobblin.data.management.conversion.hive.writer.HiveQueryWriterBuilderThe internal Java class that Gobblin uses to create Hive DDL and DML queries to convert data from Avro to ORC.Do not change. converter.classes=org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToFlattenedOrcConverter,org.apache.gobblin.data.management.conversion.hive.converter.HiveAvroToNestedOrcConverter The internal Java classes that Gobblin uses to augment Hive DDL and DML queries with schema conversion to nested or flattened, depending on requirements.Do not change. data.publisher.type=org.apache.gobblin.data.management.conversion.hive.publisher.HiveConvertPublisherThe internal Java class that Gobblin uses to publish data from the intermediate location to the final location post successful conversion of data.Do not change. hive.dataset.finder.class=org.apache.gobblin.data.management.conversion.hive.dataset.ConvertibleHiveDatasetFinderThe internal Java class that Gobblin uses to find all partitions and the needed metadata about the Avro dataset to convert it to ORC.Do not change. hive.conversion.avro.destinationFormats=flattenedOrcWhether to convert Avro nested Records as-is or flatten it when converting to ORC. The alternate config option is nestedOrc hive.conversion.avro.flattenedOrc.destination.dataPath=/output_orc/Output location to write ORC data to. hive.conversion.avro.flattenedOrc.destination.tableName=$TABLE_orcOutput table name for the ORC data. $TABLE macro carries forward the Avro table name, you can pre or post-fix it with any string. Example _orc in the sample config. hive.conversion.avro.flattenedOrc.destination.dbName=$DBOutput database name for the ORC data. $DB macro carries forward the Avro database name; you can pre or post-fix it with any string. hive.conversion.avro.flattenedOrc.evolution.enabled=trueWhether or not to evolve destination ORC schema if the source schema has evolved with new or updated column definitions. hiveserver.connection.string=jdbc:hive2://Gobblin uses an embedded Hive engine for running its internally generated queries.Do not change. hive.source.maximum.lookbackDays=5If Gobblin finds multiple partitions in the dataset, this config limits the maximum number of partitions from the past it picks up for conversion. task.maxretries=1 taskexecutor.threadpool.size=75 workunit.retry.enabled=trueStandard Gobblin configs govern a maximum number of retries, thread pool size, and whether a retry of failed tasks is desired, respectively. mr.job.root.dir=/app/gobblin/working state.store.dir=/app/gobblin/state_store writer.staging.dir=/app/gobblin/writer_staging writer.output.dir=/app/gobblin/writer_outputStandard Gobblin configs that Gobblin uses to store intermediate data, the state between recurring runs, and staged data, respectively. launcher.type=LOCAL classpath=lib/*Gobblin launcher type that governs how Gobblin runs. It includes Local (as in the example), MR, Yarn, and Cluster modes. As the explanation above indicates, Apache Gobblin handles schema interconversion and compatibility through Apache Hive. Therefore, simplifying the conversion process via SerDes in Hive behind the scenes. Apache Gobblin further has special provisions to evolve the schema on the destination in recurring executions of partitioned data, supports flattening of nested Avro data if desired, and provides retry and staging of data before publishing for data integrity.
In just a few short years, machine learning (ML) has become an essential technology that companies deploy in almost every aspect of their business. Previously the preserve of giant institutions with deep pockets, the ML market is rapidly opening up. Every kind of business can now leverage ML to minimize repetitive manual processes, automate decision-making, and predict future trends. At almost every stage of any business task, ML is making processes smarter, streamlined, and speedier. In recent years, technological advances have helped to democratize access and drive the adoption of ML by reducing the time, skill level, and the number of steps required to gain ML-driven predictions. So rapid has growth been that the global ML market is expected to expand from $21 billion in 2022 to $209 billion by 2029. Tools such as declarative ML and AutoML are helping enterprises to access powerful, business-critical predictive analytics. Taking these approaches one step further, open-source in-database ML is a new technique that’s gaining ground. It allows businesses to easily put questions to their data and rapidly get answers back using standard SQL queries. What Is In-Database ML? Building ML models has traditionally been a highly-skilled, lengthy, resource-intensive endeavor. Typical time frames for ML initiatives are measured in months. It’s not unusual for projects to take longer than six months, with considerable time devoted to the extraction, cleaning up, and preparation of data from the database. By contrast, open-source in-database ML brings analytics into the database, enabling businesses to achieve the kind of insights you’d expect from traditional, fully customized ML models, but with some important differences. In-database ML achieves those results much faster (days or weeks, not months) because the data never needs to leave the database. Another difference is that in-database modeling is done using regular, existing database skills like SQL, making it far more accessible to the wider IT team to handle. Although a relatively new field, it is now the fastest-growing segment in ML by GitHub star endorsements. In fact, there are now in-database ML integrations for all the major database vendors, ML frameworks, BI tools, and notebooks. How Are Businesses Using In-Database ML? With use cases in every domain of business from HR to marketing to sales to production, predictions derived in-database are helping companies hone their customer experience, improve product personalization, optimize customer lifetime value, increase employee retention, evaluate risk more accurately, and raise workplace productivity. Take one example from the productivity software space: Rize, a smart time tracker that makes users more productive and efficient at work, used in-database ML to develop a powerful feature in response to user feedback in a matter of weeks. The resulting capabilities — driven by ML-generated insights — increased customer retention and conversion rates. It has also helped differentiate Rize in a highly competitive market, cementing its position as a truly intelligent time tracker. Speed and Scale of In-Database ML Reshaping Industries While many of these use cases benefit businesses no matter the sector or location, specific industry applications are emerging that deliver future insights in real-time, cost a fraction of traditional ML algorithms to set up, and are starting to disrupt existing value chains within these markets. The financial sector — an industry that was quick to operationalize traditional ML modeling — is now turning to in-database modeling for improved agility. Financial services and fintech companies are using in-database ML to detect fraud, aid loan recovery, improve credit scoring, and approve loans. As a result, they’re able to react faster to market conditions, adapt the services they offer, and even open new revenue streams. For example, Domuso, a next-generation multi-family rent payment processing platform, saved $500,000 annually using in-database ML. Domuso trained and deployed an in-database ML model to accurately predict if rental payments are likely to be returned due to insufficient funds. “With in-database ML we implemented advanced models faster and with less complexity,” said Sameer Nayyar, the then-EVP of product and operations at Domuso. “It positively impacted our business. We saw a reduction of chargebacks by $95,000 over two months and a saving of $500,000 over the first year.” Furthermore, as new use cases arise, Domuso is now able to create and implement new ML models in a matter of weeks, not months. Sectors such as retail, FMCG, and food production have been quick to realize the real-time predictions of in-database modeling, helping them respond to market conditions as they happen with “just-in-time” and location-specific offers. Managing stock, predicting demand for specific items, optimizing staffing levels, and forecasting future pricing are just a few examples of how retail and other businesses are turning to in-database ML algorithms to address their day-to-day challenges. Take the example of Journey Foods, a supply chain and food science software platform for food development and innovation, which used in-database ML to address the challenge of constantly shifting ingredient prices. They wanted to predict food costs for its customers in one, three, six, and 12 months' time, drawing on its database of 130,000 food ingredients across 22,000 suppliers. With ingredients and suppliers always changing, they were concerned that the predictive analytics required to map these complex, “many-to-many" relationships would be time-intensive to set up and would need continual maintenance and re-training. Journey Foods turned to in-database ML to develop its cost prediction model, resulting in high-accuracy predictions for food ingredients. It has also resulted in significantly lower operating costs than a homegrown ML model they originally considered. Increasing Business Agility and Innovation There are many more industry-specific examples, but the common factors driving this rapidly growing open-source movement are speed and scale. In-database ML makes sophisticated predictive analytics available to any business with a database. For example, at a recent open-source, in-database ML hackathon, Hacktoberfest, the growing community of in-database ML programmers aptly demonstrated the potential for innovation. Over the course of the event, teams submitted over 20 new database handlers — including connections to Apache Impala and Solr, PlanetScale, and Teradata, plus over 10 new machine learning handlers — including PyCaret, Ray Server, and Salesforce. It’s still the early days for in-database analytics. Just like the wider AI industry, the ML segment is no stranger to hype. However, with quick answers to complex problems no longer just theoretically possible, but achievable in the near term by businesses of all sizes and budgets, in-database ML deserves serious consideration. Cutting the time it takes to build models and enabling those without a data science background to run projects drastically reduces the costs associated with predictive analytics. Data-based decision-making offers businesses a viable alternative to traditional ML techniques: fully customizable predictive capabilities at speed and scale.
Memory allocation is one of those things developers don’t think too much about. After all, modern computers, tablets, and servers count so much space that memory often seems like an infinite resource. And, if there is any trouble, a memory allocation failure or error is so unlikely that the system normally defaults to program exit. This is very different, however, when it comes to the Internet of Things (IoT). In these embedded connected devices, memory is a limited resource and multiple programs fight over how much they can consume. The system is smaller and so is the memory. Therefore, it is best viewed as a limited resource and used conservatively. It’s in this context that memory allocation — also known as malloc — takes on great importance in our sector. Malloc is the process of reserving a portion of the computer memory in the execution of a program or process. Getting it right, especially for devices connected to the internet, can make or break performance. So, let’s take a look at how developers can build resilience into their malloc approach and what it means for connected device performance going forward. Malloc and Connected Devices: A Short History Let’s start from the beginning. Traditionally, malloc has not been used often in embedded systems. This is because older devices didn’t typically connect to the internet and, therefore, counted vastly different memory demands. These older devices did, however, create a pool of resources upon system start which to allocate resources. A resource could be a connection and a system could be configured to allow n connections from a statically allocated pool. In a non-internet-connected system, the state of a system is normally somewhat restricted and therefore the upper boundaries of memory allocation are easier to estimate. But this can change drastically once an embedded system connects to the internet. For example, a device can count multiple connections and each can have a different memory requirement based on what the connection is used for. Here, the required buffer memory for a data stream on a connection is dependent on the latency of the connection to obtain a certain throughput using some probability function for packet losses or other network-dependent behavior. This is normally not a problem on modern high-end systems. But, remember that developers face restricted memory resources in an embedded environment. So, you cannot simply assume there is enough memory. This is why it is very important in IoT embedded development to think about how to create software that is resilient to memory allocation errors (otherwise known as malloc fails). Modern Embedded Connected Systems and Malloc In modern connected embedded systems, malloc is more frequently used and many embedded systems and platforms have decent malloc implementation. The reason for the shift is that modern connected embedded systems do more tasks and it is often not feasible to statically allocate the maximum required resources for all possible executions of the program. This shift to using malloc actively in modern connected embedded systems requires more thorough and systematic software testing to uncover errors. Usually, allocation errors are not tested systematically since it is often thought of as something which happens with such a small probability that it is not worth the effort. Since allocation errors are so rare, any bugs can live for years before being found. Mallocfail: How to Test for Errors The good news is that developers can leverage software to test allocation errors. A novel approach is to run a program and inject allocation errors in all unique execution paths where allocation happens. This is made possible with the tool mallocfail. Mallocfail, as the name suggests, helps test malloc failures in a deterministic manner. Rather than random testing, the tool automatically enumerates through different control paths to malloc failure. It was inspired by this Stack Overflow answer. In a nutshell, this tool overrides malloc, calloc, and realloc with custom versions. Each time a custom allocator runs, the function uses libbacktrace to generate a text representation of the current call stack, and then generates a sha256 hash of that text. The tool then checks to see if the new hash has already been seen. If it has never been seen, then the memory allocation fails. The hash is stored in memory and written to disk. If the hash — the particular call stack — has been seen before, then the normal libc version of the allocator is called as normal. Each time the program starts, the hashes that have already been seen are loaded in from disk. This is something that I’ve used first-hand and found very useful. For example, at my company, we successfully tested mallocfail on our embedded edge software development kit. I’m pleased to report that the tool actually managed to identify a few problems in the SDK and its third-party libraries. As a result, the former problems are now fixed and the latter have received patches. Handling Malloc Fails Handling allocation errors can be a bit tricky in a complex system. For example, consider the need to allocate data to handle an event. Different patterns exist to circumvent this problem. The most important is to allocate the necessary memory such that an error can be communicated back to the program in case of an allocation failure, and such that some code path does not fail silently. The ability to handle malloc fails is something that my team thinks about often. Sure, it’s not much of a problem on other devices, but it can cause big issues on embedded devices connected to the internet. For this reason, our SDK counts the functionality to limit certain resources including connections, streams, stream buffers, and more. This is such that a system can be configured to limit the amount of memory used such that malloc errors are less likely to happen (and then it is just a resource allocation error). Often, a system running out of memory results in a system struggling to perform. So it really makes sense to lower the probability of allocation errors. This is often handled by limiting which functionality/tasks that can occur simultaneously. As someone who’s been working in this field for two decades, I believe developers should embrace best malloc practices when it comes to modern embedded connected devices. My advice is to deeply consider how your embedded device resolves malloc issues and investigate the most efficient way of using your memory. This means designing with dynamic memory allocations and testing as much as possible. The performance and usability of your device count on it.
Prelude Self-driving cars can change everything in terms of road safety and mobility. Self-driving vehicles are capable of sensing their immediate environment and can move safely with little or no human input. With self-driving cars, real-time alerting systems act as a communication between vehicle and driver. Real-time signaling and alerting have many tangible and intangible benefits. XYZ’s “Autopilot and Full Self-driving capability” has been getting better every year since its introduction. XYZ's patent to “Automate Turn Signals” is an advanced step in enhancing road safety, not only for self-driving cars but also for drivers who ignore or forget to use turn signals. There is always a question, how independent should a vehicle be in making smart decisions? Self-driving cars should be just as intelligent as the driver in making the right decisions. Autopilot consists of eight external cameras, radar, 12 ultrasonic sensors, and a powerful onboard computer to guide for a safe journey. What is the role of tires? Smart tires, in real-time signaling and alerting, are the only things to touch the ground, and their movement is key in changing lines and turns. Automatic turn signals are dependent on a steering angle data source with respect to ultrasonic data sensors. A small percentage of car manufacturers can provide these additional safety measures for auto signaling and alerting, as it would be very complex and expensive for every manufacturer to come forward with these kinds of developments. Smart tires can play a key role in this kind of development to provide additional safety and be cost-effective. Smart tires provide not only automatic signaling features but also help to detect misconstructed roads and avoid fatal toppling. In this article, we will introduce a smart tire and explain how this will help tire manufacturing companies design a set of sustainable solutions to ward off various road mishaps, output an intercept to identify an overtaking vehicle, and to signal a quick estimate of all forms of rough terrains (wrongly designed angle of banks, misconstructed roads) to keep especially heavy vehicles from fatal toppling. In this pursuit, we will affix a tire internally with well-calibrated and cost-effective, non-cumbersome mechanical semi-micro tools such as Magnetometer, which is a compass, and Gyroscope, which both work together to help input an edge computing instrument to output a quick intercept to the driver. Smart tires in the tire industry give huge insights into driving analytics and much more real-time analysis. TPMS — Tire Pressure Monitoring systems are definitely an additional safety for vehicles and drivers. However, is this the only information that can be made use of from tire data? There is abundant information available from tires that can be used to generate more safety for both vehicle and driver. Safety through smart tires is a cost-effective solution that can be helpful to the majority of drivers instead of focusing only on self-driving cars, which will cover only a small percentage of cars in use. Smart tires' real-time alerting and signaling can be more effective than depending on vehicle dynamics, as tires are the only thing in contact with the ground; tire parameters can play a key role in drive analytics and help in avoiding major road mishaps. Automatic turn signals, which are dependent on steering angle data source, might be inaccurate when a vehicle takes a turn at a lean angle. This kind of alerting can be more accurate if we source data from tires instead of steering angles. A well-calibrated digital compass from Accelerometer, Gyroscope, and Magnetometer using sensor fusion methodologies can give accurate data. The three major challenges that have to be monitored and controlled in major accidents are overtaking vehicles, angle of banking, and fatal toppling due to misconstructed roads. MEMS are Microelectromechanical systems or micro machines which were made up of components of size between 0.001 to 0.1 mm. They are made up of a central unit that processes data and multiple components that interact with microsensors. Using a MEMS accelerometer, gyroscope, and magnetometer, we can create an application of a digital compass that sources data from these microsensors. In this article, we will see how we can model a device that can be affixed internally in a tire with good calibration, thereby resulting in a digital compass based on tire movement. This device, the ANEW-Angular Navigation Early Warning device, can help to control below three major road mishaps. ANEW Architecture Firstly, we can look at below designed architecture of the ANEW device and process flow. Data that is recorded through microsensors is processed with an algorithmic model to reduce noise from sensors or stochastic errors due to nonlinearity; this will result in an accurate digital compass that can provide real-time alerting. The two main segments in the entire architecture are the Sensors & Optimal estimate algorithm that have been used. These two play key roles in this product development. We will first see how to calibrate these multiple microsensors. We are using a GY-80 multi-sensor board, which comprises of accelerometer, gyroscope, and magnetometer, as shown below: MEMS Accelerometer Motion sensors like MEMS accelerometers are characterized by small size, lightweight, high sensitivity, and low cost. Accelerometer measures acceleration by measuring a change in capacitance. The primary component of the GY-80 multi-sensor board is the ADXL345 digital accelerometer. Accelerometers Operations are based on Newton’s (1) Second law of motion, which says that the acceleration (m/s2 ) of a body is directly proportional to and in the same direction as the net force (Newton) acting on the body and inversely proportional to its mass (gram). This sensing technique is known for its high accuracy, stability, low power dissipation, and simple structure to build. Bandwidth for a capacitive accelerometer is only a few hundred Hertz because of their physical geometry (spring), and the air trapped inside the IC acts as a damper. MEMS Gyroscope Microelectromechanical systems gyroscope that measures the angular rate by using Coriolis Effect comes with low cost, small device size, low power consumption, and high reliability leading to increasing applications in various inertial fields. Coriolis effect(2) or Coriolis force is nothing, but when an object is moving in a direction with a certain velocity and when any external angular rate is applied, the force will occur, which causes the perpendicular displacement of mass. MEMS gyroscope measurements are affected by errors, as they are prone to drift. In our next sections, we will see how this drift in values is handled through sensor fusion techniques. For the ANEW device, as we are using a GY-80 multi-sensor board, it comes with an L3G4200D gyroscope by default. In general, values between the accelerometer and gyroscope and combined in order to remove extra noise or drift in values from the gyroscope; this works as these sensors come with complementary filters. However, when we use these sensors on the tires of a traveling automobile, where rotations are very high, the noise will be more. These default complementary filters will not be helpful in ANEW device, which is planned for tires. These gyroscope readings are critical for us to predict the fatal toppling of vehicles. By default, the values between the accelerometer and gyroscope are integrated into our mathematical model. MEMS Magnetometer The third sensor in our GY-80 multi-sensor board is the MC5883L Magnetometer, a MEMS magnetometer that works on the Hall effect(3). Hall Effect sensors are used to measure the magnitude of a magnetic field. Its output voltage is directly proportional to the magnetic field strength through it. In general, a basic magnetometer that works on Hall Effect is quite sufficient to develop a digital compass by using a processing development environment. This can help in automating turn signals with proper calibration. As we are addressing here the toppling of vehicles which is caused due to angle of banking or misconstructed roads, we are using a GY-80 multi-sensor board consisting of an accelerometer and gyroscope. We have seen the first part of ANEW architecture, which is the ANEW device and the type of sensors we are going to use to develop a digital compass. Instead of directly moving to develop a digital compass, first, we have to address how we are going to manage additional drift that will come from the gyroscope. To handle these stochastic errors, we will be using an “Unscented Kalman Filter” in our algorithmic model; then, final values are displayed over a digital compass, which will result in automatic alerting. Kalman Filter Kalman filter is an optimal estimation algorithm; it is used to extract information about what you cannot measure from what you can. It is used to determine the best values from noisy measurements. Why do we say noisy measurements, and what is drift? For example, a cup of hot coffee measures 450C, the thermometer reads 44.60C for the first time and then reads 45.50C the second time. We will not get the same number each time. State estimation algorithms provide a way to combine all noisy values and give a better estimate. Our data from the GY-80 multi-sensor board is all sensor data that we receive, specifically gyroscope data, which is prone to much drift, so we need a better estimate algorithm to handle these noisy measurements. The technique here is to fuse data from multiple sensors to produce the correct estimate. In our case, it's data fusion between the accelerometer and gyroscope. Kalman filters are basically defined for linear systems. The Linear system process model defines the evolution of state from time k-1 to time k as(4): The above process model is for linear systems. Below is the probability density function to show the working principle of the Kalman filter for linear systems to find the position of a moving car(5). However, in our ANEW device model, we are going to fuse the accelerometer and gyroscope in order to handle drift. Due to the non-linear relationship between angular velocity and orientation, it is unclear whether the magnitude of the angular velocity and its distribution across the three gyroscope axes may alter the effect of the considered noise types(6). Now, taking into consideration of non-linear system, our set of linear system equations will change as below: State transition function and measurement function becomes non-linear. In this case of nonlinear transformations, the Kalman filter is not useful, whereas the Extended Kalman Filter comes in handy, which linearizes nonlinear functions. When a system is nonlinear and if it can be well approximated by linearization, then the Extended Kalman filter is a good option. However, it has a few drawbacks; the major drawback to highlight here is that Extended Kalman Filter is not a good option if the system is highly nonlinear. For proper system dynamics with the ANEW device, both the Kalman filter and Extended Kalman Filter will not help much as linearization becomes invalid as our system is highly nonlinear and cannot be approximated. The solution to approximate our highly nonlinear system is Unscented Kalman Filter. Unscented Kalman Filter’s (UKF) approximates the probability distribution. In this model, UKF selects a minimal set of sample points or sigma points so that their mean and covariance are exact. Each sigma point is propagated through a nonlinear system model; the mean and covariance of nonlinearly transformed points are calculated to compute the Gaussian distribution, which is a probability distribution. It is further used to calculate new state estimates. Below is the standard process model to implement the Unscented Kalman filter(7). Describe the difference equation and observation model with additive noise: Let's see this implementation in the ANEW device: We are trying to fuse three sensor data: accelerometer, magnetometer, and gyroscope, which has high noise. Below is the process flow structure of the Kalman filter-based position estimate algorithm. Here from the above equation, Zk is the output observed model with added noise vk. The first step is to apply an unscented transformation scheme to the augmented state: In the next step, we have to set the selection of Sigma points. Then in the model forecast step, each sigma point is propagated through the nonlinear process model, which is(7): Next, in the data assimilation step, we combine information obtained from the forecast step with the newly observed measure zk. As per the standard model, we need to obtain the square root matrix of covariance each time to compute a new set of sigma points, which gives us a measurement update summary. Digital Compass As we have seen two main segments in the ANEW device process flow, the GY-80 multi-sensor board and the Unscented Kalman filter algorithm and their working principles, let us see how we can set up a digital compass. A digital compass or an electronic compass is basically a combination of multiple MEMS sensors, which provide orientation and measurements in multiple applications. As highlighted earlier, to set up a digital compass, only a magnetometer is sufficient, but to avoid noise measurements, as shown in Fig.5 Model, fuse sensor readings from Gyroscope, Accelerometer, and Magnetometer for a position estimate. We will fuse all the sensor values to have the final output in the digital compass. Connect all sensors to Arduino board which works on I2C (Inter-Integrated Circuit) protocol. In the process development environment, where Arduino wire libraries are used to set up and start serial communication. Unique device addresses and their internal register address can be scraped from data sheets. ADXL345 datasheet (8). The loop section is similar for all sensors, where we calculate row data for every axis. The sensitivity of sensors is defined as per requirement here (+250 dps to +2000 dps). Angular from the gyroscope is calculated and given as an input parameter into our algorithmic model. Observer state in the model where we integrate measurements from the accelerometer and magnetometer are passed for the error update step. The final estimated values are then brought to a serial monitor that can be displayed on the digital compass. Based on these values, digital compass values are then set for automatic signaling and alerting. Finally, this data is again captured for drive analytics. The reasons highlighted for major road mishaps are available in Section 1. The movement of the tire is well-tracked to alert automatically during overtaking vehicles (auto turn alert based on tire movement). Fatal toppling due to the angle of banking or misconstructed roads are alerted by monitoring the angular rate of tire movement. Conclusion Self-driving cars are getting better every year. Auto turn signals have opened doors that technology is not only for self-driving cars. There is a larger scope for introducing new technologies. When we look at smart tires, as tires are the first thing that comes in contact with the road, there is plenty of untouched data from tires till now, which can be used for greater insights that can help increase road safety and mobility. Major road mishaps that occur during overtaking vehicles due to the angle of banking and misconstructed roads are well handled and predictable through the ANEW device and can be avoided to a maximum extent. Smart tires with ANEW device features can help to make risk-free decisions on misconstructed roads, avoiding fatal toppling, and can help build autonomous vehicle control systems with future tires. The core idea of this concept and help build tire-manufacturing companies to design a set of sustainable solutions to ward off various road mishaps.
Geospatial data analysis is an area that can bring a huge impact on agriculture, but it often doesn’t get the attention it deserves. Geospatial data analysis is the process of analyzing a geographic area for various spatial features. The features that are analyzed can include elevation, topography, vegetation, water bodies, and land use. Geospatial data analysis is used in many different fields, such as geography and geology. Geospatial data analysis can be done using a variety of methods, including aerial photography, satellite imagery, and LiDAR scanning. Geospatial data analysis is often used to identify areas that are at risk for natural disasters or other environmental hazards. Geospatial data analysis can also be used to identify potential building sites or find locations where it may be profitable to drill for oil or natural gas. On this podcast, we discuss more with Lina from EOS Data Analytics. EOS Data Analytics provides Earth observation solutions for smart decision-making in 22+ industries, with the main focus on agriculture and forestry. The company combines data retrieved from satellite imagery with AI technologies and proprietary algorithms to analyze the state of crops within farms and trees growing in forest stands to drive businesses and implement sustainable practices globally. The EOSDA’s mission is to preserve the Planet by equipping the decision-makers with the tools for tackling today’s most urgent challenges. To find out more, visit the website.
Imagine that you are running an e-commerce store for electronic devices. Going into the holiday season, your business forecast predicts a significant increase in the sales of other brands when compared to Apple devices. Every sale transaction goes through a Kafka broker, and you want to ensure there are no resource issues with the data flow. Out of the three Kafka partitions for handling sales data, you want to dedicate two for non-Apple devices and only one for Apple devices. Check out the below illustration that describes the requirements. The reason behind custom partitioning is often a business requirement. Even though Kafka has a default partitioning mechanism, the business requirement creates a need for a custom partitioning strategy. Of course, the example requirement is a little contrived. But it does not matter. All that matters is that you need to perform custom partitioning or the business might suffer due to excessive load. Thankfully, Kafka provides a ready-to-use mechanism to implement custom partitioning. Creating a Custom Partitioner Class We need a place to keep our custom partitioning logic. For this purpose, Kafka provides a Partitioner interface. We need to implement this interface and override the necessary methods with our custom logic. Check out the below code for the BrandPartitioner class: Java package com.progressivecoder.kafkaexample; import org.apache.kafka.clients.producer.Partitioner; import org.apache.kafka.common.Cluster; import org.apache.kafka.common.InvalidRecordException; import org.apache.kafka.common.PartitionInfo; import org.apache.kafka.common.utils.Utils; import java.util.List; import java.util.Map; public class BrandPartitioner implements Partitioner { private String brand; @Override public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) { int chosenPartition; List<PartitionInfo> partitions = cluster.partitionsForTopic(topic); int numPartitions = partitions.size(); if ((keyBytes == null) || (!(key instanceof String))) { throw new InvalidRecordException("All messages should have a valid key"); } if (((String) key).equalsIgnoreCase(brand)) { chosenPartition = 0; } else { chosenPartition = Utils.toPositive(Utils.murmur2(keyBytes)) % (numPartitions - 1) + 1; } System.out.println("For " + value + " partition chosen: " + chosenPartition); return chosenPartition; } @Override public void close() { } @Override public void configure(Map<String, ?> map) { brand = (String) map.get("partition.brand"); } } To implement the Partitioner interface successfully, we need to implement three specific methods: partition() - This is where we keep the actual partitioning logic configure() - This is where we receive any custom properties that might be needed to determine the correct partition. If there's no such property, you can leave the implementation blank. In our case, we receive a property named partition.brand. We will use it in the partitioning algorithm. close() - This is where we can clean up any resources if needed. In case of no such resources, we can keep the implementation blank The partition() method is where the magic happens. The Kafka Producer calls this method for every record with input parameters such as topic name, key (if any) and the cluster object. The method returns the partition number as an integer value. For our business requirement, the partitioning logic is pretty straightforward. First, extract the information about the partitions of the topic using the cluster instance. This is to find the number of partitions within the topic. Next, we throw an exception if the key value is null. The key tells us whether the device is from Apple or another brand. Without the key, we won't be able to determine the partition. Moving on, we check if the key of the current record is 'apple'. If yes, we set the value of chosenPartition to 0. Basically, we are saying that for brand value 'apple', always use partition 0. If the key value is not 'apple', we determine the chosenPartition by hashing the key, dividing it by the number of partitions and taking the mod. The mod value will turn out to be 0 or 1. Therefore, we add 1 to shift the value by 1 since we have already assigned partition 0 to 'apple'. Ultimately, we will get a value of 1 or 2 for other brands. Finally, we return the chosenPartition value. Configuring the Kafka Producer The custom partitioning class is ready. However, we still need to tell the Kafka Producer to use this particular class while determining the partition. Check the below code: Java package com.progressivecoder.kafkaexample; import org.apache.kafka.clients.producer.*; import org.apache.kafka.common.serialization.StringSerializer; import org.springframework.boot.CommandLineRunner; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import java.util.Properties; @SpringBootApplication public class KafkaExampleApplication implements CommandLineRunner { public static void main(String[] args) { SpringApplication.run(KafkaExampleApplication.class, args); } @Override public void run(String... args) throws Exception { Properties kafkaProps = new Properties(); kafkaProps.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class); kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class); kafkaProps.put(ProducerConfig.PARTITIONER_CLASS_CONFIG, BrandPartitioner.class); kafkaProps.put("partition.brand", "apple"); Producer<String, String> producer = new KafkaProducer<>(kafkaProps); try { for (int i = 0; i <= 20; i++) { if (i < 3) { ProducerRecord<String, String> apple = new ProducerRecord<String, String>("topic-1", "apple", "Selling Apple Device"); producer.send(apple, new DemoProducerCallback()); } else { ProducerRecord<String, String> samsung = new ProducerRecord<String, String>("topic-1", "others_" + i, "Selling Other Device"); producer.send(samsung, new DemoProducerCallback()); } } } catch (Exception e) { e.printStackTrace(); } } } class DemoProducerCallback implements Callback { @Override public void onCompletion(RecordMetadata recordMetadata, Exception e) { if (e != null) { e.printStackTrace(); } } } There are several steps over here: First, we create a Properties object and add the necessary properties to it. Apart from the mandatory properties such as server details, key and value serializers, we add the PARTITIONER_CLASS_CONFIG and the partition.brand properties. The PARTITIONER_CLASS_CONFIG holds the name of the custom partitioner class that we already created. partition.brand is not a Kafka Producer configuration property. It is a custom property. We are using it to supply the name of the brand that needs to receive special treatment so that we can avoid hard-coding it in the custom partitioner. This is good practice as it makes our custom partitioner class independent from brand-specific logic. In the remaining code, we are simply sending a bunch of messages to the Kafka broker. Some messages are for 'apple' devices while the rest belong to other brands. If we run our application now, we will see the below response: For Selling Apple Device partition chosen: 0 For Selling Apple Device partition chosen: 0 For Selling Apple Device partition chosen: 0 For Selling Apple Device partition chosen: 0 For Selling Other Device partition chosen: 2 For Selling Other Device partition chosen: 2 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 2 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 2 For Selling Other Device partition chosen: 2 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 2 For Selling Other Device partition chosen: 2 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 1 For Selling Other Device partition chosen: 2 The data belonging to the 'apple' device only goes to partition 0. However, non-apple messages go to partition 1 or 2 depending on our partition logic. Concluding Thoughts Custom partitioning in Kafka is an important tool in high-load scenarios. It provides a way to optimize and distribute traffic efficiently. The great part about custom partitioning is the flexibility with which we can implement the logic for determining partitions. That was all for this post. We will be covering more aspects of Kafka in upcoming posts. If you are new to Kafka, I would suggest you go through our helicopter view of Kafka.
Even digital natives — that started their business in the cloud without legacy applications in their own data centers — need to modernize their cloud-native enterprise architecture to improve business processes, reduce costs, and provide real-time information to their downstream applications. This blog post explores the benefits of an open and flexible data streaming platform compared to a proprietary message queue and data ingestion cloud services. A concrete example shows how DoorDash replaced cloud-native AWS SQS and Kinesis with Apache Kafka and Flink. Message Queue and ETL vs. Data Streaming With Apache Kafka A message queue like IBM MQ, RabbitMQ, or Amazon SQS enables sending and receiving of messages. This works great for point-to-point communication. However, additional tools like Apache NiFi, Amazon Kinesis Data Firehose, or other ETL tools are required for data integration and data processing. A data streaming platform like Apache Kafka provides many capabilities: Producing and consuming messages for real-time messaging at any scale Data integration to avoid spaghetti architectures with plenty of middleware components in the end-to-end pipeline Stream processing to continuously process and correlate data from different systems Distributed storage for true decoupling, backpressure handling, and replayability of events, all in a single platform Data streaming with Kafka provides a data hub to easily access events from different downstream applications (no matter what technology, API, or communication paradigm they use). If you need to mix a messaging infrastructure with ETL platforms, a spaghetti architecture is the consequence. Whether you are on-premise and use ETL or ESB tools, or if you are in the cloud and can leverage iPaaS platforms. The more platforms you (have to) combine in a single data pipeline, the higher the costs, operations complexity, and SLA guarantees. That's one of the top arguments for why Apache Kafka became the de facto standard for data integration. Cloud-Native Is the New Black for Infrastructure and Applications Most modern enterprise architectures leverage cloud-native infrastructure, applications, and SaaS, no matter if the deployment happens in the public or private cloud. A cloud-native infrastructure provides: Automation via DevOps and continuous delivery Elastic scale with containers and orchestration tools like Kubernetes Agile development with domain-driven design and decoupled microservices In the public cloud, fully-managed SaaS offerings are the preferred way to deploy infrastructure and applications. This includes services like Amazon SQS, Amazon Kinesis, and Confluent Cloud for fully-managed Apache Kafka. The scarce team of experts can focus on solving business problems and innovative new applications instead of operating the infrastructure. However, not everything can run as a SaaS. Cost, security, and latency are the key arguments why applications are deployed in their own cloud VPC, an on-premise data center, or at the edge. Operators and CRDs for Kubernetes or Ansible scripts are common solutions to deploy and operate your own cloud-native infrastructure if using a serverless cloud product is not possible or feasible. DoorDash: From Multiple Pipelines to Data Streaming for Snowflake Integration DoorDash is an American company that operates an online food ordering and food delivery platform. With a 50+% market share, it is the largest food delivery company in the United States. Obviously, such a service requires scalable real-time pipelines to be successful. Otherwise, the business model does not work. For similar reasons, all the other mobility services like Uber and Lyft in the US, Free Now in Europe, or Grab in Asia leverage data streaming as the foundation of their data pipelines. Challenges With Multiple Integration Pipelines Using SQS and Kinesis Instead of Apache Kafka Events are generated from many DoorDash services and user devices. They need to be processed and transported to different destinations, including: OLAP data warehouse for business analysis Machine Learning (ML) platform to generate real-time features like recent average wait times for restaurants Time series metric backend for monitoring and alerting so that teams can quickly identify issues in the latest mobile application releases The integration pipelines and downstream consumers leverage different technologies, APIs, and communication paradigms (real-time, near real-time, batch). Each pipeline is built differently and can only process one kind of event. It involves multiple hops before the data finally gets into the data warehouse. It is cost-inefficient to build multiple pipelines that are trying to achieve similar purposes. DoorDash used cloud-native AWS messaging and streaming systems like Amazon SQS and Amazon Kinesis for data ingestion into the Snowflake data warehouse:Mixing different kinds of data transport and going through multiple messaging/queueing systems without carefully designed observability around it leads to difficulties in operations. From Amazon SQS and Kinesis to Apache Kafka and Flink These issues resulted in high data latency, significant cost, and operational overhead at DoorDash. Therefore, DoorDash moved to a cloud-native streaming platform powered by Apache Kafka and Apache Flink for continuous stream processing before ingesting data into Snowflake: The move to a data streaming platform provides many benefits to DoorDash: Heterogeneous data sources and destinations, including REST APIs using the Confluent rest proxy Easily accessible from any downstream application (no matter which technology, API, or communication paradigm) End-to-end data governance with schema enforcement and schema evolution with Confluent Schema Registry Scalable, fault-tolerant, and easy to operate for a small team REST/HTTP Is Complementary to Data Streaming With Kafka Not all communication is real-time and streaming. HTTP/REST APIs are crucial for many integrations. DoorDash leverages the Confluent REST Proxy to produce and consume via HTTP to/from Kafka. All the details about this cloud-native infrastructure optimization are in DoorDash's engineering blog post: "Building Scalable Real-Time Event Processing with Kafka and Flink." Don't Underestimate Vendor Lock-In and Cost of Proprietary SaaS Offerings One of the key reasons I see customers migrating away from proprietary serverless cloud services like Kinesis is cost. While it looks fine initially, it can get crazy when the data workloads scale. Very limited retention time and missing data integration capabilities are other reasons. The DoorDash example shows how even cloud-native greenfield projects require modernization of the enterprise architecture to simplify the pipelines and reduce costs. A side benefit is the independence of a specific cloud provider. With open-source powered engines like Kafka or Flink, the whole integration pipeline can be deployed everywhere. Possible deployments include: Cluster linking across countries or even continents (including filtering, anonymization, and other data privacy relevant processing before data sharing and replication) Multiple cloud providers (e.g., if GCP is cheaper than AWS or because Mainland China only provides Alibaba) Low latency workloads or zero trust security environments at the edge (e.g., in a factory, stadium, or train. How do you see the trade-offs between open source frameworks like Kafka and Flink versus proprietary cloud services like AWS SQS or Kinesis? What are your decision criteria to make the right choice for your project? Did you already migrate services from one to the other?
Vertical farming-based IoT solutions are one of the key emerging trends in the agriculture industry today. These solutions are not only providing accurate information on plant growth statistics but also making operations more sustainable. With these solutions, farmers can track energy usage and soil composition, verify air quality, temperature and moisture levels, etc. and perform operations in a more efficient manner. In addition to processing data at rest, the real-time processing of sensor data i.e., the ability to process data collected by various sensors as it arrives, would form the major building blocks of these kinds of solutions. However, traditional data processing systems fall behind in handling real-time data, unstructured data, and scaling on demand. This is the reason why the usage of big data on the cloud in IoT-based solutions is on the rise, as it would require querying continuous data streams and detecting conditions quickly within a small interval from the time of receiving it. While big data supports data storage and structured as well as unstructured data processing, whereas cloud services are used for cost-effective scalable infrastructure. Two prominent big data processing techniques useful for these kinds of scenarios are: Lambda Architecture Kappa Architecture I have already discussed Lambda architecture in detail here. Though lambda has a fault-tolerant and scalable architecture along with the batch layer, which manages historical data in a distributed fashion, the major drawbacks of this model are: Batch and streaming layers each require a different codebase to be maintained and kept in sync. It can result in coding overhead due to the involvement of comprehensive processing. Batch, speed, and serving layers, all need to be processed (at least) twice. A data modeled with Lambda architecture is difficult to migrate or reorganize. The Kappa architecture is an event-driven software architecture model that can handle real-time data at scale for transactional and analytical systems. The major advantage of this model is that both real-time, as well as batch processing can be performed with a single technology stack. The heart of the infrastructure is the streaming architecture. First, the event streaming platform log stores incoming data. Later, the stream processing engine processes or ingests the real-time data continuously into the knowledge store or analytics database. An IoT-based reference model using cloud data processing stream analytics services is shown below: IoT Clients or Edge Devices that can have multiple sensors installed and can send records with measurements reported by all sensors. The devices can: Extend cloud intelligence to edge devices. Run artificial intelligence at the edge. Perform edge analytics. Deploy IoT solutions from the cloud to the edge. IoT Hubs: Services that enable reliable and secure bi-directional communication between IoT devices and Cloud services. IoT hubs can: Manage devices centrally from the cloud. Operate with offline and intermittent connectivity. Enable real-time decisions. Connect new and legacy devices. Reduce bandwidth costs. Stream Processors that consume that data, integrate it with business processes and place the data into storage. ML/Time Series Insights: Which allows predictive algorithms to be executed over historical telemetry data, enabling scenarios such as predictive maintenance. Knowledge Store for storing historical data and earlier predictions. This store can also be enriched with real-time market prices, pesticides, and other agriculture product-related information. Web App: A user interface to display plant and crop statistics and display of other telemetry data, which helps farmers with quick decision-making. With the cloud-based data processing techniques, intelligence can be built into these models by collecting data from IoT things equipped with sensors that continuously act on the data and transmit it to data processing locations. Remotely track energy usage, soil composition, or water levels, and remote actions like ML-based Disease Predictions, Artificial Lighting, and Control, etc., can be performed.
If you are wondering about the implementation of Enterprise IoT solutions, you understand that this process is rapidly developing all over the globe. Accoring to McKinsey Digital, 127 devices hooked up to the Internet for the first time every second in 2021, and in North America alone the worth of smart factories is expected to reach $500 billion in 2022. This trend is not surprising, as EIoT implementation helps to achieve a level of worker safety that was unattainable before, as well as new business models, and, therefore, new revenue flows. Using IoT devices, you will be able to get more information about manufacturing processes, employee and client behavior, and data that will help predict breakdowns and prevent downtime of equipment. Sounds tempting? All these benefits are achievable, but they depend on the company's ability to correctly assess the risks of EIoT implementation. Ericsson predicts that by the end of 2022 there will be around 30 billion IoT devices worldwide. IoT technology is widespread in the world, and its global “invasion” is just a matter of time. Sure, every company needs to figure out if EIoT implementation is going to be an essential point in its future business strategy. But once a respective decision is made, it is important not to put off the implementation process for “better times.” What companies should do instead is to assess the risks, and think clearly about the Industrial IoT solutions they are going to implement as early as possible. There are some risks to be considered when assessing the implementation of an IoT ecosystem, like potential hardware malfunctions or economic costs. However, almost all of these risks can be mitigated with a transparent business strategy, goal setting, and an accurate cost-benefit analysis of implementing Industrial IoT. It is not cheap to transform an ordinary factory into a smart one, but it will be much better for your business to avoid common mistakes from the very beginning, rather than needing to correct them after implementation. Here we consider the most common risks to avoid while implementing the IoT ecosystem. Unmet Business Expectations After Implementing Industrial IoT Solutions According to a Cisco report, a couple of years ago only 26% of companies considered their IoT initiatives to be successful. It means that their expectations have not been met, or the costs of technologies did not pay off. The complexity of an EIoT project is also often underestimated, which leads to postponement of implementation at the POC stage. As a result of that, the business is likely to suffer. How to Avoid Unnecessary Costs It is critical here to understand that the transition to the IoT ecosystem is a complicated process that will affect business models, employees, and the architecture of the Enterprise. Therefore, it is important to entrust qualified managers to assist in creating a plan and working on the IIoT implementation. Managers will help to define the desired results before the beginning of the project, as well as to employ a qualified team of engineers and optional experts such as data scientists, computer scientists, or statisticians for the IoT implementation. EIoT initiatives should always solve specific business issues, which have to be determined before the execution of enteprise IoT solutions. When planning the implementation of IoT at an enterprise, we advise you to focus on: Improving quality of the product Increasing equipment utilization Accelerating production cycles Increasing safety and security levels of the overall enterprise Identify priority areas, current challenges, and goals to reach in the future. Be sure to define the KPIs you are planning to attain. The company's business goals should be realistic and attainable, to make sure your expectations would match the actual results of implementing the desired solution. Although the implementation of the IoT ecosystem has to minimize human participation, people will still have to master new working conditions. Therefore, the next crucial point is to discuss all changes that might happen to the enterprise, assign roles between employees and machines, and train the team to work in a new business and technological model. Compatibility and Programming Issues Industrial IoT solutions are often implemented at Enterprises with a high proportion of machine manufacturing. For a well-funded company, it is often easier to implement the IoT ecosystem using modern equipment. But for some, it would be too expensive to replace legacy manufacturing systems. Therefore, companies often choose to adapt existing equipment and enhance it with sensors, smart devices, and gateways. However, when choosing to implement IoT technology in an enterprise equipped with old machines, the company has to ensure protocols are understandable for all the devices to connect disparate data stores, and solve all the compatibility issues. According to McKinsey, a company moving to EIoT has to solve compatibility issues for about 50% of all devices. If compatibility issues are not solved appropriately, the solution may not function as intended, or even at all. The wrong algorithm or incorrect integration can lead to hardware malfunctions and equipment damages, overheating, explosion, or system failure. These issues can cause other problems for companies such as disruption of the production process, reputation risks, lawsuits, and, in the worst case, work-related injuries. The problem of incorrect programming or lack of compatibility concerns not only the IoT ecosystem, but also other interconnected devices. How to Deal With Device Incompatibility Issues As in the case of business expectations, it would be better to create a detailed analytical plan of EIoT implementation to minimize the risks of incompatibility and incorrect programming of the solution. Since any electromechanical device can become a part of the EIoT ecosystem, you have to evaluate whether it is more financially beneficial to integrate existing equipment into the IoT system, or upgrade with new replacements. A solution for equipment should also be selected based on specific details. You should consider information about the following in your implementation plan: The Existing Equipment Evaluate the state of the equipment, the date it was acquired, the degree of its efficiency, operating costs, and existing automation processes on the enterprise. This will help you understand in what aspects the manufacturing process should be improved. Data Collecting How your business collects and processes data, and how you would ideally like it to be. Decide if it would be more efficient to develop a new solution, or implement the existing one. The Partners Choose the best development partners and suppliers. They can help you create software products that will enable a secure and interoperable IoT ecosystem in their specific environment. Remember that various hardware malfunctions can happen not only on the enterprise that has implemented the enterprise IoT solutions, but elsewhere as well. So, you should pay a bit more attention to the quality of the Industrial IoT solutions that can be implemented from external partners, and don’t experiment with non-reliable ones. Qualified EIoT managers can help to pick out the most suitable solutions for the enterprise. Practical Issues Implementation of the IoT ecosystem might require additional machines, devices, controllers, transmitters, and computers. So, you will have to manage more units of equipment after EIoT implementation. To avoid problems with servicing new devices, we advise you to hire technical personnel who are experienced in working with IoT devices. Existing staff should also be specifically trained to work with the new equipment. This strategy will minimize problems with the maintenance of IoT devices, as well as controlling the existing equipment. Data Security Risks The risk of data loss is one of the main reasons, along with cost, for companies to postpone transition to an IoT ecosystem. This is natural, because simply installing a VPN is not enough for IIoT protection. In theory, by gaining access to one device in the EIoT network, an intruder can disable the functionality of the whole enterprise. For instance, an unhappy former employee might reprogram devices or steal confidential data. Therefore, security should be considered at every IoT level, from the individual device to edge computing and the cloud. To be sure of the reliability of the whole system, make sure that all the Industrial IoT solutions meet the highest safety standards. It would be better to contact EIoT security specialists for an assessment. Unfortunately, the developers of enterprise IoT solutions do not always make safety standards accessible. So, you have to conduct a security risk assessment. When choosing a solution, consider the level of security of these access points: Sensors and actuators that can be hacked on site Communication systems enabling data exchange Computer storage platform The computer software that interprets the data So, what can you do to prevent attackers from gaining access to your devices? First, we advise establishing secure connections to cloud services and secure remote access to on-premises resources, as well as encrypting the data. For the most confidential data, you can minimize the risks by using a local network. The internal network ensures that people from outside of the company do not gain access to the data. Don’t forget to provide modern IIoT devices and systems with unique identities and credentials, and apply authentication and access control mechanisms. Physical security measures may also be required to ensure that only authorized employees have access to IoT-equipped machines and the software used to program them. In addition, in order to not lose data, the company should provide the IoT ecosystem with a standby standby source of power, such as uninterruptible power supply, built-in batteries, and solar panels. Moreover, in the case of network loss, the devices will stop transmitting data, and the whole manufacturing process will stop to. Risk of Losing Control: Can Humans Be Replaced by AI Using Industrial IoT Solutions? The wide adoptance of AI is likely to lead to the loss of human control when it comes to the manufacturing process. By implementing industrial IoT solutions based on AI, you should establish boundaries of responsibility for human and artificial intelligence. Technologies that can be used in enterprise IoT solutions, such as machine learning, deep learning, computer vision, natural language processing, and Big Data analytics are all based on artificial intelligence. The problem of losing control is becoming more urgent as people entrust AI to make more complex decisions. So, the question arises: which important management tasks can be trusted to AI without harming the decision-making process in the company? Regarding business strategy creation and building enterprise architecture, the interaction of a person with a machine is more efficient than the interaction of machines with each other. This is because machines are flawless when working with complete data on regular operations, but a human can make responsible decisions in unusual, critical situations. Also, in case of the need for creativity, only people can provide the solution with the help of high-quality ideas. So, while considering the AI borders of functionality in your future IoT enterprise, you should estimate: How mechanistic are the operations in your manufacturing process What new operations do you want to add to the manufacturing after EIoT implementation If the delivered IoT system will be capable of self-learning In this regard, another question arises: What are the chances for machines to replace humans at all? This is unlikely to happen in the near future. Can a critical error cause a person to completely lose control of the manufacturing process? It can happen, and the consequences of this are hard to predict. So, to keep ultimate human control as well as make the business more productive, you should balance machine functionality approaches with human intelligence. Conclusions At least in the near future, machines will not replace humans in manufacturing, although they will take on many functions. It doesn't matter if you want to transfer the entire factory to EIoT or to make the manufacturing process partially smart. In any case, you should approach IoT ecosystem implementation strategically: Analyze the assets of the Enterprise to know their EIoT potential Set business goals that can be reached through EIoT solutions, Employ high-qualified staff to deploy the project, Check the development partners and solutions before transferring to the IoT ecosystem The investment in EIoT may not pay off instantly, but by managing the EIoT implementation process competently, you have every chance of reaping substantial benefits that were not available to the company before EIoT implementation.
Miguel Garcia
VP of Engineering,
Nextail Labs
Gautam Goswami
Founder,
DataView
Alexander Eleseev
Full Stack Developer,
First Line Software
Ben Herzberg
Chief Scientist,
Satori