Big Data Resources

Top 10 IoT Trends That Will Impact Various Industries in the Coming Years

IoT will have a significant impact on many different industries in the coming few years.

October 28, 2022

by Calvin Austins

· 9,739 Views · 3 Likes

Data Science vs. Software Engineering: A Fine Differentiation

The differences between data science and software engineering with career opportunities, salaries, and qualifications, along with a detailed comparison table.

October 27, 2022

by Richa Sareen

· 7,537 Views · 2 Likes

The Difference Between Predictive Analytics and Data Science

Predictive data helps forecasts, but data science is unique. This post explains the difference between predictive analytics and data science.

October 26, 2022

by Smith Williams

· 7,281 Views · 1 Like

7 Alternatives to Using Segment

Review the pros and cons of Segment like its core features Warehouses and Personas, and check out the best alternative CDP platforms.

Updated October 25, 2022

by Luke Kline

· 5,306 Views · 2 Likes

Why Data Analytics Is Central to Digital Adoption Optimization

When data analysts successfully deploy digital adoption into data analysis, it impacts the transformation of organizations to a very high level.

October 25, 2022

by Alon Gehlber

CORE

· 6,047 Views · 1 Like

What Are the Key Applications and Benefits of IoT Fleet Management?

IoT in fleet management can sincerely help bring more agility to your distribution channels.

October 25, 2022

by Kamal R

· 6,403 Views · 2 Likes

Here Is Why You Need a Message Broker

Hopefully, by the end of this article, you will be able to understand the importance of using a message-driven architecture for building your next project.

Updated October 24, 2022

by Yaniv Ben Hemo

· 7,658 Views · 4 Likes

iOS Meets IoT: Five Steps to Building Connected Device Apps for Apple

Apple’s flagship device counts a unique operating system — iOS — that should not be ignored in app rollouts for the Internet of Things.

October 21, 2022

by Carsten Rhod Gregersen

· 7,039 Views · 2 Likes

Data Streaming for Data Ingestion Into the Data Warehouse and Data Lake

Data Streaming with Apache Kafka for Data Ingestion into Data Warehouse, Data Lake and Lakehouse.

October 20, 2022

by Kai Wähner

CORE

· 6,632 Views · 3 Likes

Data Modeling in Cassandra and Astra DB

What does it take to build an efficient and sound data model for Apache Cassandra and DataStax Astra DB? Where would one start? Are there any data modeling rules to follow?

October 19, 2022

by Artem Chebotko

· 6,159 Views · 2 Likes

Can You Beat the AI? How to Quickly Deploy TinyML on MCUs Using TensorFlow Lite Micro

Do you want to know how to use it on the microcontrollers you already work with? In this article, we provide an introduction to ML on microcontrollers.

October 19, 2022

by Nikolas Rieder

· 7,883 Views · 2 Likes

O11y Guide: Keeping Your Cloud-Native Observability Options Open

Take look at architecture-level choices being made and share the open standards with the open-source landscape.

October 19, 2022

by Eric D. Schabell

· 4,739 Views · 3 Likes

The Heart of the Data Mesh Beats Real-Time With Apache Kafka

Building a decentralized real-time data mesh with data streaming using Apache Kafka for truly decoupled, reliable, scalable microservices.

October 19, 2022

by Kai Wähner

CORE

· 7,628 Views · 3 Likes

Use Apache Kafka SASL OAUTHBEARER With Python

Learn how to use the Confluent Python client with SASL/OAUTHBEARER security protocol to produce and consume messages to topics in Apache Kafka.

Updated October 18, 2022

by Abhishek Koserwal

· 12,078 Views · 4 Likes

Case Studies: Cloud-Native Data Streaming for Data Warehouse Modernization

Let's explore a few case studies for cloud-native data streaming and data warehouse modernization.

October 15, 2022

by Kai Wähner

CORE

· 7,488 Views · 3 Likes

How to Read Graph Database Benchmark (Part II)

This is the second part of the How to Read Graph Database Benchmark series and is dedicated to graph query (algorithm, analytics) results validation.

October 13, 2022

by Ricky Sun

· 5,785 Views · 1 Like

Decision Guidance for Serverless Adoption

This article guides on adoption of Serverless and provides decision guidance for various, architecture and workloads, It shares a list of antipatterns.

October 12, 2022

by Abhay Patra

· 6,911 Views · 5 Likes

AIOps: What, Why, and How?

A Guide To Everything About AIOps: Use cases, benefits, challenges, core elements, AIOps architecture, and future.

Updated October 11, 2022

by Mahipal Nehra

· 8,549 Views · 3 Likes

Handling Big Data with HBase Part 5: Data Modeling (or, Life Without SQL)

This is the fifth of a series of blogs introducing Apache HBase. In the fourth part, we saw the basics of using the Java API to interact with HBase to create tables, retrieve data by row key, and do table scans. This part will discuss how to design schemas in HBase. HBase has nothing similar to a rich query capability like SQL from relational databases. Instead, it forgoes this capability and others like relationships, joins, etc. to instead focus on providing scalability with good performance and fault-tolerance. So when working with HBase you need to design the row keys and table structure in terms of rows and column families to match the data access patterns of your application. This is completely opposite what you do with relational databases where you start out with a normalized database schema, separate tables, and then you use SQL to perform joins to combine data in the ways you need. With HBase you design your tables specific to how they will be accessed by applications, so you need to think much more up-front about how data is accessed. You are much closer to the bare metal with HBase than with relational databases which abstract implementation details and storage mechanisms. However, for applications needing to store massive amounts of data and have inherent scalability, performance characteristics and tolerance to server failures, the potential benefits can far outweigh the costs. In the last part on the Java API, I mentioned that when scanning data in HBase, the row key is critical since it is the primary means to restrict the rows scanned; there is nothing like a rich query like SQL as in relational databases. Typically you create a scan using start and stop row keys and optionally add filters to further restrict the rows and columns data returned. In order to have some flexibility when scanning, the row key should be designed to contain the information you need to find specific subsets of data. In the blog and people examples we've seen so far, the row keys were designed to allow scanning via the most common data access patterns. For the blogs, the row keys were simply the posting date. This would permit scans in ascending order of blog entries, which is probably not the most common way to view blogs; you'd rather see the most recent blogs first. So a better row key design would be to use a reverse order timestamp, which you can get using the formula (Long.MAX_VALUE - timestamp), so scans return the most recent blog posts first. This makes it easy to scan specific time ranges, for example to show all blogs in the past week or month, which is a typical way to navigate blog entries in web applications. For the people table examples, we used a composite row key composed of last name, first name, middle initial, and a (unique) person identifier to distinguish people with the exact same name, separated by dashes. For example, Brian M. Smith with identifier 12345 would have row key smith-brian-m-12345. Scans for the people table can then be composed using start and end rows designed to retrieve people with specific last names, last names starting with specific letter combinations, or people with the same last name and first name initial. For example, if you wanted to find people whose first name begins with B and last name is Smith you could use the start row key smith-b and stop row key smith-c (the start row key is inclusive while the stop row key is exclusive, so the stop key smith-c ensures all Smiths with first name starting with the letter "B" are included). You can see that HBase supports the notion of partial keys, meaning you do not need to know the exact key, to provide more flexibility creating appropriate scans. You can combine partial key scans with filters to retrieve only the specific data needed, thus optimizing data retrieval for the data access patterns specific to your application. So far the examples have involved only single tables containing one type of information and no related information. HBase does not have foreign key relationships like in relational databases, but because it supports rows having up to millions of columns, one way to design tables in HBase is to encapsulate related information in the same row - a "wide" table design. It is called a "wide" design since you are storing all information related to a row together in as many columns as there are data items. In our blog example, you might want to store comments for each blog. The "wide" way to design this would be to include a column family named comments and then add columns to the comment family where the qualifiers are the comment timestamp; the comment columns would look like comments:20130704142510 and comments:20130707163045. Even better, when HBase retrieves columns it returns them in sorted order, just like row keys. So in order to display a blog entry and its comments, you can retrieve all the data from one row by asking for the content, info, and comments column families. You could also add a filter to retrieve only a specific number of comments, adding pagination to them. The people table column families could also be redesigned to store contact information such as separate addresses, phone numbers, and email addresses in column families allowing all of a person's information to be stored in one row. This kind of design can work well if the number of columns is relatively modest, as blog comments and a person's contact information would be. If instead you are modeling something like an email inbox, financial transactions, or massive amounts of automatically collected sensor data, you might choose instead to spread a user's emails, transactions, or sensor readings across multiple rows (a "tall" design) and design the row keys to allow efficient scanning and pagination. For an inbox the row key might look like - which would permit easily scanning and paginating a user's inbox, while for financial transactions the row key might be -. This kind of design can be called "tall" since you are spreading information about the same thing (e.g. readings from the same sensor, transactions in an account) across multiple rows, and is something to consider if there will be an ever-expanding amount of information, as would be the case in a scenario involving data collection from a huge network of sensors. Designing row keys and table structures in HBase is a key part of working with HBase, and will continue to be given the fundamental architecture of HBase. There are other things you can do to add alternative schemes for data access within HBase. For example, you could implement full-text searching via Apache Lucene either within rows or external to HBase (search Google for HBASE-3529). You can also create (and maintain) secondary indexes to permit alternate row key schemes for tables; for example in our people table the composite row key consists of the name and a unique identifier. But if we desire to access people by their birth date, telephone area code, email address, or any other number of ways, we could add secondary indexes to enable that form of interaction. Note, however, that adding secondary indexes is not something to be taken lightly; every time you write to the "main" table (e.g. people) you will need to also update all the secondary indexes! (Yes, this is something that relational databases do very well, but remember that HBase is designed to accomodate a lot more data than traditional RDBMSs were.) Conclusion to Part 5 In this part of the series, we got an introduction to schema design in HBase (without relations or SQL). Even though HBase is missing some of the features found in traditional RDBMS systems such as foreign keys and referential integrity, multi-row transactions, multiple indexes, and son on, many applications that need inherent HBase benefits like scaling can benefit from using HBase. As with anything complex, there are tradeoffs to be made. In the case of HBase, you are giving up some richness in schema design and query flexibility, but you gain the ability to scale to massive amounts of data by (more or less) simply adding additional servers to your cluster. In the next and last part of this series, we'll wrap up and mention a few (of the many) things we didn't cover in these introductory blogs. References HBase web site, http://hbase.apache.org/ HBase wiki, http://wiki.apache.org/hadoop/Hbase HBase Reference Guide http://hbase.apache.org/book/book.html HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide Google Bigtable Paper, http://labs.google.com/papers/bigtable.html Hadoop web site, http://hadoop.apache.org/ Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk Sample code, https://github.com/sleberknight/basic-hbase-examples

Updated October 11, 2022

by Scott Leberknight

· 19,766 Views · 3 Likes

Geek Reading for the Weekend

I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Why You Make Less Money (job tips for geeks) Nate Silver Gets Real About Big Data (ReadWrite) Java StringBuilder myth debunked (Java Code Geeks) Dew Drop – March 29, 2013 (#1,517) (Alvin Ashcraft's Morning Dew) Generation Mooch? Why 20-somethings have a hard time paying for content (GigaOM) Double Shot #1096 (A Fresh Cup) Connecting Talking with Doing (Conversation Agent) Games Galore: Building Atari with CreateJS (noupe) Putting People in Boxes (Architects Zone – Architectural Design Patterns & Best Practices) Do Code Improvements Add Value? (Architects Zone – Architectural Design Patterns & Best Practices) Cassandra 1.1 – Reading and Writing from SSTable Perspective (Architects Zone – Architectural Design Patterns & Best Practices) Couchbase NoSQL at Tunewiki: A Billion Documents and Counting (Architects Zone – Architectural Design Patterns & Best Practices) The Daily Six Pack: March 29, 2013 (Dirk Strauss) Using Kanban for Scrum Backlog Grooming (Agile Zone – Software Methodologies for Development Managers) Humming (xkcd.com) Amazon Acquires Social Reading Site Goodreads, Which Gives The Company A Social Advantage Over Apple(TechCrunch) I hope you enjoy today’s items, and please participate in the discussions on those sites.

Updated October 11, 2022

by Robert Diana

· 8,494 Views · 1 Like

The Latest Big Data Topics