DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The Latest Big Data Topics

article thumbnail
Data Streaming for Data Ingestion Into the Data Warehouse and Data Lake
Data Streaming with Apache Kafka for Data Ingestion into Data Warehouse, Data Lake and Lakehouse.
October 20, 2022
by Kai Wähner CORE
· 5,947 Views · 2 Likes
article thumbnail
Data Modeling in Cassandra and Astra DB
What does it take to build an efficient and sound data model for Apache Cassandra and DataStax Astra DB? Where would one start? Are there any data modeling rules to follow?
October 19, 2022
by Artem Chebotko
· 5,529 Views · 2 Likes
article thumbnail
Can You Beat the AI? How to Quickly Deploy TinyML on MCUs Using TensorFlow Lite Micro
Do you want to know how to use it on the microcontrollers you already work with? In this article, we provide an introduction to ML on microcontrollers.
October 19, 2022
by Nikolas Rieder
· 6,429 Views · 2 Likes
article thumbnail
O11y Guide: Keeping Your Cloud-Native Observability Options Open
Take look at architecture-level choices being made and share the open standards with the open-source landscape.
October 19, 2022
by Eric D. Schabell CORE
· 3,515 Views · 2 Likes
article thumbnail
The Heart of the Data Mesh Beats Real-Time With Apache Kafka
Building a decentralized real-time data mesh with data streaming using Apache Kafka for truly decoupled, reliable, scalable microservices.
October 19, 2022
by Kai Wähner CORE
· 6,555 Views · 3 Likes
article thumbnail
Use Apache Kafka SASL OAUTHBEARER With Python
Learn how to use the Confluent Python client with SASL/OAUTHBEARER security protocol to produce and consume messages to topics in Apache Kafka.
October 18, 2022
by Abhishek Koserwal
· 4,222 Views · 4 Likes
article thumbnail
Case Studies: Cloud-Native Data Streaming for Data Warehouse Modernization
Let's explore a few case studies for cloud-native data streaming and data warehouse modernization.
October 15, 2022
by Kai Wähner CORE
· 6,673 Views · 3 Likes
article thumbnail
How to Read Graph Database Benchmark (Part II)
This is the second part of the How to Read Graph Database Benchmark series and is dedicated to graph query (algorithm, analytics) results validation.
October 13, 2022
by Ricky Sun
· 5,205 Views · 1 Like
article thumbnail
Decision Guidance for Serverless Adoption
This article guides on adoption of Serverless and provides decision guidance for various, architecture and workloads, It shares a list of antipatterns.
October 12, 2022
by Abhay Patra
· 5,617 Views · 5 Likes
article thumbnail
AIOps: What, Why, and How?
A Guide To Everything About AIOps: Use cases, benefits, challenges, core elements, AIOps architecture, and future.
October 11, 2022
by Mahipal Nehra
· 6,554 Views · 3 Likes
article thumbnail
Handling Big Data with HBase Part 5: Data Modeling (or, Life Without SQL)
[Editor's note: Be sure to check out part 1, part 2 and part 3 first.] This is the fifth of a series of blogs introducing Apache HBase. In the fourth part, we saw the basics of using the Java API to interact with HBase to create tables, retrieve data by row key, and do table scans. This part will discuss how to design schemas in HBase. HBase has nothing similar to a rich query capability like SQL from relational databases. Instead, it forgoes this capability and others like relationships, joins, etc. to instead focus on providing scalability with good performance and fault-tolerance. So when working with HBase you need to design the row keys and table structure in terms of rows and column families to match the data access patterns of your application. This is completely opposite what you do with relational databases where you start out with a normalized database schema, separate tables, and then you use SQL to perform joins to combine data in the ways you need. With HBase you design your tables specific to how they will be accessed by applications, so you need to think much more up-front about how data is accessed. You are much closer to the bare metal with HBase than with relational databases which abstract implementation details and storage mechanisms. However, for applications needing to store massive amounts of data and have inherent scalability, performance characteristics and tolerance to server failures, the potential benefits can far outweigh the costs. In the last part on the Java API, I mentioned that when scanning data in HBase, the row key is critical since it is the primary means to restrict the rows scanned; there is nothing like a rich query like SQL as in relational databases. Typically you create a scan using start and stop row keys and optionally add filters to further restrict the rows and columns data returned. In order to have some flexibility when scanning, the row key should be designed to contain the information you need to find specific subsets of data. In the blog and people examples we've seen so far, the row keys were designed to allow scanning via the most common data access patterns. For the blogs, the row keys were simply the posting date. This would permit scans in ascending order of blog entries, which is probably not the most common way to view blogs; you'd rather see the most recent blogs first. So a better row key design would be to use a reverse order timestamp, which you can get using the formula (Long.MAX_VALUE - timestamp), so scans return the most recent blog posts first. This makes it easy to scan specific time ranges, for example to show all blogs in the past week or month, which is a typical way to navigate blog entries in web applications. For the people table examples, we used a composite row key composed of last name, first name, middle initial, and a (unique) person identifier to distinguish people with the exact same name, separated by dashes. For example, Brian M. Smith with identifier 12345 would have row key smith-brian-m-12345. Scans for the people table can then be composed using start and end rows designed to retrieve people with specific last names, last names starting with specific letter combinations, or people with the same last name and first name initial. For example, if you wanted to find people whose first name begins with B and last name is Smith you could use the start row key smith-b and stop row key smith-c (the start row key is inclusive while the stop row key is exclusive, so the stop key smith-c ensures all Smiths with first name starting with the letter "B" are included). You can see that HBase supports the notion of partial keys, meaning you do not need to know the exact key, to provide more flexibility creating appropriate scans. You can combine partial key scans with filters to retrieve only the specific data needed, thus optimizing data retrieval for the data access patterns specific to your application. So far the examples have involved only single tables containing one type of information and no related information. HBase does not have foreign key relationships like in relational databases, but because it supports rows having up to millions of columns, one way to design tables in HBase is to encapsulate related information in the same row - a "wide" table design. It is called a "wide" design since you are storing all information related to a row together in as many columns as there are data items. In our blog example, you might want to store comments for each blog. The "wide" way to design this would be to include a column family named comments and then add columns to the comment family where the qualifiers are the comment timestamp; the comment columns would look like comments:20130704142510 and comments:20130707163045. Even better, when HBase retrieves columns it returns them in sorted order, just like row keys. So in order to display a blog entry and its comments, you can retrieve all the data from one row by asking for the content, info, and comments column families. You could also add a filter to retrieve only a specific number of comments, adding pagination to them. The people table column families could also be redesigned to store contact information such as separate addresses, phone numbers, and email addresses in column families allowing all of a person's information to be stored in one row. This kind of design can work well if the number of columns is relatively modest, as blog comments and a person's contact information would be. If instead you are modeling something like an email inbox, financial transactions, or massive amounts of automatically collected sensor data, you might choose instead to spread a user's emails, transactions, or sensor readings across multiple rows (a "tall" design) and design the row keys to allow efficient scanning and pagination. For an inbox the row key might look like - which would permit easily scanning and paginating a user's inbox, while for financial transactions the row key might be -. This kind of design can be called "tall" since you are spreading information about the same thing (e.g. readings from the same sensor, transactions in an account) across multiple rows, and is something to consider if there will be an ever-expanding amount of information, as would be the case in a scenario involving data collection from a huge network of sensors. Designing row keys and table structures in HBase is a key part of working with HBase, and will continue to be given the fundamental architecture of HBase. There are other things you can do to add alternative schemes for data access within HBase. For example, you could implement full-text searching via Apache Lucene either within rows or external to HBase (search Google for HBASE-3529). You can also create (and maintain) secondary indexes to permit alternate row key schemes for tables; for example in our people table the composite row key consists of the name and a unique identifier. But if we desire to access people by their birth date, telephone area code, email address, or any other number of ways, we could add secondary indexes to enable that form of interaction. Note, however, that adding secondary indexes is not something to be taken lightly; every time you write to the "main" table (e.g. people) you will need to also update all the secondary indexes! (Yes, this is something that relational databases do very well, but remember that HBase is designed to accomodate a lot more data than traditional RDBMSs were.) Conclusion to Part 5 In this part of the series, we got an introduction to schema design in HBase (without relations or SQL). Even though HBase is missing some of the features found in traditional RDBMS systems such as foreign keys and referential integrity, multi-row transactions, multiple indexes, and son on, many applications that need inherent HBase benefits like scaling can benefit from using HBase. As with anything complex, there are tradeoffs to be made. In the case of HBase, you are giving up some richness in schema design and query flexibility, but you gain the ability to scale to massive amounts of data by (more or less) simply adding additional servers to your cluster. In the next and last part of this series, we'll wrap up and mention a few (of the many) things we didn't cover in these introductory blogs. References HBase web site, http://hbase.apache.org/ HBase wiki, http://wiki.apache.org/hadoop/Hbase HBase Reference Guide http://hbase.apache.org/book/book.html HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide Google Bigtable Paper, http://labs.google.com/papers/bigtable.html Hadoop web site, http://hadoop.apache.org/ Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk Sample code, https://github.com/sleberknight/basic-hbase-examples
October 11, 2022
by Scott Leberknight
· 18,993 Views · 3 Likes
article thumbnail
Geek Reading for the Weekend
I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Why You Make Less Money (job tips for geeks) Nate Silver Gets Real About Big Data (ReadWrite) Java StringBuilder myth debunked (Java Code Geeks) Dew Drop – March 29, 2013 (#1,517) (Alvin Ashcraft's Morning Dew) Generation Mooch? Why 20-somethings have a hard time paying for content (GigaOM) Double Shot #1096 (A Fresh Cup) Connecting Talking with Doing (Conversation Agent) Games Galore: Building Atari with CreateJS (noupe) Putting People in Boxes (Architects Zone – Architectural Design Patterns & Best Practices) Do Code Improvements Add Value? (Architects Zone – Architectural Design Patterns & Best Practices) Cassandra 1.1 – Reading and Writing from SSTable Perspective (Architects Zone – Architectural Design Patterns & Best Practices) Couchbase NoSQL at Tunewiki: A Billion Documents and Counting (Architects Zone – Architectural Design Patterns & Best Practices) The Daily Six Pack: March 29, 2013 (Dirk Strauss) Using Kanban for Scrum Backlog Grooming (Agile Zone – Software Methodologies for Development Managers) Humming (xkcd.com) Amazon Acquires Social Reading Site Goodreads, Which Gives The Company A Social Advantage Over Apple(TechCrunch) I hope you enjoy today’s items, and please participate in the discussions on those sites.
October 11, 2022
by Robert Diana
· 7,729 Views · 1 Like
article thumbnail
Geek Reading June 4, 2013
I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Getting Visual: Your Secret Weapon For Storytelling & Persuasion (The Future Buzz) My Clojure Workflow, Reloaded (Hacker News) Replacing Clever Code with Unremarkable Code in Go (Hacker News) Unit Test like a Secret Agent with Sinon.js (Web Dev .NET) Bliki: EmbeddedDocument (Martin Fowler) How we use ZFS to back up 5TB of MySQL data every day (Royal Pingdom) IMB to acquire Softlayer for a rumored $2-2.5 billion (Hacker News) Cloud SQL API: YOU get a database! And YOU get a database! And YOU get a database! (Cloud Platform Blog) You Should Write Ugly Code (Hacker News) How many lights can you turn on? (The Endeavour) Python Big Picture — What's the "roadmap"? (S.Lott-Software Architect) Salesforce announces deal to buy digital marketing firm ExactTarget for $2.5 billion (The Next Web) Dew Drop – June 4, 2013 (#1,560) (Alvin Ashcraft's Morning Dew) New Technologies Change the Way we Engage with Culture (Conversation Agent) Free Python ebook: Bayesian Methods for Hackers (Hacker News) How Go uses Go to build itself (Hacker News) Sustainable Automated Testing (Javalobby – The heart of the Java developer community) Breaking Down IBM’s Definition of DevOps (Javalobby – The heart of the Java developer community) Big Data is More than Correlation and Causality (Javalobby – The heart of the Java developer community) So, What’s in a Story? (Agile Zone – Software Methodologies for Development Managers) The Real Lessons of Lego (for Software) (Agile Zone – Software Methodologies for Development Managers) The Daily Six Pack: June 4, 2013 (Dirk Strauss) Get your mobile application backed by the cloud with the Mobile Backend Starter (Cloud Platform Blog) Open for Big Data: When Mule Meets the Elephant (Javalobby – The heart of the Java developer community) I hope you enjoy today’s items, and please participate in the discussions on those sites.
October 11, 2022
by Robert Diana
· 7,188 Views · 1 Like
article thumbnail
Building a Data Warehouse, Part 5: Application Development Options
see also: part i: when to build your data warehouse part ii: building a new schema part iii: location of your data warehouse part iv: extraction, transformation, and load in part i we looked at the advantages of building a data warehouse independent of cubes/a bi system and in part ii we looked at how to architect a data warehouse’s table schema. in part iii, we looked at where to put the data warehouse tables. in part iv, we are going to look at how to populate those tables and keep them in sync with your oltp system. today, our last part in this series, we will take a quick look at the benefits of building the data warehouse before we need it for cubes and bi by exploring our reporting and other options. as i said in part i, you should plan on building your data warehouse when you architect your system up front. doing so gives you a platform for building reports, or even application such as web sites off the aggregated data. as i mentioned in part ii, it is much easier to build a query and a report against the rolled up table than the oltp tables. to demonstrate, i will make a quick pivot table using sql server 2008 r2 powerpivot for excel (or just powerpivot for short!). i have showed how to use powerpivot before on this blog , however, i usually was going against a sql server table, sql azure table, or an odata feed. today we will use a sql server table, but rather than build a powerpivot against the oltp data of northwind, we will use our new rolled up fact table. to get started, i will open up powerpivot and import data from the data warehouse i created in part ii. i will pull in the time, employee, and product dimension tables as well as the fact table. once the data is loaded into powerpivot, i am going to launch a new pivottable. powerpivot understands the relationships between the dimension and fact tables and places the tables in the designed shown below. i am going to drag some fields into the boxes on the powerpivot designer to build a powerful and interactive pivot table. for rows i will choose the category and product hierarchy and sum on the total sales. i’ll make the columns (or pivot on this field) the month from the time dimension to get a sum of sales by category/product by month. i will also drag in year and quarter in my vertical and horizontal slicers for interactive filtering. lastly i will place the employee field in the report filter pane, giving the user the ability to filter by employee. the results look like this, i am dynamically filtering by 1997, third quarter and employee name janet leverling. this is a pretty powerful interactive report build in powerpivot using the four data warehouse tables. if there was no data warehouse, this pivot table would have been very hard for an end user to build. either they or a developer would have to perform joins to get the category and product hierarchy as well as more joins to get the order details and sum of the sales. in addition, the breakout and dynamic filtering by year and quarter, and display by month, are only possible by the dimtime table, so if there were no data warehouse tables, the user would have had to parse out those dateparts. just about the only thing the end user could have done without assistance from a developer or sophisticated query is the employee filter (and even that would have taken some powerpivot magic to display the employee name, unless the user did a join.) of course pivot tables are not the only thing you can create from the data warehouse tables you can create reports, ad hoc query builders, web pages, and even an amazon style browse application. (amazon uses its data warehouse to display inventory and oltp to take your order.) i hope you have enjoyed this series, enjoy your data warehousing.
October 11, 2022
by John Cook
· 12,579 Views · 1 Like
article thumbnail
Building a Data Warehouse, Part 3: Location of Your Data Warehouse
See also: Part I: When to build your data warehouse Part II: Building a new schema In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your data warehouse tables. Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern: Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database Moving the data warehouse tables to their own physical database Moving the data warehouse database to its own hardware When you bring a new system online, or start a new BI effort, to keep things simple you can put your data warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data warehouse and OLTP system is self-contained and it is easy to keep the systems in sync. As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own database. This will add a small amount of complexity to the load and synchronization, however, moving the data warehouse tables to their own table brings some benefits that make the move worth it. The benefits include implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less table clutter, making it easier to work with. Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with RAID 5, data warehouse RAID 1.) Once you move your data warehouse to its own database or its own database server, you can also start to replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have management in offices in different parts of the world. You can reduce network traffic by having all reporting (and what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t have to update the date warehouse more than a few times a day. Where you put your data warehouse is important, I suggest that you start small and work your way up as the needs dictate.
October 11, 2022
by Stephen Forte
· 9,692 Views · 1 Like
article thumbnail
Model Cards and the Importance of Standardized Documentation for Explaining Models
Building on Google's work, here are some suggestions on how to create effective documentation to make models open, accessible, and understandable to all teams.
October 10, 2022
by Adam Lieberman
· 3,469 Views · 2 Likes
article thumbnail
Golang vs. Python: Which Is Better?
Let's dive into a comparison between Go and Python.
October 8, 2022
by Apoorva Goel
· 5,146 Views · 2 Likes
article thumbnail
Understanding Kafka-on-Pulsar (KoP): Yesterday, Today, and Tomorrow
Diving into KoP concepts, answering frequently asked questions, and the latest and future improvements the KoP community has made and will make to the project.
October 7, 2022
by Yunze Xu
· 6,355 Views · 5 Likes
article thumbnail
Data Warehouse and Data Lake Modernization: From Legacy On-Premise to Cloud-Native Infrastructure
Learn how to build a modern data stack with cloud-native technologies, such as data warehouse, data lake, and data streaming, to solve business problems.
October 7, 2022
by Kai Wähner CORE
· 4,868 Views · 3 Likes
article thumbnail
Top 5 Cloud-Native Message Queues (MQs) With Node.js Support
The benefits cloud native, why we need it for Message queues and the top five cloud native MQs that can be easily run with Node.js.
October 6, 2022
by Rose Chege
· 3,265 Views · 3 Likes
  • Previous
  • ...
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: