DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
Intermodular Analysis of C and C++ Projects in Detail (Part 1)
This article describes how similar mechanisms are arranged in compilers and reveal details of how to implement intermodular analysis in our static analyzer.
October 14, 2022
by Oleg Lisiy
· 4,721 Views · 1 Like
article thumbnail
How to Read Graph Database Benchmark (Part II)
This is the second part of the How to Read Graph Database Benchmark series and is dedicated to graph query (algorithm, analytics) results validation.
October 13, 2022
by Ricky Sun
· 5,752 Views · 1 Like
article thumbnail
Autoscale Azure Pipeline Agents With KEDA
KEDA is an event-driven autoscaler that horizontally scales a container by adding additional pods based on the number of events needing to be processed.
October 13, 2022
by Basudeba Mandal
· 5,924 Views · 3 Likes
article thumbnail
Serverless at Scale
The article discusses the architectures currently popular for achieving this Serverless Architecture for Scale use cases and how and when we can use them.
October 12, 2022
by Joyce Thoppil
· 7,239 Views · 4 Likes
article thumbnail
Managing AWS IAM With Terraform: Part 2
In this second part, learn how to centralize IAM for multiple AWS accounts, create and use EC2 instance profiles, and implement just-in-time access with Vault.
October 11, 2022
by Tiexin Guo
· 4,853 Views · 2 Likes
article thumbnail
Managing AWS IAM With Terraform: Part 1
Covering the basics of managing AWS IAM with Terraform and learn how to delete the Root/User Access key, enforce MFA, customize password policy, and more.
October 11, 2022
by Tiexin Guo
· 4,815 Views · 1 Like
article thumbnail
Geo-Distributed API Layer With Kong Gateway
Learn how to build a geo-distributed API layer with Kong Gateway.
October 11, 2022
by Denis Magda DZone Core CORE
· 9,851 Views · 3 Likes
article thumbnail
High Availability with MySQL Fabric: Part II
Originally written by Fernando Ipar and Martin Arrieta This is the third post in our MySQL Fabric series. If you missed the previous two, we started with an overall introduction, and then a discussion of MySQL Fabric’s high-availability (HA) features. MySQL Fabric was RC when we started this series, but it went GA recently. You can read the press release here, and see this blog post from Oracle’s Mats Kindahl for more details. In our previous post, we showed a simple HA setup managed with MySQL Fabric, including some basic failure scenarios. Today, we’ll present a similar scenario from an application developer’s point of view, using the Python Connector for the examples. If you’re following the examples on these posts, you’ll notice that the UUID for servers will be changing. That’s because we rebuild the environment between runs. Symbolic names stay the same though. That said, here’s our usual 3 node setup: [vagrant@store ~]$ mysqlfabric group lookup_servers mycluster Command : { success = True return = [{'status': 'SECONDARY', 'server_uuid': '3084fcf2-df86-11e3-b46c-0800274fb806', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': '192.168.70.101'}, {'status': 'SECONDARY', 'server_uuid': '35cc3529-df86-11e3-b46c-0800274fb806', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': '192.168.70.102'}, {'status': 'PRIMARY', 'server_uuid': '3d3f6cda-df86-11e3- For our tests, we will be using this simple script: import mysql.connector from mysql.connector import fabric from mysql.connector import errors import time config = { 'fabric': { 'host': '192.168.70.100', 'port': 8080, 'username': 'admin', 'password': 'admin', 'report_errors': True }, 'user': 'fabric', 'password': 'f4bric', 'database': 'test', 'autocommit': 'true' } fcnx = None print "starting loop" while 1: if fcnx == None: print "connecting" fcnx = mysql.connector.connect(**config) fcnx.set_property(group='mycluster', mode=fabric.MODE_READWRITE) try: print "will run query" cur = fcnx.cursor() cur.execute("select id, sleep(0.2) from test.test limit 1") for (id) in cur: print id print "will sleep 1 second" time.sleep(1) except errors.DatabaseError: print "sleeping 1 second and reconnecting" time.sleep(1) del fcnx fcnx = mysql.connector.connect(**config) fcnx.set_property(group='mycluster', mode=fabric.MODE_READWRITE) fcnx.reset_cache() try: cur = fcnx.cursor() cur.execute("select 1") except errors.InterfaceError: fcnx = mysql.connector.connect(**config) fcnx.set_property(group='mycluster', mode=fabric.MODE_READWRITE) fcnx.reset_cache() This simple script requests a MODE_READWRITE connection and then issues selects in a loop. The reason it requests a RW connector is that it makes it easier for us to provoke a failure, as we have two SECONDARY nodes that could be used for queries if we requested a MODE_READONLY connection. The select includes a short sleep to make it easier to catch it in SHOW PROCESSLIST. In order to work, this script needs the test.test table to exist in the mycluster group. Running the following statements in the PRIMARY node will do it: mysql> create database if not exists test; mysql> create table if not exists test.test (id int unsigned not null auto_increment primary key) engine = innodb; mysql> insert into test.test values (null); Dealing with failure With everything set up, we can start the script and then cause a PRIMARY failure. In this case, we’ll simulate a failure by shutting down mysqld on it: mysql> select @@hostname; +-------------+ | @@hostname | +-------------+ | node3.local | +-------------+ 1 row in set (0.00 sec) mysql> show processlist; +----+--------+--------------------+------+------------------+------+-----------------------------------------------------------------------+----------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+--------+--------------------+------+------------------+------+-----------------------------------------------------------------------+----------------------------------------------+ | 5 | fabric | store:39929 | NULL | Sleep | 217 | | NULL | | 6 | fabric | node1:37999 | NULL | Binlog Dump GTID | 217 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL | | 7 | fabric | node2:49750 | NULL | Binlog Dump GTID | 216 | Master has sent all binlog to slave; waiting for binlog to be updated | NULL | | 16 | root | localhost | NULL | Query | 0 | init | show processlist | | 20 | fabric | 192.168.70.1:55889 | test | Query | 0 | User sleep | select id, sleep(0.2) from test.test limit 1 | +----+--------+--------------------+------+------------------+------+-----------------------------------------------------------------------+----------------------------------------------+ 5 rows in set (0.00 sec) [vagrant@node3 ~]$ sudo service mysqld stop Stopping mysqld: [ OK ] While this happens, here’s the output from the script: will sleep 1 second will run query (1, 0) will sleep 1 second will run query (1, 0) will sleep 1 second will run query (1, 0) will sleep 1 second will run query sleeping 1 second and reconnecting will run query (1, 0) will sleep 1 second will run query (1, 0) will sleep 1 second will run query (1, 0) The ‘sleeping 1 second and reconnecting’ line means the script got an exception while running a query (when the PRIMARY node was stopped, waited one second and then reconnected. The next lines confirm that everything went back to normal after the reconnection. The relevant piece of code that handles the reconnection is this: fcnx = mysql.connector.connect(**config) fcnx.set_property(group='mycluster', mode=fabric.MODE_READWRITE) fcnx.reset_cache() If fcnx.reset_cache() is not invoked, then the driver won’t connect to the xml-rpc server again, but will use it’s local cache of the group’s status instead. As the PRIMARY node is offline, this will cause the reconnect attempt to fail. By reseting the cache, we’re forcing the driver to connect to the xml-rpc server and fetch up to date group status information. If more failures happen and there is no PRIMARY (or candidate for promotion) node in the group, the following error is received: will run query (1, 0) will sleep 1 second will run query sleeping 1 second and reconnecting will run query Traceback (most recent call last): File "./reader_test.py", line 34, in cur = fcnx.cursor() File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 1062, in cursor self._connect() File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 1012, in _connect exc)) mysql.connector.errors.InterfaceError: Error getting connection: No MySQL server available for group 'mycluster' Running without MySQL Fabric As we have discussed in previous posts, the XML-PRC server can become a single point of failure under certain circumstances. Specifically, there are at least two problem scenarios when this server is down: When a node goes down When new connection attempts are made The first case is obvious enough. If MySQL Fabric is not running and a node fails, there won’t be any action, and clients will get an error whenever they send a query to the failed node. This is worse if the PRIMARY fails, as failover won’t happen and the cluster will be unavailable for writes. The second case means that while MySQL Fabric is not running, no new connections to the group can be established. This is because when connecting to a group, MySQL Fabric-aware clients first connect to the XML-RPC server to get a list of nodes and roles, and only then use their local cache for decisions. A way to mitigate this is to use connection pooling, which reduces the need for establishing new connections, and therefore minimises the chance of failure due to MySQL Fabric being down. This, of course, is assuming that something is monitoring MySQL Fabric ensuring some host provides the XML-PRC service. If that is not the case, failure will be delayed, but it will eventually happen anyway. Here is an example of what happens when MySQL Fabric is down and the PRIMARY node goes down: Traceback (most recent call last): File "./reader_test.py", line 35, in cur.execute("select id, sleep(0.2) from test.test limit 1") File "/Library/Python/2.7/site-packages/mysql/connector/cursor.py", line 491, in execute self._handle_result(self._connection.cmd_query(stmt)) File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 1144, in cmd_query self.handle_mysql_error(exc) File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 1099, in handle_mysql_error self.reset_cache() File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 832, in reset_cache self._fabric.reset_cache(group=group) File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 369, in reset_cache self.get_group_servers(group, use_cache=False) File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 478, in get_group_servers inst = self.get_instance() File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 390, in get_instance if not inst.is_connected: File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 772, in is_connected self._proxy._some_nonexisting_method() # pylint: disable=W0212 File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xmlrpclib.py", line 1224, in __call__ return self.__send(self.__name, args) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xmlrpclib.py", line 1578, in __request verbose=self.__verbose File "/Library/Python/2.7/site-packages/mysql/connector/fabric/connection.py", line 272, in request raise InterfaceError("Connection with Fabric failed: " + msg) mysql.connector.errors.InterfaceError: Connection with Fabric failed: This happens when a new connection attempt is made after resetting the local cache. Making sure MySQL Fabric stays up As of this writing, it is the user’s responsibility to make sure MySQL Fabric is up and running. This means you can use whatever you feel comfortable with in terms of HA, like Pacemaker. While it does add some complexity to the setup, the XML-RPC server is very simple to manage and so a simple resource manager should work. For the backend, MySQL Fabric is storage engine agnostic, so an easy way to resolve this could be to use a small MySQL Cluster set up to ensure the backend is available. MySQL’s team blogged about such a set up here. We think the ndb approach is probably the simplest for providing HA at the MySQL Fabric store level, but believe that MySQL Fabric itself should provide or make it easy to achieve HA at the XML-RPC server level. If ndb is used as store, this means any node can take a write, which in turns means multiple XML-PRC instances should be able to write to the store concurrently. This means that in theory, improving this could be as easy as allowing Fabric-aware drivers to get a list of Fabric servers instead of a single IP and port to connect to. What’s next In the past two posts, we’ve presented MySQL Fabric’s HA features, seen how it handles failures at the node level, how to use MySQL databases with a MySQL Fabric-aware driver, and what remains unresolved for now. In our next post, we’ll review MySQL Fabric’s Sharding features.
Updated October 11, 2022
by Peter Zaitsev
· 10,434 Views · 1 Like
article thumbnail
Here’s how Bell was Hacked: SQL Injection Blow-by-Blow
OWASP’s number one risk in the Top 10 has featured prominently in a high-profile attack this time resulting in the leak of over 40,000 records from Bell in Canada. It was pretty self-evident from the original info leaked by the attackers that SQL injection had played a prominent role in the breach, but now we have some pretty conclusive evidence of it as well: The usual fanfare quickly followed – announcements by the attackers, silence by the impacted company (at least for the first day), outrage by affected customers and the new normal for public breaches: I got the data loaded into Have I been pwned? and searchable as soon as I’d verified it. Now you would think – quite reasonably – that SQLi would be becoming a thing of the past what with all the awareness and Top 10 stuff let alone the emergence of tools like object relational mappers that make it almost impossible to screw this up, but here we are. Clearly we need a bit of a refresher on the risk and what better way to do it than to reconstruct the Bell system that was breached then, well, breach it again. Let’s get to it. A long time ago in a language far, far away The first thing that should hit savvy readers in the image above is that this is an ASP web site. No, not ASP.NET, go back further – classic ASP. This is not classic like, say, a Ferrari 250 GTO which has grown increasingly desirable with age, rather it’s “classic” like Citroën 2CV; it was kinda cool at the time but you’d be damned if you want a mate seeing you in one today. But I digress. Classic ASP was replaced almost 12 years ago to the day with the platform that remains Microsoft’s framework of choice for building web sites today – ASP.NET. You could forgive someone for persevering with classic ASP a decade ago, perhaps even 5 years ago, but today? I don’t think so. If you’re running this platform today to host anything of any value whatsoever on the web, you’ve got rocks in your head. (Yes, I know it’s still supported but seriously folks, it was built for another era and just isn’t resilient to today’s web risks) Anyway, to reproduce this risk I’m going to create a very simple classic ASP site that looks just like the one above. That’s one thing that was great about classic ASP – it was dead easy to create a simple site! For added realism I’ll create a local host entry for protectionmanagement.bell.ca and even add a self-signed cert so we can hit it via HTTPS. The affected site has been well and truly pulled by now, but of course nothing is ever really gone on the internet, it just goes to Google cache heaven: That shouldn’t be too hard to reproduce, how’s this look? Forgive me if I don’t go so far as to recreate the broken images! Let’s move on to the other thing we know about the attack, and that’s what the back end database looks like. Implementing the back end What we need to make this whole thing resemble a real attack is a little bit of classic ASP wiring and a database. The latter is quite easy to reconstruct because the entire schema was dumped along with the breach. Yes, yes, the breach got pulled very early by the powers that be, but per the earlier point, cache is your friend. Here’s what we’ve been told about the credentials table: Columns: tblCredentials tblCredentials.CredentialID, tblCredentials.OrderID, tblCredentials.CustomerID, tblCredentials.ServiceType, tblCredentials.UserName, tblCredentials.Password, tblCredentials.Level, tblCredentials.CustomerName, tblCredentials.PersonName, tblCredentials.GroupID, tblCredentials.SecretQuestion, tblCredentials.SecretAnswer, tblCredentials.UserEmail tblCredentials.UserLanguage By prefixing the table with the letters “tbl” you know that it’s a table and not a magic unicorn or a Chinese dissident (let us not digress into the insanity that is the tbl prefix). Anyway, I’ve recreated that table and another one called tblTransaction2010 in my local SQL instance that looks just like this. I’ve then whipped up a little VB Script in the .asp file (“Tonight we’re gonna code like it’s 1999”) which connects to the database and runs a SQL statement constructed like this: SQL = "SELECT * FROM tblCredentials WHERE UserName='" + Request.Form("UserName") + "'" Yeah, that looks about right! Let’s see what happens now… Employing HackBar The extension you see in the first image of this post is HackBar, a simple little add-on for testing things like SQLi and XSS. The premise is that it can monitor requests the browser makes and then make it dead easy to reconstruct them with manipulated parameters, you know, the kind of stuff that can exploit SQLi risks. It looks like this: What I’ve done is tried to perform a reset for the username “troy” (which performs a post request to the server) then I’ve just hit the “Load URL” button and checked “Enable Post data”. That then gives us the resource that was hit in the top text box and the form data with name value pairs in the bottom. Dead simple, now let’s break some stuff. Mounting the attack What we see in the first image above is what’s known as an error-based SQLi attack or in other words, the attacks are using exceptions thrown by the server and sent back in the response to discover the internal implementation of the system. I talk about this and other SQLi attacks patterns in my post on Everything you wanted to know about SQL injection (but were afraid to ask). Let’s reproduce what the attackers have in that first image – disclosure of the internal database version. This is a useful first step as it helps attackers understand what they’re playing with. Different database environments and even versions are exploited in different ways so discovering this early is important, question is, how do you get the database to cough this information up? In the post I mention above, I show how attempting to cast non-integer values to an integer will throw an internal exception which discloses the data. The first thing we need to establish is how to generate the data which in this case is the DB version. That’s dead simple, we just ask for @@VERSION then if we try to convert that to an int and the exception bubbles up to the browser, we’ve got ourselves some useful info. Does this look about right? And there we have it – the DB version data. I’ve all done with the post data is sent it over like this: UserName=' or 1=convert(int, @@version)-- Clearly the version won’t convert to an int so we get the error above. The %2x values you’re seeing in the HackBar window are simply URL encoded characters which can be achieved by selecting the string and then then choosing the correct encoding context from the menu (I’ll leave the unencoded values there in future grabs for the sake of legibility): So this is a start, but where’s the good stuff? How about we move onto discovering the schema because until we know what tables and columns are in there, it’s going to be a tough job pulling the data. Let’s start with table names: tblTransaction2010 is it? We know it’s a table because of the prefix… ok, I’ll let it go, point is we now know a table name and all it took was to select out of sysobjects. I go into detail about how this works in the aforementioned everything you want to know post so I won’t dwell on it here, let’s get another table name: Ah, so there’s our tblCredentials table and all it took was to adjust one number in the query so that the inner select statement took the top 2 records instead of the top 1 thus allowing the outer select to grab the next table in sysobjects. Let’s get some columns and there’s no one “right way” of doing this as there are multiple ways of pulling columns names from SQL Server (and for pulling table names too, for that matter). Let’s try this one: The exception discloses the presence of a column called UserName on the table tblCredentials. That’s handy, let’s move on and I’ll just keep incrementing the integer in the inner select statement: Ah, so there’s a password column as well, that’s handy, let’s see about pulling some data out of there: I’ve deliberately simplified this statement so it just pulls the first record in the default order but by now these nested, sorted selects should give you an idea of how easy it is to enumerate through the data. So there’s the username – “troy” – let’s grab the password too: This is unfortunate because clearly I’ve taken my personal security seriously and substituted not only the “a” for an “@”, but also the “o” for a “0”. But when you don’t have any cryptographic storage on the credentials which was the case with Bell, even my real passwords that are all randomly generated by 1Password have nowhere to hide when an SQLi attack hits pay dirt. In practice, you’re not going to go through and manually enumerate every single table, column and then row (column by column, I might add), instead you’re going to automate the process using a tool like Havij once you’ve discovered an at-risk target. If Havij is new to you, it’s child’s play – here’s my 3 year old learning how to use it, it really is that simple. There will be nuances between how I’ve replicated the attack here and how the guys behind it actually went about it. There might be other vectors through other pages or depending on how the original password recovery page responded, more streamlined ways of pulling the data. There may have even been SQL credential exposure at some point which would make the whole thing dead easy. Either way Bell (or whoever is copping the blame) will have more than enough data in their logs to reconstruct the attack and know exactly where it all went wrong. Hardening Bell’s environment Firstly, yes, I know that Bell has laid the blame on a partner providing services to some of their customers but it’s Bell in the headlines, it’s Bell sending out the apology emails and it’s Bell who now has to clean up this mess. I say this not to berate Bell but to draw attention to the responsibility that organisations have to ensure that their partners are employing appropriate security measures. The risk above could have been discovered in minutes by and automated tool and almost as quickly by even the most junior penetration tester. Nobody tested this system for security vulnerabilities – including Bell – and now they have a very unfortunate blight on their record that will be referenced for years to come. Anyway, let’s focus on the mitigations of this risk because as I said from the outset, this needs to be taken as an opportunity for others to learn some fundamentals that could save them from a similar fate. Let me summarise in point form: Using out-dated frameworks: Classic ASP guys – get rid of it. It has nowhere near the defences that modern web platforms have in place not just for SQLi, but for a whole range of attacks. You cannot afford to keep running VB script on the server. No white-listing of untrusted data: In the example above (and inevitably in the real system), SQLi attacks were thrown at the website and it… welcomed them with open arms. “Validate all untrusted data against a whitelist of allowable values” is the mantra I’ve repeated so many times and the username should only be allowing characters that it actually accepted when people signed up so that means no brackets, quotes, spaces, etc (none of these are in the breached data). Non-parameterised SQL: My example earlier on about how the SQL statement was likely constructed shows just a concatenated string with the potential to mix the query with untrusted data. This is what got them and I talk extensively about the right way to do this in part one of my series on the Top 10. Internal implementation leakage: This attack was made dead easy by the fact that internal exceptions bubbled up to the UI. Someone had to actually enable this – newer versions of IIS won’t allow to happen by default. The extent of this risk goes well beyond SQLi as well as there are some very, very juicy things that web sites sharing their internals can disclose. Plain text password storage: Shit happens. Sites get breached. We now all understand that, but what makes it a whole lot worse is when the data is usable by attackers, not just the ones who pulled it, but anyone in the general public who now has access to it. Passwords should always be stored with a strong cryptographic hashing algorithm designed for protecting credentials. Anything short of this leaves you naked in an attack. They’re just a few easy ones – SQLi 101 – and they should be painfully obvious. In conclusion… SQLi attacks remain rampant. They’re still in the number one spot in OWASP’s Top 10 (even the latest 2013 version) and it’s still rated as easy to exploit and as having a severe impact. They’re favoured by attackers because they’re just so easy to crack open which was the point of showing my 3 year old doing it earlier on. In this case the attackers actually showed a decent understanding of the mechanics behind SQLi, but the point is that the barrier to entry for this attack can be very, very low. Lastly, if you’re a dev or managing devs then get them into some in-depth security training whether that be via my Pluralsight courses or though any of the other excellent resources out there. You can’t wait until after things go wrong to do this.
Updated October 11, 2022
by Troy Hunt
· 11,159 Views · 1 Like
article thumbnail
Handling Big Data with HBase Part 5: Data Modeling (or, Life Without SQL)
This is the fifth of a series of blogs introducing Apache HBase. In the fourth part, we saw the basics of using the Java API to interact with HBase to create tables, retrieve data by row key, and do table scans. This part will discuss how to design schemas in HBase. HBase has nothing similar to a rich query capability like SQL from relational databases. Instead, it forgoes this capability and others like relationships, joins, etc. to instead focus on providing scalability with good performance and fault-tolerance. So when working with HBase you need to design the row keys and table structure in terms of rows and column families to match the data access patterns of your application. This is completely opposite what you do with relational databases where you start out with a normalized database schema, separate tables, and then you use SQL to perform joins to combine data in the ways you need. With HBase you design your tables specific to how they will be accessed by applications, so you need to think much more up-front about how data is accessed. You are much closer to the bare metal with HBase than with relational databases which abstract implementation details and storage mechanisms. However, for applications needing to store massive amounts of data and have inherent scalability, performance characteristics and tolerance to server failures, the potential benefits can far outweigh the costs. In the last part on the Java API, I mentioned that when scanning data in HBase, the row key is critical since it is the primary means to restrict the rows scanned; there is nothing like a rich query like SQL as in relational databases. Typically you create a scan using start and stop row keys and optionally add filters to further restrict the rows and columns data returned. In order to have some flexibility when scanning, the row key should be designed to contain the information you need to find specific subsets of data. In the blog and people examples we've seen so far, the row keys were designed to allow scanning via the most common data access patterns. For the blogs, the row keys were simply the posting date. This would permit scans in ascending order of blog entries, which is probably not the most common way to view blogs; you'd rather see the most recent blogs first. So a better row key design would be to use a reverse order timestamp, which you can get using the formula (Long.MAX_VALUE - timestamp), so scans return the most recent blog posts first. This makes it easy to scan specific time ranges, for example to show all blogs in the past week or month, which is a typical way to navigate blog entries in web applications. For the people table examples, we used a composite row key composed of last name, first name, middle initial, and a (unique) person identifier to distinguish people with the exact same name, separated by dashes. For example, Brian M. Smith with identifier 12345 would have row key smith-brian-m-12345. Scans for the people table can then be composed using start and end rows designed to retrieve people with specific last names, last names starting with specific letter combinations, or people with the same last name and first name initial. For example, if you wanted to find people whose first name begins with B and last name is Smith you could use the start row key smith-b and stop row key smith-c (the start row key is inclusive while the stop row key is exclusive, so the stop key smith-c ensures all Smiths with first name starting with the letter "B" are included). You can see that HBase supports the notion of partial keys, meaning you do not need to know the exact key, to provide more flexibility creating appropriate scans. You can combine partial key scans with filters to retrieve only the specific data needed, thus optimizing data retrieval for the data access patterns specific to your application. So far the examples have involved only single tables containing one type of information and no related information. HBase does not have foreign key relationships like in relational databases, but because it supports rows having up to millions of columns, one way to design tables in HBase is to encapsulate related information in the same row - a "wide" table design. It is called a "wide" design since you are storing all information related to a row together in as many columns as there are data items. In our blog example, you might want to store comments for each blog. The "wide" way to design this would be to include a column family named comments and then add columns to the comment family where the qualifiers are the comment timestamp; the comment columns would look like comments:20130704142510 and comments:20130707163045. Even better, when HBase retrieves columns it returns them in sorted order, just like row keys. So in order to display a blog entry and its comments, you can retrieve all the data from one row by asking for the content, info, and comments column families. You could also add a filter to retrieve only a specific number of comments, adding pagination to them. The people table column families could also be redesigned to store contact information such as separate addresses, phone numbers, and email addresses in column families allowing all of a person's information to be stored in one row. This kind of design can work well if the number of columns is relatively modest, as blog comments and a person's contact information would be. If instead you are modeling something like an email inbox, financial transactions, or massive amounts of automatically collected sensor data, you might choose instead to spread a user's emails, transactions, or sensor readings across multiple rows (a "tall" design) and design the row keys to allow efficient scanning and pagination. For an inbox the row key might look like - which would permit easily scanning and paginating a user's inbox, while for financial transactions the row key might be -. This kind of design can be called "tall" since you are spreading information about the same thing (e.g. readings from the same sensor, transactions in an account) across multiple rows, and is something to consider if there will be an ever-expanding amount of information, as would be the case in a scenario involving data collection from a huge network of sensors. Designing row keys and table structures in HBase is a key part of working with HBase, and will continue to be given the fundamental architecture of HBase. There are other things you can do to add alternative schemes for data access within HBase. For example, you could implement full-text searching via Apache Lucene either within rows or external to HBase (search Google for HBASE-3529). You can also create (and maintain) secondary indexes to permit alternate row key schemes for tables; for example in our people table the composite row key consists of the name and a unique identifier. But if we desire to access people by their birth date, telephone area code, email address, or any other number of ways, we could add secondary indexes to enable that form of interaction. Note, however, that adding secondary indexes is not something to be taken lightly; every time you write to the "main" table (e.g. people) you will need to also update all the secondary indexes! (Yes, this is something that relational databases do very well, but remember that HBase is designed to accomodate a lot more data than traditional RDBMSs were.) Conclusion to Part 5 In this part of the series, we got an introduction to schema design in HBase (without relations or SQL). Even though HBase is missing some of the features found in traditional RDBMS systems such as foreign keys and referential integrity, multi-row transactions, multiple indexes, and son on, many applications that need inherent HBase benefits like scaling can benefit from using HBase. As with anything complex, there are tradeoffs to be made. In the case of HBase, you are giving up some richness in schema design and query flexibility, but you gain the ability to scale to massive amounts of data by (more or less) simply adding additional servers to your cluster. In the next and last part of this series, we'll wrap up and mention a few (of the many) things we didn't cover in these introductory blogs. References HBase web site, http://hbase.apache.org/ HBase wiki, http://wiki.apache.org/hadoop/Hbase HBase Reference Guide http://hbase.apache.org/book/book.html HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide Google Bigtable Paper, http://labs.google.com/papers/bigtable.html Hadoop web site, http://hadoop.apache.org/ Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk Sample code, https://github.com/sleberknight/basic-hbase-examples
Updated October 11, 2022
by Scott Leberknight
· 19,720 Views · 3 Likes
article thumbnail
Google Cloud Messaging with Payload
google cloud messaging (or gcm) sends two types of messages: collapsible, “send-to-sync” messages, where new messages replace older ones in the sending queue. (i.e. the older messages are “collapsed”). non-collapsible messages with payload, where every single message is delivere d. each payload in non-collapsible messages is a unique content that has to be delivered and can’t be just replaced with a more recent message in the server sending queue. on the other hand, a collapsible message can be a simple ping from the server to ask its mobile clients to sync their data. when to use messages with payload? instant messaging (im) applications come to mind. another use case is when we need to include data into our push notifications to save our clients a round-trip to the server. an example use case would be sending daily/monthy online game rankings of top players. instead of just notifying the android clients to go to the server to get the information ( send-to-sync ), the data is included in the multicast messages themselves so that they can be directly consumed by the clients. we can easily see why collapsible messaging wouldn’t make much sense here, since we want our users to receive every single ranking we send them at the end of every week/month. let’s now jump into coding. as suggested in the two previous article, these code examples are modifications of the gcm demo application available for download (client + server) and using the gcm helper library for java . server code creating a message with payload in server code is similar to the process for the collapsible type, except that we simply omit the collapse_key parameter. // in imports import com.google.android.gcm.server.message; // inside send method, construct a message with payload message message = new message.builder() .delaywhileidle(true) // wait for device to become active before sending. .adddata( "rankings", "top 5 high game scorers" ) .adddata( "1", "1. yankeedoodle (15 trophies)" ) .adddata( "2", "2. billy_the_kid (13 trophies)" ) .adddata( "3", "3. viper (10 trophies)" ) .adddata( "4", "4. silversurfer (9 trophies)" ) .adddata( "5", "5. gypsy (8 trophies)" ) .build(); client code the client just retrieves the data by its keys: // inside gcmintentservice @override protected void onmessage(context context, intent intent) { // message with payload string message = intent.getstringextra("rankings") + "\n" + intent.getstringextra("1")+ "\n" + intent.getstringextra("2")+ "\n" + intent.getstringextra("3")+ "\n" + intent.getstringextra("4")+ "\n" + intent.getstringextra("5"); displaymessage(context, message); // notifies user generatenotification(context, message); } this is how it looks like on an android device: these messages can contain a payload limit of 4k of data so they are not indicated for applications that need to send more than that, although it is theoretically possible to get around that limit with multiple messages and some assembly work on the receiving android client application. another aspect to consider would be performance and impact on the handset’s batteries, since messages with payload are not as lightweight as their collapsible cousins. regardless of the message type, all payload data must be string values . so if we need to send some other type, it is up to our application to properly encode/decode the content.
Updated October 11, 2022
by Tony Siciliani
· 22,890 Views · 1 Like
article thumbnail
Geek Reading June 7, 2013
I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Dew Drop – June 7, 2013 (#1,563) (Alvin Ashcraft's Morning Dew) On friction in software (Ayende @ Rahien) Caching, jQuery Ajax and Other IE Fun (HTML5 Zone) IndexedDB and Date Example (HTML5 Zone) DevOps Scares Me – Part 1 (Architects Zone – Architectural Design Patterns & Best Practices) Visualizing the News with Vivagraph.js (Architects Zone – Architectural Design Patterns & Best Practices) My First Clojure Workflow (Javalobby – The heart of the Java developer community) Helping an ISV Look at Their Cloud Options (Architects Zone – Architectural Design Patterns & Best Practices) Ignore Requirements to Gain Flexibility, Value, Insights! The Power of Why (Javalobby – The heart of the Java developer community) Estimating the Unknown: Dates or Budgets, Part 1 (Agile Zone – Software Methodologies for Development Managers) Team Decision Making Techniques – Fist to Five and others (Agile Zone – Software Methodologies for Development Managers) The Daily Six Pack: June 7, 2013 (Dirk Strauss) Pastime (xkcd.com) The Affect Heuristic (Mark Needham) Every great company has been built the same way: bit by bit (Hacker News) Under the Hood: The entities graph (Facebook Engineering's Facebook Notes) Entrepreneurship With a Family is for Crazy People (Stay N Alive) Thinking Together for Release Planning (Javalobby – The heart of the Java developer community) I hope you enjoy today’s items, and please participate in the discussions on those sites.
Updated October 11, 2022
by Robert Diana
· 6,565 Views · 1 Like
article thumbnail
Geek Reading - Cloud, SQL, NoSQL, HTML5
I have talked about human filters and my plan for digital curation. These items are the fruits of those ideas, the items I deemed worthy from my Google Reader feeds. These items are a combination of tech business news, development news and programming tools and techniques. Real-Time Ad Impression Bids Using DynamoDB (Amazon Web Services Blog) The mother of all M&A rumors: AT&T, Verizon to jointly buy Vodafone (GigaOM) Is this the future of memory? A Hybrid Memory Cube spec makes its debut. (GigaOM) Dew Drop – April 2, 2013 (#1,518) (Alvin Ashcraft's Morning Dew) Rosetta Stone acquires Livemocha for $8.5m to move its language learning platform into the cloud (The Next Web) Double Shot #1098 (A Fresh Cup) Extending git (Atlassian Blogs) A Thorough Introduction To Backbone.Marionette (Part 2) (Smashing Magazine Feed) 60 Problem Solving Strategies (Javalobby – The heart of the Java developer community) Why asm.js is a big deal for game developers (HTML5 Zone) Implementing DAL in Play 2.x (Scala), Slick, ScalaTest (Javalobby – The heart of the Java developer community) “It’s Open Source, So the Source is, You Know, Open.” (Javalobby – The heart of the Java developer community) How to Design a Good, Regular API (Javalobby – The heart of the Java developer community) Scalding: Finding K Nearest Neighbors for Fun and Profit (Javalobby – The heart of the Java developer community) The Daily Six Pack: April 2, 2013 (Dirk Strauss) Usually When Developers Are Mean, It Is About Power (Agile Zone – Software Methodologies for Development Managers) Do Predictive Modelers Need to Know Math? (Data Mining and Predictive Analytics) Heroku Forces Customer Upgrade To Fix Critical PostgreSQL Security Hole (TechCrunch) DYNAMO (Lambda the Ultimate – Programming Languages Weblog) FitNesse your ScalaTest with custom Scala DSL (Java Code Geeks) LinkBench: A database benchmark for the social graph (Facebook Engineering's Facebook Notes) Khan Academy Checkbook Scaling to 6 Million Users a Month on GAE (High Scalability) Famo.us, The Framework For Fast And Beautiful HTML5 Apps, Will Be Free Thanks To “Huge Hardware Vendor Interest” (TechCrunch) Why We Need Lambda Expressions in Java – Part 2 (Javalobby – The heart of the Java developer community) I hope you enjoy today’s items, and please participate in the discussions on those sites.
Updated October 11, 2022
by Robert Diana
· 8,158 Views · 1 Like
article thumbnail
Easily Find & Kill MongoDB Operations from MongoLab’s UI
A few months ago, we wrote a blog post on finding and terminating long-running operations in MongoDB. To help make it even easier for MongoLab users* to quickly identify the cause behind database unresponsiveness, we’ve integrated the currentOp() and killOp() methods into our management portal. * currentOp and killOp functionality is not available on our free Sandbox databases because they run on multi-tenanted mongod processes. Quick intro to db.currentOp() If you’re unfamiliar with MongoDB’s currentOp() method, it reports in-progress operations on your mongod process. In other words, it will return information on all active operations running on your instance. This allows you to quickly identify long-running and/or blocking operations and focus your attention on problematic areas. Current database operations – now in MongoLab’s UI To access this functionality in MongoLab’s management portal, navigate to the deployment with the current operations you want to view and click on the “Tools” tab. Here you’ll see a button that will launch a new window with all of your current operations. Once the window is loaded, you’ll see a list of your deployment’s in-progress operations. In this example, you see a long-running query and a replication operation. If you’d like to kill an operation, simply click on the blue X button. For your safety, we’ve disabled the “kill” button for some types of operations. As such, you’ll notice there are operations that do not have the blue X button next to them. Note that from this page, you can also choose to: View current operations from any secondary nodes Automatically reload the page to see current operations in real-time View a more verbose output including idle and system operations With great power… With great power comes great responsibility. This new feature is very powerful in helping users find and kill operations. However, if you’re ever in doubt about whether it’s safe to terminate an operation, reach out anytime. We’re here to help!
Updated October 11, 2022
by Chris Chang
· 10,365 Views · 1 Like
article thumbnail
The Difference Between TokuMX Partitioning and Sharding
In my last post, I described a new feature in TokuMX 1.5—partitioned collections—that’s aimed at making it easier and faster to work with time series data. Feedback from that post made me realize that some users may not immediately understand the differences between partitioning a collection and sharding a collection. In this post, I hope to clear that up. On the surface, partitioning a collection and sharding a collection seem similar. Both actions take a collection and break it into smaller pieces for some performance benefit. Also, the terms are sometimes used interchangeably when discussing other technologies. But for TokuMX, the two features are very different in purpose and implementation. In describing each feature’s purpose and implementation, I hope to clarify the differences between the two features. Let’s address sharding first. The purpose of sharding is to to distribute a collection across several machines (i.e. “scale-out”) so that writes and queries on the collection will be distributed. The main idea is that for big data, a single machine can only do so much. No matter how powerful your one machine is, that machine will still be limited by some resource, be it IOPS, CPU, or disk space. So, to get better performance for a collection, one can use sharding to distribute the collection across several machines, and thereby improve performance by increasing the amount of hardware. To perform these tasks, a sharded collection ought to have a relatively even distribution across shards. Therefore, it should have the following properties: User’s writes ought to be distributed amongst machines (or shards). After all, if all writes are targeted at a single shard, then they are not distributed and we are not scaling To keep data distribution relatively even, background process migrate data between shards if a shard is found to have too much or too little data Because of these properties, each shard contains a random subset of the collection’s data. Now let’s address partitioning. The purpose of partitioning is to break the collection into smaller collections so that large chunks of data may be removed very efficiently. A typical example is keeping a rolling period of 6 months of log data for a website. Another example is keeping the last 14 days of oplog data, as we do via partitioning in TokuMX 1.4. In such examples, typically only one partition (the latest one) is getting new data. Periodically, but infrequently, we drop the oldest partition to reclaim space. For the log data example, once a month we may drop a month’s worth of data. For the oplog, once a day we drop a day’s worth of data. To perform these tasks, we are not concerned with load distribution, as nearly all writes are typically going to the last partition. We are not spreading partitions across machines. With partitioning, each partition holds a continuous range of the data (e.g. all data from the month of February), whereas with sharding, each shard holds small random chunks of data from across the key space. With all this being said, there are still similarities when thinking of schema design with a partitioned collection and a sharded collection. As I touched on in my last post, designing a partition key has similarities to designing a shard key as far as queries are concerned. Queries on a sharded collection perform better if they target single shards. Similarly, queries on a partitioned collection perform better if they target a single partition. Queries that don’t can be thought of as “scatter/gather” for both sharded and partitioned collections. Hopefully this illuminates the difference between a partitioned collection and a sharded collection.
October 11, 2022
by Zardosht Kasheff
· 6,326 Views · 1 Like
article thumbnail
Designing Search (part 3): Keeping on track
Part 1 In the previous post we looked at techniques to help us create and articulate more effective queries. From auto-complete for lookup tasks to auto-suggest for exploratory search, these simple techniques can often make the difference between success and failure. But occasionally things do go wrong. Sometimes our information journey is more complex than we’d anticipated, and we find ourselves straying off the ideal course. Worse still, in our determination to pursue our original goal, we may overlook other, more productive directions, leaving us endlessly finessing a flawed strategy. Sometimes we are in too deep to turn around and start again. Conversely, there are times when we may consciously decide to take a detour and explore the path less trodden. As we saw earlier, what we find along the way can change what we seek. Sometimes we find the most valuable discoveries in the most unlikely places. However, there’s a fine line between these two outcomes: one person’s journey of serendipitous discovery can be another’s descent into confusion and disorientation. And there’s the challenge: how can we support the former, while unobtrusively repairing the latter? In this post, we’ll look at four techniques that help us keep to the right path on our information journey. 1. Did you mean As we saw in the previous post, auto-complete and auto-suggest are two of the most effective ways to prevent spelling mistakes and typographic errors (i.e. instances where we know how to spell something, but enter it incorrectly). By completing partial queries and suggesting meaningful alternatives, they avoid the problem at source. But, inevitably, some mistakes will slip through. Fortunately, there are a variety of coping strategies. One of the simplest is to use spell checking algorithms to compare queries against common spellings of each word. The figure below shows the results on Google for the query “expolsion”. This isn’t necessarily a ‘failed’ search as such (as it does return results), but the more common spelling “explosion” would return more a productive result set. Of course, without knowing our intent, Google can never know for sure whether this spelling was intentional, so it offers the alternative as a “Did you mean” suggestion at the top of the search results page. Interestingly, Google repeats the suggestion at the bottom of the page, but with a slightly longer wording: “Did you mean to search for”. This is a subtle clarification, but one that may reflect the user’s shift in attention at this point (from query to results). Potential spelling mistakes are addressed by a “Did you mean” suggestion at Google Likewise, most major online retailers apply a similar strategy for dealing with potential spelling mistakes and typographic errors. Amazon and eBay both conservatively apply Did You Mean to queries such as “guitr”, faithfully passing on the results for this query but offering the alternative as a highlighted suggestion immediately above the search results. And in Amazon’s case, the results for the corrected spelling are appended immediately below those of the original query: Did You Mean at Amazon Did You Mean at eBay 2. Auto-correct Search engines may be capable of many things, but one thing they cannot do is read minds: they can never know the user’s intent. For that reason, when faced with queries like those above, it is wise to keep some distance. Offer a gentle nudge, but leave the choice with the user. However, there are times when it seems much more apparent that a spelling mistake has occurred. In these cases, we may not know for sure what the user’s intent was, but we can be fairly certain what it wasn’t. In these instances, auto-correction may be the most appropriate response. For example, consider a query for “expolson” on Google: this time, instead of applying a Did You Mean, it is auto-corrected to “explosion”. As before, a message appears above the results (“Showing results for”), but this time, the choice has been made for us: Auto-correct at Google It seems that this time Google is more confident that our query was unintended. Without knowing our intent, how can it determine this? (In case you’re wondering, it’s not simply by looking for relatively low numbers of results: “expolsion” returns ~135,000 results, and “exploson” returns ~222,000, yet the latter auto-corrected while the former is not.) The answer lies in what Google researchers refer to as the “Unreasonable Effectiveness of Data”: in this instance, the collective behaviour of millions of users. By mining user data for patterns of query reformulation, Google can determine that “exploson” is more likely to be corrected than “expolsion”. Knowing this, it applies the correction for us. In fact, Google applies the same insight to the auto-suggest function we saw in our previous post: in addition to completions based on the prefix, it also returns potential spelling. This is particularly important in a mobile context, when accurate typing on small, handheld keyboards is so much more difficult: Query suggestions include spelling corrections on Google These strategies make a significant difference to the experience of searching the web. However, for site search, such vast quantities of user data may not be so readily available. In this case, perhaps a simple numeric test could suffice: for zero results, apply an auto-correction; for greater than zero but less than some threshold (say 20 results), offer a Did You Mean. 3. Partial matches The techniques of auto-correct and Did You Mean are ideal for detecting and repairing simple errors such as spelling mistakes in short queries. But the reality of keyword search is that many users over-constrain their search by entering too many keywords, rather than too few. This is particularly apparent when confronted with a zero results page: for many users, the natural reaction is to add further keywords to their query, thus compounding the problem. In these cases, it no longer makes sense to replace the entire query in the manner of an auto-correct or Did You Mean, particularly if certain sections of it might have actually returned productive results on their own. Instead, we need a more sophisticated strategy that considers the individual keywords and can determine which particular permutations are likely to produce useful results. Amazon provides a particularly effective implementation of this strategy. For example, a keyword search for “fender strat maple 1976 USA” finds no matching results. However, rather than returning a zero results page, Amazon returns a number of partial matches based on various keyword permutations. Moreover, by communicating the non-matching elements of the query (using strikethrough text), it gently guides us along the path to more informed query reformulation: Partial matches at Amazon Although conceptually simple, solving the partial match problem is non-trivial: a query with N keywords actually has N factorial permutations, of which only a fraction will return useful results. So for just the single query above, there are in principle 120 variations to consider. In addition, out of all those variations, there is only space to present results for a handful, so they need to be chosen to reflect the diversity of the matching products while avoiding duplicate results. A similar strategy can be seen at eBay, which also finds no results for the same query we tried on Amazon. Instead of a zero results page, we see a list of the partial matches with an invitation to select one of them (or to “try the search again with fewer keywords”). These are ordered using what’s known as quorum-level ranking (Salton, 1989), which sorts results according to the number of matching keywords. In other words, products matching four keywords (such as “fender strat maple USA”) are ranked above those containing three or fewer (such as “fender strat USA”). Partial matches using quorum-level ranking at eBay Partial matches are a very effective way to facilitate the process of query reformulation, providing us with a clear direction to take along our information journey. Together with auto-correct and Did You Mean, they act as signposts that help us decide which of the many paths to take. But sometimes we may see something that motivates us to take a deliberate detour. Like the auto-suggest function we saw in the previous post, related searches provides us with the inspiration to embrace new ideas that we might not otherwise have considered. 4. Related searches All the major web search engines offer support for related searches. Bing, for example, shows them in a panel to the left of the main results: Related searches at Bing Google, by contrast, shows them on demand (via a link in the sidebar) as a panel above the main search results. Both designs differentiate between extensions to the query and reformulations: any keywords that are not part of the original query are rendered in bold: Related searches at Google Apart from providing inspiration, related searches can be used to help clarify an ambiguous query. For example, query on Bing for “apple” returns results associated mainly with the computer manufacturer, but the related searches clearly indicate a number of other interpretations: Query disambiguation via related searches at Bing Related searches can also be used to articulate associated concepts in a taxonomy. At eBay, for example, a query for “acoustic guitar” returns a number of related searches at varying levels of specificity. These include subordinate (child) concepts, such as “yamaha acoustic guitar” and “fender acoustic guitar”, along with sibling concepts such as “electric guitar”, and superordinate (parent) concepts such as “guitar”. These taxonomic signposts offer a subtle form of guidance, helping us understand better the conceptual space to which our query belongs. Taxonomic signposting via related searches at eBay While related searches offer us a way to open our minds to new directions; they are not the only source of inspiration. Sometimes it is the results themselves that provide the stimulus. When we find a particularly good match for our information need, we try to find more of the same: a process that Peter Morville refers to as “pearl growing” (Morville, 2010). Sometimes the action to find more of the same is one we can directly initiate: Google’s image search, for example, offers us the opportunity to find images similar to a particular result: Find similar images at Google For image search, the results certainly appear impressive, with a single click returning a remarkably homogenous set of results. But that is perhaps also its biggest shortcoming: by hiding the details of the similarity calculation, the user has no control over what it returns, and cannot see why certain items are deemed similar when others are not. For this type of information need, a faceted approach may be preferable, in which the user has control over exactly which dimensions are considered as part of the similarity calculation. While Google shows how we can actively seek similar results, sometimes we may prefer to have related content pushed to us. Recommender systems like Last.fm and Netflix rely heavily on attributes, ratings and collaborative filtering data to suggest content we’re likely to enjoy. And from just a single item in our music collection, iTunes Genius can recommend many more for us to listen to as part of a playlist: Genius playlist creates “more like this” from a single item Summary Query reformulation is a key component of information seeking behaviour, and one where we benefit most from automated support. Did You Mean and auto-correct apply spell checking strategies to keep us on track. Partial matching strategies provide signposts toward more productive keyword combinations. And related searches can inspire us to consider new directions and grow our own pearls. Together, these four techniques keep us on track along our information journey. References Salton, G. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989. Alon Halevy, Peter Norvig, Fernando Pereira, “The Unreasonable Effectiveness of Data,” IEEE Intelligent Systems, vol. 24, no. 2, pp. 8-12, Mar./Apr. 2009 Peter Morville, Search Patterns, O’Reilly Media, 2009.
Updated October 11, 2022
by Tony Russell-rose
· 10,675 Views · 1 Like
article thumbnail
Android Cloud Apps with Azure
a recent study by gartner predicts a very significant increase in cloud usage by consumers in a few years, due in great part to the ever growing use of smartphone cameras by the average household. in this context, it could be useful to have a smartphone application that is able to upload / download digital content from a cloud provider. in this article, we will construct a basic android prototype that will allow us to plug in the windows azure cloud provider, and use the windows azure toolkit for android ( available at github ) to do all of the basic cloud operations : upload content to cloud storage, browse the storage, download or delete files in cloud storage. once those operations are implemented, we will see how to enable our android application to receive server push notifications . first things first, we need to set up a storage account in the azure cloud: a storage account comes with several options as for data management : we can keep data in blob, table or queue storage. in this article, we will use the blob storage to work with images. the storage account has a primary and secondary access key , either one of the two can be used to execute operations on the storage account. any of those keys can be regenerated if compromised. 1. preliminaries first, the prerequisites: eclipse ide for java android plugin for eclipse ( adt ) windows azure toolkit for android windows azure subscription (you can get a 90-day free trial ) a getting-started document on windows azure toolkit’s github page covers the installation procedure of all the the required software in detail. this whole project ( cloid ) is freely available at github . so here we’ll limit ourselves to presenting the most relevant code sections along with the corresponding screens. the user interface is composed of a few basic activity screens, spawned from the main screen (top center): since we use a technology not for its own sake but according to our needs, let’s start by specifying what we want: public abstract class storage { /** all providers will have accesss to context*/ protected context context; /** all providers will have accesss to sharedpreferences */ protected cloudpreferences prefs; /** all downloads from providers will be saved on sd card */ protected string download_path = "/sdcard/dcim/camera/"; /** * @throws operationexception * */ public storage(context ctx) throws operationexception { context = ctx; prefs = new cloudpreferences(ctx); } /** * @throws operationexception * */ public abstract void uploadtostorage(string file_path) throws operationexception; /** * @throws operationexception * */ public abstract void downloadfromstorage(string file_name) throws operationexception; /** * @throws operationexception * */ public abstract void browsestorage() throws operationexception; /** * @throws operationexception * */ public abstract void deleteinstorage(string file_name) throws operationexception; } the above is the contract that our cloud storage provider will satisfy. we’ll provide a mockstorage implementation that will pretend to carry out a command in order to test our ui (i.e. our scrollable items list, progress bar, exception messages, etc.), so that we can later just plug in azure storage operations. note from our activities screen above, that we can switch anytime between azure storage and mock storage with the press of the toggle button “cloud on/off” in the settings screen, saving the preferences afterward. public class mockstorage extends storage { // code here... @override public void uploadtostorage(string file_path) throws operationexception { donothingbutsleep(); //throw new operationexception( "test error message", // new throwable("reason: upload test") ); } // other methods will also do nothing but sleep... /***/ private void donothingbutsleep(){ try{ thread.sleep(5000l); } catch (interruptedexception iex){ return; } } 2. the azure toolkit the toolkit comes with a sample application called “simple”, and two library jars: access control for android.jar in the wa-toolkit-android\library\accesscontrol\bin folder azure storage for android.jar in the wa-toolkit-android\library\storage\bin folder here we will only use the latter, since we will access directly azure’s blob storage. needless to say, this is not the recommended way , since our credentials will be stored on the handset. a better approach security-wise would be to access azure storage through web services hosted on either azure or other public/private clouds. once the toolkit is ready for use, we need to think a bit about settings . using an azure blob storage only requires 3 fields: an account name , an access key , and a container for our images. the access key is quite a long string (88 characters) and is kind of a pain to type, so one way to do the setup is to configure the android res/values/strings.xml file to set the default values: ... cloid insert-access-key-here pictures ... however, because we may want to overwrite the default values above (e.g. create another container), we will also save the values on the settings screen in android’s sharedpreferences . and now, let’s implement the azurestorage class. 3. azure blob storage operations 3.1. storage initialization the azurestorage constructor gets its data from android preferences (from its superclass), then constructs a connection string used to access the storage account, creates a blob client and retrieves a reference to the container of images. if the user changed the default container “pictures” in settings, then a new (empty) one will be created with that new name. a container is any grouping of blobs under a name. no blob exists outside of a container. // package here // other imports import com.windowsazure.samples.android.storageclient.blobproperties; import com.windowsazure.samples.android.storageclient.cloudblob; import com.windowsazure.samples.android.storageclient.cloudblobclient; import com.windowsazure.samples.android.storageclient.cloudblobcontainer; import com.windowsazure.samples.android.storageclient.cloudblockblob; import com.windowsazure.samples.android.storageclient.cloudstorageaccount; public class azurestorage extends storage { private cloudblobcontainer container; / * @throws operationexception * */ public azurestorage(context ctx) throws operationexception { super(ctx); // set from prefs string acct_name = prefs.getaccountname(); string access_key = prefs.getaccesskey(); // get connection string string storageconn = "defaultendpointsprotocol=http;" + "accountname=" + acct_name + ";accountkey=" + access_key; // get cloudblobcontainer try { // retrieve storage account from storageconn cloudstorageaccount storageaccount = cloudstorageaccount.parse(conn); // create the blob client // to get reference objects for containers and blobs cloudblobclient blobclient = storageaccount.createcloudblobclient(); // retrieve reference to a previously created container container = blobclient.getcontainerreference( prefs.getcontainer() ); container.createifnotexist(); } catch (exception e) { throw new operationexception("error from initblob: " + e.getmessage(), e); } } // code... we will use that container reference cloudblobcontainer throughout our upcoming cloud operations. 3.2. uploading images we will upload a file from android’s gallery to the cloud, keeping the same filename. “screener” is just a utilities class (see github repository) that does a number of useful things, e.g. extracting a file name from its path and setting the right mime type (“image/jpeg”, “image/png”, etc.). the two kinds of blobs are page blobs and block blobs . the (very) short story is that page blobs are optimized for read & write operations, while block blobs let us upload large files efficiently. in particular we can upload multiple blocks in parallel to decrease upload time. here we are uploading a blob (gallery image) as a set of blocks. /** * @throws operationexception */ @override public void uploadtostorage(string file_path) throws operationexception { try { // create or overwrite blob with contents from a local file // use same name than in local storage cloudblockblob blob = container.getblockblobreference( screener.getnamefrompath(file_path) ); file source = new file(file_path); blob.upload( new fileinputstream(source), source.length() ); blob.getproperties().contenttype = screener.getimagemimetype(file_path); blob.uploadproperties(); } catch (exception e) { throw new operationexception("error from uploadtostorage: " + e.getmessage(), e); } } bear in mind that we are not checking if the file already exists in cloud storage. therefore we will overwrite any existing file with the same name as the one we are uploading. that is usually not desirable in production code. here’s the screen flow of the upload operation: 3.3. browsing the cloud for browsing, we store all our blobs in our container into a list of items that we will display in android as a scrollable list of image names in a subclass of android.app.listactivity . once one item in the list is clicked (“touched”) by the user, we want to display some image properties such as the image size (important when deciding to download), its mime type, and the date it was last operated upon. /** * @throws operationexception * */ @override public void browsestorage() throws operationexception{ // reset uri list for refresh - no caching item.itemlist.clear(); // loop over blobs within the container try { for (cloudblob blob : container.listblobs()) { blob.downloadattributes(); blobproperties props = blob.getproperties(); long ksize = props.length/1024; string type = props.contenttype; date lastmodified = props.lastmodified; item item = new item(blob.geturi(), blob.getname(), ksize, type, lastmodified); item.itemlist.add(item); } // end loop } catch (exception e) { throw new operationexception("error from browsestorage: " + e.getmessage(), e); } } here’s the screen flow of the browse operation. pressing on an item on the list displays its details and operations on the image, which we will look at next: 3.4. downloading images our download method is pretty straightforward. note that we are downloading to the android handset’s sd card by using download_path from the superclass. /** * @throws operationexception * */ @override public void downloadfromstorage(string file_name) throws operationexception{ try { for (cloudblob blob : container.listblobs()) { // download the item and save it to a file with the same name as arg if(blob.getname().equals(file_name)){ blob.download( new fileoutputstream(download_path + blob.getname()) ); break; } } } catch (exception e) { throw new operationexception("error from downloadfromstorage: " + e.getmessage(), e); } } and the corresponding ui flow. instead of displaying the image right after the download, we chose to include a link to the gallery (bottom of the screen) where the freshly retrieved image appears on top of the gallery’s stack of pictures: 3.5. deleting images the delete operation performed on a blob up in the cloud is also rather simple: /** * @throws operationexception */ @override public void deleteinstorage(string file_name) throws operationexception{ try { // retrieve reference to a blob named file_name cloudblockblob blob = container.getblockblobreference(file_name); // delete the blob blob.delete(); } catch (exception e) { throw new operationexception("error from deleteinstorage: " + e.getmessage(), e); } } and its associated ui screens series. note that after confirming the operation, and when deletion completes, the browsing list of items is automatically refreshed, and we can see that the image is no longer on the list of blobs in our storage container. 3.6. wrapping up the azurestorage methods are called inside a basic work thread, which will take care of all cloud operations: // called inside a thread try { // get storage instance from factory storage store = storagefactory.getstorageinstance(this, storagefactory.provider.azure_storage); // for the progress bar incrementworkcount(); // do ops switch(operation){ case upload : store.uploadtostorage(path); break; case browse : store.browsestorage(); break; case download : store.downloadfromstorage(path); // refresh gallery sendbroadcast( new intent( intent.action_media_mounted, uri.parse("file://"+ environment.getexternalstoragedirectory()) ) ); break; case delete : store.deleteinstorage(path); break; } // end switch } catch (operationexception e) { recorderror(e); } notice how we are telling the android image gallery to refresh by issuing a broadcast once a new file is downloaded from the cloud to the sd card. there are different ways to do this, but without that call, the gallery won’t show the new image before the next system scheduled media scan. again, for the full code, refer to this project on github. we are done with the basic cloud operations. all we had to do was plug in our azurestorage implementation class and get an instance of it through a factory, with minimal impact on preexisting code. 4. push notifications up to this point we have demonstrated device-initiated communication with the cloud. for cloud-initiated or push communication, the android platform uses google cloud messaging (gcm). in a previous article , i wrote about how to integrate gcm into an existing android application. here we will add a second set of settings for server push. our client code will connect with any gcm server and it will set the status on our main activity (last screen shot on the right) once the information in push preferences is correctly set. 5. conclusions the toolkit documentation is kind of sparse (which is why the community needs more articles like this). also, the sample application doesn’t cover much (maybe the reason why it’s called “simple”), and it has room for improvement. however, the library itself is fully functional, and once we figure out the api, it all works quite nicely. of course, this application is itself pretty basic and doesn’t cover lots of other features, like access control, permissions, metadata, and snapshots. but it is a start.
Updated October 11, 2022
by Tony Siciliani
· 15,954 Views · 1 Like
article thumbnail
Building a Data Warehouse, Part 5: Application Development Options
see also: part i: when to build your data warehouse part ii: building a new schema part iii: location of your data warehouse part iv: extraction, transformation, and load in part i we looked at the advantages of building a data warehouse independent of cubes/a bi system and in part ii we looked at how to architect a data warehouse’s table schema. in part iii, we looked at where to put the data warehouse tables. in part iv, we are going to look at how to populate those tables and keep them in sync with your oltp system. today, our last part in this series, we will take a quick look at the benefits of building the data warehouse before we need it for cubes and bi by exploring our reporting and other options. as i said in part i, you should plan on building your data warehouse when you architect your system up front. doing so gives you a platform for building reports, or even application such as web sites off the aggregated data. as i mentioned in part ii, it is much easier to build a query and a report against the rolled up table than the oltp tables. to demonstrate, i will make a quick pivot table using sql server 2008 r2 powerpivot for excel (or just powerpivot for short!). i have showed how to use powerpivot before on this blog , however, i usually was going against a sql server table, sql azure table, or an odata feed. today we will use a sql server table, but rather than build a powerpivot against the oltp data of northwind, we will use our new rolled up fact table. to get started, i will open up powerpivot and import data from the data warehouse i created in part ii. i will pull in the time, employee, and product dimension tables as well as the fact table. once the data is loaded into powerpivot, i am going to launch a new pivottable. powerpivot understands the relationships between the dimension and fact tables and places the tables in the designed shown below. i am going to drag some fields into the boxes on the powerpivot designer to build a powerful and interactive pivot table. for rows i will choose the category and product hierarchy and sum on the total sales. i’ll make the columns (or pivot on this field) the month from the time dimension to get a sum of sales by category/product by month. i will also drag in year and quarter in my vertical and horizontal slicers for interactive filtering. lastly i will place the employee field in the report filter pane, giving the user the ability to filter by employee. the results look like this, i am dynamically filtering by 1997, third quarter and employee name janet leverling. this is a pretty powerful interactive report build in powerpivot using the four data warehouse tables. if there was no data warehouse, this pivot table would have been very hard for an end user to build. either they or a developer would have to perform joins to get the category and product hierarchy as well as more joins to get the order details and sum of the sales. in addition, the breakout and dynamic filtering by year and quarter, and display by month, are only possible by the dimtime table, so if there were no data warehouse tables, the user would have had to parse out those dateparts. just about the only thing the end user could have done without assistance from a developer or sophisticated query is the employee filter (and even that would have taken some powerpivot magic to display the employee name, unless the user did a join.) of course pivot tables are not the only thing you can create from the data warehouse tables you can create reports, ad hoc query builders, web pages, and even an amazon style browse application. (amazon uses its data warehouse to display inventory and oltp to take your order.) i hope you have enjoyed this series, enjoy your data warehousing.
Updated October 11, 2022
by John Cook
· 14,174 Views · 1 Like
article thumbnail
Building a Data Warehouse, Part 3: Location of Your Data Warehouse
In Part I we looked at the advantages of building a data warehouse independent of cubes/a BI system and in Part II we looked at how to architect a data warehouse’s table schema. Today we are going to look at where to put your data warehouse tables. Let’s look at the location of your data warehouse. Usually as your system matures, it follows this pattern: Segmenting your data warehouse tables into their own isolated schema inside of the OLTP database Moving the data warehouse tables to their own physical database Moving the data warehouse database to its own hardware When you bring a new system online, or start a new BI effort, to keep things simple you can put your data warehouse tables inside of your OLTP database, just segregated from the other tables. You can do this a variety of ways, most easily is using a database schema (ie dbo), I usually use dwh as the schema. This way it is easy for your application to access these tables as well as fill them and keep them in sync. The advantage of this is that your data warehouse and OLTP system is self-contained and it is easy to keep the systems in sync. As your data warehouse grows, you may want to isolate your data warehouse further and move it to its own database. This will add a small amount of complexity to the load and synchronization, however, moving the data warehouse tables to their own table brings some benefits that make the move worth it. The benefits include implementing a separate security scheme. This is also very helpful if your OLTP database scheme locks down all of the tables and will not allow SELECT access and you don’t want to create new users and roles just for the data warehouse. In addition, you can implement a separate backup and maintenance plan, not having your date warehouse tables, which tend to be larger, slow down your OLTP backup (and potential restore!). If you only load data at night, you can even make the data warehouse database read only. Lastly, while minor, you will have less table clutter, making it easier to work with. Once your system grows even further, you can isolate the data warehouse onto its own hardware. The benefits of this are huge, you can have less I/O contention on the database server with the OLTP system. Depending on your network topology, you can reduce network traffic. You can also load up on more RAM and CPUs. In addition you can consider different RAID array techniques for the OLTP and data warehouse servers (OLTP would be better with RAID 5, data warehouse RAID 1.) Once you move your data warehouse to its own database or its own database server, you can also start to replicate the data warehouse. For example, let’s say that you have an OLTP that works worldwide but you have management in offices in different parts of the world. You can reduce network traffic by having all reporting (and what else do managers do??) run on a local network against a local data warehouse. This only works if you don’t have to update the date warehouse more than a few times a day. Where you put your data warehouse is important, I suggest that you start small and work your way up as the needs dictate.
October 11, 2022
by Stephen Forte
· 10,197 Views · 1 Like
article thumbnail
Transit Gateway With Anypoint Platform
Here we will use the Mulesoft Anypoint platform to attach VPC to the AWS transit gateway to form a single network topology.
October 10, 2022
by Gaurav Dhimate
· 5,053 Views · 2 Likes
  • Previous
  • ...
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • 203
  • 204
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×