Data Engineering Resources

The Latest Data Engineering Topics

Query Autofiltering Revisited -- Let's Be More precise!

In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find. A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this. A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns. The traditional way to do this is to use Natural Language Processing or NLP to parse the user query. This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”. In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed. The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search (Funny story: after posting this, I ran across a Mexican Restaurant in Toms River, New Jersey that does serve sushi – but still, most of them don’t!). The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way. The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting. It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”. It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it. This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context. The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata. Searching vs Navigating A little historical background here. With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links. In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata. The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference. Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will). A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching: In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use : syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user. However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words. The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful. So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender. But the information is still somewhere in the search index, we just need a way to get it back at at “query time”. Then, we can both have our cake AND eat it! Noun Phrases and the Hierarchy of meta information When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing. In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do. That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth. Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types. Rebuilding Context: Identifying field name patterns to find relevant phrase patterns So now its time to put Humpty Dumpty back together again. With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate metadata mapping on behalf of the user. We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store. Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example. The Query AutoFiltering Search Component To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things. The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map. This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field. If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query. The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer. Assuming that we have indexed the following records: This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records. However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query. Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values. So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”. Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too. If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”. It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type. From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query: q=blue red lion socks to a filter query: fq=color:blue&fq=brand:”red lion”&fq=product_type:socks This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%! Adding case sensitivity and synonym support: One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality. To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed. Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future. Source Code For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it). The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above. Conclusions The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post. Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can. This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost). The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications. The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.

June 22, 2015

by Lisa Warner

· 1,931 Views

Heroku PostgreSQL vs. Amazon RDS for PostgreSQL

Written by Barry Jones. PostgreSQL is becoming the relational database of choice for web development for a whole host of good reasons. That means that development teams have to make a decision on whether to host their own or use a database as a service provider. The two biggest players in the world of PostgreSQL are Heroku PostgreSQL and Amazon RDS for PostgreSQL. Today I’m going to compare both platforms. Heroku was the first big provider to make a push for PostgreSQL instead of MySQL for application development. They launched their Heroku PostgreSQL platform back in 2007. Amazon Web Services first announced their RDS for PostgreSQL service in November 2013 during the AWS re:Invent conference to an overwhelming ovation by the programmers in attendance. Pricing Comparison Before I get too far into the features, let’s cover the pricing differences up front. Of course, both services have areas with different value propositions for productivity and maintenance that go beyond these direct costs. However, it’s worth it to understand the basic costs so you can weigh those values against your needs later. Heroku PostgreSQL has the simplest pricing. The rates and what you get for them are very clearly set at a simple per-month rate that includes the database, storage, data transfer, I/O, backups, SLA, and any other features built into the pricing tier. With RDS for PostgreSQL, pricing is broken down into smaller units of individual resource usage. That means there are more factors involved in estimating the price, so it’s a little tougher to draw an exact comparison to Heroku PostgreSQL. You have the price per hour for the instance type, higher if it’s a multiple availability zone instance, cheaper if you pay an upfront cost to reserve the instance for one to three years; storage cost and storage class (both single and multi AZ); provisioned IOPs rate; backup storage, and data transfer… then there are a whole lot of special cases to consider. Also, keep in mind that you get one year free of the cheapest plan when you sign up. Here is a comparison of an RDS plan to the Heroku Premium 4 plan: Heroku Premium 4 $1,200 / Month 15 GB RAM 512 GB storage 500 connections High Availability Max 15 minutes downtime/month 1 week rollback Point in time recovery Encryption at rest Continuous protection (offsite Write-Ahead-Log) RDS for PostgreSQL $1,156/month on demand or $756/month 1 year reserved db.m3.xlarge Multi-AZ at $0.780/hr ($580) 4 vCPU, 15GB RAM Encryption at rest 512 GB provisioned (SSD) at $0.250/GB ($128) 2000 provisioned IOPS at $0.20/IOPS ($400) Estimated backup storage in excess of free for 1 week rollback, 512 GB at $0.095/GB ($48) Data transfer estimated at $0 for most use cases 22 minutes downtime/month (based on AWS RDS SLA 99.95% uptime) Now, here are the caveats with such a comparison: Heroku isn’t disclosing the number of CPUs associated with their plans. Heroku’s High Availability is equivalent to AWS RDS Multi-AZ. In both setups, a read replica is maintained in a different geographic region specifically for the purpose of automatic failover in the event of an outage. With Heroku, your storage is fully allocated, and you do not pay for IOPS. As such, we don’t know what the limits are for IOPS, but they are very high performance databases. I allocated the minimum IOPS that AWS would allow for 512 GB, which was 2,000. We could go as high as 5,000 IOPS which would increase the price by $600/month. The AWS RDS backups may cost nothing depending on how much of the provisioned storage is actually being used. Backup storage is free up to the level of provisioned storage, and backups are generally smaller, incremental, and do not include the significant space used by indexes. This estimate was based on the seven days of storage needed to allow for one week rollback. AWS RDS storage can be scaled up on the fly, so your specific needs for RAM versus storage could create a wildly different pricing pattern. This comparison is aiming to draw an equivalent. AWS only charges for data transfers out of your availability zone (not including multi-AZ transfers), so transfer rates will not apply in most cases. Clear as mud. Setup Complexity Heroku PostgreSQL setup is dead simple. Whenever you create a PostgreSQL project, a free dev plan is already created with it with a connection waiting. Upgrading the database simply gives you a new connection string with a set username, password, hostname, and database identifier that are all randomly generated by their system. The database connection must be secure but is accessible anywhere on the internet, including directly from your home computer. You can also choose whether to deploy it in the US East region or in the European region. RDS for PostgreSQL setup is slightly more involved; you must select the various options outlined in the pricing section, including the instance type, whether or not it should be Multi-AZ, whether to enable encryption at rest, type of storage, how much to provision, IOPs to provision (if any), backup retention period, whether or not to enable automatic minor version upgrades, selection of backup and maintenance windows, database identifier, name, port, master user and password, which availability zone you want it to be created in and the selection of your VPC group and subnet group, and your database configuration. Obviously, RDS gives you significantly more control over the details. Depending on your point of view, that could be good or bad. The database configuration, for example, has a set of defaults for each database version for each instance type. You can take these defaults and make modifications to them with your own custom settings and then save those as your own parameter group to assign to this and any future databases that you may choose to create. The initial setup time can be slightly more involved because of the various factors like VPC, subnet groups, and public accessibility. However, once these have been defined the first time for your account, everything gets much closer to a point-and-click experience. Host Locations, Regional Restrictions Heroku operates with the AWS US East Region (us-east-1) and Europe (eu-west-1). This also means that your database will be restricted to these regions. Availability Zones are managed internally. If you choose to use Heroku PostgreSQL with something hosted in a different AWS region than those two, you should expect more latency between database requests and transfer rates may apply. Likewise, if you wish to use AWS RDS for PostgreSQL with a Heroku application, just ensure that it is set up in the appropriate region. Security and Access Considerations Within Heroku PostgreSQL, you’re given a randomized username with a randomized password and a randomized database name that must be connected to over SSL. Their network (as well as Amazon’s) have built-in protections against scanners that could potentially brute-force access such a database. That is fairly secure. The downside is that anybody who needs access to the database and has the connection information can do so from anywhere in the world. This is more of a Human Resources-level risk from departed programmers on a project than anything else, but it is something to be aware of nonetheless. Swapping out the database credentials after having a programmer leave the team will generally alleviate this concern. On the other hand, AWS RDS for PostgreSQL has a much more comprehensive security policy. The ability to set and define a VPC and private subnet groups will allow you to restrict database access to only the servers and people who need it. You have the ability to create as many database users with various permission levels as you like in order to more easily manage multiple users or applications accessing the database with different permission levels, while providing a log trail. Thanks to VPC, even if somebody did have the connection information, they still couldn’t access the database without being able to get inside the VPC. For stricter (although more complex) security, RDS wins hands down. Depending on complexity, team, and the development state of your application, this level of security paranoia may not yet make sense and could be more of a headache than you want to manage. You can also configure it with the same access rules used by Heroku PostgreSQL. Backup/Restore/Upgrade Both platforms offer very similar options for backup and restore. Both have scheduled backups, point-in-time recovery, restoration to a new copy, and the ability to create snapshots. Upgrades are more involved. On both platforms, major version upgrades will involve some downtime, which can’t be avoided. Heroku provided three options that all involve some manual steps to complete: copying data, promoting an upgraded follower, or using the pg:upgrade command for an in-place upgrade of larger databases. The pg:upgrade most closely resembles the upgrade process on RDS. With RDS, you select the Modify option for your instance and change the version. It will create pre- and post-snapshots around the in-place upgrade while maintaining the exact same connection string. RDS will allow you to schedule the database upgrade automatically within your set maintenance window. Heroku PostgreSQL will automatically apply minor upgrades and security patches, while RDS allows you to choose whether or not you want them to do that automatically within your maintenance window. Both are fairly straightforward processes, although the RDS process is a little more hands-off in this case. Feature/Extension Availability As of this writing, AWS RDS for PostgreSQL has version 9.3.1–9.3.6 and 9.4.1, while Heroku PostgreSQL has 9.1, 9.2, 9.3, and 9.4. Minor version upgrades are automatic with Heroku, so the point releases are unnecessary. Heroku PostgreSQL has been around longer and because of that has more legacy versions available for their existing users. RDS launched with 9.3 and does not appear to have any intention to support older versions. In addition to all of the functionality built into PostgreSQL, there’s a constantly growing set of extensions. Both platforms have these extensions in common: hstore citext ltree isn cube dict_int unaccent PostGIS dblink earthdistance fuzzystrmatch intarray pg_stat_statements pgcrypto pg_trgm tablefunc uuid-ossp pgrowlocks btree_gist PL/pgSQL PL/Tcl PL/Perl PL/V8 Available on Heroku PostgreSQL: pgstattuple Available on AWS RDS for PostgreSQL: postgres_fdw chkpass intagg tsearch2 sslinfo Here are the full lists for both Heroku PostgreSQL and AWS RDS for PostgreSQL. Scaling Options “Scaling” is a tricky word with databases because it means different things depending on the needs of your application. Scaling for writes vs. reads is based on low intensity and high volume (web traffic) compared to low volume and high intensity (analytics). The most common scaling case on the web is scaling for read traffic. Both Heroku and RDS address this need with the ability to create read replicas. RDS calls them read replicas and Heroku calls them followers, but they’re essentially the same thing: a copy of the database, receiving live updates via the write-ahead-log over the wire to allow you to spread read traffic over multiple servers. This is commonly referred to as horizontal scaling. To create read replicas on either platform is a point-and-click operation. Vertical scaling refers to increasing or decreasing the power of the hardware of your database in place. AWS and Heroku each handle this scenario differently. Heroku instructs users to create a follower of the newly desired database class and then promote it to the primary database once it’s caught up, destroying the original afterwards. Your application will need to update its database connection information to use the new database. If your RDS database is a multi-AZ database, then the failover database will be upgraded first. Once ready, the connection will automatically failover to that instance while the primary is then upgraded, switching back to the primary afterwards. Without a Multi-AZ, you can do the upgrade in place, but downtime will vary depending on the size of the database. Your other option is to create a read replica with the newly desired stats and then promote it to primary when it is ready, just as Heroku recommends. To scale beyond the standard vertical and horizontal options for something that can handle distributed write scaling, neither option is a particularly good fit. It will probably be necessary to either manage your own Postgres-XC installation or restructure your application to isolate the write-heavy traffic into a more use-case specific data source. Monitoring AWS RDS for PostgreSQL comes with all of the standard AWS monitoring options via Cloudwatch. Cloudwatch provides extensive metrics that you can track history with a granular ability to set up alerts via email or SNS notifications (basically webhooks). These are great for integrating with tools like PagerDuty. Heroku PostgreSQL monitoring relies more on logs and command line tools. Their pgextras command line tool will show current information about what’s going on in the database, including bloat, blocking queries, cache and index hit ratios, identification of unused indexes, and the ability to kill specific queries. These tools, while not involving the stat tracking over time that you get from Cloudwatch, provide extremely valuable insights into what’s going on with your database that you don’t come close to getting from RDS. You can see more examples of pg-extras on GitHub. These type of insights are invaluable in tuning your application and database to avoid the problems you’d need a monitor to catch in the first place. Other historical data is available in the logs, although Heroku recommends trying out Librato (which can work with any PostgreSQL database but has a Heroku plugin available for automatic configuration). Additionally, free New Relic plans will provide a wealth of insight into what’s going on with your application and database. While Cloudwatch provides more detailed insight as to what’s going on within the machine, Heroku uses the metrics seen within pg-extras to monitor and notify you of the various problems they see that require correction on your end. If data corruption happens, Heroku identifies and fixes it. Security problems, they’ll handle it. A DBA or a DevOps position will care significantly more about the Cloudwatch metrics. Heroku PostgreSQL tries to focus on making sure you don’t have to worry about it. Dataclips One bonus feature that you get from Heroku PostgreSQL is Dataclips. Dataclips are basically a method for storing and sharing read-only queries among your team for the sake of reporting without having to grant access to every person who may need to see them. Just type in a query and view the results right there on the page. The queries are version controlled; if your team is passing them around and tweaking them, you’ll be able to see the changes over time. In my personal experience, I’ve found dataclips to be a lifesaver, specifically for working with non-programmer teams. When business or support staff need information on sales, fraud, user behavior, account activity, or anything else we happen to have in there, I’ve always had the ability to write up a query to get at the information. Before dataclips, this meant that I needed to write up the query, save it somewhere, usually export the result set to a CSV or spreadsheet, and then email it to whomever was requesting it. Eventually, this becomes a routine activity that you’re having to handle at every request. Enter dataclips. Now I can take that query and just send the random hashed link over to whoever requested the information. If they want more up-to-date information the next day, week, or month, they need only refresh the page. I write the query, then never hear that request again. That is a developer time-saver right there. You can save them and name them, as well as manage more strict access if need be. Summary and Recommendation Overall, AWS RDS for PostgreSQL will usually be cheaper and more tightly tailorable to exactly what your application’s needs are. You’ll have much more granular control over access, security, monitoring, alerts, geographic location, and maintenance plans. With Heroku PostgreSQL, you’ll pay a little bit more on a simplified pricing structure, although all of your development databases will be free. You won’t be able to control a lot of the details that RDS gives you access to, but that’s partially by design so that you don’t have to deal with managing those details. With Heroku, you’ll get insights directly into how your database is performing and using the internal resources to help you catch, tune, and improve your setup before it becomes a problem. If I had to choose, I’d probably go with Heroku and Heroku PostgreSQL as a startup while I focused on actually getting my application developed and getting customers in the door. The value proposition of saving time to focus on business goals so we can build a revenue stream would be of the greatest importance. Then when things grew to a point that the database was no longer changing as much, it might make sense to start migrating things over to RDS as we focus on locking things down to focus on stability, long-term maintenance, and security. In the end, it really boils down to what costs you more: time or infrastructure. If time costs you more, go with Heroku PostgreSQL. If infrastructure costs you more, go with RDS. Having both platforms living within the AWS datacenters makes switching between the two a lot easier as your needs change.

June 22, 2015

by Moritz Plassnig

· 3,681 Views

R: dplyr - Removing Empty Rows

I’m still working my way through the exercises in Think Bayes and in Chapter 6 needed to do some cleaning of the data in a CSV file containing information about the Price is Right. I downloaded the file using wget: wget http://www.greenteapress.com/thinkbayes/showcases.2011.csv And then loaded it into R and explored the first few rows using dplyr library(dplyr) df2011 = read.csv("~/projects/rLearning/showcases.2011.csv") > df2011 %>% head(10) X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3 1 5631K 5632K 5633K 5634K 5635K 5641K 5642K 5643K 5644K 5645K 5681K 2 3 Showcase 1 50969 21901 32815 44432 24273 30554 20963 28941 25851 28800 37703 4 Showcase 2 45429 34061 53186 31428 22320 24337 41373 45437 41125 36319 38752 5 ... As you can see, we have some empty rows which we want to get rid of to ease future processing. I couldn’t find an easy way to filter those out but what we can do instead is have empty columns converted to ‘NA’ and then filter those. First we need to tell read.csv to treat empty columns as NA: df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA")) And now we can filter them out using na.omit: df2011 = df2011 %>% na.omit() > df2011 %>% head(5) X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3 3 Showcase 1 50969 21901 32815 44432 24273 30554 20963 28941 25851 28800 37703 4 Showcase 2 45429 34061 53186 31428 22320 24337 41373 45437 41125 36319 38752 6 Bid 1 42000 14000 32000 27000 18750 27222 25000 35000 22500 21300 21567 7 Bid 2 34000 59900 45000 38000 23000 18525 32000 45000 32000 27500 23800 9 Difference 1 8969 7901 815 17432 5523 3332 -4037 -6059 3351 7500 16136 ... Much better!

June 21, 2015

by Mark Needham

· 56,006 Views · 1 Like

Long-Term Log Analysis with AWS Redshift

You will aggregate a lot of logs over the lifetime of your product and codebase, so it’s important to be able to search through them. In the rare case of a security issue, not having that capability is incredibly painful. You might be able to use services that allow you to search through the logs of the last two weeks quickly. But what if you want to search through the last six months, a year, or even further? That availability can be rather expensive or not even an option at all with existing services. Many hosted log services provide S3 archival support which we can use to build a long-term log analysis infrastructure with AWS Redshift. Recently I’ve set up scripts to be able to create that infrastructure whenever we need it at Codeship. AWS Redshift AWS Redshift is a data warehousing solution by AWS. It has an easy clustering and ingestion mechanism ideal for loading large log files and then searching through them with SQL. As it automatically balances your log files across several machines, you can easily scale up if you need more speed. As I said earlier, looking through large amounts of log files is a relatively rare occasion; you don’t need this infrastructure to be around all the time, which makes it a perfect use case for AWS. Setting Up Your Log Analysis Let’s walk through the scripts that drive our long-term log analysis infrastructure. You can check them out in the flomotlik/redshift-logging GitHub repository. I’ll take you step by step through configuring the whole setup of the environment variables needed, as well as starting the creation of the cluster and searching the logs. But first, let’s get a high-level overview of what the setup script is doing before going into all the different options that you can set: Creates an AWS Redshift cluster. You can configure the number of servers and which server type should be used. Waits for the cluster to become ready. Creates a SQL table inside the Redshift cluster to load the log files into. Ingests all log files into the Redshift cluster from AWS S3. Cleans up the database and prints the psql access command to connect into the cluster. Be sure to check out the script on GitHub before we go into all the different options that you can set through the .env file. Options to set The following is a list of all the options available to you. You can simply copy the .env.template file to .env and then fill in all the options to get picked up. AWS_ACCESS_KEY_ID AWS key of the account that should run the Redshift cluster. AWS_SECRET_ACCESS_KEY AWS secret key of the account that should run the Redshift cluster. AWS_REGION=us-east-1 AWS region the cluster should run in, default us-east-1. Make sure to use the same region that is used for archiving your logs to S3 to have them close. REDSHIFT_USERNAME Username to connect with psql into the cluster. REDSHIFT_PASSWORD Password to connect with psql into the cluster. S3_AWS_ACCESS_KEY_ID AWS key that has access to the S3 bucket you want to pull your logs from. We run the log analysis cluster in our AWS Sandbox account but pull the logs from our production AWS account so the Redshift cluster doesn’t impact production in any way. S3_AWS_SECRET_ACCESS_KEY AWS secret key that has access to the S3 bucket you want to pull your logs from. PORT=5439 Port to connect to with psql. CLUSTER_TYPE=single-node The cluster type can be single-node or multi-node. Multi-node clusters get auto-balanced which gives you more speed at a higher cost. NODE_TYPE Instance type that’s used for the nodes of the cluster. Check out the Redshift Documentation for details on the instance types and their differences. NUMBER_OF_NODES=10 Number of nodes when running in multi-mode. CLUSTER_IDENTIFIER=log-analysis DB_NAME=log-analysis S3_PATH=s3://your_s3_bucket/papertrail/logs/862693/dt=2015 Database format and failed loads When ingesting log statements into the cluster, make sure to check the amount of failed loads that are happening. You might have to edit the database format to fit to your specific log output style. You can debug this easily by creating a single-node cluster first that only loads a small subset of your logs and is very fast as a result. Make sure to have none or nearly no failed loads before you extend to the whole cluster. In case there are issues, check out the documentation of the copy command which loads your logs into the database and the parameters in the setup script for that. Example and benchmarks It’s a quick thing to set up the whole cluster and run example queries against it. For example, I’ll load all of our logs of the last nine months into a Redshift cluster and run several queries against it. I haven’t spent any time on optimizing the table, but you could definitely gain some more speed out of the whole system if necessary. It’s just fast enough already for us out of the box. As you can see here, loading all logs of May — more than 600 million log lines — took only 12 minutes on a cluster of 10 machines. We could easily load more than one month into that 10-machine cluster since there’s more than enough storage available, but for this post, one month is enough. After that, we’re able to search through the history of all of our applications and past servers through SQL. We connect with our psql client and send of SQL queries against the “events’ database. For example, what if we want to know how many build servers reported logs in May: loganalysis=# select count(distinct(source_name)) from events where source_name LIKE 'i-%'; count ------- 801 (1 row) So in May, we had 801 EC2 build servers running for our customers. That query took ~3 seconds to finish. Or let’s say we want to know how many people accessed the configuration page of our main repository (the project ID is hidden with XXXX): loganalysis=# select count(*) from events where source_name = 'mothership' and program LIKE 'app/web%' and message LIKE 'method=GET path=/projects/XXXX/configure_tests%'; count ------- 15 (1 row) So now we know that there were 15 accesses on that configuration page throughout May. We can also get all the details, including who accessed it when through our logs. This could help in case of any security issues we’d need to look into. The query took about 40 seconds to go though all of our logs, but it could be optimized on Redshift even more. Those are just some of the queries you could use to look through your logs, gaining more insight into your customers’ use of your system. And you et all of that with a setup that costs $2.50 an hour, can be shut down immediately, and recreated any time you need access to that data again. Conclusions Being able to search through and learn from your history is incredibly important for building a large infrastructure. You need to be able to look into your history easily, especially when it comes to security issues. With AWS Redshift, you have a great tool in hand that allows you to start an ad hoc analytics infrastructure that’s fast and cheap for short-term reviews. Of course, Redshift can do a lot more as well. Let us know what your processes and tools around logging, storage, and search are in the comments.

June 21, 2015

by Florian Motlik

· 1,306 Views

Spring XD 1.2 GA, Spring XD 1.1.3 and Flo for Spring XD Beta Released

Written by Mark Pollack. Today, we are pleased to announce the general availability of Spring XD 1.2, Spring XD 1.1.3 and the release of Flo for Spring XD Beta. 1.2.0.GA: zip 1.1.3.RELEASE: zip Flo for Spring XD Beta You can also install XD 1.2 using brew and rpm The 1.2 release includes a wide range of new features and improvements. The release journey was an eventful one, mainly due to Spring XD’s popularity with so many different groups, each with their respective request priorities. However the Spring XD team rose to the challenge and it is rewarding to look back and review the amount of innovation delivered to meet our commitments toward simplifying big data complexity. Here is a summary of what we have been busy with for the last 3 months and the value created for the community and our customers. Flo for Spring XD and UI improvements Flo for Spring XD is an HTML5 canvas application that runs on top of the Spring XD runtime, offering a graphical interface for creation, management and monitoring streaming data pipelines. Here is a short screencast showing you how to build an advanced stream definition. You can browse the documentation for additional information and links to additional screen casts of Flo in action. The XD admin screen also includes a new Analytics section that allows you to easily view gauges, counters, field-value counters and aggregate counters. Performance Improvements Anticipating increased high-throughput and low-latency IoT requirements, we’ve made several performance optimizations within the underlying message-bus implementation to deliver several million messages per second transported between Spring XD containers using Kafka as a transport. With these optimizations, we are now on par with the performance from Kafka’s own testing tools. However, we are using the more feature rich Spring Integration Kafka client instead of Kafka’s high level consumer library. For anyone who is interested in reproducing these numbers, please refer to the XD benchmarking blog, which describes the tests performed and infrastructure used in detail. Apache Ambari and Pivotal HD To help automate the deployment of Spring XD on an Apache HadoopⓇ cluster, we added an Apache AmbariⓇ plugin for Spring XD. The plugin is supported on both Pivotal HD 3.0 and Hortonworks HDP 2.2 distributions. We also added support in Spring XD for Pivotal HD 3.0, bringing the total number of Hadoop versions supported to five. New Sources, Processors, Sinks, and Batch Jobs One of Spring XD’s biggest value propositions is its complete set of out-of-the-box data connectivity adapters that can be used to create real-time and batch-based data pipelines, and these require little to no user-code for common use-cases. With the help of community contributions, we now have MongoDB, VideCap, and FTP as source modules, an XSLT-transformer processor, and FTP sink module. The XD team also developed a Cassandra sink and a language-detection processor. Recognizing the important role in the Pivotal Big Data portfolio, we have also added native integration with Pivotal Greenplum Database and Pivotal HAWQ through gpfdist sink for real-time streaming and also support for gpload based batch jobs. Adding to our developer productivity theme and the use of Spring XD in production for high-volume data ingest use-cases, we are delighted to recognize Simon Tao and Yu Cao (EMC² Office of The CTO & Labs China), who have been operationalizing Spring XD data pipelines in production since 2014 and also for the VideCap source module contribution. Their use-case and implementation specifics (in their own words) are below. “There are significant demands to extract insights from large magnitude of unstructured video streams for the video surveillance industry. Prior to being analyzed by data scientists, the video surveillance data needs to be ingested in the first place. To tackle this challenge, we built a highly scalable and extensible video-data ingestion platform using Spring XD. This platform is operationally ready to ingest different kinds of video sources into a centralized Big Data Lake. Given the out-of-the-box features within Spring XD, the platform is designed to allow rich video content processing capabilities such as video transcoding and object detection, etc. The platform also supports various types of video sources—data processors and data exporting destinations (e.g. HDFS, Gemfire XD and Spark)—which are built as custom modules in Spring XD and are highly reusable and composable. With a declarative DSL, a video ingestion stream will be handled by a video ingestion pipeline defined as Directed Acyclic Graph of modules. The pipeline is designed to be deployed in a clustered environment with upstream modules transferring data to downstream ones efficiently via the message bus. The Spring-XD distributed runtime allows each module in the pipeline to have multiple instances that run in parallel on different nodes. By scaling out horizontally, our system is capable of supporting large scale video surveillance deployment with high volume of video data and complex data processing workloads.” Custom Module Registry and HA Support Though we have had the flexibility to configure shared network location for distributed availability of custom modules (via: xd.customModule.home), we also recognized the importance of having the module-registry resilient under failure scenarios—hence, we have an HDFS backed module registry. Having this setup for production deployment provides consistent availability of custom module bits and the flexibility of choices, as needed by the business requirements. Pivotal Cloud Foundry Integration Furthering the Pivotal Cloud Foundry integration efforts, we have made several foundation-level changes to the Spring XD runtime, so we are able to run Spring XD modules as cloud-native Apps in Lattice and Diego. We have aggressive roadmap plans to launch Spring XD on Diego proper. While studying Diego’s Receptor API (written in Go!), we created a Java Receptor API, which is now proposed to Cloud Foundry for incubation. Next Steps We have some very interesting developments on the horizon. Perhaps the most important, we will be launching new projects that focus on message-driven and batch-oriented “data microservices”. These will be built directly on Spring Boot as well as Spring Integration and Spring Batch, respectively. Our main goal is to provide the simplest possible developer experience for creating cloud-native, data-centric microservice apps. In turn, Spring XD 2.0 will be refactored as a layer above those projects, to support the composition of those data microservices into streams and jobs as well as all of the “as a service” aspects that it provides today, but it will have a major focus on deployment to Cloud Foundry and Lattice. We will be posting more on these new projects soon, so stay tuned! Feedback is very important, so please get in touch with questions and comments via * StackOverflowspring-xd tag * Spring JIRA or GitHub Issues Editor’s Note: ©2015 Pivotal Software, Inc. All rights reserved. Pivotal, Pivotal HD, Pivotal Greenplum Database, Pivotal Gemfire and Pivotal Cloud Foundry are trademarks and/or registered trademarks of Pivotal Software, Inc. in the United States and/or other countries. Apache, Apache Hadoop, Hadoop and Apache Ambari are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. All Posts Engineering Releases News and Events

June 21, 2015

by Pieter Humphrey

CORE

· 3,285 Views

Data's Hierarchy of Needs

This post originally published in the AppsFlyer blog. A couple of weeks ago Nir Rubinshtein and I presented AppsFlyer’s data architecture in a meetup ofBig Data & Data Science Israel. One of the concepts that I presented there, which is worth expanding upon is “Data’s Hierarchy of Needs:” Data should Exist Data should be Accessible Data should be Usable Data should be Distilled Data should be Presented How can we make data “achieve its pinnacle of existence” and be acted upon? In other words, what are the areas that should be addressed when designing a data architecture if you want it to be complete and enable creating insights and value from the data you generate and collect. If done properly, your users might just act upon the data you provide. This list might seem a little simplistic but it is not a prescription of what to do but rather a set of reminders of areas we need to cover and questions we need answered to properly create a data architecture. Data Should Exist Well, of course data should exist, and it probably does. You should ask yourself however, is if the data that exists is the right data? Does the retention policy you have service the business needs? Does the availability fit your needs? Do you have all the needed links (foreign keys) to other data so you’d be able to connect it later for analysis? To make this more concrete, consider the following example: AppsFlyer accepts several types of events (launches, in-app events, etc.) which are tied to apps. Apps are connected to accounts (an account would have one or more applications, usually at least, an iOS app and an Android one). If we would save the accounts as the latest snapshot and an app changes ownership, the historical data before that change would be skewed. If we treat the accounts as a slowly changing dimension of the events, then we’d be able to handle the transition correctly. Note that we may still choose to provide the new owner the historic data but now it not the only option the system support and the decision can be based on the business needs. Data Should Be Accessible If data is written to disk it is accessible programmatically at least, however, there can be many levels of accessibility and we need to think about our end users needs and the level of access they’d require. At AppsFlyer, the data existence (mentioned above) is handled by processing all the messages that go through our queues using Kafka but that data is saved in sequence files and stored by event time. Most of our usage scenarios do have a time component but they are primarily handled by the app or account. Any processing that needs a specific account and would access the raw events would have to sift through tons of records (3.7+ billion a day at the time of this post) to find the few relevant ones. Thus, one basic move toward accessibility of data is to sort by apps so that queries will only need to access a small subset of the data and thus run much faster. Then we need to consider the “hotness” of the data i.e. what response times we need and for which types of data. For instance, aggregations, such as retention reports need to be accessed online (so called “sub-second” response), latest counts need near real-time , explorations of data for new patterns can take hours etc. To enable support of these varied usage scenarios, we need to create multiple projections of our data, most likely using several different technologies. AppsFlyer stores raw data in sequence files, processed data in parquet files (accessible via Apache Spark), aggregations and recent data in columnar RDBMS and near real-time is stored in-memory. The three different storage mechanisms I mentioned above (Parquet, columnar RDBMS and In-Memory Data Grid) used in AppsFlyer all have SQL access; this is not by chance. While we (the industry) went through a short period of NoSQL, SQL or almost-SQL is getting back to be the norm, even for semi-structured and poly-structured data. Providing an SQL interface to your data is another important aspect of data accessibility as it allows expanding the user base for the data beyond R&D. Again, this is important not just for your relational data… Data Should Be Usable What’s the difference between accessible data and usable data? For one there’s data cleansing. This is a no-brainer if you pull data from disparate systems but it is also needed if your source is a single system. Data cleansing is what traditional ETL is all about and the techniques still apply. Another aspect of making data usable is enriching it or connecting it to additional data. Enriching can happen from internal sources like linking CRM data to the account info. This can also be facilitated by external sources as with getting the app category from the app store or getting device screen size from a device database. Last but not least, is to consider legal and privacy aspects of the data. Before allowing access to the data you may need to mask sensitive information or remove privacy-related data (sometimes you shouldn’t even save it in the first place). At AppsFlyer we take this issue very seriously and make major efforts to comply when working with partners and clients to make sure privacy-related data is handled correctly. In fact, we are also undergoing independent SOC auditing to make sure we are compliant with the highest standards. To summarize, to make the data usable you have to make sure it is correct, connect it to other data and you need to make sure that it is compliant with legal and privacy issues. Data Should Be Distilled Distilling insights is the reason we perform all the previous steps. Data in itself is of little use if it doesn’t help us make better decisions. There are multiple types of insights you can generate here beginning from the more traditional BI scenarios of slice and dice analytics going through real-time aggregations and trend analysis, ending in applying machine learning or “advanced analytics”. You can see one example of the type of insights that can be gleaned from our data by looking at theGaming Advertising Performance Index we recently published. Data Should Be Presented This point ties in nicely with the Gaming Advertising Performance Index example provided above. Getting insights is an important step, but if you fail to present them in a coherent and cohesive manner then the actual value users would be able to make of it is limited at best. Note that even if you use insights for making decisions (e.g. recommending a product to a user) you’d still need to present how well this decision is doing. There are many issues that need to be dealt with from UX perspective both in how users interact with the data and how the data is presented. An example of the former is deciding on chart types for the data. A simple example for the latter is when presenting projected or inaccurate data it should be clear to the users that they are looking at approximations to prevent support calls on numbers not adding up. Making sure all the areas discussed above are covered and handled properly is a lot of work but providing a solution that actually helps your users make better decisions is well worth it. The data’s hierarchy of needs is not a prescription of how to get there, it is merely a set of waypoints to help navigate toward this end goal. It helps me think holistically about AppsFlyer data needs and I hope following this post it would also help you. For more information about our architecture, check out the presentation from the meetup: Architecture for Real-Time and Batch Big Data Analytics Distilling insights @ AppsFlyer

June 21, 2015

by Arnon Rotem-gal-oz

· 999 Views

Enabling DataOps with Easy Log Analytics

DataOps is becoming an important consideration for organizations. Why? Well, DataOps is about making sure data is collected, analyzed, and available across the company – i.e. Ops insight for your decision-making systems like Hubspot, Tableau, Salesforce and more. Such systems are key to day-to-day operations and in many cases are as important as keeping your customer facing systems up and running. If you think about it, today every online business is a data driven business! Everyone is accountable to have up to the minute answers on what is happening across their systems. You can’t do this reliably without having DataOps in place. We have seen this trend across our own customer base at Logentries where more and more customers using log data to implement DataOps across their organization. Using log data for DataOps allows you to perform the following: Troubleshoot your systems managing your data by identifying errors and correlating data sources Get notified when one of these systems is experiencing issues via real time alerts or anomaly detection Analyze how these systems are used by the organization Logentries has always been great at 1 and 2 above, and this week we have enhanced Logentries to now allow you to perform easier and more powerful analytics with our new easy-to-use SQL like query language – Logentries QL (LEQL). LEQL is designed to make analyzing your log data dead simple. There are too many log management tools that are built around complex query languages and require data scientists to operate. Logentries is all about making log data accessible to anyone. With LEQL you are going to be able to use analytical functions like CountUnique, Min, Max, GroupBy, Sort…A number of our users have already been testing these out via our beta program. One great example is how Pluralsight has been using Logentries to manage and understand the usage of their Tableau environment. For example: Calculating the rate of errors over the the past 24 hours e.g. using LEQL Count function Understanding user usage patterns e.g. using GroupBy to understand queries performed grouped by different users Sorting the data to find the most popular queries and how long they are taking Being able to answer these types of questions enables DataOps teams to understand where they need to invest time going forward. For example, do I need to add capacity to improve query performance? Are internal teams having a good user experience or are they getting a lot of errors when they try to access data? At Logentries we are all about making the power of log data accessible to everyone and as we do this we are constantly seeing cool new use cases when using logs. If you have some cool use cases do let us know!

June 21, 2015

by Trevor Parsons

· 837 Views

MongoDB 3.0.4 Released

MongoDB 3.0.4 is out and is ready for production deployment. This release contains only fixes since 3.0.3, and is a recommended upgrade for all 3.0 users. Fixed in this release: SERVER-17923 Creating/dropping multiple background indexes on the same collection can cause fatal error on secondaries SERVER-18079 Large performance drop with documents > 16k on Windows SERVER-18190 Secondary reads block replication SERVER-18213 Lots of WriteConflict during multi-upsert with WiredTiger storage engine SERVER-18316 Database with WT engine fails to recover after system crash SERVER-18475 authSchemaUpgrade fails when the system.users contains non MONGODB-CR users SERVER-18629 WiredTiger journal system syncs wrong directory SERVER-18822 Sharded clusters with WiredTiger primaries may lose writes during chunk migration 3.0 Release Notes | Downloads | All Issues As always, please let us know of any issues. – The MongoDB Team

June 20, 2015

by Francesca Krihely

· 2,196 Views

Diff'ing Software Architecture Diagrams

robert annett wrote a post titled diagrams for system evolution where he describes a simple approach to showing how to visually describe changes to a software architecture. in essence, in order to show how a system is to change, he'll draw different versions of the same diagram and use colour-coding to highlight the elements/relationships that will be added, removed or modified. i've typically used a similar approach for describing as-is and to-be architectures in the past too. it's a technique that works well. although you can version control diagrams, it's still tricky to diff them using a tool. one solution that addresses this problem is to not create diagrams, but instead create a textual description of your software architecture model that is then subsequently rendered with some tooling. you could do this with an architecture description language (such as darwin ) although i would much rather use my regular programming language instead. creating a software architecture model as code this is exactly what structurizr is designed to do. i've recreated robert's diagrams with structurizr as follows. and since the diagrams were created by a model described as java code , that description can be diff'ed using your regular toolchain. code provides opportunities this perhaps isn't as obvious as robert's visual approach, and i would likely still highlight the actual differences on diagrams using notation as robert did too. creating a textual description of a software architecture model does provide some interesting opportunities though.

June 19, 2015

by Simon Brown

· 1,180 Views

Building Microservices: Using an API Gateway

Learn about using the microservice architecture pattern to build microservices and API gateways--compared to the usage of monolithic application architecture.

June 16, 2015

by Patrick Nommensen

· 120,705 Views · 40 Likes

Foreign Key Relation Across Database

Foreign key references can be found across the database with a simple Microsoft SQL Server code. This code will work between the tables to get the keys.

June 14, 2015

by Joydeep Das

· 22,467 Views · 1 Like

Why 12 Factor Application Patterns, Microservices and CloudFoundry Matter (Part 2)

Learn why 12 Factor Application Patterns, Microservices and CloudFoundry matter when trying to change the way your product is produced.

June 12, 2015

by Tim Spann

CORE

· 15,385 Views · 4 Likes

Spring Integration Tests with MongoDB Rulez

Spring integration tests allow you to test functionality against a running application. This article shows proper database set- and clean-up with MongoDB.

June 10, 2015

by Ralf Stuckert

· 21,155 Views · 2 Likes

Trimming Liquibase ChangeLogs

How do you clear out a chanelog file? The best way to handle this is to simply break up your changelog file into multiple files.

June 8, 2015

by Nathan Voxland

· 12,150 Views

Regular Expressions Denial of the Service (ReDOS) Attacks: From the Exploitation to the Prevention

autors :michael hidalgo, dinis cruz introduction when it comes to web application security, one of the recommendations to write software that is resilient to attacks is to perform a correct input data validation. however, as mobile applications and apis (application programming interface) proliferates, the number of untrusted sources where data comes from goes up, and a potential attacker can take advantage of the lack of validations to compromise our applications. regular expressions provides a versatile mechanism to perform input data validation. developers use them to validate email addresses, zip codes, phone numbers and many other task that are easily implemented thought them. unfortunately most of the time software engineers don't fully understand how regular expressions works in the background and by choosing a wrong regular expression pattern they can introduce a risk in the application. in this article we are going to discuss about the so called regular expression denial of the service (redos) vulnerability and how we can identify this problems early in the software development life cycle (sdlc) stages by enforcing a culture focused on unit testing. hardware features for this article in order to provide information about execution time, performance, cpu utilisation and other facts, we are relying on virtual machine that uses windows 7 32-bit operating system, 5.22 gb ram. intel(r) core (tm) it-3820qm cpu @2.7 ghz. we are also using 4 cores. understanding the problem. the owasp foundation (2012) defines a regular regular expression denial of service attack as follows: "the regular expression denial of service (redos) is a denial of service attack, that exploits the fact that most regular expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size). an attacker can then cause a program using a regular expression to enter these extreme situations and then hang for a very long time." although a broad explanation about regular expression engines is out of the scope of this article,it is important to understand that, according to stubblebine,t (regular expressions pocket reference), a pattern matching consist of finding a section of text that is described (matched) by a regular expression. two main rules are used to match results: the earliest (leftmost) wins : the regular expression is applied to the input starting at the first character and moving toward the last. as soon as the regular expression engine finds a match,it returns. standard quantifiers are greedy : according to stubblebine, "quantifiers specify how many times something can be repeated. the standard quantifiers attempt to match as many times as possible. the process of giving up characters and trying less-greedy matches is called backtracking." for this article we are focused a regular expression engine called nondeterministic finite automaton (nfa).this engines usually compare each element of the regex to the input string, keeping track of positions where it chose between two options in the regex. if an option fails, the engine backtracks to the most recently saved position.(stubblebine,t 2007). it is important to note that this engine is also implemented in .net, java, python, php and ruby on rails. this article is focused on c# and therefore we are relying on the microsoft .net framework system.text.regularexpression classes which at the heart uses nfa engines. according to bryan sullivan "one important side effect of backtracking is that while the regex engine can fairly quickly confirm a positive match (that is, an input string does match a given regex), confirming a negative match (the input string does not match the regex) can take quite a bit longer. in fact, the engine must confirm that none of the possible “paths” through the input string match the regex, which means that all paths have to be tested. with a simple non-grouping regular expression, the time spent to confirm negative matches is not a huge problem." in order to illustrate the problem, let's use this regular expression (\w+\d+)+c which basically performs the following checks: between one and unlimited times, as many times as possible, giving back as needed. \w+ match any word character a-za-z0-9_ . \d+ match a digit 0-9 matches the character c literally (case sensitive) so matching values are 12c,1232323232c and !!!!cd4c and non matching values are for instance !!!!!c,aaaaaac and abababababc . the following unit test was created to verify both cases. const string regexpattern = @"(\w+\d+)+c"; public void testregularexpression() { var validinput = "1234567c"; var invalidinput = "aaaaaaac"; regex.ismatch(validinput, regexpattern).assert_is_true(); regex.ismatch(invalidinput, regexpattern).assert_is_false(); } execution time : 6 milliseconds now that we've verified that our regular expression works well, let's write a new unit test to understand the backtracking problem and the performance effects. note that the longer the string, the longer the time the regular expression engine will take to resolve it. we will generate 10 random strings, starting at the length of 15 characters, incrementing the length until get to 25 characters,and then we will see the execution times. const string regexpattern = @"(\w+\d+)+c"; [testmethod] public void isvalidinput() { var sw = new stopwatch(); int16 maxiterations = 25; for (var index = 15; index < maxiterations; index++) { sw.start(); //generating x random numbers using fluentsharp api var input = index.randomnumbers() + "!"; regex.ismatch(input, regexpattern).assert_false(); sw.stop(); sw.reset(); } } now let's take a look at the test results: random string character length elapsed time (ms) 360817709111694! 16 16ms 2639383945572745! 17 23ms 57994905459869261! 18 50ms 327218096525942566! 19 106ms 4700367489525396856! 20 207ms 24889747040739379138! 21 394ms 156014309536784168029! 22 795ms 8797112169446577775348! 23 1595ms 41494510101927739218368! 24 3200ms 112649159593822679584363! 25 6323ms by looking at this results we can understand that the execution time (total time to resolve the input text against the regular expression) goes up exponentially to the size of the input. we can also see that when we append a new character, the execution time almost duplicates. this is an important finding because shows how expensive this process is, if we do not have a correct input data validation we can introduce performance issues in our application. a real-life use-case and an appeal for a unit testing approach now that we have seen the problems we can face by selecting a wrong (evil) regular expression, let's discuss about a realistic scenario where we need to validate input data thought regular expressions. we strongly believe that unit testing techniques can not only help to write quality code but also we can use them to find vulnerabilities in the code we are writing. by writing unit test that performs security checks (like input data validation) a common task in web applications consist on request an email address to the user signing in our application. from a ux (user experience perspective) complaining browsers support friendly error messages when an input, that was supposed to be an email address, does not match with the requirements in terms of format. here is a ui validation when a input textbox (with the email type is set) and the value is not a valid email address. however relying on a ui validation is not longer enough. an eavesdropper can easily perform an http request without using a browser (namely by using a proxy to capture data in transit) and then send a payload that can compromise our application. in the following use case, we are using a backend validation for the email address by using a regular expression. we will show you the real power of regular expressions here, we are not only testing that the regular expression validates the input but also how it behaves when it receives any arbitrary input. we are using this evil regular expression to validate the email: ^( 0-9a-za-z @([0-9a-za-z][-\w][0-9a-za-z].)+[a-za-z]{2,9})$ . with the following test we are verifying that a valid email and invalid emails formats are correctly processed by the regular expression, which is the functional aspect from a development point of view. const string emailregex = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var validemailaddress = "michael.hidalgo@owasp.org"; var invalidemailaddress = new string[] { "a", "abc.com", "1212", "aa.bb.cc", "aabcr@s" }; regex.ismatch(validemailaddress, emailregex).assert_is_true(); //looping throught invalid email address foreach (var email in invalidemailaddress) { regex.ismatch(email, emailregex).assert_is_false(); } } elapsed time: 6ms. so both cases are validate correctly. one could state that both scenarios supported by the unit test are enough to select this regular expression for our input data validations. however we can do a more extensive testing as you'll see. the exploit so far the previous regular expression selected to valid an email address seems to work well, we have added some unit test that verifies valid an invalid inputs. but how does it behaves when we send an arbitrary input?, from a variable length, do we face a denial of the service attack?. this kind of questions can be solved wit unit testing technique like this one: const string emailregex = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var validemailaddress = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"; var watch = new stopwatch(); watch.start(); validemailaddress.regex(emailregex).assert_is_false(); watch.stop(); console.writeline("elapsed time {0}ms", watch.elapsedmilliseconds); watch.reset(); } **elapsed time : ~23 minutes (1423127 milliseconds).** results are disturbing. we can clearly see the performance problem introduced by evaluating the given input.it takes roughly 23 minutes to validate the input given the hardware characteristics described before. in the following images you will see the cpu behaviour when running this unit test. here is another cpu utlization: and this is another image from the cpu utilization while the test is running. fuzzing and unit testing: a perfect combination of techniques in the previous unit test we found that a given input string can lead to have denial of the service issue in our application. note that we didn't need an extreme large payload, in our scenario 34 characters can illustrate this problem or even less. when using any regular expression it is recomendable to always test it against unit testing to cover most of the possible ways a user (which can be a potential attacker) can send. here is where we can use fuzzing. tobias klein in his book a bug hunter's diary a guide tour throught the wilds of sofware security defines fuzzing as "a complete different approach to bug hunting is known as fuzzing. fuzzing is a dynamic-analysis technique that consist of testing an application by providing it with malformed or unexpected input. then klein continues adding that: "it isn't easy to identify the entry points of such complex applications, but complex software often tends to crash while processing malformed input data. page 05" mano paul in his book official (isc)2 guide to the csslp talking about fuzzing states that: "also known as fuzz testing or fault injection testing, fuzzing is a brute-force type of testing in which faults (random and pseudo-random input data) are injected into the software and it's behaviour is observed. it is a test whose results are indicative of the extended and effectiveness of the input validation.page 336". taking previous definitions into consideration, we are going to implement a new unit test that can allow us to generate random input data and test our regular expression. in this case, we are using this email regular expression "^[\w-.]{1,}\@([\w]{1,}.){1,}[a-z]{2,4}$"; and by doing an exhaustive testing we will see if we are not introducing a denial of the service problem. we want to make sure that the elapsed time to resolve if the random string matches the regular expression is evaluated in less than 3 seconds: const string emailregex = @"^[\w-\.]{1,}\@([\w]{1,}\.){1,}[a-z]{2,4}$"; //number of random strings to generte. const int maxiterations = 10000; [testmethod] public void fuzz_emailaddress() { //valid email should return true "michael.hidalgo@owasp.org".regex(emailregex).assert_is_true(); //invalid email should return false "abce" .regex(emailregex).assert_is_false(); //testing maxiterations times for (int index = 0; index < maxiterations; index++) { //generating a random string var fuzzinput = (index * 5).randomstring(); var sw = new stopwatch(); sw.start(); fuzzinput.regex(emailregex).assert_is_false(); //elapsed time should be less than 3 seconds per input. sw.elapsed.seconds().assert_size_is_smaller_than(3); } } under the hardware features described before, this test passes. considering that we are using this computation (index * 5), the largest string generate is of 49995 character (which is 9999 *5). having said that we were able to test a large string against the regular expression and we confirmed that even thought it is quite large input value, the time involved to verify if it was or not a valid email, it was less than 3 seconds. now assuming that a check for the length of the email in the first place, it will guarantee that a malicious user can't inject a large payload in our application. countermeasures provided in microsoft .net 4.5 and upper if you are developing applications in microsoft .net 4.5 then you can take advantage of a new implementation on top of the ismatch method from the regex class . starting from .net 4.5 the ismatch method provides an overload that allows you to enter a timeout. note that this overload is not available in .net 4.0 . this new parameter is called matchtimeout and according to microsoft : "the matchtimeout parameter specifies how long a pattern matching method should try to find a match before it times out. setting a time-out interval prevents regular expressions that rely on excessive backtracking from appearing to stop responding when they process input that contains near matches. for more information, see best practices for regular expressions in the .net framework and backtracking in regular expressions . if no match is found in that time interval, the method throws a regexmatchtimeoutexception exception. matchtimeout overrides any default time-out value defined for the application domain in which the method executes." taken from here . we've written a new unit test where we're using a regular expression that we know can lead to denial of the service. in this case we'll test an email address that previously generated a significant side effect in the performance of the application. we'll see then how we can reduce the impact of this process by setting up a timeout. const string emailregexpattern = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var emailaddress = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"; var watch = new stopwatch(); watch.start(); //timeout of 5 seconds try { regex.ismatch(emailaddress, emailregexpattern, regexoptions.ignorecase, timespan.fromseconds(5)); } catch (exception ex) { ex.message.assert_not_null(); ex.gettype().assert_is(typeof(regexmatchtimeoutexception)); } finally { watch.stop(); watch.elapsed.seconds().assert_size_is_smaller_than(5); watch.reset(); } } running this test in visual studio we can confirm it passes, which means that the backtracking mechanism is taking longer than 5 seconds to resolve. it will throw a regexmatchtimeoutexception exception indicating that it might take longer than 5 seconds to evaluate the input. ideally one would expect this process to take less than a second, however several conditions or requirements might lead to allow a timeout in seconds. note how this model provides a very needed defensive programming style where the software engineers make informed decisions on the code they write, in this case we can establish the next steps when our method times and that way we can decrease any denial of the service attack. final thoughts no one size fits all is so cliché that has to be true. we are not sure if the regular expressions you are currently using in your applications are vulnerable to this attack. what we can do for sure is to show you how you can take advantage of unit testing to write secure code. when we write code we want to make sure that each single line of code is covered by a unit testing, which at the end of the day will guarantee early detections of error. however if we can combine this exercise with the adoption and implementation of test that can also try to attack/compromise the application (and we are not talking about anything fancy) like sending random strings, using fuzzing techniques, using combination of characters, exceeding the expected length, we will be helping to write software that is resilient to attacks. as a recommendation always test your regular expressions agains uni test, make sure that they are resilient to the attack we have covered in this article and if you are able to identify those problematic patterns out there, do a contribution and report them so we are not introduce them in the software we write. references 1.cruz,dinis(2013) the email regex that (could had) dosed a site. 2.hollos,s. hollos,r (2013) finite automata and regular expressions problems and solutions. 3.kirrage,j. rathnayake , thielecke, h.: static analysis for regular expression denial-of-service attacks. university of birmingham, uk 4.klein, t. a bug hunter's diary a guided tour through the wilds of software security (2011). 5.the owasp foundation (2012) regular expression denial of service - redos. 6.stubblebine, t(2007) regular expression pocket reference, second edition. 7.sullivan, b (2010) regular expression denial of service attacks and defenses

June 7, 2015

by Michael Hidalgo

· 33,243 Views · 5 Likes

Easy SQLite on Android with RxJava

Whenever I consider using an ORM library on my Android projects, I always end up abandoning the idea and rolling my own layer instead for a few reasons: My database models have never reached the level of complexity that ORM’s help with. Every ounce of performance counts on Android and I can’t help but fear that the SQL generated will not be as optimized as it should be. Recently, I started using a pretty simple design pattern that uses Rx to offer what I think is a fairly simple way of managing your database access with RxJava. Easy reads One of the important design principles on Android is to never perform I/O on the main thread, and this obviously applies to database access. RxJava turns out to be a great fit for this problem. I usually create one Java class per table and these tables are then managed by my SQLiteOpenHelper. With this new approach, I decided to extend my use of the helper and make it the only point of access to anything that needs to read or write to my SQL tables. Let’s consider a simple example: a USERS table managed by the UserTable class: // MySqliteOpenHelper.java Observable> getUsers(String userId) { return makeObservable(mUserTable.getUsers(getReadableDatabase(), userId)) .subscribeOn(Schedulers:io()) } The problem with this method is that if you’re not careful, you will call it on the main thread, so it’s up to the caller to make sure they are always invoking this method on a background thread (and then to post their UI update back on the main thread, if they are updating the UI). Instead of relying on managing yet another thread pool or, worse, using AsyncTask, we are going to rely on RxJava to take care of the threading model for us. Let’s rewrite this method to return a callable instead: // MySqliteOpenHelper.java private static Observable makeObservable(final Callable func) { return Observable.create( new Observable.OnSubscribe() { @Override public void call(Subscriber subscriber) { try { subscriber.onNext(func.call()); } catch(Exception ex) { Log.e(TAG, "Error reading from the database", ex); } } }); } In effect, we simply refactored our method to return a lazy result, which makes it possible for the database helper to turn this result into an Observable: // MySqliteOpenHelper.java Observable> getUsers(String userId) { return makeObservable(mUserTable.getUsers(getReadableDatabase(), userId)) .subscribeOn(Schedulers:io()) } Notice that on top of turning the lazy result into an Observable, the helper forces the subscription to happen on a background thread (the IO thread here, since we’re accessing the database). This guarantees that callers don’t have to worry about ever blocking the main thread. Finally, the makeObservable method is pretty straightforward (and completely generic): // MySqliteOpenHelper.java private static Observable makeObservable(final Callable func) { return Observable.create( new Observable.OnSubscribe() { @Override public void call(Subscriber subscriber) { try { subscriber.onNext(func.call()); } catch(Exception ex) { Log.e(TAG, "Error reading from the database", ex); } } }); } At this point, all our database reads have become observables that guarantee that the queries run on a background thread. Accessing the database is now pretty standard Rx code: // DisplayUsersFragment.java @Inject MySqliteOpenHelper mDbHelper; // ... mDbHelper.getUsers(userId) .observeOn(AndroidSchedulers.mainThread()) .subscribe(new Action1>()) { @Override public void onNext(List users) { // Update our UI with the users } } } And if you don’t need to update your UI with the results, just observe on a background thread. Since your database layer is now returning observables, it’s trivial to compose and transform these results as they come in. For example, you might decide that your ContactTable is a low layer class that should not know anything about your model (the User class) and that instead, it should only return low level objects (maybe a Cursor or ContentValues). Then you can use use Rx to map these low level values into your model classes for an even cleaner separation of layers. Two additional remarks: Your Table Java classes should contain no public methods: only package protected methods (which are accessed exclusively by your Helper, located in the same package) and private methods. No other classes should ever access these Table classes directly. This approach is extremely compatible with dependency injection: it’s trivial to have both your database helper and your individual tables injected (additional bonus: with Dagger 2, your tables can have their own component since the database helper is the only refence needed to instantiate them). This is a very simple design pattern that has scaled remarkably well for our projects while fully enabling the power of RxJava. I also started extending this layer to provide a flexible update notification mechanism for list view adapters (not unlike what SQLBrite offers), but this will be for a future post. This is still a work in progress, so feedback welcome!

June 4, 2015

by Cedric Beust

· 16,006 Views

How to Extract OMR Data from Scanned Images inside C# & VB.NET Apps

This technical tip shows how to extract OMR data from a scanned image inside .NET Applications. Developers can use the Aspose.OMR namespace to extract data entered into OMR (Optical Mark Recognition) forms, such as consumer surveys, assessment sheets, or lottery tickets. This article describes how to extract data from a scanned page using a OMR template created with OMR Template Editor tool. The Aspose.Omr namespace can be used to recognize and extract different elements from a scanned image. A template is required that contains graphical mapping of elements to be recognized from the scanned images. This template document is loaded using OmrTemplate. Pages with human marked data like surveys or tests are scanned, and the images loaded using the OmrImage class. A recognition engine (OmrEngine) is required for the template and can be initiated by using the current template as an argument. After the template has been set in the engine, the data is extracted from the scanned image using the OmrEngine class' ExtractData function. Below are the steps to extract OMR data from a scanned image: Load the template using OmrTemplate. Load the scanned image into OmrImage. Instantiate the recognition engine: OmrEngine for the template. Extract data and save to OmrProcessingResult. //The sample code below shows how to extract OMR data from a scanned image //[C#] //Load template file OmrTemplate template = OmrTemplate.Load(MyDir + "Grids.amr"); //Load the image to be analyzed OmrImage image = OmrImage.Load(MyDir + "Grids-filled-scan.jpg"); // Instantiate the recognition engine for the template OmrEngine engine = new OmrEngine(template); // Extract data. This template has only one page. OmrProcessingResult result = engine.ExtractData(new OmrImage[] { image }); //Load actual result from Hashtable OmrResult = result.PageData[0]; //Get Collection of Keys ICollection key = OmrResult.Keys; foreach (string k in key) { Console.WriteLine(k + ": " + OmrResult[k]); } Console.ReadKey(); //[VB.NET] 'Load template file Dim template As OmrTemplate = OmrTemplate.Load(myDir + "Grids.amr") 'Load the image to be analyzed Dim image As OmrImage = OmrImage.Load(MyDir + "Grids-filled-scan.jpg") 'Instantiate the recognition engine for the template Dim engine As OmrEngine = New OmrEngine(template) 'Extract data. This template has only one page. Dim result As OmrProcessingResult = engine.ExtractData(New OmrImage() {image}) 'Load actual result from Dim OmrResult As Hashtable = result.PageData(0) 'Get collection of keys Dim Key As ICollection = OmrResult.Keys For Each k As String In Key Console.WriteLine(k + ": " + OmrResult(Key)) Next Console.ReadKey()

June 3, 2015

by David Zondray

· 11,958 Views

Mounting an EBS Volume to Docker on AWS Elastic Beanstalk

Mounting an EBS volume to a Docker instance running on Amazon Elastic Beanstalk (EB) is surprisingly tricky. The good news is that it is possible. I will describe how to automatically create and mount a new EBS volume (optionally based on a snapshot). If you would prefer to mount a specific, existing EBS volume, you should check out leg100’s docker-ebs-attach (using AWS API to mount the volume) that you can use either in a multi-container setup or just include the relevant parts in your own Dockerfile. The problem with EBS volumes is that, if I am correct, a volume can only be mounted to a single EC2 instance – and thus doesn’t play well with EB’s autoscaling. That is why EB supports only creating and mounting a fresh volume for each instance. Why would you want to use an auto-created EBS volume? You can already use a docker VOLUME to mount a directory on the host system’s ephemeral storage to make data persistent across docker restarts/redeploys. The only advantage of EBS is that it survives restarts of the EC2 instance but that is something that, I suppose, happens rarely. I suspect that in most cases EB actually creates a new EC2 instance and then destroys the old one. One possible benefit of an EBS volume is that you can take a snapshot of it and use that to launch future instances. I’m now inclined to believe that a better solution in most cases is to set up automatic backup to and restore from S3, f.ex. using duplicity with its S3 backend (as I do for my NAS). Anyway, here is how I got EBS volume mounting working. There are 4 parts to the solution: Configure EB to create an EBS mount for your instances Add custom EB commands to format and mount the volume upon first use Restart the Docker daemon after the volume is mounted so that it will see it (see this discussion) Configure Docker to mount the (mounted) volume inside the container 1-3.: .ebextensions/01-ebs.config: # .ebextensions/01-ebs.config commands: 01format-volume: command: mkfs -t ext3 /dev/sdh test: file -sL /dev/sdh | grep -v 'ext3 filesystem' # ^ prints '/dev/sdh: data' if not formatted 02attach-volume: ### Note: The volume may be renamed by the Kernel, e.g. sdh -> xvdh but # /dev/ will then contain a symlink from the old to the new name command: | mkdir /media/ebs_volume mount /dev/sdh /media/ebs_volume service docker restart # We must restart Docker daemon or it wont' see the new mount test: sh -c "! grep -qs '/media/ebs_volume' /proc/mounts" option_settings: # Tell EB to create a 100GB volume and mount it to /dev/sdh - namespace: aws:autoscaling:launchconfiguration option_name: BlockDeviceMappings value: /dev/sdh=:100 4.: Dockerrun.aws.json and Dockerfile: Dockerrun.aws.json: mount the host’s /media/ebs_volume as /var/easydeploy/share inside the container: { "AWSEBDockerrunVersion": "1", "Volumes": [ { "HostDirectory": "/media/ebs_volume", "ContainerDirectory": "/var/easydeploy/share" } ] } Dockerfile: Tell Docker to use a directory on the host system as /var/easydeploy/share – either a randomly generated one or the one given via the -m mount option to docker run: ... VOLUME ["/var/easydeploy/share"] ...

June 3, 2015

by Jakub Holý

· 14,492 Views

Ecosystem of Hadoop Animal Zoo

hadoop is best known for map reduce and it's distributed file system (hdfs). recently other productivity tools developed on top of these will form a complete ecosystem of hadoop. most of the projects are hosted under apache software foundation . hadoop ecosystem projects are listed below. hadoop common a set of components and interfaces for distributed file system and i/o (serialization, java rpc, persistent data structures) http://hadoop.apache.org/ hadoop ecosystem hdfs a distributed file system that runs on large clusters of commodity hardware. hadoop distributed file system, hdfs renamed form ndfs. scalable data store that stores semi-structured, un-structured and structured data. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfsuserguide.html http://wiki.apache.org/hadoop/hdfs map reduce map reduce is the distributed, parallel computing programming model for hadoop. inspired from google map reduce research paper . hadoop includes implementation of map reduce programming model. in map reduce there are two phases, not surprisingly map and reduce. to be precise in between map and reduce phase, there is another phase called sort and shuffle. job tracker in name node machine manages other cluster nodes. map reduce programming can be written in java. if you like sql or other non- java languages, you are still in luck. you can use utility called hadoop streaming. http://wiki.apache.org/hadoop/hadoopmapreduce hadoop streaming a utility to enable map reduce code in many languages like c, perl, python, c++, bash etc., examples include a python mapper and awk reducer. http://hadoop.apache.org/docs/r1.2.1/streaming.html avro a serialization system for efficient, cross-language rpc and persistent data storage. avro is a framework for performing remote procedure calls and data serialization. in the context of hadoop, it can be used to pass data from one program or language to another, e.g. from c to pig. it is particularly suited for use with scripting languages such as pig, because data is always stored with its schema in avro. http://avro.apache.org/ apache thrift apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across programming languages. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business. http://thrift.apache.org/ hive and hue if you like sql, you would be delighted to hear that you can write sql and hive convert it to a map reduce job. but, you don't get a full ansi-sql environment. hue gives you a browser based graphical interface to do your hive work. hue features a file browser for hdfs, a job browser for map reduce/yarn, an hbase browser, query editors for hive, pig, cloudera impala and sqoop2.it also ships with an oozie application for creating and monitoring workflows, a zookeeper browser and an sdk. pig a high-level programming data flow language and execution environment to do map reduce coding the pig language is called pig latin. you may find naming conventions some what un-conventional, but you get incredible price-performance and high availability. https://pig.apache.org/ jaql jaql is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. as its name implies, a primary use of jaql is to handle data stored as json documents, but jaql can work on various types of data. for example, it can support xml, comma-separated values (csv) data and flat files. a "sql within jaql" capability lets programmers work with structured sql data while employing a json data model that's less restrictive than its structured query language counterparts. 1. jaql in google code 2. what is jaql? by ibm sqoop sqoop provides a bi-directional data transfer between hadoop -hdfs and your favorite relational database. for example you might be storing your app data in relational store such as oracle, now you want to scale your application with hadoop so you can migrate oracle database data to hadoop hdfs using sqoop. http://sqoop.apache.org/ oozie manages hadoop workflow. this doesn't replace your scheduler or BPM tooling, but it will provide if-then-else branching and control with hadoop jobs. https://oozie.apache.org/ zookeeper a distributed, highly available coordination service. zookeeper provides primitives such as distributed locks that can be used for building the highly scalable applications. it is used to manage synchronization for cluster. http://zookeeper.apache.org/ hbase based on google's bigtable , hbase "is an open-source, distributed, version, column-oriented store" that sits on top of hdfs. a super scalable key-value store. it works very much like a persistent hash-map (for python developers think like a dictionary). it is not a conventional relational database. it is a distributed, column oriented database. hbase uses hdfs for it's underlying. supports both batch-style computations using map reduce and point queries for random reads. https://hbase.apache.org/ cassandra a column oriented nosql data store which offers scalability, high availability with out compromising on performance. it perfect platform for commodity hardware and cloud infrastructure.cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views , and powerful built-in caching. http://cassandra.apache.org/ flume a real time loader for streaming your data into hadoop. it stores data in hdfs and hbase.flume "channels" data between "sources" and "sinks" and its data harvesting can either be scheduled or event-driven. possible sources for flume include avro, files, and system logs, and possible sinks include hdfs and hbase. http://flume.apache.org/ mahout machine learning for hadoop, used for predictive analytics and other advanced analysis. there are currently four main groups of algorithms in mahout: recommendations, a.k.a. collective filtering classification, a.k.a categorization clustering frequent item set mining, a.k.a parallel frequent pattern mining mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. algorithms in the mahout library belong to the subset that can be executed in a distributed fashion. http://en.wikipedia.org/wiki/list_of_machine_learning_algorithms https://www.coursera.org/course/machlearning https://mahout.apache.org/ fuse makes the hdfs system to look like a regular file system so that you can use ls, rm, cd etc., directly on hdfs data. whirr apache whirr is a set of libraries for running cloud services. whirr provides a cloud-neutral way to run services. you don't have to worry about the idiosyncrasies of each provider.a common service api. the details of provisioning are particular to the service. smart defaults for services. you can get a properly configured system running quickly, while still being able to override settings as needed. you can also use whirr as a command line tool for deploying clusters. https://whirr.apache.org/ giraph an open source graph processing api like pregel from google https://giraph.apache.org/ chukwa chukwa, an incubator project on apache, is a data collection and analysis system built on top of hdfs and map reduce. tailored for collecting logs and other data from distributed monitoring systems, chukwa provides a workflow that allows for incremental data collection, processing and storage in hadoop. it is included in the apache hadoop distribution as an independent module. https://chukwa.apache.org/ drill apache drill, an incubator project on apache, is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. drill is the open source version of google's dremel system which is available as an iaas service called google big query. one explicitly stated design goal is that drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. http://incubator.apache.org/drill/ impala (cloudera) released by cloudera, impala is an open-source project which, like apache drill, was inspired by google's paper on dremel; the purpose of both is to facilitate real-time querying of data in hdfs or hbase. impala uses an sql-like language that, though similar to hiveql, is currently more limited than hiveql. because impala relies on the hive meta store, hive must be installed on a cluster in order for impala to work. the secret behind impala's speed is that it "circumvents map reduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel rdbmss." (source: cloudera) http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html http://training.cloudera.com/elearning/impala/

June 3, 2015

by Umashankar Ankuri

· 23,568 Views · 3 Likes

The Myth of Asynchronous JDBC

I keep seeing people (especially in the scala/typesafe world) posting about async jdbc libraries. STOP IT! Under the current APIs, async JDBC belongs in a realm with Unicorns, Tiger Squirrels, and 8' spiders. While you might be able to move the blocking operations and queue requests and keep your "main" worker threads from blocking, jdbc is synchronous. At some point, somewhere, there's going to be a thread blocking waiting for a response. It's frustrating to see so many folks hyping this and muddying the waters. Unless you write your own client for a dbms and have a dbms that can multiplex calls over a single connection (or using some other strategy to enable this capability) db access is going to block. It's not impossible to make the calls completely async, but nobody's built it yet. Yes, I know ajdbc is taking a stab at this capability, but even IT uses a thread pool for the blocking calls (be default). Someday we'll have async database access (it's not impossible...well it IS with the current JDBC specification), but no general purpose RDBMS has this right now. The primary problems with the hype/misdirection are that #1 inexperienced programmers don't understand that they've just moved the problem and will use the APIs and wonder why the system is so slow (oh I have 1000 db calls queued up waiting for my single db thread to process the work) and #2 It belies a serious misunderstanding of the difference between async JDBC (not possible per current spec) and async db access (totally possible/doable, but rare in the wild).

May 29, 2015

by Michael Mainguy

· 17,524 Views · 2 Likes