DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Data Topics

article thumbnail
Data's Hierarchy of Needs
This post originally published in the AppsFlyer blog. A couple of weeks ago Nir Rubinshtein and I presented AppsFlyer’s data architecture in a meetup ofBig Data & Data Science Israel. One of the concepts that I presented there, which is worth expanding upon is “Data’s Hierarchy of Needs:” Data should Exist Data should be Accessible Data should be Usable Data should be Distilled Data should be Presented How can we make data “achieve its pinnacle of existence” and be acted upon? In other words, what are the areas that should be addressed when designing a data architecture if you want it to be complete and enable creating insights and value from the data you generate and collect. If done properly, your users might just act upon the data you provide. This list might seem a little simplistic but it is not a prescription of what to do but rather a set of reminders of areas we need to cover and questions we need answered to properly create a data architecture. Data Should Exist Well, of course data should exist, and it probably does. You should ask yourself however, is if the data that exists is the right data? Does the retention policy you have service the business needs? Does the availability fit your needs? Do you have all the needed links (foreign keys) to other data so you’d be able to connect it later for analysis? To make this more concrete, consider the following example: AppsFlyer accepts several types of events (launches, in-app events, etc.) which are tied to apps. Apps are connected to accounts (an account would have one or more applications, usually at least, an iOS app and an Android one). If we would save the accounts as the latest snapshot and an app changes ownership, the historical data before that change would be skewed. If we treat the accounts as a slowly changing dimension of the events, then we’d be able to handle the transition correctly. Note that we may still choose to provide the new owner the historic data but now it not the only option the system support and the decision can be based on the business needs. Data Should Be Accessible If data is written to disk it is accessible programmatically at least, however, there can be many levels of accessibility and we need to think about our end users needs and the level of access they’d require. At AppsFlyer, the data existence (mentioned above) is handled by processing all the messages that go through our queues using Kafka but that data is saved in sequence files and stored by event time. Most of our usage scenarios do have a time component but they are primarily handled by the app or account. Any processing that needs a specific account and would access the raw events would have to sift through tons of records (3.7+ billion a day at the time of this post) to find the few relevant ones. Thus, one basic move toward accessibility of data is to sort by apps so that queries will only need to access a small subset of the data and thus run much faster. Then we need to consider the “hotness” of the data i.e. what response times we need and for which types of data. For instance, aggregations, such as retention reports need to be accessed online (so called “sub-second” response), latest counts need near real-time , explorations of data for new patterns can take hours etc. To enable support of these varied usage scenarios, we need to create multiple projections of our data, most likely using several different technologies. AppsFlyer stores raw data in sequence files, processed data in parquet files (accessible via Apache Spark), aggregations and recent data in columnar RDBMS and near real-time is stored in-memory. The three different storage mechanisms I mentioned above (Parquet, columnar RDBMS and In-Memory Data Grid) used in AppsFlyer all have SQL access; this is not by chance. While we (the industry) went through a short period of NoSQL, SQL or almost-SQL is getting back to be the norm, even for semi-structured and poly-structured data. Providing an SQL interface to your data is another important aspect of data accessibility as it allows expanding the user base for the data beyond R&D. Again, this is important not just for your relational data… Data Should Be Usable What’s the difference between accessible data and usable data? For one there’s data cleansing. This is a no-brainer if you pull data from disparate systems but it is also needed if your source is a single system. Data cleansing is what traditional ETL is all about and the techniques still apply. Another aspect of making data usable is enriching it or connecting it to additional data. Enriching can happen from internal sources like linking CRM data to the account info. This can also be facilitated by external sources as with getting the app category from the app store or getting device screen size from a device database. Last but not least, is to consider legal and privacy aspects of the data. Before allowing access to the data you may need to mask sensitive information or remove privacy-related data (sometimes you shouldn’t even save it in the first place). At AppsFlyer we take this issue very seriously and make major efforts to comply when working with partners and clients to make sure privacy-related data is handled correctly. In fact, we are also undergoing independent SOC auditing to make sure we are compliant with the highest standards. To summarize, to make the data usable you have to make sure it is correct, connect it to other data and you need to make sure that it is compliant with legal and privacy issues. Data Should Be Distilled Distilling insights is the reason we perform all the previous steps. Data in itself is of little use if it doesn’t help us make better decisions. There are multiple types of insights you can generate here beginning from the more traditional BI scenarios of slice and dice analytics going through real-time aggregations and trend analysis, ending in applying machine learning or “advanced analytics”. You can see one example of the type of insights that can be gleaned from our data by looking at theGaming Advertising Performance Index we recently published. Data Should Be Presented This point ties in nicely with the Gaming Advertising Performance Index example provided above. Getting insights is an important step, but if you fail to present them in a coherent and cohesive manner then the actual value users would be able to make of it is limited at best. Note that even if you use insights for making decisions (e.g. recommending a product to a user) you’d still need to present how well this decision is doing. There are many issues that need to be dealt with from UX perspective both in how users interact with the data and how the data is presented. An example of the former is deciding on chart types for the data. A simple example for the latter is when presenting projected or inaccurate data it should be clear to the users that they are looking at approximations to prevent support calls on numbers not adding up. Making sure all the areas discussed above are covered and handled properly is a lot of work but providing a solution that actually helps your users make better decisions is well worth it. The data’s hierarchy of needs is not a prescription of how to get there, it is merely a set of waypoints to help navigate toward this end goal. It helps me think holistically about AppsFlyer data needs and I hope following this post it would also help you. For more information about our architecture, check out the presentation from the meetup: Architecture for Real-Time and Batch Big Data Analytics Distilling insights @ AppsFlyer
June 21, 2015
by Arnon Rotem-gal-oz
· 1,118 Views
article thumbnail
Enabling DataOps with Easy Log Analytics
DataOps is becoming an important consideration for organizations. Why? Well, DataOps is about making sure data is collected, analyzed, and available across the company – i.e. Ops insight for your decision-making systems like Hubspot, Tableau, Salesforce and more. Such systems are key to day-to-day operations and in many cases are as important as keeping your customer facing systems up and running. If you think about it, today every online business is a data driven business! Everyone is accountable to have up to the minute answers on what is happening across their systems. You can’t do this reliably without having DataOps in place. We have seen this trend across our own customer base at Logentries where more and more customers using log data to implement DataOps across their organization. Using log data for DataOps allows you to perform the following: Troubleshoot your systems managing your data by identifying errors and correlating data sources Get notified when one of these systems is experiencing issues via real time alerts or anomaly detection Analyze how these systems are used by the organization Logentries has always been great at 1 and 2 above, and this week we have enhanced Logentries to now allow you to perform easier and more powerful analytics with our new easy-to-use SQL like query language – Logentries QL (LEQL). LEQL is designed to make analyzing your log data dead simple. There are too many log management tools that are built around complex query languages and require data scientists to operate. Logentries is all about making log data accessible to anyone. With LEQL you are going to be able to use analytical functions like CountUnique, Min, Max, GroupBy, Sort…A number of our users have already been testing these out via our beta program. One great example is how Pluralsight has been using Logentries to manage and understand the usage of their Tableau environment. For example: Calculating the rate of errors over the the past 24 hours e.g. using LEQL Count function Understanding user usage patterns e.g. using GroupBy to understand queries performed grouped by different users Sorting the data to find the most popular queries and how long they are taking Being able to answer these types of questions enables DataOps teams to understand where they need to invest time going forward. For example, do I need to add capacity to improve query performance? Are internal teams having a good user experience or are they getting a lot of errors when they try to access data? At Logentries we are all about making the power of log data accessible to everyone and as we do this we are constantly seeing cool new use cases when using logs. If you have some cool use cases do let us know!
June 21, 2015
by Trevor Parsons
· 953 Views
article thumbnail
Diff'ing Software Architecture Diagrams
robert annett wrote a post titled diagrams for system evolution where he describes a simple approach to showing how to visually describe changes to a software architecture. in essence, in order to show how a system is to change, he'll draw different versions of the same diagram and use colour-coding to highlight the elements/relationships that will be added, removed or modified. i've typically used a similar approach for describing as-is and to-be architectures in the past too. it's a technique that works well. although you can version control diagrams, it's still tricky to diff them using a tool. one solution that addresses this problem is to not create diagrams, but instead create a textual description of your software architecture model that is then subsequently rendered with some tooling. you could do this with an architecture description language (such as darwin ) although i would much rather use my regular programming language instead. creating a software architecture model as code this is exactly what structurizr is designed to do. i've recreated robert's diagrams with structurizr as follows. and since the diagrams were created by a model described as java code , that description can be diff'ed using your regular toolchain. code provides opportunities this perhaps isn't as obvious as robert's visual approach, and i would likely still highlight the actual differences on diagrams using notation as robert did too. creating a textual description of a software architecture model does provide some interesting opportunities though.
June 19, 2015
by Simon Brown
· 1,356 Views
article thumbnail
Building Microservices: Using an API Gateway
Learn about using the microservice architecture pattern to build microservices and API gateways--compared to the usage of monolithic application architecture.
June 16, 2015
by Patrick Nommensen
· 121,122 Views · 40 Likes
article thumbnail
Why 12 Factor Application Patterns, Microservices and CloudFoundry Matter (Part 2)
Learn why 12 Factor Application Patterns, Microservices and CloudFoundry matter when trying to change the way your product is produced.
June 12, 2015
by Tim Spann DZone Core CORE
· 15,648 Views · 4 Likes
article thumbnail
Regular Expressions Denial of the Service (ReDOS) Attacks: From the Exploitation to the Prevention
autors :michael hidalgo, dinis cruz introduction when it comes to web application security, one of the recommendations to write software that is resilient to attacks is to perform a correct input data validation. however, as mobile applications and apis (application programming interface) proliferates, the number of untrusted sources where data comes from goes up, and a potential attacker can take advantage of the lack of validations to compromise our applications. regular expressions provides a versatile mechanism to perform input data validation. developers use them to validate email addresses, zip codes, phone numbers and many other task that are easily implemented thought them. unfortunately most of the time software engineers don't fully understand how regular expressions works in the background and by choosing a wrong regular expression pattern they can introduce a risk in the application. in this article we are going to discuss about the so called regular expression denial of the service (redos) vulnerability and how we can identify this problems early in the software development life cycle (sdlc) stages by enforcing a culture focused on unit testing. hardware features for this article in order to provide information about execution time, performance, cpu utilisation and other facts, we are relying on virtual machine that uses windows 7 32-bit operating system, 5.22 gb ram. intel(r) core (tm) it-3820qm cpu @2.7 ghz. we are also using 4 cores. understanding the problem. the owasp foundation (2012) defines a regular regular expression denial of service attack as follows: "the regular expression denial of service (redos) is a denial of service attack, that exploits the fact that most regular expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size). an attacker can then cause a program using a regular expression to enter these extreme situations and then hang for a very long time." although a broad explanation about regular expression engines is out of the scope of this article,it is important to understand that, according to stubblebine,t (regular expressions pocket reference), a pattern matching consist of finding a section of text that is described (matched) by a regular expression. two main rules are used to match results: the earliest (leftmost) wins : the regular expression is applied to the input starting at the first character and moving toward the last. as soon as the regular expression engine finds a match,it returns. standard quantifiers are greedy : according to stubblebine, "quantifiers specify how many times something can be repeated. the standard quantifiers attempt to match as many times as possible. the process of giving up characters and trying less-greedy matches is called backtracking." for this article we are focused a regular expression engine called nondeterministic finite automaton (nfa).this engines usually compare each element of the regex to the input string, keeping track of positions where it chose between two options in the regex. if an option fails, the engine backtracks to the most recently saved position.(stubblebine,t 2007). it is important to note that this engine is also implemented in .net, java, python, php and ruby on rails. this article is focused on c# and therefore we are relying on the microsoft .net framework system.text.regularexpression classes which at the heart uses nfa engines. according to bryan sullivan "one important side effect of backtracking is that while the regex engine can fairly quickly confirm a positive match (that is, an input string does match a given regex), confirming a negative match (the input string does not match the regex) can take quite a bit longer. in fact, the engine must confirm that none of the possible “paths” through the input string match the regex, which means that all paths have to be tested. with a simple non-grouping regular expression, the time spent to confirm negative matches is not a huge problem." in order to illustrate the problem, let's use this regular expression (\w+\d+)+c which basically performs the following checks: between one and unlimited times, as many times as possible, giving back as needed. \w+ match any word character a-za-z0-9_ . \d+ match a digit 0-9 matches the character c literally (case sensitive) so matching values are 12c,1232323232c and !!!!cd4c and non matching values are for instance !!!!!c,aaaaaac and abababababc . the following unit test was created to verify both cases. const string regexpattern = @"(\w+\d+)+c"; public void testregularexpression() { var validinput = "1234567c"; var invalidinput = "aaaaaaac"; regex.ismatch(validinput, regexpattern).assert_is_true(); regex.ismatch(invalidinput, regexpattern).assert_is_false(); } execution time : 6 milliseconds now that we've verified that our regular expression works well, let's write a new unit test to understand the backtracking problem and the performance effects. note that the longer the string, the longer the time the regular expression engine will take to resolve it. we will generate 10 random strings, starting at the length of 15 characters, incrementing the length until get to 25 characters,and then we will see the execution times. const string regexpattern = @"(\w+\d+)+c"; [testmethod] public void isvalidinput() { var sw = new stopwatch(); int16 maxiterations = 25; for (var index = 15; index < maxiterations; index++) { sw.start(); //generating x random numbers using fluentsharp api var input = index.randomnumbers() + "!"; regex.ismatch(input, regexpattern).assert_false(); sw.stop(); sw.reset(); } } now let's take a look at the test results: random string character length elapsed time (ms) 360817709111694! 16 16ms 2639383945572745! 17 23ms 57994905459869261! 18 50ms 327218096525942566! 19 106ms 4700367489525396856! 20 207ms 24889747040739379138! 21 394ms 156014309536784168029! 22 795ms 8797112169446577775348! 23 1595ms 41494510101927739218368! 24 3200ms 112649159593822679584363! 25 6323ms by looking at this results we can understand that the execution time (total time to resolve the input text against the regular expression) goes up exponentially to the size of the input. we can also see that when we append a new character, the execution time almost duplicates. this is an important finding because shows how expensive this process is, if we do not have a correct input data validation we can introduce performance issues in our application. a real-life use-case and an appeal for a unit testing approach now that we have seen the problems we can face by selecting a wrong (evil) regular expression, let's discuss about a realistic scenario where we need to validate input data thought regular expressions. we strongly believe that unit testing techniques can not only help to write quality code but also we can use them to find vulnerabilities in the code we are writing. by writing unit test that performs security checks (like input data validation) a common task in web applications consist on request an email address to the user signing in our application. from a ux (user experience perspective) complaining browsers support friendly error messages when an input, that was supposed to be an email address, does not match with the requirements in terms of format. here is a ui validation when a input textbox (with the email type is set) and the value is not a valid email address. however relying on a ui validation is not longer enough. an eavesdropper can easily perform an http request without using a browser (namely by using a proxy to capture data in transit) and then send a payload that can compromise our application. in the following use case, we are using a backend validation for the email address by using a regular expression. we will show you the real power of regular expressions here, we are not only testing that the regular expression validates the input but also how it behaves when it receives any arbitrary input. we are using this evil regular expression to validate the email: ^( 0-9a-za-z @([0-9a-za-z][-\w][0-9a-za-z].)+[a-za-z]{2,9})$ . with the following test we are verifying that a valid email and invalid emails formats are correctly processed by the regular expression, which is the functional aspect from a development point of view. const string emailregex = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var validemailaddress = "[email protected]"; var invalidemailaddress = new string[] { "a", "abc.com", "1212", "aa.bb.cc", "aabcr@s" }; regex.ismatch(validemailaddress, emailregex).assert_is_true(); //looping throught invalid email address foreach (var email in invalidemailaddress) { regex.ismatch(email, emailregex).assert_is_false(); } } elapsed time: 6ms. so both cases are validate correctly. one could state that both scenarios supported by the unit test are enough to select this regular expression for our input data validations. however we can do a more extensive testing as you'll see. the exploit so far the previous regular expression selected to valid an email address seems to work well, we have added some unit test that verifies valid an invalid inputs. but how does it behaves when we send an arbitrary input?, from a variable length, do we face a denial of the service attack?. this kind of questions can be solved wit unit testing technique like this one: const string emailregex = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var validemailaddress = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"; var watch = new stopwatch(); watch.start(); validemailaddress.regex(emailregex).assert_is_false(); watch.stop(); console.writeline("elapsed time {0}ms", watch.elapsedmilliseconds); watch.reset(); } **elapsed time : ~23 minutes (1423127 milliseconds).** results are disturbing. we can clearly see the performance problem introduced by evaluating the given input.it takes roughly 23 minutes to validate the input given the hardware characteristics described before. in the following images you will see the cpu behaviour when running this unit test. here is another cpu utlization: and this is another image from the cpu utilization while the test is running. fuzzing and unit testing: a perfect combination of techniques in the previous unit test we found that a given input string can lead to have denial of the service issue in our application. note that we didn't need an extreme large payload, in our scenario 34 characters can illustrate this problem or even less. when using any regular expression it is recomendable to always test it against unit testing to cover most of the possible ways a user (which can be a potential attacker) can send. here is where we can use fuzzing. tobias klein in his book a bug hunter's diary a guide tour throught the wilds of sofware security defines fuzzing as "a complete different approach to bug hunting is known as fuzzing. fuzzing is a dynamic-analysis technique that consist of testing an application by providing it with malformed or unexpected input. then klein continues adding that: "it isn't easy to identify the entry points of such complex applications, but complex software often tends to crash while processing malformed input data. page 05" mano paul in his book official (isc)2 guide to the csslp talking about fuzzing states that: "also known as fuzz testing or fault injection testing, fuzzing is a brute-force type of testing in which faults (random and pseudo-random input data) are injected into the software and it's behaviour is observed. it is a test whose results are indicative of the extended and effectiveness of the input validation.page 336". taking previous definitions into consideration, we are going to implement a new unit test that can allow us to generate random input data and test our regular expression. in this case, we are using this email regular expression "^[\w-.]{1,}\@([\w]{1,}.){1,}[a-z]{2,4}$"; and by doing an exhaustive testing we will see if we are not introducing a denial of the service problem. we want to make sure that the elapsed time to resolve if the random string matches the regular expression is evaluated in less than 3 seconds: const string emailregex = @"^[\w-\.]{1,}\@([\w]{1,}\.){1,}[a-z]{2,4}$"; //number of random strings to generte. const int maxiterations = 10000; [testmethod] public void fuzz_emailaddress() { //valid email should return true "[email protected]".regex(emailregex).assert_is_true(); //invalid email should return false "abce" .regex(emailregex).assert_is_false(); //testing maxiterations times for (int index = 0; index < maxiterations; index++) { //generating a random string var fuzzinput = (index * 5).randomstring(); var sw = new stopwatch(); sw.start(); fuzzinput.regex(emailregex).assert_is_false(); //elapsed time should be less than 3 seconds per input. sw.elapsed.seconds().assert_size_is_smaller_than(3); } } under the hardware features described before, this test passes. considering that we are using this computation (index * 5), the largest string generate is of 49995 character (which is 9999 *5). having said that we were able to test a large string against the regular expression and we confirmed that even thought it is quite large input value, the time involved to verify if it was or not a valid email, it was less than 3 seconds. now assuming that a check for the length of the email in the first place, it will guarantee that a malicious user can't inject a large payload in our application. countermeasures provided in microsoft .net 4.5 and upper if you are developing applications in microsoft .net 4.5 then you can take advantage of a new implementation on top of the ismatch method from the regex class . starting from .net 4.5 the ismatch method provides an overload that allows you to enter a timeout. note that this overload is not available in .net 4.0 . this new parameter is called matchtimeout and according to microsoft : "the matchtimeout parameter specifies how long a pattern matching method should try to find a match before it times out. setting a time-out interval prevents regular expressions that rely on excessive backtracking from appearing to stop responding when they process input that contains near matches. for more information, see best practices for regular expressions in the .net framework and backtracking in regular expressions . if no match is found in that time interval, the method throws a regexmatchtimeoutexception exception. matchtimeout overrides any default time-out value defined for the application domain in which the method executes." taken from here . we've written a new unit test where we're using a regular expression that we know can lead to denial of the service. in this case we'll test an email address that previously generated a significant side effect in the performance of the application. we'll see then how we can reduce the impact of this process by setting up a timeout. const string emailregexpattern = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var emailaddress = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"; var watch = new stopwatch(); watch.start(); //timeout of 5 seconds try { regex.ismatch(emailaddress, emailregexpattern, regexoptions.ignorecase, timespan.fromseconds(5)); } catch (exception ex) { ex.message.assert_not_null(); ex.gettype().assert_is(typeof(regexmatchtimeoutexception)); } finally { watch.stop(); watch.elapsed.seconds().assert_size_is_smaller_than(5); watch.reset(); } } running this test in visual studio we can confirm it passes, which means that the backtracking mechanism is taking longer than 5 seconds to resolve. it will throw a regexmatchtimeoutexception exception indicating that it might take longer than 5 seconds to evaluate the input. ideally one would expect this process to take less than a second, however several conditions or requirements might lead to allow a timeout in seconds. note how this model provides a very needed defensive programming style where the software engineers make informed decisions on the code they write, in this case we can establish the next steps when our method times and that way we can decrease any denial of the service attack. final thoughts no one size fits all is so cliché that has to be true. we are not sure if the regular expressions you are currently using in your applications are vulnerable to this attack. what we can do for sure is to show you how you can take advantage of unit testing to write secure code. when we write code we want to make sure that each single line of code is covered by a unit testing, which at the end of the day will guarantee early detections of error. however if we can combine this exercise with the adoption and implementation of test that can also try to attack/compromise the application (and we are not talking about anything fancy) like sending random strings, using fuzzing techniques, using combination of characters, exceeding the expected length, we will be helping to write software that is resilient to attacks. as a recommendation always test your regular expressions agains uni test, make sure that they are resilient to the attack we have covered in this article and if you are able to identify those problematic patterns out there, do a contribution and report them so we are not introduce them in the software we write. references 1.cruz,dinis(2013) the email regex that (could had) dosed a site. 2.hollos,s. hollos,r (2013) finite automata and regular expressions problems and solutions. 3.kirrage,j. rathnayake , thielecke, h.: static analysis for regular expression denial-of-service attacks. university of birmingham, uk 4.klein, t. a bug hunter's diary a guided tour through the wilds of software security (2011). 5.the owasp foundation (2012) regular expression denial of service - redos. 6.stubblebine, t(2007) regular expression pocket reference, second edition. 7.sullivan, b (2010) regular expression denial of service attacks and defenses
June 7, 2015
by Michael Hidalgo
· 34,153 Views · 5 Likes
article thumbnail
How to Extract OMR Data from Scanned Images inside C# & VB.NET Apps
This technical tip shows how to extract OMR data from a scanned image inside .NET Applications. Developers can use the Aspose.OMR namespace to extract data entered into OMR (Optical Mark Recognition) forms, such as consumer surveys, assessment sheets, or lottery tickets. This article describes how to extract data from a scanned page using a OMR template created with OMR Template Editor tool. The Aspose.Omr namespace can be used to recognize and extract different elements from a scanned image. A template is required that contains graphical mapping of elements to be recognized from the scanned images. This template document is loaded using OmrTemplate. Pages with human marked data like surveys or tests are scanned, and the images loaded using the OmrImage class. A recognition engine (OmrEngine) is required for the template and can be initiated by using the current template as an argument. After the template has been set in the engine, the data is extracted from the scanned image using the OmrEngine class' ExtractData function. Below are the steps to extract OMR data from a scanned image: Load the template using OmrTemplate. Load the scanned image into OmrImage. Instantiate the recognition engine: OmrEngine for the template. Extract data and save to OmrProcessingResult. //The sample code below shows how to extract OMR data from a scanned image //[C#] //Load template file OmrTemplate template = OmrTemplate.Load(MyDir + "Grids.amr"); //Load the image to be analyzed OmrImage image = OmrImage.Load(MyDir + "Grids-filled-scan.jpg"); // Instantiate the recognition engine for the template OmrEngine engine = new OmrEngine(template); // Extract data. This template has only one page. OmrProcessingResult result = engine.ExtractData(new OmrImage[] { image }); //Load actual result from Hashtable OmrResult = result.PageData[0]; //Get Collection of Keys ICollection key = OmrResult.Keys; foreach (string k in key) { Console.WriteLine(k + ": " + OmrResult[k]); } Console.ReadKey(); //[VB.NET] 'Load template file Dim template As OmrTemplate = OmrTemplate.Load(myDir + "Grids.amr") 'Load the image to be analyzed Dim image As OmrImage = OmrImage.Load(MyDir + "Grids-filled-scan.jpg") 'Instantiate the recognition engine for the template Dim engine As OmrEngine = New OmrEngine(template) 'Extract data. This template has only one page. Dim result As OmrProcessingResult = engine.ExtractData(New OmrImage() {image}) 'Load actual result from Dim OmrResult As Hashtable = result.PageData(0) 'Get collection of keys Dim Key As ICollection = OmrResult.Keys For Each k As String In Key Console.WriteLine(k + ": " + OmrResult(Key)) Next Console.ReadKey()
June 3, 2015
by David Zondray
· 12,128 Views
article thumbnail
Ecosystem of Hadoop Animal Zoo
hadoop is best known for map reduce and it's distributed file system (hdfs). recently other productivity tools developed on top of these will form a complete ecosystem of hadoop. most of the projects are hosted under apache software foundation . hadoop ecosystem projects are listed below. hadoop common a set of components and interfaces for distributed file system and i/o (serialization, java rpc, persistent data structures) http://hadoop.apache.org/ hadoop ecosystem hdfs a distributed file system that runs on large clusters of commodity hardware. hadoop distributed file system, hdfs renamed form ndfs. scalable data store that stores semi-structured, un-structured and structured data. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfsuserguide.html http://wiki.apache.org/hadoop/hdfs map reduce map reduce is the distributed, parallel computing programming model for hadoop. inspired from google map reduce research paper . hadoop includes implementation of map reduce programming model. in map reduce there are two phases, not surprisingly map and reduce. to be precise in between map and reduce phase, there is another phase called sort and shuffle. job tracker in name node machine manages other cluster nodes. map reduce programming can be written in java. if you like sql or other non- java languages, you are still in luck. you can use utility called hadoop streaming. http://wiki.apache.org/hadoop/hadoopmapreduce hadoop streaming a utility to enable map reduce code in many languages like c, perl, python, c++, bash etc., examples include a python mapper and awk reducer. http://hadoop.apache.org/docs/r1.2.1/streaming.html avro a serialization system for efficient, cross-language rpc and persistent data storage. avro is a framework for performing remote procedure calls and data serialization. in the context of hadoop, it can be used to pass data from one program or language to another, e.g. from c to pig. it is particularly suited for use with scripting languages such as pig, because data is always stored with its schema in avro. http://avro.apache.org/ apache thrift apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across programming languages. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business. http://thrift.apache.org/ hive and hue if you like sql, you would be delighted to hear that you can write sql and hive convert it to a map reduce job. but, you don't get a full ansi-sql environment. hue gives you a browser based graphical interface to do your hive work. hue features a file browser for hdfs, a job browser for map reduce/yarn, an hbase browser, query editors for hive, pig, cloudera impala and sqoop2.it also ships with an oozie application for creating and monitoring workflows, a zookeeper browser and an sdk. pig a high-level programming data flow language and execution environment to do map reduce coding the pig language is called pig latin. you may find naming conventions some what un-conventional, but you get incredible price-performance and high availability. https://pig.apache.org/ jaql jaql is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. as its name implies, a primary use of jaql is to handle data stored as json documents, but jaql can work on various types of data. for example, it can support xml, comma-separated values (csv) data and flat files. a "sql within jaql" capability lets programmers work with structured sql data while employing a json data model that's less restrictive than its structured query language counterparts. 1. jaql in google code 2. what is jaql? by ibm sqoop sqoop provides a bi-directional data transfer between hadoop -hdfs and your favorite relational database. for example you might be storing your app data in relational store such as oracle, now you want to scale your application with hadoop so you can migrate oracle database data to hadoop hdfs using sqoop. http://sqoop.apache.org/ oozie manages hadoop workflow. this doesn't replace your scheduler or BPM tooling, but it will provide if-then-else branching and control with hadoop jobs. https://oozie.apache.org/ zookeeper a distributed, highly available coordination service. zookeeper provides primitives such as distributed locks that can be used for building the highly scalable applications. it is used to manage synchronization for cluster. http://zookeeper.apache.org/ hbase based on google's bigtable , hbase "is an open-source, distributed, version, column-oriented store" that sits on top of hdfs. a super scalable key-value store. it works very much like a persistent hash-map (for python developers think like a dictionary). it is not a conventional relational database. it is a distributed, column oriented database. hbase uses hdfs for it's underlying. supports both batch-style computations using map reduce and point queries for random reads. https://hbase.apache.org/ cassandra a column oriented nosql data store which offers scalability, high availability with out compromising on performance. it perfect platform for commodity hardware and cloud infrastructure.cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views , and powerful built-in caching. http://cassandra.apache.org/ flume a real time loader for streaming your data into hadoop. it stores data in hdfs and hbase.flume "channels" data between "sources" and "sinks" and its data harvesting can either be scheduled or event-driven. possible sources for flume include avro, files, and system logs, and possible sinks include hdfs and hbase. http://flume.apache.org/ mahout machine learning for hadoop, used for predictive analytics and other advanced analysis. there are currently four main groups of algorithms in mahout: recommendations, a.k.a. collective filtering classification, a.k.a categorization clustering frequent item set mining, a.k.a parallel frequent pattern mining mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. algorithms in the mahout library belong to the subset that can be executed in a distributed fashion. http://en.wikipedia.org/wiki/list_of_machine_learning_algorithms https://www.coursera.org/course/machlearning https://mahout.apache.org/ fuse makes the hdfs system to look like a regular file system so that you can use ls, rm, cd etc., directly on hdfs data. whirr apache whirr is a set of libraries for running cloud services. whirr provides a cloud-neutral way to run services. you don't have to worry about the idiosyncrasies of each provider.a common service api. the details of provisioning are particular to the service. smart defaults for services. you can get a properly configured system running quickly, while still being able to override settings as needed. you can also use whirr as a command line tool for deploying clusters. https://whirr.apache.org/ giraph an open source graph processing api like pregel from google https://giraph.apache.org/ chukwa chukwa, an incubator project on apache, is a data collection and analysis system built on top of hdfs and map reduce. tailored for collecting logs and other data from distributed monitoring systems, chukwa provides a workflow that allows for incremental data collection, processing and storage in hadoop. it is included in the apache hadoop distribution as an independent module. https://chukwa.apache.org/ drill apache drill, an incubator project on apache, is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. drill is the open source version of google's dremel system which is available as an iaas service called google big query. one explicitly stated design goal is that drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. http://incubator.apache.org/drill/ impala (cloudera) released by cloudera, impala is an open-source project which, like apache drill, was inspired by google's paper on dremel; the purpose of both is to facilitate real-time querying of data in hdfs or hbase. impala uses an sql-like language that, though similar to hiveql, is currently more limited than hiveql. because impala relies on the hive meta store, hive must be installed on a cluster in order for impala to work. the secret behind impala's speed is that it "circumvents map reduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel rdbmss." (source: cloudera) http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html http://training.cloudera.com/elearning/impala/
June 3, 2015
by Umashankar Ankuri
· 23,880 Views · 3 Likes
article thumbnail
Mechanical Sympathy: Understanding the Hardware Makes You a Better Developer
[This article appears in the DZone Guide to Performance & Monitoring – 2015 Edition. For additional information including insight from industry experts and luminaries, performance statistics and strategies, and an overview of how modern companies are handling application monitoring, download the guide below.] I have spent the last few years of my career working in the field of high-performance, low-latency systems. Two observations struck me when I returned to working on the metal: Many of the assumptions that programmers make outside this esoteric domain are simply wrong. Lessons learned from high-performance, low-latency systems are applicable to many other problem domains. Some of these assumptions are pretty basic. For example, which is faster: storing to disk or writing to a cluster pair on your network? If, like me, you survived programming in the late 20th century, you know that the network is slower than disk, but that’s actually wrong! It turns out that it’s much more efficient to use clustering as a backup mechanism than to save everything to disk. Other assumptions are more general and more damaging. A common meme in our industry is “avoid pre-optimization.” I used to preach this myself, and now I’m very sorry thatI did. These are just a few supposedly obvious principles that truly high-performance systems call into question. For many developers, these rules of thumb appear reasonably inviolable in everyday practice. But as performance demands grow increasingly strict, it becomes proportionally important for developers to understand exactly how systems work—at both the abstract, procedural level, and the level of the metal itself. MOTIVATION: THE REPERCUSSIONS OF INEFFICIENT SOFTWARE First, let’s think about why it’s important to transcend crude oversimplifications like “disk I/O is faster than network I/O.” One motivation is a bit negative. I can think of noother field of human endeavor that tolerates thelevels of inefficiency that are normal in our industry.My experience has been that for most systems, any performance specialist can improve its performance ten-fold quite easily. This is because most applications are hundreds, thousands, even tens of thousands of times less efficient than they could be. Modern hardware is phenomenally capable, but we software folks regularly underestimate the strides hardware manufacturers have made. As a result, we often fail to take advantage of new hardware capabilities and miss significant shifts in the locations of common performance bottlenecks. You may say this doesn’t matter—after all, what is all that hardware performance for, if not to make it easier to write software? And then I’ll reply: it’s for lots of things. Here’s one very concrete reason. The energy consumption from data centers constitutes a significant fraction of the CO2 being pumped into our atmosphere [1]. If most software is more than 10x less efficient than it could be, then that could mean 10x more CO2. It also means 10x the capital cost in hardware required to run all that inefficient software. “You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy.” – Jackie Stewart, racing driver Perhaps more important is that performance is about more than just efficiency for its own sake. Performanceis also an important enabler of innovation. The ability to pack the data representing 1000 songs onto the tiny hard disk in the first generation iPod made it possible for Apple to revolutionize the music industry. For an even more basic example, graphical user interfaces only became feasible when the hardware was fast enough to “waste” all that time drawing pretty pictures. High-performance hardware enabled these innovations; the software simply had to follow. Then there’s the opportunity cost. What could we be doing with all the spare capacity of our astonishingly fast hardware and enormously vast storage if we weren’t wasting it on inefficient software? In the problem domains where I’ve worked, the idea of mechanical sympathy has helped enormously. Let’s examine this idea a little closer. KEY CONCEPT: MECHANICAL SYMPATHY The term Mechanical Sympathy was coined by racing driver Jackie Stewart and applied to software by Martin Thompson. Jackie Stewart said, “You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy.” He meant that understanding how a car works makes you a better driver. This is just as true for writing code. You don’t need to be a hardware engineer, but you do need to understand how the hardware works and take that into consideration when you design software. Let’s take something simple like writing a file to disk. Disks are random access, right? Well, not really. Disks work by encoding data in sectors and addressing them in segments of the disk. When you read or write data to disk, youneed to wait for the heads to move to the correct physical location. If you do this randomly, then you will incur performance penalties as the heads are physically moved across the surface of the disk and you wait for the spin of the disk to place the sector you want beneath the read/ write heads. This averages out to about 3ms per seek for the heads, and about 3ms rotational latency. That makes for an average total of about 6ms per seek! This seek time is dramatically slow when compared tothe electronic. Electrons beat spinning rust every time! And there are even more specific reasons why “random access” is a poor rule of thumb. In fact, modern disks are optimized to stream data so that they can play movies and audio. If you understand that, and treat your file storage as a serial, block device rather than random access, you will get dramatically higher performance. In this example, the difference can be two orders of magnitude (200MB/s is a good working figure for the upper bound). REACHING HIGH PERFORMANCE: DO EXPERIMENTS, DO ARITHMETIC Abstraction is important. It is essential that we are, to some degree, insulated from the complexity of the devices and systems that form the platform upon which our software executes. However, many of the abstractions that we take for granted are very poor and leaky. The overall system complexity gets amplified when we build abstraction on top of abstraction to hide the problems caused by the leakiness, resulting in the dramatic performance differences we see between high-performance and “normal” systems. You don’t need to model or measure your entire systemto tackle its complexity. The starting point is to do some experimenting to understand your theoretical maximums. You should have a rough model of the following: How much data you can write to disk in one second How many messages your code can process in one second How much data you can send or receive across your network in one second A common meme in our industry is, “avoid pre- optimization.” I used to preach this myself. I am now very sorry that I did. The idea here is to measure your actual performance against the theoretically possible performance of your system within the limits imposed by the underlying hardware. Let’s consider a specific example. I recently talked witha developer who believed he was working on a high- performance system. He told me that his system was working at 10 transactions per second. As of mid-2015, modern processors on commodity hardware can easily perform 2-3 billion instructions per second. So at 2-3 billion instructions per second, the developer was limited to 200- 300 million instructions per transaction ([10 transactions/ sec] / [2-3G instructions/sec] = 200-300m instructions/ transaction) to achieve his goals. Of course, his application isn’t responsible for all of these—the operating system is chewing some cycles too—but 200 million instructions is an awful lot of work. You can write an effective chess player in 672 bytes, as David Horne has shown. Okay, so maybe the bottleneck was I/O. Perhaps this developer’s application was disk bound. That’s unlikely— modern hard disks found in commodity hardware can transfer data at phenomenal rates. A moderate transfer rate from disk to a buffer in memory is about 10MB/s. If this is the limit, he must be pushing more than 10MB per transaction between disk and memory. You can represent a lot of information in 10 million bytes! Well, perhaps the problem was the network. Probably not—10Gbit/s networks are now on the low end. A 10Gbit/s network can transmit roughly 1GB of data per second.This means that each of our misguided developer’s 10 transactions per second is occupying 1/10th of a GB, or approximately 100MB—more than 10 times the throughput of our disks! The hardware wasn’t the problem. HIGH PERFORMANCE SIMPLICITY I: MODEL THE PROBLEM DOMAIN One of the great myths about high-performance programming is that high-performance solutions are more complex than “normal” solutions. This is just not true.By definition, a high-performance solution must do the most amount of work in the fewest instructions. Complex solutions precisely don’t last long in high-performance systems because complexity is where performance bottlenecks hide. The best programmers I know achieve the simplicity demanded by high-performance systems by modeling the problem domain. A software simulation of the problem domain is the best way to come up with an elegant solution. HIGH PERFORMANCE SIMPLICITY II: SIMPLIFY THE CODE If you follow the lead of the problem domain, you tend to end up with smaller, simpler classes with clearer relationships. If you follow strong separation of concerns as a guiding principle, then your design will push you in the direction of cleaner, simpler code. The cardinal rule of object-oriented simplicity is “one class, one thing; one method, one thing.” Modern compilers are extremely effective at optimizing code, but they are best at optimizing simple code. If you write 300-line methods with multiple for-loops, each containing several nested if-conditions throwing exceptions all over the place and returning from multiple points, then the optimizer will simply give up. If you write small, simple, easy to read methods, they are easier to test and easier for the optimizer to understand, which results in significant improvements to performance. One of the great myths about high- performance programming isthat high-performance solutions are more complex than “normal” solutions. This is just not true. THE KEY TO HIGH PERFORMANCE: MINIMIZE INSTRUCTIONS, MINIMIZE DATAFundamentally, developing high-performance systemsis simple: minimize the number of instructions being processed and the amount of data being shunted around. You achieve this by modeling the problem domainand eliminating nonessential complexity. For software developers, “Mechanically Sympathizing” with modern high-performance hardware will help you understand where you’re doing things wrong. It can also point you in the direction of better measurement and profiling, which will help you understand why your code is not performing to its theoretical maximum. So why not try to save your company some money, work in a simplified codebase,and maybe help reduce the carbon footprint of our data centers—all at the same time? [1] https://energy.stanford.edu/news/data-centers-can-slash-co2-emissions-88- or-more DOWNLOAD YOUR FREE COPY TODAY
May 24, 2015
by Dave Farley
· 56,805 Views · 58 Likes
article thumbnail
Server and Storage I/O Benchmark Tools: Microsoft Diskspd (Part I)
A key to improving performance is benchmarking. Read about Microsoft Diskspd's tools for storage and server benchmarking, and boost your I/O performance.
May 22, 2015
by Greg Schulz
· 14,739 Views
article thumbnail
Data Locality w/ Cassandra : How to Scan the Local Token Range of a Table...
I'm working on a mechanism that will allow HPCC to access data stored in Cassandra with data locality, leveraging the Java streaming capabilities from HPCC.
May 18, 2015
by Brian O' Neill
· 14,329 Views · 1 Like
article thumbnail
Swim Lane Diagrams in JavaScript
Learn about swim-lane diagrams to connect business processes and departments and apply the concept to the object-oriented world of javascript.
May 17, 2015
by Daniel Jebaraj
· 7,130 Views
article thumbnail
Use RegEx to Test Password Strength in JavaScript
In this post, we learn how to combine JavaScript and RegEx to create scripts that can help us test our password strength.
May 16, 2015
by Nic Raboy
· 93,719 Views · 1 Like
article thumbnail
HashMap Custom implementation in java
Contents of page : Custom HashMap > Entry Putting 5 key-value pairs in HashMap (step-by-step)> Methods used in custom HashMap > What will happen if map already contains mapping for key? Complexity calculation of put and get methods in HashMap > put method - worst Case complexity > put method - best Case complexity > get method - worst Case complexity > get method - best Case complexity > Summary of complexity of methods in HashMap > Custom HashMap > This is very important and trending topic. In this post i will be explaining HashMap custom implementation in lots of detail with diagrams which will help you in visualizing the HashMap implementation. I will be explaining how we will put and get key-value pair in HashMap by overriding- >equals method - helps in checking equality of entry objects. >hashCode method - helps in finding bucket’s index on which data will be stored. We will maintain bucket (ArrayList) which will store Entry (LinkedList). Entry We store key-value pair by usingEntry Entry contains K key, V value and Entrynext (i.e. next entry on that location of bucket). static class Entry { K key; V value; Entry next; public Entry(K key, V value, Entry next){ this.key = key; this.value = value; this.next = next; } } Putting 5 key-value pairs in custom HashMap (step-by-step)> I will explain you the whole concept of HashMap by putting 5 key-value pairs in HashMap. Initially, we have bucket of capacity=4. (all indexes of bucket i.e. 0,1,2,3 are pointing to null) Let’s put first key-value pair in HashMap- Key=21, value=12 newEntry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 21%4= 1. So, 1 will be the index of bucket on which newEntry object will be stored. We will go to 1stindex as it is pointing to null we will put our newEntry object there. At completion of this step, our HashMap will look like this- Let’s put second key-value pair in HashMap- Key=25, value=121 newEntry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 25%4= 1. So, 1 will be the index of bucket on which newEntry object will be stored. We will go to 1st index, it contains entry with key=21, we will compare two keys(i.e. compare 21 with 25 by using equals method), as two keys are different we check whether entry with key=21’s next is null or not, if next is null we will put our newEntry objecton next. At completion of this step our HashMap will look like this- Let’s put third key-value pair in HashMap- Key=30, value=151 newEntry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 30%4= 2. So, 2 will be the index of bucket on which newEntry object will be stored. We will go to 2nd index as it is pointing to null we will put our newEntry object there. At completion of this step, our HashMap will look like this- Let’s put fourth key-value pair in HashMap- Key=33, value=15 Entry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 33%4= 1, So, 1 will be the index of bucket on whichnewEntry object will be stored. We will go to 1st index - >it contains entry with key=21, we will compare two keys (i.e. compare 21 with 33 by using equals method, as two keys are different, proceed to next of entry with key=21 (proceed only if next is not null). >now, next contains entry with key=25, we will compare two keys (i.e. compare 25 with 33 by using equals method, as two keys are different, now next of entry with key=25 is pointing to null so we won’t proceed further, we will put our newEntry object on next. At completion of this step our HashMap will look like this- Let’s put fifth key-value pair in HashMap- Key=35, value=89 Repeat above mentioned steps. At completion of this step our HashMap will look like this- Must read: LinkedHashMap Custom implementation Methods used in custom HashMap > public void put(K newKey, V data) -Method allows you put key-value pair in HashMap -If the map already contains a mapping for the key, the old value is replaced. -provide complete functionality how to override equals method. -provide complete functionality how to override hashCode method. public V get(K key) Method returns value corresponding to key. public boolean remove(K deleteKey) Method removes key-value pair from HashMapCustom. public void display() -Method displays all key-value pairs present in HashMapCustom., -insertion order is not guaranteed, for maintaining insertion order refer LinkedHashMapCustom. private int hash(K key) -Method implements hashing functionality, which helps in finding the appropriate bucket location to store our data. -This is very important method, as performance of HashMapCustom is very much dependent on this method's implementation. What will happen if map already contains mapping for key? If the map already contains a mapping for the key, the old value is replaced. Complexity calculation of put and get methods in HashMap > put method - worst Case complexity > O(n). But how complexity is O(n)? Initially, let's say map is like this - And we have to insert newEntry Object with Key=25, value=121 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 25%4= 1. So, 1 will be the index of bucket on which newEntry object will be stored. We will go to 1st index, it contains entry with key=21, we will compare two keys(i.e. compare 21 with 25 by using equals method), as two keys are different we check whether entry with key=21’s next is null or not, if next is null we will put our newEntry objecton next. At completion of this step our HashMap will look like this- Now let’s do complexity calculation - Earlier there was 1 element in HashMap and for putting newEntry Object we iterated on it. Hence complexity was O(n). Note: We may calculate complexity by adding more elements in HashMap as well, but to keep explanation simple i kept less elements in HashMap. put method - best Case complexity > O(1). But how complexity is O(n)? Let's say map is like this - And we have to insert newEntry Object with Key=30, value=151 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 30%4= 2. So, 2 will be the index of bucket on which newEntry object will be stored. We will go to 2nd index as it is pointing to null we will put our newEntry object there. At completion of this step our HashMap will look like this- Now let’s do complexity calculation - Earlier there 2 elements in HashMap but we were able to put newEntry Object in first go. Hence complexity was O(1). get method - worst Case complexity > O(n). But how complexity is O(n)? Initially, let's say map is like this - And we have to get Entry Object with Key=25, value=121 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 25%4= 1. So, 1 will be the index of bucket on which Entry object is stored. We will go to 1st index, it contains entry with key=21, we will compare two keys(i.e. compare 21 with 25 by using equals method), as two keys are different we check whether entry with key=21’s next is null or not, next is not null so we will repeat same process and ultimately will be able to get Entry object. Now let’s do complexity calculation - There were 2 elements in HashMap and for getting Entry Object we iterated on both of them. Hence complexity was O(n). Note: We may calculate complexity by using HashMap of larger size, but to keep explanation simple i kept less elements in HashMap. get method - best Case complexity > O(1). But how complexity is O(n)? Initially, let's say map is like this - And we have to get Entry Object with Key=30, value=151 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 30%4= 2. So, 2 will be the index of bucket on which Entry object is stored. We will go to 2nd index and get Entry object. Now let’s do complexity calculation - There were 3 elements in HashMap but we were able to get Entry Object in first go. Hence complexity was O(1). Summary of complexity of methods in HashMap > Operation/ method Worst case Best case put(K key, V value) O(n) O(1) get(Object key) O(n) O(1) REFER: http://javamadesoeasy.com/2015/02/hashmap-custom-implementation.html HashMap Custom implementation - put, get, remove Employee object.
May 13, 2015
by Ankit Mittal
· 74,462 Views · 10 Likes
article thumbnail
Python: Equivalent to flatMap for Flattening an Array of Arrays
I found myself wanting to flatten an array of arrays while writing some Python code earlier this afternoon and being lazy my first attempt involved building the flattened array manually: episodes = [ {"id": 1, "topics": [1,2,3]}, {"id": 2, "topics": [4,5,6]} ] flattened_episodes = [] for episode in episodes: for topic in episode["topics"]: flattened_episodes.append({"id": episode["id"], "topic": topic}) for episode in flattened_episodes: print episode If we run that we’ll see this output: $ python flatten.py {'topic': 1, 'id': 1} {'topic': 2, 'id': 1} {'topic': 3, 'id': 1} {'topic': 4, 'id': 2} {'topic': 5, 'id': 2} {'topic': 6, 'id': 2} What I was really looking for was the Python equivalent to the flatmap function which I learnt can be achieved in Python with a list comprehension like so: flattened_episodes = [{"id": episode["id"], "topic": topic} for episode in episodes for topic in episode["topics"]] for episode in flattened_episodes: print episode We could also choose to use itertools in which case we’d have the following code: from itertools import chain, imap flattened_episodes = chain.from_iterable( imap(lambda episode: [{"id": episode["id"], "topic": topic} for topic in episode["topics"]], episodes)) for episode in flattened_episodes: print episode We can then simplify this approach a little by wrapping it up in a ‘flatmap’ function: def flatmap(f, items): return chain.from_iterable(imap(f, items)) flattened_episodes = flatmap( lambda episode: [{"id": episode["id"], "topic": topic} for topic in episode["topics"]], episodes) for episode in flattened_episodes: print episode I think the list comprehensions approach still works but I need to look into itertools more – it looks like it could work well for other list operations.
May 9, 2015
by Mark Needham
· 36,348 Views · 2 Likes
article thumbnail
8 Questions You Need to Ask About Microservices, Containers & Docker in 2015
In containers and microservices, we’re facing the greatest potential change in how we deliver and run software services since the arrival of virtual machines.
May 9, 2015
by Andrew Phillips
· 15,009 Views · 1 Like
article thumbnail
Make Your IoT Gateway WiFi-Aware Using Camel and Kura
The common scenario for the mobile IoT Gateways is to cache collected data locally on the device storage and synchronizing the data with the data center.
May 9, 2015
by Henryk Konsek
· 8,076 Views
article thumbnail
Why Run Your Microservices on a PaaS
[This article by Chris Haddad comes to you from the DZone Guide to Cloud Development - 2015 Edition. For more information—including in-depth articles from industry experts, best solutions for PaaS, iPaaS, IaaS, and MBaaS, and more—click the link below to download your free copy of the guide.] Microservices can be understood from two angles. First, the differential: teams that take a microservice design approach divide business solutions into distinct, full-stack business services owned by autonomous teams. Second, the integral: microservice-based applications weave multiple atomic microservices into holistic user experiences. Unfortunately, traditional application delivery models and traditional middleware infrastructure do not address microservice-specific demands for on-demand provisioning, dynamic composition, and service level management. On the other hand, the Platform-as-a-Service (PaaS) model addresses these demands perfectly. Running microservices on a PaaS fabric decreases solution fragility, reduces operational burden, and enhances developer productivity. To understand why, we’ll first review how microservices separate concerns from both business and object-oriented design perspectives. Second, we’ll consider how microservice-based design can complicate deployment as applications scale dynamically. Third, we’ll focus on how a PaaS environment helps to solve many of the problems both addressed and introduced by microservices-based architectures — in other words, why PaaS and microservices are a match made in heaven. Microservices: Separating Concerns By Business Solution A microservice approach decomposes monolithic applications according to the single responsibility pattern. In a microservice solution, each microservice interface delivers discrete business capabilities (e.g. customer profile, product catalogue, inventory, order, billing, fulfillment) within a well-defined, bounded context. The atomic microservice interfaces reside on separate and distinct full-stack application platforms that contain separate database storage, integration flows, and web application hosting. By separating concerns onto separate full-stack platforms and not sharing database instances or web application hosts across services, every team is free to choose different runtime languages and frameworks for its own microservice. Also, every team is free to evolve its data schemas, application frameworks, and business logic without impacting other teams. Because microservices are a relatively new design approach, many development teams may have the misconception that creating a microservice-based solution requires simply deploying small web services in containers. But this doesn’t cut quite deep enough. The correct approach is to evolve your monolithic design by applying service-oriented principles (i.e. encapsulation, loose coupling, separation of concerns) in conjunction with domain-driven design techniques and dynamic runtime application composition. For example, in a typical ecommerce scenario, a development team applies the bounded context pattern and single responsibility pattern to refactor a monolithic application into units distinguished by business capability (see Figure 2). By creating a user experience from loosely coupled services instead of tightly coupled native-language business objects, teams have more independence to develop, evolve, and deploy each business capability separately. Obviously, the microservice design approach works best for (a) greenfield projects or (b) modernization efforts where teams focus on refactoring monolithic application assets. The Microservice Execution Trap Although a microservice approach decouples development dependencies and speeds up development iterations, microservices also create a challenging environment for high-performance scaling and reliable runtime execution. More complex, loosely coupled, and dynamic environments distribute business capabilities over the entire network. Even a task as simple as responding to a single web application page request may spread out across several microservice instances residing on a distributed network topology. Martin Fowler and Stefan Tilkov (both microservice proponents) warn teams that successfully implementing a microservice approach requires choosing platforms that decrease solution fragility and reduce operational burdens. What Platform-as-a-Service Offers Platform-as-a-Service environments reduce microservice operational burdens when infrastructure-as-code and declarative policies are used to eliminate all manual actions and increase runtime quality of service (i.e. reliability, availability, scalability, and performance). The appropriate PaaS environment will automatically deploy, provision, and link full-stack microservices. In a microservice architecture, teams want to rapidly release new versions and perform A/B testing across versions. When teams define instance dependencies, scaling properties, and security policies as PaaS metadata or code scripts, the runtime fabric can reduce manual effort and increase release confidence. With a DevOps- friendly PaaS, the team can experiment with new service versions and safely rollback to a prior stable release if a problem arises. Because microservices are full-stack silos *1* that can be composed of multiple server instances (e.g. web server, database, load balancer, integration server), a PaaS can reduce deployment complexity by automatically spinning up and linking all instances. Linking may require discovering instance locations, dynamically initializing network routes, and auto-configuring connection strings based on service version or tenant. A traditional application will compose business functions and user experience by statically linking class files and shared object libraries. In contrast, microservice- based applications use service composition to connect available microservices endpoints and realize a fully functional application. While many microservice proponents promote microservice-based interactions by “smart endpoints through dumb pipes, ‘ effective service composition requires smart infrastructure building blocks to bootstrap and maintain connections between services and consumers. The right PaaS solves these problems. Infrastructure building blocks will register service endpoint locations, associate metadata and policies, connect clients, circuit break around failures, correlate inter-service calls, and load balance traffic. A microservice-friendly PaaS will provide service registries, metadata services, discovery services, and service virtualization gateways. In the pipe, circuit breakers will automatically route traffic on failover or overload. Smart endpoint code will dynamically connect with microservices based on discovery service responses and negotiated quality of service parameters. Rather than being hard-coded to a specific service hostname and URI, endpoint code will query for microservice location based on security assurances, performance guarantees, traffic load, service version, client tenancy, or business domain. When services are unavailable or underperform, smart endpoints will follow the tolerant reader pattern and gracefully degrade experience or proactively recover. A few recovery options include reading from local caches or circuit tripping to backup service endpoints. In conjunction with smart endpoint actions, a smart PaaS will spin up new microservice endpoints and full-stack instances based on service level management metrics. By following microservice architecture best practices, teams create anti-fragile applications that not only withstand a shock, but also improve performance and quality of service when stressed or experiencing failures. To drive this non-intuitive behavior, the underlying platform environment must be ready to scale, repair, and reconnect services. PaaS service level management components will create more resilient and anti-fragile microservices by monitoring performance, elastically provisioning instances, and dynamically re-routing traffic. Scaling an anti-fragile microservice is more difficult than scaling a web application. The PaaS should distribute microservice instances across multiple availability zones and dynamically adjust traffic to reduce latency and response time. Because transient microservice instances will rapidly start, stop, and change location, the service management layer must be completely automated and integrated with routing services. A PaaS environment will deliver the service level management, dynamic service composition, circuit breakers, and on-demand provisioning functions required to overcome the complexity inherent within a distributed microservice-based application architecture. Running microservices on a PaaS fabric will decrease solution fragility, reduce operational burden, and enhance developer productivity. If you are pursuing a microservice design approach, make sure you choose a microservice- friendly PaaS. DOWNLOAD YOUR FREE COPY TODAY
May 5, 2015
by Chris Haddad
· 12,063 Views · 2 Likes
article thumbnail
How To: Neo4j Data Import - Minimal Example
The easiest way to import data from relational or legacy systems, like plain CSV files without headers, into Neo4j with Cypher and a graph model.
April 30, 2015
by Michael Hunger
· 7,756 Views
article thumbnail
Introduction to Probabilistic Data Structures
When processing large data sets, we often want to do some simple checks, such as number of unique items, most frequent items, and whether some items exist in the data set. The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes. But when the data set we are dealing with becomes very large, such data structures are simply not feasible because the data is too big to fit in the memory. It becomes even more difficult for streaming applications which typically require data to be processed in one pass and perform incremental updates. Probabilistic data structures are a group of data structures that are extremely useful for big data and streaming applications. Generally speaking, these data structures use hash functions to randomize and compactly represent a set of items. Collisions are ignored but errors can be well-controlled under certain threshold. Comparing with error-free approaches, these algorithms use much less memory and have constant query time. They usually support union and intersection operations and therefore can be easily parallelized. This article will introduce three commonly used probabilistic data structures: Bloom filter, HyperLogLog, and Count-Min sketch. Membership Query - Bloom filter A Bloom filter is a bit array of m bits initialized to 0. To add an element, feed it to k hash functions to get k array position and set the bits at these positions to 1. To query an element, feed it to k hash functions to obtain k array positions. If any of the bits at these positions is 0, then the element is definitely not in the set. If the bits are all 1, then the element might be in the set. A Bloom filter with 1% false positive rate only requires 9.6 bits per element regardless of the size of the elements. For example, if we have inserted x, y, z into the bloom filter, with k=3 hash functions like the picture above. Each of these three elements has three bits each set to 1 in the bit array. When we look up for w in the set, because one of the bits is not set to 1, the bloom filter will tell us that it is not in the set. Bloom filter has the following properties: False positive is possible when the queried positions are already set to 1. But false negative is impossible. Query time is O(k). Union and intersection of bloom filters with same size and hash functions can be implemented with bitwise OR and AND operations. Cannot remove an element from the set. Bloom filter requires the following inputs: m: size of the bit array n: estimated insertion p: false positive probability The optimum number of hash functions k can be determined using the formula: Given false positive probability p and the estimated number of insertions n, the length of the bit array can be calculated as: The hash functions used for bloom filter should generally be faster than cryptographic hash algorithms with good distribution and collision resistance. Commonly used hash functions for bloom filter include Murmur hash, fnv series of hashes and Jenkins hashes. Murmur hash is the fastest among them. MurmurHash3 is used by Google Guava library's bloom filter implementation. Cardinality - HyperLogLog HyperLogLog is a streaming algorithm used for estimating the number of distinct elements (the cardinality) of very large data sets. HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k. For each element si in the stream, hash function h(si) transforms si into string of random bits (0 or 1 with probability of 1/2): The probability P of the bit patterns: 0xxxx... → P = 1/2 01xxx... → P = 1/4 001xx... → P = 1/8 The intuition is that when we are seeing prefix 0k 1..., it's likely there are n ≥ 2k+1 different strings. By keeping track of prefixes 0k 1... that have appeared in the data stream, we can estimate the cardinality to be 2p, where p is the length of the largest prefix. Because the variance is very high when using single counter, in order to get a better estimation, data is split into m sub-streams using the first few bits of the hash. The counters are maintained by m registers each has memory space of multiple of 4 bytes. If the standard deviation for each sub-stream is σ, then the standard deviation for the averaged value is only σ/√m. This is called stochastic averaging. For instance for m=4, The elements are split into m stream using the first 2 bits (00, 01, 10, 11) which are then discarded. Each of the register stores the rest of the hash bits that contains the largest 0k 1 prefix. The values in the m registers are then averaged to obtain the cardinality estimate. HyperLogLog algorithm uses harmonic mean to normalize result. The algorithm also makes adjustment for small and very large values. The resulting error is equal to 1.04/√m. Each of the m registers uses at most log2log2 n + O(1) bits when cardinalities ≤ n need to be estimated. Union of two HyperLogLog counters can be calculated by first taking the maximum value of the two counters for each of the m registers, and then calculate the estimated cardinality. Frequency - Count-Min Sketch Count-Min sketch is a probabilistic sub-linear space streaming algorithm. It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary. The basic data structure is a two dimensional d x w array of counters with d pairwise independent hash functions h1 ... hd of range w. Given parameters (ε,δ), set w = [e/ε], and d = [ln1/δ]. ε is the accuracy we want to have and δ is the certainty with which we reach the accuracy. The two dimensional array consists of wd counts. To increment the counts, calculate the hash positions with the d hash functions and update the counts at those positions. The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions. The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved. Example of Count-Min sketch sizes for different error and probability combination: ε 1 - δ w d wd 0.1 0.9 28 3 84 0.1 0.99 28 5 140 0.1 0.999 28 7 196 0.01 0.9 272 3 816 0.01 0.99 272 5 1360 0.01 0.999 272 7 1940 0.001 0.999 2719 7 19033 Count-Min sketch has the following properties: Union can be performed by cell-wise ADD operation O(k) query time Better accuracy for higher frequency items (heavy hitters) Can only cause over-counting but not under-counting Count-Min sketch can be used for querying single item count or "heavy hitters" which can be obtained by keeping a heap structure of all the counts. Summary Probabilistic data structures have many applications in modern web and data applications where the data arrives in a streaming fashion and needs to be processed on the fly using limited memory. Bloom filter, HyperLogLog, and Count-Min sketch are the most commonly used probabilistic data structures. There are a lot of research on various streaming algorithms, synopsis data structures and optimization techniques that are worth investigating and studying. If you haven't tried these data structures, you will be amazed how powerful they can be once you start using them. It may be a little bit intimidating to understand the concept initially, but the implementation is actually quite simple. Google Guava has Bloom filter implementation using murmur hash. Clearspring's Java library stream-lib and Twitter's Scala library Algebird have implementation for all three data structures and other useful data structures that you can play with. I have included the links below. Links http://bigsnarf.wordpress.com/2013/02/08/probabilistic-data-structures-for-data-analytics/ http://en.wikipedia.org/wiki/Bloom_filter http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf http://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/ http://dimacs.rutgers.edu/~graham/pubs/papers/cm-full.pdf http://www.moneyscience.com/pg/blog/ThePracticalQuant/read/438348/realtime-analytics-hokusai-adds-a-temporal-component-to-countmin-sketch http://people.cs.umass.edu/~mcgregor/711S12/sketches1.pdf https://github.com/addthis/stream-lib https://github.com/twitter/algebird
April 30, 2015
by Yin Niu
· 35,558 Views · 6 Likes
  • Previous
  • ...
  • 407
  • 408
  • 409
  • 410
  • 411
  • 412
  • 413
  • 414
  • 415
  • 416
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×