DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest AI/ML Topics

article thumbnail
Cornerstone OnDemand and TED join forces to spark innovation in professional learning and development
Curated TED Talk playlists integrated within Cornerstone Learning enable organisations to instantly access new, innovative ideas and share knowledge across their workforce June 30, 2015 - Cornerstone OnDemand, a global leader in cloud-based talent management software solutions, today announced the company is teaming with TED, the non-profit global community devoted to spreading ideas, to deliver curated TED Talks to Cornerstone clients for a new, innovative approach to professional learning and development. The first and only collaboration of its kind, Cornerstone clients now have the ability to provide their workforce with modern, mobile-enabled TED Talks from world-class leaders at the forefront of their fields from within Cornerstone Learning. Cornerstone's collaborative learning functionality also allows organisations to enable peer-to-peer knowledge capture and discussions that can extend the learning impact of TED Talks. Watched and listened to more than 1 billion times this year, TED Talks introduce ideas that can help companies transform how their people think and work. Cornerstone clients will have access to a series of curated TED Talk playlists designed to address key business challenges in an innovative format that is unique, powerful and inspiring. With curated TED Talk playlists through Cornerstone: Inspire your workforce. TED brings together the world's most inspiring and ingenious people whose ideas can strengthen how professionals understand and think about the world around them. TED's curation of talks on behalf of Cornerstone can help organisations generate excitement and engagement among employees, help management crystallise goals, start important conversations, and spark collaborations. Provide the best, most relevant content. Organisations will gain access to the very best collections of TED Talks across a wide range of topics that are central to innovation and talent development, including change management, culture building, leadership, technology, globalisation, diversity and design. Playlists have been curated to reflect talks from visionary leaders across the most influential industries, such as healthcare, education, technology, manufacturing, finance and more. Amplify the value of your learning and development strategy. Employees can view TED Talks from within Cornerstone Learning, the global learning management system (LMS) for over 1,800 leading organisations. Integrating TED Talks into professional development curriculum allows organisations to inspire each individual employee at any stage in their career. Organisations can easily target and deliver learning and development to groups or individuals with the support of TED Talks and measure impact on workforce development from within Cornerstone. Watch and Share Instantly on Mobile: As smartphones emerge as the leading platform for watching video and Web content among busy professionals, TED Talks allow employees to consume and share content on their mobile devices while on the go. Comments on the News "TED Talks are brilliantly crafted and make an emotional connection with viewers. Their ability to convey innovative and complex ideas through powerful, first-person stories is the type of talent management content that can inspire and drive real change in the workforce," said Kirsten Helvey, senior vice president, client success, Cornerstone OnDemand. "We are dedicated to helping people reach their potential by providing our clients with the most innovative talent management solutions that support their professional development and training initiatives." "With the growing demand from companies for TED Talks, Cornerstone provides TED with the expertise and efficiency in reaching millions of learners in organisations across the globe that can benefit from our content," said Deron Triff, TED's director of global distribution and licensing. "This collaboration also provides TED with an important opportunity to understand how the talks can be utilised for professional development to strengthen how we collaborate with the business community. Cornerstone will be a great alliance for bringing TED Talks to companies and sparking innovation among their employees." Additional Resources Learn more about curated TED Talk playlists for Cornerstone via the Cornerstone Marketplace: marketplace.csod.com/#/content/90 Read additional commentary by Cornerstone's director of talent management, Jeff Miller, on the value and influence of TED Talks for empowering today's workforce via the Cornerstone blog: www.cornerstoneondemand.com/blog/how-ted-gets-your-workforce-talking
June 30, 2015
by Fran Cator
· 953 Views
article thumbnail
Spark Grows Up and Scales Out
Written by Craig Wentworth. To understand the furor that’s greeted recent vendor announcements around open source analytics computing engine Spark, and some commentary seemingly setting up a Spark versus Hadoop battle, it’s worth taking a moment to recap on what each actually is (and is not). As I covered in last year’s MWD report on Hadoop and its family of tools, when people talk about Apache Hadoop they’re often referring to a whole framework of tools designed to facilitate distributed parallel processing of large datasets. That processing was traditionally confined to MapReduce batch jobs in Hadoop’s early days, though Hadoop 2 brought the YARN resource scheduler and opened up Hadoop to streaming, real-time querying and a wider array of analytical programming applications (beyond MapReduce). Spark has been designed to run on top of Hadoop’s Distributed File System (amongst other data platforms) as an alternative to MapReduce – tuned for real-time streaming data processing and fast interactive queries, and with multi-genre analytics applicability (machine learning, time series, graph, SQL, streaming out-of-the-box). It gets that speed advantage by caching in-memory (rather than writing interim results to disk, as MapReduce does), but with that approach comes a need for higher-spec physical machines (compared with MapReduce’s tolerance for commodity hardware). So, Spark isn’t about to replace Hadoop -- but it may well supplant MapReduce (especially in growing real-time use cases). Those “Spark vs Hadoop” headlines are about as meaningful as one proclaiming “mushrooms vs pizza." Yes, mushroom might be a more suitable topping than, say, pepperoni (especially in a vegetarian use case), but it’ll still be deployed on the same dough and tomato sauce pizza platform. Nobody’s about to suggest the mushroom should go it alone! But what’s behind the headlines and the hype is a story of enterprise adoption – or at least vendors anticipating that adoption and investing in ‘the weaponization of Spark’ as it faces the more exacting standards of security, scaling performance, consistency, etc. which come with mainstream enterprise deployment. Big names like IBM, Databricks (the company formed by the originators of Spark), and MapR made commitments in and around the Spark Summit earlier this month. MapR has announced three new Quick Start Solutions for its Hadoop distribution to help customers get started with Spark in real-time security log analytics, genome sequencing, and time series analytics; Databricks’ cloud-hosted Spark platform (formerly known as Databricks Cloud) has become generally available; and IBM announced a raft of measures designed to give Spark a significant shot in the arm – it’s open sourcing its SystemML technology to bolster Spark’s machine learning capabilities, integrating Spark into its own analytics platforms, investing in Spark training and education, committing 3,500 of its researchers and developers to work on Spark-related projects, and offering Spark as a service on its Bluemix developer cloud. Given the overlap with Databricks’ business model (of offering development, certification, and support for Spark), IBM’s intentions are likely to tread on some toes before long – but for now, at least, both companies are content to focus on the combined push benefiting the Spark community and its enterprise aspirations overall (though clearly IBM’s betting on all this investment buying it some influence over where Spark goes next). It’s worth bearing in mind that not all its supporters champion Spark wholesale and all the interested parties tend to be interested in particular bits of Spark (as wide-ranging as it is) because of overlaps with their own preferred toolsets. For instance, although Spark supports many analytics genres, Cloudera focuses on its machine learning capabilities (as it has its own SQL-on-Hadoop tool in Impala), and MapR and Hortonworks also promote Drill and Hive as their favoured source of SQL-on-Hadoop. IBM’s support is focused on Spark’s machine learning and in-memory capabilities (hence the SystemML open sourcing news). In the face of such strong vendor preferences, how long before some of Spark’s current features fall away (or at least start to show the effects of being starved of as much care and feeding as is bestowed upon vendors’ favourite Spark components)? The Spark community is at much the same place the Hadoop one was at a while back – it’s showing great promise and suitability in key growth workloads (in Spark’s case, such as real-time IoT applications). However, the product as it stands is too immature for many enterprise tastes. Cue enterprise software vendors stepping up to help grow Spark up fast. Their challenge though is to smooth out the edges without smothering what made it so interesting in the first place.
June 28, 2015
by Angela Ashenden
· 2,342 Views
article thumbnail
.NET Deployment Tips From New Relic Community Forum Users
[This article was written by Wyatt Lindsay] New Relic’s Community Forum is designed to be a place for our users to share their experiences, questions, problems, and fixes. The collective expertise and creativity of the New Relic community has generated some outstanding solutions to everyday issues, and we want to call out some of them in the area of .NET agent deployment excellence. Basic installation Installing New Relic’s .NET agent is designed to be simple: run the installer on the target host and choose the features you want to include. For Microsoft Azuredeployments, install one of our NuGet packages. Installation requires a reset of IIS for the software to load into your application. To upgrade, we recommend first stopping IIS, installing the newer agent version, and then starting up IIS again. It’s also possible to perform a “silent” (manual) install using msiexec.exe, for example: msiexec.exe /i C:\NewRelicAgent.msi /qb NR_LICENSE_KEY= INSTALLLEVEL=1 See the documentation for complete manual installation options. Leveraging PowerShell Scripting the silent installation provides convenience and flexibility. Here’s an example script provided by community member Jon Carl in this post: $msiName = $licenseKey = $arguments = "/i $msiName /L*v install.log /qn NR_LICENSE_KEY=$licenseKey" if ($msiName -ne $null) { $exitCode = (Start-Process -FilePath "msiexec" -ArgumentList $arguments -Wait -PassThru).ExitCode; if($exitCode -eq 0) { Write-Host "Installation successful!" -ForegroundColor Green } else { Write-Host "Installation unsuccessful. Exitcode: $exitCode" -ForegroundColor Red } } This script works great when run directly on the target machine. Another forum user (Kym McGain) noticed that the installation didn’t complete before the session ended when executing the script remotely. This caused the installer to quit partway through. Kym posted this script that uses a ‘while’ loop to ensure the installer completes. As a bonus, it stops IIS before and restarts it after the software installs. As mentioned above, these steps are usually needed when upgrading. $installNewRelic = { $runProcess = { param($process,$arguments) $res = Start-Process -FilePath $process -ArgumentList $arguments -Wait -PassThru while ($res.HasExited -eq $false) { Write-Host "Waiting for $process..." Start-Sleep -s 1 } $exitCode = $res.ExitCode if($exitCode -eq 0) { Write-Host "$process successful!" -ForegroundColor Green } else { Write-Host "$process unsuccessful. Exitcode: $exitCode" -ForegroundColor Red } } $msiName = $licenseKey = $arguments = "/i $msiName /L*v install.log /qn NR_LICENSE_KEY=$licenseKey" Invoke-Command $runProcess -ArgumentList "IISRESET","/STOP" Invoke-Command $runProcess -ArgumentList "msiexec.exe",$arguments Invoke-Command $runProcess -ArgumentList "IISRESET","/START" } Chef, Puppet, and Chocolatey Deployment options abound for modern Web developers. These solutions often require a known download path and installer name. New Relic offers a consistent filepath and MSI name for the agent in an effort to make automated deployment easier for .NET customers: http://download.newrelic.com/dot_net_agent/release/NewRelicDotNetAgent_x64.msi http://download.newrelic.com/dot_net_agent/release/NewRelicDotNetAgent_x86.msi Several Community members have created packages for these utilities. Chocolatey users are invited to use the following NuGet package created by kireevco: https://chocolatey.org/packages/newrelic-dotnet New Relic community member ePitty built a Puppet module to handle .NET agent deployment: https://github.com/epitty1023/puppet-newrelicappmon Chef users, meanwhile, can check out the following cookbook for .NET and many other platforms New Relic supports: http://community.opscode.com/cookbooks/newrelic New Relic Community member E_Bow wrote a Chef recipe that goes a step further by stopping IIS before the installation and starting it again after completion: #Stop IIS iis_site 'Website' do action [:stop] end # install latest Newrelic agent from web include_recipe 'newrelic::repository' include_recipe node['newrelic']['dotnet-agent']['dotnet_recipe'] license = node['newrelic']['application_monitoring']['license'] windows_package 'Install New Relic .NET Agent' do source node['newrelic']['dotnet-agent']['https_download'] options "/qb NR_LICENSE_KEY=#{license} INSTALLLEVEL=#{node['newrelic']['dotnet-agent']['install_level']}" installer_type :msi action :install end #Start IIS iis_site 'Website' do action [:start] end The author states that they were unable to pull the New Relic license key from the configuration JSON in Chef Overrides, requiring them to modify the config file on each machine and manually enter the key. We invite any Chef experts out there to extend and improve this recipe so that it correctly pulls the license key. We are continually impressed by the smarts and spirit of our New Relic Community Forum members, and jump at the chance to highlight their contributions. Look for more Forum projects in the New Relic blog in the future. Do you have your own approach, tips, or recipes? Please share them in the New Relic Community Forum.
June 26, 2015
by Fredric Paul
· 1,658 Views
article thumbnail
Query Autofiltering Revisited -- Let's Be More precise!
In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find. A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this. A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns. The traditional way to do this is to use Natural Language Processing or NLP to parse the user query. This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”. In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed. The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search (Funny story: after posting this, I ran across a Mexican Restaurant in Toms River, New Jersey that does serve sushi – but still, most of them don’t!). The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way. The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting. It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”. It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it. This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context. The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata. Searching vs Navigating A little historical background here. With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links. In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata. The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference. Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will). A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching: In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use : syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user. However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words. The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful. So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender. But the information is still somewhere in the search index, we just need a way to get it back at at “query time”. Then, we can both have our cake AND eat it! Noun Phrases and the Hierarchy of meta information When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing. In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do. That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth. Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types. Rebuilding Context: Identifying field name patterns to find relevant phrase patterns So now its time to put Humpty Dumpty back together again. With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate metadata mapping on behalf of the user. We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store. Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example. The Query AutoFiltering Search Component To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things. The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map. This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field. If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query. The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer. Assuming that we have indexed the following records: This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records. However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query. Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values. So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”. Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too. If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”. It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type. From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query: q=blue red lion socks to a filter query: fq=color:blue&fq=brand:”red lion”&fq=product_type:socks This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%! Adding case sensitivity and synonym support: One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality. To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed. Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future. Source Code For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it). The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above. Conclusions The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post. Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can. This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost). The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications. The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.
June 22, 2015
by Lisa Warner
· 2,136 Views
article thumbnail
Long-Term Log Analysis with AWS Redshift
You will aggregate a lot of logs over the lifetime of your product and codebase, so it’s important to be able to search through them. In the rare case of a security issue, not having that capability is incredibly painful. You might be able to use services that allow you to search through the logs of the last two weeks quickly. But what if you want to search through the last six months, a year, or even further? That availability can be rather expensive or not even an option at all with existing services. Many hosted log services provide S3 archival support which we can use to build a long-term log analysis infrastructure with AWS Redshift. Recently I’ve set up scripts to be able to create that infrastructure whenever we need it at Codeship. AWS Redshift AWS Redshift is a data warehousing solution by AWS. It has an easy clustering and ingestion mechanism ideal for loading large log files and then searching through them with SQL. As it automatically balances your log files across several machines, you can easily scale up if you need more speed. As I said earlier, looking through large amounts of log files is a relatively rare occasion; you don’t need this infrastructure to be around all the time, which makes it a perfect use case for AWS. Setting Up Your Log Analysis Let’s walk through the scripts that drive our long-term log analysis infrastructure. You can check them out in the flomotlik/redshift-logging GitHub repository. I’ll take you step by step through configuring the whole setup of the environment variables needed, as well as starting the creation of the cluster and searching the logs. But first, let’s get a high-level overview of what the setup script is doing before going into all the different options that you can set: Creates an AWS Redshift cluster. You can configure the number of servers and which server type should be used. Waits for the cluster to become ready. Creates a SQL table inside the Redshift cluster to load the log files into. Ingests all log files into the Redshift cluster from AWS S3. Cleans up the database and prints the psql access command to connect into the cluster. Be sure to check out the script on GitHub before we go into all the different options that you can set through the .env file. Options to set The following is a list of all the options available to you. You can simply copy the .env.template file to .env and then fill in all the options to get picked up. AWS_ACCESS_KEY_ID AWS key of the account that should run the Redshift cluster. AWS_SECRET_ACCESS_KEY AWS secret key of the account that should run the Redshift cluster. AWS_REGION=us-east-1 AWS region the cluster should run in, default us-east-1. Make sure to use the same region that is used for archiving your logs to S3 to have them close. REDSHIFT_USERNAME Username to connect with psql into the cluster. REDSHIFT_PASSWORD Password to connect with psql into the cluster. S3_AWS_ACCESS_KEY_ID AWS key that has access to the S3 bucket you want to pull your logs from. We run the log analysis cluster in our AWS Sandbox account but pull the logs from our production AWS account so the Redshift cluster doesn’t impact production in any way. S3_AWS_SECRET_ACCESS_KEY AWS secret key that has access to the S3 bucket you want to pull your logs from. PORT=5439 Port to connect to with psql. CLUSTER_TYPE=single-node The cluster type can be single-node or multi-node. Multi-node clusters get auto-balanced which gives you more speed at a higher cost. NODE_TYPE Instance type that’s used for the nodes of the cluster. Check out the Redshift Documentation for details on the instance types and their differences. NUMBER_OF_NODES=10 Number of nodes when running in multi-mode. CLUSTER_IDENTIFIER=log-analysis DB_NAME=log-analysis S3_PATH=s3://your_s3_bucket/papertrail/logs/862693/dt=2015 Database format and failed loads When ingesting log statements into the cluster, make sure to check the amount of failed loads that are happening. You might have to edit the database format to fit to your specific log output style. You can debug this easily by creating a single-node cluster first that only loads a small subset of your logs and is very fast as a result. Make sure to have none or nearly no failed loads before you extend to the whole cluster. In case there are issues, check out the documentation of the copy command which loads your logs into the database and the parameters in the setup script for that. Example and benchmarks It’s a quick thing to set up the whole cluster and run example queries against it. For example, I’ll load all of our logs of the last nine months into a Redshift cluster and run several queries against it. I haven’t spent any time on optimizing the table, but you could definitely gain some more speed out of the whole system if necessary. It’s just fast enough already for us out of the box. As you can see here, loading all logs of May — more than 600 million log lines — took only 12 minutes on a cluster of 10 machines. We could easily load more than one month into that 10-machine cluster since there’s more than enough storage available, but for this post, one month is enough. After that, we’re able to search through the history of all of our applications and past servers through SQL. We connect with our psql client and send of SQL queries against the “events’ database. For example, what if we want to know how many build servers reported logs in May: loganalysis=# select count(distinct(source_name)) from events where source_name LIKE 'i-%'; count ------- 801 (1 row) So in May, we had 801 EC2 build servers running for our customers. That query took ~3 seconds to finish. Or let’s say we want to know how many people accessed the configuration page of our main repository (the project ID is hidden with XXXX): loganalysis=# select count(*) from events where source_name = 'mothership' and program LIKE 'app/web%' and message LIKE 'method=GET path=/projects/XXXX/configure_tests%'; count ------- 15 (1 row) So now we know that there were 15 accesses on that configuration page throughout May. We can also get all the details, including who accessed it when through our logs. This could help in case of any security issues we’d need to look into. The query took about 40 seconds to go though all of our logs, but it could be optimized on Redshift even more. Those are just some of the queries you could use to look through your logs, gaining more insight into your customers’ use of your system. And you et all of that with a setup that costs $2.50 an hour, can be shut down immediately, and recreated any time you need access to that data again. Conclusions Being able to search through and learn from your history is incredibly important for building a large infrastructure. You need to be able to look into your history easily, especially when it comes to security issues. With AWS Redshift, you have a great tool in hand that allows you to start an ad hoc analytics infrastructure that’s fast and cheap for short-term reviews. Of course, Redshift can do a lot more as well. Let us know what your processes and tools around logging, storage, and search are in the comments.
June 21, 2015
by Florian Motlik
· 1,444 Views
article thumbnail
Ecosystem of Hadoop Animal Zoo
hadoop is best known for map reduce and it's distributed file system (hdfs). recently other productivity tools developed on top of these will form a complete ecosystem of hadoop. most of the projects are hosted under apache software foundation . hadoop ecosystem projects are listed below. hadoop common a set of components and interfaces for distributed file system and i/o (serialization, java rpc, persistent data structures) http://hadoop.apache.org/ hadoop ecosystem hdfs a distributed file system that runs on large clusters of commodity hardware. hadoop distributed file system, hdfs renamed form ndfs. scalable data store that stores semi-structured, un-structured and structured data. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfsuserguide.html http://wiki.apache.org/hadoop/hdfs map reduce map reduce is the distributed, parallel computing programming model for hadoop. inspired from google map reduce research paper . hadoop includes implementation of map reduce programming model. in map reduce there are two phases, not surprisingly map and reduce. to be precise in between map and reduce phase, there is another phase called sort and shuffle. job tracker in name node machine manages other cluster nodes. map reduce programming can be written in java. if you like sql or other non- java languages, you are still in luck. you can use utility called hadoop streaming. http://wiki.apache.org/hadoop/hadoopmapreduce hadoop streaming a utility to enable map reduce code in many languages like c, perl, python, c++, bash etc., examples include a python mapper and awk reducer. http://hadoop.apache.org/docs/r1.2.1/streaming.html avro a serialization system for efficient, cross-language rpc and persistent data storage. avro is a framework for performing remote procedure calls and data serialization. in the context of hadoop, it can be used to pass data from one program or language to another, e.g. from c to pig. it is particularly suited for use with scripting languages such as pig, because data is always stored with its schema in avro. http://avro.apache.org/ apache thrift apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across programming languages. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business. http://thrift.apache.org/ hive and hue if you like sql, you would be delighted to hear that you can write sql and hive convert it to a map reduce job. but, you don't get a full ansi-sql environment. hue gives you a browser based graphical interface to do your hive work. hue features a file browser for hdfs, a job browser for map reduce/yarn, an hbase browser, query editors for hive, pig, cloudera impala and sqoop2.it also ships with an oozie application for creating and monitoring workflows, a zookeeper browser and an sdk. pig a high-level programming data flow language and execution environment to do map reduce coding the pig language is called pig latin. you may find naming conventions some what un-conventional, but you get incredible price-performance and high availability. https://pig.apache.org/ jaql jaql is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. as its name implies, a primary use of jaql is to handle data stored as json documents, but jaql can work on various types of data. for example, it can support xml, comma-separated values (csv) data and flat files. a "sql within jaql" capability lets programmers work with structured sql data while employing a json data model that's less restrictive than its structured query language counterparts. 1. jaql in google code 2. what is jaql? by ibm sqoop sqoop provides a bi-directional data transfer between hadoop -hdfs and your favorite relational database. for example you might be storing your app data in relational store such as oracle, now you want to scale your application with hadoop so you can migrate oracle database data to hadoop hdfs using sqoop. http://sqoop.apache.org/ oozie manages hadoop workflow. this doesn't replace your scheduler or BPM tooling, but it will provide if-then-else branching and control with hadoop jobs. https://oozie.apache.org/ zookeeper a distributed, highly available coordination service. zookeeper provides primitives such as distributed locks that can be used for building the highly scalable applications. it is used to manage synchronization for cluster. http://zookeeper.apache.org/ hbase based on google's bigtable , hbase "is an open-source, distributed, version, column-oriented store" that sits on top of hdfs. a super scalable key-value store. it works very much like a persistent hash-map (for python developers think like a dictionary). it is not a conventional relational database. it is a distributed, column oriented database. hbase uses hdfs for it's underlying. supports both batch-style computations using map reduce and point queries for random reads. https://hbase.apache.org/ cassandra a column oriented nosql data store which offers scalability, high availability with out compromising on performance. it perfect platform for commodity hardware and cloud infrastructure.cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views , and powerful built-in caching. http://cassandra.apache.org/ flume a real time loader for streaming your data into hadoop. it stores data in hdfs and hbase.flume "channels" data between "sources" and "sinks" and its data harvesting can either be scheduled or event-driven. possible sources for flume include avro, files, and system logs, and possible sinks include hdfs and hbase. http://flume.apache.org/ mahout machine learning for hadoop, used for predictive analytics and other advanced analysis. there are currently four main groups of algorithms in mahout: recommendations, a.k.a. collective filtering classification, a.k.a categorization clustering frequent item set mining, a.k.a parallel frequent pattern mining mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. algorithms in the mahout library belong to the subset that can be executed in a distributed fashion. http://en.wikipedia.org/wiki/list_of_machine_learning_algorithms https://www.coursera.org/course/machlearning https://mahout.apache.org/ fuse makes the hdfs system to look like a regular file system so that you can use ls, rm, cd etc., directly on hdfs data. whirr apache whirr is a set of libraries for running cloud services. whirr provides a cloud-neutral way to run services. you don't have to worry about the idiosyncrasies of each provider.a common service api. the details of provisioning are particular to the service. smart defaults for services. you can get a properly configured system running quickly, while still being able to override settings as needed. you can also use whirr as a command line tool for deploying clusters. https://whirr.apache.org/ giraph an open source graph processing api like pregel from google https://giraph.apache.org/ chukwa chukwa, an incubator project on apache, is a data collection and analysis system built on top of hdfs and map reduce. tailored for collecting logs and other data from distributed monitoring systems, chukwa provides a workflow that allows for incremental data collection, processing and storage in hadoop. it is included in the apache hadoop distribution as an independent module. https://chukwa.apache.org/ drill apache drill, an incubator project on apache, is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. drill is the open source version of google's dremel system which is available as an iaas service called google big query. one explicitly stated design goal is that drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. http://incubator.apache.org/drill/ impala (cloudera) released by cloudera, impala is an open-source project which, like apache drill, was inspired by google's paper on dremel; the purpose of both is to facilitate real-time querying of data in hdfs or hbase. impala uses an sql-like language that, though similar to hiveql, is currently more limited than hiveql. because impala relies on the hive meta store, hive must be installed on a cluster in order for impala to work. the secret behind impala's speed is that it "circumvents map reduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel rdbmss." (source: cloudera) http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html http://training.cloudera.com/elearning/impala/
June 3, 2015
by Umashankar Ankuri
· 23,870 Views · 3 Likes
article thumbnail
Efficient Cassandra Write Pattern for Micro-Batching
The best way to write to a Cassandra cluster are concurrent asynchronous writes. In cases where data exhibits strong temporal locality, speed can be improved.
May 20, 2015
by John Georgiadis
· 34,981 Views · 1 Like
article thumbnail
How To Set Up a Tomcat, Apache and mod_jk Cluster
In this article I will go through a common set-up for a small production environment. A single tier, load balanced application server cluster. Overview A high level overview of what we will be doing. Downloading and installing Apache HTTP server and mod_jk Downloading Tomcat Downloading Java Configuring two local Tomcat servers Clustering the two Tomcat servers Configuring Apache to use mod_jk to forward request to Tomcat Deploying application to Tomcat server that tests our set-up Introduction What is Apache? Apache is an HTTP server. What is mod_jk? It is an Apache module that allows AJP communication between Apache and a back end application server like Tomcat.I am running this on Ubuntu 14.04LTS installed on a dual boot PC with Windows 7. Download Apache2 We are going to use Ubuntu's APT package maintenance system to obtain and install Apache2. sudo apt-get install apache2 This will install in /etc/apache2 Download and install mod_jk The mod_jk module is not included in the Apache2 download so must be obtained and installed separately. The installation requires that the mod_jk module is visible to Apache and configured to ensure that Apache knows where to look for it and what to do with the requests you want to proxy. sudo apt-get install libapache2-mod-jk This will install in /etc/libapache2-mod-jk also two files have been added to the /etc/apache2/mods-available folder. Downloading and installing Tomcat 8 At the time of writing this Tomcat 8 does not have a package in APT so you must download the binaries from the tomcat website.http://tomcat.apache.org/download-80.cgi select the appropriate binary distribution and extract it as follows. tar xvzf apache-tomcat-8.0.5.tar.gz We need two copies of the Tomcat server to be load balanced. I created two directories in the /opt/ location: /opt/tomcat-server1/ and /opt/tomcat-server2/ and copied tomcat into each one. Download and install Java Download Java from APT as follows: apt-get install openjdk-7-jdk and set JAVA_HOME in .bashrc vim ~/.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 Configure two local Tomcat servers We will edit only the server.xml of the server2 installation of tomcat. We need to change port numbers to avoid conflicts.We change the following: and comment out the HTTP Connector as we only want the web application to be accessible through the load balancer.Here is my server2 Tomcat server.xml configuration. Configure mod_jk Load balancing is configured in the workers.properties file, located /etc/libapache2-mod-jk/ where workers represent actual or virtual workers.We will define two actual workers and two virtual workers which map to the Tomcat servers. In the worker.list property I have defined two virtual workers: status and loadbalancer, I will refer to these later in the Apache configuration.Workers for each server have been defined using values for the server.xml configuration files. I used the port values for the AJP connectors and I have included an lbfactor that sets the preference that the load balancer will show for that server.Finally we define the virtual workers. The loadbalancer worker is set to type lb and set the workers that represent the Tomcat servers in the balancer_workers properties. The status only needs to be set to type status. worker.list=loadbalancer,status worker.server1.port=8009worker.server1.host=localhostworker.server1.type=ajp13 worker.server2.port=9009worker.server2.host=localhostworker.server2.type=ajp13 worker.server1.lbfactor=1worker.server2.lbfactor=1 worker.loadbalancer.type=lbworker.loadbalancer.balance_workers=server1,server2 worker.status.type=status Ensure that you remove any other worker configuration that are not being used. Configure Apache Web Server to forward requests You will need to add the following to the Apache configurations located in etc/apache2/sites-enabled/000-default.conf JkMount /status status JkMount /* loadbalancer Verify the installation To test that all has been configured correctly we need to deploy an application. A sample application that has been used for years to test such configurations is called the ClusterJSP sample application. You can find it by googling in or from the JBoss site.Now deploy the war to the webapps folder on both servers and start each server using the start-up script /opt/tomcat-server1/bin/startup.sh.Go to http://localhost/clusterjsp/HaJsp.jsp and you should see the page show HttpSession information. Now lets look at the mod_jk status page: http://localhost/status. You will see that this page shows information about the load balancer workers and the workers it is balancing. If everything is working you will see the worker error state show OK or OK/IDLE if they are not currently balancing load. Things to try out Enable sticky sessions: Configure jvmRoute in the server.xml configuration. Further reading Loadbalancing with mod_jk and ApacheWorking with mod_jk Connecting Apache's Web Server to Multiple Instances of Tomcat
May 19, 2015
by Alex Theedom
· 10,789 Views · 1 Like
article thumbnail
Docker Machine on Windows - How To Setup You Hosts
I've been playing around with Docker a lot lately. Many reasons for that, one for sure is, that I love to play around with latest technology and even help out to build a demo or two or a lab. The main difference, between what everybody else of my coworkers is doing is, that I run my setup on Windows. Like most of the middleware developers out there. So, If you followed Arun's blog about "Docker Machine to Setup Docker Host" you might have tried to make this work on windows already. Here is the ultimate short how-to guide on using Docker Machine to administrate and spin up your Docker hosts. Docker Machine Machine lets you create Docker hosts on your computer, on cloud providers, and inside your own data center. It creates servers, installs Docker on them, then configures the Docker client to talk to them. You basically don't have to have anything installed on your machine prior to this. Which is a hell lot easier, than having to manually install boot2docker before. So, let's try this out. You want to have at least one thing in place before starting with anything Docker or Machine. Go and get Git for Windows (aka msysgit). It has all kinds of helpful unix tools in his belly, which you need anyway. Prerequisites - The One For All Solution The first is to install the windows boot2docker distribution which I showed in an earlier blog. It contains the following bits configured and ready for you to use: - VirtualBox - Docker Windows Client Prerequisites- The Bits And Pieces I dislike the boot2docker installer for a variety of reasons. Mostly, because I want to know what exactly is going on on my machine. So I played around a bit and here is the bits and pieces installer if you decide against the one-for-all solution. Start with the virtualization solution. We need something like that on Windows, because it just can't run Linux and this is what Docker is based on. At least for now. So, get VirtualBox and ensure that version 4.3.18 is correctly installed on your system (VirtualBox-4.3.18-96516-Win.exe, 105 MB). WARNING: There is a strange issue, when you run Windows itself in Virtualbox. You might run into an issue with starting the host. And while you're at it, go and get the Docker Windows Client. The other is to grab the final from the test servers as a direct download (docker-1.6.0.exe, x86_64, 7.5MB). Rename to "docker" and put it into a folder of your choice (I assume it will be c:\docker\. Now you also need to download Docker Machine, which is another single executable (docker-machine_windows-amd64.exe, 11.5MB). Rename to "docker-machine" and put it into the same folder. Now add this folder to your PATH: set PATH=%PATH%;C:\docker If you change your standard PATH environment variable, this might safe your from a lot of typing. That's it. Now you're ready to create your first Machine managed Docker Host. Create Your Docker Host With Machine All you need is a simple command: docker-machine create --driver virtualbox dev And the output should state: ←[34mINFO←[0m[0000] Creating SSH key... ←[34mINFO←[0m[0001] Creating VirtualBox VM... ←[34mINFO←[0m[0016] Starting VirtualBox VM... ←[34mINFO←[0m[0022] Waiting for VM to start... ←[34mINFO←[0m[0076] "dev" has been created and is now the active machine. ←[34mINFO←[0m[0076] To point your Docker client at it, run this in your shell: eval "$(docker-machine.exe env dev)" This means, you just created a Docker Host using the VirtualBox provider and the name “dev”. Now you need to find out on which IP address the host is running. docker-machine ip 192.168.99.102 If you want to configure your environment variables, needed by the client more easy, just use the following command: docker-machine env dev export DOCKER_TLS_VERIFY=1 export DOCKER_CERT_PATH="C:\\Users\\markus\\.docker\\machine\\machines\\dev" export DOCKER_HOST=tcp://192.168.99.102:2376 Which outputs the Linux version of environment variable definition. All you have to do is to change the "export" keyword to "set", remove the " and the double back-slashes and you are ready to go. C:\Users\markus\Downloads>set DOCKER_TLS_VERIFY=1 C:\Users\markus\Downloads>set DOCKER_CERT_PATH=C:\Users\markus\.docker\machine\machines\dev C:\Users\markus\Downloads>set DOCKER_HOST=tcp://192.168.99.102:2376 Time to test our Docker Client And here we go now run WildFly on your freshly created host: docker run -it -p 8080:8080 jboss/wildfly Watch the container being downloaded and check, that it is running by redirecting your browser to http://192.168.99.102:8080/. Congratulations on having setup your very first docker host with Maschine on Windows.
May 12, 2015
by Markus Eisele
· 20,142 Views
article thumbnail
Why Elasticsearch is Suitable for Application Log Analytics
Handling Application Logs Enterprise application development using Web technologies has been around for a long time. In recent years we have seen a sharp increase in the deployment of such applications. This is partly due to the proliferation of ecommerce sites, social media sites, mobile application supporting sites, as well as the desire of enterprises to have their applications available 24x7. In most cases, such applications cater to huge load and are deployed on cloud infrastructure. Monitoring deployed applications is increasingly becoming a crucial task, as deployed applications are bound to fail, irrespective of the robust techniques used during development. Whenever an application fails, the most common resolution method starts by examining the application log. If the application has implemented logging properly, the logs can reveal the cause of application failure. Examination of log files is usually done by viewing the file using tools like vi, less, more, tail or grep. Another method is to download the file to a Windows system and viewing it using an editor like Notepad++. Engineers usually scan the log information to look for clues that point to the reasons for failure. Once the cause of failure is identified, suitable action is taken for restoring the application and/or service. The Key to Application Log Analytics This process, of logging onto a remote system and viewing logs is tedious. Additionally, many of the tools do not provide support to make the task of issue identification any simpler. Even when using tools like grep (if we know the pattern), we still need to view the logs in order to go through other information that has been logged, such as the log information that precedes the failure point. While it has always been possible to develop applications to parse application logs, the recent renewed interest in application log analytics is due to the acceptance of NoSQL-like technologies and the availability of standard tools to parse application logs. Though relational databases (RDBMS) have for many years provided the facility to store structured data, they are not well-suited for handling log data, as in many cases, the structure of the logged information is not the same across the file. This does not fit well in the rigidly defined world of an RDBMS. In comparison, NoSQL allows document flexibility and documents with different schemas can be stored in the same database / index / store. The ability to convert log data into a well-defined structure, as well as the ability to search, are key to implement a modern log analytics solution. In this document, we cover how Elasticsearch. Elasticsearch can store documents, giving us the benefit of structured storage without the overheads of a database system. The Suitability of Elasticsearch In the following subsections, we share our views as to why Elasticsearch is a suitable data store for an application log analytics solution. Elasticsearch is part of a popular trio of tools, commonly known as ELK. Of these, L stands for Logstash, the log parser; E stands for Elasticsearch, the document store; and K stands for Kibana, the visualization tool. Storing Documents Logstash can be used to parse plain text data into structured text. Once data has some structure, it becomes easy to find information by enabling search on it. While parsing application logs is not a challenge, the challenge has been in storing the data and enabling search on it. Most prior solutions have used an RDBMS for storage, but the varying structure and textual nature of application logs makes it difficult to use an RDBMS table structure to store data. RDBMSs are not geared toward ‘search’. They are geared for maintaining a ‘single value of truth’ for the data, defining relations between the data, ensuring their consistency and so on. Search is also not a strong point for RDBMSs as they use exact matches for values, while Elasticsearch supports exact matches as well as partial matches. It also supports document scoring, which attaches a confidence factor to the documents located. Elasticsearch supports documents in JSON format and uses the NoSQL philosophy for document storage. This has the advantage of allowing a flexible schema for the data. Unlike an RDBMS, Elasticsearch is a search engine at heart and hence is built for the same. Though Elasticsearch uses NoSQL for storing documents, it does not provide robust methods to update stored data. Not supporting updates is a serious disadvantage in most cases. In the case of application logs, not supporting updates actually works in favour of Elasticsearch. In case of machine logs, updates are not really required. Application logs are generated from a debugging perspective – having data handy for debugging purposes in the event of application crash or incorrect execution. They usually record important events from application execution and provide additional information to allow application developers to identify the reasons for failure. Additionally, existing information in application logs is rarely, if ever, updated. New information is continually being written to the logs, with no need to refer to old information. This plays to Elasticsearch’s strength, which is able to ingest and index new information very quickly. Search One of the easiest ways of locating information from large volumes of logs is to perform a search. Elasticsearch is well suited not only to handle search, it also supports huge volume of data, using distributed computing (implemented using Shards). While Kibana is one of the commonly used tools to display and visualize information stored in Elasticsearch, it is more suited to display standard charts like bar chart, column chart and pie chart. If the features provided by Kibana are not enough, we can always use Elasticsearch’s REST API support and it’s Query DSL (Domain-Specific Language), to search for required information. The Query DSL and the result of the query are in JSON format. Though this format makes it easy for applications to parse and process, users would need a friendly user interface to interact with the data. Handling Voluminous Data Elasticsearch supports distributed search out of the box – using the concept of ‘shards’. A shard is a single Lucene instance and is managed by Elasticsearch. Two types of shards, namely ‘primary shard’ and ‘replica shard’ are supported. By default, a document is first indexed on the primary shard and then on the replica shards. The number of primary shards can be specified, to cater to the expected volume. By default, Elasticsearch creates five shards for an index. But, once the number of primary shards is decided, it cannot be changed. A replica shards are copies the primary shard. They are used to handle fail-over and the increase performance. While performance across voluminous data can be handled by sharding, it is important to note that shards, once created for an index, cannot be changed. Thus, the sharding strategy of the data has to be decided in advance, after an assessment of the data and an estimation of its growth. In the case of application logs, the sharding strategy can be based on the application name, the business unit ID, the application OD or the application’s geolocation, just to name a few. Analytics By storing data in a structure, analytics can be enabled on the data. Not only can application perform a simple search, it is also possible to restrict the search for specific terms or over a specified time period. Structured storage also makes it easier to develop reports with well-defined visualizations, which in turn makes it easy to understand the current state of applications. It is also possible to perform various analytics operations like time series analysis using the timestamp and identification of patterns from the data using machine learning techniques (assuming, we have the right kind of data in the logs). Though Elasticsearch does not provide built-in support for analytics, applications can benefit from its fast search capability and also from its ability to handle voluminous data sets. In Closing One of the main hurdles for application logs has been the ability to search for information from the huge volume of data. By parsing application log files using Logstash, we can convert a flat file into structured data. Structured data, once stored in Elasticsearch, is easier to search and locate. Visualizations and business logic for generating alerts and tickets is easier to develop on structured data. Elasticsearch, which stores and searches documents, along with its ability to scale over huge volume of data, is a good candidate for inclusion in an application log analytics solution.
April 22, 2015
by Bipin Patwardhan
· 11,684 Views · 2 Likes
article thumbnail
Meet Molly, the virtual nurse
Telemedicine is a topic that I’ve touched on numerous times on this blog over the past year or so, with a number of platforms emerging to offer patients the opportunity to consult with a medical professional from anywhere in the world. Most of these platforms are clinical based, such as the Babylon service, and therefore provide you with access to doctors when you have something wrong with you. There are however, also a number that are taking a more preventative approach and seek to keep you healthier in the first place. For instance, last autumn I wrote about the Vida platform that is providing coaching and support to ensure you stay out of the emergency room altogether. Of course, all of these services require a human healthcare professional at the other end of your video call to answer your queries for you. Sense.ly are taking another approach by offering an AI based nurse, called Molly, who aims to provide help and support in those periods between appointments with real life professionals. The service is aimed specifically at patients with common medical conditions such as diabetes or heart failure. The patient signs up to the site either direct or via their GP, and the platform then draws up a personalized care plan for them based upon both their medical records and the individual needs of the patient. The patient then follows this prescribed plan (hopefully), with regular check-ins with the virtual nurse via their smartphone or computer to help their progress. Doctors can also access information via the site to see how their patient is getting on, whilst the system will alert them automatically if worrying symptoms begin to emerge. I’ve written previously about the important role the appearance of avatars has on how we engage with them, so Molly has been designed with a friendly face and a softly spoken voice. She interacts with patients using voice recognition technology and can ask relatively simple questions of the patient whilst guiding them through exercise plans and collecting medical data from them. This data can then be analyzed by a doctor (although AI analysis seems inevitable), whilst the patient can also use Molly as a sort of health PA and book appointments with their doctor through her. With voice recognition technology growing at an incredible pace and the AI grunt of services such as Watson increasingly capable of making complex decisions, this seems the beginning of an inevitable trend towards more automated healthcare. Check out the video below that explains more about Molly and the service she offers. Original post
April 15, 2015
by Adi Gaskell
· 2,081 Views
article thumbnail
A cluster management framework, Apache Helix
What is Helix? It is used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. Helix automates reassignment of resources in the face of node failure and recovery, cluster expansion, and reconfiguration. Modeling a distributed system as a state machine with constraints on states and transitions. Terminologies Node : A single machine Cluster: Set of Nodes Resource : A logical entry (e.g. database, index, task) Partition: Subset of the resource (Each subtask is referred to as a partition) Replica: Copy of a Partition State (e.g Master, Slave). It increase the availability of the system State: Describes the role of a replica (Each node in the cluster has its own Current State) State Machine and Transitions: An action that allows a replica to move from one state to another, thus changing its role. ( e.g Slave --> Master ) spectators: the external clients. Helix provides an External View that is an aggregated view of the current state across all nodes. Current State: represents resource's actual state at a participating node. - INSTANCE_NAME: Unique name representing the process - SESSION_ID: ID that is automatically assigned every time a process joins the cluster Rebalancer: The core component of Helix is the Controller which runs the Rebalance algorithm on every cluster event. Dynamic Ideal State: Helix powerful is that Ideal State can be changed dynamically. It is adjusting the ideal state. Whenever a cluster event occurs, Helix can operate in one of three modes FULL_AUTO SEMI_AUTO CUSTOMIZED Cluster events can be one of the following: Nodes start and/or stop Nodes experience soft and/or hard failures New nodes are added/removed [1] http://helix.apache.org/Concepts.html
April 13, 2015
by Madhuka Udantha
· 7,843 Views
article thumbnail
Introduction to Apache Cassandra's Architecture
Some key concepts for Apache's popular Cassandra Architecture include partitioning, replication, consistency, bootstrapping, and write paths.
April 6, 2015
by Akhil Mehra
· 118,039 Views · 38 Likes
article thumbnail
How to Configure a Simple JBoss Cluster in Domain Mode
Clustering is a very important thing to master for any serious user of an application server. Clustering allows for high availability by making your application available on secondary servers when the primary instance is down or it lets you scale up or out by increasing the server density on the host, or by adding servers on other hosts. It can even help to increase performance with effective load balancing between servers based on their respective hardware. Andy Overton has already covered how to set up a cluster of servers in standalone mode fronted by mod_cluster for load balancing, so in this post I'll cover clustering in domain mode. I won't rehash mod_cluster settings, so this will just cover the set up of a doman controller on one host, and the host controller and server instances on another host. To follow along with this blog, you'll need to download either JBoss EAP 6.x or WildFly. I'll be using WildFly 8.2 on Xubuntu 14.04. I'll be using $WF_HOME to refer to your WildFly home directory. Configuring the Domain Controller The domain controller needs both the domain.xml and host.xml configured. In the $WF_HOME/domain/configuration directory, you'll see that those two files are joined by a host-master.xml and a host-slave.xml. These are preconfigured host.xml files which you can use to give you a head start in making a host.xml for the domain controller (master) and host controller (slave) to use. You can either change the name of the file to be host.xml, so it will get picked up and used by default, or you can specify the host configuration you want to use on the command line by adding the --host-config argument: domain.sh --host-config=host-master.xml Whether you choose to modify the host.xml or the host-master.xml, you need to make sure that the empty element has been added to the section. This is so that when WildFly looks to see which server is the domain controller, it knows to become the domain controller itself. The other change is optional, but recommended. We need to tell the domain controller to bind its management interface to the correct IP address because, by default, it will bind to localhost, so the management communication it needs to do with the remote hosts won't be able to reach the domain controller at all! We can set this address permanently in the host.xml by making sure the inet-address value is set to the right IP, by changing the 127.0.0.1 in the example below to the correct IP: The result of that is that the default bind IP of the management interface is no longer localhost, although you can still override this value by starting JBoss with the variable left of the colon as a -D argument: domain.sh -Djboss.bind.address.management=10.0.0.1 Next, we need to modify the domain.xml file, where we need to define our server groups; essentially just defining the cluster. Each server group is named, so we can reference it later, and references a particular profile which needs to be one of the profiles named and defined in the same XML file. As I mentioned in my previous blog, domain mode has several profiles in the same file (domain.xml) rather than multiple files for each, like standalone mode (standalone.xml, standalone-ha.xml etc.). In the screenshot, there are two server groups defined - "main-server-group" which references the "full" profile, and "other-server-group" which references the "full-ha" profile. These are just the defaults which come with WildFly, so you're free to use them and modify the settings or create your own from scratch. Whichever you choose, it's a good idea to rename your server group to something meaningful, like a description of the workload, or the application name. Configuring the Host Controllers Every host server which you want to be part of the cluster must have the host.xml file configured. We've already configured the host.xml on the domain controller, so now we'll focus on the host controller. Remember, this process can be repeated on any number of hosts, depending on how many servers you want in your server group and their topology. First, we need to make sure that the domain controller and the host controller can communicate, and to do that we need a valid management user. On the domain controller, run the add-user.sh or add-user.bat script. You will need to make sure to: Choose a management user Make sure the user is different than the one you would use to log in to the web console Confirm that the new user will connect one AS process to another AS process Make a note of the secret value (this is very important!) You will find that you get prompts similar to the following: mike@mike-C2B2:~$ /opt/wildfly/wildfly-8.2.0.Final/bin/add-user.sh What type of user do you wish to add? a) Management User (mgmt-users.properties) b) Application User (application-users.properties) (a): a Enter the details of the new user to add. Using realm 'ManagementRealm' as discovered from the existing property files. Username : mgmt Password recommendations are listed below. To modify these restrictions edit the add-user.properties configuration file. - The password should not be one of the following restricted values {root, admin, administrator} - The password should contain at least 8 characters, 1 alphabetic character(s), 1 digit(s), 1 non-alphanumeric symbol(s) - The password should be different from the username Password : Re-enter Password : What groups do you want this user to belong to? (Please enter a comma separated list, or leave blank for none)[ ]: About to add user 'mgmt' for realm 'ManagementRealm' Is this correct yes/no? yes Added user 'mgmt' to file '/opt/wildfly/wildfly-8.2.0.Final/standalone/configuration/mgmt-users.properties' Added user 'mgmt' to file '/opt/wildfly/wildfly-8.2.0.Final/domain/configuration/mgmt-users.properties' Added user 'mgmt' with groups to file '/opt/wildfly/wildfly-8.2.0.Final/standalone/configuration/mgmt-groups.properties' Added user 'mgmt' with groups to file '/opt/wildfly/wildfly-8.2.0.Final/domain/configuration/mgmt-groups.properties' Is this new user going to be used for one AS process to connect to another AS process? e.g. for a slave host controller connecting to the master or for a Remoting connection for server to server EJB calls. yes/no? yes To represent the user add the following to the server-identities definition Once we have the secret value for our management user, we can add it to the host.xml file. I'm choosing to modify the host-slave.xml file, since much of the configuration I need is done for me: Next, we need to tell the host controller where to look for the domain controller. We set this to for the domain controller's host.xml file, but in the host-slave.xml we have an example tag filled out for us. All we need to do is add the domain controller's IP or hostname exactly as we did for the management bind address earlier. So our host-slave.xml should go from this: to this: This way, like with the management interface on the domain controller, the default address will be 10.0.0.1, but it can also be overridden on the command line if needed. Once we've sorted the communication out, we need to tell the host controller to actually start some server instances! At the bottom of the host-slave.xml file, there are two predefined servers to use: These are already configured to become members of the two server groups configured in the domain.xml. Note that the second server has to have a port offset. Despite it being in a different server group, it's still going to run on the same host and will attempt to bind to the same ports as the first server unless we tell it not to! We would also need to do the same thing if we added other server instances. Optionally, we can make things a little easier for ourselves when managing a lot of servers on a lot of hosts. We can give each server instance its own unique name, but we can also name the host by adding a name attribute to the parent tag, changing it from: to So both in the logs and in the admin console, you should see this host controller referred to as "host1". Now, if you wanted to name your server instances the same across hosts, you'll be able to tell which is which! If all you wanted was to configure a single domain controller and a single host controller, then that's all we need to do to get them speaking to each other. You can then carry on and configure mod_cluster and Apache to forward requests on to the right server, or just deploy your applications and connect to them directly.
April 3, 2015
by Mike Croft
· 23,542 Views
article thumbnail
Streaming Big Data: Storm, Spark and Samza
There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences. Apache Storm In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline. Apache Spark Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations). Apache Samza Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka. Common Ground All three real-time computation systems are open-source, low-latency, distributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations. The three frameworks use different vocabularies for similar concepts: Comparison Matrix A few of the differences are summarized in the table below: There are three general categories of delivery patterns: At-most-once: messages may be lost. This is usually the least desirable outcome. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases. Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident. Use Cases All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines. If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching. A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel... Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQL, MLlib, GraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time. A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu… If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects. A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale… Conclusion We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.
February 28, 2015
by Tony Siciliani
· 32,700 Views · 5 Likes
article thumbnail
Algorithms Explained: Minesweeper
This blog post explains the essential algorithms for the well-known Windows game "Minesweeper." Game Rules The board is a two-dimensional space, which has a predetermined number of mines. Cells have two states, opened and closed. If you left-click on a closed cell: Cell is empty and opened. If neighbor cell(s) have mine(s), this opened cell shows neighbor mine count. If neighbor cells have no mines, all neighbor cells are opened automatically. Cell has a mine, game ends with FAIL. If you right-click on a closed cell, you put a flag which shows that "I know this cell has a mine". If you multi-click (both right and left click) on a cell which is opened and has at least one mine on its neighbors: If neighbor cells' total flag count equals to this multi-clicked cell's count and predicted mine locations are true, all closed and unflagged neighbor cells are opened automatically. If neighbor cells' total flag count equals to this multi-clicked cell's count and at least one predicted mine location is wrong, game ends with FAIL. If all cells (without mines) are opened using left clicks and/or multi-clicks, game ends with SUCCESS. Data Structures We may think of each cell as a UI structure (e.g. button), which has the following attributes: colCoord = 0 to colCount rowCoord = 0 to rowCount isOpened = true / false (default false) hasFlag = true / false (default false) hasMine = true / false (default false) neighborMineCount 0 to 8 (default 0, total count of mines on neighbor cells) So, we have a two-dimensional "Button[][] cells" data structure to handle game actions. Algorithms Before Start: Assign mines to cells randomly (set hasMine=true) . Calculate neighborMineCount values for each cell, which have hasMine=false. (This step may be done for each clicked cell while game continues but it may be inefficient.) Note 1: Neighbor cells should be accessed with the coordinates: {(colCoord-1, rowCoord-1),(colCoord-1, rowCoord),(colCoord-1, rowCoord+1),(colCoord, rowCoord-1),(colCoord, rowCoord+1),(colCoord+1, rowCoord-1),(colCoord+1, rowCoord),(colCoord+1, rowCoord+1)} And don't forget that neighbor cell counts may be 3 (for corner cells), 5 (for edge cells) or 8 (for middle cells). Note 2: It's recommended to handle mouse clicks with "mouse release" actions instead of "mouse pressed/click" actions, otherwise a left or right click may be understood as a multi-click or vice versa. Right Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false, set cell hasFlag=true and show a flag on the cell. Left Click on a Cell: If cell isOpened=true, do nothing. If cell isOpened=false: If cell hasMine=true, game over. If cell hasMine=false: If cell has neighborMineCount > 0, set isOpened=true, show neighborMineCount on the cell. If all cells which hasMine=false are opened, end game with SUCCESS. If cell has neighborMineCount == 0, set isOpened=true, call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure. Multi Click (both Left and Right Click) on a Cell: If cell isOpened=false, do nothing. If cell isOpened=true: If cell neighborMineCount == 0, no nothing. If cell neighborMineCount > 0: If cell neighborMineCount != "neighbor hasFlag=true cell count", do nothing. If cell neighborMineCount == "neighbor hasFlag=true cell count": If all neighbor hasFlag=true cells are not hasMine=true, game over. If all neighbor hasFlag=true cells are hasMine=true (every flag is put on correct cell), call Left Click on a Cell for all neighbor cells, which hasFlag=false and isOpened=false. Note: The last step may be implemented with a recursive call or by using a stacked data structure.
February 9, 2015
by Cagdas Basaraner
· 33,338 Views
article thumbnail
Do it in Java 8: The State Monad
In a previous article (Do it in Java 8: Automatic memoization), I wrote about memoization and said that memoization is about handling state between function calls, although the value returned by the function does not change from one call to another as long as the argument is the same. I showed how this could be done automatically. There are however other cases when handling state between function calls is necessary but cannot be done this simple way. One is handling state between recursive calls. In such a situation, the function is called only once from the user point of view, but it is in fact called several times since it calls itself recursively. Most of these functions will not benefit internally from memoization. For example, the factorial function may be implemented recursively, and for one call to f(n), there will be n calls to the function, but not twice with the same value. So this function would not benefit from internal memoization. On the contrary, the Fibonacci function, if implemented according to its recursive definition, will call itself recursively a huge number of times will the same arguments. Here is the standard definition of the Fibonacci function: f(n) = f(n – 1) + f(n – 2) This definition has a major problem: calculating f(n) implies evaluating n² times the function with different values. The result is that, in practice, it is impossible to use this definition for n > 50 because the time of execution increases exponentially. But since we are calculating more than n values, it is obvious that some values are calculated several times. So this definition would be a good candidate for internal memoization. Warning: Java is not a true recursive language, so using any recursive method will eventually blow the stack, unless we use TCO (Tail Call Optimization) as described in my previous article (Do it in Java 8: recursive and corecursive Fibonacci). However, TCO can't be applied to the original Fibonacci definition because it is not tail recursive. So we will write an example working for n limited to a few thousands. The naive implementation Here is how we might implement the original definition: static BigInteger fib(BigInteger n) { return n.equals(BigInteger.ZERO) || n.equals(BigInteger.ONE) ? n : fib(n.subtract(BigInteger.ONE)).add(fib(n.subtract(BigInteger.ONE.add(BigInteger.ONE)))); } Do not try this implementation for values greater that 50. On my machine, fib(50) takes 44 minutes to return! Basic memoized version To avoid computing several times the same values, we can use a map were computed values are stored. This way, for each value, we first look into the map to see if it has already been computed. If it is present, we just retrieve it. Otherwise, we compute it, store it into the map and return it. For this, we will use a special class called Memo. This is a normal HashMap that mimic the interface of functional map, which means that insertion of a new (key, value) pair returns a map, and looking up a key returns an Optional: public class Memo extends HashMap { public Optional retrieve(BigInteger key) { return Optional.ofNullable(super.get(key)); } public Memo addEntry(BigInteger key, BigInteger value) { super.put(key, value); return this; } } Note that this is not a true functional (immutable and persistent) map, but this is not a problem since it will not be shared. We will also need a Tuple class that we define as: public class Tuple { public final A _1; public final B _2; public Tuple(A a, B b) { _1 = a; _2 = b; } } With this class, we can write the following basic implementation: static BigInteger fibMemo1(BigInteger n) { return fibMemo(n, new Memo().addEntry(BigInteger.ZERO, BigInteger.ZERO) .addEntry(BigInteger.ONE, BigInteger.ONE))._1; } static Tuple fibMemo(BigInteger n, Memo memo) { return memo.retrieve(n).map(x -> new Tuple<>(x, memo)).orElseGet(() -> { BigInteger x = fibMemo(n.subtract(BigInteger.ONE), memo)._1 .add(fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE), memo)._1); return new Tuple<>(x, memo.addEntry(n, x)); }); } This implementation works fine, provided values of n are not too big. For f(50), it returns in less that 1 millisecond, which should be compared to the 44 minutes of the naive version. Remark that we do not have to test for terminal values (f(0) and f(1). These values are simply inserted into the map at start. So, is there something we should make better? The main problem is that we have to handle the passing of the Memo map by hand. The signature of the fibMemo method is no longer fibMemo(BigInteger n), but fibMemo(BigInteger n, Memo memo). Could we simplify this? We might think about using automatic memoization as described in my previous article (Do it in Java 8: Automatic memoization ). However, this will not work: static Function fib = new Function() { @Override public BigInteger apply(BigInteger n) { return n.equals(BigInteger.ZERO) || n.equals(BigInteger.ONE) ? n : this.apply(n.subtract(BigInteger.ONE)).add(this.apply(n.subtract(BigInteger.ONE.add(BigInteger.ONE)))); } }; static Function fibm = Memoizer.memoize(fib); Beside the fact that we could not use lambdas because reference to this is not allowed, which may be worked around by using the original anonymous class syntax, the recursive call is made to the non memoized function, so it does not make things better. Using the State monad In this example, each computed value is strictly evaluated by the fibMemo method, and this is what makes the memo parameter necessary. Instead of a method returning a value, what we would need is a method returning a function that could be evaluated latter. This function would take a Memo as parameter, and this Memo instance would be necessary only at evaluation time. This is what the State monad will do. Java 8 does not provide the state monad, so we have to create it, but it is very simple. However, we first need an implementation of a list that is more functional that what Java offers. In a real case, we would use a true immutable and persistent List. The one I have written is about 1 000 lines, so I can't show it here. Instead, we will use a dummy functional list, backed by a java.util.ArrayList. Although this is less elegant, it does the same job: public class List { private java.util.List list = new ArrayList<>(); public static List empty() { return new List(); } @SafeVarargs public static List apply(T... ta) { List result = new List<>(); for (T t : ta) result.list.add(t); return result; } public List cons(T t) { List result = new List<>(); result.list.add(t); result.list.addAll(list); return result; } public U foldRight(U seed, Function> f) { U result = seed; for (int i = list.size() - 1; i >= 0; i--) { result = f.apply(list.get(i)).apply(result); } return result; } public List map(Function f) { List result = new List<>(); for (T t : list) { result.list.add(f.apply(t)); } return result; } public List filter(Function f) { List result = new List<>(); for (T t : list) { if (f.apply(t)) { result.list.add(t); } } return result; } public Optional findFirst() { return list.size() == 0 ? Optional.empty() : Optional.of(list.get(0)); } @Override public String toString() { StringBuilder s = new StringBuilder("["); for (T t : list) { s.append(t).append(", "); } return s.append("NIL]").toString(); } } The implementation is not functional, but the interface is! And although there are lots of missing capabilities, we have all we need. Now, we can write the state monad. It is often called simply State but I prefer to call it StateMonad in order to avoid confusion between the state and the monad: public class StateMonad { public final Function> runState; public StateMonad(Function> runState) { this.runState = runState; } public static StateMonad unit(A a) { return new StateMonad<>(s -> new StateTuple<>(a, s)); } public static StateMonad get() { return new StateMonad<>(s -> new StateTuple<>(s, s)); } public static StateMonad getState(Function f) { return new StateMonad<>(s -> new StateTuple<>(f.apply(s), s)); } public static StateMonad transition(Function f) { return new StateMonad<>(s -> new StateTuple<>(Nothing.instance, f.apply(s))); } public static StateMonad transition(Function f, A value) { return new StateMonad<>(s -> new StateTuple<>(value, f.apply(s))); } public static StateMonad> compose(List> fs) { return fs.foldRight(StateMonad.unit(List.empty()), f -> acc -> f.map2(acc, a -> b -> b.cons(a))); } public StateMonad flatMap(Function> f) { return new StateMonad<>(s -> { StateTuple temp = runState.apply(s); return f.apply(temp.value).runState.apply(temp.state); }); } public StateMonad map(Function f) { return flatMap(a -> StateMonad.unit(f.apply(a))); } public StateMonad map2(StateMonad sb, Function> f) { return flatMap(a -> sb.map(b -> f.apply(a).apply(b))); } public A eval(S s) { return runState.apply(s).value; } } This class is parameterized by two types: the value type A and the state type S. In our case, A will be BigInteger and S will be Memo. This class holds a function from a state to a tuple (value, state). This function is hold in the runState field. This is similar to the value hold in the Optional monad. To make it a monad, this class needs a unit method and a flatMap method. The unit method takes a value as parameter and returns a StateMonad. It could be implemented as a constructor. Here, it is a factory method. The flatMap method takes a function from A (a value) to StateMonad and return a new StateMonad. (In our case, A is the same as B.) The new type contains the new value and the new state that result from the application of the function. All other methods are convenience methods: map allows to bind a function from A to B instead of a function from A to StateMonad. It is implemented in terms of flatMap and unit. eval allows easy retrieval of the value hold by the StateMonad. getState allows creating a StateMonad from a function S -> A. transition takes a function from state to state and a value and returns a new StateMonad holding the value and the state resulting from the application of the function. In other words, it allows changing the state without changing the value. There is also another transition method taking only a function and returning a StateMonad. Nothing is a special class: public final class Nothing { public static final Nothing instance = new Nothing(); private Nothing() {} } This class could be replaced by Void, to mean that we do not care about the type. However, Void is not supposed to be instantiated, and the only reference of type Void is normally null. The problem is that null does not carry its type. We could instantiate a Void instance through introspection: Constructor constructor; constructor = Void.class.getDeclaredConstructor(); constructor.setAccessible(true); Void nothing = constructor.newInstance(); but this is really ugly, so we create a Nothing type with a single instance of it. This does the trick, although to be complete, Nothing should be able to replace any type (like null), which does not seem to be possible in Java. Using the StateMonad class, we can rewrite our program: static BigInteger fibMemo2(BigInteger n) { return fibMemo(n).eval(new Memo().addEntry(BigInteger.ZERO, BigInteger.ZERO).addEntry(BigInteger.ONE, BigInteger.ONE)); } static StateMonad fibMemo(BigInteger n) { return StateMonad.getState((Memo m) -> m.retrieve(n)) .flatMap(u -> u.map(StateMonad:: unit).orElse(fibMemo(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))))); } Now, the fibMemo method only takes a BigInteger as its parameter and returns a StateMonad, which means that when this method returns, nothing has been evaluated yet. The Memo doesn't even exist! To get the result, we may call the eval method, passing it the Memo instance. If you find this code difficult to understand, here is an exploded commented version using longer identifiers: static StateMonad fibMemo(BigInteger n) { /* * Create a function of type Memo -> Optional with a closure * over the n parameter. */ Function> retrieveValueFromMapIfPresent = (Memo memoizationMap) -> memoizationMap.retrieve(n); /* * Create a state from this function. */ StateMonad> initialState = StateMonad.getState(retrieveValueFromMapIfPresent); /* * Create a function for converting the value (BigInteger) into a State * Monad instance. This function will be bound to the Optional resulting * from the lookup into the map to give the result if the value was found. */ Function> createStateFromValue = StateMonad:: unit; /* * The value computation proper. This can't be easily decomposed because it * make heavy use of closures. It first calls recursively fibMemo(n - 1), * producing a StateMonad. It then flatMaps it to a new * recursive call to fibMemo(n - 2) (actually fibMemo(n - 1 - 1)) and get a * new StateMonad which is mapped to BigInteger addition * with the preceding value (x). Then it flatMaps it again with the function * y -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z) which adds * the two values and returns a new StateMonad with the computed value added * to the map. */ StateMonad computedValue = fibMemo(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))); /* * Create a function taking an Optional as its parameter and * returning a state. This is the main function that returns the value in * the Optional if it is present and compute it and put it into the map * before returning it otherwise. */ Function, StateMonad> computeFiboValueIfAbsentFromMap = u -> u.map(createStateFromValue).orElse(computedValue); /* * Bind the computeFiboValueIfAbsentFromMap function to the initial State * and return the result. */ return initialState.flatMap(computeFiboValueIfAbsentFromMap); } The most important part is the following: StateMonad computedValue = fibMemo_(n.subtract(BigInteger.ONE)) .flatMap(x -> fibMemo_(n.subtract(BigInteger.ONE).subtract(BigInteger.ONE)) .map(x::add) .flatMap(z -> StateMonad.transition((Memo m) -> m.addEntry(n, z), z))); This kind of code is essential to functional programming, although it is sometimes replaced in other languages with “for comprehensions”. As Java 8 does not have for comprehensions we have to use this form. At this point, we have seen that using the state monad allows abstracting the handling of state. This technique can be used every time you have to handle state. More uses of the state monad The state monad may be used for many other cases were state must be maintained in a functional way. Most programs based upon maintaining state use a concept known as a State Machine. A state machine is defined by an initial state and a series of inputs. Each input submitted to the state machine will produce a new state by applying one of several possible transitions based upon a list of conditions concerning both the input and the actual state. If we take the example of a bank account, the initial state would be the initial balance of the account. Possible transition would be deposit(amount) and withdraw(amount). The conditions would be true for deposit and balance >= amount for withdraw. Given the state monad that we have implemented above, we could write a parameterized state machine: public class StateMachine { Function> function; public StateMachine(List, Transition>> transitions) { function = i -> StateMonad.transition(m -> Optional.of(new StateTuple<>(i, m)).flatMap((StateTuple t) -> transitions.filter((Tuple, Transition> x) -> x._1.test(t)).findFirst().map((Tuple, Transition> y) -> y._2.apply(t))).get()); } public StateMonad process(List inputs) { List> a = inputs.map(function); StateMonad> b = StateMonad.compose(a); return b.flatMap(x -> StateMonad.get()); } } This machine uses a bunch of helper classes. First, the inputs are represented by an interface: public interface Input { boolean isDeposit(); boolean isWithdraw(); int getAmount(); } There are two instances of inputs: public class Deposit implements Input { private final int amount; public Deposit(int amount) { super(); this.amount = amount; } @Override public boolean isDeposit() { return true; } @Override public boolean isWithdraw() { return false; } @Override public int getAmount() { return this.amount; } } public class Withdraw implements Input { private final int amount; public Withdraw(int amount) { super(); this.amount = amount; } @Override public boolean isDeposit() { return false; } @Override public boolean isWithdraw() { return true; } @Override public int getAmount() { return this.amount; } } Then come two functional interfaces for conditions and transitions: public interface Condition extends Predicate> {} public interface Transition extends Function, S> {} These act as type aliases in order to simplify the code. We could have used the predicate and the function directly. In the same manner, we use a StateTuple class instead of a normal tuple: public class StateTuple { public final A value; public final S state; public StateTuple(A a, S s) { value = Objects.requireNonNull(a); state = Objects.requireNonNull(s); } } This is exactly the same as an ordinary tuple with named members instead of numbered ones. Numbered members allows using the same class everywhere, but a specific class like this one make the code easier to read as we will see. The last utility class is Outcome, which represent the result returned by the state machine: public class Outcome { public final Integer account; public final List> operations; public Outcome(Integer account, List> operations) { super(); this.account = account; this.operations = operations; } public String toString() { return "(" + account.toString() + "," + operations.toString() + ")"; } } This again could be replaced with a Tuple>>, but using named parameters make the code easier to read. (In some functional languages, we could use type aliases for this.) Here, we use an Either class, which is another kind of monad that Java does not offer. I will not show the complete class, but only the parts that are useful for this example: public interface Either { boolean isLeft(); boolean isRight(); A getLeft(); B getRight(); static Either right(B value) { return new Right<>(value); } static Either left(A value) { return new Left<>(value); } public class Left implements Either { private final A left; private Left(A left) { super(); this.left = left; } @Override public boolean isLeft() { return true; } @Override public boolean isRight() { return false; } @Override public A getLeft() { return this.left; } @Override public B getRight() { throw new IllegalStateException("getRight() called on Left value"); } @Override public String toString() { return left.toString(); } } public class Right implements Either { private final B right; private Right(B right) { super(); this.right = right; } @Override public boolean isLeft() { return false; } @Override public boolean isRight() { return true; } @Override public A getLeft() { throw new IllegalStateException("getLeft() called on Right value"); } @Override public B getRight() { return this.right; } @Override public String toString() { return right.toString(); } } } This implementation is missing a flatMap method, but we will not need it. The Either class is somewhat like the Optional Java class in that it may be used to represent the result of an evaluation that may return a value or something else like an exception, an error message or whatever. What is important is that it can hold one of two things of different types. We now have all we need to use our state machine: public class Account { public static StateMachine createMachine() { Condition predicate1 = t -> t.value.isDeposit(); Transition transition1 = t -> new Outcome(t.state.account + t.value.getAmount(), t.state.operations.cons(Either.right(t.value.getAmount()))); Condition predicate2 = t -> t.value.isWithdraw() && t.state.account >= t.value.getAmount(); Transition transition2 = t -> new Outcome(t.state.account - t.value.getAmount(), t.state.operations.cons(Either.right(- t.value.getAmount()))); Condition predicate3 = t -> true; Transition transition3 = t -> new Outcome(t.state.account, t.state.operations.cons(Either.left(new IllegalStateException(String.format("Can't withdraw %s because balance is only %s", t.value.getAmount(), t.state.account))))); List, Transition>> transitions = List.apply( new Tuple<>(predicate1, transition1), new Tuple<>(predicate2, transition2), new Tuple<>(predicate3, transition3)); return new StateMachine<>(transitions); } } This could not be simpler. We just define each possible condition and the corresponding transition, and then build a list of tuples (Condition, Transition) that is used to instantiate the state machine. There are however to rules that must be enforced: Conditions must be put in the right order, with the more specific first and the more general last. We must be careful to be sure to match all possible cases. Otherwise, we will get an exception. At this stage, nothing has been evaluated. We did not even use the initial state! To run the state machine, we must create a list of inputs and feed it in the machine, for example: List inputs = List.apply( new Deposit(100), new Withdraw(50), new Withdraw(150), new Deposit(200), new Withdraw(150)); StateMonad = Account.createMachine().process(inputs); Again, nothing has been evaluated yet. To get the result, we just evaluate the result, using an initial state: Outcome outcome = state.eval(new Outcome(0, List.empty())) If we run the program with the list above, and call toString() on the resulting outcome (we can't do more useful things since the Either class is so minimal!) we get the following result: // // (100,[-150, 200, java.lang.IllegalStateException: Can't withdraw 150 because balance is only 50, -50, 100, NIL]) This is a tuple of the resulting balance for the account (100) and the list of operations that have been carried on. We can see that successful operations are represented by a signed integer, and failed operations are represented by an error message. This of course is a very minimal example, and as usual, one may think it would be much easier to do it the imperative way. However, think of a more complex example, like a text parser. All there is to do to adapt the state machine is to define the state representation (the Outcome class), define the possible inputs and create the list of (Condition,Transition). Going the functional way does not make the whole thing simpler. However, it allows abstracting the implementation of the state machine from the requirements. The only thing we have to do to create a new state machine is to write the new requirements!
October 9, 2014
by Pierre-Yves Saumont
· 25,910 Views · 7 Likes
article thumbnail
JBoss Data Grid: Installation and Development
In this blog, we will discuss one particular data grid platform from Redhat namely JBoss Data Grid (JDG). We will firstly cover how to access and install this data grid platform and then we will demonstrate how to develop and deploy a simple remote client/server data grid application which utilises the HotRod protocol. We will be using the latest release JDG 6.2 from Redhat in this article. Installation Overview To start using JDG, firstly log on to the redhat site https://access.redhat.com/home and download the software from the Downloads section of the site. We wish to download JDG 6.2 server by clicking on the appropriate links in the Downloads section. For future reference, it is also useful to download the quickstart and maven repository zip files. To install JDG, we simply unzip the JDG server package into an appropriate directory in your environment. JDG Overview In this section, we will provide a brief overview of the contents of the JDG installation package and the most notable configuration options available to users. Out of the box, users are provided with two runtime options either to run JDG in standalone or clustered mode. We can start JDG in either mode by invoking the stanadalone or clustered start up scripts in the / bin directory. To configure the JDG in either mode we need to configure the files standalone.xml and clustered.xml. In our case we will creating a distributed cache which will run on 3 node JDG cluster so we will be utilizing the clustered startup script. In order to set up and add new cache instances to JDG, we modify the infinispan subsystems in the appropriate xml configuration file above. We should also note the principal difference between the standalone and clustered configuration file is that in the clustered configuration file there is a JGroups subsystem configured element which allows for communication and messaging between configured cache instances running in a JDG cluster. Development Environment Setup and Configuration In this section, we will detail how to develop and configure a simple datagrid application which will be deployed to a 3 node JDG cluster. We will demonstrate how to configure and deploy a distributed cache in JDG and also show how to develop a HotRod Java client application which will be used to insert, update and display entries in the distributed cache. We will firstly discuss setting a new distributed cache on a 3 node JDG cluster. In this example, we will run our JDG cluster on a single machine by running each JDG instance on different ports. Firstly, we will create 3 instances of JDG by creating 3 directories (server1, server2, server3) on our host machine and unzipping each JDG installation into each directory. We will now configure each node in our cluster by copying and renaming the clustered.xml configuration file in the \server1\jboss-datagrid-6.2.0-server\standalone\configuration directory. We will name each of the cluster configuration files as "clustered1.xml", "clustered2.xml" and "clustered3.xml" for the JDG instances denoted by "server1", "server2" and "server3" respectively. We will now set up a new distributed cache on our JDG cluster by modifying the infinispan subsystem element in each clustered.xml file. We will demonstrate this for the node denoted "server1" here by modifying the file "clustered1.xml". The cache configuration shown here will be the same across all 3 nodes. To setup a new distributed cache named "directory-dist-cache", we configure the following elements in the file named "clustered1.xml" ......... ...... .............. ...... ...... /socket-binding-group> We will discuss the key elements and attributes relating to the configuration above. In the infinispan endpoint subsystem, we will configure hotrod clients to connect to the JDG server instance on socket 11222. The name of the cache container to host each of the cache instances will be held in the container named "clusteredcache". We have configured the infinispan core subsystem to the default cache container named "clusteredcacahe" whereby we will allow for jmx statistics to be collected relating the configured cache entries i.e statistics="true" We have created a new distributed cache named "directory-dist-cache" whereby there will be two copies of each cache entry held on two of the 3 cluster nodes. We have also set up an eviction policy whereby should there be more than 20 entries in our cache then cache entries will be removed using the LRU algorithm We should have configured nodes "server2" and "server3" to start up with a port offset of 100 and 200 respectively by configuring the socketing binding group element appropriately. Please view the socket bindings noted below. To set the socket binding element with a port offset of 100 on "server2", we configure "clustered2.xml" with the following entry: ...... ...... /socket-binding-group> To set the socket binding element with a port offset of 200 on "server3", we configure "clustered3.xml" with the following entry: ...... ...... /socket-binding-group> Before discussing the setup and configuration of our Hotrod client which will be used to interact with our JDG clustered HotRod server, we will start up each server instance to ensure our newly configured JDG distributed cache starts up correctly. Open up 3 Windows or Linux consoles and execute the following start up commands: Console 1: 1) Navigate to \server1\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the first instance of our JDG cluster denoted "server1": clustered -c=clustered1.xml -Djboss.node.name=server1 Console 2: 1) Navigate to \server2\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the second instance of our JDG cluster denoted "server2": clustered -c=clustered2.xml -Djboss.node.name=server2 Console 3: 1) Navigate to \server3\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the third instance of our JDG cluster denoted "server3": clustered -c=clustered3.xml -Djboss.node.name=server3 Providing all 3 JDG instances have started up correctly, you should see output in the console window whereby we can see there are 3 JDG instances in the JGroups view: HotRod Client Development Setup Now that the Hotrod server is up and running, we need to develop a Hotrod Java client which will interact with the clustered server application. The development environment consists of the following tools. 1) JDK Hotspot 1.7.0_45 2) IDE - Eclipse Kepler Build id: 20130919-0819 The HotRod client application is a simple application consisting of two Java classes. The application allows users to retrieve a reference to the distributed cache from the JDG server and then perform these actions: a) add new cinema objects. b) add and remove shows to each cinema object. c) print the list of all cinemas and shows stored in our distributed cache. The source code can be downloaded from github @ https://github.com/davewinters/JDG. We could use maven here to build and execute our application by configuring the maven settings.xml to point to the maven repository files we downloaded earlier and set up a maven project file (pom.xml) to build and execute the client application. In this article we will build our application using the Eclipse IDE and run the client application on the command line. To create a HotRod client application and execute the sample application, one should complete the following steps: 1) Create a new Java Project in Eclipse 2) Create a new package named uk.co.c2b2.jdg.hotrod and import the source code that has been downloaded from Github mentioned previously. 3) Now we need to configure the build path in Eclipse to contain the appropriate JDG client jar files which are required to compile the application. You should include all the client jar files in the project build path. These jar files are contained in the JDG installation zip file. For example on my machine these jar files are located in the directory: \server1\jboss-datagrid-6.2.0-server\client\hotrod\java 4. Providing the Eclipse build path has been configured appropriately, the application source should compile without issue. 5. We will need to execute the Hotrod application by opening the console window and executing the following command. Note the path specified here will differ depending on where the JDG client jar files and application class files are located in your environment: java -classpath ".;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\commons-pool-1.6-redhat-4.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-client-hotrod-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-commons-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-query-dsl-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-remote-query-client-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-logging-3.1.2.GA-redhat-1.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-river-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protobuf-java-2.5.0.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protostream-1.0.0.CR1-redhat-1.jar" uk/co/c2b2/jdg/hotrod/CinemaDirectory 6. The Hotrod client at runtime provides the end user with a number of different options to interact with the distributed cache as we can view from the console window below. Client Application Principal API Details We will not provide a detailed overview of the Hotrod application code however we will describe the principal API and code details briefly. In order to interact with the distributed cache on the JDG cluster using the Hotrod protocol, we will use the RemoteCacheManager Object which will allow us to retrieve a remote reference to the distributed cache. We have initialised a Properties object with the list of JDG instances and the associated with HotRod server port on each instance. We can add Cinema objects into the distributed cache using the RemoteCache.put() method. private RemoteCacheManager cacheManager; private RemoteCache cache; ..... Properties properties = new Properties(); properties.setProperty(ConfigurationProperties.SERVER_LIST, "127.0.0.1:11222;127.0.0.1:11322;127.0.0.1:11422"); cacheManager = new RemoteCacheManager(properties); cache = cacheManager.getCache("directory-dist-cache"); ..... cache.put(cinemaKey, cinemalist); In the webinar below, I describe in further detail how to set up a JDG cluster and how to develop and run the JDG application discussed above. For further details on JDG please visit: http://www.redhat.com/products/jbossenterprisemiddleware/data-grid/ Webinar: Introduction to JBoss Data Grid -- Installation, Configuration and Development In this webinar we will look at the basics of setting up JBoss Data Grid covering installation, configuration and development. We will look at practical examples of storing data, viewing the data in the cache and removing it. We will also take a look at the different clustered modes and what effect these have on the storage of your data:
July 25, 2014
by David Winters
· 16,040 Views
article thumbnail
What is Scalable Machine Learning?
scalability has become one of those core concept slash buzzwords of big data. it’s all about scaling out, web scale, and so on. in principle, the idea is to be able to take one piece of code and then throw any number of computers at it to make it fast. the terms “scalable” and “large scale” have been used in machine learning circles long before there was big data. there had always been certain problems which lead to a large amount of data, for example in bioinformatics, or when dealing with large number of text documents. so finding learning algorithms, or more generally data analysis algorithms which can deal with a very large set of data was always a relevant question. interestingly, this issue of scalability were seldom solved using actual scaling in in machine learning, at least not in the big data kind of sense. part of the reason is certainly that multicore processors didn’t yet exist at the scale they do today and that the idea of “just scaling out” wasn’t as pervasive as it is today. instead, “scalable” machine learning is almost always based on finding more efficient algorithms, and most often, approximations to the original algorithm which can be computed much more efficiently. to illustrate this, let’s search for nips papers (the annual advances in neural information processing systems, short nips, conference is one of the big ml community meetings) for papers which have the term “scalable” in the title. here are some examples: scalable inference for logistic-normal topic models … this paper presents a partially collapsed gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation … partially collapsed gibbs sampling is a kind of estimation algorithm for certain graphical models. a scalable approach to probabilistic latent space inference of large-scale networks … with […] an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices […] on a single machine in a matter of hours … stochastic variational inference algorithm is both an approximation and an estimation algorithm. scalable kernels for graphs with continuous attributes … in this paper, we present a class of path kernels with computational complexity $o(n^2(m + \delta^2 ))$ … and this algorithm has squared runtime in the number of data points, so wouldn’t even scale out well even if you could. usually, even if there is potential for scalability, it usually something that is “embarassingly parallel” (yep, that’s a technical term), meaning that it’s something like a summation which can be parallelized very easily. still, the actual “scalability” comes from the algorithmic side. so how do scalable ml algorithms look like? a typical example are the stochastic gradient descent (sgd) class of algorithms. these algorithms can be used, for example, to train classifiers like linear svms or logistic regression. one data point is considered at each iteration. the prediction error on that point is computed and then the gradient is taken with respect to the model parameters, giving information about how to adapt these parameters slightly to make the error smaller. vowpal wabbit is one program based on this approach and it has a nice definition of what it considers to mean scalable in machine learning: there are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. this project is about approach (b), and it’s reached a state where it may be useful to others as a platform for research and experimentation. so “scalable” means having a learning algorithm which can deal with any amount of data, without consuming ever growing amounts of resources like memory. for sgd type algorithms this is the case, because all you need to store are the model parameters, usually a few ten to hundred thousand double precision floating point value, so maybe a few megabytes in total. the main problem to speed this kind of computation up is how to stream the data by fast enough. to put it differently, not only does this kind of scalability not rely on scaling out, it’s actually not even necessary or possible to scale the computation out because the main state of the computation easily fits into main memory and computations on it cannot be distributed easily. i know that gradient descent is often taken as an example for map reduce and other approaches like in this paper on the architecture of spark , but that paper discusses a version of gradient descent where you are not taking one point at a time, but aggregate the gradient information for the whole data set before making the update to the model parameters. while this can be easily parallelized, it does not perform well in practice because the gradient information tends to average out when computed over the whole data set. if you want to know more, this large scale learning challenge sören sonnneburg organized in 2008 still has valuable information on how to deal with massive data sets. of course, there are things which can be easily scaled well using hadoop or spark, in particular any kind of data preprocessing or feature extraction where you need to apply the same operation to each data point in your data set. another area where parallelization is easy and useful is when you are using cross validation to do model selection where you usually have to train a large number of models for different parameter sets to find the combination which performs best. again, even here there is more potential for even speeding up such computations using better algorithms like in this paper of mine . i’ve just scratched the surface of this, but i hope you got the idea that scalability can mean quite different things. in big data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. in machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. often, this allows you to speed up your computations significantly by decoupling computations. parallelization is important, too, but alone it won’t get you very far. luckily, there are projects like spark and stratosphere/flink which work on providing more useful abstractions beyond map and reduce to make the last part easier for data scientists, but you won’t get rid of the algorithmic design part any time soon.
July 3, 2014
by Mikio Braun
· 18,504 Views · 1 Like
article thumbnail
Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel
Explore different message brokers, and discover how these important web technologies impact a customer's backlog of messages, and cluster/data performance.
June 3, 2014
by Yves Trudeau
· 460,451 Views · 86 Likes
  • Previous
  • ...
  • 195
  • 196
  • 197
  • 198
  • 199
  • 200
  • 201
  • 202
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×