DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Data Topics

article thumbnail
yield(), sleep(0), wait(0,1) and parkNanos(1)
On the surface these methods do the same thing in Java; Thread.yield(), Thread.sleep(0), Object.wait(0,1) and LockSupport.parkNanos(1) They all wait a sort period of time, but how much that is varies a surprising amount and between platforms. Timing a short delay The following code times how long it takes to repeatedly call those methods. import java.util.concurrent.locks.LockSupport; public class Pausing { public static void main(String... args) throws InterruptedException { int repeat = 10000; for (int i = 0; i < 3; i++) { long time0 = System.nanoTime(); for (int j = 0; j < repeat; j++) Thread.yield(); long time1 = System.nanoTime(); for (int j = 0; j < repeat; j++) Thread.sleep(0); long time2 = System.nanoTime(); synchronized (Thread.class) { for (int j = 0; j < repeat/10; j++) Thread.class.wait(0, 1); } long time3 = System.nanoTime(); for (int j = 0; j < repeat/10; j++) LockSupport.parkNanos(1); long time4 = System.nanoTime(); System.out.printf("The average time to yield %.1f μs, sleep(0) %.1f μs, " + "wait(0,1) %.1f μs and LockSupport.parkNanos(1) %.1f μs%n", (time1 - time0) / repeat / 1e3, (time2 - time1) / repeat / 1e3, (time3 - time2) / (repeat/10) / 1e3, (time4 - time3) / (repeat/10) / 1e3); } } } On Windows 7 The average time to yield 0.3 μs, sleep(0) 0.6 μs, wait(0,1) 999.9 μs and LockSupport.parkNanos(1) 1000.0 μs The average time to yield 0.3 μs, sleep(0) 0.6 μs, wait(0,1) 999.5 μs and LockSupport.parkNanos(1) 1000.1 μs The average time to yield 0.2 μs, sleep(0) 0.5 μs, wait(0,1) 1000.0 μs and LockSupport.parkNanos(1) 1000.1 μs On RHEL 5.x The average time to yield 1.1 μs, sleep(0) 1.1 μs, wait(0,1) 2003.8 μs and LockSupport.parkNanos(1) 3.8 μs The average time to yield 1.1 μs, sleep(0) 1.1 μs, wait(0,1) 2004.8 μs and LockSupport.parkNanos(1) 3.4 μs The average time to yield 1.1 μs, sleep(0) 1.1 μs, wait(0,1) 2005.6 μs and LockSupport.parkNanos(1) 3.1 μs In summary If you want to wait for a short period of time, you can't assume that all these methods do the same thing, nor will be the same between platforms.
April 27, 2012
by Peter Lawrey
· 9,769 Views
article thumbnail
Managing and Monitoring Drupal Sites on Windows Azure
A few weeks ago, I co-authored an article (with my colleague Rama Ramani) about how the Screen Actors Guild Awards website migrated its Drupal deployment from LAMP to Windows Azure: Azure Real World: Migrating a Drupal Site from LAMP to Windows Azure. Since then, Rama and another colleague, Jason Roth, have been working on writing up how the SAG Awards website was managed and monitored in Windows Azure. The article below is the fruit of their work…a very interesting/educational read. Overview Drupal is an open source content management system that runs on PHP. Windows Azure offers a flexible platform for hosting, managing, and scaling Drupal deployments. This paper focuses on an approach to host Drupal sites on Windows Azure, based on learning from a BPD Customer Programs Design Win engagement with the Screen Actors Guild Awards Drupal website. This paper covers guidelines and best practices for managing an existing Drupal web site in Windows Azure. For more information on how to migrate Drupal applications to Windows Azure, see Azure Real World: Migrating a Drupal Site from LAMP to Windows Azure. The target audience for this paper is Drupal administrators who have some exposure to Windows Azure. More detailed pointers to Windows Azure content is provided throughout the paper as links. Drupal Application Architecture on Windows Azure Before reviewing the management and monitoring guidelines, it is important to understand the architecture of a typical Drupal deployment on Windows Azure. First, the following diagram displays the basic architecture of Drupal running on Windows and IIS7. In the Windows Server scenario, you could have one or more machines hosting the web site in a farm. Those machines would either persist the site content to the file system or point to other network shares. For Windows Azure, the basic architecture is the same, but there are some differences. In Windows Azure the site is hosted on a web role. A web role instance is hosted on a Windows Server 2008 virtual machine within the Windows Azure datacenter. Like the web farm, you can have multiple instances running the site. But there is no persistence guarantee for the data on the file system. Because of this, much of the shared site content should be stored in Windows Azure Blob storage. This allows them to be highly available and durable. Usually, a large portion of the site caters to static content which lends well to caching. And caching can be applied in a set of places – browser level caching, CDN to cache content in the edge closer to the browser clients, caching in Azure to reduce the load on backend, etc. Finally, the database can be located in SQL Azure. The following diagram shows these differences. For monitoring and management, we will look at Drupal on Windows Azure from three perspectives: Availability: Ensure the web site does not go down and that all tiers are setup correctly. Apply best practices to ensure that the site is deployed across data centers and perform backup operations regularly. Scalability: Correctly handle changes in user load. Understand the performance characteristics of the site. Manageability: Correctly handle updates. Make code and site changes with no downtime when possible. Although some management tasks span one or more of these categories, it is still helpful to discuss Drupal management on Windows Azure within these focus areas. Availability One main goal is that the Drupal site remains running and accessible to all end-users. This involves monitoring both the site and the SQL Azure database that the site depends on. In this section, we will briefly look at monitoring and backup tasks. Other crossover areas that affect availability will be discussed in the next section on scalability. Monitoring With any application, monitoring plays an important role with managing availability. Monitoring data can reveal whether users are successfully using the site or whether computing resources are meeting the demand. Other data reveals error counts and possibly points to issues in a specific tier of the deployment. There are several monitoring tools that can be used. The Windows Azure Management Portal. Windows Azure diagnostic data. Custom monitoring scripts. System Center Operations Manager. Third party tools such as Azure Diagnostics Manager and Azure Storage Explorer. The Windows Azure Management Portal can be used to ensure that your deployments are successful and running. You can also use the portal to manage features such as Remote Desktop so that you can directly connect to machines that are running the Drupal site. Windows Azure diagnostics allows you to collect performance counters and logs off of the web role instances that are running the Drupal site. Although there are many options for configuring diagnostics in Azure, the best solution with Drupal is to use a diagnostics configuration file. The following configuration file demonstrates some basic performance counters that can monitor resources such as memory, processor utilization, and network bandwidth. For more information about setting up diagnostic configuration files, see How to Use the Windows Azure Diagnostics Configuration File. This information is stored locally on each role instance and then transferred to Windows Azure storage per a defined schedule or on-demand. See Getting Started with Storing and Viewing Diagnostic Data in Windows Azure Storage. Various monitoring tools, such as Azure Diagnostics Manager, help you to more easily analyze diagnostic data. Monitoring the performance of the machines hosting the Drupal site is only part of the story. In order to plan properly for both availability and scalability, you should also monitor site traffic, including user load patterns and trends. Standard and custom diagnostic data could contribute to this, but there are also third-party tools that monitor web traffic. For example, if you know that spikes occur in your application during certain days of the week, you could make changes to the application to handle the additional load and increase the availability of the Drupal solution. Backup Tasks To remain highly available, it is important to backup your data as a defense-in-depth strategy for disaster recovery. This is true even though SQL Azure and Windows Azure Storage both implement redundancy to prevent data loss. One obvious reason is that these services cannot prevent administrator error if data is accidentally deleted or incorrectly changed. SQL Azure does not currently have a formal backup technology, although there are many third-party tools and solutions that provide this capability. Usually the database size for a Drupal site is relatively small. In the case of SAG Awards, it was only ~100-150 MB. So performing an entire backup using any strategy was relatively fast. If your database is much larger, you might have to test various backup strategies to find the one that works best. Apart from third-party SQL Azure backup solutions, there are several strategies for obtaining a backup of your data: · Use the Drush tool and the portabledb-export command. · Periodically copy the database using the CREATE DATABASE Transact-SQL command. · Use Data-tier applications (DAC) to assist with backup and restore of the database. SQL Azure backup and data security techniques are described in more detail in the topic, Business Continuity in SQL Azure. Note that bandwidth costs accrue with any backup operation that transfers information outside of the Windows Azure datacenter. To reduce costs, you can copy the database to a database within the same datacenter. Or you can export the data-tier applications to blob storage in the same datacenter. Another potential backup task involves the files in Blob storage. If you keep a master copy of all media files uploaded to Blob storage, then you already have an on-premises backup of those files. However, if multiple administrators are loading files into Blob storage for use on the Drupal site, it is a good idea to enumerate the storage account and to download any new files to a central location. The following PHP script demonstrates how this can be done by backing up all files in Blob storage after a specified modification date. setProxy(true, 'YOUR_PROXY_IF_NEEDED', 80); $blobs = (array)$blobObj->listBlobs(AZURE_STORAGE_CONTAINER, '', '', 35000); backupBlobs($blobs, $blobObj); function backupBlobs($blobs, $blobObj) { foreach ($blobs as $blob) { if (strtotime($blob->lastmodified) >= DEFAULT_BACKUP_FROM_DATE && strtotime($blob->lastmodified) <= DEFAULT_BACKUP_TO_DATE) { $path = pathinfo($blob->name); if ($path['basename'] != '$$$.$$$') { $dir = $path['dirname']; $oldDir = getcwd(); if (handleDirectory($dir)) { chdir($dir); $blobObj->getBlob( AZURE_STORAGE_CONTAINER, $blob->name, $path['basename'] ); chdir($oldDir); } } } } } function handleDirectory($dir) { if (!checkDirExists($dir)) { return mkdir($dir, 0755, true); } return true; } function checkDirExists($dir) { if(file_exists($dir) && is_dir($dir)) { return true; } return false; } ?> This script has a dependency on the Windows Azure SDK for PHP. Also note there are several parameters that you must modify such as the storage account, secret, and backup location. As with SQL Azure, bandwidth and transaction charges apply to a backup script like this. Scalability Drupal sites on Windows Azure can scale as load increased through typical strategies of scale-up, scale-out, and caching. The following sections describe the specifics of how these strategies are implemented in Windows Azure. Typically you make scalability decisions based on monitoring and capacity planning. Monitoring can be done in staging during testing or in production with real-time load. Capacity planning factors in projections for changes in user demand. Scale Up When you configure your web role prior to deployment, you have the option of specifying the Virtual Machine (VM) size, such as Small or ExtraLarge. Each size tier adds additional memory, processing power, and network bandwidth to each instance of your web role. For cost efficiency and smaller units of scale, you can test your application under expected load to find the smallest virtual machine size that meets your requirements. The workload usually in most popular Drupal websites can be separated out into a limited set of Drupal admins making content changes and a large user base who perform mostly read-only workload. End users can be allowed to make ‘writes’, such as uploading blogs or posting in forums, but those changes are not ‘content changes’. Drupal admins are setup to operate without caching so that the writes are made directly to SQL Azure or the corresponding backend database. This workload performs well with Large or ExtraLarge VM sizes. Also, note that the VM size is closely tied to all hardware resources, so if there are many content-rich pages that are streaming content, then the VM size requirements are higher. To make changes to the Virtual Machine size setting, you must change the vmsize attribute of the WebRole element in the service definition file, ServiceDefinition.csdef. A virtual machine size change requires existing applications to be redeployed. Scale Out In addition to the size of each web role instance, you can increase or decrease the number of instances that are running the Drupal site. This spreads the web requests across more servers, enabling the site to handle more users. To change the number of running instances of your web role, see How to Scale Applications by Increasing or Decreasing the Number of Role Instances. Note that some configuration changes can cause your existing web role instances to recycle. You can choose to handle this situation by applying the configuration change and continue running. This is done by handling the RoleEnvironment.Changing event. For more information see, How to Use the RoleEnvironment.Changing Event. A common question for any Windows Azure solution is whether there is some type of built-in automatic scaling. Windows Azure does not provide a service that provides auto-scaling. However, it is possible to create a custom solution that scales Azure services using the Service Management API. For an example of this approach, see An Auto-Scaling Module for PHP Applications in Windows Azure. Caching Caching is an important strategy for scaling Drupal applications on Windows Azure. One reason for this is that SQL Azure implements throttling mechanisms to regulate the load on any one database in the cloud. Code that uses SQL Azure should have robust error handling and retry logic to account for this. For more information, see Error Messages (SQL Azure Database). Because of the potential for load-related throttling as well as for general performance improvement, it is strongly recommended to use caching. Although Windows Azure provides a Caching service, this service does not currently have interoperability with PHP. Because of this, the best solution for caching in Drupal is to use a module that uses an open-source caching technology, such as Memcached. Outside of a specific Drupal module, you can also configure Memcached to work in PHP for Windows Azure. For more information, see Running Memcached on Windows Azure for PHP. Here is also an example of how to get Memcached working in Windows Azure using a plugin: Windows Azure Memcached plugin. In a future paper, we hope to cover this architecture in more detail. For now, here are several design and management considerations related to caching. Area Consideration Design and Implementation For a technology like Memcached, will the cache be collocated (spread across all web role instances)? Or will you attempt to setup a dedicated cache ring with worker roles that only run Memcached? Configuration What memory is required and how will items in the cache be invalidated? Performance and Monitoring What mechanisms will be used to detect the performance and overall health of the cache? For ease of use and cost savings, collocation of the cache across the web role instances of the Drupal site works best. However, this assumes that there is available reserve memory on each instance to apply toward caching. It is possible to increase the virtual machine size setting to increase the amount of available memory on each machine. It is also possible to add additional web role instances to add to the overall memory of the cache while at the same time improving the ability of the web site to respond to load. It is possible to create a dedicated cache cluster in the cloud, but the steps for this are beyond the scope of this paper[RR1] . For Windows Azure Blob storage, there is also a caching feature built into the service called the Content Delivery Network (CDN). CDN provides high-bandwidth access to files in Blob storage by caching copies of the files in edge nodes around the world. Even within a single geographic region, you could see performance improvements as there are many more edge nodes than Windows Azure datacenters. For more information, see Delivering High-Bandwidth Content with the Windows Azure CDN. Manageability It is important to note that each hosted service has a Staging environment and a Production environment. This can be used to manage deployments, because you can load and test and application in staging before performing a VIP swap with production. From a manageability standpoint, Drupal has an advantage on Windows Azure in the way that site content is stored. Because the data necessary to serve pages is stored in the database and blob storage, there is no need to redeploy the application to change the content of the site. Another best practice is to use a separate storage account for diagnostic data than the one that is used for the application itself. This can improve performance and also helps to separate the cost of diagnostic monitoring from the cost of the running application. As mentioned previously, there are several tools that can assist with managing Windows Azure applications. The following table summarizes a few of these choices. Tool Description Windows Azure Management Portal The web interface of the Windows Azure management portal shows deployments, instance counts and properties, and supports many different common management and monitoring tasks. Azure Diagnostics Managerq[RR2] [JR3] A Red Gate Software product that provides advanced monitoring and management of diagnostic data. This tool can be very useful for easily analyzing the performance of the Drupal site to determine appropriate scaling decisions. Azure Storage Explorer A tool created by Neudesic for viewing Windows Azure storage account. This can be useful for viewing both diagnostic data and the files in Blob storage.
April 25, 2012
by Brian Swan
· 8,749 Views
article thumbnail
Face Detection using HTML5, Javascript, Webrtc, Websockets, Jetty and OpenCV
How to create a real-time face detection system using HTML5, JavaScript, and OpenCV, leveraging WebRTC for webcam access and WebSockets for client-server communication.
April 23, 2012
by Jos Dirksen
· 53,102 Views
article thumbnail
How-to: Python Data into Graphite for Monitoring Bliss
This post shows code examples in Python (2.7) for sending data to Graphite. Once you have a Graphite server setup, with Carbon running/collecting, you need to send it data for graphing. Basically, you write a program to collect numeric values and send them to Graphite's backend aggregator (Carbon). To send data, you create a socket connection to the graphite/carbon server and send a message (string) in the format: "metric_path value timestamp\n" `metric_path`: arbitrary namespace containing substrings delimited by dots. The most general name is at the left and the most specific is at the right. `value`: numeric value to store. `timestamp`: epoch time. messages must end with a trailing newline. multiple messages maybe be batched and sent in a single socket operation. each message is delimited by a newline, with a trailing newline at the end of the message batch. Example message: "foo.bar.baz 42 74857843\n" Let's look at some (Python 2.7) code for sending data to graphite... Here is a simple client that sends a single message to graphite. Code: #!/usr/bin/env python import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 message = 'foo.bar.baz 42 %d\n' % int(time.time()) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a command line client that sends a single message to graphite: Usage: $ python client-cli.py metric_path value Code: #!/usr/bin/env python import argparse import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 parser = argparse.ArgumentParser() parser.add_argument('metric_path') parser.add_argument('value') args = parser.parse_args() if __name__ == '__main__': timestamp = int(time.time()) message = '%s %s %d\n' % (args.metric_path, args.value, timestamp) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a client that collects load average (Linux-only) and sends a batch of 3 messages (1min/5min/15min loadavg) to graphite. It will run continuously in a loop until killed. (adjust the delay for faster/slower collection interval): #!/usr/bin/env python import platform import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 DELAY = 15 # secs def get_loadavgs(): with open('/proc/loadavg') as f: return f.read().strip().split()[:3] def send_msg(message): print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() if __name__ == '__main__': node = platform.node().replace('.', '-') while True: timestamp = int(time.time()) loadavgs = get_loadavgs() lines = [ 'system.%s.loadavg_1min %s %d' % (node, loadavgs[0], timestamp), 'system.%s.loadavg_5min %s %d' % (node, loadavgs[1], timestamp), 'system.%s.loadavg_15min %s %d' % (node, loadavgs[2], timestamp) ] message = '\n'.join(lines) + '\n' send_msg(message) time.sleep(DELAY) Resources: Graphite Docs Graphite Docs - Getting Your Data Into Graphite Installing Graphite 0.9.9 on Ubuntu 12.04 LTS Installing and configuring Graphite END
April 20, 2012
by Corey Goldberg
· 25,278 Views
article thumbnail
Caching With WCF Services
This is the first part of a two part article about caching in WCF services. In this part I will explain the in-process memory cache available in .NET 4.0. In the second part I will describe the Windows AppFabric distributed memory cache. The .NET framework has provided a cache for ASP.NET applications since version 1.0. For other types of applications like WPF applications or console application, caching was never possible out of the box. Only WCF services were able to use the ASP.NET cache if they were configured to run in ASP.NET compatibility mode. But this mode has some performance drawbacks and only works when the WCF service is hosted inside IIS and uses an HTTP-based binding. With the release of the .NET 4.0 framework this has luckily changed. Microsoft has now developed an in-process memory cache that does not rely on the ASP.NET framework. This cache can be found in the “System.Runtime.Caching.dll” assembly. In order to explain the working of the cache, I have a created a simple sample application. It consists of a very slow repository called “SlowRepository”. public class SlowRepository { public IEnumerable GetPizzas() { Thread.Sleep(10000); return new List() { "Hawaii", "Pepperoni", "Bolognaise" }; } } This repository is used by my sample WCF service to gets its data. public class PizzaService : IPizzaService { private const string CacheKey = "availablePizzas"; private SlowRepository repository; public PizzaService() { this.repository = new SlowRepository(); } public IEnumerable GetAvailablePizzas() { ObjectCache cache = MemoryCache.Default; if(cache.Contains(CacheKey)) return (IEnumerable)cache.Get(CacheKey); else { IEnumerable availablePizzas = repository.GetPizzas(); // Store data in the cache CacheItemPolicy cacheItemPolicy = new CacheItemPolicy(); cacheItemPolicy.AbsoluteExpiration = DateTime.Now.AddHours(1.0); cache.Add(CacheKey, availablePizzas, cacheItemPolicy); return availablePizzas; } } } When the WCF service method GetAvailablePizzas is called, the service first retrieves the default memory cache instance ObjectCache cache = MemoryCache.Default; Next, it checks if the data is already available in the cache. If so, the cached data is used. If not, the repository is called to get the data and afterwards the data is stored in the cache. For my sample service, I also choose to restrict the maximum memory to 20% of the total physical memory. This can be done in the web.config.
April 13, 2012
by Pieter De Rycke
· 21,918 Views · 1 Like
article thumbnail
How to Use Sigma.js with Neo4j
i’ve done a few posts recently using d3.js and now i want to show you how to use two other great javascript libraries to visualize your graphs. we’ll start with sigma.js and soon i’ll do another post with three.js . we’re going to create our graph and group our nodes into five clusters. you’ll notice later on that we’re going to give our clustered nodes colors using rgb values so we’ll be able to see them move around until they find their right place in our layout. we’ll be using two sigma.js plugins, the gefx (graph exchange xml format) parser and the forceatlas2 layout. you can see what a gefx file looks like below. notice it comes from gephi which is an interactive visualization and exploration platform, which runs on all major operating systems, is open source, and is free. ... ... in order to build this file, we will need to get the nodes and edges from the graph and create an xml file. get '/graph.xml' do @nodes = nodes @edges = edges builder :graph end we’ll use cypher to get our nodes and edges: def nodes neo = neography::rest.new cypher_query = " start node = node:nodes_index(type='user')" cypher_query << " return id(node), node" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0]}.merge(n[1]["data"])} end we need the node and relationship ids, so notice i’m using the id() function in both cases. def edges neo = neography::rest.new cypher_query = " start source = node:nodes_index(type='user')" cypher_query << " match source -[rel]-> target" cypher_query << " return id(rel), id(source), id(target)" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0], "source" => n[1], "target" => n[2]} } end so far we have seen graphs represented as json, and we’ve built these manually. today we’ll take advantage of the builder ruby gem to build our graph in xml. xml.instruct! :xml xml.gexf 'xmlns' => "http://www.gephi.org/gexf", 'xmlns:viz' => "http://www.gephi.org/gexf/viz" do xml.graph 'defaultedgetype' => "directed", 'idtype' => "string", 'type' => "static" do xml.nodes :count => @nodes.size do @nodes.each do |n| xml.node :id => n["id"], :label => n["name"] do xml.tag!("viz:size", :value => n["size"]) xml.tag!("viz:color", :b => n["b"], :g => n["g"], :r => n["r"]) xml.tag!("viz:position", :x => n["x"], :y => n["y"]) end end end xml.edges :count => @edges.size do @edges.each do |e| xml.edge:id => e["id"], :source => e["source"], :target => e["target"] end end end end you can get the code on github as usual and see it running live on heroku. you will want to see it live on heroku so you can see the nodes in random positions and then move to form clusters. use your mouse wheel to zoom in, and click and drag to move around. credit goes out to alexis jacomy and mathieu jacomy . you’ve seen me create numerous random graphs, but for completeness here is the code for this graph. notice how i create 5 clusters and for each node i assign half its relationships to other nodes in their cluster and half to random nodes? this is so the forceatlas2 layout plugin clusters our nodes neatly. def create_graph neo = neography::rest.new graph_exists = neo.get_node_properties(1) return if graph_exists && graph_exists['name'] names = 500.times.collect{|x| generate_text} clusters = 5.times.collect{|x| {:r => rand(256), :g => rand(256), :b => rand(256)} } commands = [] names.each_index do |n| cluster = clusters[n % clusters.size] commands << [:create_node, {:name => names[n], :size => 5.0 + rand(20.0), :r => cluster[:r], :g => cluster[:g], :b => cluster[:b], :x => rand(600) - 300, :y => rand(150) - 150 }] end names.each_index do |from| commands << [:add_node_to_index, "nodes_index", "type", "user", "{#{from}"] connected = [] # create clustered relationships members = 20.times.collect{|x| x * 10 + (from % clusters.size)} members.delete(from) rels = 3 rels.times do |x| to = members[x] connected << to commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless to == from end # create random relationships rels = 3 rels.times do |x| to = rand(names.size) commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless (to == from) || connected.include?(to) end end batch_result = neo.batch *commands end
April 12, 2012
by Max De Marzi
· 15,380 Views
article thumbnail
F1 Live Timing Map
this is a live timing map application for f1 championship races made using javascript and google maps markers. the live timing data is supplied by formula1.com. it’s interactive, you can press over a driver to track him or press into an empty map zone to untrack and have a general view. it has also been made with a responsive design to adapt it to mobile browsers using jquerymobile framework. how it works: the client side: until the race start date a countdown and a demo race is showed. when the countdown finishes it will connect to server (using ajax) to get the live timing data from server (every five seconds) and the interface will be updated using this data. the server side: it uses a django app for the web page and the static race data (circuit, laps, drivers) is put into the html using the django template system. for the dynamic data (live timing) i have modified the source of a c program for the linux terminal called live-f1 to generate a json with the data that the client requires instead of printing it on terminal screen. enjoy the race!
April 12, 2012
by Luis Sobrecueva
· 15,784 Views
article thumbnail
A Regular Expression HashMap Implementation in Java
Below is an implementation of a Regular Expression HashMap. It works with key-value pairs which the key is a regular expression. It compiles the key (regular expression) while adding (i.e. putting), so there is no compile time while getting. Once getting an element, you don't give regular expression; you give any possible value of a regular expression. As a result, this behaviour provides to map numerous values of a regular expression into the same value. The class does not depend to any external libraries, uses only default java.util. So, it will be used simply when a behaviour like that is required. import java.util.ArrayList; import java.util.HashMap; import java.util.regex.Pattern; /** * This class is an extended version of Java HashMap * and includes pattern-value lists which are used to * evaluate regular expression values. If given item * is a regular expression, it is saved in regexp lists. * If requested item matches with a regular expression, * its value is get from regexp lists. * * @author cb * * @param : Key of the map item. * @param : Value of the map item. */ public class RegExHashMap extends HashMap { // list of regular expression patterns private ArrayList regExPatterns = new ArrayList(); // list of regular expression values which match patterns private ArrayList regExValues = new ArrayList(); /** * Compile regular expression and add it to the regexp list as key. */ @Override public V put(K key, V value) { regExPatterns.add(Pattern.compile(key.toString())); regExValues.add(value); return value; } /** * If requested value matches with a regular expression, * returns it from regexp lists. */ @Override public V get(Object key) { CharSequence cs = new String(key.toString()); for (int i = 0; i < regExPatterns.size(); i++) { if (regExPatterns.get(i).matcher(cs).matches()) { return regExValues.get(i); } } return super.get(key); } }
April 11, 2012
by Cagdas Basaraner
· 24,693 Views
article thumbnail
Algorithm of the Week: Rabin-Karp String Searching
Brute force string matching is a very basic sub-string matching algorithm, but it’s good for some reasons. For example it doesn’t require preprocessing of the text or the pattern. The problem is that it’s very slow. That is why in many cases brute force matching can’t be very useful. For pattern matching we need something faster, but to understand other sub-string matching algorithms let’s take a look once again at brute force matching. In brute force sub-string matching we checked every single character from the text with the first character of the pattern. Once we have a match between them we shift the comparison between the second character of the pattern with the next character of the text, as shown on the picture below. This algorithm is slow for mainly two reasons. First, we have to check every single character from the text. On the other hand even if we find a match between a text character and the first character of the pattern we continue to check step by step (character by character) every single symbol of the pattern in order to find whether it is in the text. So is there any other approach to find whether the text contains the pattern? In fact there is a “faster” approach. In this case, in order to avoid the comparison between the pattern and the text character by character, we’ll try to compare them all at once, so we need a good hash function. With its help we can hash the pattern and check against hashed sub-strings of the text. We must be sure that the hash function is returning “small” hash codes for larger sub-strings. Another problem is that for larger patterns we can’t expect to have short hashes. But besides this the approach should be quite effective compared to the brute force string matching. This approach is known as Rabin-Karp algorithm. Overview Michael O. Rabin and Richard M. Karp came up with the idea of hashing the pattern and to check it against a hashed sub-string from the text in 1987. In general the idea seems quite simple, the only thing is that we need a hash function that gives different hashes for different sub-strings. Said hash function, for instance, may use the ASCII codes for every character, but we must be careful for multi-lingual support. The hash function may vary depending on many things, so it may consist of ASCII char to number converting, but it can also be anything else. The only thing we need is to convert a string (pattern) into some hash that is faster to compare. Let’s say we have the string “hello world”, and let’s assume that its hash is hash(‘hello world’) = 12345. So if hash(‘he’) = 1 we can say that the pattern “he” is contained in the text “hello world”. So in every step, we take from the text a sub-string with the length of m, where m is the pattern length. Thus we hash this sub-string and we can directly compare it to the hashed pattern, as in the picture above. Implementation So far we saw some diagrams explaining the Rabin-Karp algorithm, but let’s take a look at its implementation here, in this very basic example where a simple hash table is used in order to convert the characters into integers. The code is PHP and it’s used only to illustrate the principles of this algorithm. function hash_string($str, $len) { $hash = ''; $hash_table = array( 'h' => 1, 'e' => 2, 'l' => 3, 'o' => 4, 'w' => 5, 'r' => 6, 'd' => 7, ); for ($i = 0; $i < $len; $i++) { $hash .= $hash_table[$str{$i}]; } return (int)$hash; } function rabin_karp($text, $pattern) { $n = strlen($text); $m = strlen($pattern); $text_hash = hash_string(substr($text, 0, $m), $m); $pattern_hash = hash_string($pattern, $m); for ($i = 0; $i < $n-$m+1; $i++) { if ($text_hash == $pattern_hash) { return $i; } $text_hash = hash_string(substr($text, $i, $m), $m); } return -1; } // 2 echo rabin_karp('hello world', 'ello'); Multiple Pattern Match It’s great to say that the Rabin-Karp algorithm is great for multiple pattern match. Indeed its nature is supposed to support such functionality, which is its advantage in comparison to other string searching algorithms. Complexity The Rabin-Karp algorithm has the complexity of O(nm) where n, of course, is the length of the text, while m is the length of the pattern. So where is it compared to brute-force matching? Well, brute force matching complexity is O(nm), so as it seems there’s not much of a gain in performance. However, it’s considered that Rabin-Karp’s complexity is O(n+m) in practice, and that makes it a bit faster, as shown on the chart below. Note that the Rabin-Karp algorithm also needs O(m) preprocessing time. Application As we saw Rabin-Karp is not much faster than brute force matching. So where we should use it? 3 Reasons Why Rabin-Karp is Cool 1. Good for plagiarism, because it can deal with multiple pattern matching! 2. Not faster than brute force matching in theory, but in practice its complexity is O(n+m)! 3. With a good hashing function it can be quite effective and it’s easy to implement! 2 Reasons Why Rabin-Karp is Not Cool 1. There are lots of string matching algorithms that are faster than O(n+m) 2. It’s practically as slow as brute force matching and it requires additional space Final Words Rabin-Karp is a great algorithm for one simple reason – it can be used to match against multiple patterns. This makes it perfect to detect plagiarism even for larger phrases.
April 3, 2012
by Stoimen Popov
· 36,719 Views
article thumbnail
Converting a Value to String in JavaScript
In JavaScript, there are three main ways in which any value can be converted to a string. This blog post explains each way, along with its advantages and disadvantages. Three approaches for converting to string The three approaches for converting to string are: value.toString() "" + value String(value) The problem with approach #1 is that it doesn’t work if the value is null or undefined. That leaves us with approaches #2 and #3, which are basically equivalent. ""+value: The plus operator is fine for converting a value when it is surrounded by non-empty strings. As a way for converting a value to string, I find it less descriptive of one’s intentions. But that is a matter of taste, some people prefer this approach to String(value). String(value): This approach is nicely explicit: Apply the function String() to value. The only problem is that this function call will confuse some people, especially those coming from Java, because String is also a constructor. However, function and constructor produce completely different results: > String("abc") === new String("abc") false > typeof String("abc") 'string' > String("abc") instanceof String false > typeof new String("abc") 'object' > new String("abc") instanceof String true The function produces, as promised, a string (a primitive [1]). The constructor produces an instance of the type String (an object). The latter is hardly ever useful in JavaScript, which is why you can usually forget about String as a constructor and concentrate on its role as converting to string. A minor difference between ""+value and String(value) Until now you have heard that + and String() convert their “argument” to string. But how do they actually do that? It turns out that they do it in slightly different ways, but usually arrive at the same result. Converting primitives to string Both approaches use the internal ToString() operation to convert primitives to string. “Internal” means: a function specified by the ECMAScript 5.1 (§9.8) that isn’t accessible to the language itself. The following table explains how ToString() operates on primitives. Argument Result undefined "undefined" null "null" boolean value either "true" or "false" number value the number as a string, e.g. "1.765" string value no conversion necessary Converting objects to string Both approaches first convert an object to a primitive, before converting that primitive to string. However, + uses the internal ToNumber() operator (except for dates [2]), while String() uses ToString(). ToNumber(): To convert an object obj to a primitive, invoke obj.valueOf(). If the result is primitive, return that result. Otherwise, invoke obj.toString(). If the result is primitive, return that result. Otherwise, throw a TypeError. ToString(): Works the same, but invokes obj.toString() before obj.valueOf(). With the following object, you can observe the difference: var obj = { valueOf: function () { console.log("valueOf"); return {}; // not a primitive, keep going }, toString: function () { console.log("toString"); return {}; // not a primitive, keep going } }; Interaction: > "" + obj valueOf toString TypeError: Cannot convert object to primitive value > String(obj) toString valueOf TypeError: Cannot convert object to primitive value Most objects use the default implementation of valueOf() which returns this for objects. Hence, that method will always be skipped by ToNumber(). > var x = {} > x.valueOf() === x true Instances of Boolean, Number, and String wrap primitives and valueOf returns the wrapped primitive. But that still means that the final result will be the same as for toString(), even though it will have been produced in a different manner. > var n = new Number(756) > n.valueOf() === n false > n.valueOf() === 756 true Conclusion Which of the three approaches for converting to string should you choose? value.toString() can be OK, if you are sure that value will never be null or undefined. Otherwise, ""+value and String(value) are mostly equivalent. Which one people prefer is a matter of taste. I find String(value) more explicit. Related posts JavaScript values: not everything is an object [primitives versus objects] What is {} + {} in JavaScript? [explains how the + operator works] String concatenation in JavaScript [how to best concatenate many strings]
March 30, 2012
by Axel Rauschmayer
· 31,842 Views · 2 Likes
article thumbnail
Using "Natural": A NLP Module for node.js
Like most node modules "natural" is packaged as an NPM and can be installed from the command line with node.js.
March 27, 2012
by Christopher Umbel
· 63,978 Views · 3 Likes
article thumbnail
Algorithm of the Week: Brute Force String Matching
String matching is something crucial for database development and text processing software. Fortunately, every modern programming language and library is full of functions for string processing that help us in our everyday work. However it's important to understand their principles. String algorithms can typically be divided into several categories. One of these categories is string matching. When it comes to string matching, the most basic approach is what is known as brute force, which simply means to check every single character from the text to match against the pattern. In general we have a text and a pattern (most commonly shorter than the text). What we need to do is to answer the question whether this pattern appears in the text. Overview The principles of brute force string matching are quite simple. We must check for a match between the first characters of the pattern with the first character of the text as on the picture bellow. If they don’t match, we move forward to the second character of the text. Now we compare the first character of the pattern with the second character of the text. If they don’t match again, we move forward until we get a match or until we reach the end of the text. In case they match, we move forward to the second character of the pattern comparing it with the “next” character of the text, as shown in the picture bellow. Just because we have found a match between the first character from the pattern and some character of the text, doesn’t mean that the pattern appears in the text. We must move forward to see whether the full pattern is contained in the text. Implementation Implementation of brute force string matching is easy and here we can see a short PHP example. The bad news is that this algorithm is naturally quite slow. function sub_string($pattern, $subject) { $n = strlen($subject); $m = strlen($pattern); for ($i = 0; i < $n-$m; $i++) { $j = 0; while ($j < $m && $subject[$i+$j] == $pattern[$j]) { $j++; } if ($j == $m) return $i; } return -1; } echo sub_string('o wo', 'hello world!'); Complexity As I said this algorithm is slow. Actually every algorithm that contains “brute force” in its name is slow, but to show how slow string matching is, I can say that its complexity is O(n.m). Here n is the length of the text, while m is the length of the pattern. In case we fix the length of the text and test against variable length of the pattern, again we get a rapidly growing function. Application Brute force string matching can be very ineffective, but it can also be very handy in some cases. Just like the sequential search. It can be very useful… Doesn’t require pre-processing of the text – Indeed if we search the text only once we don’t need to pre-process it. Most of the algorithms for string matching need to build an index of the text in order to search quickly. This is great when you’ve to search more than once into a text, but if you do only once, perhaps (for short texts) brute force matching is great! Doesn’t require additional space – Because brute force matching doesn’t need pre-processing it also doesn’t require more space, which is one cool feature of this algorithm Can be quite effective for short texts and patterns It can be ineffective… If we search the text more than once – As I said in the previous section if you perform the search more than once it’s perhaps better to use another string matching algorithm that builds an index, and it’s faster. It’s slow – In general brute force algorithms are slow and brute force matching isn’t an exception. Final Words String matching is something very special in software development and it is used in various cases, so every developer must be familiar with this topic.
March 27, 2012
by Stoimen Popov
· 61,860 Views · 3 Likes
article thumbnail
GapList – a Lightning-Fast List Implementation
This article introduces GapList, an implementation which strives for combining the strengths of both ArrayList and LinkedList.
March 19, 2012
by Thomas Mauch
· 64,984 Views · 4 Likes
article thumbnail
Defensive Programming vs. Batshit Crazy Paranoid Programming
Hey, let’s be careful out there. --Sergeant Esterhaus, daily briefing to the force of Hill Street Blues When developers run into an unexpected bug and can’t fix it, they’ll “add some defensive code” to make the code safer and to make it easier to find the problem. Sometimes just doing this will make the problem go away. They’ll tighten up data validation – making sure to check input and output fields and return values. Review and improve error handling – maybe add some checking around “impossible” conditions. Add some helpful logging and diagnostics. In other words, the kind of code that should have been there in the first place. Expect the Unexpected The whole point of defensive programming is guarding against errors you don’t expect. ---Steve McConnell, Code Complete The few basic rules of defensive programming are explained in a short chapter in Steve McConnell’s classic book on programming, Code Complete: Protect your code from invalid data coming from “outside”, wherever you decide “outside” is. Data from an external system or the user or a file, or any data from outside of the module/component. Establish “barricades” or “safe zones” or “trust boundaries” – everything outside of the boundary is dangerous, everything inside of the boundary is safe. In the barricade code, validate all input data: check all input parameters for the correct type, length, and range of values. Double check for limits and bounds. After you have checked for bad data, decide how to handle it. Defensive Programming is NOT about swallowing errors or hiding bugs. It’s about deciding on the trade-off between robustness (keep running if there is a problem you can deal with) and correctness (never return inaccurate results). Choose a strategy to deal with bad data: return an error and stop right away (fast fail), return a neutral value, substitute data values, … Make sure that the strategy is clear and consistent. Don’t assume that a function call or method call outside of your code will work as advertised. Make sure that you understand and test error handling around external APIs and libraries. Use assertions to document assumptions and to highlight “impossible” conditions, at least in development and testing. This is especially important in large systems that have been maintained by different people over time, or in high-reliability code. Add diagnostic code, logging and tracing intelligently to help explain what’s going on at run-time, especially if you run into a problem. Standardize error handling. Decide how to handle “normal errors” or “expected errors” and warnings, and do all of this consistently. Use exception handling only when you need to, and make sure that you understand the language’s exception handler inside out. Programs that use exceptions as part of their normal processing suffer from all the readability and maintainability problems of classic spaghetti code. --The Pragmatic Programmer I would add a couple of other rules. From Michael Nygard’s Release It! n Never ever wait forever on an external call, especially a remote call. Forever can be a long time when something goes wrong. Use time-out/retry logic and his Circuit Breaker stability pattern to deal with remote failures. And for languages like C and C++, defensive programming also includes using safe function calls to avoid buffer overflows and common coding mistakes. Different Kinds of Paranoia The Pragmatic Programmer describes defensive programming as “Pragmatic Paranoia”. Protect your code from other people’s mistakes, and your own mistakes. If in doubt, validate. Check for data consistency and integrity. You can’t test for every error, so use assertions and exception handlers for things that “can’t happen”. Learn from failures in test and production – if this failed, look for what else can fail. Focus on critical sections of code – the core, the code that runs the business. Healthy Paranoid Programming is the right kind of programming. But paranoia can be taken too far. In the Error Handling chapter of Clean Code, Michael Feathers cautions that “many code bases are dominated by error handling” --Michael Feathers, Clean Code Too much error handling code not only obscures the main path of the code (what the code is actually trying to do), but it also obscures the error handling logic itself – so that it is harder to get it right, harder to review and test, and harder to change without making mistakes. Instead of making the code more resilient and safer, it can actually make the code more error-prone and brittle. There’s healthy paranoia, then there’s over-the-top-error-checking, and then there’s bat shit crazy crippling paranoia – where defensive programming takes over and turns in on itself. The first real world system I worked on was a “Store and Forward” network control system for servers (they were called minicomputers back then) across the US and Canada. It shared data between distributed systems, scheduled jobs, and coordinated reporting across the network. It was designed to be resilient to network problems and automatically recover and restart from operational failures. This was ground breaking stuff at the time, and a hell of a technical challenge. The original programmer on this system didn’t trust the network, didn’t trust the O/S, didn’t trust Operations, didn’t trust other people’s code, and didn’t trust his own code – for good reason. He was a chemical engineer turned self-taught system programmer who drank a lot while coding late at night and wrote thousands of lines of unstructured FORTRAN and Assembler under the influence. The code was full of error checking and self diagnostics and error-correcting code, the files and data packets had their own checksums and file-level passwords and hidden control labels, and there was lots of code to handle sequence accounting exceptions and timing-related problems – code that mostly worked most of the time. If something went wrong that it couldn’t recover from, programs would crash and report a “label of exit” and dump the contents of variables – like today’s stack traces. You could theoretically use this information to walk back through the code to figure out what the hell happened. None of this looked anything like anything that I learned about in school. Reading and working with this code was like programming your way out of Arkham Asylum. If the programmer ran into bugs and couldn’t fix them, that wouldn’t stop him. He would find a way to work around the bugs and make the system keep running. Then later after he left the company, I would find and fix a bug and congratulate myself until it broke some “error-correcting” code somewhere else in the network that now depended on the bug being there. So after I finally figured out what was going on, I took out as much of this “protection” as I could safely remove, and cleaned up the error handling so that I could actually maintain the system without losing what was left of my mind. I setup trust boundaries for the code – although I didn’t know that’s what it was called then – deciding what data couldn’t be trusted and what could. Once this was done I was able to simplify the defensive code so that I could make changes without the system falling over itself, and still protect the core code from bad data, mistakes in the rest of the code, and operational problems. Making code safer is simple The point of defensive coding is to make the code safer and to help whoever is going to maintain and support the code – not make their job harder. Defensive code is code – all code has bugs, and, because defensive code is dealing with exceptions, it is especially hard to test and to be sure that it will work when it has to. Understanding what conditions to check for and how much defensive coding is needed takes experience, working with code in production and seeing what can go wrong in the real world. A lot of the work involved in designing and building secure, resilient systems is technically difficult or expensive. Defensive programming is neither – like defensive driving, it’s something that everyone can understand and do. It requires discipline and awareness and attention to detail, but it’s something that we all need to do if we want to make the world safe.
March 19, 2012
by Jim Bird
· 24,351 Views
article thumbnail
Deploying an Artifact to the Local Cache in Gradle
One question that came up a couple times this week is how to set gradle up to deploy jars locally. For the most part I was satisfied with just having people push snapshot releases to our Artifactory server but some people did express a real desire to be able to publish a jar to the local resolution cache to test changes out locally. I’m still a fan of deploying snapshots from feature branches but luckily you can do a local publish and resolve with gradle. First off, ask yourself if the dependency is coupled enough to warrant being a submodule. Also, could just linking the project in your IDE be enough to get what you want done? If the answer to both questions are no then your next recourse is to use gradle’s excellent maven compatibility (don’t run!). For the project you want to publish locally you simply need to apply the maven plugin and make sure you have version and group set for the project (usually I put group and version in gradle.properties). apply plugin: 'java' apply plugin: 'maven' version = '0.5.1-SNAPSHOT' group = 'org.jamescarr.examples' That’s all you need to install it locally, just run gradle install from the project root to install it to the local m2 cache. Now let’s update your project that will depend on it. ... repositories { mavenCentral() mavenLocal() } dependencies { compile('org.jamecarr.examples:example-api:0.5.1-SNAPSHOT'){ changing=true } } The magic sauce here is using mavenLocal() as one of your resolution repositories. This will resolve against the local m2 cache. mavenCentral() can be replaced by whatever repositories you might use, it is only included since it’s the most often used. That’s it! I know some people dislike this approach due to ingrained disdain for maven but the beauty of it is that maven is silently at work and you really don’t get bothered by it.
March 16, 2012
by James Carr
· 33,826 Views
article thumbnail
Circos: An Amazing Tool for Visualizing Big Data
storing massive amounts of data in a nosql data store is just one side of the big data equation. being able to visualize your data in such a way that you can easily gain deeper insights , is where things really start to get interesting. lately, i've been exploring various options for visualizing (directed) graphs, including circos . circos is an amazing software package that visualizes your data through a circular layout . although it's originally designed for displaying genomic data , it allows to create good-looking figures from data in any field. just transform your data set into a tabular format and you are ready to go. the figure below illustrates the core concept behind circos. the table's columns and rows are represented by segments around the circle. individual cells are shown as ribbons , which connect the corresponding row and column segments. the ribbons themselves are proportional in width to the value in the cell. when visualizing a directed graph , nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. the proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. in my case, i want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, ...) and how do they navigate between pages. the rest of this article details how to 1) retrieve the raw visit information through the google analytics api, 2) persist this information as a graph in neo4j and 3) query and preprocess this data for visualization through circos. as always, the complete source code can be found on the datablend public github repository . 1. retrieving your google analytics data let's start by retrieving the raw google analytics data . the google analytics data api provides access to all dimensions and metrics that can be queried through the web application. in my case, i'm interested in retrieving the previous page path property for each page view. if a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance) . otherwise, it contains the internal path . we will use google's java data api to connect and retrieve this information. we are particularly interested in the pagepath , pagetitle , previouspagepath and medium dimensions, while our metric of choice is the number of pageviews . after setting the date range, the feed of entries that satisfy this criteria can be retrieved. for ease of use, we transform this data to a domain entity and filter/clean the data accordingly. if a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, ...) as previous path. // authenticate analyticsservice = new analyticsservice(configuration.service); analyticsservice.setusercredentials(configuration.client_username, configuration.client_pass); // create query dataquery query = new dataquery(new url(configuration.data_url)); query.setids(configuration.table_id); query.setdimensions("ga:medium,ga:previouspagepath,ga:pagepath,ga:pagetitle"); query.setmetrics("ga:pageviews"); query.setstartdate(datestring); query.setenddate(datestring); // execute datafeed feed = analyticsservice.getfeed(createqueryurl(date), datafeed.class); // iterate and clean for (dataentry entry : feed.getentries()) { string pagepath = entry.stringvalueof("ga:pagepath"); string pagetitle = entry.stringvalueof("ga:pagetitle"); string previouspagepath = entry.stringvalueof("ga:previouspagepath"); string medium = entry.stringvalueof("ga:medium"); long views = entry.longvalueof("ga:pageviews"); // filter the data if (filter(pagepath) && filter(previouspagepath) && (!clean(previouspagepath).equals(clean(pagepath)))) { // check criteria are satisfied navigation navigation = new navigation(clean(previouspagepath), clean(pagepath), pagetitle, date, views); if (navigation.getsource().equals("(entrance)")) { // in case of an entrace, save its medium instead navigation.setsource(medium); } navigations.add(navigation); } } 2. storing navigational data as a directed graph in neo4j the set of site navigations can easily be stored as a directed graph in the neo4j graph database . nodes are site paths (or mediums), while relationships are the navigations themselves. we start by retrieving the navigations for a particular date range and retrieve (or lazily create) the nodes representing the source and target paths (or mediums). next we de-normalize the pageviews metric (for instance, 6 individual relationships will be created for 6 page-views). although this de-normalization step is not really required, i did so to make sure that the degree of my nodes is correct if i would perform other types of calculations. for each individual navigation relationship, we also store the date of visit . // retrieve navigations for a particular date list navigations = retrieval.getnavigations(date); // save them in the graph database transaction tx = graphdb.begintx(); // iterate and create for (navigation nav : navigations) { node source = getpath(nav.getsource()); node target = getpath(nav.gettarget()); if (!target.hasproperty("title")) { target.setproperty("title", nav.gettargettitle()); } for (long i = 0; i < nav.getamount(); i++) { // duplicate relationships relationship transition = source.createrelationshipto(target, relationships.navigation); transition.setproperty("date", date.gettime()); // save time as long } } // commit tx.success(); tx.finish(); 3. creating the circos tabular data format the circos tabular data format is quite easy to construct. it's basically a tab-delimited file with row and column headers. a cell is interpreted as a value that flows from the row entity to the column entity . we will use the neo4j cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period . doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time. // access the graph database graphdb = new embeddedgraphdatabase("var/analytics"); engine = new executionengine(graphdb); // execute the data range cypher query map params = new hashmap(); params.put("fromdate", from.gettime()); params.put("todate", to.gettime()); // execute the query executionresult result = engine.execute("start sourcepath=node:index(\"path:*\") " + "match sourcepath-[r]->targetpath " + "where r.date >= {fromdate} and r.date <= {todate} " + "return sourcepath,targetpath", params); next, we create the tab delimited file itself. we iterate through all entries (i.e. navigations) that match our cypher query and store them in a temporary list. afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. at the end, we filter this occurrence matrix on the minimal number of required navigations. this ensures that we will only create segments for paths that are relevant in the total population. as a final step, we print the occurrences matrix as a tab-delimited file. for each path, we will use a shorthand as the circos renderer seems to have problem with long string identifiers. // retrieve the results iterator> it = result.javaiterator(); list navigations = new arraylist(); map titles = new hashmap(); set paths = new hashset(); // iterate the results while (it.hasnext()) { map record = it.next(); string source = (string)((node) record.get("sourcepath")).getproperty("path"); string target = (string) ((node) record.get("targetpath")).getproperty("path"); string targettitle = (string) ((node) record.get("targetpath")).getproperty("title"); // reuse the navigation object as temorary holder navigations.add(new navigation(source, target, targettitle, new date(), 1)); paths.add(source); paths.add(target); if (!titles.containskey(target)) { titles.put(target, targettitle); } } // retrieve the various paths list pathids = arrays.aslist(paths.toarray(new string[]{})); // create the matrix that holds the info int[][] occurences = new int[pathids.size()][pathids.size()]; // iterate through all the navigations and update accordingly for (navigation navigation : navigations) { int sourceindex = pathids.indexof(navigation.getsource()); int targetindex = pathids.indexof(navigation.gettarget()); occurences[sourceindex][targetindex] = occurences[sourceindex][targetindex] + 1; } // matrix build, filter on threshold for (int i = 0; i < occurences.length; i++) { for (int j = 0; j < occurences.length; j++) { if (occurences[i][j] < threshold) { occurences[i][j] = 0; } } // print printcircosdata(pathids, titles, occurences); the text below is a sample of the output generated by the printcircosdata method. it first prints the legend (matching shorthands with actual paths). next it prints the tab-delimited circos table. link0 - /?p=411/wp-admin - storing and querying rdf data in neo4j through sail - datablend link1 - /?p=1146 - visualizing rdf schema inferencing through neo4j, tinkerpop, sail and gephi - datablend link2 - /?p=164 - big data / concise articles - datablend link3 - referral - null link4 - /?p=1400 - the joy of algorithms and nosql revisited: the mongodb aggregation framework - datablend ... datal0l1l2l3l4... l000000 l100000 l200000 l3059400197 l400000 4. use the circos power although circos can be installed on your local computer, we will use its online version to create the visualization of our data. upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site's navigation information. with just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. the outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. in case of referrals, no navigations have this path as target (indicated by the empty middle ring). its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. the l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). this segment visualizes the navigation data related to my "the joy of algorithms and nosql: a mongodb example" -article. most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. the l15-segment (my blog's main page) is the only path that receives an almost equal amount of incoming and outgoing traffic. with just a few tweaks to the circos input data, we can easily focus on particular types of navigation data. in the figure below, i made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors. 5. conclusions in the era of big data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. circos specializes in a very specific type of visualization, but does its job extremely well. i would be delighted to hear about other types of visualizations for directed graphs.
March 13, 2012
by Davy Suvee
· 36,357 Views · 2 Likes
article thumbnail
All about JMS messages
JMS providers like ActiveMQ are based on the concept of passing one-directional messages between nodes and brokers asynchronously. A thorough knowledge of the type of messages that can be sent through a JMS middleware can simplify a lot your work in mapping the communication patterns to real code. The basic Message interface Some object members are shared by all messages: header fields, used to identify univocally a message and to route it to the right brokers and consumers. A dynamic map of properties which can be read programmatically by JMS brokers in order to filter or to route messages. A body, which is differentiated in the various implementations we'll see. Header fields The set of getJMS*() methods on the Message interface defines the available headers. Two of them are oriented to message identification: getJMSMessageID() contains a generated ID for identifying a message, unique at least for the current broker. All generated IDs start with the prefix 'ID:', but you can override it with the corresponding setter. getJMSCorrelationID() (and getJMSCorrelationID() as bytes) can link a message with another, usually one that has been sent previously. For example, a reply can carry the ID of the original message when put in another queue. Two to sender and recipient identification: getJMSDestination() returns a Destination object (a Topic or a Queue, or their temporary version) describing where the message was directed. getJMSReplyTo() is a Destination object where replies should be sent; it can be null of course. Three tune the delivery mechanism: getJMSDeliveryMode() can be DeliveryMode.NON_PERSISTENT or DeliveryMode.PERSISTENT; only persistent messages guarantee delivery in case of a crash of the brokers that transport it. getJMSExpiration() returns a timestamp indicating the expiration time of the message; it can be 0 on a message without a defined expiration. getJMSPriority() returns a 0-9 integer value (higher is better) defining the priority for delivery. It is only a best-effort value. While the remaining ones contain metadata: getJMSRedelivered() returns a boolean indicating if the message is being delivered again after a delivery which was not acknowledge. getJMSTimestamp() returns a long indicating the time of sending. getJMSType() defines a field for provider-specific or application-specific message types. Of these headers, only JMSCorrelationID, JMSReplyTo and JMSType have to be set when needed. The others are generated or managed by the send() and publish() methods if not specified (there are setters available for each of these headers.) Properties Generic properties with a String name can be added to messages and read with getBooleanProperty(), getStringProperty() and similar methods. The corresponding setters setBooleanProperty(), setStringProperty(), ... can be used for their addition. The reason for keeping some properties out of the content of the message (which a MapMessage can contain) is so they could be read before reaching the destination, for example in a JMS broker. The use case for this access to message properties is routing and filtering: downstream brokers and consumers may define a filter such as "I am interested only on messages on this Topic that have property X = 'value' and Y = 'value2'". Bodies All the subinterfaces of javax.jms.Message defined by the API provide different types of message bodies (while actual classes are defined by the providers and are not part of the API). Actual instantiation is then handled by the Session, which implements an Abstract Factory pattern. On the receival side, a cast is necessary for any message type (at least in the Java JMS Api), since only a generic Message is read. BytesMessage is the most basic type: it contains a sequence of uninterpreted bytes. Hence, it can in theory contain anything, but the generation and interpretation of the content is the client's job. BytesMessage m = session.createBytesMessage(); m.writeByte(65); m.writeBytes(new byte[] { 66, 68, 70 }); // on receival (cast shown only here) BytesMessage m = (BytesMessage) genericMessage; byte[] content = new byte[4]; m.readBytes(content); MapMessage defines a message containing an (unordered) set of key/value pairs, also called a map or dictionary or hash. However, the keys are String objects, while the values are primitives or Strings; since they are primitives, they shouldn't be null. MapMessage = session.createMapMessage(); m.setString('key', 'value'); // or m.setObject('key', 'value') to avoid specifying a type // on receival m.getString('key'); // or m.getObject('key') ObjectMessage wraps a generic Object for transmission. The Object should be Serializable. ObjectMessage m = session.createObjectMessage(); m.setObject(new ValueObject('field1', 42)); // on receival ValueObject vo = (ValueObject) m.getObject(); StreamMessage wraps a stream of primitive values of indefinite length. StreamMessage m = session.createStreamMessage(); m.writeBoolean(true); m.writeBoolean(false); m.writeBoolean(true); // receival System.out.println(m.readBoolean()); System.out.println(m.readBoolean()); System.out.println(m.readBoolean()); // prints true, false, true TextMessage wraps a String of any length. TextMessage m = session.createTextMessage("Contents"); // or use m.setText() afterwards // receival String text = m.getText(); Usually all messages are in a read-only phase after receival, so only getters can be called on them.
March 12, 2012
by Giorgio Sironi
· 61,865 Views · 1 Like
article thumbnail
Best Practices for Variable and Method Naming
Use short enough and long enough variable names in each scope of code. Generally length may be 1 char for loop counters, 1 word for condition/loop variables, 1-2 words for methods, 2-3 words for classes, 3-4 words for globals. Use specific names for variables, for example "value", "equals", "data", ... are not valid names for any case. Use meaningful names for variables. Variable name must define the exact explanation of its content. Don't start variables with o_, obj_, m_ etc. A variable does not need tags which states it is a variable. Obey company naming standards and write variable names consistently in application: e.g. txtUserName, lblUserName, cmbSchoolType, ... Otherwise readability will reduce and find/replace tools will be unusable. Obey programming language standards and don't use lowercase/uppercase characters inconsistently: e.g. userName, UserName, USER_NAME, m_userName, username, ... use Camel Case (aka Upper Camel Case) for classes: VelocityResponseWriter use Lower Case for packages: com.company.project.ui use Mixed Case (aka Lower Camel Case) for variables: studentName use Upper Case for constants : MAX_PARAMETER_COUNT = 100 use Camel Case for enum class names and Upper Case for enum values. don't use '_' anywhere except constants and enum values (which are constants). For example for Java, Don't reuse same variable name in the same class in different contexts: e.g. in method, constructor, class. So you can provide more simplicity for understandability and maintainability. Don't use same variable for different purposes in a method, conditional etc. Create a new and different named variable instead. This is also important for maintainability and readability. Don't use non-ASCII chars in variable names. Those may run on your platform but may not on others. Don't use too long variable names (e.g. 50 chars). Long names will bring ugly and hard-to-read code, also may not run on some compilers because of character limit. Decide and use one natural language for naming, e.g. using mixed English and German names will be inconsistent and unreadable. Use meaningful names for methods. The name must specify the exact action of the method and for most cases must start with a verb. (e.g. createPasswordHash) Obey company naming standards and write method names consistently in application: e.g. getTxtUserName(), getLblUserName(), isStudentApproved(), ... Otherwise readability will reduce and find/replace tools will be unusable. Obey programming language standards and don't use lowercase/uppercase characters inconsistently: e.g. getUserName, GetUserName, getusername, ... For example for Java, use Mixed Case for method names: getStudentSchoolType use Mixed Case for method parameters: setSchoolName(String schoolName) Use meaningful names for method parameters, so it can documentate itself in case of no documentation.
March 10, 2012
by Cagdas Basaraner
· 153,898 Views · 5 Likes
article thumbnail
Resetting the Database Connection in Django
Django handles database connections transparently in almost all cases. It will start a new connection when your request starts up, and commit it at the end of the request lifetime. Other times you need to dive in further and do your own granular transaction management. But for the most part, it's fully automatic. However, sometimes your use case may require that you close the current database connection and open a new one. While this is possible in Django, it's not well documented. Why would you want to do this? I my case, I was writing an automation test framework. Some of the automation tests make database calls through the Django ORM to setup records, clean up after the test, etc. Each test is executed in the same process space, via a thread pool. We found that if one of the early tests threw an unrecoverable database error, such as an IntegrityError due to violating a unique constraint, the database connection would be aborted. Subsequent tests that tried to use the database would raise a DatabaseError: Traceback (most recent call last): File /home/user/project/app/test.py, line 73, in tearDown MyModel.objects.all() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 444, in delete collector.collect(del_query) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 146, in collect reverse_dependency=reverse_dependency) File /usr/local/lib/python2.6/dist-packages/django/db/models/deletion.py, line 91, in add if not objs: File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 113, in __nonzero__ iter(self).next() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 107, in _result_iter self._fill_cache() File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 772, in _fill_cache self._result_cache.append(self._iter.next()) File /usr/local/lib/python2.6/dist-packages/django/db/models/query.py, line 273, in iterator for row in compiler.results_iter(): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 680, in results_iter for rows in self.execute_sql(MULTI): File /usr/local/lib/python2.6/dist-packages/django/db/models/sql/compiler.py, line 735, in execute_sql cursor.execute(sql, params) File /usr/local/lib/python2.6/dist-packages/django/db/backends/postgresql_psycopg2/base.py, line 44, in execute return self.cursor.execute(query, args) DatabaseError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. It turns out that it's relatively easy to reset the database connection. We just called the following function at the start of every test. Django is smart enough to re-initialize the connection the next time it's used, assuming that it's disconnected properly. def reset_database_connection(): from django import db db.close_connection()
March 9, 2012
by Chase Seibert
· 9,184 Views
article thumbnail
The Dark Side of Big Data: Pseudo-Science & Fooled By Randomness
Over the last couple of months I have read up on volumes of Technical Analysis (“TA”) information, I have back tested probably hundreds of automated trading strategies against massive amounts of data, both exchange intraday- and tick data, as well as other sources. Some of these strategies have been massively profitable in back testing, others not so much. Some of the TA patterns, I’ve discarded before they even left the book, because they did not stand up to any sort of scientific scrutiny because they lacked a clear predictive thesis, where riddled with forward-looking bias (“Head and Shoulders patterns”), and in some cases where just plain bulls**t (“Elliott Wave Principle” comes to mind). The outcomes of my testing has made me think about the implications of large scale data analysis in general: it is very easy to get fooled by randomness. In many cases in my testing results have been amazing, but I cannot come up with a plausible causal explanation as to why, and when I gently nudge the parameters just ever so slightly, outcomes can look entirely different. Taking a step back from the data, looking at it in a larger perspective, I’m inclined to conclude that if data across multiple parameter variations looks like a random walk and lacks a plausible causal explanation, then it is a random walk. If I cannot say “X is caused by A and B”, I’m inclined to believe that the actual reason is “X is the result because A and B fit the historical data D, but may not do so in the future”. And herein lies the crux of the matter: how many data scientists are inclined to take a step back, rather than just assume that there is a pattern there? How many are prepared to do so if their livelihood is largely based on them finding patterns, rather than discarding them because they do not hold up to deeper scrutiny? I’d say very few. My conclusion to this is that the age of Big Data will see a radical increase of pseudo-scientific “discoveries”, driven out of an interest in announcing new great “patterns”. This pseudo-science will pervade both academia, public sector and private sector, God knows I’ve seen a fair number of academic research papers already that simply do not hold if you investigate their thesis in a deeper manner. I suspect we will arrive at a point much like with any new technology whereby people will tire of the claims made by “Big Data Scientists”, because at least half of what they say will have been proven to be hokey and pseudo-science in the pursuit of being able to make even more outlandish claims in a game of one-upping the competition. Some of this will be driven by malice and self-interest, but I suspect in equal parts it will be driven by ignorance and perverted incentives putting blinders on people in the business.
March 9, 2012
by Wille Faler
· 13,971 Views
  • Previous
  • ...
  • 424
  • 425
  • 426
  • 427
  • 428
  • 429
  • 430
  • 431
  • 432
  • 433
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×