DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
The Limited Usefulness of AsyncContext.start()
Some time ago I came across What's the purpose of AsyncContext.start(...) in Servlet 3.0? question. Quoting the Javadoc of aforementioned method: Causes the container to dispatch a thread, possibly from a managed thread pool, to run the specified Runnable. To remind all of you, AsyncContext is a standard way defined in Servlet 3.0 specification to handle HTTP requests asynchronously. Basically HTTP request is no longer tied to an HTTP thread, allowing us to handle it later, possibly using fewer threads. It turned out that the specification provides an API to handle asynchronous threads in a different thread pool out of the box. First we will see how this feature is completely broken and useless in Tomcat and Jetty - and then we will discuss why the usefulness of it is questionable in general. Our test servlet will simply sleep for given amount of time. This is a scalability killer in normal circumstances because even though sleeping servlet is not consuming CPU, but sleeping HTTP thread tied to that particular request consumes memory - and no other incoming request can use that thread. In our test setup I limited the number of HTTP worker threads to 10 which means only 10 concurrent requests are completely blocking the application (it is unresponsive from the outside) even though the application itself is almost completely idle. So clearly sleeping is an enemy of scalability. @WebServlet(urlPatterns = Array("/*")) class SlowServlet extends HttpServlet with Logging { protected override def doGet(req: HttpServletRequest, resp: HttpServletResponse) { logger.info("Request received") val sleepParam = Option(req.getParameter("sleep")) map {_.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") } } Benchmarking this code reveals that the average response times are close to sleep parameter as long as the number of concurrent connections is below the number of HTTP threads. Unsurprisingly the response times begin to grow the moment we exceed the HTTP threads count. Eleventh connection has to wait for any other request to finish and release worker thread. When the concurrency level exceeds 100, Tomcat begins to drop connections - too many clients are already queued. So what about the the fancy AsyncContext.start() method (do not confuse with ServletRequest.startAsync())? According to the JavaDoc I can submit any Runnable and the container will use some managed thread pool to handle it. This will help partially as I no longer block HTTP worker threads (but still another thread somewhere in the servlet container is used). Quickly switching to asynchronous servlet: @WebServlet(urlPatterns = Array("/*"), asyncSupported = true) class SlowServlet extends HttpServlet with Logging { protected override def doGet(req: HttpServletRequest, resp: HttpServletResponse) { logger.info("Request received") val asyncContext = req.startAsync() asyncContext.setTimeout(TimeUnit.MINUTES.toMillis(10)) asyncContext.start(new Runnable() { def run() { logger.info("Handling request") val sleepParam = Option(req.getParameter("sleep")) map {_.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") asyncContext.complete() } }) } } We are first enabling the asynchronous processing and then simply moving sleep() into a Runnable and hopefully a different thread pool, releasing the HTTP thread pool. Quick stress test reveals slightly unexpected results (here: response times vs. number of concurrent connections): Guess what, the response times are exactly the same as with no asynchronous support at all (!) After closer examination I discovered that when AsyncContext.start() is called Tomcat submits given task back to... HTTP worker thread pool, the same one that is used for all HTTP requests! This basically means that we have released one HTTP thread just to utilize another one milliseconds later (maybe even the same one). There is absolutely no benefit of calling AsyncContext.start() in Tomcat. I have no idea whether this is a bug or a feature. On one hand this is clearly not what the API designers intended. The servlet container was suppose to manage separate, independent thread pool so that HTTP worker thread pool is still usable. I mean, the whole point of asynchronous processing is to escape the HTTP pool. Tomcat pretends to delegate our work to another thread, while it still uses the original worker thread pool. So why I consider this to be a feature? Because Jetty is "broken" in exactly same way... No matter whether this works as designed or is only a poor API implementation, using AsyncContext.start() in Tomcat and Jetty is pointless and only unnecessarily complicates the code. It won't give you anything, the application works exactly the same under high load as if there was no asynchronous logic at all. But what about using this API feature on correct implementations like IBM WAS? It is better, but still the API as is doesn't give us much in terms of scalability. To explain again: the whole point of asynchronous processing is the ability to decouple HTTP request from an underlying thread, preferably by handling several connections using the same thread. AsyncContext.start() will run the provided Runnable in a separate thread pool. Your application is still responsive and can handle ordinary requests while long-running request that you decided to handle asynchronously are processed in a separate thread pool. It is better, unfortunately the thread pool and thread per connection idiom is still a bottle-neck. For the JVM it doesn't matter what type of threads are started - they still occupy memory. So we are no longer blocking HTTP worker threads, but our application is not more scalable in terms of concurrent long-running tasks we can support. In this simple and unrealistic example with sleeping servlet we can actually support thousand of concurrent (waiting) connections using Servlet 3.0 asynchronous support with only one extra thread - and without AsyncContext.start(). Do you know how? Hint: ScheduledExecutorService. Postscriptum: Scala goodness I almost forgot. Even though examples were written in Scala, I haven't used any cool language features yet. Here is one: implicit conversions. Make this available in your scope: implicit def blockToRunnable[T](block: => T) = new Runnable { def run() { block } } And suddenly you can use code block instead of instantiating Runnable manually and explicitly: asyncContext start { logger.info("Handling request") val sleepParam = Option(req.getParameter("sleep")) map { _.toLong} TimeUnit.MILLISECONDS.sleep(sleepParam getOrElse 10) logger.info("Request done") asyncContext.complete() } Sweet!
May 22, 2012
by Tomasz Nurkiewicz
· 17,524 Views · 1 Like
article thumbnail
Lucene Setup on OracleDB in 5 Minutes
This tutorial is for people who want to run an Apache Lucene example with OracleDB in just five minutes.
May 19, 2012
by Mohammad Juma
· 31,304 Views · 4 Likes
article thumbnail
EasyNetQ, a simple .NET API for RabbitMQ
After pondering the results of our message queue shootout, we decided to run with Rabbit MQ. Rabbit ticks all of the boxes, it’s supported (by Spring Source and then VMware ultimately), scales and has the features and performance we need. The RabbitMQ.Client provided by Spring Source is a thin wrapper that quite faithfully exposes the AMQP protocol, so it expects messages as byte arrays. For the shootout tests spraying byte arrays around was fine, but in the real world, we want our messages to be .NET types. I also wanted to provide developers with a very simple API that abstracted away the Exchange/Binding/Queue model of AMQP and instead provides a simple publish/subscribe and request/response model. My inspiration was the excellent work done by Dru Sellers and Chris Patterson with MassTransit (the new V2.0 beta is just out). The code is on GitHub here: https://github.com/mikehadlow/EasyNetQ The API centres around an IBus interface that looks like this: /// /// Provides a simple Publish/Subscribe and Request/Response API for a message bus. /// public interface IBus : IDisposable { /// /// Publishes a message. /// /// The message type /// The message to publish void Publish(T message); /// /// Subscribes to a stream of messages that match a .NET type. /// /// The type to subscribe to /// /// A unique identifier for the subscription. Two subscriptions with the same subscriptionId /// and type will get messages delivered in turn. This is useful if you want multiple subscribers /// to load balance a subscription in a round-robin fashion. /// /// /// The action to run when a message arrives. /// void Subscribe(string subscriptionId, Action onMessage); /// /// Makes an RPC style asynchronous request. /// /// The request type. /// The response type. /// The request message. /// The action to run when the response is received. void Request(TRequest request, Action onResponse); /// /// Responds to an RPC request. /// /// The request type. /// The response type. /// /// A function to run when the request is received. It should return the response. /// void Respond(Func responder); } To create a bus, just use a RabbitHutch, sorry I couldn’t resist it :) var bus = RabbitHutch.CreateRabbitBus("localhost"); You can just pass in the name of the server to use the default Rabbit virtual host ‘/’, or you can specify a named virtual host like this: var bus = RabbitHutch.CreateRabbitBus("localhost/myVirtualHost"); The first messaging pattern I wanted to support was publish/subscribe. Once you’ve got a bus instance, you can publish a message like this: var message = new MyMessage {Text = "Hello!"}; bus.Publish(message); This publishes the message to an exchange named by the message type. You subscribe to a message like this: bus.Subscribe("test", message => Console.WriteLine(message.Text)); This creates a queue named ‘test_’ and binds it to the message type’s exchange. When a message is received it is passed to the Action delegate. If there are more than one subscribers to the same message type named ‘test’, Rabbit will hand out the messages in a round-robin fashion, so you get simple load balancing out of the box. Subscribers to the same message type, but with different names will each get a copy of the message, as you’d expect. The second messaging pattern is an asynchronous RPC. You can call a remote service like this: var request = new TestRequestMessage {Text = "Hello from the client! "}; bus.Request(request, response => Console.WriteLine("Got response: '{0}'", response.Text)); This first creates a new temporary queue for the TestResponseMessage. It then publishes the TestRequestMessage with a return address to the temporary queue. When the TestResponseMessage is received, it passes it to the Action delegate. RabbitMQ happily creates temporary queues and provides a return address header, so this was very easy to implement. To write an RPC server. Simple use the Respond method like this: bus.Respond(request => new TestResponseMessage { Text = request.Text + " all done!" }); This creates a subscription for the TestRequestMessage. When a message is received, the Func delegate is passed the request and returns the response. The response message is then published to the temporary client queue. Once again, scaling RPC servers is simply a question of running up new instances. Rabbit will automatically distribute messages to them. The features of AMQP (and Rabbit) make creating this kind of API a breeze. Check it out and let me know what you think.
May 13, 2012
by Mike Hadlow
· 11,262 Views
article thumbnail
Martin Fowler on ORM Hate
while i was at the qcon conference in london a couple of months ago, it seemed that every talk included some snarky remarks about object/relational mapping (orm) tools. i guess i should read the conference emails sent to speakers more carefully, doubtless there was something in there telling us all to heap scorn upon orms at least once every 45 minutes. but as you can tell, i want to push back a bit against this orm hate - because i think a lot of it is unwarranted. the charges against them can be summarized in that they are complex, and provide only a leaky abstraction over a relational data store. their complexity implies a grueling learning curve and often systems using an orm perform badly - often due to naive interactions with the underlying database. there is a lot of truth to these charges, but such charges miss a vital piece of context. the object/relational mapping problem is hard . essentially what you are doing is synchronizing between two quite different representations of data, one in the relational database, and the other in-memory. although this is usually referred to as object-relational mapping, there is really nothing to do with objects here. by rights it should be referred to as in-memory/relational mapping problem, because it's true of mapping rdbmss to any in-memory data structure. in-memory data structures offer much more flexibility than relational models, so to program effectively most people want to use the more varied in-memory structures and thus are faced with mapping that back to relations for the database. the mapping is further complicated because you can make changes on either side that have to be mapped to the other. more complication arrives since you can have multiple people accessing and modifying the database simultaneously. the orm has to handle this concurrency because you can't just rely on transactions- in most cases, you can't hold transactions open while you fiddle with the data in-memory. i think that if you if you're going to dump on something in the way many people do about orms, you have to state the alternative. what do you do instead of an orm? the cheap shots i usually hear ignore this, because this is where it gets messy. basically it boils down to two strategies, solve the problem differently (and better), or avoid the problem. both of these have significant flaws. a better solution listening to some critics, you'd think that the best thing for a modern software developer to do is roll their own orm. the implication is that tools like hibernate and active record have just become bloatware, so you should come up with your own lightweight alternative. now i've spent many an hour griping at bloatware, but orms really don't fit the bill - and i say this with bitter memory. for much of the 90's i saw project after project deal with the object/relational mapping problem by writing their own framework - it was always much tougher than people imagined. usually you'd get enough early success to commit deeply to the framework and only after a while did you realize you were in a quagmire - this is where i sympathize greatly with ted neward's famous quote that object-relational mapping is the vietnam of computer science [1] . the widely available open source orms (such as ibatis, hibernate, and active record) did a great deal to remove this problem [2] . certainly they are not trivial tools to use, as i said the underlying problem is hard, but you don't have to deal with the full experience of writing that stuff (the horror, the horror). however much you may hate using an orm, take my word for it - you're better off. i've often felt that much of the frustration with orms is about inflated expectations. many people treat the relational database "like a crazy aunt who's shut up in an attic and whom nobody wants to talk about" [3] . in this world-view they just want to deal with in-memory data-structures and let the orm deal with the database. this way of thinking can work for small applications and loads, but it soon falls apart once the going gets tough. essentially the orm can handle about 80-90% of the mapping problems, but that last chunk always needs careful work by somebody who really understands how a relational database works. this is where the criticism comes that orm is a leaky abstraction. this is true, but isn't necessarily a reason to avoid them. mapping to a relational database involves lots of repetitive, boiler-plate code. a framework that allows me to avoid 80% of that is worthwhile even if it is only 80%. the problem is in me for pretending it's 100% when it isn't. david heinemeier hansson, of active record fame, has always argued that if you are writing an application backed by a relational database you should damn well know how a relational database works. active record is designed with that in mind, it takes care of boring stuff, but provides manholes so you can get down with the sql when you have to. that's a far better approach to thinking about the role an orm should play. there's a consequence to this more limited expectation of what an orm should do. i often hear people complain that they are forced to compromise their object model to make it more relational in order to please the orm. actually i think this is an inevitable consequence of using a relational database - you either have to make your in-memory model more relational, or you complicate your mapping code. i think it's perfectly reasonable to have a more relational domain model in order to simplify your object-relational mapping. that doesn't mean you should always follow the relational model exactly, but it does mean that you take into account the mapping complexity as part of your domain model design. so am i saying that you should always use an existing orm rather than doing something yourself? well i've learned to always avoid saying "always". one exception that comes to mind is when you're only reading from the database. orms are complex because they have to handle a bi-directional mapping. a uni-directional problem is much easier to work with, particularly if your needs aren't too complex and you are comfortable with sql. this is one of the arguments for cqrs . so most of the time the mapping is a complicated problem, and you're better off using an admittedly complicated tool than starting a land war in asia. but then there is the second alternative i mentioned earlier - can you avoid the problem? avoiding the problem to avoid the mapping problem you have two alternatives. either you use the relational model in memory, or you don't use it in the database. to use a relational model in memory basically means programming in terms of relations, right the way through your application. in many ways this is what the 90's crud tools gave you. they work very well for applications where you're just pushing data to the screen and back, or for applications where your logic is well expressed in terms of sql queries. some problems are well suited for this approach, so if you can do this, you should. but its flaw is that often you can't. when it comes to not using relational databases on the disk, there rises a whole bunch of new champions and old memories. in the 90's many of us (yes including me) thought that object databases would solve the problem by eliminating relations on the disk. we all know how that worked out. but there is now the new crew of nosql databases - will these allow us to finesse the orm quagmire and allow us to shock-and-awe our data storage? as you might have gathered , i think nosql is technology to be taken very seriously. if you have an application problem that maps well to a nosql data model - such as aggregates or graphs - then you can avoid the nastiness of mapping completely. indeed this is often a reason i've heard teams go with a nosql solution. this is, i think, a viable route to go - hence my interest in increasing our understanding of nosql systems. but even so it only works when the fit between the application model and the nosql data model is good. not all problems are technically suitable for a nosql database. and of course there are many situations where you're stuck with a relational model anyway. maybe it's a corporate standard that you can't jump over, maybe you can't persuade your colleagues to accept the risks of an immature technology. in this case you can't avoid the mapping problem. so orms help us deal with a very real problem for most enterprise applications. it's true they are often misused, and sometimes the underlying problem can be avoided. they aren't pretty tools, but then the problem they tackle isn't exactly cuddly either. i think they deserve a little more respect and a lot more understanding. 1: i have to confess a deep sense of conflict with the vietnam analogy. at one level it seems like a case of the pathetic overblowing of software development's problems to compare a tricky technology to war. nasty the programming may be, but you're still in a relatively comfy chair, usually with air conditioning, and bug-hunting doesn't involve bullets coming at you. but on another level, the phrase certainly resonates with the feeling of being sucked into a quagmire. 2: there were also commercial orms, such as toplink and kodo. but the approachability of open source tools meant they became dominant. 3: i like this phrase so much i feel compelled to subject it to re-use.
May 9, 2012
by Martin Fowler
· 115,427 Views · 4 Likes
article thumbnail
Lean tools: Options thinking
We now have finished exploring the Lean tools for amplifying learning like feedback, iterations and set-based development. We enter the real of the 3rd Lean principle, Decide as late as possible. This principle is oriented to postpone decisions as long as the delay does not impact the product, in order to gain more flexibility instead of becoming locked in with some initial design decisions. Software is easy to rebuild from source code, but its architecture is not always malleable by default as non-technical people would think. Moreover, there are some changes which will always happen, like upgrade of libraries and operating systems, which complements change in requirements or integration ports. The easiest decision to change is the one that has not been made yet. Options Thinking The first tool that helps in postponing decisions is Options Thinking: the introduction of mechanisms whose specific purpose is to enable delaying decisions. In the financial domain, an option is the right to buy a good at a certain price before a future date comes - effectively transferring the decision of buying shares or products some time in the future, as options can expire without being exercised. A simpler instance of Options Thinking cited by Mary Poppendieck is an hotel reservation: you invest a small sum of money (the reservation fee) to book a room; exercising the option means actually going to the hotel, a decision which is made only when the time comes. Trains and airlines often use the same pricing model for seats (even if we do not consider the rise of prices as a flight is being filled). There are multiple types of tickets for each combination of flight and date: some basic and not transferrable or refundable, some more costly that provide the option of changing the date or to get a partial or total refund. Agile Mary Poppendieck adds the insight that Agile software development is a process that creates many options by introducing a very flexible plan and only prescribing more detailed actions after several inspect and adapt loops. It's not bad to delay a commitment until you know more about a problem: forced early decisions are the mark of waterfall (actually of the mainstream version of waterfall). But options do not come for free: for example, in order to simplify a technical decision, XP suggests to create throwaway code. These spikes are the exploration of each potential solution, which in a certain sense are a waste of development time as their final result is of low quality and usually thrown away. However, spikes produces knowledge about the solution that results in a better estimate for its full development or in its abandonment. The decision to adopt a technology or of which solution to adopt is delayed until the end of a spike, but this option pay itself quickly as uncertainty is removed and decisions "get it right" with an higher probability. Real world examples Almost any application I have been involved with in the last two years has had the separation of a persistence layer as one of the goals: Active Record has been progressively abandoned in the PHP world to favor Data Mappers like the Doctrine ORM and ODMs. As for all options that can be bought, this separation does not come for free: development is a little slower when Repositories are objects that have to be designed instead of just a bunch of static calls to the Entity class like User::find() (although there are benefits of the Data Mapper approach that go beyond keeping options open.) An isolated persistence layer, however, allows us to postpone fundamental decisions about the database to use: it's a rough time for many of them as licenses change (MySQL) or new NoSQL solutions come out and evolve. Every month of development where you're not tied to a specific database is a month where the hype goes down and we move towards more mature solutions that we can choose with a greater knowledge of the requirements of our data. Do we need relational database consistency? Or a schema-less store? Moreover, the investment in persistence adapters separated from the core of the application let us able to choose different databases for different bounded contexts of an application; for example, storing views in a relational database and the primary database as a set of aggregates in Couch or Mongo. Conclusion I will never advocate to invest in an option just for the sake of the technical challenge, nor that they come for free; but once you recognize postponing a decision freezing is valuable for the project, there should be really no issue in go and buying it.
May 2, 2012
by Giorgio Sironi
· 10,336 Views
article thumbnail
Managing and Monitoring Drupal Sites on Windows Azure
A few weeks ago, I co-authored an article (with my colleague Rama Ramani) about how the Screen Actors Guild Awards website migrated its Drupal deployment from LAMP to Windows Azure: Azure Real World: Migrating a Drupal Site from LAMP to Windows Azure. Since then, Rama and another colleague, Jason Roth, have been working on writing up how the SAG Awards website was managed and monitored in Windows Azure. The article below is the fruit of their work…a very interesting/educational read. Overview Drupal is an open source content management system that runs on PHP. Windows Azure offers a flexible platform for hosting, managing, and scaling Drupal deployments. This paper focuses on an approach to host Drupal sites on Windows Azure, based on learning from a BPD Customer Programs Design Win engagement with the Screen Actors Guild Awards Drupal website. This paper covers guidelines and best practices for managing an existing Drupal web site in Windows Azure. For more information on how to migrate Drupal applications to Windows Azure, see Azure Real World: Migrating a Drupal Site from LAMP to Windows Azure. The target audience for this paper is Drupal administrators who have some exposure to Windows Azure. More detailed pointers to Windows Azure content is provided throughout the paper as links. Drupal Application Architecture on Windows Azure Before reviewing the management and monitoring guidelines, it is important to understand the architecture of a typical Drupal deployment on Windows Azure. First, the following diagram displays the basic architecture of Drupal running on Windows and IIS7. In the Windows Server scenario, you could have one or more machines hosting the web site in a farm. Those machines would either persist the site content to the file system or point to other network shares. For Windows Azure, the basic architecture is the same, but there are some differences. In Windows Azure the site is hosted on a web role. A web role instance is hosted on a Windows Server 2008 virtual machine within the Windows Azure datacenter. Like the web farm, you can have multiple instances running the site. But there is no persistence guarantee for the data on the file system. Because of this, much of the shared site content should be stored in Windows Azure Blob storage. This allows them to be highly available and durable. Usually, a large portion of the site caters to static content which lends well to caching. And caching can be applied in a set of places – browser level caching, CDN to cache content in the edge closer to the browser clients, caching in Azure to reduce the load on backend, etc. Finally, the database can be located in SQL Azure. The following diagram shows these differences. For monitoring and management, we will look at Drupal on Windows Azure from three perspectives: Availability: Ensure the web site does not go down and that all tiers are setup correctly. Apply best practices to ensure that the site is deployed across data centers and perform backup operations regularly. Scalability: Correctly handle changes in user load. Understand the performance characteristics of the site. Manageability: Correctly handle updates. Make code and site changes with no downtime when possible. Although some management tasks span one or more of these categories, it is still helpful to discuss Drupal management on Windows Azure within these focus areas. Availability One main goal is that the Drupal site remains running and accessible to all end-users. This involves monitoring both the site and the SQL Azure database that the site depends on. In this section, we will briefly look at monitoring and backup tasks. Other crossover areas that affect availability will be discussed in the next section on scalability. Monitoring With any application, monitoring plays an important role with managing availability. Monitoring data can reveal whether users are successfully using the site or whether computing resources are meeting the demand. Other data reveals error counts and possibly points to issues in a specific tier of the deployment. There are several monitoring tools that can be used. The Windows Azure Management Portal. Windows Azure diagnostic data. Custom monitoring scripts. System Center Operations Manager. Third party tools such as Azure Diagnostics Manager and Azure Storage Explorer. The Windows Azure Management Portal can be used to ensure that your deployments are successful and running. You can also use the portal to manage features such as Remote Desktop so that you can directly connect to machines that are running the Drupal site. Windows Azure diagnostics allows you to collect performance counters and logs off of the web role instances that are running the Drupal site. Although there are many options for configuring diagnostics in Azure, the best solution with Drupal is to use a diagnostics configuration file. The following configuration file demonstrates some basic performance counters that can monitor resources such as memory, processor utilization, and network bandwidth. For more information about setting up diagnostic configuration files, see How to Use the Windows Azure Diagnostics Configuration File. This information is stored locally on each role instance and then transferred to Windows Azure storage per a defined schedule or on-demand. See Getting Started with Storing and Viewing Diagnostic Data in Windows Azure Storage. Various monitoring tools, such as Azure Diagnostics Manager, help you to more easily analyze diagnostic data. Monitoring the performance of the machines hosting the Drupal site is only part of the story. In order to plan properly for both availability and scalability, you should also monitor site traffic, including user load patterns and trends. Standard and custom diagnostic data could contribute to this, but there are also third-party tools that monitor web traffic. For example, if you know that spikes occur in your application during certain days of the week, you could make changes to the application to handle the additional load and increase the availability of the Drupal solution. Backup Tasks To remain highly available, it is important to backup your data as a defense-in-depth strategy for disaster recovery. This is true even though SQL Azure and Windows Azure Storage both implement redundancy to prevent data loss. One obvious reason is that these services cannot prevent administrator error if data is accidentally deleted or incorrectly changed. SQL Azure does not currently have a formal backup technology, although there are many third-party tools and solutions that provide this capability. Usually the database size for a Drupal site is relatively small. In the case of SAG Awards, it was only ~100-150 MB. So performing an entire backup using any strategy was relatively fast. If your database is much larger, you might have to test various backup strategies to find the one that works best. Apart from third-party SQL Azure backup solutions, there are several strategies for obtaining a backup of your data: · Use the Drush tool and the portabledb-export command. · Periodically copy the database using the CREATE DATABASE Transact-SQL command. · Use Data-tier applications (DAC) to assist with backup and restore of the database. SQL Azure backup and data security techniques are described in more detail in the topic, Business Continuity in SQL Azure. Note that bandwidth costs accrue with any backup operation that transfers information outside of the Windows Azure datacenter. To reduce costs, you can copy the database to a database within the same datacenter. Or you can export the data-tier applications to blob storage in the same datacenter. Another potential backup task involves the files in Blob storage. If you keep a master copy of all media files uploaded to Blob storage, then you already have an on-premises backup of those files. However, if multiple administrators are loading files into Blob storage for use on the Drupal site, it is a good idea to enumerate the storage account and to download any new files to a central location. The following PHP script demonstrates how this can be done by backing up all files in Blob storage after a specified modification date. setProxy(true, 'YOUR_PROXY_IF_NEEDED', 80); $blobs = (array)$blobObj->listBlobs(AZURE_STORAGE_CONTAINER, '', '', 35000); backupBlobs($blobs, $blobObj); function backupBlobs($blobs, $blobObj) { foreach ($blobs as $blob) { if (strtotime($blob->lastmodified) >= DEFAULT_BACKUP_FROM_DATE && strtotime($blob->lastmodified) <= DEFAULT_BACKUP_TO_DATE) { $path = pathinfo($blob->name); if ($path['basename'] != '$$$.$$$') { $dir = $path['dirname']; $oldDir = getcwd(); if (handleDirectory($dir)) { chdir($dir); $blobObj->getBlob( AZURE_STORAGE_CONTAINER, $blob->name, $path['basename'] ); chdir($oldDir); } } } } } function handleDirectory($dir) { if (!checkDirExists($dir)) { return mkdir($dir, 0755, true); } return true; } function checkDirExists($dir) { if(file_exists($dir) && is_dir($dir)) { return true; } return false; } ?> This script has a dependency on the Windows Azure SDK for PHP. Also note there are several parameters that you must modify such as the storage account, secret, and backup location. As with SQL Azure, bandwidth and transaction charges apply to a backup script like this. Scalability Drupal sites on Windows Azure can scale as load increased through typical strategies of scale-up, scale-out, and caching. The following sections describe the specifics of how these strategies are implemented in Windows Azure. Typically you make scalability decisions based on monitoring and capacity planning. Monitoring can be done in staging during testing or in production with real-time load. Capacity planning factors in projections for changes in user demand. Scale Up When you configure your web role prior to deployment, you have the option of specifying the Virtual Machine (VM) size, such as Small or ExtraLarge. Each size tier adds additional memory, processing power, and network bandwidth to each instance of your web role. For cost efficiency and smaller units of scale, you can test your application under expected load to find the smallest virtual machine size that meets your requirements. The workload usually in most popular Drupal websites can be separated out into a limited set of Drupal admins making content changes and a large user base who perform mostly read-only workload. End users can be allowed to make ‘writes’, such as uploading blogs or posting in forums, but those changes are not ‘content changes’. Drupal admins are setup to operate without caching so that the writes are made directly to SQL Azure or the corresponding backend database. This workload performs well with Large or ExtraLarge VM sizes. Also, note that the VM size is closely tied to all hardware resources, so if there are many content-rich pages that are streaming content, then the VM size requirements are higher. To make changes to the Virtual Machine size setting, you must change the vmsize attribute of the WebRole element in the service definition file, ServiceDefinition.csdef. A virtual machine size change requires existing applications to be redeployed. Scale Out In addition to the size of each web role instance, you can increase or decrease the number of instances that are running the Drupal site. This spreads the web requests across more servers, enabling the site to handle more users. To change the number of running instances of your web role, see How to Scale Applications by Increasing or Decreasing the Number of Role Instances. Note that some configuration changes can cause your existing web role instances to recycle. You can choose to handle this situation by applying the configuration change and continue running. This is done by handling the RoleEnvironment.Changing event. For more information see, How to Use the RoleEnvironment.Changing Event. A common question for any Windows Azure solution is whether there is some type of built-in automatic scaling. Windows Azure does not provide a service that provides auto-scaling. However, it is possible to create a custom solution that scales Azure services using the Service Management API. For an example of this approach, see An Auto-Scaling Module for PHP Applications in Windows Azure. Caching Caching is an important strategy for scaling Drupal applications on Windows Azure. One reason for this is that SQL Azure implements throttling mechanisms to regulate the load on any one database in the cloud. Code that uses SQL Azure should have robust error handling and retry logic to account for this. For more information, see Error Messages (SQL Azure Database). Because of the potential for load-related throttling as well as for general performance improvement, it is strongly recommended to use caching. Although Windows Azure provides a Caching service, this service does not currently have interoperability with PHP. Because of this, the best solution for caching in Drupal is to use a module that uses an open-source caching technology, such as Memcached. Outside of a specific Drupal module, you can also configure Memcached to work in PHP for Windows Azure. For more information, see Running Memcached on Windows Azure for PHP. Here is also an example of how to get Memcached working in Windows Azure using a plugin: Windows Azure Memcached plugin. In a future paper, we hope to cover this architecture in more detail. For now, here are several design and management considerations related to caching. Area Consideration Design and Implementation For a technology like Memcached, will the cache be collocated (spread across all web role instances)? Or will you attempt to setup a dedicated cache ring with worker roles that only run Memcached? Configuration What memory is required and how will items in the cache be invalidated? Performance and Monitoring What mechanisms will be used to detect the performance and overall health of the cache? For ease of use and cost savings, collocation of the cache across the web role instances of the Drupal site works best. However, this assumes that there is available reserve memory on each instance to apply toward caching. It is possible to increase the virtual machine size setting to increase the amount of available memory on each machine. It is also possible to add additional web role instances to add to the overall memory of the cache while at the same time improving the ability of the web site to respond to load. It is possible to create a dedicated cache cluster in the cloud, but the steps for this are beyond the scope of this paper[RR1] . For Windows Azure Blob storage, there is also a caching feature built into the service called the Content Delivery Network (CDN). CDN provides high-bandwidth access to files in Blob storage by caching copies of the files in edge nodes around the world. Even within a single geographic region, you could see performance improvements as there are many more edge nodes than Windows Azure datacenters. For more information, see Delivering High-Bandwidth Content with the Windows Azure CDN. Manageability It is important to note that each hosted service has a Staging environment and a Production environment. This can be used to manage deployments, because you can load and test and application in staging before performing a VIP swap with production. From a manageability standpoint, Drupal has an advantage on Windows Azure in the way that site content is stored. Because the data necessary to serve pages is stored in the database and blob storage, there is no need to redeploy the application to change the content of the site. Another best practice is to use a separate storage account for diagnostic data than the one that is used for the application itself. This can improve performance and also helps to separate the cost of diagnostic monitoring from the cost of the running application. As mentioned previously, there are several tools that can assist with managing Windows Azure applications. The following table summarizes a few of these choices. Tool Description Windows Azure Management Portal The web interface of the Windows Azure management portal shows deployments, instance counts and properties, and supports many different common management and monitoring tasks. Azure Diagnostics Managerq[RR2] [JR3] A Red Gate Software product that provides advanced monitoring and management of diagnostic data. This tool can be very useful for easily analyzing the performance of the Drupal site to determine appropriate scaling decisions. Azure Storage Explorer A tool created by Neudesic for viewing Windows Azure storage account. This can be useful for viewing both diagnostic data and the files in Blob storage.
April 25, 2012
by Brian Swan
· 8,741 Views
article thumbnail
Algorithm of the Week: How to Determine the Day of the Week
Do you know what day of the week was the day you were born? Monday or maybe Saturday? Well, perhaps you know that. Everybody knows the day he’s born on, but do you know what day was the 31st of January in 1883? No? Well, there must be some method to determine any day in any century. We know that 2012 started at Sunday. After we know that, it’s easy to determine what day is the 2nd of January. It should be Monday. But things get a little more complex if we try to guess some date distant from January the 1st. Indeed 1st of Jan was on Sunday, but what day is 9th of May the same year. This is far more difficult to say. Of course we can go with a brute force approach and count from 1/Jan till 9/May, but that is quite slow and error prone. So what would we do if we had to code a program that answers this question? The easiest way is to use a library. Almost every major library has built-in functions that can answer what day is on a given date. Such are date() in PHP or getDate() in JavaScript. But the question remains: How these library functions know the answer and how can we code such library functions if our library doesn’t support such functionality? There must be some algorithm to help us. Overview Because months have different number of days, and most of them aren’t divisible by 7 without a remainder, months begin on different days of the week. Thus, if January begins on Sunday, the month of February the same year will begin on Wednesday. Of course, in common years February has 28 days, which fortunately is divisible by 7 and thus February and March both begin on the same day, which is great, but isn’t true for leap years. What Do We Know About the Calendar First thing to know is that each week has exactly 7 days. We also know that a common year has 365 days, while a leap year has one day more – 366. Most of the months have 30 or 31 days, but February has only 28 days in common years and 29 in leap years. Because 365 mod 7 = 1 in a common year each year begins exactly on the next day of the preceding year. Thus if 2011 started on Saturday, 2012 starts on Sunday. And yet again, that is because 2011 is not a leap year. What else do we know? Because a week has exactly seven days only February (with its 28 days in a common year) is divisible by 7 (28 mod 7 = 0) and has exactly four weeks in it. Thus in a common year February and March start on a same day. Unfortunately that is not true about the other months. All these things we know about the calendar are great, so we can make some conclusions. Although eleven of the months have either 30 or 31 days they don’t start on a same day, but some of the months do appear to start on a same day just because the number of days between them is divisible by 7 without a remainder. Let’s take a look on some examples. For instance September has 30 days, as does November, while October, which is in between them has 31 days. Thus 30+30+31 makes 91. Fortunately 91 mod 7 = 0. So for each year September and December start on the same day (as they are after February they don’t depend on leap years). The same thing occurs to April and July and the good news is that in leap years even January starts on the same day as April and July. Now we know that there are some relations between months. Thus, if we know somehow that the 13th of April is Monday, we’ll be sure that 13th of July is also Monday. Let’s see now a summary of these observations. We can also refer to the following diagram. For leap years there are other corresponding months. Let’s take a look at the following image. Another way to get the same information is the following table. We also know that leap years happen to occur once every four years. However, if there is a common year like the year 2001, which will be the next year that is common and starts and corresponds exactly on 2001? Because of leap years we can have a year starting on one of the seven days of the week and to be either leap or common. This means just 14 combinations. Following these observations we can refer to the following table. You can clearly see the pattern “6 4 2 0” Here’s the month table. Columns 2 and 3 differs only for January and February. Clearly the day table is as follows: Now let’s go back to the algorithm. Using these tables and applying a simple formula, we can calculate what day was on some given date. Here are the steps of this algorithm. Get the number for the corresponding century from the centuries table; Get the last two digits from the year; Divide the number from step 2 by 4 and get it without the remainder; Get the month number from the month table; Sum the numbers from steps 1 to 4; Divide it by 7 and take the remainder; Find the result of step 6 in the days table; Implementation First let’s take a look at a simple and practical example of the example above and then the code. Let’s answer the question from the first paragraph of this post. What day was on January 31st, 1883? Take a look at the centuries table: for 1800 – 1899 this is 2. Get the last two digits from the year: 83. Divide 83 by 4 without a remainder: 83/4 = 20 Get the month number from the month table: Jan = 0. Sum the numbers from steps 1 to 4: 2 + 83 + 20 + 0 = 105. Divide it by 7 and take the remainder: 105 mod 7 = 0 Find the result of step 6 in the days table: Sunday = 0. The following code in PHP implements the algorithm above. function get_century_code($century) { // XVIII if (1700 <= $century && $century <= 1799) return 4; // XIX if (1800 <= $century && $century <= 1899) return 2; // XX if (1900 <= $century && $century <= 1999) return 0; // XXI if (2000 <= $century && $century <= 2099) return 6; // XXII if (2100 <= $century && $century <= 2199) return 4; // XXIII if (2200 <= $century && $century <= 2299) return 2; // XXIV if (2300 <= $century && $century <= 2399) return 0; // XXV if (2400 <= $century && $century <= 2499) return 6; // XXVI if (2500 <= $century && $century <= 2599) return 4; // XXVII if (2600 <= $century && $century <= 2699) return 2; } /** * Get the day of a given date * * @param $date */ function get_day_from_date($date) { $months = array( 1 => 0,// January 2 => 3,// February 3 => 3,// March 4 => 6,// April 5 => 1,// May 6 => 4,// June 7 => 6,// July 8 => 2,// August 9 => 5,// September 10 => 0,// October 11 => 3,// November 12 => 5,// December ); $days = array( 0 => 'Sunday', 1 => 'Monday', 2 => 'Tuesday', 3 => 'Wednesday', 4 => 'Thursday', 5 => 'Friday', 6 => 'Saturday', ); // calculate the date $dateParts = explode('-', $date); $century = substr($dateParts[2], 0, 2); $year = substr($dateParts[2], 2); // 1. Get the number for the corresponding century from the centuries table $a = get_century_code($dateParts[2]); // 2. Get the last two digits from the year $b = $year; // 3. Divide the number from step 2 by 4 and get it without the remainder $c = floor($year / 4); // 4. Get the month number from the month table $d = $months[$dateParts[1]]; // 5. Sum the numbers from steps 1 to 4 $e = $a + $b + $c + $d; // 6. Divide it by 7 and take the remainder $f = $e % 7; // 7. Find the result of step 6 in the days table return $days[$f]; } // Sunday echo get_day_from_date('31-1-1883'); Application This algorithm can be applied in many different cases although most of the libraries have built-in functions that can do that. The only problem besides that is that there are much more efficient algorithms that don’t need additional space (tables) of data. However this algorithm isn’t difficult to implement and it gives a good outlook of some facts in the calendar.
April 24, 2012
by Stoimen Popov
· 61,705 Views · 1 Like
article thumbnail
Amazon EMR Tutorial: Running a Hadoop MapReduce Job Using Custom JAR
See original post at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html Introduction Amazon EMR is a web service which can be used to easily and efficiently process enormous amounts of data. It uses a hosted Hadoop framework running on the web-scale infrastructure of Amazon EC2 and Amazon S3. Amazon EMR removes most of the cumbersome details of Hadoop while taking care of provisioning of Hadoop, running the job flow, terminating the job flow, moving the data between Amazon EC2 and Amazon S3, and optimizing Hadoop. In this tutorial, we will use a developed WordCount Java example using Hadoop and thereafter, we execute our program on Amazon Elastic MapReduce. Prerequisites You must have valid AWS account credentials. You should also have a general familiarity with using the Eclipse IDE before you begin. The reader can also use any other IDE of their choice. Step 1 – Develop MapReduce WordCount Java Program In this section, we are first going to develop a WordCount application. A WordCount program will determine how many times different words appear in a set of files. In Eclipse (or whatever the IDE you are using), Create simple Java Project with the name "WordCount". Create a java class name Map and override the map method as follow, public class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } Create a java class named Reduce and override the reduce method as shown below, public class Reduce extends Reducer { @Override protected void reduce(Text key, java.lang.Iterable values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } } Create a java class named WordCount and defined the main method as below, public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setJarByClass(WordCount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } Export the WordCount program in a jar using eclipse and save it to some location on disk. Make sure that you have provided the Main Class (WordCount.jar) during extraction ofu8u the jar file as shown below. Our jar is ready!!! Step 2 – Upload the WordCount JAR and Input Files to Amazon S3 Now we are going to upload the WordCount jar to Amazon S3. First, go to the following URL: https://console.aws.amazon.com/s3/home Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane. Upload the WordCount JAR and sample input file for counting the words. Step 3 – Running an Elastic MapReduce job Now that the JAR is uploaded into S3, all we need to do is to create a new Job flow. let's execute the steps below. (I encourage readers to check out the following link for details regarding each step, How to Create a Job Flow Using a Custom JAR ) Sign in to the AWS Management Console and open the Amazon Elastic MapReduce console at https://console.aws.amazon.com/elasticmapreduce/ Click Create New Job Flow. In the DEFINE JOB FLOW page, enter the following details, a) Job Flow Name = WordCountJob b) Select Run your own applications) Select Custom JAR in the drop-down list) Click Continue In the SPECIFY PARAMETERS page, enter values in the boxes using the following table as a guide, and then click Continue.JAR Location = bucketName/jarFileLocationJAR Arguments =s3n://bucketName/inputFileLocations3n://bucketName/outputpath Please note that the output path must be unique each time we execute the job. The Hadoop always create a folder with the same name specified here. After executing the job, just wait and monitor your job that runs through the Hadoop flow. You can also look for errors by using the Debug button. The job should be complete within 10 to 15 minutes (can also depend on the size of the input). After completing the job, You can view results in the S3 Browser panel. You can also download the files from S3 and can analyze the outcome of the job. Amazon Elastic MapReduce Resources Amazon Elastic MapReduce Documentation,http://aws.amazon.com/documentation/elasticmapreduce/ Amazon Elastic MapReduce Getting Started Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/ Amazon Elastic MapReduce Developer Guide,http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ Apache Hadoop,http://hadoop.apache.org/ See more at https://muhammadkhojaye.blogspot.com/2012/04/how-to-run-amazon-elastic-mapreduce-job.html
April 23, 2012
by Muhammad Ali Khojaye
· 59,010 Views
article thumbnail
How-to: Python Data into Graphite for Monitoring Bliss
This post shows code examples in Python (2.7) for sending data to Graphite. Once you have a Graphite server setup, with Carbon running/collecting, you need to send it data for graphing. Basically, you write a program to collect numeric values and send them to Graphite's backend aggregator (Carbon). To send data, you create a socket connection to the graphite/carbon server and send a message (string) in the format: "metric_path value timestamp\n" `metric_path`: arbitrary namespace containing substrings delimited by dots. The most general name is at the left and the most specific is at the right. `value`: numeric value to store. `timestamp`: epoch time. messages must end with a trailing newline. multiple messages maybe be batched and sent in a single socket operation. each message is delimited by a newline, with a trailing newline at the end of the message batch. Example message: "foo.bar.baz 42 74857843\n" Let's look at some (Python 2.7) code for sending data to graphite... Here is a simple client that sends a single message to graphite. Code: #!/usr/bin/env python import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 message = 'foo.bar.baz 42 %d\n' % int(time.time()) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a command line client that sends a single message to graphite: Usage: $ python client-cli.py metric_path value Code: #!/usr/bin/env python import argparse import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 parser = argparse.ArgumentParser() parser.add_argument('metric_path') parser.add_argument('value') args = parser.parse_args() if __name__ == '__main__': timestamp = int(time.time()) message = '%s %s %d\n' % (args.metric_path, args.value, timestamp) print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() Here is a client that collects load average (Linux-only) and sends a batch of 3 messages (1min/5min/15min loadavg) to graphite. It will run continuously in a loop until killed. (adjust the delay for faster/slower collection interval): #!/usr/bin/env python import platform import socket import time CARBON_SERVER = '0.0.0.0' CARBON_PORT = 2003 DELAY = 15 # secs def get_loadavgs(): with open('/proc/loadavg') as f: return f.read().strip().split()[:3] def send_msg(message): print 'sending message:\n%s' % message sock = socket.socket() sock.connect((CARBON_SERVER, CARBON_PORT)) sock.sendall(message) sock.close() if __name__ == '__main__': node = platform.node().replace('.', '-') while True: timestamp = int(time.time()) loadavgs = get_loadavgs() lines = [ 'system.%s.loadavg_1min %s %d' % (node, loadavgs[0], timestamp), 'system.%s.loadavg_5min %s %d' % (node, loadavgs[1], timestamp), 'system.%s.loadavg_15min %s %d' % (node, loadavgs[2], timestamp) ] message = '\n'.join(lines) + '\n' send_msg(message) time.sleep(DELAY) Resources: Graphite Docs Graphite Docs - Getting Your Data Into Graphite Installing Graphite 0.9.9 on Ubuntu 12.04 LTS Installing and configuring Graphite END
April 20, 2012
by Corey Goldberg
· 25,261 Views
article thumbnail
Back To The Future with Datomic
At the beginning of March, Rich Hickey and his team released Datomic. Datomic is a novel distributed database system designed to enable scalable, flexible and intelligent applications, running on next-generation cloud architectures. Its launch was surrounded with quite some buzz and skepticism, mainly related to its rather disruptive architectural proposal. Instead of trying to recapitulate the various pros and cons of its architectural approach, I will try to focus on the other innovation it introduces, namely its powerful data model (based upon the concept of Datoms) and its expressive query language (based upon the concept of Datalog). The remainder of this article will describe how to store facts and query them through Datalog expressions and rules. Additionally, I will show how Datomic introduces an explicit notion of time, which allows for the execution of queries against both the previous and future states of the database. As an example, I will use a very simple data model that is able to describe genealogical information. As always, the complete source code can be found on the Datablend public GitHub repository. 1. The Datomic data model Datomic stores facts (i.e. your data points) as datoms. A datom represents the addition (or retraction) of a relation between an entity, an attribute, a value, and a transaction. The datom concept is closely related to the concept of a RDF triple, where each triple is a statement about a particular resource in the form of a subject-predicate-object expression. Datomic adds the notion of time by explicitly tagging a datom with a transaction identifier (i.e. the exact time-point at which the fact was persisted into the Datomic database). This allows Datomic to promote data immutability: updates are not changing your existing facts; they are merely creating new datoms that are tagged with a more recent transaction. Hence, the system keeps track of all the facts, forever. Datomic does not enforce an explicit entity schema; it’s up to the user to decide what type of attributes he/she want to store for a particular entity. Attributes are part of the Datomic meta model, which specifies the characteristics (i.e. attributes) of the attributes themselves. Our genealogical example data model stores information about persons and their ancestors. For this, we will require two attributes: name and parent. An attribute is basically an entity, expressed in terms of the built-in system attributes such as cardinality, value type and attribute description. // Open a connection to the database String uri = "datomic:mem://test"; Peer.createDatabase(uri); Connection conn = Peer.connect(uri); // Declare attribute schema List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/name", ":db/valueType", ":db.type/string", ":db/cardinality", ":db.cardinality/one", ":db/doc", "A person's name", ":db.install/_attribute", ":db.part/db")); tx.add(Util.map(":db/id", Peer.tempid(":db.part/db"), ":db/ident", ":person/parent", ":db/valueType", ":db.type/ref", ":db/cardinality", ":db.cardinality/many", ":db/doc", "A person's parent", ":db.install/_attribute", ":db.part/db")); // Store it conn.transact(tx).get(); All entities in a Datomic database need to have an internal key, called the entity id. In our case, we generate a temporary id through the tempid utility method. All entities are stored within a specific database partition that groups together logically related entities. Attribute definitions need to reside in the :db.part/db partition, a dedicated system partition employed exclusively for storing system entities and schema definitions. :person/name is a single-valued attribute of value type string. :person/parent is a multi-valued attribute of value type ref. The value of a reference attribute points to (the id) of another entity stored within the Datomic database. Once our attribute schema is persisted, we can start populating our database with concrete person entities. // Define person entities List tx = new ArrayList(); Object edmond = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", edmond, ":person/name", "Edmond Suvee")); Object gilbert = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", gilbert, ":person/name", "Gilbert Suvee", ":person/parent", edmond)); Object davy = Peer.tempid(":db.part/user"); tx.add(Util.map(":db/id", davy, ":person/name", "Davy Suvee", ":person/parent", gilbert)); // Store them conn.transact(tx).get(); We will create three concrete persons: myself, my dad Gilbert Suvee and my grandfather Edmond Suvee. Similarly to the definition of attributes, we again employ the tempid utility method to retrieve temporary ids for our newly created entities. This time however, we store our persons within the :db.part/user database partition, which is the default partition for storing application entities. Each person is given a name (via the :person/name attribute) and parent (via the :person/parent attribute). When calling the transact method, each entity is translated into a set of individual datoms that together describe the entity. Once persisted, Datomic ensures that temporary ids are replaced with their final counterparts. 2. The Datomic query language Datomic’s query model is an extended form of Datalog. Datalog is a deductive query system which will feel quite familiar to people who have experience with SPARQL and/or Prolog. The declarative query language makes use of a pattern matching mechanism to find all combinations of values (i.e. facts) that satisfy a particular set of conditions expressed as clauses. Let’s have a look at a few example queries: // Find all persons System.out.println(Peer.q("[:find ?name " + ":where [?person :person/name ?name] ]", conn.db())); // Find the parents of all persons System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]" , conn.db())); // Find the grandparent of all persons System.out.println(Peer.q("[:find ?name ?grandparentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] " + "[?grandparent :person/name ?grandparentname] ]" , conn.db())); We consider entities to be of type person if they own a :person/name attribute. The :where-part of the first query, which aims at finding all persons stored in the Datomic database, specifies the following “conditional” clause: [?person :person/name ?name]. ?person and ?name are variables which act as placeholders. The Datalog query engine retrieves all facts (i.e. datoms) that match this clause. The :find-part of the query specifies the “values” that should be returned as the result of the query. Result query 1: [["Davy Suvee"], ["Edmond Suvee"], ["Gilbert Suvee"]] The second and the third query aim at retrieving the parents and grandparents of all persons stored in the Datomic database. These queries specify multiple clauses that are solved through the use of unification: when a variable name is used more than once, it must represent the same value in every clause in order to satisfy the total set of clauses. As expected, only Davy Suvee has been identified as having a grandparent, as the necessary facts to satisfy this query are not available for neither Gilbert Suvee and Edmond Suvee. Result query 2: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] Result query 3: [["Davy Suvee" "Edmond Suvee"]] If several queries require this “grandparent” notion, one can define a reusable rule that encapsulates the required clauses. Rules can be flexibly combined with clauses (and other rules) in the :where-part of a query. Our third query can be rewritten using the following rules and clauses: String grandparentrule = "[ [ (grandparent ?person ?grandparent) [?person :person/parent ?parent] " + "[?parent :person/parent ?grandparent] ] ]"; System.out.println(Peer.q("[:find ?name ?grandparentname " + ":in $ % " + ":where [?person :person/name ?name] " + "(grandparent ?person ?grandparent) " + "[?grandparent :person/name ?grandparentname] ]" , conn.db(), grandparentrule)); Rules can also be used to write recursive queries. Imagine the ancestor-relationship. It’s impossible to predict the number of parent-levels one needs to go up in order to retrieve the ancestors of a person. As Datomic rules supports the notion of recursion, a rule can call itself within its definition. Similar to recursion in other languages, recursive rules are build up out of a simple base case and a set of clauses which reduce all other cases toward this base case. String ancestorrule = "[ [ (ancestor ?person ?ancestor) [?person :person/parent ?ancestor] ] " + "[ (ancestor ?person ?ancestor) [?person :person/parent ?parent] " + "(ancestor ?parent ?ancestor) ] ] ]"; System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db(), ancestorrule)); Result query 4: [["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"]] 3. Back To The Future I As already mentioned in section 1, Datomic does not perform in-place updates. Instead, all facts are stored and tagged with a transaction such that the most up-to-date value of a particular entity attribute can be retrieved. By doing so, Datomic allows you to travel back into time and perform queries against previous states of the database. Using the asOf method, one can retrieve a version of the database that only contains facts that were part of the database at that particular moment in time. The use of a checkpoint that predates the storage of my own person entity will result in parent-query results that do not longer contain results related to myself. System.out.println(Peer.q("[:find ?name ?parentname " + ":where [?person :person/name ?name] " + "[?person :person/parent ?parent] " + "[?parent :person/name ?parentname] ]", conn.db().asOf(getCheckPoint(checkpoint)))); Result query 2: [["Gilbert Suvee" "Edmond Suvee"]] 4. Back To The Future II Datomic also allows to predict the future. Well, sort of … Similar to the asOf method, one can use the with method to retrieve a version of the database that gets extended with a list of not-yet transacted datoms. This allows to run queries against future states of the database and to observe the implications if these new facts were to be added. List tx = new ArrayList(); tx.add(Util.map(":db/id", Peer.tempid(":db.part/user"), ":person/name", "FutureChild Suvee", ":person/parent", Peer.q("[:find ?person :where [?person :person/name \"Davy Suvee\"] ]", conn.db()).iterator().next().get(0))); System.out.println(Peer.q("[:find ?name ?ancestorname " + ":in $ % " + ":where [?person :person/name ?name] " + "[ancestor ?person ?ancestor] " + "[?ancestor :person/name ?ancestorname] ]" , conn.db().with(tx), ancestorrule)); Result query 4: [["FutureChild Suvee" "Edmond Suvee"], ["FutureChild Suvee" "Gilbert Suvee"], ["Gilbert Suvee" "Edmond Suvee"], ["Davy Suvee" "Edmond Suvee"], ["Davy Suvee" "Gilbert Suvee"], ["FutureChild Suvee" "Davy Suvee"]] 5. Conclusion The use of Datoms and Datalog allows you to express simple, yet powerful queries. This article introduces only a fraction of the features offered by Datomic. To get myself better acquainted with the various Datomic gotchas, I implemented the Tinkerpop Blueprints API on top of Datomic. By doing so, you basically get a distributed, temporal graph database, which is, as far as I know, unique within the Graph database ecosystem. The source code of this Blueprints implementation can currently be found on the Datablend public GitHub repository and will soon be merged within the Tinkerpop project..
April 14, 2012
by Davy Suvee
· 18,144 Views
article thumbnail
How to Use Sigma.js with Neo4j
i’ve done a few posts recently using d3.js and now i want to show you how to use two other great javascript libraries to visualize your graphs. we’ll start with sigma.js and soon i’ll do another post with three.js . we’re going to create our graph and group our nodes into five clusters. you’ll notice later on that we’re going to give our clustered nodes colors using rgb values so we’ll be able to see them move around until they find their right place in our layout. we’ll be using two sigma.js plugins, the gefx (graph exchange xml format) parser and the forceatlas2 layout. you can see what a gefx file looks like below. notice it comes from gephi which is an interactive visualization and exploration platform, which runs on all major operating systems, is open source, and is free. ... ... in order to build this file, we will need to get the nodes and edges from the graph and create an xml file. get '/graph.xml' do @nodes = nodes @edges = edges builder :graph end we’ll use cypher to get our nodes and edges: def nodes neo = neography::rest.new cypher_query = " start node = node:nodes_index(type='user')" cypher_query << " return id(node), node" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0]}.merge(n[1]["data"])} end we need the node and relationship ids, so notice i’m using the id() function in both cases. def edges neo = neography::rest.new cypher_query = " start source = node:nodes_index(type='user')" cypher_query << " match source -[rel]-> target" cypher_query << " return id(rel), id(source), id(target)" neo.execute_query(cypher_query)["data"].collect{|n| {"id" => n[0], "source" => n[1], "target" => n[2]} } end so far we have seen graphs represented as json, and we’ve built these manually. today we’ll take advantage of the builder ruby gem to build our graph in xml. xml.instruct! :xml xml.gexf 'xmlns' => "http://www.gephi.org/gexf", 'xmlns:viz' => "http://www.gephi.org/gexf/viz" do xml.graph 'defaultedgetype' => "directed", 'idtype' => "string", 'type' => "static" do xml.nodes :count => @nodes.size do @nodes.each do |n| xml.node :id => n["id"], :label => n["name"] do xml.tag!("viz:size", :value => n["size"]) xml.tag!("viz:color", :b => n["b"], :g => n["g"], :r => n["r"]) xml.tag!("viz:position", :x => n["x"], :y => n["y"]) end end end xml.edges :count => @edges.size do @edges.each do |e| xml.edge:id => e["id"], :source => e["source"], :target => e["target"] end end end end you can get the code on github as usual and see it running live on heroku. you will want to see it live on heroku so you can see the nodes in random positions and then move to form clusters. use your mouse wheel to zoom in, and click and drag to move around. credit goes out to alexis jacomy and mathieu jacomy . you’ve seen me create numerous random graphs, but for completeness here is the code for this graph. notice how i create 5 clusters and for each node i assign half its relationships to other nodes in their cluster and half to random nodes? this is so the forceatlas2 layout plugin clusters our nodes neatly. def create_graph neo = neography::rest.new graph_exists = neo.get_node_properties(1) return if graph_exists && graph_exists['name'] names = 500.times.collect{|x| generate_text} clusters = 5.times.collect{|x| {:r => rand(256), :g => rand(256), :b => rand(256)} } commands = [] names.each_index do |n| cluster = clusters[n % clusters.size] commands << [:create_node, {:name => names[n], :size => 5.0 + rand(20.0), :r => cluster[:r], :g => cluster[:g], :b => cluster[:b], :x => rand(600) - 300, :y => rand(150) - 150 }] end names.each_index do |from| commands << [:add_node_to_index, "nodes_index", "type", "user", "{#{from}"] connected = [] # create clustered relationships members = 20.times.collect{|x| x * 10 + (from % clusters.size)} members.delete(from) rels = 3 rels.times do |x| to = members[x] connected << to commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless to == from end # create random relationships rels = 3 rels.times do |x| to = rand(names.size) commands << [:create_relationship, "follows", "{#{from}", "{#{to}"] unless (to == from) || connected.include?(to) end end batch_result = neo.batch *commands end
April 12, 2012
by Max De Marzi
· 15,371 Views
article thumbnail
F1 Live Timing Map
this is a live timing map application for f1 championship races made using javascript and google maps markers. the live timing data is supplied by formula1.com. it’s interactive, you can press over a driver to track him or press into an empty map zone to untrack and have a general view. it has also been made with a responsive design to adapt it to mobile browsers using jquerymobile framework. how it works: the client side: until the race start date a countdown and a demo race is showed. when the countdown finishes it will connect to server (using ajax) to get the live timing data from server (every five seconds) and the interface will be updated using this data. the server side: it uses a django app for the web page and the static race data (circuit, laps, drivers) is put into the html using the django template system. for the dynamic data (live timing) i have modified the source of a c program for the linux terminal called live-f1 to generate a json with the data that the client requires instead of printing it on terminal screen. enjoy the race!
April 12, 2012
by Luis Sobrecueva
· 15,771 Views
article thumbnail
Configuring Quartz With JDBCJobStore in Spring
I am starting a little series about Quartz scheduler internals, tips and tricks, this is chapter 0 - how to configure persistent job store.
April 7, 2012
by Tomasz Nurkiewicz
· 37,682 Views
article thumbnail
Wrapping Begin/End Async API Into C#5 Tasks
Microsoft offered programmers several different ways of dealing with the asynchronous programming since .NET 1.0. The first model was Asynchronous programming model or APM for short. The pattern is implemented with two methods named BeginOperation and EndOperation. .NET 4 introduced new pattern – Task Asynchronous Pattern and with the introduction of .NET 4.5, Microsoft added language support for language integrated asynchronous coding style. You can check the MSDN for more samples and information. I will assume that you are familiar with it and have written code using it. You can wrap existing APM pattern into TPL pattern using the Task.Factory.FromAsync methods. For example: public static Task> ExecuteAsync(this DataServiceQuery query, object state) { return Task.Factory.FromAsync>(query.BeginExecute, query.EndExecute, state); } It is easy to wrap most of the asynchronous functions this way, but some cannot be since the wrapper functions assume that the last two parameters to the BeginOperation are AsyncCallback and object, and there are some versions of asynchronous operations that have different specifications. Examples: Extra parameters after the object state parameter: IAsyncResult DataServiceContext.BeginExecuteBatch( AsyncCallback callback, object state, params DataServiceRequest[] queries); Missing the expected object state parameter and different return type: ICancelableAsyncResult BeginQuery(AsyncCallback callBack); WorkItemCollection EndQuery(ICancelableAsyncResult car); Short solution for the first example The short and elegant way for wrapping the first example is to provide the following wrapper: public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { if (context == null) throw new ArgumentNullException("context"); return Task.Factory.FromAsync( context.BeginExecuteBatch(null, state, queries), context.EndExecuteBatch); } We simply call the Begin method ourselves and then wrap it using an another overload for FromAsync function. The longer way However, we can fully wrap it ourselves by simulating what the FromAsync wrapper does. The complete code is listed below. public static Task ExecuteBatchAsync(this DataServiceContext context, object state, params DataServiceRequest[] queries) { // this will be our sentry that will know when our async operation is completed var tcs = new TaskCompletionSource(); try { context.BeginExecuteBatch((iar) => { try { var result = context.EndExecuteBatch(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { // if the inner operation was canceled, this task is cancelled too tcs.TrySetCanceled(); } catch (Exception ex) { // general exception has been set bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }, state, queries); } catch { tcs.TrySetResult(default(DataServiceResponse)); // propagate exceptions to the outside throw; } return tcs.Task; } Besides educational benefits, writing the full wrapper code allows us to add cancellation, logging and diagnostic information. Once we understand how to wrap APM pattern, We can now tackle the second problem easily. Handling the BeginQuery/EndQuery We will first create our own wrapper function in the spirit of the above code with the notable difference that we use the ICancelableAsyncResult interface instead of the IAsyncResult. public static class TaskEx { public static Task FromAsync(Func beginMethod, Func endMethod) { if (beginMethod == null) throw new ArgumentNullException("beginMethod"); if (endMethod == null) throw new ArgumentNullException("endMethod"); var tcs = new TaskCompletionSource(); try { beginMethod((iar) => { try { var result = endMethod(iar as ICancelableAsyncResult); tcs.TrySetResult(result); } catch (OperationCanceledException ex) { tcs.TrySetCanceled(); } catch (Exception ex) { bool flag = tcs.TrySetException(ex); if (flag && ex as ThreadAbortException != null) { tcs.Task.m_contingentProperties.m_exceptionsHolder.MarkAsHandled(false); } } }); } catch { tcs.TrySetResult(default(TResult)); throw; } return tcs.Task; } } The code is pretty self-explanatory and we can go ahead with the wrapping. There are four different operations that are exposed both in synchronous and asynchronous version: Query, LinkQuery, CountOnlyQuery and RegularQuery. The extension methods are short since we have already created our generic wrapper above: public static Task RunQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginQuery, query.EndQuery); } public static Task RunLinkQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginLinkQuery, query.EndLinkQuery); } public static Task RunCountOnlyQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginCountOnlyQuery, query.EndCountOnlyQuery); } public static Task RunRegularQueryAsync(this Query query) { return TaskEx.FromAsync(query.BeginRegularQuery, query.EndRegularQuery); } That is it for today, you can write your own handy extensions easily for APM functions out there.
April 2, 2012
by Toni Petrina
· 11,246 Views
article thumbnail
Cassandra Indexing: The Good, the Bad and the Ugly
Within NoSQL, the operations of indexing, fetching and searching for information are intimately tied to the physical storage mechanisms. It is important to remember that rows are stored across hosts, but a single row is stored on a single host. (with replicas) Columns families are stored in sorted order, which makes querying a set of columns efficient (provided you are spanning rows). The Bad : Partitioning One of the tough things to get used to at first is that without any indexes queries that span rows can (very) be bad. Thinking back to our storage model however, that isn't surprising. The strategy that Cassandra uses to distribute the rows across hosts is called Partitioning. Partitioning is the act of carving up the range of rowkeys assigning them into the "token ring", which also assigns responsibility for a segment (i.e. partition) of the rowkey range to each host. You've probably seen this when you initialized your cluster with a "token". The token gives the host a location along the token ring, which assigns responsibility for a section of the token range. Partitioning is the act of mapping the rowkey into the token range. There are two primary partitioners: Random and Order Preserving. They are appropriately named. The RandomPartitioner hashes the rowkeys into tokens. With the RandomPartitioner, the token is a hash of the rowkey. This does a good job of evenly distributing your data across a set of nodes, but makes querying a range of the rowkey space incredibly difficult. From only a "start rowkey" value and an "end rowkey" value, Cassandra can't determine what range of the token space you need. It essentially needs to perform a "table scan" to answer the query, and a "table scan" in Cassandra is bad because it needs to go to each machine (most likely ALL machines if you have a good hash function) to answer the query. Now, at the great cost of even data distribution, you can employ the OrderPreservingPartitioner (OPP). I am *not* down with OPP. The OPP preserves order as it translates rowkeys into tokens. Now, given a start rowkey value and a end rowkey value, Cassandra *can* determine exactly which hosts have the data you are looking for. It computes the start value to a token the end value to a token, and simply selects and returns everything in between. BUT, by preserving order, unless your rowkeys are evenly distributed across the space, your tokens won't be either and you'll get a lopsided cluster, which greatly increases the cost of configuration and administration of the cluster. (not worth it) The Good : Secondary Indexes Cassandra does provide a native indexing mechanism in Secondary Indexes. Secondary Indexes work off of the columns values. You declare a secondary index on a Column Family. Datastax has good documentation on the usage. Under the hood, Cassandra maintains a "hidden column family" as the index. (See Ed Anuff's presentation for specifics) Since Cassandra doesn't maintain column value information in any one node, and secondary indexes are on columns value (rather than rowkeys), a query still needs to be sent to all nodes. Additionally, secondary indexes are not recommended for high-cardinality sets. I haven't looked yet, but I'm assuming this is because of the data model used within the "hidden column family". If the hidden column family stores a row per unique value (with rowkeys as columns), then it would mean scanning the rows to determine if they are within the range in the query. From Ed's presentation: Not recommended for high cardinality values(i.e.timestamps,birthdates,keywords,etc.) Requires at least one equality comparison in a query--not great for less-than/greater-than/range queries Unsorted - results are in token order, not query value order Limited to search on datatypes, Cassandra natively understands With all that said, secondary indexes work out of the box and we've had good success using them on simple values. The Ugly : Do-It-Yourself (DIY) / Wide-Rows Now, beauty is in the eye of the beholder. One of the beautiful things about NoSQL is the simplicity. The constructs are simple: Keyspaces, Column Families, Rows and Columns. Keeping it simple however means sometimes you need to take things into your own hands. This is the case with wide-row indexes. Utilizing Cassandra's storage model, its easy to build your own indexes where each row-key becomes a column in the index. This is sometimes hard to get your head around, but lets imagine we have a case whereby we want to select all users in a zip code. The main users column family is keyed on userid, zip code is a column on each user row. We could use secondary indexes, but there are quite a few zip codes. Instead we could maintain a column family with a single row called "idx_zipcode". We could then write columns into this row of the form "zipcode_userid". Since the columns are stored in sorted order, it is fast to query for all columns that start with "18964" (e.g. we could use 18964_ and 18964_ZZZZZZ as start and end values). One obvious downside of this approach is that rows are self-contained on a host. (again except for replicas) This means that all queries are going to hit a single node. I haven't yet found a good answer for this. Additionally, and IMHO, the ugliest part of DIY wide-row indexing is from a client perspective. In our implementation, we've done our best to be language agnostic on the client-side, allowing people to pick the best tool for the job to interact with the data in Cassandra. With that mentality, the DIY indexes present some trouble. Wide-rows often use composite keys (imagine if you had an idx_state_zip, which would allow you to query by state then zip). Although there is "native" support for composite keys, all of the client libraries implement their own version of them (Hector, Astyanax, and Thrift). This means that client needing to query data needs to have the added logic to first query the index, and additionally all clients need to construct the composite key in the same manner. Making It Better... For this very reason, we've decided to release two open source projects that help push this logic to the server-side. The first project is Cassandra-Triggers. This allows you to attached asynchronous activities to writes in Cassandra. (one such activity could be indexing) We've also released Cassandra-Indexing. This is hot off the presses and is still in its infancy (e.g. it only supports UT8Types in the index), but the intent is to provide a generic server-side mechanism that indexes data as its written to Cassandra. Employing the same server-side technique we used in Cassandra-Indexing, you simply configure the columns you want indexed, and the AOP code does the rest as you write to the target CF. As always, questions, comments and thoughts are welcome. (especially if I'm off-base somewhere)
March 23, 2012
by Brian O' Neill
· 35,489 Views
article thumbnail
PHP objects in MongoDB with Doctrine
An is equivalent to an Object-Relational Mapper, but with its targets are documents of a NoSQL database instead of table rows. No one said that a Data Mapper must always rely on a relational database as its back end. In the PHP world, probably the Doctrine ODM for MongoDB is the most successful. This followes to the opularity of Mongo, which is a transitional product between SQL and NoSQL, still based on some relational concepts like queries. Lots of features The Doctrine Mongo ODM supports mapping of objects via annotations placed in the class source code, or via external XML or YAML files. In this and in many aspects it is based on the same concepts as the Doctrine ORM: it features a Facade DocumentManager object and a Unit Of Work that batches changes to the database when objects are added to it. Moreover, two different types of relationships between objects are supported: references and embedded documents. The first is the equivalent of the classical pointer to another row which ORM always transform object references into; the second actually stores an object inside another one, like you would do with a Value Object. Thus, at least in Doctrine's case, it is easier to map objects as documents that as rows. As said before, the ODM borrows some concepts and classes from the ORM, in particular from the Doctrine\Common package which features a standard collection class. So if you have built objects mapped with the Doctrine ORM nothing changes for persisting them in MongoDB, except for the mapping metadata itself. Advantages If an ORM is sometimes a leaky abstraction, an ODM probably becomes an issue less often. It has less overhead than an ORM, since there is no schema to define and the ability to embed objects means there should be no compromises between the object model and the capabilities of the database. How many times we have renounced introducing a potential Value Object because of the difficulty in persisting it? The case for an ODM over a plain Mongo connection object is easy to make: you will still be able to use objects with proper encapsulation (like private fields and associations) and behavior (many methods) instead of extracting just a JSON package from your database. Installation A prerequisite for the ODM is the presence of the mongo extension, that can be installed via pecl. After having verified the extension is present, grab the Doctrine\Common as the 2.2.x package, and a zip of the doctrine-mongodb and doctrine-mongodb-odm projects from Github. Decompress everything into a Doctrine/ folder. After having setup autoloading for classes in Doctrine\, use this bootstrap to get a DocumentManager (the equivalent of EntityManager): use Doctrine\Common\Annotations\AnnotationReader, Doctrine\ODM\MongoDB\DocumentManager, Doctrine\MongoDB\Connection, Doctrine\ODM\MongoDB\Configuration, Doctrine\ODM\MongoDB\Mapping\Driver\AnnotationDriver; private function getADm() { $config = new Configuration(); $config->setProxyDir(__DIR__ . '/mongocache'); $config->setProxyNamespace('MongoProxies'); $config->setDefaultDB('test'); $config->setHydratorDir(__DIR__ . '/mongocache'); $config->setHydratorNamespace('MongoHydrators'); $reader = new AnnotationReader(); $config->setMetadataDriverImpl(new AnnotationDriver($reader, __DIR__ . '/Documents')); return DocumentManager::create(new Connection(), $config); } You will be able to call persist() and flush() on the DocumentManager, along with a set of other methods for querying like find() and getRepository(). Integration with an ORM We are researching a solution for versioning objects mapped with the Doctrine ORM. Doing this with a version column would be invasive, and also strange where multiple objects are involved (do you version just the root of an object graph? Duplicate the other ones when they change? How can you detect that?) The idea is taking a snapshot and putting it in a read only MongoDB instance, where all previous versions can be retrieved later for auditing (business reasons). This has been verified to be technically possible: the DocumentManager and EntityManager are totally separate object graphs, so they won't clash with each other. The only point of conflict is the annotations of model classes, since both use different version of @Id, and can see the other's annotation like @Entity and @Document while parsing. This can be solved by using aliases for all the annotations, using their parent namespace basename as a prefix: model = $model; } public function __toString() { return "Car #$this->document_id: $this->id, $this->model"; } } This make us able to save a copy of an ORM object into Mongo: $car = new Car('Ford'); $this->em->persist($car); $this->em->flush(); $this->dm->persist($car); $this->dm->flush(); var_dump($car->__toString()); $this->assertTrue(strlen($car->__toString()) > 20); The output produces by this test is: .string(38) "Car #4f61a8322f762f1121000000: 3, Ford" When retrieving the object, one of the two ids will be null as it is ignored by the ORM or ODM. I am not using the same field because I want to store multiple copies of a row, so it's id alone won't be unique. If you're interested, checkout my hack on Github. It contains the running example presented in this post. Remember to create the relational schema with: $ php doctrine.php orm:schema-tool:create before running the test with phpunit --bootstrap bootstrap.php DoubleMappingTest.php MongoDB won't need the schema setup, of course. There are still some use cases to test, like the behavior in the presence of proxies, but it seems that non-invasive approach of Data Mappers like Doctrine 2 is paying off: try mapping an object in multiple database with Active Records.
March 20, 2012
by Giorgio Sironi
· 22,356 Views
article thumbnail
Adding a .first() method to Django's QuerySet
In my last Django project, we had a set of helper functions that we used a lot. The most used was helpers.first, which takes a query set and returns the first element, or None if the query set was empty. Instead of writing this: try: object = MyModel.objects.get(key=value) except model.DoesNotExist: object = None You can write this: def first(query): try: return query.all()[0] except: return None object = helpers.first(MyModel.objects.filter(key=value)) Note, that this is not identical. The get method will ensure that there is exactly one row in the database that matches the query. The helper.first() method will silently eat all but the first matching row. As long as you're aware of that, you might choose to use the second form in some cases, primarily for style reasons. But the syntax on the helper is a little verbose, plus you're constantly including helpers.py. Here is a version that makes this available as a method on the end of your query set chain. All you have to do is have your models inherit from this AbstractModel. class FirstQuerySet(models.query.QuerySet): def first(self): try: return self[0] except: return None class ManagerWithFirstQuery(models.Manager): def get_query_set(self): return FirstQuerySet(self.model) class AbstractModel(models.Model): objects = ManagerWithFirstQuery() class Meta: abstract = True class MyModel(AbstractModel): ... Now, you can do the following. object = MyModel.objects.filter(key=value).first()
March 19, 2012
by Chase Seibert
· 12,597 Views
article thumbnail
Display an OLE Object from a Microsoft Access Database using OLE Stripper
In database programming, it happens a lot that you need to bind a picture box to a field with type of photo or image. For example, if you want to show an Employee’s picture from Northwind.mdb database, you might want to try the following code: picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true); This code works if the images are stored in the database with no OLE header or the images stored as a raw image file formats. As the pictures stored in the Northwind database in are not stored in raw image file formats and they are stored as an OLE image documents, then you have to strip off the OLE header to work with the image properly. Binding imageBinding = new Binding("Image", bsEmployees, "ImageBlob.ImageBlob", true); imageBinding.Format += new ConvertEventHandler(this.PictureFormat); private void PictureFormat(object sender, ConvertEventArgs e) { Byte[] img = (Byte[])e.Value; MemoryStream ms = new MemoryStream(); int offset = 78; ms.Write(img, offset, img.Length - offset); Bitmap bmp = new Bitmap(ms); ms.Close(); // Writes the new value back e.Value = bmp; } Fortunately, there are some overload methods in .NET Framework to take care of this mechanism, but it cannot be guaranteed whether you need to strip off the OLE object by yourself or not. For example, you can use the following technique to access the images of the Northwind.mdb that ships with Microsoft Access and they will be rendered properly. picEmployees.DataBindings.Add(“Image”, bsEmployees, “Photo”, true, DataSourceUpdateMode.Never, new Bitmap(typeof(Button), “Button.bmp”)); Unfortunately, there are some scenarios that you need a better solution. For example, the Xtreme.mdb database that ships with Crystal Reports has a photo filed that cannot be handled by the preceding methods. For these complex scenarios, you can download the OLEStripper classes from here and re-write the PictureFormat method as it is shown below: private void PictureFormat(object sender, ConvertEventArgs e) { // photoIndex is same as Employee ID int photoIndex = Convert.ToInt32(e.Value); // Read the original OLE object ReadOLE olePhoto = new ReadOLE(); string PhotoPath = olePhoto.GetOLEPhoto(photoIndex); // Strip the original OLE object StripOLE stripPhoto = new StripOLE(); string StripPhotoPath = stripPhoto.GetStripOLE(PhotoPath); FileStream PhotoStream = new FileStream(StripPhotoPath , FileMode.Open); Image EmployeePhoto = Image.FromStream(PhotoStream); e.Value = EmployeePhoto; PhotoStream.Close(); }
March 15, 2012
by Amir Ahani
· 11,081 Views
article thumbnail
Circos: An Amazing Tool for Visualizing Big Data
storing massive amounts of data in a nosql data store is just one side of the big data equation. being able to visualize your data in such a way that you can easily gain deeper insights , is where things really start to get interesting. lately, i've been exploring various options for visualizing (directed) graphs, including circos . circos is an amazing software package that visualizes your data through a circular layout . although it's originally designed for displaying genomic data , it allows to create good-looking figures from data in any field. just transform your data set into a tabular format and you are ready to go. the figure below illustrates the core concept behind circos. the table's columns and rows are represented by segments around the circle. individual cells are shown as ribbons , which connect the corresponding row and column segments. the ribbons themselves are proportional in width to the value in the cell. when visualizing a directed graph , nodes are displayed as segments on the circle and the size of the ribbons is proportional to the value of some property of the relationships. the proportional size of the segments and ribbons with respect to the full data set allows you to easily identify the key data points within your table. in my case, i want to better understand the flow of visitors to and within the datablend site and blog; where do visitors come from (direct, referral, search, ...) and how do they navigate between pages. the rest of this article details how to 1) retrieve the raw visit information through the google analytics api, 2) persist this information as a graph in neo4j and 3) query and preprocess this data for visualization through circos. as always, the complete source code can be found on the datablend public github repository . 1. retrieving your google analytics data let's start by retrieving the raw google analytics data . the google analytics data api provides access to all dimensions and metrics that can be queried through the web application. in my case, i'm interested in retrieving the previous page path property for each page view. if a visitor enters through a page outside of the datablend website, the previous page path is marked as (entrance) . otherwise, it contains the internal path . we will use google's java data api to connect and retrieve this information. we are particularly interested in the pagepath , pagetitle , previouspagepath and medium dimensions, while our metric of choice is the number of pageviews . after setting the date range, the feed of entries that satisfy this criteria can be retrieved. for ease of use, we transform this data to a domain entity and filter/clean the data accordingly. if a visit originates from outside the datablend website, we store the specific medium (direct, referral, search, ...) as previous path. // authenticate analyticsservice = new analyticsservice(configuration.service); analyticsservice.setusercredentials(configuration.client_username, configuration.client_pass); // create query dataquery query = new dataquery(new url(configuration.data_url)); query.setids(configuration.table_id); query.setdimensions("ga:medium,ga:previouspagepath,ga:pagepath,ga:pagetitle"); query.setmetrics("ga:pageviews"); query.setstartdate(datestring); query.setenddate(datestring); // execute datafeed feed = analyticsservice.getfeed(createqueryurl(date), datafeed.class); // iterate and clean for (dataentry entry : feed.getentries()) { string pagepath = entry.stringvalueof("ga:pagepath"); string pagetitle = entry.stringvalueof("ga:pagetitle"); string previouspagepath = entry.stringvalueof("ga:previouspagepath"); string medium = entry.stringvalueof("ga:medium"); long views = entry.longvalueof("ga:pageviews"); // filter the data if (filter(pagepath) && filter(previouspagepath) && (!clean(previouspagepath).equals(clean(pagepath)))) { // check criteria are satisfied navigation navigation = new navigation(clean(previouspagepath), clean(pagepath), pagetitle, date, views); if (navigation.getsource().equals("(entrance)")) { // in case of an entrace, save its medium instead navigation.setsource(medium); } navigations.add(navigation); } } 2. storing navigational data as a directed graph in neo4j the set of site navigations can easily be stored as a directed graph in the neo4j graph database . nodes are site paths (or mediums), while relationships are the navigations themselves. we start by retrieving the navigations for a particular date range and retrieve (or lazily create) the nodes representing the source and target paths (or mediums). next we de-normalize the pageviews metric (for instance, 6 individual relationships will be created for 6 page-views). although this de-normalization step is not really required, i did so to make sure that the degree of my nodes is correct if i would perform other types of calculations. for each individual navigation relationship, we also store the date of visit . // retrieve navigations for a particular date list navigations = retrieval.getnavigations(date); // save them in the graph database transaction tx = graphdb.begintx(); // iterate and create for (navigation nav : navigations) { node source = getpath(nav.getsource()); node target = getpath(nav.gettarget()); if (!target.hasproperty("title")) { target.setproperty("title", nav.gettargettitle()); } for (long i = 0; i < nav.getamount(); i++) { // duplicate relationships relationship transition = source.createrelationshipto(target, relationships.navigation); transition.setproperty("date", date.gettime()); // save time as long } } // commit tx.success(); tx.finish(); 3. creating the circos tabular data format the circos tabular data format is quite easy to construct. it's basically a tab-delimited file with row and column headers. a cell is interpreted as a value that flows from the row entity to the column entity . we will use the neo4j cypher query language to retrieve the data of interest, namely all navigations that occurred within a certain time period . doing so allows us to create historical visualizations of our navigations and observe how visit flow behaviors are changing over time. // access the graph database graphdb = new embeddedgraphdatabase("var/analytics"); engine = new executionengine(graphdb); // execute the data range cypher query map params = new hashmap(); params.put("fromdate", from.gettime()); params.put("todate", to.gettime()); // execute the query executionresult result = engine.execute("start sourcepath=node:index(\"path:*\") " + "match sourcepath-[r]->targetpath " + "where r.date >= {fromdate} and r.date <= {todate} " + "return sourcepath,targetpath", params); next, we create the tab delimited file itself. we iterate through all entries (i.e. navigations) that match our cypher query and store them in a temporary list. afterwards, we start building the two-dimensional array by normalizing (i.e. summing) the number of navigations between the source and target paths. at the end, we filter this occurrence matrix on the minimal number of required navigations. this ensures that we will only create segments for paths that are relevant in the total population. as a final step, we print the occurrences matrix as a tab-delimited file. for each path, we will use a shorthand as the circos renderer seems to have problem with long string identifiers. // retrieve the results iterator> it = result.javaiterator(); list navigations = new arraylist(); map titles = new hashmap(); set paths = new hashset(); // iterate the results while (it.hasnext()) { map record = it.next(); string source = (string)((node) record.get("sourcepath")).getproperty("path"); string target = (string) ((node) record.get("targetpath")).getproperty("path"); string targettitle = (string) ((node) record.get("targetpath")).getproperty("title"); // reuse the navigation object as temorary holder navigations.add(new navigation(source, target, targettitle, new date(), 1)); paths.add(source); paths.add(target); if (!titles.containskey(target)) { titles.put(target, targettitle); } } // retrieve the various paths list pathids = arrays.aslist(paths.toarray(new string[]{})); // create the matrix that holds the info int[][] occurences = new int[pathids.size()][pathids.size()]; // iterate through all the navigations and update accordingly for (navigation navigation : navigations) { int sourceindex = pathids.indexof(navigation.getsource()); int targetindex = pathids.indexof(navigation.gettarget()); occurences[sourceindex][targetindex] = occurences[sourceindex][targetindex] + 1; } // matrix build, filter on threshold for (int i = 0; i < occurences.length; i++) { for (int j = 0; j < occurences.length; j++) { if (occurences[i][j] < threshold) { occurences[i][j] = 0; } } // print printcircosdata(pathids, titles, occurences); the text below is a sample of the output generated by the printcircosdata method. it first prints the legend (matching shorthands with actual paths). next it prints the tab-delimited circos table. link0 - /?p=411/wp-admin - storing and querying rdf data in neo4j through sail - datablend link1 - /?p=1146 - visualizing rdf schema inferencing through neo4j, tinkerpop, sail and gephi - datablend link2 - /?p=164 - big data / concise articles - datablend link3 - referral - null link4 - /?p=1400 - the joy of algorithms and nosql revisited: the mongodb aggregation framework - datablend ... datal0l1l2l3l4... l000000 l100000 l200000 l3059400197 l400000 4. use the circos power although circos can be installed on your local computer, we will use its online version to create the visualization of our data. upload your tab-delimited file and just wait a few seconds before enjoying the beautiful rendering of your site's navigation information. with just a glimpse of an eye we can already see that the l3-segment (i.e. the referrals) is significantly larger (almost 6000 navigations) compared to the others segments. the outer 3 rings visualize the total amounts of navigations that are leaving and entering this particular path. in case of referrals, no navigations have this path as target (indicated by the empty middle ring). its total segment count (inner ring) is entirely build up out of navigations that have a referral as source. the l6-segment seems to be the path that attracts the most traffic (around 2500 navigations). this segment visualizes the navigation data related to my "the joy of algorithms and nosql: a mongodb example" -article. most of its traffic is received through referrals, while a decent amount is also generated through direct (l17-segment) and search (l27-segment) traffic. the l15-segment (my blog's main page) is the only path that receives an almost equal amount of incoming and outgoing traffic. with just a few tweaks to the circos input data, we can easily focus on particular types of navigation data. in the figure below, i made sure that referral and search navigations are visualized more prominently through the use of 2 separate colors. 5. conclusions in the era of big data, visualizations are becoming crucial as they enable us to mine our large data sets for certain patterns of interest. circos specializes in a very specific type of visualization, but does its job extremely well. i would be delighted to hear about other types of visualizations for directed graphs.
March 13, 2012
by Davy Suvee
· 36,340 Views · 2 Likes
article thumbnail
Joins with MapReduce
i have been reading up on join implementations available for hadoop for past few days. in this post i recap some techniques i learnt during the process. the joins can be done at both map side and join side according to the nature of data sets of to be joined. reduce side join let’s take the following tables containing employee and department data. let’s see how join query below can be achieved using reduce side join. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id map side is responsible for emitting the join predicate values along with the corresponding record from each table so that records having same department id in both tables will end up at on same reducer which would then do the joining of records having same department id. however it is also required to tag the each record to indicate from which table the record originated so that joining happens between records of two tables. following diagram illustrates the reduce side join process. here is the pseudo code for map function for this scenario. map (k table, v rec) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } at reduce side join happens within records having different tags. reduce (k dept_id, list tagged_recs) { for (tagged_rec : tagged_recs) { for (tagged_rec1 : taagged_recs) { if (tagged_rec.tag != tagged_rec1.tag) { joined_rec = join(tagged_rec, tagged_rec1) } emit (tagged_rec.rec.dept_id, joined_rec) } } map side join (replicated join) using distributed cache on smaller table for this implementation to work one relation has to fit in to memory. the smaller table is replicated to each node and loaded to the memory. the join happens at map side without reducer involvement which significantly speeds up the process since this avoids shuffling all data across the network even-though most of the records not matching are later dropped. smaller table can be populated to a hash-table so look-up by dept_id can be done. the pseudo code is outlined below. map (k table, v rec) { list recs = lookup(rec.dept_id) // get smaller table records having this dept_id for (small_table_rec : recs) { joined_rec = join (small_table_rec, rec) } emit (rec.dept_id, joined_rec) } using distributed cache on filtered table if the smaller table doesn’t fit the memory it may be possible to prune the contents of it if filtering expression has been specified in the query. consider following query. select employees.name, employees.age, department.name from employees inner join department on employees.dept_id=department.dept_id where department.name="eng" here a smaller data set can be derived from department table by filtering out records having department names other than “eng”. now it may be possible to do replicated map side join with this smaller data set. replicated semi-join reduce side join with map side filtering even of the filtered data of small table doesn’t fit in to the memory it may be possible to include just the dept_id s of filtered records in the replicated data set. then at map side this cache can be used to filter out records which would be sent over to reduce side thus reducing the amount of data moved between the mappers and reducers. the map side logic would look as follows. map (k table, v rec) { // check if this record needs to be sent to reducer boolean sendtoreducer = check_cache(rec.dept_id) if (sendtoreducer) { dept_id = rec.dept_id tagged_rec.tag = table tagged_rec.rec = rec emit(dept_id, tagged_rec) } } reducer side logic would be same as the reduce side join case. using a bloom filter a bloom filter is a construct which can be used to test the containment of a given element in a set. a smaller representation of filtered dept_ids can be derived if dept_id values can be augmented in to a bloom filter. then this bloom filter can be replicated to each node. at the map side for each record fetched from the smaller table the bloom filter can be used to check whether the dept_id in the record is present in the bloom filter and only if so to emit that particular record to reduce side. since a bloom filter is guaranteed not to provide false negatives the result would be accurate. references [1] hadoop in action [2] hadoop : the definitive guide
March 12, 2012
by Buddhika Chamith
· 31,035 Views
  • Previous
  • ...
  • 512
  • 513
  • 514
  • 515
  • 516
  • 517
  • 518
  • 519
  • 520
  • 521
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×