Data Resources

The Latest Data Topics

ActiveMQ Message Priorities: How it Works

There’s usually a steady drip of questions on the mailing list surrounding ActiveMQ’s message-priority support as well as good questions about observed behaviors and “what’s really supported”? I hope to help you understand what happens under the covers and what levels of priority can be supported. The details could get gory for some. If you’re not interested in the details, take a look at the ActiveMQ wiki for the high-level overview. First, since ActiveMQ supports JMS 1.1, let’s take a look at what the JMS Spec says about support for “JMSPriority”: JMS defines a ten-level priority value, with 0 as the lowest priority and 9 as the highest. In addition, clients should consider priorities 0-4 as gradations of normal priority and priorities 5-9 as gradations of expedited priority. JMS does not require that a provider strictly implement priority ordering of messages; however, it should do its best to deliver expedited messages ahead of normal messages. ActiveMQ observes three distinct levels of “Priority”: Default (JMSPriority == 4) High (JMSPriority > 4 && <= 9) Low (JMSPriority > 0 && < 4) If you don’t specify a priority for your MessageProducer or individual messages (see MessageProducer#send(message, deliveryMode, priority, timeToLive)), ActiveMQ’s client will default to using a JMSPriority == 4. As a JMS consumer, you can expect a FIFO ordering if the producers aren’t using priority or you’re not using some other form of selection criteria on the destination. ActiveMQ also “does its best” to deliver expedited messages ahead of “normal” messages, as the spec states. The message store that your broker uses greatly contributes to how that’s exactly done, but in general you can expect the broker to honor strict (0-9) priority support for only the JDBC backed messages stores. For KahaDB-backed message stores, only “category priority” is supported (Low, Default, High, where priorities in each category are not always differentiated, that is 5 and 9 are considered “High”). However, with the right settings and messaging profile, you can affect how [strict] prioritization happens even with KahaDB, so let’s take a quick look. Enabling Message Priority You can enable message priority on your Queues with the following setting in your activemq.xml configuration file: For queueName there is wildcard support, so you can enable priority support on a hierarchy of messages. When you enable priority support, the broker will use prioritized linked-list structures in its messages cursors as well as give KahaDB a hint to use priority categories when storing messages onto disk. There are varying levels of how strict the priority ordering can get, but at worst, you can assume priorities will be upheld by category. The following factors come into play which control how strict the priority ordering can get when using the KahaDB store: Caching enabled/disabled in the queue cursor MaxPageInSize for how many messages to page from the store in a batch Consumer prefetching Expired-message checking Broker Memory settings Persistent/Non-persistent messages The next section presents a little detail about what happens in KahaDB to support priority, while the following sections will go into how things happen in broker memory and are finally dispatched to a consumer and will point out along the way how the different factors from above come into play. KahaDB Prioritization Categories First we’ll start with how messages are stored on disk and loaded into a destination. KahaDB (the default message store) is a file-based message database that the broker uses to persist messages in a “log” or “journal”. The broker also keeps track of which messages are in the log by keeping a separate “index” that holds information about messages (like its location in the log, to which destination it’s associated, ordering, etc). The index also has a notion of message “priority”, which is implemented with three B+Tree structures, one for each priority level (see MessageOrderIndex in org.apache.activemq.store.kahadb.MessageDatabase). This implementation detail is the root of message prioritization and has implications for the rest of the broker as messages are removed from the store. When messages are retrieved from the store, they are done so in batches (maxPageInSize), and messages that are in the “highPriority” BTree are retrieved first. When the high-priority messages are exhausted, the store will then offer up the default priority and subsequently the low priority messages. You can set the maxPageInSize like so: The larger the page size, the larger the number of messages in a batch and the more messages you can see at a time per “snapshot”. For each batch that’s brought into memory, it’s messages are going to be strictly prioritized as described below by the store cursor. The downside is that if your messages are large, bringing in 500 at a time could exhaust your broker memory. The default setting is 200. Message Cursor Priority Lists When persistent messages come into the broker from a producer, they will be stored onto disk, but they will also be cached in memory waiting to be dispatched to a consumer. This is a default setting, so no need to explicitly set it. The idea behind this is to be able to dispatch to fast consumers without having to retrieve it directly from disk (if consumers become slow, the broker will auto-tune itself to not use the cache once it’s filled so as to not OOM). The good thing about this is that when prioritization support is used for a queue, the internal lists used for the cursors will support strict priority (0-9), so for all of the messages that are currently in memory (in the cache), they will be sorted properly from highest to lowest. The trick is what happens when all of the messages in the cache are “lower priority messages” and then a high-priority message comes in to the broker but won’t fit in the cache because it’s full… in that case the message will go directly to the store, be indexed in the “high-priority” index, but won’t be available for dispatch ahead of the lower priority messages until it’s paged into memory in the next batch. When NON persistent messages come into the broker, they will not go to the message store. They will be kept in memory for as long as possible and only pushed to disk (in a temporary store) when memory has passed a defined threshold (> 70% by default). So the same behaviors for cached messages above apply for non-persistent messages, namely, those that are in memory will be ordered strictly (0-9), but once they get pushed to disk, only categories are observed. If you disable the cursor’s cache (with the following setting) then you could help to eliminate the above scenario where the cache becomes full with lower priority messages right when a high-priority message comes in (and becomes stuck on disk because it cannot be paged into memory). However, doing this will slow down your throughput because messages must be paged in from disk before sending to consumers which will slow down the dispatch. But note, when doing this, you are more likely to see messages not following “strict” priority even with the priority lists in the cursor. They will, however, follow the priority categories (High, Default, Low) properly. So to recap, if you disable the cache, you can get higher priority messages delivered more timely than you can if the cache is enabled and it’s filled with lower priority messages. But disabling the cache, by itself, won’t get you to strict priority. Disabling the cache helps getting high priority messages to consumers ahead of lower priority messages, however for this to work as intended (and has bitten me), you’ll want to disable the asynchronous message expiry check. This expiry check pages messages into memory every 30 seconds regardless if they’re ready to be dispatched (by default) and performs a TTL check (time to live) on them and discards those messages that should be expired. This sort of checking effectively brings messages into memory and will stall the normal “page in for dispatch” just enough to miss higher priority messages. Turning off expiry checking, however, will keep expired messages in the store longer because the only expiry check would be done right before dispatch, so make an educated decision on this, and all ActiveMQ settings you tinker with. But to move in the direction of strict(er) order priorities, you’ll want to disable this. Lastly, consumer prefetch plays a role in achieving “strict ordering.” By default, prefetch is set to 1000 for queue consumers, which means they will be sent 1000 messages in a batch. This helps speed up the consumer when it’s consuming messages, but in terms of priority handling it in essence also acts like a cache of messages (discussed above) and could contribute to not seeing “strict ordering”. “category priority” could also be violated if your prefetch is filled with lower priority messages, and there is a new high-priority message that came in to the broker, you wouldn’t see it until the next message dispatch to the consumer. So the lower the prefetch, the better chance of seeing higher priority messages ahead of lower ones. With prefetch of 1, you’ll always get the highest priority message that the store cursor knows about. Client side message priority ActiveMQ also has priority support built right into the message client and it’s enabled by default. This means, when messages are being sent to your consumer (even before your consumer is receiving them, using prefetch), they will be cached on the consumer side and prioritized by default. This is regardless of whether you’re using priority support on the broker side. This could impact the ordering you see on the consumer so just keep this in mind. To disable it, set the following configuration option on your broker URL, e.g., tcp://0.0.0.0:61616?jms.messagePrioritySupported=false But as mentioned above, you’ll want to lower the prefetch to 1 to get the best chance of achieving strict ordering. Tradeoffs So ultimately, getting strictly ordered messages with KahaDB is possible but there are significant tradeoffs to consider and it won’t apply for every messaging situation. Do you want optimized, fast messaging? or do you want to slow down the messaging to achieve strict(er) ordering for priorities. Each situation is different and should be evaluated on a case-by-case basis. In general, however, you can rely on category level priorities. Reordering messages across large queues AND keeping high performance is problematic, and most Message Queue vendors do not do that very well. ActiveMQ’s priority support is strong, but another good alternative exists as discussed on the ActiveMQ wiki describing message priority and that is: using message selectors and balancing out the consumers in such a way that high priority messages end up getting consumed first. This approach tends to give more flexibility and control, but that’s for another post Leave me some comments if something wasn’t clear, or drop an email in the mailing list!

April 2, 2013

by Christian Posta

· 19,083 Views

ArrayList vs. LinkedList vs. Vector

1. list overview list, as its name indicates, is an ordered sequence of elements. when we talk about list, it is a good idea to compare it with set which is a set of elements which is unordered and every element is unique. the following is the class hierarchy diagram of collection. 2. arraylist vs. linkedlist vs. vector from the hierarchy diagram, they all implement list interface. they are very similar to use. their main difference is their implementation which causes different performance for different operations. arraylist is implemented as a resizable array. as more elements are added to arraylist, its size is increased dynamically. it's elements can be accessed directly by using the get and set methods, since arraylist is essentially an array. linkedlist is implemented as a double linked list. its performance on add and remove is better than arraylist, but worse on get and set methods. vector is similar with arraylist, but it is synchronized. arraylist is a better choice if your program is thread-safe. vector and arraylist require space as more elements are added. vector each time doubles its array size, while arraylist grow 50% of its size each time. linkedlist, however, also implements queue interface which adds more methods than arraylist and vector, such as offer(), peek(), poll(), etc. note: the default initial capacity of an arraylist is pretty small. it is a good habit to construct the arraylist with a higher initial capacity. this can avoid the resizing cost. 3. arraylist example arraylist al = new arraylist(); al.add(3); al.add(2); al.add(1); al.add(4); al.add(5); al.add(6); al.add(6); iterator iter1 = al.iterator(); while(iter1.hasnext()){ system.out.println(iter1.next()); } 4. linkedlist example linkedlist ll = new linkedlist(); ll.add(3); ll.add(2); ll.add(1); ll.add(4); ll.add(5); ll.add(6); ll.add(6); iterator iter2 = al.iterator(); while(iter2.hasnext()){ system.out.println(iter2.next()); } as shown in the examples above, they are similar to use. the real difference is their underlying implementation and their operation complexity. 5. vector vector is almost identical to arraylist, and the difference is that vector is synchronized. because of this, it has an overhead than arraylist. normally, most java programmers use arraylist instead of vector because they can synchronize explicitly by themselves. 6. performance of arraylist vs. linkedlist the time complexity comparison is as follows: i use the following code to test their performance: arraylist arraylist = new arraylist(); linkedlist linkedlist = new linkedlist(); // arraylist add long starttime = system.nanotime(); for (int i = 0; i < 100000; i++) { arraylist.add(i); } long endtime = system.nanotime(); long duration = endtime - starttime; system.out.println("arraylist add: " + duration); // linkedlist add starttime = system.nanotime(); for (int i = 0; i < 100000; i++) { linkedlist.add(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist add: " + duration); // arraylist get starttime = system.nanotime(); for (int i = 0; i < 10000; i++) { arraylist.get(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("arraylist get: " + duration); // linkedlist get starttime = system.nanotime(); for (int i = 0; i < 10000; i++) { linkedlist.get(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist get: " + duration); // arraylist remove starttime = system.nanotime(); for (int i = 9999; i >=0; i--) { arraylist.remove(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("arraylist remove: " + duration); // linkedlist remove starttime = system.nanotime(); for (int i = 9999; i >=0; i--) { linkedlist.remove(i); } endtime = system.nanotime(); duration = endtime - starttime; system.out.println("linkedlist remove: " + duration); and the output is: arraylist add: 13265642 linkedlist add: 9550057 arraylist get: 1543352 linkedlist get: 85085551 arraylist remove: 199961301 linkedlist remove: 85768810 the difference of their performance is obvious. linkedlist is faster in add and remove, but slower in get. based on the complexity table and testing results, we can figure out when to use arraylist or linkedlist. in brief, linkedlist should be preferred if: there are no large number of random access of element there are a large number of add/remove operations

March 28, 2013

by Ryan Wang

· 611,277 Views · 20 Likes

HashMap vs. TreeMap vs. HashTable vs. LinkedHashMap

Learn all about important data structures like HashMap, HashTable, and TreeMap.

March 28, 2013

by Ryan Wang

· 438,388 Views · 14 Likes

Connect Apache OFBiz with the Real World

What would you expect from someone who is OFBiz and Camel committer? To integrate them for fun? Fine, here it is. In addition to being fun, I believe this integration will be of real benefit for the OFBiz community, because despite the fact of being a complete ERP software, OFBiz lacks the ability to easily integrate with external systems. The goal of this project is instead of reinventing the wheel and trying to integrate OFBiz with each system separately, integrate it with Camel and let Camel do what it does best: connect your application with every possible protocol and system out there. Quick OFBiz introduction The Apache Open For Business Project is an open source, enterprise automation software. It consist mainly from two parts: A full-stack framework for rapid business application development. It has Entity Engine for the data layer (imagine something like iBATIS and Hibernate combined). It is the same entity engine that powers millions of Attlasian Jira instances. But don't get me wrong, it is not meant for usage outside of OFBiz framework, so use it only as OFBiz data layer. Service Engine - this might be hard to grasp for someone only with DDD background, but OFBiz doesn't have any domain objects. Instead for the service layer it uses SOA and has thousands of services that contains the business logic. A service is an atomic bit of isolated business logic, usually reading and updating the database. If you need you can make services triggering each other using ECAs(event-condition-action) which is kind of rule engine that allows define pre/post conditions for triggering other service calls when a service is executed. The service itself can be written written in java, groovy or simple language (an XML DSL for simple database manipulation) and usually requires authentication, authorisation and finally executed in a transaction. UI widgets - an XML DSL which allows you easily create complex pages with tables, forms and trees. And the really great thing about this framework is that 'The whole is greater than the sum of its parts' - all of the above layers works together amazingly: if you have an entity definition (a table) in your data layer, you can use it in your service layer during your service interface definition or its implementation. It takes one line of code(a long one) to create a service which has as input parameters the table columns and return the primary key as result of the service. Then if you are creating a screen with tables or forms, you can base it on your entity definitions or service definitions. It is again only few lines of code to create a form with fields mapping to a service or entity fields. Out of the box business applications. These are vertical applications for managing the full life cycle of a business domain like ordering, accounting, manufacturing and many more in a horizontally integrated manner. So creating an order from order or ecommerce application will interact with facility to check whether a product is available, and after the order is created will create accounting transaction in accounting application. Before the order is shipped from the facility, it will create invoices and once the invoice is paid, it will close the order. You get the idea. Camel in 30 seconds Apache Camel is an integration framework based on known Enterprise Integration Patterns(EIP). Camel can also be presented as consisting of two artifacts: The routing framework which can be defined using java, scala, xml DSL with all the EIPs like Pipe, Filter, Router, Splitter, Aggregator, Throttler, Normalizer and many more. Components and transformers ie all the different connectors to more than 100 different applications and protocols like: AMQP, AWS, web services, REST, MongoDB, Twitter, Websocket, you name it. If you can imagine a tool, that enables you to consume data from one system, then manipulate the data (transform, filter, split, aggregate) and send it to other systems, using a declarative, concise, English-like DSL without any boilerplate code - that's Apache Camel. Let OFBiz talk to all of the systems Camel do The main interaction point with OFBiz are either by using the Entity Engine for direct data manipulation or by calling services through Service Engine. The latter is preferred because it ensures that the user executing the service is authorised to do so, the operation is transactional to ensure data integrity, and also all the business rules are satisfied (there might be other services that have to be executed with ECA rules). So if we can create an OFBiz endpoint in Camel and execute OFBiz services from Camel messages, that would allow OFBiz to receive notifications from Camel endpoints. What about the other way around - making OFBiz notify Camel endpoints? The ideal way would be to have an OFBiz service that sends the IN parameters to Camel endpoints as message body and headers and return the reply message as OFBiz service response. If you are wondering: why is it so great, what is an endpoint, where is the real world, who is gonna win Euro2012... have a look at the complete list of available Camel components, and you will find out the answer. Running Camel in OFBiz container I've started an experimental ofbiz-camel project on github which allows you to do all of the above. It demonstrates how to poll files from a directory using Camel and create notes in OFBiz with the content of the file using createNote service. The project also has an OFBiz service, that enables sending messages from OFBiz to Camel. For example using that service it is possible to send a message to Camel file://data endpoint, and Camel will create a file in the data folder using the service parameters. The integration between OFBiz and Camel is achieved by running Camel in an OFBiz container as part of the OFBiz framework. This makes quite tight integration, but ensures that there will not be any http, rmi or any other overhead in between. It is still WIP and may change totally. Running Camel and OFBiz separately Another approach is KISS: run Camel and OFBiz as they are - separate applications, and let them interact with RMI, WS* or something else. This doesn't require any much coding, but only configuring both systems to talk to each other. I've created a simple demo camel-ofbiz-rmi which demonstrates how to listen for tweets with a specific keyword and store them in OFBiz as notes by calling createNote service using RMI. It uses Camel's twitter and rmi components and requires only configuration. Notice that this example demonstrates only one way interaction: from Camel to OFBiz. In order to invoke a Camel endpoint from OFBiz you can you have to write some RMI, WS* or other code. PS: I'm looking forward to hear your real world integration requirements for OFBiz.

March 25, 2013

by Bilgin Ibryam

· 18,259 Views · 1 Like

OpenJPA: Memory Leak Case Study

This article will provide the complete root cause analysis details and resolution of a Java heap memory leak (Apache OpenJPA leak) affecting an Oracle Weblogic server 10.0 production environment. This post will also demonstrate the importance to follow the Java Persistence API best practices when managing the javax.persistence.EntityManagerFactory lifecycle. Environment specifications Java EE server: Oracle Weblogic Portal 10.0 OS: Solaris 10 JDK: Oracle/Sun HotSpot JVM 1.5 32-bit @2 GB capacity Java Persistence API: Apache OpenJPA 1.0.x (JPA 1.0 specifications) RDBMS: Oracle 10g Platform type: Web Portal Troubleshooting tools Quest Foglight for Java (Java heap monitoring) MAT (Java heap dump analysis) Problem description & observations The problem was initially reported by our Weblogic production support team following production outages. An initial root cause analysis exercise did reveal the following facts and observations: Production outages were observed on regular basis after ~2 weeks of traffic. The failures were due to Java heap (OldGen) depletion e.g. OutOfMemoryError: Java heap space error found in the Weblogic logs. A Java heap memory leak was confirmed after reviewing the Java heap OldGen space utilization over time from Foglight monitoring tool along with the Java verbose GC historical data. Following the discovery of the above problems, the decision was taken to move to the next phase of the RCA and perform a JVM heap dump analysis of the affected Weblogic (JVM) instances. JVM heap dump analysis ** A video explaining the following JVM Heap Dump analysis is now available here. In order to generate a JVM heap dump, the supported team did use the HotSpot 1.5 jmap utility which generated a heap dump file (heap.bin) of about ~1.5 GB. The heap dump file was then analyzed using the Eclipse Memory Analyzer Tool. Now let’s review the heap dump analysis so we can understand the source of the OldGen memory leak. MAT provides an initial Leak Suspects report which can be very useful to highlight your high memory contributors. For our problem case, MAT was able to identify a leak suspect contributing to almost 600 MB or 40% of the total OldGen space capacity. At this point we found one instance of java.util.LinkedList using almost 600 MB of memory and loaded to one of our application parent class loader (@ 0x7e12b708). The next step was to understand the leaking objects along with the source of retention. MAT allows you to inspect any class loader instance of your application, providing you with capabilities to inspect the loaded classes & instances. Simply search for the desired object by providing the address e.g. 0x7e12b708 and then inspect the loaded classes & instances by selecting List Objects > with outgoing references. As you can see from the above snapshot, the analysis was quite revealing. What we found was one instance of org.apache.openjpa.enhance.PCRegistry at the source of the memory retention; more precisely the culprit was the _listeners field implemented as a LinkedList. For your reference, the Apache OpenJPA PCRegistry is used internally to track the registered persistence-capable classes. Find below a snippet of the PCRegistry source code from Apache OpenJPA version 1.0.4 exposing the _listeners field. /** * Tracks registered persistence-capable classes. * * @since 0.4.0 * @author Abe White */ publicclass PCRegistry { // DO NOT ADD ADDITIONAL DEPENDENCIES TO THIS CLASS privatestaticfinal Localizer _loc = Localizer.forPackage (PCRegistry.class); // map of pc classes to meta structs; weak so the VM can GC classes privatestaticfinal Map _metas = new ConcurrentReferenceHashMap (ReferenceMap.WEAK, ReferenceMap.HARD); // register class listeners privatestaticfinal Collection _listeners = new LinkedList(); …………………………………………………………………………………… Now the question is why is the memory footprint of this internal data structure so big and potentially leaking over time? The next step was to deep dive into the _listeners LinkedLink instance in order to review the leaking objects. We finally found that the leaking objects were actually the JDBC & SQL mapping definitions (metadata) used by our application in order to execute various queries against our Oracle database. A review of the JPA specifications, OpenJPA documentation and source did confirm that the root cause was associated with a wrong usage of the javax.persistence.EntityManagerFactory such of lack of closure of a newly created EntityManagerFactory instance. If you look closely at the above code snapshot, you will realize that the close() method is indeed responsible to cleanup any recently used metadata repository instance. It did also raise another concern, why are we creating such Factory instances over and over… The next step of the investigation was to perform a code walkthrough of our application code, especially around the life cycle management of the JPA EntityManagerFactory and EntityManager objects. Root cause and solution A code walkthrough of the application code did reveal that the application was creating a new instance of EntityManagerFactory on each single request and not closing it properly. public class Application { @Resource private UserTransaction utx = null; // Initialized on each application request and not closed! @PersistenceUnit(unitName = "UnitName") private EntityManagerFactory emf = Persistence.createEntityManagerFactory("PersistenceUnit"); public EntityManager getEntityManager() { return this.emf.createEntityManager(); } public void businessMethod() { // Create a new EntityManager instance via from the newly created EntityManagerFactory instance // Do something... // Close the EntityManager instance } } This code defect and improver use of JPA EntityManagerFactory was causing a leak or accumulation of metadata repository instances within the OpenJPA _listeners data structure demonstrated from the earlier JVM heap dump analysis. The solution of the problem was to centralize the management & life cycle of the thread safe javax.persistence.EntityManagerFactory via the Singleton pattern. The final solution was implemented as per below: Create and maintain only one static instance of javax.persistence.EntityManagerFactory per application class loader and implemented via the Singleton Pattern. Create and dispose new instances of EntityManager for each application request. Please review this discussion from Stackoverflow as the solution we implemented is quite similar. Following the implementation of the solution to our production environment, no more Java heap OldGen memory leak is observed. Please feel free to provide your comments and share your experience on the same.

March 21, 2013

by Pierre - Hugues Charbonneau

· 9,266 Views

MongoDB: Replication Lag and the Facts of Life

Curator's Note: The content of this article was originally written over at the MongoLab blog. So you’re checking in on your latest awesome application one day — it’s really getting traction! You’re proud of its uptime record, thanks in part to the MongoDB replica set underneath it. But now … something’s wrong. Users are complaining that some of their data has gone missing. Others are noticing stuff they deleted has suddenly reappeared. What’s going on?!? Don’t worry… we’ll get to the bottom of this! In doing so, we’ll examine a source of risk that’s easy to overlook in a MongoDB application: replication lag — what it means, why it happens, and what you can do about it. Here’s what we’re going to cover: What is replication lag? Why is lag problematic? What causes a secondary to fall behind? How do I measure lag? How do I monitor for lag? What can I do to minimize lag? Tip #1: Make sure your secondary has enough horsepower Tip #2: Consider adjusting your write concern Tip #3: Plan for index builds Tip #4: Take backups without blocking Tip #5: Be sure capped collections have an _id field & a unique index Tip #6: Check for replication errors Don’t let replication lag take you by surprise. Continuing this cautionary tale… Seriously, wtf?! You were doing everything right! Using MongoDB with a well-designed schema and lovingly-tuned indexes, your application back-end has been handling thousands of transactions per second without breaking a sweat. You’ve got multiple nodes arranged in a replica set with no single point of failure. Your application tier’s Mongo driver connections are aware of the replica set and can follow changes in the PRIMARY node during failover. All critical writes are “safe” writes. Your app has been up without interruption for almost six months now! How could this have happened? This unsettling situation has the hallmarks of an insidious foe in realm of high-availability data stewardship: unchecked replication lag. Closely monitoring a MongoDB replica set for replication lag is critical. What is replication lag? As you probably know, like many data stores MongoDB relies on replication — making redundant copies of data — to meet design goals around availability. In a perfect world, data replication would be instantaneous; but in reality, thanks to pesky laws of physics, some delay is inevitable — it’s a fact of life. We need to be able to reason about how it affects us so as to manage around the phenomenon appropriately. Let’s start with definitions… For a given secondary node, replication lag is the delay between the time an operation occurs on the primary and the time that same operation gets applied on the secondary. For the replica set as a whole, replication lag is (for most purposes) the smallest replication lag found among all its secondary nodes. In a smoothly running replica set, all secondaries closely follow changes on the primary, fetching each group of operations from its oplog and replaying them approximately as fast as they occur. That is, replication lag remains as close to zero as possible. Reads from any node are then reasonably consistent; and, should the current primary become unavailable, the secondary that assumes the PRIMARY role will be able to serve to clients a dataset that is almost identical to the original. For a variety of reasons, however, secondaries may fall behind. Sometimes elevated replication lag is transient and will remedy itself without intervention. Other times, replication lag remains high or continues to rise, indicating a systemic problem that needs to be addressed. In either case, the larger the replication lag grows and the longer it remains that way, the more exposure your database has to the associated risks. Why is lag problematic? Significant replication lag creates failure modes that can be problematic for a MongoDB database deployment that is meant to be highly available. Here’s why: If your replica set fails over to a secondary that is significantly behind the primary, a lot of un-replicated data may be on the original primary that will need to be manually reconciled. This will be painful or impossible if the original primary is unrecoverable. If the failed primary cannot be recovered quickly, you may be forced to run on a node whose data is not up-to-date, or forced to take down your database altogether until the primary can be recovered. If you have only one secondary, and it falls farther behind than the earliest history retained in the primary’s oplog, your secondary will require a full resynchronization from the primary. During the resync, your cluster will lack the redundancy of a valid secondary; the cluster will not return to high availability until the entire data set is copied. If you only take backups from your secondary (which we highly recommend), backups must be suspended for the duration of the resync. Replication lag makes it more likely that results of any read operations distributed across secondaries will be inconsistent. A “safe” write with ‘w’ > 1 — i.e., requiring multiple nodes acknowledge the write before it returns — will incur latency proportional to the current replication lag, and/or may time out. Strictly speaking, the problem of replication lag is distinct from the problem of data durability. But as the last point above regarding multi-node write concern illustrates, the two concepts are most certainly linked. Data that has not yet been replicated is not completely protected from single-node failure; and client writes specified to be safe from single-node failure must block until replication catches up to them. What causes a secondary to fall behind? In general, a secondary falls behind on replication any time it cannot keep up with the rate at which the primary is writing data. Some common causes: Secondary is weak To have the best chance of keeping up, a secondary host should match the primary host’s specs for CPU, disk IOPS, and network I/O. If it’s outmatched by the primary on any of these specs, a secondary may fall behind during periods of sustained write activity. Depending on load this will, at best, create brief excursions in replication lag and, at worst, cause the secondary to fall irretrievably behind. Bursty writes In the wake of a burst of write activity on the primary, a secondary may not be able to fetch and apply the ops quickly enough. If the secondary is underpowered, this effect can be quite dramatic. But even when the nodes have evenly matched specs, such a situation is possible. For example, a command like: db.coll.update({x: 7}, {$set: {y: 42}, {multi: true} can place an untold number of separate “update” ops in the primary’s oplog. To keep up, a secondary must fetch those ops (max 4MB at a time for each getMore command!), read into RAM any index and data pages necessary to satisfy each _id lookup (remember: each oplog entry references a single target document by _id; the original query about “x” is never directly reflected the oplog), and finally perform the update op, altering the document and placing the corresponding entry into its oplog; and it must do all this in the same amount of time that the primary does merely the last step. Multiplied by a large enough number of ops, that disparity can amount to a noticeable lag. Map/reduce output A specific type of the extreme write burst scenario might be a command like: db.coll.mapReduce( ... { out: other_coll ... }) Index build It may surprise you to learn that, even if you build an index in the background on the primary, it will be built in the foreground on each secondary. There is currently no way to build indexes in the background on secondary nodes (cf. SERVER-2771). Therefore, whenever a secondary builds an index, it will block all other operations, including replication, for the duration. If the index builds quickly, this may not be a problem; but long-running index builds can swiftly manifest as significant replication lag. Secondary is locked for backup One of the suggested methods for backing up data in a replica set involves explicitly locking a secondary against changes while the backup is taken. Assuming the primary is still conducting business as usual, of course replication lag will climb until the backup is complete and the lock is released. Secondary is offline Similarly, if the secondary is not running or cannot reach the primary for whatever reason, it cannot make progress against the replication backlog. When it rejoins the replica set, the replication lag will naturally reflect the time spent away. How do I measure lag? Run the db.printSlaveReplicationInfo() command To determine the current replication lag of your replica set, you can use the mongo shell and run the db.printSlaveReplicationInfo() command. rs-ds046297:PRIMARY db.printSlaveReplicationInfo() source: ds046297-a1.mongolab.com:46297 syncedTo: Tue Mar 05 2013 07:48:19 GMT-0800 (PST) = 7475 secs ago (2.08hrs) source: ds046297-a2.mongolab.com:46297 syncedTo: Tue Mar 05 2013 07:48:19 GMT-0800 (PST) = 7475 secs ago (2.08hrs) More than 2 hours — whoa, isn’t that a lot? Maybe! See, those “syncedTo” times don’t have much to do with the clock on the wall; they’re just the timestamp on the last operation that the replica has copied over from the PRIMARY. If the last write operation on the PRIMARY happened 5 minutes ago, then yes: 2 hours is a lot. On the other hand, if the last op was 2.08 hours ago, then this is golden! To fill in that missing piece of the story, we can use the db.printReplicationInfo() command. rs-ds046297:PRIMARY db.printReplicationInfo() configured oplog size: 1024MB log length start to end: 5589secs (1.55hrs) oplog first event time: Tue Mar 05 2013 06:15:19 GMT-0800 (PST) oplog last event time: Tue Mar 05 2013 07:48:19 GMT-0800 (PST) now: Tue Mar 05 2013 09:53:07 GMT-0800 (PST) Let’s see … PRIMARY’s “oplog last event time” – SECONDARY’s “syncedTo” = 0.0. Yay. As fun as that subtraction may be, it’s seldom called for. If there is a steady flow of write operations, the last op on the PRIMARY will usually have been quite recent. Thus, a figure like “2.08 hours” should probably raise eyebrows; you would expect to see a nice low number there instead — perhaps as high as a few seconds. And, having seen a low number, there would be no need to qualify its context with the second command. Examine the “repl lag” graph in MMS You can also view recent and historical replication lag using the MongoDB Monitoring Service (MMS) from 10gen. On the Status tab of each SECONDARY node, you’ll find the repl lag graph: How do I monitor for lag? It is critical that the replication lag of your replica set(s) be monitored continuously. Since you have to sleep occasionally, this is a job best done by robots. It is essential that these robots be reliable, and that they notify you promptly whenever a replica set is lagging too far behind. Here are a couple ways you can make sure this is taken care of: If MongoLab is hosting your replica set, relax! For any multi-node, highly-available replica set we host for you, you can monitor replication lag in our UI and by default you will receive automated alerts whenever the replication lag exceeds 10 minutes. You can also set up an alert using the MMS system. Its exciting new features allow you to configure a replication lag alert: What can I do to minimize lag? Out of courtesy (for them or for ourselves), we would like to make those lag-monitoring automata’s lives as boring as possible. Here are some tips: Tip #1: Make sure your secondary has enough horsepower It’s not uncommon for people to run under-powered secondaries to save money — this can be fine if the write load is light. But in scenarios where the write load is heavy, the secondary might not be able to keep up with the primary. To avoid this, you should beef up your secondary so that it’s as powerful as your primary. Specifically, a SECONDARY node should have enough network bandwidth that it can retrieve ops from the PRIMARY’s oplog at roughly the rate they’re created and also enough storage throughput that it can apply the ops — i.e., read any affected documents and their index entries into RAM, and commit the altered documents back to disk — at that same rate. CPU rarely becomes a bottleneck, but it may need to be considered if there are many index keys to compute and insert for the documents that are being added or changed. Tip #2: Consider adjusting your write concern Your secondary may be lagging simply because your primary’s oplog is filling up faster than it can be replicated. Even with an equally-brawny SECONDARY node, the PRIMARY will always be capable of depositing 4MB in its memory-mapped oplog in a fraction of the time those same 4MB will need to make it across a TCP/IP connection. One viable way to apply some back-pressure to the primary might be to adjust your write concern. If you are currently using a write concern that does not acknowledge writes (aka “fire-and-forget” mode), you can change your write concern to require an acknowledgement from the primary (w:1) and/or a write to the primary’s journal (j:true). Doing so will slow down the rate at which the concerned connection can generate new ops needing replication. Other times it may be appropriate to use a ‘w’ > 1 or a ‘w’ set to “majority” to ensure that each write to the cluster is replicated to more than one node before the command returns. Requiring confirmation that a write has replicated to secondaries will effectively guarantee that those secondaries have caught up (at least up to the timestamp of this write) before the next command on the same connection can produce more ops in the backlog. As previously alluded to, choosing the most appropriate write concern for the data durability requirements of your application — or for particular critical write operations within the application — is something you must give thought to irrespective of the replication lag issue we’re focusing on here. But you should be aware of the interrelationship: just as the durability guarantee of w>1 can be used as a means of forcing a periodic “checkpoint” on replication, excessive replication lag can show up as a surprisingly high latency (or timeout) for that very occasional critical write operation where you’ve used “w: majority” to make sure it’s truly committed. Adjust to taste Having servers acknowledge every write can be a big hit to system throughput. If it makes sense for your application, you can amortize that penalty by doing inserts in batches, requiring acknowledgement only at the end of each batch. The smaller the batch, the greater the back-pressure on PRIMARY data creation rate, and correspondingly greater potential adverse impact to overall throughput. Don’t overdo it Using a large value for ‘w’ can itself be problematic. It represents a demand that w nodes finish working through their existing backlog before the command returns. So, if replication lag is high (in the sense of there being a large volume of data waiting to copy over) when the write command is issued, the command execution time will suffer a proportionally high latency. Also, if enough nodes go offline such that ‘w’ cannot be satisfied, you have effectively locked up your database. This is basically the opposite of “high availability.” Tip #3: Plan for index builds As mentioned earlier, an index build on a secondary is a foreground, blocking operation. If you’re going to create an index that is sizeable, perhaps you can arrange to do it during a period of low write activity on the primary. Alternately, if you have more than one secondary, you can follow the steps here to minimize the impact of building large indexes. Tip #4: Take backups without blocking Earlier we discussed the technique of locking the secondary to do a backup. There are other alternatives to consider here, including filesystem snapshots and “point-in-time” backups using the “--oplog” option of mongodump without locking. These are preferable to locking the secondary during a period of active writes if there’s any chance you’ll use the secondary for anything other than backups. Tip #5: Be sure capped collections have an _id field & a unique index Reliable replication is not possible unless there is a unique index on the _id field. Before MongoDB version 2.2, capped collections did not have an _id field or index by default. If you have a collection like this, you should create an index on the _id field, specifying unique: true. Failing to do this can, in certain situations, cause replication to halt entirely. So … this should not be regarded as optional. Tip #6: Check for replication errors If you see that replication lag is only increasing (and never falling), your replica set could be experiencing replication errors. To check for errors, run rs.status() and look at the errmsg field in the result. Additionally, check the log file of your secondary and look for error messages there. One specific example: if you see “RS102 too stale to catch up” in the secondary’s mongodb.log or in the errmsg field when running rs.status(), it means that secondary has fallen so far behind that there is not enough history retained by the primary (its “oplog size”) to bring it up to date. In this case, your secondary will require a full resynchronization from the primary. In general, though, what you do in response to an error depends on the error. Sometimes you can simply restart the mongod process for your secondary; but the majority of the time you will need to understand the root cause of the error before you can fix the problem. Don’t let replication lag take you by surprise. At the end of the day, replication lag is just one more source of risk in any high-availability system that we need to understand and design around. Striking the right balance between performance and “safety” of write operations is an exercise in risk management — the “right” balance will be different in different situations. For an application on a tight budget with occasional spikes in write volume, for example, you might decide that a large replication lag in the wake of those spikes is acceptable given the goals of the application, and so an underpowered secondary makes sense. At the opposite extreme, for an application where every write is precious and sacred, the required “majority” write concern will mean you have essentially no tolerance for replication lag above the very minimum possible. The good news is that MongoDB makes this all very configurable, even on an operation by operation basis. We hope this article has given you some insight into the phenomenon of replication lag that will enable you to reason about the risk it poses for a high-availability MongoDB application, and armed you with some tools for managing it. As always, let us know if we can help!

March 19, 2013

by Todd O. Dampier

· 44,127 Views · 1 Like

Algorithm of the Week: Aho-Corasick String Matching Algorithm in Haskell

let’s say you have a large piece of text and a dictionary of keywords. how do you quickly locate all the keywords? aho-corasick algorithm diagram well, there are many ways really, you could even iterate through the whole thing and compare words to keywords. but it turns out that’s going to be very slow. at least o(n_keywords * n_words) complexity. essentially you’re making as many passes over the text as your dictionary is big. in 1975 a couple of ibm researchers – alfred aho and margaret corasick – discovered an algorithm that can do this in a single pass. the aho-corasick string matching algorithm . i implemented it in haskell and it takes 0.005s to find 8 different keywords in oscar wilde’s the nightingale and the rose – a 12kb text. a quick naive keyword search implemented in python takes 0.023s . not a big difference practically speaking, but imagine a situation with megabytes of text and thousands of words in the dictionary. the authors mention printing out the result as a major bottleneck in their assessment of the algorithm. yep, printing . the aho-corasick algorithm at the core of this algorithm are three functions: the three functions of aho-corasick algorithm a parser based on a state machine, which maps (state, char) pairs to states and occasionally emits an output. this is called the goto function a failure function, which tells the goto function which state to jump into when the character it just read doesn’t match anything an output function, which maps states to outputs – potentially more than one per state the algorithm works in two stages. it will first construct the goto, failure and output functions. the complexity of this operation hinges solely on the size of our dictionary. then it iterates over the input text to produce all the matches. using state machines for parsing text is a well known trick – the real genius of this algorithm rests in that failure function if you ask me. it makes lateral transitions between states when the algorithm climbs itself into a wall. say you have she and hers in the dictionary. the goto machine eats your input string one character at the time. let’s say it’s already read s h . the next input is an e so it outputs she and reaches a final state. next it reads an r , but the state didn’t expect any more inputs, so the failure function puts us on the path towards hers . this is a bit tricky to explain in text, i suggest you look at the picture from the original article and look at what’s happening. my haskell implementation the first implementation i tried, relied on manully mapping inputs to outputs for the goto, failure and output functions by using pattern recognition. not very pretty, extremely hardcoded, but it worked and was easy to make. building the functions dynamically proved a bit trickier. type goto = map (int, char) int type failure = map int int type output = map int [string] first off, we build the goto function. -- builds the goto function build_goto::goto -> string -> (goto, string) build_goto m s = (add_one 0 m s, s) -- adds one string to goto function add_one::int -> goto -> [char] -> goto add_one _ m [] = m add_one state m (c:rest) | member key m = add_one (frommaybe 0 $ map.lookup key m) m rest | otherwise = add_one max (map.insert key max m) rest where key = (state, c) max = (size m)+1 essentially this builds a flattened prefix tree in a hashmap of (state, char) pairs mapping to the next state. it makes sure to avoid adding new edges to the three as much as possible. the reason it’s not simply a prefix tree are those lateral transitions; doing them in a tree would require backtracking and repeating of steps, so we haven’t achieved anything. once we have the goto function, building the output is trivial. -- builds the output function build_output::(?m::goto) => [string] -> output build_output [] = empty build_output (s:rest) = map.insert (fin 0 s) (list.filter (\x -> elem x dictionary) $ list.tails s) $ build_output rest -- returns the state in which an input string ends without using failures fin::(?m::goto) => int -> [char] -> int fin state [] = state fin state (c:rest) = fin next rest where next = frommaybe 0 $ map.lookup (state, c) ?m we are essentially going over the dictionary, finding the final state for each word and building a hash table mapping final states to their outputs. building the failure function was trickiest, because we need a way to iterate over the depths at which nodes are position in the goto state machine. but we threw that info away by using a hashmap. -- tells us which nodes in the goto state machine are at which traversal depth nodes_at_depths::(?m::goto) => [[int]] nodes_at_depths = list.map (\i -> list.filter (>0) $ list.map (\l -> if i < length l then l!!i else -1) paths) [0..(maximum $ list.map length paths)-1] where paths = list.map (path 0) dictionary we now have a list of lists, that tells us at which depth certain nodes are. -- builds the failure function build_fail::(?m::goto) => [[int]] -> int -> failure build_fail nodes 0 = fst $ mapaccuml (\f state -> (map.insert state 0 f, state)) empty (nodes!!0) build_fail nodes d = fst $ mapaccuml (\f state -> (map.insert state (decide_fail state lower) f, state)) lower (nodes!!d) where lower = build_fail nodes (d-1) -- inner step of building the failure function decide_fail::(?m::goto) => int -> failure -> int decide_fail state lower = findwithdefault 0 (s, c) ?m where (s', c) = key' state $ assocs ?m s = findwithdefault 0 s' lower -- gives us the key associated with a certain state (how to get there) key'::int -> [((int, char), int)] -> (int, char) key' _ [] = (-1, '_') -- this is ugly, being of maybe type would be better key' state ((k, v):rest) | state == v = k | otherwise = key' state rest here we are going over the list of nodes at depths and deciding what the failure should be for each depth based on the failures of depth-1. at depth zero, all failures go to the zeroth state. an important part of this process was inverting the goto hashmap so values point to keys, which is essentially what the key’ function does. finally, we can use the whole algorithm like this: main = do let ?m = fst $ mapaccuml build_goto empty dictionary let ?f = build_fail nodes_at_depths $ (length $ nodes_at_depths)-1 ?out = build_output dictionary print $ ahocorasick text a bit more involved than the usual example of haskell found online, it’s still pretty cool you can see the whole code on github here .

March 19, 2013

by Swizec Teller

· 22,037 Views

Producers and Consumers - Part 3 Poison Pills

A couple of weeks ago I wrote part 2 of a short series of blogs on the Producer Consumer pattern. This blog focused upon the need to close down my Teletype’s worker thread, fixing a bug in the original code from part 1 of the series. The idea here is that the Teletype’s worker thread can be controlled by a command from the application’s main thread. This command tells the worker thread to shutdown thus allowing the app the gracefully shutdown as demonstrated by the code below: @Override public void run() { while (run) { try { Message message = queue.take(); printHead.print(message.toString()); messageCount++; } catch (InterruptedException e) { printHead.print("Teletype closing down..."); } } printHead.print("Teletype Off."); } public void destroy() { run = false; thread.interrupt(); } In this sample, the main thread calls the destroy() method, which sets the run variable to false and interrupts the worker’s blocking call to queue.take(). However, there’s a problem with this idea in certain circumstances. For example, will suddenly terminating the consumer’s worker thread cause problems in other parts of the system? Will there be data loss as important messages in the queue don’t get processed? If the answer to these questions is ‘yes’ then there’s another approach you can take: use a Poison Pill. Poison Pill is a rather melodramatic name for simply placing a certain, known, data item on the queue and when the consumer reads this item it closes down. Obviously, the poison pill has to be the last item placed on the queue or else the consumer will shut down prematurely. This idea is great in simple systems with only one producer and consumer as shown below: ...but takes a little more thought when there are multiple producers with a single consumer as in my football match updates scenario: ...and could fall apart completely in the case of multiple produces and consumers: ... as ensuring that each consumer receives a poison pill at the right time and all the data in the queue gets processed could be quite tricky. In this blog I’m updating my Teletype code to shut itself down once the two MatcherReporters have sent all their data. The first thing to do is to decide on the message that will act as a poison pill. In the snippet below you can see that I’ve inserted a message that contains the text “END OF FILE” at the end of the match update stream. 95:30 END OF FILE 95:00 Final Score Fulham 0 - 1 Man Utd 94:59 Full time The referee signals the end of the game. I’ve inserted one of these messages into each set of game data. The next thing to do is to modify the Teletype code adding a check for the poison pill message: public class Teletype implements Runnable { private static final String POISON_PILL_MESSAGE = "END OF FILE"; private final BlockingQueue queue; private final PrintHead printHead; private final int matchesPlayed; private volatile boolean run = true; private int pillsRecieved; public Teletype(PrintHead printHead, BlockingQueue queue, int matchesPlayed) { this.queue = queue; this.printHead = printHead; this.matchesPlayed = matchesPlayed; } public void start() { Thread thread = new Thread(this, "Studio Teletype"); thread.start(); printHead.print("Teletype Online."); } @Override public void run() { while (run) { try { Message message = queue.take(); handleMessage(message); } catch (InterruptedException e) { printHead.print("Teletype closing down..."); } } printHead.print("Teletype Off."); } private void handleMessage(Message message) { if (allGamesAreOver(message.getMessageText())) { run = false; } else { printHead.print(message.toString()); } } private boolean allGamesAreOver(String messageText) { if (POISON_PILL_MESSAGE.equals(messageText)) { pillsRecieved++; } return pillsRecieved == matchesPlayed ? true : false; } @VisibleForTesting boolean isRunning() { return run; } } One of the most significant changes here is the addition of the matchesPlayed instance variable. This variable tells the Teletype how many MatchReporters there are supplying it with data. Ultimately this breaks the Producer Consumer pattern in that the consumer now knows about the rest of the system; however, it’s necessary because we need to ensure that the Teletype shuts down at the end of all the data. In a single producer/consumer one to one system this isn’t necessary. The other big change in the Teletype code is to the run() loop. Once a message has been retrieved from the queue it’s passed to the new handleMessage(...) method. The handleMessage(...) method checks whether or not all the games it’s receiving data from are over by calling allGamesAreOver(...), which checks the message text against the poison pill string. If the message test is the poison pill string then the pillsRecieved counter is updated. If the pillsRecieved equals the matchesPlayed variable then all the all the games are over and allGamesAreOver(...) returns true. This sets the run instance variable to false and the worker thread’s run() method exits. So that’s about it, the melodramatic Poison Pill pattern in a nutshell, next time Murder in the Red Barn. The code for this sample is available on GitHub.

March 18, 2013

by Roger Hughes

· 30,532 Views · 2 Likes

In-Memory Data Grids

Introduction The IT buzzword of 2012 is without a doubt Big Data. It’s new and here to stay, and for a good reason. Big data is data that exceeds the processing capacity of conventional database systems. Great examples are CERN with the Large Hadron Collider, whose experiments generate 25 petabytes of data annually, or Walmart, which handles more than one million customer transaction every hour. Problems These vast amounts of data leave us with two problems. Problem 1: To gain value from this data, one must choose an alternative way to process it. The value of big data to an organization falls into two categories: analytical use, and enabling new products. Big data analytics can reveal insights hidden previously by data too costly to process, such as peer influence among customers, revealed by analyzing shoppers’ transactions, social and geographical data. Being able to process every item of data in reasonable time removes the troublesome need for sampling and promotes an investigative approach to data, in contrast to the somewhat static nature of running predetermined reports. Problem 2: The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. Remember the CERN case where the LHC produces over 25 Petabytes of data annually? No “classic” database architecture or setup is capable of holding these amounts of data. Solutions Fortunately, both problems can be solved by implementing the correct infrastructure and rethinking data storage. There are two critical factors in Big Data environments: size and speed. We already discussed the vast amounts of data and desire to be able to access and process the data fast. The latter is the main differentiator from more traditional data warehouses. Just imagine what you can do when you can access all your data real-time. Enter big data. A common Big Data implementation is an in-memory data grid that lives in a distributed cluster, ensuring both speed, by storing data in-memory, and capacity by using scalability features provided by a cluster. As a bonus, availability is ensured by using a distributed cluster. As for the data storage, there are typically two kinds: in-memory databases and in-memory data grids. But first some background. It is not a new attempt to use main memory as a storage area instead of a disk. In our daily lives there are numerous examples of main memory databases (MMDB), as they perform much faster than disk-based databases. An every day example is a mobile phone. When you SMS or call someone most mobile service providers use MMDB to get the information on your contact as soon as possible. The same applies to your phone. When someone calls you, the caller details are looked up in the contacts application, usually providing a name and sometimes a picture. In memory data grids In Memory Data Grid (IMDG) is the same as MMDB in that it stores data in main memory, but it has a totally different architecture. The features of IMDG can be summarized as follows: Data is distributed and stored on multiple servers. Each server operates in the active mode. A data model is usually object-oriented (serialized) and non-relational. According to the necessity, you often need to add or reduce servers. No traditional database features such as tables. In other words, IMDG is designed to store data in main memory, ensure scalability and store an object itself. These days, there are many IMDG products, both commercial and open source. Some of the most commonly used products are: Hazelcast (http://www.hazelcast.com) JBoss Infinispan (http://www.jboss.org/infinispan) GridGain DataGrid (http://www.gridgain.com/features/in-memory-data-grid/) VMware Gemfire (http://www.vmware.com/nl/products/application-platform/vfabric-gemfire/overview.html) Oracle Coherence (http://www.oracle.com/technetwork/middleware/coherence/overview/index.html) Gigaspaces XAP (http://www.gigaspaces.com/datagrid) Terracotta Enterprise Suite (http://terracotta.org/products/enterprise-suite) Why Memory? The main reasons for using main memory for data storage are once again the two main themes of Big Data: speed and capacity. The processing performance of main memory is 800 times faster than an HDD and up to 40 times faster than an. Moreover, the latest x86 server supports main memory of hundreds of GB per server. It is said that the limit of a traditional processing database’s (OLTP) data capacity is approximately 1 TB and that the OLTP processing data capacity would not increment well. If servers using main memory of 1 TB or larger become more commonly used, you will be able to conduct operations with the entire data placed in main memory, at least in the field of OLTP. IMDG Architecture To use main memory as a storage area, two weak points should be overcome: Limited capacity: involves data that exceeds the maximum capacity of the main memory of the server Reliability: involves data loss in case of a (system) failure. IMDG overcomes the limit of capacity by ensuring horizontal scalability using a distributed architecture, and resolves the issue of reliability through a replication system as part of the grid (or a distributed cluster). Now let’s discuss how an IMDG actually works. First of all, it is important to understand that an IMDG is not the same as an in-memory database, also referred to as MMDB (main memory databases). Typical examples of MMDBs are Oracle TimesTen or Sap Hana. MMDBs are full database products that simply reside in memory. As a result of being a full-blown database, they also carry the weight and overhead of database management features. IMDG is different. No tables, indexes, triggers, stored procedures, process managers etc. Just plain storage. The data model used in IMDG is key-value pairs. A key-value pair is a list with only two parts: a key and a value. The key can be used for storing and retrieving the values in the list. A key can be compared to the index or primary key of a table in a database. Note that IMDG are closely tied to development environments such as Java as the key-value pairs are represented by the structures provided by such a programming environment. Most IMDGs are written in Java, and can only be used within other Java applications. Therefore, the values of key-value pairs can be anything supported by Java, ranging from simple data types such as a string or number, to complex objects. This overcomes the two important hurdles: as you can store complex Java objects as value, there’s no need to translate these objects into a relational datamodel (which is the case in more traditional applications using a database for storage). Furthermore, the seeming limitation of being able to store only one value per key, is actually no limitation at all. Large memory sizes Most of the products introduced above use Java as an implementation language. Java reserves and uses a part of the RAM (internal memory) for dynamic memory allocation. This reserved memory space is called the Java heap. All runtime objects created by a Java application are stored in heap. Using large amounts of data causes two problems. Size limitation: By default, the heap size is 128 MB, but for current business applications, this limit is reached easily. Once the heap is “full”, no new objects can be created and the Java application will show some nasty errors. Performance: It is possible to increase the size of the heap, but this introduces some new problems. When a heap reaches a size of more than 4 gigabytes, Java will have serious issues with memory managements, causing your application to slow down or even freeze. Java has a feature called Garbage Collector, which periodically scans the heap and checks each object if it is still valid and being used. If not, the garbage collector removes the object and defragments the newly available space. The problem is, the larger the heap size, the more work to do for the garbage collector, resulting in performance degradation. Imagine a large bank has a Java application that manages customers, accounts and transactions. We have seen that an IMDG allows the application to store and access all data very quickly by caching it in memory, instead of storing the data in relatively slow databases. Let’s assume the combined data has a size of 40 gigabytes. Storing it in heap is simply not possible, considering the performance penalties of Java’s memory management capabilities. The graph below illustrates the garbage collection pause time when placing cached data in heap: Terracotta’s BigMemory product has a method to overcome these limitations. The method is to use an off-heap memory (direct buffer). Data will not be stored in Java’s heap, but directly in the available internal memory (RAM). Since, this is not subject to Java’s garbage collector, there are no performance penalties. The differences on performance are significant, as can be seen in the graph below: Using off-heap storage has some major benefits: You can use all the available memory on your machine, not just the memory that is allocated to the heap (usually less that 512 Mb). This allows you to store more data in a in-memory data grid, greatly speeding up your application. The heap can be relieved by storing data in native memory, speeding up Java applications as less heap space has to be garbage collected. Clustering, fail over and high availability So far, we have seen IMDG features that are applicable to a single server. However, the real power of IMDG lies in it’s networking and clustering capabilities, providing features as data replication, data synchronization between clients, fail over and high availability. To achieve this, a cluster of servers (or server array) acts a backbone of the infrastructure. Applications (that still can have their own IMDG or off-heap cache) that are connected to the cluster can share, replicate and backup their data with either the cluster or other applications. The graph below depicts a typical setup using Terracotta's BigMemory: The caches on the application servers are usually referred to as “level 1” cache, while the data cache on the server array is referred to as “level 2” cache. There are many different scenarios possible for storing, clustering, synchronizing and replicating data. Covering all these topics goes far beyond the scope of this article. For more information, consult the technical documentation of the product of your choice. Conclusion Big Data brings us some new challenges. First of all, storing and accessing vast amounts of data makes us rethink traditional methods and technologies. Next, there’s the question what to do with all the available data. The potential value for marketing, financial and other businesses is huge. In order to facilitate Big Data, in-memory data grids are considered the best option. IMDGs with off-heap storage are even more powerful, allowing data centric enterprise application to overcome certain limits of the Java platform, such as memory and performance constraints. As the amount of data that (large) companies produce and store, grows exponentially, databases will hit a limit. Accessing your data without a performance penalty simply will not be possible. The answer to this is using an IMDG.

March 13, 2013

by Roy Prins

· 32,669 Views · 5 Likes

Database Concepts for a java Dev: Database Normalization

In this part, I will be briefing about different types of Database Normalizations using a sample data model. What is Database Normalization? Normalization is the process of efficiently organizing data in the database. Primary Goal of Normalization? Eliminating redundant data & ensuring meaningful data dependencies. Types of Normalization The following are the three most common normal forms in the database normalization process First Normal Form (1NF) Second Normal Form (2NF) Third Normal Form (3NF) Sample Data Model for Demonstration The following data model will be used to demonstrate all the three normal forms First Normal Form (1NF) First Normal Form (1NF) sets the very basic rules for an organized database: Create separate set of tables for each group of related data and identify each row with a unique columns [primary key] or set of columns [composite key] Eliminate duplicate columns from the table The following data model depicts the tables after 1NF rules are applied - Second Normal Form (2NF) Second Normal Form (2NF) further addresses the concept of removing duplicate data: Meet all the requirements of the first normal form Remove subsets of data that apply to multiple rows of a table and place them in separate tables Create relationships between these new tables and their predecessors through the use of foreign keys So basically the objective of the Second Normal Form is to take that is only partly dependent on the primary key and enter that data into another table. The following data model depicts the tables after 2NF rules are applied. Data from EMPLOYEE_TABLE is split into 2 tables – EMPLOYEE_TABLE and EMPLOYEE_HR_TABLE. Similarly data from CUSTOMER_TABLE is moved to CUSTOMER_TABLE and CUSTOMER_ORDER table Third Normal Form (3NF) Third normal form (3NF) goes one large step further: Meet all the requirements of the second normal form. Remove columns that are not dependent upon the primary key. The following data model depicts the tables after 3NF rules are applied. Further state and country details are moved to their own tables because they are not dependent on the primary key. Advantages of Normalizing the Database There are several advantages of normalization - Data can be stored as small atomic pieces Saves space Increases speed Reduces data anomalies Easy maintenance Other parts of this series include: Part 1 – ACID Properties Part 2 – Keys Part 4 – Database Transactions [coming soon] Part 5 – Indexes [coming soon]

March 13, 2013

by Jagadeesh Motamarri

· 10,962 Views · 1 Like

Where is My Datastore in Hyper-V? Server Virtualization - Part 4

The term 'datastore’ is one that many of you who work with VMware are familiar, but which doesn’t really translate to the world of Microsoft’s Hyper-V. “Since Hyper-V does not require a different formatting of the underlying physical disk structure like VMFS(VMware’s proprietary disk format) we are able to browse the ‘datastore’ with File Explorer(In Windows 8/Server 2012…formerly known as Windows Explorer).” “Who said that?” That quote was from my friend Tommy Patterson, who writes about datastores and how they compare to the file system structures used in Hyper-V in Part 4 of our “20+ Days of Server Virtualization” series. In his article he describes the locations of the various components that define and make up a Hyper-V virtual machine, and even provides a script to help you quickly locate filesystem locations for your virtual machine bits. READ HIS ARTICLE HERE

March 9, 2013

by Kevin Remde

· 9,080 Views

SAP Integration with Talend Components / Connectors (BAPI, RFC, IDoc, BW, SOAP)

talend has several connectors to integrate sap systems. however, this guide is no introduction to talend’s sap components. instead, this guide helps to understand different alternatives to integrate sap systems with talend set up a local sap system configure talend studio for using sap components use talend’s sap wizard run a first talend job which connects to sap all further required information and example use cases for talend’s sap components should be available in the talend component guide at www.help.talend.com . if that’s not the case, please create a jira documentation ticket ( https://jira.talendforge.org/browse/doct )! now let’s take a look at different alternatives for integration of sap systems with talend. alternatives for sap integration three protocols exist for communication between sap and external programs: dynamic information and action gateway (diag): e.g. used by sap gui remote function call (rfc): a function call with input and output parameters (like a java interface) hypertext transfer protocol (http): internet standard the following alternatives are available for integrating sap systems using some of these protocols. file sap supports the direct import of files (call-transaction-program, batch-input, direct input). files have to be in a specific format to be imported. transformation and integration can be realized with talend’s various file components such as tfileinputdelimited. rfc remote function call is the proprietary sap ag interface for communication between a sap system and other sap or third-party compatible system over tcp/ip or cpi-c connections. remote function calls may be associated with sap software and abap programming, and provide a way for an external program (written in languages such as php, asp, java, or c, c++) to use data returned from the server. data transactions are not limited to getting data from the server, but can insert data into server records as well. sap can act as the client or server in an rfc call. a remote function call (rfc) is the call or remote execution of a remote function module in an external system. in the sap system, these functions are provided by the rfc interface system. the rfc interface system enables function calls between two sap systems, or between a sap system and an external system. tsapinput and tsapoutput are talend’s components to use rfcs. business application programming interface (bapi) a bapi is an object-oriented view on most data and transactions of a sap system (called “business objects”). object types of the business objects are stored in the business object repository (bor). bapis are always implemented as rfcs and therefore can be called the same way. additionally, they have the following characteristics (compared to rfcs): stable interface no view layer no exceptions, instead export parameter: “return” most business objects offer the following standard bapis: getlist getdetail change creationfromdata tsapinput and tsapoutput are talend’s components to use bapis. application link enabling (ale) application link enabling (ale) is used for asynchronous messaging between different systems via “intermediate documents (idoc)”. idoc is a sap document format for business transaction data transfers. it is used to realize distributed business processes. idoc is similar to xml in purpose, but differs in syntax. both serve the purpose of data exchange and automation in computer systems, but the idoc technology takes a different approach. while xml allows having some metadata about the document itself, an idoc is obligated to have information at its header like its creator, creation time, etc. while xml has a tag-like tree structure containing data and meta-data, idocs use a table with the data and meta-data. idocs also have a session that explains all the processes which the document passed or will pass, allowing one to debug and trace the status of the document. an idoc consists of control record (it contains the type of idoc, port of the partner, release of sap r/3 which produced the idoc, etc.) data records of different types. the number and type of segments is mostly fixed for each idoc type, but there is some flexibility (for example an sd order can have any number of items). status records containing messages such as 'idoc created', 'the recipient exists', 'idoc was successfully passed to the port', 'could not book the invoice because...' different idoc types are available to handle different types of messages. for example, the idoc format orders01 may be used for both purchase orders and order confirmations. tsapidocinput and tsapidocoutput are talend’s components to use ale / idoc. bapis can also be called asynchronously via ale. all new idocs are even based on bapis. soap web services sap supports soap web services. not just sap as java, but also sap as abap! integration can be realized with talend’s esb / web service components such as tesbrequest, tesbresponse, or tesbconsumer. installation of sap server and client installation can take about 6 to 8 hours, but it is an “all in one installation”, i.e. you can install it overnight. steps for installation: get yourself a windows 7 64 bit laptop or vm with 8+ gb ram and 50+gb free disc space get a sap community account (for free, just register): http://scn.sap.com/welcome download sap netweaver (software downloads --> sap netweaver main releases: http://www.sdn.sap.com/irj/scn/nw-downloads download current version of sap netweaver application server abap 64-bit trial install sap server: follow installation guide – a html website included in the download in root of extracted download folder (start.htm --> there click on “installation” link) install sap gui (rich client frontend): start.htm --> there click on “install sap gui” link and follow instructions download the sap jco for the operating system on which your connector is running. the sap jco is available for download from sap's website at http://service.sap.com/connectors . you must have an sapnet account to access the sap jco (if you do not already have one, contact your local sap basis administrator). usage of sap server hint: you have to use a windows user which has a password (as you need to enter windows credentials when stopping sap). if you have a windows user without a password (for instance if you use windows within a vm on your mac), sap cannot process these credentials (i.e. it cannot process an empty password field) --> change your windows password before starting sap start the management console (windows startmenu --> programs --> sap management console) start and stop the sap server (right click on “nsp” --> start / stop) default user: sap* (sap system super user) password: the one which you entered at installation of sap netweaver, e.g. admin123 usage of sap client a sap client should be used to get information about the sap system (functions, data, etc.) similarly to using e.g. mysql workbench to get information from a mysql database. sap gui (view layer) communicates with sap as abap (business logic layer). the application server communicates with the relational database (db layer). different clients are available for sap: sap gui windows sap gui java web browser external rfc-program for local development demos, sap gui windows is probably the best alternative. start sap gui windows by: clicking shortcut “windows start menu --> sap frontend --> sap logon” entering username and password clicking logon sap transactions in sap, you call sap programs via sap transaction codes. important transactions codes are for example: bapi: bapi explorer, view all sap bapi's se16: data browser, view/add table data se38: program editor here is a list of several other important transaction codes: http://www.sapdev.co.uk/tcodes/tcodes.htm installation of demo data the sap installation includes some demo data. as most people do not want to install “real” sap modules such as sap fi, sap crm or sap bi on their local system, this demo data is perfect for demos using talend’s sap connectors. to install the flight demo on a local sap system, you just have to open the abap editor (transaction: se38) and execute the program sapbc_data_generator. this program generates example data within the flight tables and does some further initializations. here is a good tutorial with more information and how to test the flight application: http://help.sap.com/saphelp_erp60_sp/helpdata/de/db/7c623cf568896be10000000a11405a/content.htm configuration of talend studio to use sap components talend’s sap components are already included in the studio. however, two further steps are required to be able to use them: copy sapjco3.dll to the directory c:/windows/system32 sap java connector jar must be added copy sapjco3.jar to the directory “talend/studio/lib/java” (re-) start talend studio check if sap library is added successfully open view “talend modules” (eclipse --> windows --> show view --> talend --> modules) sort by column “context” look for “tsap*” contexts and check if sapjco3.jar has status “installed” usage of sap components with talend studio this section describes how to use talend’s sap components and the sap wizard in general (using one specific example for calling a bapi). detailed descriptions of all sap components (for using bapis, rfcs, idocs, bw, etc.) are available in the documentation talend_components_rg_x.y.z.pdf at www.help.talend.com . connection to a sap system a connection to a sap system can be done “built-in” or via “metadata --> sap connections” (the latter only in enterprise version). using the latter has several advantages: reuse connection configuration quick check if connection to sap works wizards for retrieving functions from sap (instead of handwriting without wizard) quick test with test parameters if function works before finishing development lifecycle for a sap job development lifecycle for sap job: create connection (if not existing yet) right click on metadata --> sap connection create sap connection follow wizard sap jco version: 3 client: “001” userid: “sap*” password: “admin123” --> as you defined it while installation language: “en” hostname: “localhost” system number: “00” retrieve function (bapi / rfc) right click on created connection click on “retrieve sap function” enter search filter (e.g. bapi_fl*) click on “search” select and double click on your function (e.g bapi_flcust_getlist) you see all input, output and table parameters for this sap function click on “test in” --> here you see parameters in more detail: you now have to define which input and output parameters you want to use --> remove all other by selecting them and clicking “remove” button hint: if you do not remove an input parameter, you usually have to enter a value for it! select the output type - can be a single (single record), a table (list of records), or a structure output hint: difference between table and structure in sap: http://www.sapfans.com/forums/viewtopic.php?f=12&t=119794 if you want to do a quick test: enter values for input parameters (if there are any for your function call), then click “launch” button in this example, there is only an optional input parameter max_rows you should see data in the output fields in this example, you see the record with custname “sap ag” and street “neurottstr. 16” click “finish” button under “metadata --> sap connections --> “your connection” --> sap functions: there you can now see your function (in this example: bapi_flcust_getlist) create sap job drag&drop the created function into a job (without the wizard, you also can enter all data by hand) tsapinput component is proposed automatically. click ok to add it to your job go to “initialize input” and add parameter values in this example, there is just the parameter “max_rows” hint: the parameter value can be changed from a hardcoded value to a variable, of course (just click control space on your keyboard to get access to all available variables via code completion in your studio) go to the tsapinput component and add the desired output mapping (i.e. which values you want to process further with other components scroll to the bottom to “outputs” add the correct table / structure name (in this example: "customer_list") click on mapping (which is empty and has to be filled) click on “mapping”, then click on “…” add the wanted output columns of your sap function add the same names at the column “schema xpathqueries” (do not forget the double quotes here!) click “ok” button connect the tsapinput component to a tlogrowcomponent and synchronize the schema hint: always try out if this works before adding further logic to your job! run and test your job (you will see five rows logged (as you have configured max_rows = 5 that's it. now enjoy talend's sap components :-) best regards, kai wähner (twitter: @kaiwaehner) content from my blog: http://www.kai-waehner.de/blog/2013/03/03/sap-integration-with-talend-components-connectors-bapi-rfc-idoc-bw-soap/

March 4, 2013

by Kai Wähner

CORE

· 32,916 Views · 1 Like

Understanding TCP/IP Network Stack & Writing Network Apps

We cannot imagine Internet service without TCP/IP. All Internet services we have developed and used at NHN are based on a solid basis, TCP/IP. Understanding how data is transferred via the network will help you to improve performance through tuning, troubleshooting, or introduction to a new technology. This article will describe the overall operation scheme of the network stack based on data flow and control flow in Linux OS and the hardware layer. Key Characteristics of TCP/IP How should I design a network protocol to transmit data quickly while keeping the data order without any data loss? TCP/IP has been designed with this consideration. The following are the key characteristics of TCP/IP required to understand the concept of the stack. TCP and IP Technically, since TCP and IP have different layer structures, it would be correct to describe them separately. However, here we will describe them as one. 1. Connection-oriented First, a connection is made between two endpoints (local and remote) and then data is transferred. Here, the "TCP connection identifier" is a combination of addresses of the two endpoints, having type. 2. Bidirectional Byte Stream Bidirectional data communication is made by using byte stream. 3. In-order Delivery A receiver receives data in the order of sending data from a sender. For that, the order of data is required. To mark the order, 32-bit integer data type is used. 4. Reliability through ACK When a sender did not receive ACK (acknowledgement) from a receiver after sending data to the receiver, the sender TCP re-sends the data to the receiver. Therefore, the sender TCP buffers unacknowledged data from the receiver. 5. Flow Control A sender sends as much data as a receiver can afford. A receiver sends the maximum number of bytes that it can receive (unused buffer size, receive window) to the sender. The sender sends as much data as the size of bytes that the receiver's receive window allows. 6. Congestion Control The congestion window is used separately from the receive window to prevent network congestion by limiting the volume of data flowing in the network. Like the receive window, the sender sends as much data as the size of bytes that the receiver's congestion window allows by using a variety of algorithms such as TCP Vegas, Westwood, BIC, and CUBIC. Different from flow control, congestion control is implemented by the sender only. Data Transmission As indicated by its name, a network stack has many layers. The following Figure 1 shows the layer types. Figure 1: Operation Process by Each Layer of TCP/IP Network Stack for Data Transmission. There are several layers and the layers are briefly classified into three areas: User area Kernel area Device area Tasks at the user area and the kernel area are performed by the CPU. The user area and the kernel area are called "host" to distinguish them from the device area. Here, the device is the Network Interface Card (NIC) that sends and receives packets. It is a more accurate term than the commonly used "LAN card". Let's take a look at the user area. First, the application creates data to send (the "User data" box in Figure 1) and then calls the write() system call to send the data. Assume that the socket (fd in Figure 1) has been already created. When the system call is called, the area is switched to the kernel area. POSIX-series operating systems including Linux and Unix expose the socket to the application by using a file descriptor. In the POSIX-series operating system, the socket is a kind of a file. The file layer executes a simple examination and calls the socket function by using the socket structure connected to the file structure. The kernel socket has two buffers. One is the send socket buffer for sending; And the other is the receive socket buffer for receiving. When the write system call is called, the data in the user area is copied to the kernel memory and then added to the end of the send socket buffer. This is to send data in order. In the Figure 1, the light-gray box refers to the data in the socket buffer. Then, TCP is called. There is the TCP Control Block (TCB) structure connected to the socket. The TCB includes data required for processing the TCP connection. Data in the TCB are connection state (LISTEN, ESTABLISHED, TIME_WAIT),receive window, congestion window, sequence number, resending timer, etc. If the current TCP state allows for data transmission, a new TCP segment (in other words, a packet) is created. If data transmission is impossible due to flow control or such a reason, the system call is ended here and then the mode is returned to the user mode (in other words, the control is passed to the application). There are two TCP segments as shown in Figure 2: TCP header; And payload. Figure 2: TCP Frame Structure (source). The payload includes the data saved in the unacknowledged send socket buffer. The maximum length of the payload is the maximum value among the receive window, congestion window, and maximum segment size (MSS). Then, TCP checksum is computed. In this checksum computation, pseudo header information (IP addresses, segment length, and protocol number) is included. One or more packets can be transmitted according to the TCP state. In fact, since the current network stack uses the checksum offload, the TCP checksum is computed by NIC, not by the kernel. However, we assume that the kernel computes the TCP checksum for convenience. The created TCP segment goes down to the IP layer. The IP layer adds an IP header to the TCP segment and performs IP routing. IP routing is a procedure of searching the next hop IP in order to go to the destination IP. After the IP layer has computed and added the IP header checksum, it sends the data to the Ethernet layer. The Ethernet layer searches for the MAC address of the next hop IP by using the Address Resolution Protocol (ARP). It then adds the Ethernet header to the packet. The host packet is completed by adding the Ethernet header. After IP routing is performed, the transmit interface (NIC) is known as the result of IP routing. The interface is used for transmitting a packet to the next hop IP and the IP. Therefore, the transmit NIC driver is called. At this time, if a packet capture program such as tcpdump or Wireshark is running, the kernel copies the packet data onto the memory buffer that the program uses. In that way, the receiving packet is directly captured on the driver. Generally, the traffic shaper function is implemented to run on this layer. The driver requests packet transmission according to the driver-NIC communication protocol defined by the NIC manufacturer. After receiving the packet transmission request, the NIC copies the packets from the main memory to its memory and then sends it to the network line. At this time, by complying with the Ethernet standard, it adds the IFG (Inter-Frame Gap), preamble, and CRC to the packet. The IFG and preamble are used to distinguish the start of the packet (as a networking term, framing), and the CRC is used to protect the data (the same purpose as TCP and IP checksum). Packet transmission is started based on the physical speed of the Ethernet and the condition of Ethernet flow control. It is like getting the floor and speaking in a conference room. When an NIC sends a packet, the NIC generates interrupts on the host CPU. Every interrupt has its own interrupt number and the OS searches an adequate driver to handle the interrupt by using the number. The driver registers a function to handle the interrupt (an interrupt handler) when the driver is started. The OS calls the interrupt handler and then the interrupt handler returns the transmitted packet to the OS. So far we have discussed the procedure of data transmission through the kernel and the device when an application performs write. However, without a direct write request from the application, the kernel can transmit a packet by directly calling TCP. For example, when an ACK is received and the receive window is expanded, the kernel creates a TCP segment including the data left in the socket buffer and sends the TCP segment to the receiver. Data Receiving Now, let's take a look at how data is received. Data receiving is a procedure for how the network stack handles a packet coming in. Figure 3 shows how the network stack handles a packet received. Figure 3: Operation Process by Each Layer of TCP/IP Network Stack for Handling Data Received. First, the NIC writes the packet onto its memory. It checks whether the packet is valid by performing the CRC check and then sends the packet to the memory buffer of the host. This buffer is a memory that has already been requested by the driver to the kernel and allocated for receiving packets. After the buffer has been allocated, the driver tells the memory address and size to the NIC. When there is no host memory buffer allocated by the driver even though the NIC receives a packet, the NIC may drop the packet. After sending the packet to the host memory buffer, the NIC sends an interrupt to the host OS. Then, the driver checks whether it can handle the new packet or not. So far, the driver-NIC communication protocol defined by the manufacturer is used. When the driver should send a packet to the upper layer, the packet must be wrapped in a packet structure that the OS uses for the OS to understand the packet. For example, sk_buff of Linux, mbuf of BSD-series kernel, and NET_BUFFER_LIST of Microsoft Windows are the packet structures of the corresponding OS. The driver sends the wrapped packets to the upper layer. The Ethernet layer checks whether the packet is valid and then de-multiplexes the upper protocol (network protocol). At this time, it uses the ethertype value of the Ethernet header. The IPv4 ethertype value is 0x0800. It removes the Ethernet header and then sends the packet to the IP layer. The IP layer also checks whether the packet is valid. In other words, it checks the IP header checksum. It logically determines whether it should perform IP routing and make the local system handle the packet, or send the packet to the other system. If the packet must be handled by the local system, the IP layer de-multiplexes the upper protocol (transport protocol) by referring to the proto value of the IP header. The TCP proto value is 6. It removes the IP header and then sends the packet to the TCP layer. Like the lower layer, the TCP layer checks whether the packet is valid. It also checks the TCP checksum. As mentioned before, since the current network stack uses the checksum offload, the TCP checksum is computed by NIC, not by the kernel. Then it searches the TCP control block where the packet is connected. At this time, of the packet is used as an identifier. After searching the connection, it performs the protocol to handle the packet. If it has received new data, it adds the data to the receive socket buffer. According to the TCP state, it can send a new TCP packet (for example, an ACK packet). Now TCP/IP receiving packet handling has completed. The size of the receive socket buffer is the TCP receive window. To a certain point, the TCP throughput increases when the receive window is large. In the past, the socket buffer size had been adjusted on the application or the OS configuration. The latest network stack has a function to adjust the receive socket buffer size, i.e., the receive window, automatically. When the application calls the read system call, the area is changed to the kernel area and the data in the socket buffer is copied to the memory in the user area. The copied data is removed from the socket buffer. And then the TCP is called. The TCP increases the receive window because there is new space in the socket buffer. And it sends a packet according to the protocol status. If no packet is transferred, the system call is terminated. Network Stack Development Direction The functions of network stack layers described so far are the most basic functions. The network stack in the early 1990s had few more functions than the functions described above. However, the latest network stack has many more functions and complexity as the network stack implementation structure gets higher. The latest network stack is classified by purpose as follows. Packet Processing Procedure Manipulation It is a function like Netfilter (firewall, NAT) and traffic control. By inserting the user-controllable code to the basic processing flow, the function can work differently according to the user configuration. Protocol Performance It aims to improve the throughput, latency, and stability that the TCP protocol can achieve within the given network environment. Various congestion control algorithms and additional TCP functions such as SACK are the typical examples. The protocol improvement will not be discussed here since it is out of the scope. Packet Processing Efficiency The packet processing efficiency aims to improve the maximum number of packets that can be processed per second by reducing the CPU cycle, memory usage, and memory accesses that one system consumes to process packets. There have been several attempts to reduce the latency in the system. The attempts include stack parallel processing, header prediction, zero-copy, single-copy, checksum offload, TSO, LRO, RSS, etc. Control Flow in the Stack Now, we will take a more detailed look at the internal flow of the Linux network stack. Like a subsystem which is not a network stack, a network stack basically runs as the event-driven way that reacts when the event occurs. Therefore, there is no separated thread to execute the stack. Figure 1 and Figure 3 showed the simplified diagrams of control flow. Figure 4 below illustrates more exact control flow. Figure 4: Control Flow in the Stack. At Flow (1) in Figure 4, an application calls a system call to execute (use) the TCP. For example, calls the read system call and the write system call and then executes TCP. However, there is no packet transmission. Flow (2) is same as Flow (1) if it requires packet transmission after executing TCP. It creates a packet and sends down the packet to the driver. A queue is in front of the driver. The packet comes into the queue first, and then the queue implementation structure decides the time to send the packet to the driver. This is queue discipline (qdisc) of Linux. The function of Linux traffic control is to manipulate the qdisc. The default qdisc is a simple First-In-First-Out (FIFO) queue. By using another qdisc, operators can achieve various effects such as artificial packet loss, packet delay, transmission rate limit, etc. At Flow (1) and Flow (2), the process thread of the application also executes the driver. Flow (3) shows the case in which the timer used by the TCP has expired. For example, when the TIME_WAITtimer has expired, the TCP is called to delete the connection. Like Flow (3), Flow (4) is the case in which the timer used by the TCP has expired and the TCP execution result packet should be transmitted. For example, when the retransmit timer has expired, the packet of which ACK has not been received is transmitted. Flow (3) and Flow (4) show the procedure of executing the timer softirq that has processed the timer interrupt. When the NIC driver receives an interrupt, it frees the transmitted packet. In most cases, execution of the driver is terminated here. Flow (5) is the case of packet accumulation in the transmit queue. The driver requests softirq and the softirq handler executes the transmit queue to send the accumulated packet to the driver. When the NIC driver receives an interrupt and finds a newly received packet, it requests softirq. The softirq that processes the received packet calls the driver and transmits the received packet to the upper layer. In Linux, processing the received packet as shown above is called New API (NAPI). It is similar to polling because the driver does not directly transmit the packet to the upper layer, but the upper layer directly gets the packet. The actual code is called NAPI poll or poll. Flow (6) shows the case that completes execution of TCP, and Flow (7) shows the case that requires additional packet transmission. All of Flow (5), (6), and (7) are executed by the softirq which has processed the NIC interrupt. How to Process Interrupt and Received Packet Interrupt processing is complex; however, you need to understand the performance issue related to processing of packets received. Figure 5 shows the procedure of processing an interrupt. Figure 5: Processing Interrupt, softirq, and Received Packet. Assume that the CPU 0 is executing an application program (user program). At this time, the NIC receives a packet and generates an interrupt for the CPU 0. Then the CPU executes the kernel interrupt (called irq) handler. This handler refers to the interrupt number and then calls the driver interrupt handler. The driver frees the packet transmitted and then calls the napi_schedule() function to process the received packet. This function requests the softirq (software interrupt). After execution of the driver interrupt handler has been terminated, the control is passed to the kernel handler. The kernel handler executes the interrupt handler for the softirq. After the interrupt context has been executed, the softirq context will be executed. The interrupt context and the softirq context are executed by an identical thread. However, they use different stacks. And, the interrupt context blocks hardware interrupts; however, the softirq context allows for hardware interrupts. The softirq handler that processes the received packet is the net_rx_action() function. This function calls thepoll() function of the driver. The poll() function calls the netif_receive_skb() function and then sends the received packets one by one to the upper layer. After processing the softirq, the application restarts execution from the stopped point in order to request a system call. Therefore, the CPU that has received the interrupt processes the received packets from the first to the last. In Linux, BSD, and Microsoft Windows, the processing procedure is basically the same on this wise. When you check the server CPU utilization, sometimes you can check that only one CPU executes the softirq hard among the server CPUs. The phenomenon occurs due to the way of processing received packets explained so far. To solve the problem, multi-queue NIC, RSS, and RPS have been developed. Data Structure The followings are some key data structures. Take a look at them and review the code. sk_buff structure First, there is the sk_buff structure or skb structure that means a packet. Figure 6 shows some of the sk_buffstructure. As the functions have been advanced, they get more complicated. However, the basic functions are very common that anyone can think. Figure 6: Packet Structure sk_buff. Including Packet Data and meta data The structure directly includes the packet data or refers to it by using a pointer. In Figure 6, some of the packets (from Ethernet to buffer) refer to using the data pointer and the additional data (frags) refer to the actual page. The necessary information such as header and payload length is saved in the meta data area. For example, inFigure 6, the mac_header, the network_header, and the transport_header have the corresponding pointer data that points the starting position of the Ethernet header, IP header and TCP header, respectively. This way makes TCP protocol processing easy. How to Add or Delete a Header The header is added or deleted as up and down each layer of the network stack. Pointers are used for more efficient processing. For example, to remove the Ethernet header, just increase the head pointer. How to Combine and Divide Packet The linked list is used for efficient execution of tasks such as adding or deleting packet payload data to the socket buffer, or packet chain. The next pointer and the prev pointer are used for this purpose. Quick Allocation and Free As a structure is allocated whenever creating a packet, the quick allocator is used. For example, if data is transmitted at the speed of 10-Gigabit Ethernet, more than one million packets per second must be created and deleted. TCP Control Block Second, there is a structure that represents the TCP connection. Previously, it was abstractly called a TCP control block. Linux uses tcp_sock for the structure. In Figure 7, you can see the relationship among the file, the socket, and the tcp_sock. Figure 7: TCP Connection Structure. When a system call has occurred, it searches the file in the file descriptor used by the application that has called the system call. For the Unix-series OS, the socket, the file and the device for general file system for storage are abstracted to a file. Therefore, the file structure includes the least information. For a socket, a separate socket structure saves the socket-related information and the file refers to the socket as a pointer. The socket refers to the tcp_sock again. The tcp_sock is classified into sock, inet_sock, etc to support various protocols except TCP. It may be considered as a kind of polymorphism. All status information used by the TCP protocol is saved in the tcp_sock. For example, the sequence number, receive window, congestion control, and retransmit timer are saved in the tcp_sock. The send socket buffer and the receive socket buffer are the sk_buff lists and they include the tcp_sock. The dst_entry, the IP routing result, is referred to in order to avoid too frequent routing. The dst_entry allows for easy search of the ARP result, i.e., the destination MAC address. The dst_entry is part of the routing table. The structure of the routing table is very complex that it will not be discussed in this document. The NIC to be used for packet transmission is searched by using the dst_entry. The NIC is expressed as the net_device structure. Therefore, by searching just the file, it is very easy to find all structures (from the file to the driver) required to process the TCP connection with the pointer. The size of the structures is the memory size used by one TCP connection. The memory size is a few KBs (excluding the packet data). As more functions have been added, the memory usage has been gradually increased. Finally, let's see the TCP connection lookup table. It is a hash table used to search the TCP connection where the received packet belongs. The hash value is calculated by using the input data of of the packet and the Jenkins hash algorithm. It is told that the hash function has been selected by considering defense against attacks to the hash table. Following Code: How to Transmit Data We will check the key tasks performed by the stack by following the actual Linux kernel source code. Here, we will observe two paths which are frequently used. First, this is a path used to transmit data when an application calls the write system call. SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, ...) { struct file *file; [...] file = fget_light(fd, &fput_needed); [...] ===> ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); struct file_operations { [...] ssize_t (*aio_read) (struct kiocb *, const struct iovec *, ...) ssize_t (*aio_write) (struct kiocb *, const struct iovec *, ...) [...] }; static const struct file_operations socket_file_ops = { [...] .aio_read = sock_aio_read, .aio_write = sock_aio_write, [...] }; When the application calls the write system call, the kernel performs the write() function of the file layer. First, the actual file structure of the file descriptor fd is fetched. And then the aio_write is called. This is the function pointer. In the file structure, you will see the file_operations structure pointer. The structure is generally called function table and includes the function pointers such as aio_read and aio_write. The actual table for the socket is socket_file_ops. The aio_write function used by the socket is sock_aio_write. The function table is used for the purpose that is similar to the Java interface. It is generally used for the kernel to perform code abstraction or refactoring. static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, ..) { [...] struct socket *sock = file->private_data; [...] ===> return sock->ops->sendmsg(iocb, sock, msg, size); struct socket { [...] struct file *file; struct sock *sk; const struct proto_ops *ops; }; const struct proto_ops inet_stream_ops = { .family = PF_INET, [...] .connect = inet_stream_connect, .accept = inet_accept, .listen = inet_listen, .sendmsg = tcp_sendmsg, .recvmsg = inet_recvmsg, [...] }; struct proto_ops { [...] int (*connect) (struct socket *sock, ...) int (*accept) (struct socket *sock, ...) int (*listen) (struct socket *sock, int len); int (*sendmsg) (struct kiocb *iocb, struct socket *sock, ...) int (*recvmsg) (struct kiocb *iocb, struct socket *sock, ...) [...] }; The sock_aio_write() function gets the socket structure from the file and then calls sendmsg. It is also the function pointer. The socket structure includes the proto_ops function table. The proto_ops implemented by the IPv4 TCP is inet_stream_ops and the sendmsg is implemented by tcp_sendmsg. int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size) { struct sock *sk = sock->sk; struct iovec *iov; struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; [...] mss_now = tcp_send_mss(sk, &size_goal, flags); /* Ok commence sending. */ iovlen = msg->msg_iovlen; iov = msg->msg_iov; copied = 0; [...] while (--iovlen >= 0) { int seglen = iov->iov_len; unsigned char __user *from = iov->iov_base; iov++; while (seglen > 0) { int copy = 0; int max = size_goal; [...] skb = sk_stream_alloc_skb(sk, select_size(sk, sg), sk->sk_allocation); if (!skb) goto wait_for_memory; /* * Check whether we can use HW checksum. */ if (sk->sk_route_caps & NETIF_F_ALL_CSUM) skb->ip_summed = CHECKSUM_PARTIAL; [...] skb_entail(sk, skb); [...] /* Where to copy to? */ if (skb_tailroom(skb) > 0) { /* We have some space in skb head. Superb! */ if (copy > skb_tailroom(skb)) copy = skb_tailroom(skb); if ((err = skb_add_data(skb, from, copy)) != 0) goto do_fault; [...] if (copied) tcp_push(sk, flags, mss_now, tp->nonagle); [...] } tcp_sengmsg gets tcp_sock (i.e.,TCP control block) from the socket and copies the data that the application has requested to transmit to the send socket buffer. When copying data to sk_buff, how many bytes will one sk_buff include? One sk_buff copies and includes MSS (tcp_send_mss) bytes to help the code that actually creates packets. Maximum Segment Size (MSS) stands for the maximum payload size that one TCP packet includes. By using TSO and GSO, one sk_buff can save more data than MSS. This will be discussed later, not in this document. The sk_stream_alloc_skb function creates a new sk_buff, and skb_entail adds the new sk_buff to the tail of the send_socket_buffer. The skb_add_data function copies the actual application data to the data buffer of thesk_buff. All the data is copied by repeating the procedure (creating an sk_buff and adding it to the send socket buffer) several times. Therefore, sk_buffs at the size of the MSS are in the send socket buffer as a list. Finally, the tcp_push is called to make the data which can be transmitted now as a packet, and the packet is sent. static inline void tcp_push(struct sock *sk, int flags, int mss_now, ...) [...] ===> static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, ...) int nonagle, { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; [...] while ((skb = tcp_send_head(sk))) { [...] cwnd_quota = tcp_cwnd_test(tp, skb); if (!cwnd_quota) break; if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) break; [...] if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) break; /* Advance the send_head. This one is sent out. * This call will increment packets_out. */ tcp_event_new_data_sent(sk, skb); [...] The tcp_push function transmits as many of the sk_buffs in the send socket buffer as the TCP allows in sequence. First, the tcp_send_head is called to get the first sk_buff in the socket buffer and thetcp_cwnd_test and the tcp_snd_wnd_test are performed to check whether the congestion window and the receive window of the receiving TCP allow new packets to be transmitted. Then, the tcp_transmit_skb function is called to create a packet. static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, gfp_t gfp_mask) { const struct inet_connection_sock *icsk = inet_csk(sk); struct inet_sock *inet; struct tcp_sock *tp; [...] if (likely(clone_it)) { if (unlikely(skb_cloned(skb))) skb = pskb_copy(skb, gfp_mask); else skb = skb_clone(skb, gfp_mask); if (unlikely(!skb)) return -ENOBUFS; } [...] skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); skb_set_owner_w(skb, sk); /* Build TCP header and checksum it. */ th = tcp_hdr(skb); th->source = inet->inet_sport; th->dest = inet->inet_dport; th->seq = htonl(tcb->seq); th->ack_seq = htonl(tp->rcv_nxt); [...] icsk->icsk_af_ops->send_check(sk, skb); [...] err = icsk->icsk_af_ops->queue_xmit(skb); if (likely(err <= 0)) return err; tcp_enter_cwr(sk, 1); return net_xmit_eval(err); } tcp_transmit_skb creates the copy of the given sk_buff (pskb_copy). At this time, it does not copy the entire data of the application but the metadata. And then it calls skb_push to secure the header area and records the header field value. Send_check computes the TCP checksum. With the checksum offload, the payload data is not computed. Finally, queue_xmit is called to send the packet to the IP layer. Queue_xmit for IPv4 is implemented by the ip_queue_xmit function. int ip_queue_xmit(struct sk_buff *skb) [...] rt = (struct rtable *)__sk_dst_check(sk, 0); [...] /* OK, we know where to send it, allocate and build IP header. */ skb_push(skb, sizeof(struct iphdr) + (opt ? opt->optlen : 0)); skb_reset_network_header(skb); iph = ip_hdr(skb); *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff)); if (ip_dont_fragment(sk, &rt->dst) && !skb->local_df) iph->frag_off = htons(IP_DF); else iph->frag_off = 0; iph->ttl = ip_select_ttl(inet, &rt->dst); iph->protocol = sk->sk_protocol; iph->saddr = rt->rt_src; iph->daddr = rt->rt_dst; [...] res = ip_local_out(skb); [...] ===> int __ip_local_out(struct sk_buff *skb) [...] ip_send_check(iph); return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, skb, NULL, skb_dst(skb)->dev, dst_output); [...] ===> int ip_output(struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; [...] skb->dev = dev; skb->protocol = htons(ETH_P_IP); return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL, dev, ip_finish_output, [...] ===> static int ip_finish_output(struct sk_buff *skb) [...] if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); The ip_queue_xmit function executes tasks required by the IP layers. __sk_dst_check checks whether the cached route is valid. If there is no cached route or the cached route is invalid, it performs IP routing. And then it calls skb_push to secure the IP header area and records the IP header field value. After that, as following the function call, ip_send_check computes the IP header checksum and calls the netfilter function. IP fragment is created when ip_finish_output function needs IP fragmentation. No fragmentation is generated when TCP is used. Therefore, ip_finish_output2 is called and it adds the Ethernet header. Finally, a packet is completed. int dev_queue_xmit(struct sk_buff *skb) [...] ===> static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, ...) [...] if (...) { .... } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && qdisc_run_begin(q)) { [...] if (sch_direct_xmit(skb, q, dev, txq, root_lock)) { [...] ===> int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, ...) [...] HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_tx_queue_frozen_or_stopped(txq)) ret = dev_hard_start_xmit(skb, dev, txq); HARD_TX_UNLOCK(dev, txq); [...] } int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, ...) [...] if (!list_empty(&ptype_all)) dev_queue_xmit_nit(skb, dev); [...] rc = ops->ndo_start_xmit(skb, dev); [...] } The completed packet is transmitted through the dev_queue_xmit function. First, the packet passes via the qdisc. If the default qdisc is used and the queue is empty, the sch_direct_xmit function is called to directly send down the packet to the driver, skipping the queue. Dev_hard_start_xmit function calls the actual driver. Before calling the driver, the device TX is locked first. This is to prevent several threads from accessing the device simultaneously. As the kernel locks the device TX, the driver transmission code does not need an additional lock. It is closely related to the parallel processing that will be discussed next time. Ndo_start_xmit function calls the driver code. Just before, you will see ptype_all and dev_queue_xmit_nit. The ptype_all is a list that includes the modules such as packet capture. If a capture program is running, the packet is copied by ptype_all to the separate program. Therefore, the packet that tcpdump shows is the packet transmitted to the driver. When checksum offload or TSO is used, the NIC manipulates the packet. So the tcpdump packet is different from the packet transmitted to the network line. After completing packet transmission, the driver interrupt handler returns the sk_buff. Following Code: How to Receive Data The general executed path is to receive a packet and then to add the data to the receive socket buffer. After executing the driver interrupt handler, follow the napi poll handle first. static void net_rx_action(struct softirq_action *h) { struct softnet_data *sd = &__get_cpu_var(softnet_data); unsigned long time_limit = jiffies + 2; int budget = netdev_budget; void *have; local_irq_disable(); while (!list_empty(&sd->poll_list)) { struct napi_struct *n; [...] n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); if (test_bit(NAPI_STATE_SCHED, &n->state)) { work = n->poll(n, weight); trace_napi_poll(n); } [...] } int netif_receive_skb(struct sk_buff *skb) [...] ===> static int __netif_receive_skb(struct sk_buff *skb) { struct packet_type *ptype, *pt_prev; [...] __be16 type; [...] list_for_each_entry_rcu(ptype, &ptype_all, list) { if (!ptype->dev || ptype->dev == skb->dev) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } [...] type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } if (pt_prev) { ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); static struct packet_type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, [...] }; As mentioned before, the net_rx_action function is the softirq handler that receives a packet. First, the driver that has requested the napi poll is retrieved from the poll_list and the poll handler of the driver is called. The driver wraps the received packet with sk_buff and then calls netif_receive_skb. When there is a module that requests all packets, the netif_receive_skb sends packets to the module. Like packet transmission, the packets are transmitted to the module registered to the ptype_all list. The packets are captured here. Then, the packets are transmitted to the upper layer based on the packet type. The Ethernet packet includes 2-byte ethertype field in the header. The value indicates the packet type. The driver records the value in sk_buff(skb->protocol). Each protocol has its own packet_type structure and registers the pointer of the structure to the ptype_base hash table. IPv4 uses ip_packet_type. The Type field value is the IPv4 ethertype (ETH_P_IP) value. Therefore, the IPv4 packet calls the ip_rcv function. int ip_rcv(struct sk_buff *skb, struct net_device *dev, ...) { struct iphdr *iph; u32 len; [...] iph = ip_hdr(skb); [...] if (iph->ihl < 5 || iph->version != 4) goto inhdr_error; if (!pskb_may_pull(skb, iph->ihl*4)) goto inhdr_error; iph = ip_hdr(skb); if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl))) goto inhdr_error; len = ntohs(iph->tot_len); if (skb->len < len) { IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS); goto drop; } else if (len < (iph->ihl*4)) goto inhdr_error; [...] return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish); [...] ===> int ip_local_deliver(struct sk_buff *skb) [...] if (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)) { if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER)) return 0; } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish); [...] ===> static int ip_local_deliver_finish(struct sk_buff *skb) [...] __skb_pull(skb, ip_hdrlen(skb)); [...] int protocol = ip_hdr(skb)->protocol; int hash, raw; const struct net_protocol *ipprot; [...] hash = protocol & (MAX_INET_PROTOS - 1); ipprot = rcu_dereference(inet_protos[hash]); if (ipprot != NULL) { [...] ret = ipprot->handler(skb); [...] ===> static const struct net_protocol tcp_protocol = { .handler = tcp_v4_rcv, [...] }; The ip_rcv function executes tasks required by the IP layers. It examines packets such as the length and header checksum. After passing through the netfilter code, it performs the ip_local_deliver function. If required, it assembles IP fragments. Then, it calls ip_local_deliver_finish through the netfilter code. Theip_local_deliver_finish function removes the IP header by using the __skb_pull and then searches the upper protocol whose value is identical to the IP header protocol value. Similar to the Ptype_base, each transport protocol registers its own net_protocol structure in inet_protos. IPv4 TCP uses tcp_protocol and callstcp_v4_rcv that has been registered as a handler. When packets come into the TCP layer, the packet processing flow varies depending on the TCP status and the packet type. Here, we will see the packet processing procedure when the expected next data packet has been received in the ESTABLISHED status of the TCP connection. This path is frequently executed by the server receiving data when there is no packet loss or out-of-order delivery. int tcp_v4_rcv(struct sk_buff *skb) { const struct iphdr *iph; struct tcphdr *th; struct sock *sk; [...] th = tcp_hdr(skb); if (th->doff < sizeof(struct tcphdr) / 4) goto bad_packet; if (!pskb_may_pull(skb, th->doff * 4)) goto discard_it; [...] th = tcp_hdr(skb); iph = ip_hdr(skb); TCP_SKB_CB(skb)->seq = ntohl(th->seq); TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin + skb->len - th->doff * 4); TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq); TCP_SKB_CB(skb)->when = 0; TCP_SKB_CB(skb)->flags = iph->tos; TCP_SKB_CB(skb)->sacked = 0; sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest); [...] ret = tcp_v4_do_rcv(sk, skb); First, the tcp_v4_rcv function validates the received packets. When the header size is larger than the data offset (th->doff < sizeof(struct tcphdr) / 4), it is the header error. And then __inet_lookup_skb is called to look for the connection where the packet belongs from the TCP connection hash table. From the sock structure found, all required structures such as tcp_sock and socket can be got. int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) [...] if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ sock_rps_save_rxhash(sk, skb->rxhash); if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) { [...] ===> int tcp_rcv_established(struct sock *sk, struct sk_buff *skb, [...] /* * Header prediction. */ if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags && TCP_SKB_CB(skb)->seq == tp->rcv_nxt && !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt))) { [...] if ((int)skb->truesize > sk->sk_forward_alloc) goto step5; NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS); /* Bulk data transfer: receiver */ __skb_pull(skb, tcp_header_len); __skb_queue_tail(&sk->sk_receive_queue, skb); skb_set_owner_r(skb, sk); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; [...] if (!copied_early || tp->rcv_nxt != tp->rcv_wup) __tcp_ack_snd_check(sk, 0); [...] step5: if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0) goto discard; tcp_rcv_rtt_measure_ts(sk, skb); /* Process urgent data. */ tcp_urg(sk, skb, th); /* step 7: process the segment text */ tcp_data_queue(sk, skb); tcp_data_snd_check(sk); tcp_ack_snd_check(sk); return 0; [...] } The actual protocol is executed from the tcp_v4_do_rcv function. If the TCP is in the ESTABLISHED status,tcp_rcv_esablished is called. Processing of the ESTABLISHED status is separately handled and optimized since it is the most common status. The tcp_rcv_established first executes the header prediction code. The header prediction is also quickly processed to detect in the common state. The common case here is that there is no data to transmit and the received data packet is the packet that must be received next time, i.e., the sequence number is the sequence number that the receiving TCP expects. In this case, the procedure is completed by adding the data to the socket buffer and then transmitting ACK. Go forward and you will see the sentence comparing truesize with sk_forward_alloc. It is to check whether there is any free space in the receive socket buffer to add new packet data. If there is, header prediction is "hit" (prediction succeeded). Then __skb_pull is called to remove the TCP header. After that, __skb_queue_tail is called to add the packet to the receive socket buffer. Finally, __tcp_ack_snd_check is called for transmitting ACK if necessary. In this way, packet processing is completed. If there is not enough free space, a slow path is executed. The tcp_data_queue function newly allocates the buffer space and adds the data packet to the socket buffer. At this time, the receive socket buffer size is automatically increased if possible. Different from the quick path, tcp_data_snd_check is called to transmit a new data packet if possible. Finally, tcp_ack_snd_check is called to create and transmit the ACK packet if necessary. The amount of code executed by the two paths is not much. This is accomplished by optimizing the common case. In other words, it means that the uncommon case will be processed significantly more slowly. The out-of-order delivery is one of the uncommon cases. How to Communicate between Driver and NIC Communication between a driver and the NIC is the bottom of the stack and most people do not care about it. However, the NIC is executing more and more tasks to solve the performance issue. Understanding the basic operation scheme will help you understand the additional technology. A driver and the NIC asynchronously communicate. First, a driver requests packet transmission (call) and the CPU performs another task without waiting for the response. And then the NIC sends packets and notifies the CPU of that, the driver returns the received packets (returns the result). Like packet transmission, packet receiving is asynchronous. First, a driver requests packet receiving and the CPU performs another task (call). Then, the NIC receives packets and notifies the CPU of that, and the driver processes the received packets received (returns the result). Therefore, a space to save the request and the response is necessary. In most cases, the NIC uses the ring structure. The ring is similar to the common queue structure. With the fixed number of entries, one entry saves one request data or one response data. The entries are sequentially used in turn. The name "ring" is generally used since the fixed entries are reused in turn. As following the packet transmission procedure shown in the following Figure 8, you will see how the ring is used. Figure 8: Driver-NIC Communication: How to Transmit Packet. The driver receives packets from the upper layer and creates the send descriptor that the NIC can understand. The send descriptor includes the packet size and the memory address by default. As the NIC needs the physical address to access the memory, the driver should change the virtual address of the packets to the physical address. Then, it adds the send descriptor to the TX ring (1). The TX ring is the send descriptor ring. Next, it notifies the NIC of the new request (2). The driver directly writes the data to a specific NIC memory address. In this way, Programmed I/O (PIO) is the data transmission method in which the CPU directly sends data to the device. The notified NIC gets the send descriptor of the TX ring from the host memory (3). Since the device directly accesses the memory without intervention of the CPU, the access is called Direct Memory Access (DMA). After getting the send descriptor, the NIC determines the packet address and the size and then gets the actual packets from the host memory (4). With the checksum offload, the NIC computes the checksum when the NIC gets the packet data from the memory. Therefore, overhead rarely occurs. The NIC sends packets (5) and then writes the number of packets that are sent to the host memory (6). Then, it sends an interrupt (7). The driver reads the number of packets that are sent and then returns the packets that have been sent so far. In the following Figure 9, you will see the procedure of receiving packets. Figure 9: Driver-NIC Communication: How to Receive Packets. First, the driver allocates the host memory buffer for receiving packets and then creates the receive descriptor. The receive descriptor includes the buffer size and the memory address by default. Like the send descriptor, it saves the physical address that the DMA uses in the receive descriptor. Then, it adds the receive descriptor to the RX ring (1). It is the receive request and the RX ring is the receive request ring. Through the PIO, the driver notifies that there is a new descriptor in the NIC (2). The NIC gets the new descriptor of the RX ring. And then it saves the size and location of the buffer included in the descriptor to the NIC memory (3). After the packets have been received (4), the NIC sends the packets to the host memory buffer (5). If the checksum offload function is existing, the NIC computes the checksum at this time. The actual size of received packets, the checksum result, and any other information are saved in the separate ring (the receive return ring) (6). The receive return ring saves the result of processing the receive request, i.e., the response. And then the NIC sends an interrupt (7). The driver gets packet information from the receive return ring and processes the received packets. If necessary, it allocates new memory buffer and repeats Step (1) and Step (2). To tune the stack, most people say that the ring and interrupt setting should be adjusted. When the TX ring is large, a lot of send requests can be made at once. When the RX ring is large, a lot of packet receives can be done at once. A large ring is useful for the workload that has a huge burst of packet transmission/receiving. In most cases, the NIC uses a timer to reduce the number of interrupts since the CPU may suffer from large overhead to process interrupts. To avoid flooding the host system with too many interrupts, interrupts are collected and sent regularly(interrupt coalescing) while sending and receiving the packets. Stack Buffer and Flow Control Flow control is executed in several stages in the stack. Figure 10 shows buffers used to transmit data. First, an application creates data and adds it to the send socket buffer. If there is no free space in the buffer, the system call is failed or the blocking occurs in the application thread. Therefore, the application data rate flowing into the kernel must be controlled by using the socket buffer size limit. Figure 10: Buffers Related to Packet Transmission. The TCP creates and sends packets to the driver through the transmit queue (qdisc). It is a typical FIFO queue type and the maximum length of the queue is the value of txqueuelen which can be checked by executing the ifconfig command. Generally, it is thousands of packets. The TX ring is between the driver and the NIC. As mentioned before, it is considered as a transmission request queue. If there is no free space in the queue, no transmission request is made and the packets are accumulated in the transmit queue. If too many packets are accumulated, packets are dropped. The NIC saves the packets to transmit in the internal buffer. The packet rate from this buffer is affected by the physical rate (ex: 1 Gb/s NIC cannot offer performance of 10 Gb/s). And with the Ethernet flow control, packet transmission is stopped if there is no free space in the receive NIC buffer. When the packet rate from the kernel is faster than the packet rate from the NIC, packets are accumulated in the buffer of the NIC. If there is no free space in the buffer, processing of transmission request from the TX ring is stopped. More and more requests are accumulated in the TX ring and finally there is no free space in the queue. The driver cannot make any transmission request and the packets are accumulated in the transmit queue. Like this, backpressure is sent from the bottom to the top through many buffers. Figure 11 shows the buffers that the receive packets are passing. The packets are saved in the receive buffer of the NIC. From the view of flow control, the RX ring between the driver and the NIC is considered as a packet buffer. The driver gets packets coming into the RX ring and then sends them to the upper layer. There is no buffer between the driver and the upper layer since the NIC driver that is used by the server system uses NAPI by default. Therefore, it can be considered as the upper layer directly gets packets from the RX ring. The payload data of packets is saved in the receive socket buffer. The application gets the data from the socket buffer later. Figure 11: Buffers Related to Packet Receiving. The driver that does not support NAPI saves packets in the backlog queue. Later, the NAPI handler gets packets. Therefore, the backlog queue can be considered as a buffer between the upper layer and the driver. If the packet processing rate of the kernel is slower than the packet flow rate into the NIC, the RX ring space is full. And the space of the buffer in the NIC is full, too. When the Ethernet flow control is used, the NIC sends a request to stop transmission to the transmission NIC or makes the packet drop. There is no packet drop due to lack of space in the receive socket buffer because the TCP supports end-to-end flow control. However, packet drop occurs due to lack of space in the socket buffer when the application rate is slow because the UDP does not support flow control. The sizes of the TX ring and the RX ring used by the driver in Figure 10 and Figure 11 are the sizes of the rings shown by the ethtool. For most workloads which regard throughput as important, it will be helpful to increase the ring size and the socket buffer size. Increasing the sizes reduces the possibility of failures caused by lack of space in the buffer while receiving and transmitting a lot of packets at a fast rate. Conclusion Initially, I planned to explain only the things that would be helpful for you to develop network programs, execute performance tests, and perform troubleshooting. In spite of my initial plan, the amount of description included in this document is not small. I hope this document will help you to develop network applications and monitor their performance. The TCP/IP protocol itself is very complicated and has many exceptions. However, you don't need to understand every line of TCP/IP-related code of the OS to understand performance and analyze the phenomena. Just understanding its context will be very helpful for you. With continuous advancement of system performance and implementation of the OS network stack, the latest server can offer 10-20 Gb/s TCP throughput without any problem. These days, there are too many technology types related to performance, such as TSO, LRO, RSS, GSO, GRO, UFO, XPS, IOAT, DDIO, and TOE, just like alphabet soup, to make us confused. In the next article, I will explain about the network stack from the performance perspective and discuss the problems and effects of this technology. By Hyeongyeop Kim, Senior Engineer at Performance Engineering Lab, NHN Corporation.

February 27, 2013

by Esen Sagynov

· 13,807 Views · 1 Like

Text Processing, Part 2: Oh, Inverted Index

This is the second part of my text processing series. In this blog, we'll look into how text documents can be stored in a form that can be easily retrieved by a query. I'll used the popular open source Apache Lucene index for illustration. There are two main processing flow in the system ... Document indexing: Given a document, add it into the index Document retrieval: Given a query, retrieve the most relevant documents from the index. The following diagram illustrate how this is done in Lucene. Index Structure Both documents and query is represented as a bag of words. In Apache Lucene, "Document" is the basic unit for storage and retrieval. A "Document" contains multiple "Fields" (also call zones). Each "Field" contains multiple "Terms" (equivalent to words). To control how the document will be indexed across its containing fields, a Field can be declared in multiple ways to specified whether it should be analyzed (a pre-processing step during index), indexed (participate in the index) or stored (in case it needs to be returned in query result). Keyword (Not analyzed, Indexed, Stored) Unindexed (Not analyzed, Not indexed, Stored) Unstored (Analyzed, Indexed, Not stored) Text (Analyzed, Indexed, Stored) The inverted index is a core data structure of the storage. It is organized as an inverted manner from terms to the list of documents (which contain the term). The list (known as posting list) is ordered by a global ordering (typically by document id). To enable faster retrieval, the list is not just a single list but a hierarchy of skip lists. For simplicity, we ignore the skip list in subsequent discussion. This data structure is illustration below based on Lucene's implementation. It is stored on disk as segment files which will be brought to memory during the processing. The above diagram only shows the inverted index. The whole index contain an additional forward index as follows. Document indexing Document in its raw form is extracted from a data adaptor. (this can be making an Web API to retrieve some text output, or crawl a web page, or receiving an HTTP document upload). This can be done in a batch or online manner. When the index processing start, it parses each raw document and analyze its text content. The typical steps includes ... Tokenize the document (breakdown into words) Lowercase each word (to make it non-case-sensitive, but need to be careful with names or abbreviations) Remove stop words (take out high frequency words like "the", "a", but need to careful with phrases) Stemming (normalize different form of the same word, e.g. reduce "run", "running", "ran" into "run") Synonym handling. This can be done in two ways. Either expand the term to include its synonyms (ie: if the term is "huge", add "gigantic" and "big"), or reduce the term to a normalized synonym (ie: if the term is "gigantic" or "huge", change it to "big") At this point, the document is composed with multiple terms. doc = [term1, term2 ...]. Optionally, terms can be further combined into n-grams. After that we count the term frequency of this document. For example, in a bi-gram expansion, the document will become ... doc1 -> {term1: 5, term2: 8, term3: 4, term1_2: 3, term2_3:1} We may also compute a "static score" based on some measure of quality of the document. After that, we insert the document into the posting list (if it exist, otherwise create a new posting list) for each terms (all n-grams), this will create the inverted list structure as shown in previous diagram. There is a boost factor that can be set to the document or field. The boosting factor effectively multiply the term frequency which effectively affecting the importance of the document or field. Document can be added to the index in one of the following ways; inserted, modified and deleted. Typically the document will first added to the memory buffer, which is organized as an inverted index in RAM. When this is a document insertion, it goes through the normal indexing process (as I described above) to analyze the document and build an inverted list in RAM. When this is a document deletion (the client request only contains the doc id), it fetches the forward index to extract the document content, then goes through the normal indexing process to analyze the document and build the inverted list. But in this case the doc object in the inverted list is labeled as "deleted". When this is a document update (the client request contains the modified document), it is handled as a deletion followed by an insertion, which means the system first fetch the old document from the forward index to build an inverted list with nodes marked "deleted", and then build a new inverted list from the modified document. (e.g. If doc1 = "A B" is update to "A C", then the posting list will be {A:doc1(deleted) -> doc1, B:doc1(deleted), C:doc1}. After collapsing A, the posting list will be {A:doc1, B:doc1(deleted), C:doc1} As more and more document are inserted into the memory buffer, it will become full and will be flushed to a segment file on disk. In the background, when M segments files have been accumulated, Lucene merges them into bigger segment files. Notice that the size of segment files at each level is exponentially increased (M, M^2, M^3). This maintains the number of segment files that need to be search per query to be at the O(logN) complexity where N is the number of documents in the index. Lucene also provide an explicit "optimize" call that merges all the segment files into one. Here lets detail a bit on the merging process, since the posting list is already vertically ordered by terms and horizontally ordered by doc id, merging two segment files S1, S2 is basically as follows Walk the posting list from both S1 and S2 together in sorted term order. For those non-common terms (term that appears in one of S1 or S2 but not both), write out the posting list to a new segment S3. Until we find a common term T, we merge the corresponding posting list from these 2 segments. Since both list are sorted by doc id, we just walk down both posting list to write out the doc object to a new posting list. When both posting lists have the same doc (which is the case when the document is updated or deleted), we pick the latest doc based on time order. Finally, the doc frequency of each posting list (of the corresponding term) will be computed. Document retrieval Consider a document is a vector (each term as the separated dimension and the corresponding value is the tf-idf value) and the query is also a vector. The document retrieval problem can be defined as finding the top-k most similar document that match a query, where similarity is defined as the dot-product or cosine distance between the document vector and the query vector. tf-idf is a normalized frequency. TF (term frequency) represents how many time the term appears in the document (usually a compression function such as square root or logarithm is applied). IDF is the inverse of document frequency which is used to discount the significance if that term appears in many other documents. There are many variants of TF-IDF but generally it reflects the strength of association of the document (or query) with each term. Given a query Q containing terms [t1, t2], here is how we fetch the corresponding documents. A common approach is the "document at a time approach" where we traverse the posting list of t1, t2 concurrently (as opposed to the "term at a time" approach where we traverse the whole posting list of t1 before we start the posting list of t2). The traversal process is described as follows ... For each term t1, t2 in query, we identify all the corresponding posting lists. We walk each posting list concurrently to return a sequence of documents (ordered by doc id). Notice that each return document contains at least one term but can also also contain multiple terms. We compute the dynamic score which is dot product of the query to document vector. Notice that we typically don't concern the TF/IDF of the query (which is short and we don't care the frequency of each term). Therefore we can just compute the sum up all the TF score of the posting list that has a match term after dividing the IDF score (at the head of each posting list). Lucene also support query level boosting where a boost factor can be attached to the query terms. The boost factor will multiply the term frequency correspondingly. We also look up the static score which is purely based on the document (but not the query). The total score is a linear combination of static and dynamic score. Although the score we used in above calculation is based on computing the cosine distance between the query and document, we are not restricted to that. We can plug in any similarity function that make sense to the domain. (e.g. we can use machine learning to train a model to score the similarity between a query and a document). After we compute a total score, we insert the document into a heap data structure where the topK scored document is maintained. Here the whole posting list will be traversed. In case of the posting list is very long, the response time latency will be long. Is there a way that we don't have to traverse the whole list and still be able to find the approximate top K documents ? There are a couple strategies we can consider. Static Score Posting Order: Notice that the posting list is sorted based on a global order, this global ordering provide a monotonic increasing document id during the traversal that is important to support the "document at a time" traversal because it is impossible to visit the same document again. This global ordering, however, can be quite arbitrary and doesn't have to be the document id. So we can pick the order to be based on the static score (e.g. quality indicator of the document) which is global. The idea is that we traverse the posting list in decreasing magnitude of static score, so we are more likely to visit the document with the higher total score (static + dynamic score). Cut frequent terms: We do not traverse the posting list whose term has a low IDF value (ie: the term appears in many documents and therefore the posting list tends to be long). This way we avoid to traverse the long posting list. TopR list: For each posting list, we create an extra posting list which contains the top R documents who has the highest TF (term frequency) in the original list. When we perform the search, we perform our search in this topR list instead of the original posting list. Since we have multiple inverted index (in memory buffer as well as the segment files at different levels), we need to combine the result them. If termX appears in both segmentA and segmentB, then the fresher version will be picked. The fresher version is determine as follows; the segment with a lower level (smaller size) will be considered more fresh. If the two segment files are at the same level, then the one with a higher number is more fresh. On the other hand, the IDF value will be the sum of the corresponding IDF of each posting list in the segment file (the value will be slightly off if the same document has been updated, but such discrepancy is negligible). However, the processing of consolidating multiple segment files incur processing overhead in document retrieval. Lucene provide an explicit "optimize" call to merge all segment files into one single file so there is no need to look at multiple segment files during document retrieval. Distributed Index For large corpus (like the web documents), the index is typically distributed across multiple machines. There are two models of distribution: Term partitioning and Document partitioning. In document partitioning, documents are randomly spread across different partitions where the index is built. In term partitioning, the terms are spread across different partitions. We'll discuss document partitioning as it is more commonly used. Distributed index is provider by other technologies that is built on Lucene, such as ElasticSearch. A typical setting is as follows ... In this setting, machines are organized as columns and rows. Each column represent a partition of documents while each row represent a replica of the whole corpus. During the document indexing, first a row of the machines is randomly selected and will be allocated for building the index. When a new document crawled, a column machine from the selected row is randomly picked to host the document. The document will be sent to this machine where the index is build. The updated index will be later propagated to the other rows of replicas. During the document retrieval, first a row of replica machines is selected. The client query will then be broadcast to every column machine of the selected row. Each machine will perform the search in its local index and return the TopM elements to the query processor which will consolidate the results before sending back to client. Notice that K/P < M < K, where K is the TopK documents the client expects and P is the number of columns of machines. Notice that M is a parameter that need to be tuned. One caveat of this distributed index is that as the posting list is split horizontally across partitions, we lost the global view of the IDF value without which the machine is unable to calculate the TF-IDF score. There are two ways to mitigate that ... Do nothing: here we assume the document are evenly spread across different partitions so the local IDF represents a good ratio of the actual IDF. Extra round trip: In the first round, query is broadcasted to every column which returns its local IDF. The query processor will collected all IDF response and compute the sum of the IDF. In the second round, it broadcast the query along with the IDF sum to each column of machines, which will compute the local score based on the IDF sum.

February 26, 2013

by Ricky Ho

· 9,341 Views

XML->JSON->HashMap

Yes, it is long time since i posted… Was just trying to see how a XML can be converted to JSON and to HashMap. The situation is very imaginary. import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import java.util.Map; import net.sf.json.JSON; import net.sf.json.xml.XMLSerializer; import org.apache.commons.io.IOUtils; import org.codehaus.jackson.JsonGenerationException; import org.codehaus.jackson.map.JsonMappingException; import org.codehaus.jackson.map.ObjectMapper; import org.codehaus.jackson.type.TypeReference; public class XML2JSONConvertor { public static void main(String[] args) throws Exception { InputStream is = new FileInputStream(new File( “e:\\jagannathan\\personal\\java-projects\\secondtest.xml”)); String xml = IOUtils.toString(is); XMLSerializer xmlSerializer = new XMLSerializer(); JSON json = xmlSerializer.read(xml); System.out.println(json.toString(2)); printJSON(json.toString(2)); } public static void printJSON(String jsonString) { ObjectMapper mapper = new ObjectMapper(); try { Map jsonInMap = mapper.readValue(jsonString, new TypeReference>() { }); List keys = new ArrayList(jsonInMap.keySet()); for (String key : keys) { System.out.println(key + “: ” + jsonInMap.get(key)); } } catch (JsonGenerationException e) { e.printStackTrace(); } catch (JsonMappingException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } } Dependencies net.sf.json-lib json-lib 2.4 jdk15 commons-io commons-io 2.3 compile xom xom 1.2.5 org.codehaus.jackson jackson-mapper-asl 1.9.0 The Input XML Jags Inc Jagan Male 24-jul Satya Male 24-apr The output 7 Feb, 2013 7:20:50 PM net.sf.json.xml.XMLSerializer getType INFO: Using default type string { “name”: “Jags Inc”, “employees”: [ { "name": "Jagan", "sex": "Male", "dob": "24-jul" }, { "name": "Satya", "sex": "Male", "dob": "24-apr" } ] } name: Jags Inc employees: [{name=Jagan, sex=Male, dob=24-jul}, {name=Satya, sex=Male, dob=24-apr}]

February 18, 2013

by Jagannathan Asokan

· 33,553 Views

Better explaining the CAP Theorem

today, i thought a lot about how to examine different databases. choosing a database is often a daunting task. there's a lot of confusion, a 'theorem', and more than all, the immortal proverb 'not one size fits all'. as if it helps. one of the first things that you realize, when examining nosql distributed databases (and how could you not)is that these days databases are like cars: they're all good. old fashioned sql databases can scale in and out, horizontally sharded over several machines to achieve high availability. nosql systems claim to be consistent. what difference then does it make what database would you choose? the availability and consistency that i mentioned comes, of course, from the misunderstood cap theorem , that - so people say - states that you can only choose 2 out of the 3 consistency: every read would get you the most recent write availability: every node (if not failed) always executes queries partition-tolerance: even if the connections between nodes are down, the other two (a & c) promises, are kept. usually its depicted in a nicely equilaterl triangle, as this one from ofirm : there's a nice proof and explanation of it in this 4 minute video here . but if we think about it, and also see some of brewer's (the theorem author) later remarks , we'll see that the 2 out of 3 is really 1 out of 2: it's really just a vs c! and this is simply because: availability is achieved by replicating the data across different machines consistency is achieved by updating several nodes before allowing further reads total partitioning, meaning failure of part of the system is rare. however, we could look at a delay, a latency, of the update between nodes, as a temporary partitioning . it will then cause a temporary decision between a and c: on systems that allow reads before updating all the nodes, we will get high availability on systems that lock all the nodes before allowing reads, we will get consistency that's it! and since this decision is temporary, it exists only for the duration of the delay, some may say that we are really contrasting latency (another word for availability) against consistency. by the way, there's no distributed system that wants to live with "paritioning" - if it does, it's not distributed. that is why putting sql in this triangle may lead to confusion.

February 17, 2013

by Lior Messinger

· 139,400 Views · 18 Likes

CPU Cache Flushing Fallacy

Even from highly experienced technologists I often hear talk about how certain operations cause a CPU cache to "flush". This seems to be illustrating a very common fallacy about how CPU caches work, and how the cache sub-system interacts with the execution cores. In this article I will attempt to explain the function CPU caches fulfil, and how the cores, which execute our programs of instructions, interact with them. For a concrete example I will dive into one of the latest Intel x86 server CPUs. Other CPUs use similar techniques to achieve the same ends. Most modern systems that execute our programs are shared-memory multi-processor systems in design. A shared-memory system has a single memory resource that is accessed by 2 or more independent CPU cores. Latency to main memory is highly variable from 10s to 100s of nanoseconds. Within 100ns it is possible for a 3.0GHz CPU to process up to 1200 instructions. Each Sandy Bridge core is capable of retiring up to 4 instructions-per-cycle (IPC) in parallel. CPUs employ cache sub-systems to hide this latency and allow them to exercise their huge capacity to process instructions. Some of these caches are small, very fast, and local to each core; others are slower, larger, and shared across cores. Together with registers and main-memory, these caches make up our non-persistent memory hierarchy. Next time you are developing an important algorithm, try pondering that a cache-miss is a lost opportunity to have executed ~500 CPU instructions! This is for a single-socket system, on a multi-socket system you can effectively double the lost opportunity as memory requests cross socket interconnects. Memory Hierarchy Figure 1. For the circa 2012 Sandy Bridge E class servers our memory hierarchy can be decomposed as follows: Registers: Within each core are separate register files containing 160 entries for integers and 144 floating point numbers. These registers are accessible within a single cycle and constitute the fastest memory available to our execution cores. Compilers will allocate our local variables and function arguments to these registers. When hyperthreading is enabled these registers are shared between the co-located hyperthreads. Memory Ordering Buffers (MOB): The MOB is comprised of a 64-entry load and 36-entry store buffer. These buffers are used to track in-flight operations while waiting on the cache sub-system. The store buffer is a fully associative queue that can be searched for existing store operations, which have been queued when waiting on the L1 cache. These buffers enable our fast processors to run asynchronously while data is transferred to and from the cache sub-system. When the processor issues asynchronous reads and writes then the results can come back out-of-order. The MOB is used to disambiguate the ordering for compliance to the published memory model. Level 1 Cache: The L1 is a core-local cache split into separate 32K data and 32K instruction caches. Access time is 3 cycles and can be hidden as instructions are pipelined by the core for data already in the L1 cache. Level 2 Cache: The L2 cache is a core-local cache designed to buffer access between the L1 and the shared L3 cache. The L2 cache is 256K in size and acts as an effective queue of memory accesses between the L1 and L3. L2 contains both data and instructions. L2 access latency is 12 cycles. Level 3 Cache: The L3 cache is shared across all cores within a socket. The L3 is split into 2MB segments each connected to a ring-bus network on the socket. Each core is also connected to this ring-bus. Addresses are hashed to segments for greater throughput. Latency can be up to 38 cycles depending on cache size. Cache size can be up to 20MB depending on the number of segments, with each additional hop around the ring taking an additional cycle. The L3 cache is inclusive of all data in the L1 and L2 for each core on the same socket. This inclusiveness, at the cost of space, allows the L3 cache to intercept requests thus removing the burden from private core-local L1 & L2 caches. Main Memory: DRAM channels are connected to each socket with an average latency of ~65ns for socket local access on a full cache-miss. This is however extremely variable, being much less for subsequent accesses to columns in the same row buffer, through to significantly more when queuing effects and memory refresh cycles conflict. 4 memory channels are aggregated together on each socket for throughput, and to hide latency via pipelining on the independent memory channels. NUMA: In a multi-socket server we have non-uniform memory access. It is non-uniform because the required memory maybe on a remote socket having an additional 40ns hop across the QPI bus. Sandy Bridge is a major step forward for 2-socket systems over Westmere and Nehalem. With Sandy Bridge the QPI limit has been raised from 6.4GT/s to 8.0GT/s, and two lanes can be aggregated thus eliminating the bottleneck of the previous systems. For Nehalem and Westmere the QPI link is only capable of ~40% the bandwidth that could be delivered by the memory controller for an individual socket. This limitation made accessing remote memory a choke point. In addition, the QPI link can now forward pre-fetch requests which previous generations could not. Associativity Levels Caches are effectively hardware based hash tables. The hash function is usually a simple masking of some low-order bits for cache indexing. Hash tables need some means to handle a collision for the same slot. The associativity level is the number of slots, also known as ways or sets, which can be used to hold a hashed version of an address. Having more levels of associativity is a trade off between storing more data vs. power requirements and time to search each of the ways. For Sandy Bridge the L1 and L2 are 8-way and the L3 is 12-way associative. Cache Coherence With some caches being local to cores, we need a means of keeping them coherent so all cores can have a consistent view of memory. The cache sub-system is considered the "source of truth" for mainstream systems. If memory is fetched from the cache it is never stale; the cache is the master copy when data exists in both the cache and main-memory. This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place. An x86 cache works on blocks of data that are 64-bytes in size, known as a cache-line. Other processors can use a different size for the cache-line. A larger cache-line size reduces effective latency at the expense of increased bandwidth requirements. To keep the caches coherent the cache controller tracks the state of each cache-line as being in one of a finite number of states. The protocol Intel employs for this is MESIF, AMD employs a variant know as MOESI. Under the MESIF protocol each cache-line can be in 1 of the 5 following states: Modified: Indicates the cache-line is dirty and must be written back to memory at a later stage. When written back to main-memory the state transitions to Exclusive. Exclusive: Indicates the cache-line is held exclusively and that it matches main-memory. When written to, the state then transitions to Modified. To achieve this state a Request-For-Ownership (RFO) message is sent which involves a read plus an invalidate broadcast to all other copies. Shared: Indicates a clean copy of a cache-line that matches main-memory. Invalid: Indicates an unused cache-line. Forward: Indicates a specialised version of the shared state i.e. this is the designated cache which should respond to other caches in a NUMA system. To transition from one state to another, a series of messages are sent between the caches to effect state changes. Previous to Nehalem for Intel, and Opteron for AMD, this cache coherence traffic between sockets had to share the memory bus which greatly limited scalability. These days the memory controller traffic is on a separate bus. The Intel QPI, and AMD HyperTransport, buses are used for cache coherence between sockets. The cache controller exists as a module within each L3 cache segment that is connected to the on-socket ring-bus network. Each core, L3 cache segment, QPI controller, memory controller, and integrated graphics sub-system are connected to this ring-bus. The ring is made up of 4 independent lanes for: request, snoop, acknowledge, and 32-bytes data per cycle. The L3 cache is inclusive in that any cache-line held in the L1 or L2 caches is also held in the L3. This provides for rapid identification of the core containing a modified line when snooping for changes. The cache controller for the L3 segment keeps track of which core could have a modified version of a cache-line it owns. If a core wants to read some memory, and it does not have it in a Shared, Exclusive, or Modified state; then it must make a read on the ring bus. It will then either be read from main-memory if not in the cache sub-systems, or read from L3 if clean, or snooped from another core if Modified. In any case the read will never return a stale copy from the cache sub-system, it is guaranteed to be coherent. Concurrent Programming If our caches are always coherent then why do we worry about visibility when writing concurrent programs? This is because within our cores, in their quest for ever greater performance, data modifications can appear out-of-order to other threads. There are 2 major reasons for this. Firstly, our compilers can generate programs that store variables in registers for relatively long periods of time for performance reasons, e.g. variables used repeatedly within a loop. If we need these variables to be visible across cores then the updates must not be register allocated. This is achieved in C by qualifying a variable as "volatile". Beware that C/C++ volatile is inadequate for telling the compiler to order other instructions. For this you need fences/barriers. The second major issue with ordering we have to be aware of is a thread could write a variable and then, if it reads it shortly after, could see the value in its store buffer which may be older than the latest value in the cache sub-system. This is never an issue for algorithms following the Single Writer Principle but is an issue for the likes of the Dekker and Peterson lock algorithms. To overcome this issue, and ensure the latest value is observed, the thread must wait for the store buffer to drain on that core. This can be achieved by issuing a fence instruction. The write of a volatile variable in Java, in addition to never being register allocated, is accompanied by a full fence instruction. This fence instruction on x86 has a significant performance impact by preventing progress on the issuing thread until the store buffer is drained. Fences on other processors can have more efficient implementations that simply put a marker in the store buffer for the search boundary, e.g. the Azul Vega does this. If you want to ensure memory ordering across Java threads when following the Single Writer Principle, and avoid the store fence, it is possible by using the j.u.c.Atomic(Int|Long|Reference).lazySet() method, as opposed to setting a volatile variable. The Fallacy Returning to the fallacy of "flushing the cache" as part of a concurrent algorithm. I think we can safely say that we never "flush" the CPU cache within our user space programs. I believe the source of this fallacy is the need to flush, mark or drain to a point, the store buffer for some classes of concurrent algorithms so the latest value can be observed on a subsequent load operation. For this we require a memory ordering fence and not a cache flush. Another possible source of this fallacy is that L1 caches, or the TLB, may need to be flushed based on address indexing policy on a context switch. ARM, previous to ARMv6, did not use address space tags on TLB entries thus requiring the whole L1 cache to be flushed on a context switch. Many processors require the L1 instruction cache to be flushed for similar reasons, in many cases this is simply because instruction caches are not required to be kept coherent. The bottom line is, context switching is expensive and a bit off topic, so in addition to the cache pollution of the L2, a context switch can also cause the TLB and/or L1 caches to require a flush. Intel x86 processors require only a TLB flush on context switch.

February 15, 2013

by Martin Thompson

· 11,535 Views · 3 Likes

Introduction to JCache JSR 107

Resin has supported caching, session replication (another form of caching), and http proxy caching in cluster environments for over ten years. When you use Resin caching, you are using the same platform that has the speed and scalability of custom services written in C like NginX with the usability of Java, and the industry platform Java EE. JCache JSR 107 is a distributed cache that has a similar interface to the HashMap that you know and love. To be more specific, the Cache object in JCache looks like a java.util.ConncurrentHashMap. In addition, JCache JSR 107 defines integration with CDI (as well as Spring and Guice). You can decorate services with interceptors that apply caching to the services just by defining annotations. Resin 4 has support for JCache, and JCache support is required for Java EE 7. Let's look at a small example to see how easy is to get started with JCache. package hello.world; import javax.cache.Cache; import javax.cache.CacheBuilder; import javax.cache.CacheManager; import javax.cache.Caching; ... @WebServlet("/HelloServlet") public class HelloServlet extends HttpServlet { Cache cache; public Cache cache() { if (cache == null) { //building a cache CacheManager manager = Caching.getCacheManager("cacheManagerHello"); CacheBuilder builder = manager.createCacheBuilder("a"); cache = builder.build(); } return cache; } protected void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { response.setContentType("text/html"); response.getWriter().append(" "); String helloMessage = cache().get("hello message"); if (helloMessage == null) { helloMessage = new StringBuilder(20) .append("Hello World ! ") .append(System.currentTimeMillis()).toString(); cache().put("hello message", helloMessage); // <-------------- putting results in the cache } response.getWriter().append(helloMessage); response.getWriter().append(" "); } } The above works out fairly well, but what if we want to periodically change the helloMessage. Let's say we get 2,000 requests a second, but every 10 seconds or so we would like to regenerate the helloMessage. The message might be: Hello World ! 1358979745996 Later we would want it to change. If we wanted it to change every 10 seconds after it was last accessed, we would do this: cache = builder.setExpiry(ExpiryType.ACCESSED, new Duration(TimeUnit.SECONDS, 10)).build(); For this example, we want to change it every 10 seconds after is was last modified. We would set up the timeout on the creation as follows: cache = builder.setExpiry(ExpiryType.MODIFIED, new Duration(TimeUnit.SECONDS, 10)).build(); This would go right in the cache method we defined earlier. public Cache cache() { if (cache == null) { CacheManager manager = Caching.getCacheManager("cacheManagerHello"); CacheBuilder builder = manager.createCacheBuilder("b"); cache = builder.setExpiry(ExpiryType.MODIFIED, new Duration(TimeUnit.SECONDS, 10)).build(); } return cache; } Resin's JCache implementation is built on top Resin distributed cache architecture. You get replication, and data redundancy built in. Bill Digman is a Java EE / Servlet enthusiast and Open Source enthusiast who loves working with Caucho's Resin Servlet Container, a Java EE Web Profile Servlet Container. Caucho's Resin OpenSource Servlet Container Java EE Web Profile Servlet Container Caucho's Resin 4.0 JCache blog post

February 13, 2013

by Bill Digman

· 48,991 Views · 1 Like

Synchronising Multithreaded Integration Tests

Testing threads is hard, very hard and this makes writing good integration tests for multithreaded systems under test... hard. This is because in JUnit there's no built in synchronisation between the test code, the object under test and any threads. This means that problems usually arise when you have to write a test for a method that creates and runs a thread. One of the most common scenarios in this domain is in making a call to a method under test, which starts a new thread running before returning. At some point in the future when the thread's job is done you need assert that everything went well. Examples of this scenario could include asynchronously reading data from a socket or carrying out a long and complex set of operations on a database. For example, the ThreadWrapper class below contains a single public method: doWork(). Calling doWork() sets the ball rolling and at some point in the future, at the discretion of the JVM, a thread runs adding data to a database. public class ThreadWrapper { /** * Start the thread running so that it does some work. */ public void doWork() { Thread thread = new Thread() { /** * Run method adding data to a fictitious database */ @Override public void run() { System.out.println("Start of the thread"); addDataToDB(); System.out.println("End of the thread method"); } private void addDataToDB() { // Dummy Code... try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } } }; thread.start(); System.out.println("Off and running..."); } } A straightforward test for this code would be to call the doWork() method and then check the database for the result. The problem is that, owing to the use of a thread, there's no co-ordination between the object under test, the test and the thread. A common way of achieving some co-ordination when writing this kind of test is to put some kind of delay in between the call to the method under test and checking the results in the database as demonstrated below: public class ThreadWrapperTest { @Test public void testDoWork() throws InterruptedException { ThreadWrapper instance = new ThreadWrapper(); instance.doWork(); Thread.sleep(10000); boolean result = getResultFromDatabase(); assertTrue(result); } /** * Dummy database method - just return true */ private boolean getResultFromDatabase() { return true; } } In the code above there is a simple Thread.sleep(10000) between two method calls. This technique has the benefit of being incredabile simple; however it's also very risky. This is because it introduces a race condition between the test and the worker thread as the JVM makes no guarantees about when threads will run. Often it'll work on a developer's machine only to fail consistently on the build machine. Even if it does work on the build machine it atificially lengthens the duration of the test; remember that quick builds are important. The only sure way of getting this right is to synchronise the two different threads and one technique for doing this is to inject a simple CountDownLatch into the instance under test. In the example below I've modified the ThreadWrapper class's doWork() method adding the CountDownLatch as an argument. public class ThreadWrapper { /** * Start the thread running so that it does some work. */ public void doWork(final CountDownLatch latch) { Thread thread = new Thread() { /** * Run method adding data to a fictitious database */ @Override public void run() { System.out.println("Start of the thread"); addDataToDB(); System.out.println("End of the thread method"); countDown(); } private void addDataToDB() { try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } } private void countDown() { if (isNotNull(latch)) { latch.countDown(); } } private boolean isNotNull(Object obj) { return latch != null; } }; thread.start(); System.out.println("Off and running..."); } } he Javadoc API describes a count down latch as: A synchronization aid that allows one or more threads to wait until a set of operations being performed in other threads completes. A CountDownLatch is initialized with a given count. The await methods block until the current count reaches zero due to invocations of the countDown() method, after which all waiting threads are released and any subsequent invocations of await return immediately. This is a one-shot phenomenon -- the count cannot be reset. If you need a version that resets the count, consider using a CyclicBarrier. A CountDownLatch is a versatile synchronization tool and can be used for a number of purposes. A CountDownLatch initialized with a count of one serves as a simple on/off latch, or gate: all threads invoking await wait at the gate until it is opened by a thread invoking countDown(). A CountDownLatchinitialized to N can be used to make one thread wait until N threads have completed some action, or some action has been completed N times. A useful property of a CountDownLatch is that it doesn't require that threads calling countDown wait for the count to reach zero before proceeding, it simply prevents any thread from proceeding past an await until all threads could pass. The idea here is that the test code will never check the database for the results until the run() method of the worker thread has called latch.countdown(). This is because the test code thread is blocking at the call to latch.await(). latch.countdown() decrements latch's count and once this is zero the blocking call the latch.await() returns and the test code continues executing, safe in the knowledge that any results which should be in the database, are in the database. The test can then retrieve these results and make a valid assertion. Obviously, the code above merely fakes the database connection and operations. The thing is you may not want to, or need to, inject a CountDownLatch directly into your code; after all it's not used in production and it doesn't look particularly clean or elegant. One quick way around this is to simply make the doWork(CountDownLatch latch) method package private and expose it through a public doWork() method. public class ThreadWrapper { /** * Start the thread running so that it does some work. */ public void doWork() { doWork(null); } @VisibleForTesting void doWork(final CountDownLatch latch) { Thread thread = new Thread() { /** * Run method adding data to a fictitious database */ @Override public void run() { System.out.println("Start of the thread"); addDataToDB(); System.out.println("End of the thread method"); countDown(); } private void addDataToDB() { try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } } private void countDown() { if (isNotNull(latch)) { latch.countDown(); } } private boolean isNotNull(Object obj) { return latch != null; } }; thread.start(); System.out.println("Off and running..."); } } The code above uses Google's Guava @VisibleForTesting annotation to tell us that the doWork(CountDownLatch latch) method visibility has been relaxed slightly for testing purposes. Now I realise that making a method call package private for testing purposes in highly controversial; some people hate the idea, whilst others include it everywhere. I could write a whole blog on this subject (and may do one day), but for me it should be used judiciously, when there's no other choice, for example when you're writing characterisation tests for legacy code. If possible it should be avoided, but never ruled out. After all tested code is better than untested code. With this in mind the next iteration of ThreadWrapper designs out the need for a method marked as @VisibleForTesting together with the need to inject a CountDownLatch into your production code. The idea here is to use the Strategy Pattern and separate the Runnable implementation from the Thread. Hence, we have a very simple ThreadWrapper public class ThreadWrapper { /** * Start the thread running so that it does some work. */ public void doWork(Runnable job) { Thread thread = new Thread(job); thread.start(); System.out.println("Off and running..."); } } and a separate job: public class DatabaseJob implements Runnable { /** * Run method adding data to a fictitious database */ @Override public void run() { System.out.println("Start of the thread"); addDataToDB(); System.out.println("End of the thread method"); } private void addDataToDB() { try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } } } You'll notice that the DatabaseJob class doesn't use a CountDownLatch. How is it synchronised? The answer lies in the test code below... public class ThreadWrapperTest { @Test public void testDoWork() throws InterruptedException { ThreadWrapper instance = new ThreadWrapper(); CountDownLatch latch = new CountDownLatch(1); DatabaseJobTester tester = new DatabaseJobTester(latch); instance.doWork(tester); latch.await(); boolean result = getResultFromDatabase(); assertTrue(result); } /** * Dummy database method - just return true */ private boolean getResultFromDatabase() { return true; } private class DatabaseJobTester extends DatabaseJob { private final CountDownLatch latch; public DatabaseJobTester(CountDownLatch latch) { super(); this.latch = latch; } @Override public void run() { super.run(); latch.countDown(); } } } The test code above contains an inner class DatabaseJobTester, which extends DatabaseJob. In this class the run() method has been overridden to include a call to latch.countDown() after our fake database has been updated via the call to super.run(). This works because the test passes a DatabaseJobTester instance to the doWork(Runnable job) method adding in the required thread testing capability. The idea of sub-classing objects under test is something I've mentioned before in one of my blogs on testing techniques and is a really powerful technique. So, to conclude: Testing threads is hard. Testing anonymous inner classes is almost impossible. Using Thead.sleep(...) is a risky idea and should be avoided. You can refactor out these problems using the Strategy Pattern. Programming is the Art of Making the Right Decision ...and that relaxing a method's visibility for testing may or may not be a good idea, but more on that later... The code above is available on Github in the captain debug repository (git://github.com/roghughe/captaindebug.git) under the unit-testing-threads project.

February 13, 2013

by Roger Hughes

· 14,002 Views · 12 Likes

Java 8: From PermGen to Metaspace

As you may be aware, the JDK 8 Early Access is now available for download. This allows Java developers to experiment with some of the new language and runtime features of Java 8. One of these features is the complete removal of the Permanent Generation (PermGen) space which has been announced by Oracle since the release of JDK 7. Interned strings, for example, have already been removed from the PermGen space since JDK 7. The JDK 8 release finalizes its decommissioning. This article will share the information that we found so far on the PermGen successor: Metaspace. We will also compare the runtime behavior of the HotSpot 1.7 vs. HotSpot 1.8 (b75) when executing a Java program “leaking” class metadata objects. The final specifications, tuning flags and documentation around Metaspace should be available once Java 8 is officially released. Metaspace: A new memory space is born The JDK 8 HotSpot JVM is now using native memory for the representation of class metadata and is called Metaspace; similar to the Oracle JRockit and IBM JVM's. The good news is that it means no more java.lang.OutOfMemoryError: PermGen space problems and no need for you to tune and monitor this memory space anymore…not so fast. While this change is invisible by default, we will show you next that you will still need to worry about the class metadata memory footprint. Please also keep in mind that this new feature does not magically eliminate class and classloader memory leaks. You will need to track down these problems using a different approach and by learning the new naming convention. I recommend that you read the PermGen removal summary and comments from Jon on this subject. In summary: PermGen space situation This memory space is completely removed. The PermSize and MaxPermSize JVM arguments are ignored and a warning is issued if present at start-up. Metaspace memory allocation model Most allocations for the class metadata are now allocated out of native memory. The klasses that were used to describe class metadata have been removed. Metaspace capacity By default class metadata allocation is limited by the amount of available native memory (capacity will of course depend if you use a 32-bit JVM vs. 64-bit along with OS virtual memory availability). A new flag is available (MaxMetaspaceSize), allowing you to limit the amount of native memory used for class metadata. If you don’t specify this flag, the Metaspace will dynamically re-size depending of the application demand at runtime. Metaspace garbage collection Garbage collection of the dead classes and classloaders is triggered once the class metadata usage reaches the “MaxMetaspaceSize”. Proper monitoring & tuning of the Metaspace will obviously be required in order to limit the frequency or delay of such garbage collections. Excessive Metaspace garbage collections may be a symptom of classes, classloaders memory leak or inadequate sizing for your application. Java heap space impact Some miscellaneous data has been moved to the Java heap space. This means you may observe an increase of the Java heap space following a future JDK 8 upgrade. Metaspace monitoring Metaspace usage is available from the HotSpot 1.8 verbose GC log output. Jstat & JVisualVM have not been updated at this point based on our testing with b75 and the old PermGen space references are still present. Enough theory now, let’s see this new memory space in action via our leaking Java program… PermGen vs. Metaspace runtime comparison In order to better understand the runtime behavior of the new Metaspace memory space, we created a class metadata leaking Java program. You can download the source here. The following scenarios will be tested: Run the Java program using JDK 1.7 in order to monitor & deplete the PermGen memory space set at 128 MB. Run the Java program using JDK 1.8 (b75) in order to monitor the dynamic increase and garbage collection of the new Metaspace memory space. Run the Java program using JDK 1.8 (b75) in order to simulate the depletion of the Metaspace by setting the MaxMetaspaceSize value at 128 MB. JDK 1.7 @64-bit – PermGen depletion Java program with 50K configured iterations Java heap space of 1024 MB Java PermGen space of 128 MB (-XX:MaxPermSize=128m) As you can see form JVisualVM, the PermGen depletion was reached after loading about 30K+ classes. We can also see this depletion from the program and GC output. Class metadata leak simulator Author: Pierre-Hugues Charbonneau http://javaeesupportpatterns.blogspot.com ERROR: java.lang.OutOfMemoryError: PermGen space Now let’s execute the program using the HotSpot JDK 1.8 JRE. JDK 1.8 @64-bit – Metaspace dynamic re-size Java program with 50K configured iterations Java heap space of 1024 MB Java Metaspace space: unbounded (default) As you can see from the verbose GC output, the JVM Metaspace did expand dynamically from 20 MB up to 328 MB of reserved native memory in order to honor the increased class metadata memory footprint from our Java program. We could also observe garbage collection events in the attempt by the JVM to destroy any dead class or classloader object. Since our Java program is leaking, the JVM had no choice but to dynamically expand the Metaspace memory space. The program was able to run its 50K of iterations with no OOM event and loaded 50K+ Classes. Let's move to our last testing scenario. JDK 1.8 @64-bit – Metaspace depletion Java program with 50K configured iterations Java heap space of 1024 MB Java Metaspace space: 128 MB (-XX:MaxMetaspaceSize=128m) As you can see form JVisualVM, the Metaspace depletion was reached after loading about 30K+ classes; very similar to the run with the JDK 1.7. We can also see this from the program and GC output. Another interesting observation is that the native memory footprint reserved was twice as much as the maximum size specified. This may indicate some opportunities to fine tune the Metaspace re-size policy, if possible, in order to avoid native memory waste. Now find below the Exception we got from the Java program output. Class metadata leak simulator Author: Pierre-Hugues Charbonneau http://javaeesupportpatterns.blogspot.com ERROR: java.lang.OutOfMemoryError: Metadata space Done! As expected, capping the Metaspace at 128 MB like we did for the baseline run with JDK 1.7 did not allow us to complete the 50K iterations of our program. A new OOM error was thrown by the JVM. The above OOM event was thrown by the JVM from the Metaspace following a memory allocation failure. #metaspace.cpp Final words I hope you appreciated this early analysis and experiment with the new Java 8 Metaspace. The current observations definitely indicate that proper monitoring & tuning will be required in order to stay away from problems such as excessive Metaspace GC or OOM conditions triggered from our last testing scenario. Future articles may include performance comparisons in order to identify potential performance improvements associated with this new feature. Please feel free to provide any comment.

February 11, 2013

by Pierre - Hugues Charbonneau

· 598,668 Views · 33 Likes