Data Engineering Resources

The Latest Data Engineering Topics

Memory Barriers/Fences

In this article I'll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors. CPUs have employed many techniques to try and accommodate the fact that CPU execution unit performance has greatly outpaced main memory performance. In my “Write Combining” article I touched on just one of these techniques. The most common technique employed by CPUs to hide memory latency is to pipeline instructions and then spend significant effort, and resource, on trying to re-order these pipelines to minimise stalls related to cache misses. When a program is executed it does not matter if its instructions are re-ordered provided the same end result is achieved. For example, within a loop it does not matter when the loop counter is updated if no operation within the loop uses it. The compiler and CPU are free to re-order the instructions to best utilise the CPU provided it is updated by the time the next iteration is about to commence. Also over the execution of a loop this variable may be stored in a register and never pushed out to cache or main memory, thus it is never visible to another CPU. CPU cores contain multiple execution units. For example, a modern Intel CPU contains 6 execution units which can do a combination of arithmetic, conditional logic, and memory manipulation. Each execution unit can do some combination of these tasks. These execution units operate in parallel allowing instructions to be executed in parallel. This introduces another level of non-determinism to program order if it was observed from another CPU. Finally, when a cache-miss occurs, a modern CPU can make an assumption on the results of a memory load and continue executing based on this assumption until the load returns the actual data. Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance. Figure 1. Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers. These buffers are associative queues that allow fast lookup. This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache. Figure 1 above depicts a simplified view of a modern multi-core CPU. It shows how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system. In a multi-threaded environment techniques need to be employed for making program results visible in a timely manner. I will not cover cache coherence in this article. Just assume that once memory has been pushed to the cache then a protocol of messages will occur to ensure all caches are coherent for any shared data. The techniques for making memory visible from a processor core are known as memory barriers or fences. Memory barriers provide two properties. Firstly, they preserve externally visible program order by ensuring all instructions either side of the barrier appear in the correct program order if observed from another CPU and, secondly, they make the memory visible by ensuring the data is propagated to the cache sub-system. Memory barriers are a complex subject. They are implemented very differently across CPU architectures. At one end of the spectrum there is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers. Since x86 CPUs are the most common for multi-threaded programming I’ll try and simplify to this level. Store Barrier A store barrier, “sfence” instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued. This will make the program state visible to other CPUs so they can act on it if necessary. A good example of this in action is the following simplified code from the BatchEventProcessor in the Disruptor. When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action. All previous updates to memory that happened before the barrier are now visible. private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE; // from inside the run() method T event = null; long nextSequence = sequence + 1L; while (running) { try { final long availableSequence = dependencyBarrier.waitFor(nextSequence); while (nextSequence <= availableSequence) { event = dependencyBarrier.getEvent(nextSequence); eventHandler.onEvent(event, nextSequence == availableSequence); nextSequence++; } sequence = event.getSequence(); // store barrier inserted here !!! } catch (final Exception ex) { exceptionHandler.handle(ex, event); sequence = event.getSequence(); // store barrier inserted here !!! nextSequence = event.getSequence() + 1L; } } Load Barrier A load barrier, “lfence” instruction on x86, forces all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for that CPU. This makes program state exposed from other CPUs visible to this CPU before making further progress. A good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor. Full Barrier A full barrier, "mfence" instruction on x86, is a composite of both load and store barriers happening on a CPU. Java Memory Model In the Java Memory Model a volatile field has a store barrier inserted after a write to it and a load barrier inserted before a read of it. Qualified final fields of a class have a store barrier inserted after their initialisation to ensure these fields are visible once the constructor completes when a reference to the object is available. Atomic Instructions and Software Locks Atomic instructions, such as the “lock ...” instructions on x86, are effectively a full barrier as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs. Software locks usually employ memory barriers, or atomic instructions, to achieve visibility and preserve program order. Performance Impact of Memory Barriers Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance cost which must be considered. To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units. Taking this approach allows the processor to optimise the units of work without restriction. There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them. From http://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html

September 12, 2011

by Martin Thompson

· 26,088 Views · 8 Likes

On DTOs

DTOs, or data-transfer objects, are commonly used. What is not со commonly-known is that they originate from DDD (Domain-driven design). There it makes a lot of sense – domain objects have state, identity and business logic while DTOs have only state. But many projects today are using the anemic data model approach (my opinion) and still use DTOs. They are used whenever an object “leaves” the service layer or “leaves” the system (through web services, rmi, etc.). There are three approaches: every entity has at least one corresponding DTO. Usually more than one, for different scenarios in the view layer. When you display a user in a list you have one DTO, when you display it in a “user details” window you need a more extended DTO. I am not in favour of this approach because in too many cases the DTO and the domain structure have exactly the same structure and as a result there’s a lot of duplicated code + redundant mapping. Another thing is the variability of multiple DTOs. Even if they differ from the entity, they differ from one another with one or two fields. Why duplication is a bad thing? Because changes are to be made in two places, issues are traced harder when data passes through multiple objects, and because..it is duplication. Copy & paste within the same project is a sin. DTOs are only created when their structure significantly differs from the that of the entity. In all other cases the entity itself is used. The cases when you don’t want to show some fields (especially when exposing via web services to 3rd parties) exist, but are not that common. This can sometimes be handled via the serialization mechanism – mark them as @JsonIgnore or @XmlTransient for example – but in other cases the structures are just different. In these cases a DTO is due. For example you have a User and UserDetails, where UserDetails holds the details + the relations of the currently logged user to the given user. The latter has nothing to do with the entity, so you create a DTO. However in the case of a DirectMessage you have sender, recipient, text and datetime both in the DB and in the UI. No need to have a DTO. One caveat with this approach (as well as with the next one). Anemic entities usually come with an ORM (JPA in the case of Java). Whenever they exit the service layer they may be invalid, because of lazy collections that require an open session. You have two options here: use OpenSessionInView / OpenEntityManagerInView – thus your session stays open until you are finished preparing the response. This is easy to configure but is not my preferred option – it violates layer boundaries in a subtle way, and this sometimes leads to problems especially for novice developers Don’t use lazy collections. Lazy collections are unneeded. Either make them eager, if they are supposed to hold a small list of items (for example – the list of roles for a user), or if the data is likely to grow use queries. Yes, you are not going to show 1000 records at on go anyway, you will have to page it. Without lazy associations (@*ToOne are eager by default) you won’t have invalid objects when the session is closed Don’t use DTOs at all. Applicable a soon as there aren’t significantly varying structures. For smaller projects this is usually a good way to go. Everything mentioned in the above point applies here. So my preferred approach is the “middle way”. But it requires a lot of consideration in each case, which may not be applicable for bigger and/or less experienced teams. So one of the two “extremes” should be picked. Since the “no DTOs” approach also requires consideration – what to make @Transient, how does lazy collections affect the flow, etc, the “All DTOs” is usually chosen. But even though it is seemingly the safest approach, it has many pitfalls. First, how do you map from DTOs to entities and vice-versa? Three options: dedicated mapper classes constructors – the DTO constructor takes the entity and fills itself, and vice-versa (remember to also provide a default constructor) declarative mapping (e.g. Dozer). This is practically the same as the first option – it externalizes the mapping. It can even be used together with a dedicated mapper class map them in-line (whenever needed). This can generate unmaintainable code and is not preferred I prefer the constructor approach, at least because fewer classes are created. But they are essentially the same (DTOs are not famous for encapsulation, so all of your properties are exposed anyway). Here is a list of guidelines when using DTOs and either of the “mapping” approaches: Don’t generate too much redundant code. If two scenarios require slightly different DTOs, reuse. No need to create a new DTO for a difference of one or two fields Don’t put presentation logic in mappers/constructors. For example if (entity.isActive()) dto.setStatus("Active"); This should happen in the view layer Don’t sneak entities together with DTOs. DTOs should not have members which are entities. Generally, entities should not be used outside the service layer (this is a bit extreme, but if we use DTOs everywhere we should be consistent and stick to that practice) Don’t use the mappers/entity-to-dto constructors in controllers, use them in the service layer. The reason DTOs are used in the first place is that entities may be ORM-bound, and they may not valid outside a session (i.e. outside the service layer). If using mappers, prefer static mapper methods. Mappers don’t have state, so no need for them to be instantiated. (And they don’t have to be mocked, wrapped, etc). If using mappers, there’s no need for a separate mapper for each entity(+its multiple DTOs). Related entities can be grouped in one mapper. For example Company, CompanyProfile, CompanySubsidiary can use the same mapper class Just make sure you make all these decisions at the beginning of a project and figure out which is applicable in your scenario (team size and experience, project size, domain complexity). From http://techblog.bozho.net/?p=427

September 10, 2011

by Bozhidar Bozhanov

· 27,723 Views · 3 Likes

JSON data migration

JSON data format is simple and still powerful. Nowadays you can encounter more and more web applications communicating using JSON format then a couple of years ago. It is simple for a developer to read the format, it is effective for a web browser to parse the format and there are databases using it as its primary data format. But what happens when the data structure changes? You need to migrate. And that is where acris-json-migration might help you! Example situation Let's shed a light into it and assume we have a data like this: { "firstName":"John" "secondName":"Doe" "street":"Over the rainbow" "streetNr":21 } Such data can be represented by following Java domain object: public class Person { String firstName; String secondName; String street; Integer streetNr; // ... and getters and setters... } Well, this seems like data about a person named John Doe. We stored it in a database and you can clearly see, that secondName is probably not the field name we really like to have. But a developer made a mistake and in second version of our domain model we are going to fix it: { "firstName":"John" "surname":"Doe" "street":"Over the rainbow" "streetNr":21 } Now you can see the point - thousands of data stored in the format defined by Person class in its version #1 but our program communicating in version #2 with changed secondName to surname in Person class. Clients can wonder why the don't see surnames, can't they? ;) One thing to remember (for the following context) - the class Person changed and there is only Person class in version #2. Simple migration script In this situation I would like to write a script: public class PersonV1toV2Script extends JacksonTransformationScript { @Override public void process(ObjectNode node) { rename(node, "secondName", "surname"); } } From the above example it is clear that the script will do the job. And you can do pretty anything with the whole tree of JSON data - adding new nodes, removing existing ones, transforming here and there - all thanks to Jackson's tree model. How can I execute it? There is a Transformer abstract class representing a transofmer responsible for passing JSON data to a script and writing it back. Currently there are tow kinds of transformers: Jackson-based JSONT-based Jackson-based is the preferred one and is more developed then JSONT-based. So to execute a transformation on a data set you have to specify only two lines of code: JacksonTransformer t = new JacksonTransformer(input, output); t.transform(PersonV1toV2Script.class.getName()); ... where input and output represent directories. In the input directory all files are treated as files containing JSON data and are transformed and written to the output directory. For a detailed test you can look into TransformerTest in the project. Conclusion The script's helper API is evolving and provides you with nice methods like removeIfExists or addNonExistent methods. We would like to hear about your use-cases which are not handled by acris-json-migration yet so the project can generally serve the purpose of JSON data migration.

September 9, 2011

by Ladislav Gažo

· 12,548 Views

Testing Databases with JUnit and Hibernate Part 1: One to Rule them

There is little support for testing the database of your enterprise application. We will describe some problems and possible solutions based on Hibernate and JUnit.

September 6, 2011

by Jens Schauder

· 122,968 Views · 2 Likes

Memory is stored within the cache system in units know as cache lines. Cache lines are a power of 2 of contiguous bytes which are typically 32-256 in size. The most common cache line size is 64 bytes. False sharing is a term which applies when threads unwittingly impact the performance of each other while modifying independent variables sharing the same cache line. Write contention on cache lines is the single most limiting factor on achieving scalability for parallel threads of execution in an SMP system. I’ve heard false sharing described as the silent performance killer because it is far from obvious when looking at code. To achieve linear scalability with number of threads, we must ensure no two threads write to the same variable or cache line. Two threads writing to the same variable can be tracked down at a code level. To be able to know if independent variables share the same cache line we need to know the memory layout, or we can get a tool to tell us. Intel VTune is such a profiling tool. In this article I’ll explain how memory is laid out for Java objects and how we can pad out our cache lines to avoid false sharing. Figure 1. Figure 1. above illustrates the issue of false sharing. A thread running on core 1 wants to update variable X while a thread on core 2 wants to update variable Y. Unfortunately these two hot variables reside in the same cache line. Each thread will race for ownership of the cache line so they can update it. If core 1 gets ownership then the cache sub-system will need to invalidate the corresponding cache line for core 2. When Core 2 gets ownership and performs its update, then core 1 will be told to invalidate its copy of the cache line. This will ping pong back and forth via the L3 cache greatly impacting performance. The issue would be further exacerbated if competing cores are on different sockets and additionally have to cross the socket interconnect. Java Memory Layout For the Hotspot JVM, all objects have a 2-word header. First is the “mark” word which is made up of 24-bits for the hash code and 8-bits for flags such as the lock state, or it can be swapped for lock objects. The second is a reference to the class of the object. Arrays have an additional word for the size of the array. Every object is aligned to an 8-byte granularity boundary for performance. Therefore to be efficient when packing, the object fields are re-ordered from declaration order to the following order based on size in bytes: doubles (8) and longs (8) ints (4) and floats (4) shorts (2) and chars (2) booleans (1) and bytes (1) references (4/8) With this knowledge we can pad a cache line between any fields with 7 longs. Within the Disruptor we pad cache lines around the RingBuffer cursor and BatchEventProcessor sequences. To show the performance impact let’s take a few threads each updating their own independent counters. These counters will be volatile longs so the world can see their progress. public final class FalseSharing implements Runnable { public final static int NUM_THREADS = 4; // change public final static long ITERATIONS = 500L * 1000L * 1000L; private final int arrayIndex; private static VolatileLong[] longs = new VolatileLong[NUM_THREADS]; static { for (int i = 0; i < longs.length; i++) { longs[i] = new VolatileLong(); } } public FalseSharing(final int arrayIndex) { this.arrayIndex = arrayIndex; } public static void main(final String[] args) throws Exception { final long start = System.nanoTime(); runTest(); System.out.println("duration = " + (System.nanoTime() - start)); } private static void runTest() throws InterruptedException { Thread[] threads = new Thread[NUM_THREADS]; for (int i = 0; i < threads.length; i++) { threads[i] = new Thread(new FalseSharing(i)); } for (Thread t : threads) { t.start(); } for (Thread t : threads) { t.join(); } } public void run() { long i = ITERATIONS + 1; while (0 != --i) { longs[arrayIndex].value = i; } } public final static class VolatileLong { public volatile long value = 0L; public long p1, p2, p3, p4, p5, p6; // comment out } } Results Running the above code while ramping the number of threads and adding/removing the cache line padding, I get the results depicted in Figure 2. below. This is measuring the duration of test runs on my 4-core Nehalem at home. Figure 2. The impact of false sharing can clearly be seen by the increased execution time required to complete the test. Without the cache line contention we achieve near linear scale up with threads. This is not a perfect test because we cannot be sure where the VolatileLongs will be laid out in memory. They are independent objects. However experience shows that objects allocated at the same time tend to be co-located. So there you have it. False sharing can be a silent performance killer. From http://mechanical-sympathy.blogspot.com/2011/07/false-sharing.html

August 31, 2011

by Martin Thompson

· 39,026 Views · 10 Likes

Cloud Integration with Apache Camel and Amazon Web Services (AWS): S3, SQS and SNS

The integration framework Apache Camel already supports several important cloud services (see my overview article at http://www.kai-waehner.de/blog/2011/07/09/cloud-computing-heterogeneity-will-require-cloud-integration-apache-camel-is-already-prepared for more details). This article describes the combination of Apache Camel and the Amazon Web Services (AWS) interfaces of Simple Storage Service (S3), Simple Queue Service (SQS) and Simple Notification Service (SNS). Thus, The concept of Infrastructure as a Service (IaaS) is used to access messaging systems and data storage without any need for configuration. Registration to AWS and Setup of Camel First, you have to register to the Amazon Web Services (for free). Most AWS services include a free monthly quota, which is absolutely sufficient to play around and develop some simple applications. As its name states, AWS uses technology-independent web services. Besides, APIs for several different programming languages are available to ease development. By the way, Camel uses the AWS SDK for Java (http://aws.amazon.com/sdkforjava), of course. The documentation is detailed and easy to understand, including tutorials, screenshots and code examples . Hint 1: You should read the introductions to S3, SQS and SNS (go to http://aws.amazon.com and click on „products“) and play around with the AWS Management Console (http://aws.amazon.com/console) before you continue. This step is very easy and takes less than one hour. Then, you will have a much better understanding about AWS and where Camel can help you! Hint 2: It really helps to look at the source code of the camel-aws component, It helps you to understand how Camel uses the AWS Java API internally. If you want to write tests, you can do it the same way. In the past, I was afraid of looking at „complex“ source code of open source frameworks. But there is no need to be scared! The camel-aws component (and most other camel components) contain only of a few classes. Everything is easy to understand. It helps you to understand Camel internals, the AWS API, and to spot and solve errors due to exceptions in your code. In the meanwhile, the current Camel version 2.8 supports three AWS services: S3, SQS and SNS. All of them use similar concepts. Therefore, they are included in one single camel component: „camel-aws“. You have to add the libraries to your existing Camel project. As always, the simplest way is to use Maven and add the following dependency to the pom.xml: org.apache.camel camel-aws ${camel-version} Configuration of the Camel Endpoint The implementation and configuration of all three services is very similar. The URI looks like this (the code shows the SQS service): aws-sqs://queue-name[?options] There are two alternatives to configure your endpoint. Using Parameters The easy way is to use two paramters in the URI of your endpoint: „accessKey“ and „secretKey“ (you receive both after your AWS registration). “aws-sqs://unique-queue-name?accessKey=“INSERT_ME“&secretKey=INSERT_ME” Be aware of the following problem, which can result in a strange, non-speaking exception (thanks to Brendan Long): You’ll need to URL encode any +’s in your secret key (otherwise, they’ll be treated as spaces). + = %2B, so if your secretkey was “my+secret\key”, your Camel URL should have “secretKey=my%2Bsecret\key”. “Within the query string, the plus sign is reserved as shorthand notation for a space. Therefore, real plus signs must be encoded. This method was used to make query URIs easier to pass in systems which did not allow spaces.” Source: WC3 URI Recommendations Adding a configured AmazonClient to the Registry If you need to do more configuration (e.g. because your system is behind a firewall), you have to add an AmazonClient object to your registry. The following code shows an example using SQS, but SNS and S3 use exactly the same concept. @Override protected JndiRegistry createRegistry() throws Exception { JndiRegistry registry = super.createRegistry(); AWSCredentials awsCredentials = new BasicAWSCredentials(“INSERT_ME”, “INSERT_ME”); ClientConfiguration clientConfiguration = new ClientConfiguration(); clientConfiguration.setProxyHost(“http://myProxyHost”); clientConfiguration.setProxyPort(8080); AmazonSQSClient client = new AmazonSQSClient(awsCredentials, clientConfiguration); registry.bind(“amazonSQSClient”, client); return registry; } This example overwrites the createRegistry() method of a JUnit test (extending CamelTestSupport). You can also add this information to your runtime Camel application, of course. Apache Camel and the Simple Storage Service (S3) Simple Storage Service (S3) is a key-value-store. You can store small to very large data. The usage is very easy. You create buckets and put key-value data into these buckets. You can also create folders within buckets to organize your data. That’s it. You can monitor your buckets using the AWS Management Console – an intuitive GUI supporting most AWS services. The following example shows both alternatives for accessing the Amazon services (as described above): Paramenters and the AmazonClient. // Transfer data from your file inbox to the AWS S3 service from(“file:files/inbox”) // This is the key of your key-value data .setHeader(S3Constants.KEY, simple(“This is a static key”)) // Using parameters for accessing the AWS service .to(“aws-s3://camel-integration-bucket-mwea-kw?accessKey=INSERT_ME&secretKey=INSERT_ME&region=eu-west-1″); // Transfer data from the AWS S3 service to your file outbox from(“aws-s3://camel-integration-bucket-mwea-kw?amazonS3Client=#amazonS3Client&region=eu-wes”) .to(“file:files/outbox”); There are some additional parameters, for instance you can submit the desired AWS region or delete data after receiving it (see http://camel.apache.org/aws-s3.html and the corresponding SQS and SNS sites for more details about parameters and message headers). As you see in the code, you can use the AWS-S3 endpoint for producing and for consuming messages. Each bucket must be unique, thus you have to add some specific information such as your company to its name. Hint: If a bucket does not exist, Camel is creating it automatically (as the AWS API does). This concept is also used for SQS queues and SNS topics. Apache Camel and the Simple Queue Service (SQS) The Simple Queue Service (SQS) is similar to a JMS provider such as WebSphere MQ or ActiveMQ (but with some differences). You create queues and send messages to them. Consumers receive the messages. Contrary to most other AWS services, you cannot monitor queues by using the AWS management console directly. You have to use the service „Cloudwatch“ (http://aws.amazon.com/cloudwatch) and start an EC2 instance to monitor queues and its content. As you can see in the following code example, the syntax and concepts are almost the same as for the S3 service: from(“file:inbox”) .to(“aws-sqs://camel-integration-queue-mwea-kw?accessKey=INSERT_ME&secretKey=INSERT_ME”); from(“aws-sqs://camel-integration-queue-mwea-kw?amazonSQSClient=#amazonSQSClient”) .to(“file:outbox?fileName=sqs-${date:now:yyyy.MM.dd-hh:mm:ss:SS}”); Again, you can use the AWS-SQS endpoint for producing and for consuming messages. Each queue name must be unique. There exist two important differences to JMS (copy & paste from the AWS documentation): Q: How many times will I receive each message? Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies. Q: Why are there separate ReceiveMessage and DeleteMessage operations? When Amazon SQS returns a message to you, that message stays in the queue, whether or not you actually received the message. You are responsible for deleting the message; the delete request acknowledges that you’re done processing the message. If you don’t delete the message, Amazon SQS will deliver it again on another receive request. Apache Camel and the Simple Notification Service (SNS) The Simple Notification Service (SNS) acts like JMS topics. You create a topic, consumers subscribe to the topic and then receive notifications. Several transport protocols are supported: HTTP(S), Email and SQS. Further interfaces will be added in the future, e.g. the Short Message Service (SMS) for mobile phones. Contrary to S3 and SQS, Camel only offers a producer endpoint for this AWS service. You can only create topics and send messages via Camel. The reason is simple: Camel already offers endpoints for consuming these messages: HTTP, Email and SQS are already available. There is one tradeoff: A consumer cannot subscribe to topics using Camel – at the moment. The AWS Management Console has to be used. A very interesting discussion can be read on the Camel JIRA issue regarding the following questions: Should Camel be able to subscribe to topics? Should the producer contain this feature or should there be a consumer? In my opinion, there should be a consumer which is able to subscribe to topics, otherwise Camel is missing a key part of the AWS SNS service! Please read the discussion and contribute your opinion: https://issues.apache.org/jira/browse/CAMEL-3476. Apache Camel is already ready for the Cloud Computing Era AWS offers many more services for the cloud. Probably, it does not make sense to integrate everyone into Camel, but more AWS services will be supported in the future. For instance, SimpleDB and the Relational Database Service (RDS) are already planned and make sende, too: http://camel.apache.org/aws.html. The conclusion is easy: Apache Camel is already ready for the cloud computing era. Several important cloud services are already supported. Cloud integration will become very important in the future. Thus, Camel is on a very good way. Hopefully, we will see more cloud components, soon. I will continue to write articles about other Camel cloud components (and new AWS addons, ouf course). For instance, a component for the Platform as a Service (PaaS) product Google App Engine (GAE) is already available. If you have any additional important information, questions or other feedback, please write a comment. Thank you in advance… Best regards, Kai Wähner (Twitter: @KaiWaehner) [Content from my Blog: Cloud Integration with Apache Camel and Amazon Web Services (AWS): S3, SQS and SNS]

August 30, 2011

by Kai Wähner

CORE

· 26,164 Views

Concurrency Pattern: Producer and Consumer

In my career spanning 15 years, the problem of Producer and Consumer is one that I have come across only a few times. In most programming cases, what we are doing is performing functions in a synchronous fashion where the JVM or the web container handles the complexities of multi-threading on its own. However, when writing certain kinds of use cases where we need this. Last week, I came acros one such use case that sent me 3 years back when I last did it. However, the way it was done last time was very different. When I first heard the problem statement, I knew instantly what was needed. However, my approach to doing it this time was going to be different from last time. It had simply to do with how I am viewing technology in my life today. I will not go into any non-technical side and will jump straight into the problem and its solution. I started to look at what existed in the market and did come across a couple of posts that helped me in channelizing my thoughts in the right way. Problem Statement We need a solution for a batch migration. We are migrating data form System 1 to System 2 and in the process we need to do three tasks: Load data from Database based on groups Process the data Update the records loaded in step#1 with modifications We have to handle 100s of groups and each group will have around 40K records. You can imagine the amount of time it would take if we were to perform this exercise in a synchronous fashion. Image here explains this problem in an effective way. Producer Consumer: The Problem Producer and Consumer Pattern Let us take a look at the Producer Consumer pattern to begin with. If you refer to the problem statement above and look at the image, we see that there are so many entities who are ready with their part of data. However, there are not enough workers who can process all the data. Hence, as the producers continue to line-up in a queue it just continues to grow. We see that the systems start to hog up threads and take a lot of time. Intermediate Solution Producer Consumer: The Intermediate approch We do have an intermediate solution. Refer to the image and you will immediately notice that the producers are piling up their work in a filing cabinet and the worker continues to pick it up as they get done with the previous task. However, this approach does have some glaring shortcomings: There is still one worker who has to do all the work. The external systems may be happy, but the task will still continue to exist until the worker has completed all of the tasks The producers will pile up their data in a queue and it needs resources to hold the same. Just as in this example the cabinet can fill up, the same can happen with the JVM resources too. We need to be careful how much data we are going to place in memory and in some cases it may not be much. The Solution Producer Consumer: The Solution The solution is what we see everyday in many places – like the cinema hall queue, Petrol Pumps etc. There are so many people who come in to book a ticket and based on how many people come in, the more people are added to issue tickets. Essentially, refer to image here and you will notice that Producers will keep adding their jobs to the cabinet and we have more workers to handle the work load. Java provided concurrency package to solve this issue. Till now, I have always worked on threading at a much lower level and this was first time I was going to work with this package. As I started to explore the web and read fellow bloggers with what they have to say, I came across one very good article. It helped in understanding the use of BlockingQueue in a very effective manner. However, the solutions provided by Dhruba would not have helped me in achieving the high throughput which is needed. So, I started to explore the use of ArrayBlockingQueue for the same. The Controller This is the first class where the contract between the producers and consumers are managed. The controller will setup 1 thread for the Producer and 2 threads for the consumer. Based on the needs we can create as many threads as we need; and even can even read the data from a properties or do some dynamic magic. For now, we will keep this simple. package com.kapil.techieforever.producerconsumer; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class TestProducerConsumer { public static void main(String args[]) { try { Broker broker = new Broker(); ExecutorService threadPool = Executors.newFixedThreadPool(3); threadPool.execute(new Consumer("1", broker)); threadPool.execute(new Consumer("2", broker)); Future producerStatus = threadPool.submit(new Producer(broker)); // this will wait for the producer to finish its execution. producerStatus.get(); threadPool.shutdown(); } catch (Exception e) { e.printStackTrace(); } } } I am using ExecuteService to create a thread pool and manage it. Instead of using the basic Thread implementation, this is a more effective way as it will handle the exiting and restarting the threads as needed. You will also notice that I am using Future class to get the status of the producer thread. This class is very effective and will halt my program from further execution. This is a nice way of replacing the “.join” method on the threads. Note: I am not using Future very effectively in this example; so you may have to try a few things as you feel fit. Also, you should note the Broker class which is being used as filing cabinet between the producers and consumers. We will see its implementation in just a little while. The Producer This class is responsible for producing the data that needs to be worked upon. package com.kapil.techieforever.producerconsumer; public class Producer implements Runnable { private Broker broker; public Producer(Broker broker) { this.broker = broker; } @Override public void run() { try { for (Integer i = 1; i < 5 + 1; ++i) { System.out.println("Producer produced: " + i); Thread.sleep(100); broker.put(i); } this.broker.continueProducing = Boolean.FALSE; System.out.println("Producer finished its job; terminating."); } catch (InterruptedException ex) { ex.printStackTrace(); } } } This class is doing the most simplest of things that it can do – adding an integer to the broker. Some key areas to note are: 1. There is a property on Broker which is updated in the end by the producer when its done producing. This is also known as the “final” or “poison” entry. This is used by the consumers to know that there are no more data coming up 2. I have used Thread.sleep to simulate that some producers may take more time to produce the data. You can tweak this value and see the consumers act The Consumer This class is responsible for reading the data from the broker and doing its job package com.kapil.techieforever.producerconsumer; public class Consumer implements Runnable { private String name; private Broker broker; public Consumer(String name, Broker broker) { this.name = name; this.broker = broker; } @Override public void run() { try { Integer data = broker.get(); while (broker.continueProducing || data != null) { Thread.sleep(1000); System.out.println("Consumer " + this.name + " processed data from broker: " + data); data = broker.get(); } System.out.println("Comsumer " + this.name + " finished its job; terminating."); } catch (InterruptedException ex) { ex.printStackTrace(); } } } This is again a simple class that reads the Integer and prints it on the console. However, key points to note are: 1. The loop to process data is an endless loop, that runs on two conditions – until the producer is consuming and there is some data with the broker 2. Again, the Thread.sleep is used to create effective and different scenarios The Broker package com.kapil.techieforever.producerconsumer; import java.util.concurrent.ArrayBlockingQueue; import java.util.concurrent.TimeUnit; public class Broker { public ArrayBlockingQueue queue = new ArrayBlockingQueue(100); public Boolean continueProducing = Boolean.TRUE; public void put(Integer data) throws InterruptedException { this.queue.put(data); } public Integer get() throws InterruptedException { return this.queue.poll(1, TimeUnit.SECONDS); } } The very first thing to note is that we are using ArrayBlockingQueue as the data holder. I am not going to say what this does, but insist you to read it on the JavaDocs here. however, I will explain that the producers are going to place the data in the queue and the consumers will fetch from the queue in FIFO format. But, if the producers are slow, the consumers will wait for data to come in and if the array is full, the producers will wait for it to fill up. Also, note that I am using the ‘poll’ function instead of get in the queue. This is to ensure that the consumers will not keep waiting for ever and the waiting will time out after a few seconds. This helps us in inter-communication and kill the consumers when all the data is processed. (Note: try replacing poll with get and you will see some interesting outputs). Code I have the code sitting on Google project hosting. Feel free to go across and download it from there. It is essentially an eclipse (Spring STS) project. You may also get additional packages and classes when you download it based on when you are downloading it. Feel free to look into those too and share your comments - You can browse the source code on the SVN browser or; - You can download it from the project itself From http://scratchpad101.com/2011/08/22/concurrency-pattern-producer-consumer/

August 29, 2011

by Kapil Viren Ahuja

· 68,594 Views · 2 Likes

Java NIO vs. IO

when studying both the java nio and io api's, a question quickly pops into mind: when should i use io and when should i use nio? in this text i will try to shed some light on the differences between java nio and io, their use cases, and how they affect the design of your code. main differences of java nio and io the table below summarizes the main differences between java nio and io. i will get into more detail about each difference in the sections following the table. io nio stream oriented buffer oriented blocking io non blocking io selectors stream oriented vs. buffer oriented the first big difference between java nio and io is that io is stream oriented, where nio is buffer oriented. so, what does that mean? java io being stream oriented means that you read one or more bytes at a time, from a stream. what you do with the read bytes is up to you. they are not cached anywhere. furthermore, you cannot move forth and back in the data in a stream. if you need to move forth and back in the data read from a stream, you will need to cache it in a buffer first. java nio's buffer oriented approach is slightly different. data is read into a buffer from which it is later processed. you can move forth and back in the buffer as you need to. this gives you a bit more flexibility during processing. however, you also need to check if the buffer contains all the data you need in order to fully process it. and, you need to make sure that when reading more data into the buffer, you do not overwrite data in the buffer you have not yet processed. blocking vs. non-blocking io java io's various streams are blocking. that means, that when a thread invokes a read() or write(), that thread is blocked until there is some data to read, or the data is fully written. the thread can do nothing else in the meantime. java nio's non-blocking mode enables a thread to request reading data from a channel, and only get what is currently available, or nothing at all, if no data is currently available. rather than remain blocked until data becomes available for reading, the thread can go on with something else. the same is true for non-blocking writing. a thread can request that some data be written to a channel, but not wait for it to be fully written. the thread can then go on and do something else in the mean time. what threads spend their idle time on when not blocked in io calls, is usually performing io on other channels in the meantime. that is, a single thread can now manage multiple channels of input and output. selectors java nio's selectors allow a single thread to monitor multiple channels of input. you can register multiple channels with a selector, then use a single thread to "select" the channels that have input available for processing, or select the channels that are ready for writing. this selector mechanism makes it easy for a single thread to manage multiple channels. how nio and io influences application design whether you choose nio or io as your io toolkit may impact the following aspects of your application design: the api calls to the nio or io classes. the processing of data. the number of thread used to process the data. the api calls of course the api calls when using nio look different than when using io. this is no surprise. rather than just read the data byte for byte from e.g. an inputstream, the data must first be read into a buffer, and then be processed from there. the processing of data the processing of the data is also affected when using a pure nio design, vs. an io design. in an io design you read the data byte for byte from an inputstream or a reader. imagine you were processing a stream of line based textual data. for instance: name: anna age: 25 email: [email protected] phone: 1234567890 this stream of text lines could be processed like this: inputstream input = ... ; // get the inputstream from the client socket bufferedreader reader = new bufferedreader(new inputstreamreader(input)); string nameline = reader.readline(); string ageline = reader.readline(); string emailline = reader.readline(); string phoneline = reader.readline(); notice how the processing state is determined by how far the program has executed. in other words, once the first reader.readline() method returns, you know for sure that a full line of text has been read. the readline() blocks until a full line is read, that's why. you also know that this line contains the name. similarly, when the second readline() call returns, you know that this line contains the age etc. as you can see, the program progresses only when there is new data to read, and for each step you know what that data is. once the executing thread have progressed past reading a certain piece of data in the code, the thread is not going backwards in the data (mostly not). this principle is also illustrated in this diagram: java io: reading data from a blocking stream. a nio implementation would look different. here is a simplified example: bytebuffer buffer = bytebuffer.allocate(48); int bytesread = inchannel.read(buffer); notice the second line which reads bytes from the channel into the bytebuffer. when that method call returns you don't know if all the data you need is inside the buffer. all you know is that the buffer contains some bytes. this makes processing somewhat harder. imagine if, after the first read(buffer) call, that all what was read into the buffer was half a line. for instance, "name: an". can you process that data? not really. you need to wait until at leas a full line of data has been into the buffer, before it makes sense to process any of the data at all. so how do you know if the buffer contains enough data for it to make sense to be processed? well, you don't. the only way to find out, is to look at the data in the buffer. the result is, that you may have to inspect the data in the buffer several times before you know if all the data is inthere. this is both inefficient, and can become messy in terms of program design. for instance: bytebuffer buffer = bytebuffer.allocate(48); int bytesread = inchannel.read(buffer); while(! bufferfull(bytesread) ) { bytesread = inchannel.read(buffer); } the bufferfull() method has to keep track of how much data is read into the buffer, and return either true or false, depending on whether the buffer is full. in other words, if the buffer is ready for processing, it is considered full. the bufferfull() method scans through the buffer, but must leave the buffer in the same state as before the bufferfull() method was called. if not, the next data read into the buffer might not be read in at the correct location. this is not impossible, but it is yet another issue to watch out for. if the buffer is full, it can be processed. if it is not full, you might be able to partially process whatever data is there, if that makes sense in your particular case. in many cases it doesn't. the is-data-in-buffer-ready loop is illustrated in this diagram: java nio: reading data from a channel until all needed data is in buffer. summary nio allows you to manage multiple channels (network connections or files) using only a single (or few) threads, but the cost is that parsing the data might be somewhat more complicated than when reading data from a blocking stream. if you need to manage thousands of open connections simultanously, which each only send a little data, for instance a chat server, implementing the server in nio is probably an advantage. similarly, if you need to keep a lot of open connections to other computers, e.g. in a p2p network, using a single thread to manage all of your outbound connections might be an advantage. this one thread, multiple connections design is illustrated in this diagram: java nio: a single thread managing multiple connections. if you have fewer connections with very high bandwidth, sending a lot of data at a time, perhaps a classic io server implementation might be the best fit. this diagram illustrates a classic io server design: java io: a classic io server design - one connection handled by one thread. from http://tutorials.jenkov.com/java-nio/nio-vs-io.html

August 28, 2011

by Jakob Jenkov

· 133,904 Views · 19 Likes

Practical PHP Refactoring: Replace Array with Object

This refactoring is a specialization of Replace Data Value with Object: its goal is to replace a scalar or primitive structure (in this case, an ever-present array) with an object where we can host methods that act on those data. We have already seen a lightweight version of this refactoring in the code sample of that article: this time we go all the way to a real object, which has private fields representing the elements of the array. Usually the target of the refactoring is an associative array, but it may also be a numeric one, with a limited number of elements. When to introduce an object where a simple array already works? A clue that points to the need for this refactoring in numerical arrays is the fact that the elements are not homogeneous: they may be in type (all strings or integers) but not in meaning. For example, if two or them are flipped the array loses meaning or becomes very strange: array( 'FirstName LastName', '[email protected]' ) For associative arrays, the refactoring is viable everytime the number of elements is strictly fixed: array( 'name' => ... 'email' => ... ) Private fields are self-documenting, and they're easier to understand and maintain that the documentation of the keys of an array. Documentation on array structures always gets repeated in docblocks and doesn't have a real place to live in without a class; moreover, it's the death of encapsulation as nothing stops client code (even in the parts that should only pass the array to other methods) from accessing every single element of the array. And of course, a class is a place where to put methods, while an array cannot host them. Steps The technique described by Fowler for this refactoring is composed of many little steps: create a new class: it should contain only a public field encapsulating a little the array. Change the client code to use this new class in place of the primitive variable. In an iterative cycle, add a getter and a setter for each field and change client code. At each step, the relevant tests should be run. The methods should still use internally the elements of the array. When this phase has been completed, make the array private and see if the code still works. Add private fields to substitute the elements of the array, and change getters and setters accordingly. This change now ripples only into the source code of the new class. When you're finished, delete the field storing the array. Many little steps are often appropriate as the usage of the array spans over dozens of differente classes, and raises the risk of reaching an irreparably broken build. After you have reached the final state, an object with getters and setters, you can go on and remove methods accordingly for immutability or encapsulation; or move Foreign Methods to the new class now that it has become a first class citizen. Note that tests may encompass even end-to-end ones if the array was used on a large scale. For example, we replaced arrays with objects in the two upper layers of the application, forcing us to run tests at the end-to-end scale. Example In the initial state, a response is created by putting together an array. Client code is omitted for brevity, and only the creation part will be our target. true, 'content' => '{someJson:"ok"}' ); } } The array is moved onto a public field of a new class. true, 'content' => '{someJson:"ok"}' )); } } class HttpResponse { public $data; public function __construct(array $data) { $this->data = $data; } } We add setters (also getters in case we need them.) class HttpResponse { public $data; public function __construct(array $data) { $this->data = $data; } public function setSuccess($boolean) { $this->data['success'] = $boolean; } public function setContent($content) { $this->data['content'] = $content; } } The array becomes private, to check that only getters, setters and methods are really used externally. true, 'content' => '{someJson:"ok"}' )); $response->setSuccess(false); $response->setContent('{}'); $this->assertEquals(new HttpResponse(array( 'success' => false, 'content' => '{}' )), $response); } } class HttpResponse { private $data; public function __construct(array $data) { $this->setSuccess($data['success']); $this->setContent($data['content']); } public function setSuccess($boolean) { $this->data['success'] = $boolean; } public function setContent($content) { $this->data['content'] = $content; } } Private fields replace the array elements. We can start move logic into methods on the new class. class HttpResponse { private $success; private $content; public function __construct(array $data) { $this->setSuccess($data['success']); $this->setContent($data['content']); } public function setSuccess($boolean) { $this->success = $boolean; } public function setContent($content) { $this->content = $content; } }

August 24, 2011

by Giorgio Sironi

· 11,581 Views

Clojure: partition-by, split-with, group-by, and juxt

Today I ran into a common situation: I needed to split a list into 2 sublists - elements that passed a predicate and elements that failed a predicate. I'm sure I've run into this problem several times, but it's been awhile and I'd forgotten what options were available to me. A quick look at http://clojure.github.com/clojure/ reveals several potential functions: partition-by, split-with, and group-by. partition-by From the docs: Usage: (partition-by f coll) Applies f to each value in coll, splitting it each time f returns a new value. Returns a lazy seq of partitions. Let's assume we have a collection of ints and we want to split them into a list of evens and a list of odds. The following REPL session shows the result of calling partition-by with our list of ints. user=> (partition-by even? [1 2 4 3 5 6]) ((1) (2 4) (3 5) (6)) The partition-by function works as described; unfortunately, it's not exactly what I'm looking for. I need a function that returns ((1 3 5) (2 4 6)). split-with From the docs: Usage: (split-with pred coll) Returns a vector of [(take-while pred coll) (drop-while pred coll)] The split-with function sounds promising, but a quick REPL session shows it's not what we're looking for. user=> (split-with even? [1 2 4 3 5 6]) [() (1 2 4 3 5 6)] As the docs state, the collection is split on the first item that fails the predicate - (even? 1). group-by From the docs: Usage: (group-by f coll) Returns a map of the elements of coll keyed by the result of f on each element. The value at each key will be a vector of the corresponding elements, in the order they appeared in coll. The group-by function works, but it gives us a bit more than we're looking for. user=> (group-by even? [1 2 4 3 5 6]) {false [1 3 5], true [2 4 6]} The result as a map isn't exactly what we desire, but using a bit of destructuring allows us to grab the values we're looking for. user=> (let [{evens true odds false} (group-by even? [1 2 4 3 5 6])] [evens odds]) [[2 4 6] [1 3 5]] The group-by results mixed with destructuring do the trick, but there's another option. juxt From the docs: Usage: (juxt f) (juxt f g) (juxt f g h) (juxt f g h & fs) Alpha - name subject to change. Takes a set of functions and returns a fn that is the juxtaposition of those fns. The returned fn takes a variable number of args, and returns a vector containing the result of applying each fn to the args (left-to-right). ((juxt a b c) x) => [(a x) (b x) (c x)] The first time I ran into juxt I found it a bit intimidating. I couldn't tell you why, but if you feel the same way - don't feel bad. It turns out, juxt is exactly what we're looking for. The following REPL session shows how to combine juxt with filter and remove to produce the desired results. user=> ((juxt filter remove) even? [1 2 4 3 5 6]) [(2 4 6) (1 3 5)] There's one catch to using juxt in this way, the entire list is processed with filter and remove. In general this is acceptable; however, it's something worth considering when writing performance sensitive code. From http://blog.jayfields.com/2011/08/clojure-partition-by-split-with-group.html

August 24, 2011

by Jay Fields

· 13,203 Views

Edge Side Includes with Varnish in 10 minutes

Varnish is a tool built to be an intermediate server in the HTTP chain, not an origin one like Apache or IIS. You can outsource caching, logging, zipping and other filters to Varnish, since they are not the main feature of an HTTP server like Apache. What we'll see today is how to work with Edge Side Includes in Varnish, as a way to compose dynamic pages from independently generated and cached fragments; we won't encounter logging or other features. If you are familiar with PHP, ESI is an (almost) standard for executing include()-like statements on a front end server like Varnish; the proxy is able not only to assembly pages but also to cache them according to different policies: a certain time, for a single user, and so on. Thijs Feryn and Alessandro Nadalin introduced me to Varnish and ESI respectively, for the first time. I recommend you to consider their blogs and talks as additional sources on these topics. Installation The default version of Varnish in Ubuntu 11.04 is instead 2.1, and apparently does not support ESI very much. Installation via packages means adding a public key and a repository to your list of software sources, and install the varnish package via apt-get or an equivalent command. You can install version 3.0.0 via packages, but only in Ubuntu LTS (10.04). A way that always works in these cases is the installation from sources. The linked page will list the package dependencies and give you a sequence of 3-4 commands to seamlessly compile varnish. I used checkinstall instead of make install to get a binary package that I can reuse later: $ sudo checkinstall -D --install=no --fstrans=no [email protected] --reset-uids=yes --nodoc --pkgname=varnish --pkgversion=3.0.0 --pkgrelease=201108231000 --arch=i386 After installation with dpkg, check that varnishd is available and of the right version: [10:18:17][giorgio@Desmond:~]$ varnishd -V varnishd (varnish-3.0.0 revision 3bd5997) Copyright (c) 2006 Verdens Gang AS Copyright (c) 2006-2011 Varnish Software AS Varnish needs minimal configuration: a server to point at. For our tests you can edit /etc/varnish/default.vcl and check (or add) the following: backend default { .host = "127.0.0.1"; .port = "80"; } You can execute ps -A | grep varnishd at any time to see if varnish is already in execution. Execution [09:55:18][giorgio@Desmond:~]$ sudo varnishd -f /etc/varnish/default.vcl -s malloc,1G -T 127.0.0.1:2000 -a 0.0.0.0:8080 storage_malloc: max size 1024 MB. 1 gigabyte of memory is allocated for keeping fragments in RAM. An administrative interface will respond on port 2000, and only be accessible from localhost. http://localhost:8080/ is the exposed HTTP server, and will point to http://localhost:80 as defined in the configuration. Look at man varnishd for more switched and to man vcl for additional explanations on the configuration language. A bit of ESI ESI is a technique for leveraging HTTP cache and at the same time build dynamic pages. The problem with today's pages is that they are highly dynamic: some sections change very often or according to the current user (Welcome, John Doe or the current posts timeline); some sections do not change at all for days (the navigation bar and the layout structure); some sections change in response to external events (the list of incoming messages only when a new message arrives). It would be ideal to set different caching configurations for all the page's fragments. But implementing this strategy in the application code is error-prone and means reinventing the wheel. To use HTTP cache you will be forced to load with Ajax every single fragment of the page, even a single paragraph. With ESI, your application produces only the pieces, and lets an implementor of the Edge Side Include specification like Varnish assemble the whole thing. Example HTML page (very static): Varnish will work on this page: . PHP page (really dynamic, can change at any time): Varnish will work on this page: 2011-08-23. No sign of Varnish interventions, and totally transparent for the client. And sometimes you can also throw away Zend_Layout and similar components to assemble HTML on the PHP side.

August 23, 2011

by Giorgio Sironi

· 25,167 Views · 1 Like

Practical PHP Refactoring: Replace Data Value with Object

One of the rules of simple design is the necessity to minimize the number of moving parts, like classes and methods, as long as the tests are satisfied and we are not accepting duplication or feeling the lack of an explicit concept. Thus, a rule that aids simple design is to use primitive types unless a field has already some behavior attached: we don't create a class for the user's name or the user's password; we just use some strings. As we make progress, however, we must be able to revise our decisions via refactoring: if a field gains some logic, this behavior shouldn't be modelled by methods in the containing class, but by a new object. The code in this new class can be reused, while the containing object will change from case to case and you will end up duplicating the same methods. Transforming a scalar value into an object is the essence of the Replace Data Value with Object refactoring. In most of the cases, a Value Object or a Parameter Object come out as a result: while DDD pursue Value Objects as concepts in the domain layer, this refactoring is more general and can be applied anywhere. For instance, in a project we started introducing Data Transfer Objects to model the data sent by the controller to a Service Layer. Data values in PHP In PHP, all scalar values are by nature data values as they cannot host methods: string, integers, and booleans are proper scalar. arrays are not scalar in the Perl or mathematical sense, but they are still a primitive type. On the borderline, we find some simple objects used as data containers in PHP: ArrayObjects. SplHeap and other SPL data structures. The classes on the borderline may host methods, but the original class is out of reach for modification, and an indirection has to be introduced."Local Extension" Steps Create the new class: it should contain as a private field just the value you want to substitute. The methods you immediately need have to be chosen between a constructor, getters, and setters (where needed). Change the field in the containing class. Update the constructor to also create the new object and populate the field, or accept injection (a rarer case). Update the original getter to delegate to the new one. Update the original setter to delegate to the new one (where present) or to create a new object. Run tests at the functional level; the changes should be propagated to the construction phases, while the external usage should not change very much. Example In the initial state, magic arrays are passed around. It's very easy to build an array where a key is missing or is called incorrectly. newPassword(array( 'userId' => 42, 'oldPassword' => 'gismo', 'newPassword' => 'supersecret', 'repeatNewPassword' => 'supersecret' )); $this->markTestIncomplete('This refactoring is about the introduction of an object; it suffices that the test does not explode.'); } } class UserService { public function newPassword($changePasswordData) { /* it's not interesting to do something here */ } } After the introduction of an ArrayObject extension, a little type safety is ensure and we gained a place to put methods at a little cost. newPassword(new ChangePasswordCommand(array( 'userId' => 42, 'oldPassword' => 'gismo', 'newPassword' => 'supersecret', 'repeatNewPassword' => 'supersecret' ))); $this->markTestIncomplete('This refactoring is about the introduction of an object; it suffices that the test does not explode.'); } } class UserService { public function newPassword(ChangePasswordCommand $changePasswordData) { /* it's not interesting to do something here */ } } class ChangePasswordCommand extends ArrayObject { } We add methods to implement logic on this object; in this case, validation logic; in general cases, any kind of code that should not be duplicated by the different clients. For a stricter implementation, wrap an array or another data structure (scalars, SPL objects) instead of extending ArrayObject as you gain immutability and encapsulation (but this kind of objects need little encapsulation.) class ChangePasswordCommand extends ArrayObject { public function __construct($data) { if (!isset($data['userId'])) { throw new Exception('User id is missing.'); } parent::__construct($data); } public function getPassword() { if ($this['newPassword'] != $this['repeatNewPassword']) { throw new Exception('Password do not match.'); } return $this['newPassword']; } } Being this a refactoring however, this is the less invasive kind of introduction of objects you can make as the client code can still use the ArrayAccess interface and treat the object as a scalar array.

August 15, 2011

by Giorgio Sironi

· 9,941 Views

Serialize only specific class properties to JSON string using JavaScriptSerializer

About one year ago I wrote a blog post about JavaScriptSerializer and the Serialize and Deserialize methods it supports. Note: This blog post has been in draft for sometime now, so I decided to complete it and publish it. There might be situation when you want to serialize to JSON string only specific properties of a given class. You can do that using JavaScriptSerializer in combination with LINQ. Let’s say we have the following class definition public class Customer { public string Name { get; set; } public string Surname { get; set; } public string Email { get; set; } public int Age { get; set; } public bool Drinker { get; set; } public bool Smoker { get; set; } public bool Single { get; set; } } Next, lets create method that will create sample data for our demo private List GetListOfCustomers() { List customers = new List(); customers.Add(new Customer() { Name = "Hajan", Surname = "Selmani", Age = 25, Drinker = false, Smoker = false, Single = false, Email = "[email protected]" }); customers.Add(new Customer() { Name = "John", Surname = "Doe", Age = 29, Drinker = false, Smoker = true, Single = false, Email = "[email protected]" }); customers.Add(new Customer() { Name = "Mark", Surname = "Moris", Age = 34, Drinker = true, Smoker = true, Single = true, Email = "[email protected]" }); return customers; } So, we have three customers with some property values for each of them. Now, lets serialize some of their properties using JavaScriptSerializer. First, you must put the following directive: using System.Web.Script.Serialization; Next, we create list of customers that will get the returned value from GetListOfCustomers method and we create instance of JavaScriptSerializer class List customers = GetListOfCustomers(); JavaScriptSerializer serializer = new JavaScriptSerializer(); Now, lets say we want to serialize as JSON string and retrieve only the Age property data… We do that with only one simple line of code: //this will serialize only the 'Age' property string jsonString = serializer.Serialize(customers.Select(x => x.Age)); The result will be: Nice! Now, what if we want to serialize multiple properties at once, but not all class properties? string jsonStringMultiple = serializer.Serialize(customers.Select(x => new { x.Name, x.Surname, x.Age })); The result will be: You see, the result is an array of objects with the four properties and their corresponding values we have selected using the LINQ query above. You can see that integer and boolean values are without quotes, which is correct way of serialization. Now, you probably saw a difference somewhere? Namely, in the first example where we have selected only one property, there are only the values of the property (no property name), while in the second example we have the property name and it’s corresponding value… Why is it like that? It’s because in the second query, we use new { … } to specify multiple properties in the select statement. Therefore, the anonymous new { … } creates an object of each found item. So, if you are interested to make some more tests, run the following two lines of code: var customers1 = customers.Select(x => x.Name).ToList(); var customers2 = customers.Select(x=> new { x.Name } ).ToList(); and you will obviously see the difference. If we use the new { } way for single property selection, like in the following example string jsonString2 = serializer.Serialize(customers.Select(x => new { x.Age })); the result will be: The complete demo code used for this blog post: List customers = GetListOfCustomers(); JavaScriptSerializer serializer = new JavaScriptSerializer(); //this will serialize only the 'Age' property string jsonString = serializer.Serialize(customers.Select(x => x.Age )); string jsonStringMultiple = serializer.Serialize(customers.Select(x => new { x.Name, x.Surname, x.Age, x.Drinker })); var customers1 = customers.Select(x => x.Name).ToList(); var customers2 = customers.Select(x=> new { x.Name } ).ToList(); string jsonString2 = serializer.Serialize(customers.Select(x => new { x.Age })); You can download the demo project here.

August 10, 2011

by Hajan Selmani

· 32,223 Views

A collection with billions of entries

There are a number of problems with having a large number of records in memory. One way around this is to use direct memory, but this is too low level for most developers. Is there a way to make this more friendly? Limitations of large numbers of objects The overhead per object is between 12 and 16 bytes for 64-bit JVMs. If the object is relatively small, this is significant and could be more than the data itself. The GC pause time increases with the number of objects. Pause times can be around one second per GB of objects. Collections and arrays only support two billion elements Huge collections One way to store more data and still follow object orientated principles is have wrappers for direct ByteBuffers. This can be tedious to write, but very efficient. What would be ideal is to have these wrappers generated automatically. Small JavaBean Example This is an example of JavaBean which would have far more overhead than actual data contained. interface MutableByte { public void setByte(byte b); public byte getByte(); } It is also small enough that I can create billions of these on my machine. This example creates a List with 16 billion elements. final long length = 16_000_000_000L; HugeArrayList hugeList = new HugeArrayBuilder() {{ allocationSize = 4 * 1024 * 1024; capacity = length; }.create(); List list = hugeList; assertEquals(0, list.size()); hugeList.setSize(length); // add a GC to see what the GC times are like. System.gc(); assertEquals(Integer.MAX_VALUE, list.size()); assertEquals(length, hugeList.longSize()); byte b = 0; for (MutableByte mb : list) mb.setByte(b++); b = 0; for (MutableByte mb : list) { byte b2 = mb.getByte(); byte expected = b++; if (b2 != expected) assertEquals(expected, b2); } From start to finish, the heap memory used is as follows. with -verbosegc 0 sec - 3100 KB used [GC 9671K->1520K(370496K), 0.0020330 secs] [Full GC 1520K->1407K(370496K), 0.0063500 secs] 10 sec - 3885 KB used 20 sec - 4428 KB used 30 sec - 4428 KB used ... deleted ... 1380 sec - 4475 KB used 1390 sec - 4476 KB used 1400 sec - 4476 KB used 1410 sec - 4476 KB used The only GC is one triggered explicitly. Without the System.gc(); no GC logs appear. After 20 sec, the increase in memory used is from logging how much memory was used. Conclusion The library is relatively slow. Each get or set takes about 40 ns which really adds up when there are so many calls to make. I plan to work on it so it is much faster. ;) On the upside, it wouldn't be possible to create 16 billion objects with the memory I have, nor could it be put in an ArrayList, so having it a little slow is still better than not working at all. From http://vanillajava.blogspot.com/2011/08/collection-with-billions-of-entries.html

August 10, 2011

by Peter Lawrey

· 17,383 Views

Dissecting the Disruptor: Why it's so fast (part one) - Locks Are Bad

martin fowler has written a really good article describing not only the disruptor , but also how it fits into the architecture at lmax . this gives some of the context that has been missing so far, but the most frequently asked question is still "what is the disruptor?". i'm working up to answering that. i'm currently on question number two: "why is it so fast?". these questions do go hand in hand, however, because i can't talk about why it's fast without saying what it does, and i can't talk about what it is without saying why it is that way. so i'm trapped in a circular dependency. a circular dependency of blogging. to break the dependency, i'm going to answer question one with the simplest answer, and with any luck i'll come back to it in a later post if it still needs explanation: the disruptor is a way to pass information between threads. as a developer, already my alarm bells are going off because the word "thread" was just mentioned, which means this is about concurrency, and concurrency is hard. concurrency 101 imagine two threads are trying to change the same value. case one: thread 1 gets there first: the value changes to "blah" then the value changes to "blahy" when thread 2 gets there. case two: thread 2 gets there first: the value changes to "fluffy" then the value changes to "blah" when thread 1 gets there. case three: thread 1 interrupts thread 2: thread 2 gets the value "fluff" and stores it as myvalue thread 1 goes in and updates value to "blah" then thread 2 wakes up and sets the value to "fluffy". case three is probably the only one which is definitely wrong, unless you think the naive approach to wiki editing is ok ( google code wiki, i'm looking at you...). in the other two cases it's all about intentions and predictability. thread 2 might not care what's in value, the intention might be to append "y" to whatever is in there regardless. in this circumstance, cases one and two are both correct. but if thread 2 only wanted to change "fluff" to "fluffy", then both cases two and three are incorrect. assuming that thread 2 wants to set the value to "fluffy", there are some different approaches to solving the problem. approach one: pessimistic locking (does the "no entry" sign make sense to people who don't drive in britain?) the terms pessimistic and optimistic locking seem to be more commonly used when talking about database reads and writes, but the principal applies to getting a lock on an object. thread 2 grabs a lock on entry as soon as it knows it needs it and stops anything from setting it. then it does its thing, sets the value, and lets everything else carry on. you can imagine this gets quite expensive, with threads hanging around all over the place trying to get hold of objects and being blocked. the more threads you have, the more chance that things are going to grind to a halt. approach two: optimistic locking in this case thread 2 will only lock entry when it needs to write to it. in order to make this work, it needs to check if entry has changed since it first looked at it. if thread 1 came in and changed the value to "blah" after thread 2 had read the value, thread 2 couldn't write "fluffy" to the entry and trample all over the change from thread 1. thread 2 could either re-try (go back, read the value, and append "y" onto the end of the new value), which you would do if thread 2 didn't care what the value it was changing was; or it could throw an exception or return some sort of failed update flag if it was expecting to change "fluff" to "fluffy". an example of this latter case might be if you have two users trying to update a wiki page, and you tell the user on the other end of thread 2 they'll need to load the new changes from thread 1 and then reapply their changes. potential problem: deadlock locking can lead to all sorts of issues, for example deadlock. imagine two threads that need access to two resources to do whatever they need to do: if you've used an over-zealous locking technique, both threads are going to sit there forever waiting for the other one to release its lock on the resource. that's when you reboot windows your computer. definite problem: locks are sloooow... the thing about locks is that they need the operating system to arbitrate the argument. the threads are like siblings squabbling over a toy, and the os kernel is the parent that decides which one gets it. it's like when you run to your dad to tell him your sister has nicked the transformer when you wanted to play with it - he's got bigger things to worry about than you two fighting, and he might finish off loading the dishwasher and putting on the laundry before settling the argument. if you draw attention to yourself with a lock, not only does it take time to get the operating system to arbitrate, the os might decide the cpu has better things to do than servicing your thread. the disruptor paper talks about an experiment we did. the test calls a function incrementing a 64-bit counter in a loop 500 million times. for a single thread with no locking, the test takes 300ms. if you add a lock (and this is for a single thread, no contention, and no additional complexity other than the lock) the test takes 10,000ms. that's, like, two orders of magnitude slower. even more astounding, if you add a second thread (which logic suggests should take maybe half the time of the single thread with a lock) it takes 224,000ms. incrementing a counter 500 million times takes nearly a thousand times longer when you split it over two threads instead of running it on one with no lock. concurrency is hard and locks are bad i'm just touching the surface of the problem, and obviously i'm using very simple examples. but the point is, if your code is meant to work in a multi-threaded environment, your job as a developer just got a lot more difficult: naive code can have unintended consequences. case three above is an example of how things can go horribly wrong if you don't realise you have multiple threads accessing and writing to the same data. selfish code is going to slow your system down. using locks to protect your code from the problem in case three can lead to things like deadlock or simply poor performance. this is why many organisations have some sort of concurrency problems in their interview process (certainly for java interviews). unfortunately it's very easy to learn how to answer the questions without really understanding the problem, or possible solutions to it. how does the disruptor address these issues? for a start, it doesn't use locks. at all. instead, where we need to make sure that operations are thread-safe (specifically, updating the next available sequence number in the case of multiple producers ), we use a cas (compare and swap/set) operation. this is a cpu-level instruction, and in my mind it works a bit like optimistic locking - the cpu goes to update a value, but if the value it's changing it from is not the one it expects, the operation fails because clearly something else got in there first. note this could be two different cores rather than two separate cpus. cas operations are much cheaper than locks because they don't involve the operating system, they go straight to the cpu. but they're not cost-free - in the experiment i mentioned above, where a lock-free thread takes 300ms and a thread with a lock takes 10,000ms, a single thread using cas takes 5,700ms. so it takes less time than using a lock, but more time than a single thread that doesn't worry about contention at all. back to the disruptor - i talked about the claimstrategy when i went over the producers . in the code you'll see two strategies, a singlethreadedstrategy and a multithreadedstrategy. you could argue, why not just use the multi-threaded one with only a single producer? surely it can handle that case? and it can. but the multi-threaded one uses an atomiclong (java's way of providing cas operations), and the single-threaded one uses a simple long with no locks and no cas. this means the single-threaded claim strategy is as fast as possible, given that it knows there is only one producer and therefore no contention on the sequence number. i know what you're thinking: turning one single number into an atomiclong can't possibly have been the only thing that is the secret to the disruptor's speed. and of course, it's not - otherwise this wouldn't be called "why it's so fast (part one )". but this is an important point - there's only one place in the code where multiple threads might be trying to update the same value. only one place in the whole of this complicated data-structure-slash-framework. and that's the secret. remember everything has its own sequence number? if you only have one producer then every sequence number in the system is only ever written to by one thread. that means there is no contention. no need for locks. no need even for cas. the only sequence number that is ever written to by more than one thread is the one on the claimstrategy if there is more than one producer. this is also why each variable in the entry can only be written to by one consumer . it ensures there's no write contention, therefore no need for locks or cas. back to why queues aren't up to the job so you start to see why queues, which may implemented as a ring buffer under the covers, still can't match the performance of the disruptor. the queue, and the basic ring buffer , only has two pointers - one to the front of the queue and one to the end: if more than one producer wants to place something on the queue, the tail pointer will be a point of contention as more than one thing wants to write to it. if there's more than one consumer, then the head pointer is contended, because this is not just a read operation but a write, as the pointer is updated when the item is consumed from the queue. but wait, i hear you cry foul! because we already knew this, so queues are usually single producer and single consumer (or at least they are in all the queue comparisons in our performance tests). there's another thing to bear in mind with queues/buffers. the whole point is to provide a place for things to hang out between producers and consumers, to help buffer bursts of messages from one to the other. this means the buffer is usually full (the producer is out-pacing the consumer) or empty (the consumer is out-pacing the producer). it's rare that the producer and consumer will be so evenly-matched that the buffer has items in it but the producers and consumers are keeping pace with each other. so this is how things really look. an empty queue: ...and a full queue: the queue needs a size so that it can tell the difference between empty and full. or, if it doesn't, it might determine that based on the contents of that entry, in which case reading an entry will require a write to erase it or mark it as consumed. whichever implementation is chosen, there's quite a bit of contention around the tail, head and size variables, or the entry itself if a consume operation also includes a write to remove it. on top of this, these three variables are often in the same cache line , leading to false sharing . so, not only do you have to worry about the producer and the consumer both causing a write to the size variable (or the entry), updating the tail pointer could lead to a cache-miss when the head pointer is updated because they're sat in the same place. i'm going to duck out of going into that in detail because this post is quite long enough as it is. so this is what we mean when we talk about "teasing apart the concerns" or a queue's "conflated concerns". by giving everything its own sequence number and by allowing only one consumer to write to each variable in the entry, the only case the disruptor needs to manage contention is where more than one producer is writing to the ring buffer. in summary the disruptor a number of advantages over traditional approaches: no contention = no locks = it's very fast. having everything track its own sequence number allows multiple producers and multiple consumers to use the same data structure. tracking sequence numbers at each individual place (ring buffer, claim strategy, producers and consumers), plus the magic cache line padding , means no false sharing and no unexpected contention. from http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast.html

July 23, 2011

by Trisha Gee

· 12,687 Views · 1 Like

Testing Entity Validations with a Mock Entity - Roo in Action Corner

In Spring Roo in Action, Chapter 3, I discuss how Roo automatically executes the Bean Validators when persisting a live entity. However, when running unit tests, we don't have a live entity at all, nor do we have a Spring container - so how can we exercise the validation without actually hitting our Roo application and the database? The following post is ancillary material from the upcoming book Spring Roo in Action, by Ken Rimple and Srini Penchikala, with Gordon Dickens. You can purchase the MEAP edition of the book, and participate in the author forum, at www.manning.com/rimple. The answer is that we have to bootstrap the validation framework within the test ourselves. We can use the CourseDataOnDemand class's getNewTransientEntityName method to generate a valid, transient JPA entity. Then, we can: Mock static entity methods, such as findById, to bring back pre-fabricated class instances of your entity Initialize the validation engine, bootstrapping a JSR-303 bean validation framework engine, and perform validation on your entity Set any appropriate properties to apply to a particular test condition Initialize a test instance of the entity validator and assert the appropriate validation results are returned The concept in action... Given a Student entity with the following definition: @RooEntity @RooJavaBean @RooToString public class Student { @NotNull private String emergencyContactInfo; ... } The listing below shows a unit test method that ensures the NotNull validation fires against missing emergency contact information on the Student entity: @Test public void testStudentMissingEmergencyContactValidation() { // setup our test data StudentDataOnDemand dod = new StudentDataOnDemand(); // tell the mock to expect this call Student.findStudent(1L); // tell the mocking API to expect a return from the prior call in the form of // a new student from the test data generator, dod AnnotationDrivenStaticEntityMockingControl.expectReturn( dod.getNewTransientStudent(0)); // put our mock in playback mode AnnotationDrivenStaticEntityMockingControl.playback(); // Setup the validator API in our unit test LocalValidatorFactoryBean validator = new LocalValidatorFactoryBean(); validator.afterPropertiesSet(); // execute the call from the mock, set the emergency contact field // to an invalid value Student student = Student.findStudent(1L); student.setEmergencyContactInfo(null); // execute validation, check for violations Set> violations = validator.validate(student, Default.class); // do we have one? Assert.assertEquals(1, violations.size()); // now, check the constraint violations to check for our specific error ConstraintViolation violation = violations.iterator().next(); // contains the right message? Assert.assertEquals("{javax.validation.constraints.NotNull.message}", violation.getMessageTemplate()); // from the right field? Assert.assertEquals("emergencyContactInfo", violation.getPropertyPath().toString()); } Analysis The test starts with a declaration of a StudentOnDemand object, which we'll use to generate our test data. We'll get into the more advanced uses of the DataOnDemand Framework later in the chapter. For now, keep in mind that we can use this class to create an instance of an Entity, with randomly assigned, valid data. We then require that the test calls the Student.findStudent method, passing it a key of 1L. Next, we'll tell the entity mocking framework that the call should return a new transient Student instance. At this point, we've defined our static mocking behavior, so we'll put the mocking framework into playback mode. Next, we issue the actual Student.findById(1L) call, this time storing the result as the member variable student. This call will trip the mock, which will return a new transient instance. We then set the emergencyContactInfo field to null, so that it becomes invalid, as it is annotated with a @NotNull annotation. Now we are ready to set up our bean validation framework. We create a LocalValidatorFactoryBean instance, which will boot the Bean Validation Framework in the afterPropertiesSet() method, which is defined for any Spring Bean implementing InitializingBean. We must call this method ourselves, because Spring is not involved in our unit test. Now we're ready to run our validation and assert the proper behavior has occurred. We call our validator's validate method, passing it the student instance and the standard Default validation group, which will trigger validation. We'll then check that we only have one validation failure, and that the message template for the error is the same as the one for the @NotNull validation. We also check to ensure that the field that caused the validation was our emergencyContactInfo field. In our answer callback, we can launch the Bean Validation Framework, and execute the validate method against our entity. In this way, we can exercise our bean instance any way we want, and instead of persisting the entity, can perform the validation phase and exit gracefully. Caveats... There are a few things slightly wrong here. First of all, the Data on Demand classes actually use Spring to inject relationships to each other, which I've logged a bug against as ROO-2497. You can override the setup of the data on demand class and manually create the DoD of the referring one, which is fine. They have slated to work on this bug for Roo 1.2, so it should be fixed sometime in the next few months. Also, realize that this is NOT easy to do, compared to writing an integration test. However, this test runs markedly faster. If you have some sophisticated logic that you've attached to a @AssertTrue annotation, this is the way to test it in isolation. From http://www.rimple.com/tech/2011/7/17/testing-entity-validations-with-a-mock-entity-roo-in-action.html

July 20, 2011

by Ken Rimple

· 14,610 Views

Human Readable vs Machine Readable Formats

Most file/serialization formats can be broadly broking into two formats, Human Readable Text and Machine Readble Binary. The Human Readable formats have the advantage of being easily understood by a person reading them. Machine readable formats are easier/faster for a machine to encode/decode. There are formats which attempt to be a little of both. XML, JSon, CSV are examples of these. However these do not achieve close to the performance a binary format can achieve. Myth: Machine Readable Binary is always more compact than a Human Readable Binary can be more compact, however the obscurity of its format makes it difficult to ensure every byte counts. i.e. its usually hard enough getting something work. Making it compact as well is an added complication. However with Human Readable formats, determing how the format can be made more compact is more easily understood. As text: 38 bytes long, [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] As binary: 290 bytes long, ....sr..java.util.ArrayListx.....a....I..sizexp....w.....sr..java.lang.Long; .....#....J..valuexr..java.lang.Number...........xp........sq.~..........sq.~ ..........sq.~..........sq.~..........sq.~..........sq.~..........sq.~ ..........sq.~..........sq.~..........sq.~..........sq.~..........x Even though the first format is more compact, you can immedately see you could drop the [ ] and spaces after the ", " to make it more compact. With the binary formats, it is hard to know where to start. ComparingHumanReadableToBinaryMain.java List longs = new ArrayList(); for(long i=-1;i<=10;i++) longs.add(i); String asText = longs.toString(); byte[] bytes1 = asText.getBytes(); System.out.println("As text: "+ bytes1.length+" bytes long, "+asText); ByteArrayOutputStream baos = new ByteArrayOutputStream(); ObjectOutputStream oos = new ObjectOutputStream(baos); oos.writeObject(longs); oos.close(); byte[] bytes2 = baos.toByteArray(); System.out.println("As binary: "+bytes2.length+" bytes long, " +new String(bytes2, 0).replaceAll("[^\\p{Graph}]", ".")); Myth: Machine Readable Binary is always faster than a Human Readable Its assumed the cost of parsing data in a human readable format always makes it slower, however machine sreadbale formats have to deal with an issue human readbale formats takes for granted, that is byte endianness. For human readable formats the order of digits is fairly obvious, however for machine formats the byte endianess of the data might not match that the natrual byte order of the CPU, leading to a source of overhead (as it has to swap the bytes around) One example of this is using big-endian (e.g. TCP/Network byte order) on a little endian machine e.g. Windows/Linux Intel/AMD. A common class which has this issue is DataInputStream and DataOutputStream which re-arranges the byte order (even if the native byte order matches) For this reason, a fast human readable parse can be as fast or faster. In an earlier article I showed how a Human Readable format could be used to read/write integers 30% faster than using DataInput/DataOuput. Writing human readable data faster than binary. Myth: Using a Human Readable Format makes it easy to read Just using a human readable format doesn't mean it will be easier to read than a machine readable format. Reusing existing tools as much as possible makes human readable format preferrable. However, machine readable formats can come with tools which decode the data and make maintain it easier. If you have data which can only be managed with the use of specialist tools, being human readable is not much advantage. Images are a good example of where a machine readable format is the best option. It is hard to image editing or viewing an image without the need for a specialist tool. A practical human readable format would undoubtably lower the quality of the image. ;) ________/.- ,’_______`-. \ _________\ /`__________\’/ _________ /___’a___a`___\ _________|____,’(_)`.____ | _________\___( ._|_. )___ / __________\___ .__,’___ / __________.-`._______,’-.__ ________,’__,’___`-’___`.__`. _______/____/____V_____\___\_ _____,’____/_____o______\___`.__ ___,’_____|______o_______|_____`. __|_____,’|______o_______|`._____| ___`.__,’_.-\_____o______/-._`.__,’ __________/_`.___o____,’__\_ __.””-._,’_____`._:_,’_____`.,-””._ _/_,-._`_______)___(________’_,-.__\ (_(___`._____,’_____`.______,’___)_) _\_\____\__,’________`.____/.___/_/ On the other hand human readable formats can be almost as obscure. This is a piece of code written in a language I am not worthy of mentioning. ;) Its is descibed as "used to list all of the prime numbers between 1 and R" (!R)@&{&/x!/:2_!x}'!R Conclusion If you are designing a file format, start with a human readable one as its much easier to understand. If this is not compact enough, consider compressing it. If it is not fast enough concider making it a binary format, but make sure it really is faster to use such a format. If you are going to use a binary format make sure you have tools in place to supprot viewing (possibly editing) the data (which you would get for free with a text format) From http://vanillajava.blogspot.com/2011/07/human-readable-vs-machine-readble.html

July 13, 2011

by Peter Lawrey

· 21,601 Views · 5 Likes

Lucene's near-real-time search is fast!

Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement. Fortunately, it's trivial to use. Just open your initial NRT reader, like this: // w is your IndexWriter IndexReader r = IndexReader.open(w, true); (That's the 3.1+ API; prior to that use w.getReader() instead). The returned reader behaves just like one opened with IndexReader.open: it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away! Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one. What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible. Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader. Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when. With NRT, your application has controlled consistency: you decide exactly when changes must become visible. Recently there have been some good improvements related to NRT: New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions. NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092). When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900). Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010). How fast is NRT search? I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents. Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard. The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time. Finally, the NRT reader is reopened once per second. I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory. The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time: The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph. Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents. Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat. Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical. Reopen delay during merging The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching. I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0. Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness. Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache. Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges. This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene! Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough. While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT. NIOFSDirectory vs MMapDirectory Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory: There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another. The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory! Optimizing your NRT turnaround time My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate. Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements. There are also some simple steps you can take to reduce the turnaround time: Store the index on a fast IO system, ideally a modern SSD. Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader. Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk). If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster. Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092). Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader. While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit. Happy near-real-time searching!

July 11, 2011

by Michael Mccandless

· 19,088 Views

SEVERE: Error in xpath:java.lang.RuntimeException: solrconfig.xml missing luceneMatchVersion

One of the things that changed from Solr 1.4.1 to 1.5+ was the introduction of a parameter to tell Solr / Lucene which kind of compability version its index files should be created and used in. Solr now refuses to start if you do not provide this setting (if you’re upgrading a previous installation from 1.4.1 or earlier). The fix isn’t really straight forward, and you’ll probably have to recreate your index files if you’re just arriving at the scene with Solr / Lucene 3.2 and 4.0. Solr 3.0 (1.5) might be able to upgrade the files from the 2.9 version, but if you’re jumping from Lucene 2.9 to 4.0, the easiest solution seems to be to delete the current index and reindex (set up replication, disable replication from the master, query the slave while reindexing the master, etc.. and you’ll have no downtime while doing this!). You’ll need to add a parameter to your solrconfig.xml file as well in the section. LUCENE_CURRENT Other valid values are LUCENE_30, LUCENE_31, LUCENE_32 and LUCENE_40. These values represent specific versions of the index structure, while LUCENE_CURRENT will use the version depending on which particular release of Lucene you’re using. The version format will be upgraded automagically between most releases, so you’ll probably be fine by using LUCENE_CURRENT. If you however are trying to load index files that are more than one version older, you may have to use one of the other values.

July 9, 2011

by Mats Lindh

· 10,287 Views

The era of Object-Document Mapping

The Data Mapper pattern is a mechanism for persistence where the application model and the data source have no dependencies between each other. For example, a group of PHP classes and a relational database may be used together without having the PHP classes extends a base Active Record class, thanks to a Data Mapper like Doctrine 2. But everytime we talk about the Data Mapper pattern, we assume there is a relational database on the other side of the persistence boundary. We always save objects; we always map them to MySQL or Postgres tables; but it's not mandatory. In fact, you can store objects also in NoSQL stores, with no more impedance mismatch that the one already existing with relational databases. Doctrine is expanding with two projects which have the goal of replicating the easy object-relational mapping of Doctrine 2 to other stores. These two projects are the MongoDB and CouchDB Object-Document mappers. Since the basic unit of persistence is the document in both cases, objects (or group of them) are stored as documents. The retrieval process depends on the features of the underlying database: it may consists of simple queries (MongoDB) or just of a unique identifier (CouchDB). How do an ODM work? If you use Doctrine 2, you'll find out that many of its concepts translates well to the other mappers. Instead of Entities, we have Documents in our model: id = "myuser-1234"; $user->username = "test"; $this->dm->persist($user); $this->dm->flush(); $this->dm->clear(); $userNew = $this->dm->find($this->type, $user->id); There is a lot of reuse between the various projects: Doctrine\Common is a dependency for them all, and you will be able to use the same ArrayCollection with any of the mappers. Features There's really all I would need to get started: storage of new documents, update, removal; pretty much what you expect; Unit Of Work pattern for tracking what has been modified and execute a single flush() to save changes; support for collections and both one-to-many or many-to-one associations; support for embedded documents (read Value Objects); both embedding of one or many objects as document fields is possible. This is a striking difference with respect to ORMs: Doctrine 2 does not support Value Objects mapping. cascade of persistence and removal of objects; specification of the mapping metadata via annotations, XML, YAML or PHP (like for Doctrine 2). Projects status The concept of Object-Document Mapper, at least in this name, is pretty unique to PHP and Ruby. Mandango is a PHP alternative and Mongoid a Ruby one, for the MongoDB case. Both projects are in the midst of releasing their first stable version: we are talking about bleeding edge technology here. The MongoDB mapper has been around for more than an year and is now in beta3. The CouchDB has an alpha release, but I found out that the current version shipped via PEAR is broken: if you want to play with it, checkout it from github and initialize its submodules (the standalone Symfony components and Doctrine\Common): git clone https://github.com/doctrine/couchdb-odm.git couchdb_odm cd couchdb_odm git submodule init git submodule update For the MongoDB mapper, the PEAR installation should suffice: pear install pear.doctrine-project.org/DoctrineMongoDBODM-1.0.0BETA3 Conclusions I think I'm going to explore more the CouchDB ODM, since it is a departure from the relational model. MongoDB is still similar to traditional databases in the way queries are satisfied: specifying a set of constraints like in SQL WHERE clauses. I'm more interested instead in approached that skip the Mapper for reporting (like Command-Query Responsibility Separation), and maybe CouchDB could be a good fit. By the way, object-document mapping is an alternative to explore: Data Mappers are all the rage today in PHP (they already were yesterday in other languages.) If we already accept that relational databases aren't always the solution, extending NoSQL with Data Mappers is a natural choice.

July 7, 2011

by Giorgio Sironi

· 20,908 Views