Building Multi-Core Ready Java Applications: Part I
This article series will discuss some of the special design, development and testing techniques that must be used to take full advantage of multiprocessor systems.
With the advent of multi-core processors, the existing subject of symmetric multiprocessing has been thrust to the forefront in the development community. As multi-core CPUs make parallel processing systems more prevalent, and more affordable, there is an increasing need for frameworks that help to handle threading, synchronization, deadlock detection, memory management, data pipelining, vertical/horizontal data partitioning, and so on.
This two-part article series will discuss some of the special design, development and testing techniques that must be used to take full advantage of multiprocessor systems. It examines some frameworks available to make Java-based parallel programming easier and even transparent in some cases.
Symmetric Multiprocessing Overview
Symmetric multiprocessing (SMP) is a multiprocessor computer architecture where two or more identical processors are connected to a single shared main memory. Most common multiprocessor systems today use an SMP architecture, where the system allows any processor to work on any task no matter where the data for that task is located in memory. In fact, all system resources are available to all processors, and hence all applications, equally (see Figure 1). With proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently.
Figure 1 - With symmetric multiprocessing, all CPUs have equal access to all system resources.
Although SMP discussions often center on the terms processor or CPU, the same rules apply to multiple cores within a single physical CPU. For the most part, each core within a multi-core CPU is an individual processor in its own right. Although some operating systems do identify and treat cores differently than separate physical CPUs, the differences are minor and generally don’t affect the way the system behaves. For instance, the Solaris 10 OS tries to assign threads to cores so they are evenly distributed across physical CPU dies. This is done mainly to help distribute heat more effectively.
SMP-Enabled Systems and Applications
The SMP architecture is very common in today’s typical computer systems, and hence is supported on a wide variety of operating systems. Today, you’ll find SMP support in all versions of Windows since Windows NT, Solaris, Linux, Mac OS X, OpenVMS, OS/2, VxWorks, and other less-common OS's. Because of its widespread use, you should be aware of its implications to application development.
Alternatives to SMP
SMP is not the only multiprocessor architecture available. For instance, there are non-uniform memory access (NUMA), asymmetric multi-processing (ASMP), and clustered computing. Let’s examine these architectures briefly here:
- Non-Uniform Memory Access (NUMA): This architecture gives memory-access priority to a subset of processors within a system. The result is better than usual memory access performance because access is limited. By creating banks of memory dedicated to specific processors, data access performance can be increased by a factor roughly equivalent to the total number of processors in the system with separate memory banks (see Figure 2). Special hardware supports the access of data by a single process across different memory banks. One downside to this technique is the added cost of dedicated hardware to make it work efficiently.
- Asymmetric Multiprocessing (ASMP): With ASMP, some tasks are assigned to certain processors. This architecture is useful in real-time systems, where processor availability must be guaranteed for some applications when critical events occur. This helps to lower the maximum latency for event-to-process dispatching.
ASMP also allows a single processor to be dedicated to low-level tasks such as interrupt handling. An artifact of this architecture can be seen in both Solaris and some versions of the Linux kernel, where processors can be grouped in sets, and even shielded from interrupts. Again, this technique is useful for real-time systems, and applications where low-latency response is desired. However, asymmetric processing is most prevalent in computer subsystems, such as with graphics cards, specialized encryption hardware, and other application specific integrated circuits (ASIC).
- Clustered Multiprocessing: This technique involves a group of interconnected computers work closely together to behave, in many respects, as a single computer system. An application started on one node in the cluster is able to use the processing power and resources of other nodes in the cluster, as available. One common use of this technique is in the distributed software compilation of large software systems. Combining clustering with high-availability means that application processes can continue to function even when subsets of clustered nodes fail entirely.
Figure 2 - NUMA architectures allow for dedicated memory access with varying degrees of improved performance, as indicated by the boldness of the interconnecting lines in the diagram above.
Parallel Computing Overview
Regardless of the underlying architecture, multi-processor systems require you to use special software development techniques such asconcurrency and parallel computing. Parallel computing is often confused with concurrency, where multiple individual tasks are executed simultaneously. An example of concurrency is the execution of more than one application (or multiple tasks) at the same time on a multi-processor system. Concurrency can also involve the use of multiple running threads within a single application. However, this approach only helps applications that perform many tasks simultaneously, such as Web and application servers.
For other applications, you need to figure out how to breakup the execution of a single task into pieces that can be executed in parallel, and then combined, where the result is an overall speed up in task processing. Parallel computing is more involved than simply creating multiple threads in your application. Parallel computing is the execution of one task on multiple processors (or multiple processor cores) at the same time, where both the processing and the results are highly coordinated. This allows you to more effectively take advantage of today’s multi-core systems to achieve the highest throughput and performance possible.
Additionally, with parallel programming, user requests and other tasks can be load-balanced across multiple processors, and other concepts such as partitioning and pipeline parallelism can be implemented. However, to be truly effective, you need to properly break down your application’s work into as many parallel tasks as possible. This requires careful design, coding, and testing, as issues related with parallel programming can be extremely difficult to debug. Let’s take a look at some of these challenges now.
Implicit versus Explicit Parallelism
There are different approaches to parallel programming. One approach is data parallelism, where the data to be processed is distributed by a controlling application across multiple data processing clients or node computers. With this approach, each client performs the same processing; it’s the data that’s distributed. Another, related, approach is called Map Reduce parallelism, where data and processing is distributed across large numbers of nodes within a cluster of potentially unreliable computing resources.
Another approach is explicit parallelism, where you must indicate how an application or function is to be partitioned. Many factors and techniques impact the performance of parallel programming, especially load balancing, which attempt to keep all processors busy by moving tasks from heavily loaded processors to less loaded ones.
Finally, we have implicit parallelism, where the system partitions the problem and allocates tasks to processors automatically. The system may depend on compilers or other software to perform this parallelism. An example is Open Multi-Processing (OpenMP), which is an API that supports multi-platform, shared memory multiprocessing. OpenMP consists of compilers, libraries, and other runtime features that enable scalable multiprocessing.
The first three approaches require you to be aware of the parallel nature of the application, and to write code accordingly. This can be come cumbersome and inefficient as this is done repeatedly for each multi-core aware application developed. Fortunately, tools and frameworks are available to help you. However, the process is still not without itschallenges.
Parallel Programming Challenges
To effectively take advantage of today’s multi-core systems, many parallel programming challenges must be overcome, beginning with the issue of determining how to break apart tasks to run in parallel. Simply creating application threads to do course-grained processing may not scale well enough, or be tunable to the varying numbers of cores that may be available to your application in production. Additionally, with multi-threaded programming, issues such as resource locking, deadlock detection, inter-thread messaging and data sharing, and debugging are very tricky to master.
Effective parallel programming requires tools to help breakdown even a single thread’s work into tasks that can be executed in parallel, while resolving thread deadlock issues and resource locking for you. It also requires an infrastructure that dynamically scales according to the number of processorsor cores available, leveraging various parallel processing algorithms such as pipeline parallelism, dynamic load balancing, and vertical and horizontal partitioning. This infrastructure should use standard techniques, such as JMX in Java, to accurately measure and tune the system for the best performance on multiprocessor machines.
Most of all, this infrastructure should be reusable across all of the parallel programming problems you tackle without requiring you to rewrite multi-threaded application control code each time. Before we get into parallel programming tools and frameworks available today, let’s look at some of the required features these frameworks should implement.
Typical von Neumann programming involves serial tasks; an event occurs, that event is processed, and some output is produced. The results may be applied as input to another task that performs a subsequent operation, or step, in an overall process. For example, let’s examine a system that needs to bulk-load data into a database. First, each input file needs to be read into memory and parsed. Next, the data is used as input to generate acomma-separated value (CSV) file. Finally, the CSV file is provided to the database bulk loader application. This serial process is shown in Figure 3.
Figure 3 - With serial processing, each task is completed before the next one is started. In many cases, the results from one or more tasks are used as input to subsequent tasks.
Here, you can see that as one task ends, another begins. None of these tasks are being performed in parallel; hence the time to completethe overall job is the sum of the individual tasks. However, since the first two tasks have no dependency on one another, they can be executed in parallel (as shown in Figure 4). Even on a system with one processor, some benefit will be observed, as the CPU will be available to parse portions of one file as portions of the other file are being read. This results in a reduction in time for the overall job to complete.
Figure 4 - Simple parallelism of tasks with no dependencies can reduce overall processing time.
Although this is an improvement over strict serial processing, this approach can be improved further. For example, as each file is read and parsed, that resulting data sits in memory until both files are completely processed. On a multiprocessor system, it’s much more efficient to begin creating the CSV file once the data begins to be read from each file. The processing from these three tasks will then overlap in time, at least partially. Further, why wait until the CSV file is finalized to call the bulk loader? This data can be broken up into smaller CSV files, or streamed to the bulk loader (if supported), to gain even more parallelism (see Figure 5). The resulting task graph is formally described as pipeline parallelism, sometimes called pipelining, and results in a further reduction of job processing time.
Figure 5 - Overlapping the tasks in a job,where even partial output from one task is made immediately available to othertasks, results in a large reduction in overall processing time. This is calledpipeline parallelism.
A pipeline is a series of related steps, where each step depends upon the others. As shown above, many of the steps in a typical pipeline can execute in parallel. With pipeline parallelism, even when one step in the pipeline depends on data from another step, both steps can execute in parallel, at least partially, if the data is streamed between them while the data is still being generated from the first step. The data dependency between subtasks is often called a dataflow, and a dataflow graph represents the interaction of all subtasks within a task.
Pipeline parallelism is an efficient way to ensure that all available processors are used as often as possible. With properly coordinated, overlapping, tasks and efficient inter-task communication, overall system throughput and performance is increased as tasks and processors are added to the system. However, the overall parallelism can be increased even more. In this example, only the individual tasks have been parallelized. Through a technique called partitioning, you can take each single task and break it down further.
Horizontal and Vertical Partitioning
A problem can be partitioned based on domain decomposition, functional decomposition, or a combination. Functional decomposition refers broadly to the process of resolving a functional relationship into its constituent parts in such a way that the original function can be reconstructed from those parts by function composition. In general, this process of decomposition is undertaken either for the purpose of gaining insight into the identity of the constituent components, or for the purpose of invoking those components in parallel to one another. The steps taken in the previous section to gain parallelism are an example of horizontal partitioning.
Vertical partitioning, also called data partitioning, allows data in multiple tables, multiple rows within a table, and even columns within those rows, to be read or written to simultaneously. This combination further increases system efficiency and performance. Of course, this power isn’t limited to only database processing. Horizontal and vertical partitioning can be applied to file processing, Web service calls, and even legacy systems integration. The resulting multi-dimensional parallelism helps to get the best performance from a distributed system made up of multiprocessor and multi-core systems.
Java-based Parallel Computing
Fortunately for developers, Java provides a first-class abstraction to threading that works regardless of the OS the application runs on. Unlike with C++, it’s up to the Java virtual machine (JVM), not the developer, to create low-level threads according to the host OS. Therefore, concurrency in Java comes naturally; simply create a class, extend Thread or implement Runnable, call the thread’s start method, and you achieve basic parallelism.
To be accurate this scenario describes Java concurrency. Again, don’t confuse parallel computing with concurrency. Some applications require you to break up the execution of a single task into pieces that can be executed in parallel, and then combined, where the result is an overall speed up in single-task processing. The result is, in effect, the execution of a single task by multiple processors.
The use of Java concurrency is a start; but to achieve the level of parallelism and pipelining described in the previous sections, you’ll need to do a lot more. Fortunately, there are frameworks and tools available to help you. Two examples are Groovy Parallel, and Pervasive DataRush. Groovy is a separate language that works well with Java, but since it lies on the outer fringes of typical Java development, we won’t go deep into Groovy Parallel extensions.
Part 2 of this article will focus on a framework from Pervasive Software, calledPervasive DataRush™,which reduces the complexity of parallelprogramming. The issues around parallel computing, as applied to theproblem of loading large amounts of text-based data into an RDBMS, willbe introduced, and Pervasive DataRush will be explored as a way tosolve them.