Hadoop and Ambari usually run over Linux, but please don’t fall into thinking of your cluster as a collection of Linux boxes; for stability and efficiency, you need to treat it like an appliance dedicated to Hadoop. Here’s why.
Hadoop was designed to run on Linux, but not because Linux is especially perfect for servers, let alone for special-purpose clusters. Hadoop runs on Linux because commodity servers on Linux can’t be beat in either cost per compute cycle or in the availability of technical expertise; two generations of top programmers have learned to think about computation in the language of Unix.
Linux is a General Purpose OS
Linux (and Unix in general) is the ultimate general-purpose operating system, built around people first, and designed for multiple users running many concurrent programs each its own unpredictable resource usage pattern. The original use case for Unix was practically the opposite of a cluster; it was designed for a shared minicomputer environment with many users running shell-based processing pipelines.
Don’t get me wrong here, Linux is a wonderful thing, but Hadoop computing has little in common with this kind of workload. Typical Linux processing (the same goes for any Unix) is geared towards an unpredictable and constantly changing number of processes with unknown memory and I/O demands, while properly configured Hadoop processing uses a bounded amount of resources very predictably.
Memory, Hardware, and Workload
The two keys to Linux’s graceful handling of the normal chaotic workload are (1) sophisticated virtual memory management and (2) the nimble way that it applies the CPU downtime of processes that are busy with I/O to service the CPU needs of other processes that are free to run.
The first of these, virtual memory management, is all about allowing each process to act as if it owns the entire 32-bit or 64-bit memory space of the processor. Even though there is a finite amount of RAM on a machine, any number of programs can simultaneously enjoy the illusion that they have the entire address space to themselves. It is left to the OS to worry about making sure that each process’s memory is magically and seamlessly present when the process uses it. The OS is able to do this because whenever there isn’t sufficient RAM memory for the current demands, it can swap 4K pages of memory out to disk to clear the way for pages from the current working sets of active processes.
The second critical capability is feasible because things that happen in the CPU are incomparably faster than anything involving I/O. During the five or ten milliseconds that a process is idled waiting for a paged-out disk block to be read, a 2.5 GHZ CPU can execute between 250,000 and 500,000 instructions for other processes. The interval between two successive keyboard interrupts might allow half a billion instructions to execute. Because of this discrepancy in time scales, the OS is able to give many concurrent programs the simultaneous illusion that they have unlimited CPU.
Hadoop Is Not a Typical Workload
The need for these OS capabilities does not vanish with Hadoop, but their centrality is diminished because Hadoop workloads are fundamentally much more predictable than a general Linux workload. To see why, consider core MapReduce processing (or Tez–the principle is the same) which consists of streaming standard size chunks of data from disk (64 or 128 MB is typical), applying relatively simple row-by-row processing to it, and then streaming some or all of the processed data back out, either directly to HDFS or over the network to the reducers.
The enforced regularity of the processing load allows the maximum number of worker processes and their resource allocations to be fixed in advance, as a function of the number of cores, the number of disks, and the amount of physical memory. This means that most of the need for spilling virtual memory to disk is eliminated (every memory access in Linux is virtual–it’s disk-swapping that kills you.) Moreover, with a balanced hardware configuration, the CPU’s and disks can both run at close to capacity continuously. With all cores and all drives running continuously, there is relatively little dead time to backfill by letting other processes run and few idle processes that need running.
Streamed data is also friendly to a number of low-level CPU optimizations. For instance, L1 and L2 caches and the TLB don’t miss much, because locality of reference assures that the CPU isn’t jumping all over the memory space incurring cache misses. Within Hadoop, a host of optimizations take advantage of this regularity. Vector processing is one obvious example.
Some of the most important cluster configurations that relate to these issues are:
- Limiting the maximum number of YARN containers to a fixed function of the amount of RAM, the number of disk drives, the number of hardware cores, and the heap allocation appropriate to the workload.
- Block size is chosen to take advantage of, but not overfill, the memory of the containers.
- Java containers (several kinds) are configured with a heap size that is adequate and stable, to minimize GC overhead.
- Swappiness, i.e., the willingness of the OS to swap a memory page out to disk, is set to the lowest value in order to ensure that pages are only swapped under dire necessity.
- Transparent Huge Pages (THP), which is used in Linux by default to reduce the page table size incurred for large memory spaces, works poorly with HDP processing style, in part because the large scale of the blocks breaks up smooth streaming.
Why Mixing It Up Is Bad
It’s pretty obvious that this happy scenario must fall apart if you allow the random use of your cluster for other activities, for instance as a landing zone for inbound data, for periodic report running, or other tasks. YARN assumes that it can count on the amount of physical memory that you tuned for being present, but random jobs mean that any amount of it can be committed elsewhere. Not only will this produce page faults, which have an inherent cost, but the mystery processes’ page fault delays and other I/O break up the smooth data pipeline behavior and add many disk-seek intervals to what was formerly pure streaming.
Your cluster won’t usually break if you are properly configured with enough swap space to allow for the unexpected, but at best, performance will take a nose dive and it is likely that your admins will have a tough time figuring it out because there’s nothing logically wrong at the user level — it’s just that the balance is off.
Bottom line: both cluster efficiency and even stability can take a big hit from even a modest increase in memory load over what you have configured for. Therefore, treat your cluster like a dedicated appliance — sneaking random processing tasks onto your cluster machines is a poor economy.