Data Engineering Resources

The Latest Data Engineering Topics

How-To: Setup Development Environment for Hadoop MapReduce

This post is intended for folks who are looking out for a quick start on developing a basic Hadoop MapReduce application. We will see how to set up a basic MR application for WordCount using Java, Maven and Eclipse and run a basic MR program in local mode , which is easy for debugging at an early stage. Assuming JDK 1.6+ is already installed and Eclipse has a setup for Maven plugin and download from default maven repository is not restriced. Problem Statement : To count the occurrence of each word appearing in an input file using MapReduce. Step 1 : Adding Dependency Create a maven project in eclipse and use following code in your pom.xml. 4.0.0 com.saurzcode.hadoop MapReduce 0.0.1-SNAPSHOT jar org.apache.hadoop hadoop-client 2.2.0 Upon saving it should download all required dependencies for running a basic Hadoop MapReduce program. Step 2 : Mapper Program Map step involves tokenizing the file, traversing the words, and emitting a count of one for each word that is found. Our mapper class should extend Mapper class and override it’s map method. When this method is called the value parameter of the method will contain a chunk of the lines of file to be processed and the output parameter is used to emit word instances. In real world clustered setup, this code will run on multiple nodes which will be consumed by set of reducers to process further. public class WordCountMapper extends Mapper { private final IntWritable ONE = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while(tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, ONE); } } } Step 3 : Reducer Program Our reducer extends the Reducer class and implement logic to sum up each occurrence of word token received from mappers.Output from Reducers will go to the output folder as a text file ( default or as configured in Driver program for Output format) named as part-r-00000 along with a _SUCCESS file. public class WordCountReducer extends Reducer { public void reduce(Text text, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(text, new IntWritable(sum)); } } Step 4 : Driver Program Our driver program will configure the job by supplying the map and reduce program we just wrote along with various input , output parameters. public class WordCount { public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Path inputPath = new Path(args[0]); Path outputDir = new Path(args[1]); // Create configuration Configuration conf = new Configuration(true); // Create job Job job = new Job(conf, "WordCount"); job.setJarByClass(WordCountMapper.class); // Setup MapReduce job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setNumReduceTasks(1); // Specify key / value job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Input FileInputFormat.addInputPath(job, inputPath); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, outputDir); job.setOutputFormatClass(TextOutputFormat.class); // Delete output if exists FileSystem hdfs = FileSystem.get(conf); if (hdfs.exists(outputDir)) hdfs.delete(outputDir, true); // Execute job int code = job.waitForCompletion(true) ? 0 : 1; System.exit(code); } } That’s It !! We are all set to execute our first MapReduce Program in eclipse in local mode. Let’s assume there is an input text file called input.txt in folder input which contains following text : foo bar is foo count count foo for saurzcode Expected output : foo 3 bar 1 is 1 count 2 for 1 saurzcode 1 Let’s run this program in eclipse as Java Application :- We need to give path to input and output folder/file to the program as argument.Also, note output folder shouldn’t exist before running this program else program will fail. java com.saurzcode.mapreduce.WordCount input/inputfile.txt output If this program runs successfully emitting set of lines while it is executing mappers and reducers, we should see a output folder and with following files : output/ _SUCCESS part-r-00000

January 30, 2015

by Saurabh Chhajed

· 13,076 Views · 1 Like

Why Customers Choose Datical DB to Automate Database Deployments

Over the past year, Datical has had amazing success with our flagship product, Datical DB. We’ve seen multiple visionary, sector-leading companies select Datical DB to drive their Application Schema changes. Now that the number has grown rapidly over the past year, we can begin to see patterns in why customers choose Datical DB. One of them turns out to be pretty emblematic of our other customers. So, let’s examine the reasons why they chose to adopt Datical DB. Customer Facing Applications are the Front Door When your competitor is only a mobile screen swipe away, this Datical customer focuses on brand reputation and customer satisfaction. They know that application uptime and fast delivery are key to customer retention and account expansion. All applications have a database backend. Though, when something goes wrong with the database, it is not apparent to the consumer. The consumer just knows that the app isn’t working. An unresponsive mobile app or website is often enough to make the customer take their business elsewhere. Customer stickiness due to the hassle of changing providers continues to lessen as the cost to change continues to drop. Therefore, immediate and fast access is a must have for today’s companies. Remember: Consumer facing applications don’t have business hours. They must ALWAYS be open. Cross Team Collaboration This Datical customer had difficulty in determining who made what change to the database. Moreover, answering which database and why was near impossible. All of the changes were stored in a single document. That solution is not multi-tenant, meaning that the team had to distribute the document to share information. Moreover, there was a single point of failure in the tracking process. Too often, people were required to visit the datacenter for days at a time during application changes. There was simply no method to notify team members of changes, when they occurred, and the change impact. This lack of communication led to almost 80% of all database change requests being rejected. There was clearly a need to increase communication of changes and increase the requested changes’ quality. Increase Staff Productivity With over 70% of the Database team’s time spent on managing change due to application requirements, the Database team was taxed in meeting other demands. Needs to improve scalability and reliability of the database servers became a lower priority as the DBAs struggled to keep up with change requests. Furthermore, development teams spent almost 10% of their time managing these change requests, including reworks of failed requests. By eliminating a manual, time-consuming process, this customer can now focus resources on addressing other needs such as managing continuity during server failure, better allocation of server resources, and certainly scalability concerns. Go Faster With the adoption of IBM UrbanCode Deploy, this customer quickly streamlined their deployments with all but the database automated. With all other components automated, the database changes required by application changes was clearly the weak link in the deployment chain. Truly, database automation is the last mile necessary to realize the promise of Agile, Continuous Delivery, and DevOps. Until the customer was able to apply Agile and Continuous Delivery to database changes, the entire application stack was, in effect, still using out-of-date development and deployment methods. To see the complete benefits of Agile and Continuous Delivery, automation throughout the entire stack, including the database, was absolutely necessary. Leverage Existing Infrastructure and Processes With out of the box integration with UrbanCode Deploy, Datical DB did not require an independent separate server. Furthermore, with Datical DB’s ability to utilize the customer’s existing source code control repository, the implementation cycle was shortened significantly. Too often, customers are asked by vendors to spend money on more server resources or make strange, unnatural changes to their network security to support potential solutions. By utilizing existing infrastructure and processes, Datical DB was able to deliver to this customer lightening quick ROI. We measure ROI in days and weeks, not months and years. Please join us for a webcast next Wednesday, February 4th, from 12:00 – 1:00 pm EST, as we discuss these customer benefits in detail and show how Datical DB integrates seamlessly with IBM UrbanCode Deploy.

January 30, 2015

by Robert Reeves

· 8,387 Views

Andrews Curves

Andrews curves are a method for visualizing multidimensional data by mapping each observation onto a function. This function is defined as It has been shown the Andrews curves are able to preserve means, distance (up to a constant) and variances. Which means that Andrews curves that are represented by functions close together suggest that the corresponding data points will also be close together. Now, we will demonstrate the effectiveness of the Andrew curves on the iris dataset (which we already used here). Let's create a function to compute the values of the functions give a single sample: import numpy as np def andrew_curve4(x,theta):# iris has 4 four dimensions base_functions =[lambda x : x[0]/np.sqrt(2.),lambda x : x[1]*np.sin(theta),lambda x : x[2]*np.cos(theta),lambda x : x[3]*np.sin(2.*theta)] curve = np.zeros(len(theta))for f in base_functions: curve = curve + f(x)return curve At this point we can load the dataset and plot the curves for a subset of samples: samples = np.loadtxt('iris.csv', usecols=[0,1,2,3], delimiter=',')#samples = samples - np.mean(samples)#samples = samples / np.std(samples) classes = np.loadtxt('iris.csv', usecols=[4], delimiter=',',dtype=np.str) theta = np.linspace(-np.pi,np.pi,100)import pylab as pl for s in samples[:20]:# setosa pl.plot(theta, andrew_curve4(s,theta),'r')for s in samples[50:70]:# versicolor pl.plot(theta, andrew_curve4(s,theta),'b')for s in samples[100:120]:# virginica pl.plot(theta, andrew_curve4(s,theta),'g') pl.xlim(-np.pi,np.pi) pl.show() In the plot above, the each color used represents a class and we can easily note that the lines that represent samples from the same class have similar curves.

January 29, 2015

by Giuseppe Vettigli

· 14,024 Views · 6 Likes

Bulk Data Insertion into Oracle Database in C#

Bulk insertion of data into database table is a big overhead in application development.

January 28, 2015

by Ayobami Adewole

· 44,772 Views

Code Coverage for Embedded Target with Eclipse, gcc and gcov

The great thing with open source tools like Eclipse and GNU (gcc, gdb) is that there is a wealth of excellent tools: one thing I had in mind to explore for a while is how to generate code coverage of my embedded application. Yes, GNU and Eclipse come with code profiling and code coverage tools, all for free! The only downside seems to be that these tools seem to be rarely used for embedded targets. Maybe that knowledge is not widely available? So here is my attempt to change this :-). Or: How cool is it to see in Eclipse how many times a line in my sources has been executed? Line Coverage in Eclipse And best of all, it does not stop here…. Coverage with Eclipse To see how much percentage of my files and functions are covered? gcov in Eclipse Or even to show the data with charts? Coverage Bar Graph View Outline In this tutorial I’m using a Freescale FRDM-K64F board: this board has ARM Cortex-M4F on it, with 1 MByte FLASH and 256 KByte of RAM. The approach used in this tutorial can be used with any embedded target, as long there is enough RAM to store the coverage data on the target. I’m using Eclipse Kepler with the ARM Launchpad GNU tools (q3 2014 release), but with small modifications any Eclipse version or GNU toolchain could be used. To generate the Code Coverage information, I’m using gcov. Freescale FRDM-K64F Board Generating Code Coverage Information with gcov gcov is an open source program which can generate code coverage information. It tells me how often each line of a program is executed. This is important for testing, as that way I can know which parts of my application actually has been executed by the testing procedures. Gcov can be used as well for profiling, but in this post I will use it to generate coverage information only. The general flow to generate code coverage is: Instrument code: Compile the application files with a special option. This will add (hidden) code and hooks which records how many times a piece of code is executed. Generate Instrumentation Information: as part of the previous steps, the compiler generates basic block and line information. This information is stored on the host as *.gcno (Gnu Coverage Notes Object?) files. Run the application: While the application is running on the target, the instrumented code will record how many the lines or blocks in the application are executed. This information is stored on the target (in RAM). Dump the recorded information: At application exit (or at any time), the recorded information needs to be stored and sent to the host. By default gcov stores information in files. As a file system might not be alway available, other methods can be used (serial connection, USB, ftp, …) to send and store the information. In this tutorial I show how the debugger can be used for this. The information is stored as *.gcda (Gnu Coverage Data Analysis?) files. Generate the reports and visualize them with gcov. General gcov Flow gcc does the instrumentation and provides the library for code coverage, while gcov is the utility to analyze the generated data. Coverage: Compiler and Linker Options To generate the *.gcno files, the following option has to be added for each file which should generate coverage information: -fprofile-arcs -ftest-coverage :idea: There is as well the ‘–coverage’ option (which is a shortcut option) which can be used both for the compiler and linker. But I prefer the ‘full’ options so I know what is behind the options. -fprofile-arcs Compiler Option The option -fprofile-arcs adds code to the program flow to so execution of source code lines are counted. It does with instrumenting the program flow arcs. From https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html: -fprofile-arcs Add code so that program flow arcs are instrumented. During execution the program records how many times each branch and call is executed and how many times it is taken or returns. When the compiled program exits it saves this data to a file called auxname.gcda for each source file. The data may be used for profile-directed optimizations (-fbranch-probabilities), or for test coverage analysis (-ftest-coverage). Each object file’s auxname is generated from the name of the output file, if explicitly specified and it is not the final executable, otherwise it is the basename of the source file. In both cases any suffix is removed (e.g. foo.gcda for input file dir/foo.c, or dir/foo.gcda for output file specified as -o dir/foo.o). See Cross-profiling. If you are not familiar with compiler technology or graph theory: An ‘Arc‘ (alternatively ‘edge’ or ‘branch’) is a directed link between a pair ‘Basic Blocks‘. A Basic is a sequence of code which has no branching in it (it is executed in a single sequence). For example if you have the following code: k = 0; if (i==10) { i += j; j++; } else { foo(); } bar(); Then this consists of the following four basic blocks: Basic Blocks The ‘Arcs’ are the directed edges (arrows) of the control flow. It is important to understand that not every line of the source gets instrumented, but only the arcs: This means that the instrumentation overhead (code size and data) depends how ‘complicated’ the program flow is, and not how many lines the source file has. However, there is an important aspect to know about gcov: it provides ‘condition coverage‘ if a full expression evaluates to TRUE or FALSE. Consider the following case: if (i==0 || j>=20) { In other words: I get coverage how many times the ‘if’ has been executed, but *not* how many times ‘i==0′ or ‘j>=20′ (which would be ‘decision coverage‘, which is not provided here). See http://www.bullseye.com/coverage.html for all the details. -ftest-coverage Compiler Option The second option for the compiler is -ftest-coverage (from https://gcc.gnu.org/onlinedocs/gcc-3.4.5/gcc/Debugging-Options.html): -ftest-coverage Produce a notes file that the gcov code-coverage utility (see gcov—a Test Coverage Program) can use to show program coverage. Each source file’s note file is called auxname.gcno. Refer to the -fprofile-arcs option above for a description of auxname and instructions on how to generate test coverage data. Coverage data will match the source files more closely, if you do not optimize. So this option generates the *.gcno file for each source file I decided to instrument: gcno file generated This file is needed later to visualize the data with gcov. More about this later. Adding Compiler Options So with this knowledge, I need to add -fprofile-arcs -ftest-coverage as compiler option to every file I want to profile. It is not necessary profile the full application: to save ROM and RAM and resources, I can add this option only to the files needed. Actually as a starter, I recommend to instrument a single source file only at the beginning. For this I select the properties (context menu) of my file Test.c I add the options in ‘other compiler flags': Coverage Added to Compilation File -fprofile-arcs Linker Option Profiling not only needs a compiler option: I need to tell the linker that it needs to link with the profiler library. For this I add -fprofile-arcs to the linker options: -fprofile-arcs Linker Option Coverage Stubs Depending on your library settings, you might now get a lot of unresolved symbol linker errors. This is because by default the profiling library assumes to write the profiling information to a file system. However, most file systems do *not* have a file system. To overcome this, I add a stubs for all the needed functions. I have them added with a file to my project (see latest version of that file on GitHub): /* * coverage_stubs.c * * These stubs are needed to generate coverage from an embedded target. */ #include #include #include #include #include #include #include "UTIL1.h" #include "coverage_stubs.h" /* prototype */ void gcov_exit(void); /* call the coverage initializers if not done by startup code */ void static_init(void) { void (**p)(void); extern uint32_t __init_array_start, __init_array_end; /* linker defined symbols, array of function pointers */ uint32_t beg = (uint32_t)&__init_array_start; uint32_t end = (uint32_t)&__init_array_end; while(begst_mode = S_IFCHR; return 0; } int _getpid(void) { return 1; } int _isatty(int file) { switch (file) { case STDOUT_FILENO: case STDERR_FILENO: case STDIN_FILENO: return 1; default: errno = EBADF; return 0; } } int _kill(int pid, int sig) { (void)pid; (void)sig; errno = EINVAL; return (-1); } int _lseek(int file, int ptr, int dir) { (void)file; (void)ptr; (void)dir; return 0; /* return offset in file */ } #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wreturn-type" __attribute__((naked)) static unsigned int get_stackpointer(void) { __asm volatile ( "mrs r0, msp \r\n" "bx lr \r\n" ); } #pragma GCC diagnostic pop void *_sbrk(int incr) { extern char __HeapLimit; /* Defined by the linker */ static char *heap_end = 0; char *prev_heap_end; char *stack; if (heap_end==0) { heap_end = &__HeapLimit; } prev_heap_end = heap_end; stack = (char*)get_stackpointer(); if (heap_end+incr > stack) { _write (STDERR_FILENO, "Heap and stack collision\n", 25); errno = ENOMEM; return (void *)-1; } heap_end += incr; return (void *)prev_heap_end; } int _read(int file, char *ptr, int len) { (void)file; (void)ptr; (void)len; return 0; /* zero means end of file */ } :idea: In this code I’m using the UTIL1 (Utility) Processor Expert component, available on SourceForge. If you do not want/need this, you can remove the lines with UTIL1. - Coverage Stubs File in Project Coverage Constructors There is one important thing to mention: the coverage data structures need to be initialized, similar to constructors for C++. Depending on your startup code, this might *not* be done automatically. Check your linker .map file for some _GLOBAL__ symbols: .text._GLOBAL__sub_I_65535_0_TEST_Test 0x0000395c 0x10 ./Sources/Test.o Such a symbol should exist for every source file which has been instrumented with coverage information. These are functions which need to be called as part of the startup code. Set a breakpoint in your code at the given address to check if it gets called. If not, you need to call it yourself. :!: Typically I use the linker option ‘-nostartfiles’), and I have my startup code. In that case, these constructors are not called by default, so I need to do myself. See http://stackoverflow.com/questions/6343348/global-constructor-call-not-in-init-array-section In my linker file I have this: .init_array : { PROVIDE_HIDDEN (__init_array_start = .); KEEP (*(SORT(.init_array.*))) KEEP (*(.init_array*)) PROVIDE_HIDDEN (__init_array_end = .); } > m_text This means that there is a list of constructor function pointers put together between __init_array_start and __init_array_end. So all what I need is to iterate through this array and call the function pointers: /* call the coverage initializers if not done by startup code */ void static_init(void) { void (**p)(void); extern uint32_t __init_array_start, __init_array_end; /* linker defined symbols, array of function pointers */ uint32_t beg = (uint32_t)&__init_array_start; uint32_t end = (uint32_t)&__init_array_end; while(beg stack) { _write (STDERR_FILENO, "Heap and stack collision\n", 25); errno = ENOMEM; return (void *)-1; } heap_end += incr; return (void *)prev_heap_end; } :!: It might be that several kBytes of heap are needed. So if you are running in a memory constraint system, be sure that you have enough RAM available. The above implementation assumes that I have space between my heap end and the stack area. :!: If your memory mapping/linker file is different, of course you will need to change that _sbrk() implementation. Compiling and Building Now the application should compile and link without errors.Check that the .gcno files are generated: :idea: You might need to refresh the folder in Eclipse. - .gcno files generated In the next steps I’m showing how to get the coverage data as *.gcda files to the host using gdb. Using Debugger to get the Coverage Data The coverage data gets dumped when _exit() gets called by the application. Alternatively I could call gcov_exit() or __gcov_flush() any time. What it then does is Open the *.gcda file with _open() for every instrumented source file. Write the data to the file with _write(). So I can set a breakpoint in the debugger to both _open() and _write() and have all the data I need :-) With _open() I get the file name, and I store it in a global pointer so I can reference it in _write(): static const unsigned char *fileName; /* file name used for _open() */ int _open (const char *ptr, int mode) { (void)mode; fileName = (const unsigned char*)ptr; /* store file name for _write() */ return 0; } In _write() I get a pointer to the data and the length of the data. Here I can dump the data to a file using the gdb command: dump binary memory I could use a calculator to calculate the memory dump range, but it is much easier if I let the program generate the command line for gdb :-): int _write(int file, char *ptr, int len) { static unsigned char gdb_cmd[128]; /* command line which can be used for gdb */ (void)file; /* construct gdb command string */ UTIL1_strcpy(gdb_cmd, sizeof(gdb_cmd), (unsigned char*)"dump binary memory "); UTIL1_strcat(gdb_cmd, sizeof(gdb_cmd), fileName); UTIL1_strcat(gdb_cmd, sizeof(gdb_cmd), (unsigned char*)" 0x"); UTIL1_strcatNum32Hex(gdb_cmd, sizeof(gdb_cmd), (uint32_t)ptr); UTIL1_strcat(gdb_cmd, sizeof(gdb_cmd), (unsigned char*)" 0x"); UTIL1_strcatNum32Hex(gdb_cmd, sizeof(gdb_cmd), (uint32_t)(ptr+len)); return 0; } That way I can copy the string in the gdb debugger: Generated GDB Memory Dump Command That command gets pasted and executed in the gdb console: gdb command line After execution of the program, the *.gcda file gets created (refresh might be necessary to show it up): gcda file created Repeat this for all instrumented files as necessary. Showing Coverage Information To show the coverage information, I need the *.gcda, the *.gcno plus the .elf file. :idea: Use Refresh if not all files are shown in the Project Explorer view Files Ready to Show Coverage Information Then double-click on the gcda file to show coverage results: Double Click on gcda File Press OK, and it opens the gcov view. Double click on file in that view to show the details: gcov Views Use the chart icon to create a chart view: Chart view Bar Graph View Video of Steps to Create and Use Coverage The following video summarizes the steps needed: Data and Code Overhead Instrumenting code to generate coverage information means that it is an intrusive method: it impacts the application execution speed, and needs extra RAM and ROM. How much heavily depends on the complexity of the control flow and on the number of arcs. Higher compiler optimizations would reduce the code size footprint, however optimizations are not recommended for coverage sessions, as this might make the job of the coverage much harder. I made a quick comparison using my test application. I used the ‘size’ GNU command (see “Printing Code Size Information in Eclipse”). Without coverage enabled, the application footprint is: arm-none-eabi-size --format=berkeley "FRDM-K64F_Coverage.elf" text data bss dec hex filename 6360 1112 5248 12720 31b0 FRDM-K64F_Coverage.elf With coverage enabled only for Test.c gave: arm-none-eabi-size --format=berkeley "FRDM-K64F_Coverage.elf" text data bss dec hex filename 39564 2376 9640 51580 c97c FRDM-K64F_Coverage.elf Adding main.c to generate coverage gives: arm-none-eabi-size --format=berkeley "FRDM-K64F_Coverage.elf" text data bss dec hex filename 39772 2468 9700 51940 cae4 FRDM-K64F_Coverage.elf So indeed there is some initial add-up because of the coverage library, but afterwards adding more source files does not add up much. Summary It took me a while and reading many articles and papers to get code coverage implemented for an embedded target. Clearly, code coverage is easier if I have a file system and plenty of resources available. But I’m now able to retrieve coverage information from a rather small embedded system using the debugger to dump the data to the host. It is not practical for large sets of files, but at least a starting point :-). I have committed my Eclipse Kepler/Launchpad project I used in this tutorial on GitHub. Ideas I have in my mind: Instead using the debugger/gdb, use FatFS and SD card to store the data Exploring how to use profiling Combining multiple coverage runs Happy Covering :-) Links: Blog article who helped me to explore gcov for embedded targets: http://simply-embedded.blogspot.ch/2013/08/code-coverage-introduction.html Paper about using gcov for Embedded Systems: http://sysrun.haifa.il.ibm.com/hrl/greps2007/papers/gcov-on-an-embedded-system.pdf Article about coverage options for GNU compiler and linker: http://bobah.net/d4d/tools/code-coverage-with-gcov How to call static constructor methods manually: http://stackoverflow.com/questions/6343348/global-constructor-call-not-in-init-array-section Article about using gcov with lcov: https://qiaomuf.wordpress.com/2011/05/26/use-gcov-and-lcov-to-know-your-test-coverage/ Explanation of different coverage methods and terminology: http://www.bullseye.com/coverage.html

January 28, 2015

by Erich Styger

· 15,748 Views

Importing Big Tables With Large Indexes With Myloader MySQL Tool

originally written by david ducos mydumper is known as the faster (much faster) mysqldump alternative. so, if you take a logical backup you will choose mydumper instead of mysqldump. but what about the restore? well, who needs to restore a logical backup? it takes ages! even with myloader. but this could change just a bit if we are able to take advantage of fast index creation. as you probably know, mydumper and mysqldump export the struct of a table, with all the indexes and the constraints, and of course, the data. then, myloader and mysql import the struct of the table and import the data. the most important difference is that you can configure myloader to import the data using a certain amount of threads. the import steps are: create the complete struct of the table import the data when you execute myloader, internally it first creates the tables executing the “-schema.sql” files and then takes all the filenames without “schema.sql” and puts them in a task queue. every thread takes a filename from the queue, which actually is a chunk of the table, and executes it. when finished it takes another chunk from the queue, but if the queue is empty it just ends. this import procedure works fast for small tables, but with big tables with large indexes the inserts are getting slower caused by the overhead of insert the new values in secondary indexes. another way to import the data is: split the table structure into table creation with primary key, indexes creation and constraint creation create tables with primary key per table do: load the data create index create constraints this import procedure is implemented in a branch of myloader that can be downloaded from here or directly executing bzr with the repository: bzr branch lp:~david-ducos/mydumper/mydumper the tool reads the schema files and splits them into three separate statements which create the tables with the primary key, the indexes and the constraints. the primary key is kept in the table creation in order to avoid the recreation of the table when a primary key is added and the “key” and “constraint” lines are removed. these lines are added to the index and constraint statements, respectively. it processes tables according to their size starting with the largest because creating the indexes of a big table could take hours and is single-threaded. while we cannot process other indexes at the time, we are potentially able to create other tables with the remaining threads. it has a new thread (monitor_process) that decides which chunk of data will be put in the task queue and a communication queue which is used by the task processes to tell the monitor_process which chunk has been completed. i run multiple imports on an aws m1.xlarge machine with one table comparing myloader and this branch and i found that with large indexes the times were: as you can see, when you have less than 150m rows, import the data and then create the indexes is higher than import the table with the indexes all at once. but everything changes after 150m rows, import 200m takes 64 minutes more for myloader but just 24 minutes for the new branch. on a table of 200m rows with a integer primary key and 9 integer columns, you will see how the time increases as the index gets larger: where: 2-2-0: two 1-column and two 2-column index 2-2-1: two 1-column, two 2-column and one 3-column index 2-3-1: two 1-column, three 2-column and one 3-column index 2-3-2: two 1-column, three 2-column and two 3-column index conclusion this branch can only import all the tables with this same strategy, but with this new logic in myloader, in a future version it could be able to import each table with the best strategy reducing the time of the restore considerably.

January 27, 2015

by Peter Zaitsev

· 5,265 Views

A very quick guide to deadlock diagnosis in SQL Server

Recently I was asked about diagnosing deadlocks in SQL Server – I’ve done a lot of work in this area way back in 2008, so I figure it’s time for a refresher. If there’s a lot of interest in exploring SQL Server and deadlocks further, I’m happy to write an extended article going into far more detail. Just let me know. Before we get into diagnosis and investigation, it’s a good time to pose the question: “what is a deadlock?”: From TechNet: A deadlock occurs when two or more tasks permanently block each other by each task having a lock on a resource which the other tasks are trying to lock. The following graph presents a high level view of a deadlock state where: Task T1 has a lock on resource R1 (indicated by the arrow from R1 to T1) and has requested a lock on resource R2 (indicated by the arrow from T1 to R2). Task T2 has a lock on resource R2 (indicated by the arrow from R2 to T2) and has requested a lock on resource R1 (indicated by the arrow from T2 to R1). Because neither task can continue until a resource is available and neither resource can be released until a task continues, a deadlock state exists. The SQL Server Database Engine automatically detects deadlock cycles within SQL Server. The Database Engine chooses one of the sessions as a deadlock victim and the current transaction is terminated with an error to break the deadlock. Basically, it’s a resource contention issue which blocks one process or transaction from performing actions on resources within SQL Server. This can be a serious condition, not just for SQL Server as processes become suspended, but for the applications which rely on SQL Server as well. The T-SQL Approach A fast way to respond is to execute a bit of T-SQL on SQL Server, making use of System Views. The following T-SQL will show you the “victim” processes, much like activity monitor does: select * from sys.sysprocesses where blocked > 0 Which is not particularly useful (but good to know, so you can see the blocked count). To get to the heart of the deadlock, this is what you want (courtesy of this SO question/answer): SELECT Blocker.text –, Blocker.*, * FROM sys.dm_exec_connections AS Conns INNER JOIN sys.dm_exec_requests AS BlockedReqs ON Conns.session_id = BlockedReqs.blocking_session_id INNER JOIN sys.dm_os_waiting_tasks AS w ON BlockedReqs.session_id = w.session_id CROSS APPLY sys.dm_exec_sql_text(Conns.most_recent_sql_handle) AS Blocker This will show you line and verse (the actual statement causing the resource block) – see the attached screenshot for an example. However, the generally accepted way to determine and diagnose deadlocks is through the use of SQL Server trace flags. SQL Trace Flags They are (usually) set temporarily, and they cause deadlocking information to be dumped to the SQL management logs. The flags that are useful are flags 1204 and 1222. From TechNet: https://technet.microsoft.com/en-us/library/ms178104%28v=sql.105%29.aspx Trace flags are set on or off by using either of the following methods: · Using the DBCC TRACEON and DBCC TRACEOFF commands. For example, DBCC TRACEON 2528: To enable the trace flag globally, use DBCC TRACEON with the -1 argument: DBCC TRACEON (2528, -1). To turn off a global trace flag, use DBCC TRACEOFF with the -1 argument. · Using the -T startup option to specify that the trace flag be set on during startup. The -T startup option enables a trace flag globally. You cannot enable a session-level trace flag by using a startup option. So to enable or disable deadlock trace flags globally, you’d use the following T-SQL: DBCC TRACEON (1204, -1) DBCC TRACEON (1222, -1) DBCC TRACEOFF (1204, -1) DBCC TRACEOFF (1222, -1) Due to the overhead, it’s best to enable the flag at runtime rather than on start up. Note that the scope of a non-startup trace flag can be global or session-level. Basic Deadlock Simulation By way of a very simple scenario, you can make use of SQL Management Studio (and breakpoints) to roughly simulate a deadlock scenario. Given the following basic table schema: CREATE TABLE [dbo].[UploadedFile]( [Id] [int] NOT NULL, [Filename] [nvarchar](50) NOT NULL, [DateCreated] [datetime] NOT NULL, [DateModified] [datetime] NULL, CONSTRAINT [PK_UploadedFile] PRIMARY KEY CLUSTERED ( [Id] ASC )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ) With some basic test data in it: If you create two separate queries in SQL Management Studio, use the following transaction (Query #1) to lock rows in the table: SET TRANSACTION ISOLATION LEVEL SERIALIZABLE BEGIN TRANSACTION SELECT [Id],[Filename],[DateCreated],[DateModified] FROM [dbo].[UploadedFile] WHERE DateCreated > ‘2015-01-01′ ROLLBACK TRANSACTION Now add a “victim” script (Query #2) in a separate query session: UPDATE [dbo].[UploadedFile] SET [DateModified] = ‘2014-12-31′ WHERE DateCreated > ‘2015-01-01′ As long as you set a breakpoint on the ROLLBACK TRANSACTION statement, you’ll block the second query due to the isolation level of the transaction which wraps query #1. Now you can use the diagnostic T-SQL to examine the victim and the blocking transaction. Enjoy!

January 27, 2015

by Rob Sanders

· 165,242 Views

Improving Lock Performance in Java

After we introduced locked thread detection to Plumbr couple of months ago, we have started to receive queries similar to “hey, great, now I understand what is causing my performance issues, but what I am supposed to do now?” We are working hard to build the solution instructions into our own product, but in this post I am going to share several common techniques you can apply independent of the tool used for detecting the lock. The methods include lock splitting, concurrent data structures, protecting the data instead of the code and lock scope reduction. Locking is not evil, lock contention is Whenever you face a performance problem with the threaded code there is a chance that you will start blaming locks. After all, common “knowledge” is that locks are slow and limit scalability. So if you are equipped with this “knowledge” and start to optimize the code and getting rid of locks there is a chance that you end up introducing nasty concurrency bugs that will surface later on. So it is important to understand the difference between contended and uncontended locks. Lock contention occurs when a thread is trying to enter the synchronized block/method currently executed by another thread. This second thread is now forced to wait until the first thread has completed executing the synchronized block and releases the monitor. When only one thread at a time is trying to execute the synchronized code, the lock stays uncontended. As a matter of fact, synchronization in JVM is optimized for the uncontended case and for the vast majority of the applications, uncontended locks pose next to no overhead during execution. So, it is not locks you should blame for performance, but contended locks. Equipped with this knowledge, lets see what we can do to reduce either the likelihood of contention or the length of the contention. Protect the data not the code A quick way to achieve thread-safety is to lock access to the whole method. For example, take look at the following example, illustrating a naive attempt to build an online poker server: class GameServer { public Map<> tables = new HashMap>(); public synchronized void join(Player player, Table table) { if (player.getAccountBalance() > table.getLimit()) { List tablePlayers = tables.get(table.getId()); if (tablePlayers.size() < 9) { tablePlayers.add(player); } } } public synchronized void leave(Player player, Table table) {/*body skipped for brevity*/} public synchronized void createTable() {/*body skipped for brevity*/} public synchronized void destroyTable(Table table) {/*body skipped for brevity*/} } The intentions of the author have been good - when new players join() the table, there must be a guarantee that the number of players seated at the table would not exceed the table capacity of nine. But whenever such a solution would actually be responsible for seating players to tables - even on a poker site with moderate traffic, the system would be doomed to constantly trigger contention events by threads waiting for the lock to be released. Locked block contains account balance and table limit checks which potentially can involve expensive operations both increasing the likelihood and length of the contention. First step towards solution would be making sure we are protecting the data, not the code by moving the synchronization from the method declaration to the method body. In the minimalistic example above, it might not change much at the first place. But lets consider the whole GameServerinterface, not just the single join() method: class GameServer { public Map> tables = new HashMap>(); public void join(Player player, Table table) { synchronized (tables) { if (player.getAccountBalance() > table.getLimit()) { List tablePlayers = tables.get(table.getId()); if (tablePlayers.size() < 9) { tablePlayers.add(player); } } } } public void leave(Player player, Table table) {/* body skipped for brevity */} public void createTable() {/* body skipped for brevity */} public void destroyTable(Table table) {/* body skipped for brevity */} } What originally seemed to be a minor change, now affects the behaviour of the whole class. Whenever players were joining tables, the previously synchronized methods locked on theGameServer instance (this) and introduced contention events to players trying to simultaneouslyleave() tables. Moving the lock from the method signature to the method body postpones the locking and reduces the contention likelihood. Reduce the lock scope Now, after making sure it is the data we actually protect, not the code, we should make sure our solution is locking only what is necessary - for example when the code above is rewritten as follows: public class GameServer { public Map> tables = new HashMap>(); public void join(Player player, Table table) { if (player.getAccountBalance() > table.getLimit()) { synchronized (tables) { List tablePlayers = tables.get(table.getId()); if (tablePlayers.size() < 9) { tablePlayers.add(player); } } } } //other methods skipped for brevity } then the potentially time-consuming operation of checking player account balance (which potentially can involve IO operations) is now outside the lock scope. Notice that the lock was introduced only to protect against exceeding the table capacity and the account balance check is not anyhow part of this protective measure. Split your locks When we look at the last code example, you can clearly notice that the whole data structure is protected by the same lock. Considering that we might hold thousands of poker tables in this structure, it still poses a high risk for contention events as we have to protect each table separately from overflowing in capacity. For this there is an easy way to introduce individual locks per table, such as in the following example: public class GameServer { public Map> tables = new HashMap>(); public void join(Player player, Table table) { if (player.getAccountBalance() > table.getLimit()) { List tablePlayers = tables.get(table.getId()); synchronized (tablePlayers) { if (tablePlayers.size() < 9) { tablePlayers.add(player); } } } } //other methods skipped for brevity } Now, if we synchronize the access only to the same table instead of all the tables, we have significantly reduced the likelihood of locks becoming contended. Having for example 100 tables in our data structure, the likelihood of the contention is now 100x smaller than before. Use concurrent data structures Another improvement is to drop the traditional single-threaded data structures and use data structures designed explicitly for concurrent usage. For example, when picking ConcurrentHashMapto store all your poker tables would result in code similar to following: public class GameServer { public Map> tables = new ConcurrentHashMap>(); public synchronized void join(Player player, Table table) {/*Method body skipped for brevity*/} public synchronized void leave(Player player, Table table) {/*Method body skipped for brevity*/} public synchronized void createTable() { Table table = new Table(); tables.put(table.getId(), table); } public synchronized void destroyTable(Table table) { tables.remove(table.getId()); } } The synchronization in join() and leave() methods is still behaving as in our previous example, as we need to protect the integrity of individual tables. So no help from ConcurrentHashMap in this regards. But as we are also creating new tables and destroying tables in createTable() and destroyTable()methods, all these operations to the ConcurrentHashMap are fully concurrent, permitting to increase or reduce the number of tables in parallel. Other tips and tricks Reduce the visibility of the lock. In the example above, the locks are declared public and are thus visible to the world, so there is there is a chance that someone else will ruin your work by also locking on your carefully picked monitors. Check out java.util.concurrent.locks to see whether any of the locking strategies implemented there will improve the solution. Use atomic operations. The simple counter increase we are actually conducting in example above does not actually require a lock. Replacing the Integer in count tracking withAtomicInteger would most suit this example just fine. Hope the article helped you to solve the lock contention issues, independent of whether you are using Plumbr automatic lock detection solution or manually extracting the information from thread dumps.

January 22, 2015

by Nikita Salnikov-Tarnovski

· 11,548 Views

Implementing Ehcache using Spring context and Annotations

Part 1 – Configuring Ehcache with Spring Ehcache is most widely used open source cache for boosting performance, offloading your database, and simplifying scalability. It is very easy to configure Ehcache with Spring context xml. Assuming you have a working Spring configured application, here we discuss how to integrate Ehcache with Spring. In order to configure it, we need to add a CacheManager into the Spring context file. Code snippet added below In ‘ehcache’ bean, we are referring to ‘Ehcache.xml’ file which will contain all ehcache configurations. Sample ehcache file is added below. You can configure this file according to your project needs. Here you can two caches, defaultCache and one defined with name ‘TestCache’. DefaultCache configuration is applied to any cache that is not explicitly configured. Part 2 - Enabling Caching Annotations In order to used Spring provided java annotations , we need to declarative enable cache annotations in Spring config file. Code snippet added below. In order to use above annotation declaration, we need to add cache namespace into the spring file. Code snippet added below. Spring Annotations for caching @Cacheable triggers cache population @CacheEvict triggers cache eviction @CachePut updates the cache without interfering with the method execution I will give brief description and practical implementation of these annotations. For detailed explanations please refer spring documentation for caching . @Cacheable - This annotation means that the return value of the corresponding method can be cached and multiple invocation of this method with same arguments must populate the return value from cache without executing the method. Code snippet added below. @Cacheable("customer") public Customer findCustomer(String custId) {...} Each time the method is called, the cache with name ‘customer’ is checked to see whether the invocation has been already executed for the same ‘custId’. @CachePut - This annotation is used to update the cache without interfering the method execution. Method will always be executed and its result will be replaced in the cache. This annoatation is used for cache population. @CachePut("customer") public Customer updateBook(String custId){...} @CacheEvict – This annotation is used for cache eviction, used to remove stale or unused data from the cache. @CacheEvict(value="customer", allEntries=true) public void unloadCustomer(String newBatch) With above configurations, all entries in the ‘customer’ cache will be removed. Above description will give head start to configure Ehcachewith spring and use some of the caching annotations.

January 21, 2015

by Roshan Thomas

· 12,244 Views

Avoiding MySQL ALTER Table Downtime

Originally Written byAndrew Moore MySQL table alterations can interrupt production traffic causing bad customer experience or in worst cases, loss of revenue. Not all DBAs, developers, syadmins know MySQL well enough to avoid this pitfall. DBAs usually encounter these kinds of production interruptions when working with upgrade scripts that touch both application and database or if an inexperienced admin/dev engineer perform the schema change without knowing how MySQL operates internally. Truths * Direct MySQL ALTER table locks for duration of change (pre-5.6) * Online DDL in MySQL 5.6 is not always online and may incurr locks * Even with Percona Toolkit‘s pt-online-schema-change there are several workloads that can experience blocking Here on the Percona MySQL Managed Services team we encourage our clients to work with us when planning and performing schema migrations. We aim to ensure that we are using the best method available in their given circumstance. Our intentions to avoid blocking when performing DDL on large tables ensures that business can continue as usual whilst we strive to improve response time or add application functionality. The bottom line is that a business relying on access to its data cannot afford to be down during core trading hours. Many of the installations we manage are still below MySQL 5.6, which requires us to seek workarounds to minimize the amount of disruption a migration can cause. This may entail slave promotion or changing the schema with an ‘online schema change’ tool. MySQL version 5.6 looks to address this issue by reducing the number of scenarios where a table is rebuilt and locked but it doesn’t yet cover all eventualities, for example when changing the data type of a column a full table rebuild is necessary. The topic of 5.6 Online Schema Change was discussed in great detail last year in the post, “Schema changes – what’s new in MySQL 5.6?” by Przemysław Malkowski With new functionality arriving in MySQL 5.7, we look forward to non-blocking DDL operations such as; OPTIMIZE TABLE and RENAME INDEX. (More info) The best advice for MySQL 5.6 users is to review the matrix to familiarize with situations where it might be best to look outside of MySQL to perform schema changes, the good news is that we’re on the right path to solving this natively. Truth be told, a blocking alter is usually going to go unnoticed on a 30MB table and we tend to use a direct alter in this situation, but on a 30GB or 300GB table we have some planning to do. If there is a period of time where activity is low and the this is permissive of locking the table then sometimes it is better execute within this window. Frequently though we are reactive to new SQL statements or a new performance issue and an emergency index is required to reduce load on the master in order to improve the response time. To pt-osc or not to pt-osc? As mentioned, pt-online-schema-change is a fixture in our workflow. It’s usually the right way to go but we still have occasions where pt-online-schema-change cannot be used, for example; when a table already uses triggers. It’s an important to remind ourselves of the the steps that pt-online-schema-change traverses to complete it’s job. Lets look at the source code to identify these; [moore@localhost]$ egrep 'Step' pt-online-schema-change # Step 1: Create the new table. # Step 2: Alter the new, empty table. This should be very quick, # Step 3: Create the triggers to capture changes on the original table and <--(metadata lock) # Step 4: Copy rows. # Step 5: Rename tables: orig -> old, new -> orig <--(metadata lock) # Step 6: Update foreign key constraints if there are child tables. # Step 7: Drop the old table. [moore@localhost]$egrep'Step'pt-online-schema-change # Step 1: Create the new table. # Step 2: Alter the new, empty table. This should be very quick, # Step 3: Create the triggers to capture changes on the original table and <--(metadata lock) # Step 4: Copy rows. # Step 5: Rename tables: orig -> old, new -> orig <--(metadata lock) # Step 6: Update foreign key constraints if there are child tables. # Step 7: Drop the old table. I pick out steps 3 and 5 from above to highlight a source of a source of potential downtime due to locks, but step 6 is also an area for concern since foreign keys can have nested actions and should be considered when planning these actions to avoid related tables from being rebuilt with a direct alter implicitly. There are several ways to approach a table with referential integrity constraints and they are detailed within the pt-osc documentation a good preparation step is to review the structure of your table including the constraints and how the ripples of the change can affect the tables around it. Recently we were alerted to an incident after a client with a highly concurrent and highly transactional workload ran a standard pt-online-schema-change script over a large table. This appeared normal to them and a few hours later our pager man was notified that this client was experiencing max_connections limit reached. So what was going on? When pt-online-schema-change reached step 5 it tried to acquire a metadata lock to rename the the original and the shadow table, however this wasn’t immediately granted due to open transactions and thus threads began to queue behind the RENAME command. The actual effect this had on the client’s application was downtime. No new connections could be made and all existing threads were waiting behind the RENAME command. Metadata locks Introduced in 5.5.3 at server level. When a transaction starts it will acquire a metadata lock (independent of storage engine) on all tables it uses and then releases them when it’s finished it’s work. This ensures that nothing can alter the table definition whilst a transaction is open. With some foresight and planning we can avoid these situations with non-default pt-osc options, namely –nodrop-new-table and –no-swap-tables. This combination leaves both the shadow table and the triggers inplace so that we can instigate an atomic RENAME when load permits. EDIT: as of percona-toolkit version 2.2 we have a new variable –tries which in conjunction with –set-vars has been deployed to cover this scenario where various pt-osc operations could block waiting for a metadata lock. The default behaviour of pt-osc (–set-vars) is to set the following session variables when it connects to the server; wait_timeout=10000 innodb_lock_wait_timeout=1 lock_wait_timeout=60 when using –tries we can granularly identify the operation, try count and the wait interval between tries. This combination will ensure that pt-osc will kill it’s own waiting session in good time to avoid the thread pileup and provide us with a loop to attempt to acquire our metadata lock for triggers|rename|fk management; –tries swap_tables:5:0.5,drop_triggers:5:0.5 The documentation is here http://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html#cmdoption-pt-online-schema-change–tries This illustrates that even with a tool like pt-online-schema-change it is important to understand the caveats presented with the solution you think is most adequate. To help decide the direction to take use the flow chart to ensure you’re taking into account some of the caveats of the MySQL schema change. Be sure to read up on the recommended outcome though as there are uncharted areas such as disk space, IO load that are not featured on the diagram. Choosing the right DDL option Ensure you know what effect ALTER TABLE will have on your platform and pick the right method to suit your uptime. Sometimes that means delaying the change until a period of lighter use or utilising a tool that will avoid holding a table locked for the duration of the operation. A direct ALTER is sometimes the answer like when you have triggers installed on a table. – In most cases pt-osc is exactly what we need – In many cases pt-osc is needed but the way in which it’s used needs tweaking – In few cases pt-osc isn’t the right tool/method and we need to consider native blocking ALTER or using failovers to juggle the change into place on all hosts in the replica cluster. If you want to learn more about avoiding avoidable downtime please tune into my webinar Wednesday, November 19 at 10 a.m. PST. It’s titled “Tips from the Trenches: A Guide to Preventing Downtime for the Over-Extended DBA.” Register now!(If you miss it, don’t worry: Catch the recording and download the slides from that same registration page.)

January 20, 2015

by Peter Zaitsev

· 15,244 Views · 1 Like

Lambda Architecture for Big Data

An increasing number of systems are being built to handle the Volume, Velocity and Variety of Big Data, and hopefully help gain new insights and make better business decisions. Here, we will look at ways to deal with Big Data’s Volume and Velocity simultaneously, within a single architecture solution. Volume + Velocity Apache Hadoop provides both reliable storage (HDFS) and a processing system (MapReduce) for large data sets across clusters of computers. MapReduce is a batch query processor that is targeted at long-running background processes. Hadoop can handle Volume. But to handle Velocity, we need real-time processing tools that can compensate for the high-latency of batch systems, and serve the most recent data continuously, as new data arrives and older data is progressively integrated into the batch framework. Therefore we need both batch and real-time to run in parallel, and add a real-time computational system (e.g. Apache Storm) to our batch framework. This architectural combination of batch and real-time computation is referred to as a Lambda Architecture (λ). Generic Lambda λ has three layers: The Batch Layer manages the master data and precomputes the batch views The Speed Layer serves recent data only and increments the real-time views The Serving Layer is responsible for indexing and exposing the views so that they can be queried. The three layers are outlined in the below diagram along with a sample choice of technology stacks: Incoming data is dispatched to both Batch and Speed layers for processing. At the other end, queries are answered by merging both batch and real-time views. Note that real-time views are transient by nature and their data is discarded (making room for newer data) once propagated through the Batch and Serving layers. Most of the complexity is pushed onto the much smaller Speed layer where the results are only temporary, a process known as “complexity isolation“. We are indeed isolating the complexity of concurrent data updates in a layer that is regularly purged and kept small in size. λ is technology agnostic. The data pipeline is broken down into layers with clear demarcation of responsibilities, and at each layer, we can choose from a number of technologies. The Speed layer for instance could use either Apache Storm, or Apache Spark Streaming, or Spring “XD” ( eXtreme Data) etc. How do we recover from mistakes in λ ? Basically, we recompute the views. If that takes too long, we just revert to the previous, non-corrupted versions of our data. We can do that because of data immutability in the master dataset: data is never updated, only appended to (time-based ordering). The system is therefore Human Fault-Tolerant: if we write bad data, we can just remove that data altogether and recompute. Unified Lambda The downside of λ is its inherent complexity. Keeping in sync two already complex distributed systems is quite an implementation and maintenance challenge. People have started to look for simpler alternatives that would bring just about the same benefits and handle the full problem set. There are basically three approaches: 1) Adopt a pure streaming approach, and use a flexible framework such as Apache Samza to provide some type of batch processing. Although its distributed streaming layer is pluggable, Samza typically relies on Apache Kafka. Samza’s streams are replayable, ordered partitions. Samza can be configured for batching, i.e. consume several messages from the same stream partition in sequence. 2) Take the opposite approach, and choose a flexible Batch framework that would also allow micro-batches, small enough to be close to real-time, with Apache Spark/Spark Streaming or Storm’s Trident. Spark streaming is essentially a sequence of small batch processes that can reach latency as low as one second.Trident is a high-level abstraction on top of Storm that can process streams as small batches as well as do batch aggregation. 3) Use a technology stack already combining batch and real-time, such as Spring “XD”, Summingbird or Lambdoop. Summingbird (“Streaming MapReduce”) is a hybrid system where both batch/real-time workflows can be run at the same time and the results merged automatically.The Speed layer runs on Storm and the Batch layer on Hadoop, Lambdoop (Lambda-Hadoop, with HBase, Storm and Redis) also combines batch/real-time by offering a single API for both processing paradigms: The integrated approach (unified λ) seeks to handle Big Data’s Volume and Velocity by featuring a hybrid computation model, where both batch and real-time data processing are combined transparently. And with a unified framework, there would be only one system to learn, and one system to maintain.

January 17, 2015

by Tony Siciliani

· 40,468 Views · 6 Likes

ORM and Angular -- Make Your App Smarter

Posted by Gilad F on Back& Blog. Current approaches to web development rely upon having two kinds of intelligence built into your application – business intelligence in the server, and presentation intelligence on the client side. This institutes a clear delineation in responsibilities, which is often desirable from an architectural standpoint. However, this approach does have some drawbacks. Processing time for business logic, for example, is centralized on the server. This can introduce bottlenecks in the application’s performance, or add complexity when it comes to cross-server communication. For smaller applications that nonetheless have a large user base, this can often be the single greatest performance concern – the time spent computing solutions by the server. One way this can be offset is through the use of Object-Relational Mapping, or ORM. Below we’ll look at the concept of ORM, and how creating an ORM system in Angular can help make your application smarter. What is an ORM? Simply put, Object-Relational Mapping is the concept of creating representations of your underlying data that know how to manage themselves. Most web applications boil down to four basic actions, known as the “CRUD” approach – Create a record, Retrieve records, Update a record, or Delete a record. With an ORM, you simply encapsulate each of these functions within a class that represents a given record in the database. In essence, the objects you create to represent your data on the front end also know how to manipulate that data on the back end. Why Use an ORM? The primary benefit of an ORM is that it hides a lot of the functional complexity of database integration behind an established API. Communication with the database to implement each of the CRUD methods can be complex, but once it’s been accomplished for one model it can be easily ported to all of the other models in your system. An ORM focuses on hiding as much of this code as possible, allowing your models to care only about how they are represented – and how they interact with other elements in the system. A series of calls to establish a connection to the database, for example, becomes a single call to a method named “Save” on the model instance. This also allows you to centralize your database code, giving you only one location where you need to look for database-related bugs instead of having to search a complex code base for different custom data communication handlers. Why Use an ORM in Angular? While the JavaScript stack is particularly performant when compared to more heavyweight offerings such as Rails and Django, it still faces the issues common to the standard web application architecture – the server has the potential to be a bottleneck, handling the incoming traffic from a number of locations. By focusing your development efforts to create a pure CRUD API in your server, and developing a rudimentary ORM in Angular, you can offload a lot of that processing load to the client machines – in essence parallelizing the process at the expense of increased network communication. This allows you to reduce the overall dependence of your application on the server, making the server a “thin” client that simply updates the database based upon the API calls issued by the client. After a certain point, your back-end can be outsourced completely to an external provider that specializes in providing this type of access – such as Backand – allowing you to completely offload scalability and security concerns. In essence, it allows you to focus on your application as opposed to focusing on the attendant resources. Conclusion Object-Relational Mapping is a powerful paradigm that eases communication with a database for the basic CRUD activities associated with web applications. As most existing web development environments focus on implementing ORM on the server side, this can result in performance and communication bottlenecks – not to mention increased infrastructure costs. By offloading some of these ORM tasks to AngularJS, you can parallelize many of these tasks and reduce overall server load, in some cases obviating the need for the server entirely. If your application is facing a bloated back-end communication pattern, it might be worth your time to look at working towards implementation of a client-side ORM system. Build your Angular app and connect it to any database with Backand today. – Get started now.translate in hindi

January 16, 2015

by Itay Herskovits

· 8,848 Views

Structurizr: System Context Diagram as Code

as i said in resolving the conflict between software architecture and code , my focus for this year is representing a software architecture model as code. in simple sketches for diagramming your software architecture , i showed an example system context diagram for my techtribes.je website. it's a simple diagram that shows techtribes.je in the middle, surrounded by the key types of users and system dependencies. it's your typical "big picture" view. this diagram was created using omnigraffle (think microsoft visio for mac os x) and it's exactly that - a static diagram that needs to be manually kept up to date. instead, wouldn't it be great if this diagram was based upon a model that we could better version control, collaborate on and visualize? if you're not sure what i mean by a "model", take a look at models, sketches and everything in between . this is basically what the aim of structurizr is. it's a way to describe a software architecture model as code, and then visualize it in a simple way. the structurizr java library is available on github and you can download a prebuilt binary . just as a warning, this is very much a work in progress and so don't be surprised if things change! here's some java code to recreate the techtribes.je system context diagram. package com.structurizr.example; import com.structurizr.io.json.jsonwriter; import com.structurizr.model.location; import com.structurizr.model.model; import com.structurizr.model.person; import com.structurizr.model.softwaresystem; import com.structurizr.view.systemcontextview; import com.structurizr.view.viewset; import java.io.stringwriter; /** * this is a model of the system context for the techtribes.je system, * the code for which can be found at https://github.com/techtribesje/techtribesje */ public class techtribessystemcontext { public static void main(string[] args) throws exception { // create a model and the software system we want to describe model model = new model("techtribes.je", "this is a model of the system context for the techtribes.je system, the code for which can be found at https://github.com/techtribesje/techtribesje"); softwaresystem techtribes = model.addsoftwaresystem(location.internal, "techtribes.je", "techtribes.je is the only way to keep up to date with the it, tech and digital sector in jersey and guernsey, channel islands"); // create the various types of people (roles) that use the software system person anonymoususer = model.addperson(location.external, "anonymous user", "anybody on the web."); anonymoususer.uses(techtribes, "view people, tribes (businesses, communities and interest groups), content, events, jobs, etc from the local tech, digital and it sector."); person authenticateduser = model.addperson(location.external, "aggregated user", "a user or business with content that is aggregated into the website."); authenticateduser.uses(techtribes, "manage user profile and tribe membership."); person adminuser = model.addperson(location.external, "administration user", "a system administration user."); adminuser.uses(techtribes, "add people, add tribes and manage tribe membership."); // create the various software systems that techtribes.je has a dependency on softwaresystem twitter = model.addsoftwaresystem(location.external, "twitter", "twitter.com"); techtribes.uses(twitter, "gets profile information and tweets from."); softwaresystem github = model.addsoftwaresystem(location.external, "github", "github.com"); techtribes.uses(github, "gets information about public code repositories from."); softwaresystem blogs = model.addsoftwaresystem(location.external, "blogs", "rss and atom feeds"); techtribes.uses(blogs, "gets content using rss and atom feeds from."); // now create the system context view based upon the model viewset viewset = new viewset(model); systemcontextview contextview = viewset.createcontextview(techtribes); contextview.addallsoftwaresystems(); contextview.addallpeople(); // and output the model and view to json (so that we can render it using structurizr.com) jsonwriter jsonwriter = new jsonwriter(true); stringwriter stringwriter = new stringwriter(); jsonwriter.write(viewset, stringwriter); system.out.println(stringwriter.tostring()); } } executing this code creates this json , which you can then copy and paste into the try it page of structurizr. the result (if you move the boxes around) is something like this. don't worry, there will eventually be an api for uploading software architecture models and the diagrams will get some styling, but it proves the concept. what we have then is an api that implements the various levels in my c4 software architecture model, with a simple browser-based rendering tool. hopefully that's a nice simple introduction of how to represent a software architecture model as code, and gives you a flavour for the sort of direction i'm taking it. having the software architecture as code provides some interesting opportunities that you don't get with static diagrams from visio, etc and the ability to keep the models up to date automatically by scanning the codebase is what i find particularly exciting. if you have any thoughts on this, please do drop me a note.

January 16, 2015

by Simon Brown

· 6,615 Views · 4 Likes

Store UUID in an Optimized Way

written by karthik appigatla for the mysql performance blog . a few years ago peter zaitsev, in a post titled “ to uuid or not to uuid ,” wrote: “ there is a timestamp based part in uuid which has similar properties to auto_increment and which could be used to have values generated at the same point in time physically local in btree index.” for this post i’ve rearranged the timestamp part of uuid (universal unique identifier) and did some benchmarks. many people store uuid as char (36) and use as row identity value (primary key) because it is unique across every table, every database and every server and allows easy merging of records from different databases. but here comes the problem, using it as a primary key causes the problems described below. problems with uuid uuid has 36 characters which makes it bulky. innodb stores data in the primary key order and all the secondary keys also contain primary key. so having uuid as primary key makes the index bigger which can not be fit into the memory inserts are random and the data is scattered. despite the problems with uuid, people still prefer it because it is unique across every table, can be generated anywhere. in this blog, i will explain how to store uuid in an efficient way by re-arranging timestamp part of uuid. structure of uuid mysql uses uuid version 1 which is a 128-bit number represented by a utf8 string of five hexadecimal numbers the first three numbers are generated from a timestamp. the fourth number preserves temporal uniqueness in case the timestamp value loses monotonicity (for example, due to daylight saving time). the fifth number is an ieee 802 node number that provides spatial uniqueness. a random number is substituted if the latter is not available (for example, because the host computer has no ethernet card, or we do not know how to find the hardware address of an interface on your operating system). in this case, spatial uniqueness cannot be guaranteed. nevertheless, a collision should have very low probability. the timestamp is mapped as follows: when the timestamp has the (60 bit) hexadecimal value: 1d8eebc58e0a7d7. the following parts of the uuid are set:: 58e0a7d7-eebc-11d8 -9669-0800200c9a66. the 1 before the most significant digits (in 11d8) of the timestamp indicates the uuid version, for time-based uuids this is 1. fourth and fifth parts would be mostly constant if it is generated from a single server. first three numbers are based on timestamp, so they will be monotonically increasing. lets rearrange the total sequence making the uuid closer to sequential. this makes the inserts and recent data look up faster. dashes (‘-‘) make no sense, so lets remove them. 58e0a7d7-eebc-11d8-9669-0800200c9a66 => 11d8eebc58e0a7d796690800200c9a66 benchmarking i created three tables: events_uuid – uuid binary(16) primary key events_int – additional bigint auto increment column and made it as primary key and index on uuid column events_uuid_ordered – rearranged uuid binary(16) as primary key i created three stored procedures which insert 25k random rows at a time into the respective tables. there are three more stored procedures which call the random insert-stored procedures in a loop and also calculate the time taken to insert 25k rows and data and index size after each loop. totally i have inserted 25m records. data size horizontal axis – number of inserts x 25,000 vertical axis – data size in mb the data size for uuid table is more than other two tables. index size horizontal axis – number of inserts x 25,000 vertical axis – index size in mb total size horizontal axis – number of inserts x 25,000 vertical axis – total size in mb time taken horizontal axis – number of inserts x 25,000 vertical axis – time taken in seconds for the table with uuid as primary key, you can notice that as the table grows big, the time taken to insert rows is increasing almost linearly. whereas for other tables, the time taken is almost constant. the size of uuid table is almost 50% bigger than ordered uuid table and 30% bigger than table with bigint as primary key. comparing the ordered uuid table bigint table, the time taken to insert rows and the size are almost same. but they may vary slightly based on the index structure. root@localhost:~# ls -lhtr /media/data/test/ | grep ibd -rw-rw---- 1 mysql mysql 13g jul 24 15:53 events_uuid_ordered.ibd -rw-rw---- 1 mysql mysql 20g jul 25 02:27 events_uuid.ibd -rw-rw---- 1 mysql mysql 15g jul 25 07:59 events_int.ibd table structure #1 events_int create table `events_int` ( `count` bigint(20) not null auto_increment, `id` binary(16) not null, `unit_id` binary(16) default null, `event` int(11) default null, `ref_url` varchar(255) collate utf8_unicode_ci default null, `campaign_id` binary(16) collate utf8_unicode_ci default '', `unique_id` binary(16) collate utf8_unicode_ci default null, `user_agent` varchar(100) collate utf8_unicode_ci default null, `city` varchar(80) collate utf8_unicode_ci default null, `country` varchar(80) collate utf8_unicode_ci default null, `demand_partner_id` binary(16) default null, `publisher_id` binary(16) default null, `site_id` binary(16) default null, `page_id` binary(16) default null, `action_at` datetime default null, `impression` smallint(6) default null, `click` smallint(6) default null, `sold_impression` smallint(6) default null, `price` decimal(15,7) default '0.0000000', `actioned_at` timestamp not null default '0000-00-00 00:00:00', `unique_ads` varchar(255) collate utf8_unicode_ci default null, `notification_url` text collate utf8_unicode_ci, primary key (`count`), key `id` (`id`), key `index_events_on_actioned_at` (`actioned_at`), key `index_events_unit_demand_partner` (`unit_id`,`demand_partner_id`) ) engine=innodb default charset=utf8 collate=utf8_unicode_ci; #2 events_uuid create table `events_uuid` ( `id` binary(16) not null, `unit_id` binary(16) default null, ~ ~ primary key (`id`), key `index_events_on_actioned_at` (`actioned_at`), key `index_events_unit_demand_partner` (`unit_id`,`demand_partner_id`) ) engine=innodb default charset=utf8 collate=utf8_unicode_ci; #3 events_uuid_ordered create table `events_uuid_ordered` ( `id` binary(16) not null, `unit_id` binary(16) default null, ~ ~ primary key (`id`), key `index_events_on_actioned_at` (`actioned_at`), key `index_events_unit_demand_partner` (`unit_id`,`demand_partner_id`) ) engine=innodb default charset=utf8 collate=utf8_unicode_ci; conclusions create function to rearrange uuid fields and use it delimiter // create definer=`root`@`localhost` function `ordered_uuid`(uuid binary(36)) returns binary(16) deterministic return unhex(concat(substr(uuid, 15, 4),substr(uuid, 10, 4),substr(uuid, 1, 8),substr(uuid, 20, 4),substr(uuid, 25))); // delimiter ; inserts insert into events_uuid_ordered values (ordered_uuid(uuid()),'1','m',....); selects select hex(uuid),is_active,... from events_uuid_ordered ; define uuid as binary(16) as binary does not have any character set references http://dev.mysql.com/doc/refman/5.1/en/miscellaneous-functions.html#function_uuid http://www.famkruithof.net/guid-uuid-timebased.html http://www.percona.com/blog/2007/03/13/to-uuid-or-not-to-uuid/ http://blog.codinghorror.com/primary-keys-ids-versus-guids/

January 13, 2015

by Peter Zaitsev

· 17,862 Views · 2 Likes

AngularJS Two-Way Data Binding

Traditional web development builds a bridge between the front end, where the user performs their manipulations of the application’s data, and the back end, where that data is stored. In traditional web development, this process is driven by successive networking calls, communicating changes between the server and the client via re-rendering the involved pages. AngularJS enhances this with two-way data binding. Below we’ll look at what two-way data binding is, and how it differs from the traditional data processing approach. The Traditional Approach Most web frameworks focus on one-way data binding. This involves reading the input from the DOM, serializing the data, sending it to the server, waiting for the process to finish, then modifying the DOM to indicate any errors, or reloading the DOM if the call is successful. While this provides a traditional web application all the time it needs to perform data processing, this benefit is only really applicable to web apps with highly complex data structures. If your application has a simpler data format, with relatively flat models, then the extra work can needlessly complicate the process. Furthermore, all models need to wait for server confirmation before their data can be updated, meaning that related data depending upon those models won’t have the latest information. Tying Together the UI and the Model AngularJS addresses this with two-way data binding. With two-way data binding, the user interface changes are immediately reflected in the underlying data model, and vice-versa. This allows the data model to serve as an atomic unit that the view of the application can always depend upon to be accurate. Many web frameworks implement this type of data binding with a complex series of event listeners and event handlers – an approach that can quickly become fragile. AngularJS, on the other hand, makes this approach to data a primary part of its architecture. Instead of creating a series of callbacks to handle the changing data, AngularJS does this automatically without any needed intervention by the programmer Benefits and Considerations The primary benefit of two-way data binding is that updates to (and retrievals from) the underlying data store happen more or less automatically. When the data store updates, the UI updates as well. This allows you to remove a lot of logic from the front-end display code, particularly when making effective use of AngularJS’s declarative approach to UI presentation. In essence, it allows for true data encapsulation on the front-end, reducing the need to do complex and destructive manipulation of the DOM. While this solves a lot of problems with a website’s presentation architecture, there are some disadvantages to take into consideration. First, AngularJS uses a dirty-checking approach that can be slow in some browsers – not a problem for thin presentation pages, but any page with heavy processing may run into problems in older browsers. Additionally, two-way binding is only truly beneficial for relatively simple objects. Any data that requires heavy parsing work, or extensive manipulation and processing, will simply not work well with two-way binding. Additionally, some uses of Angular – such as using the same binding directive more than once – can break the data binding process. Conclusion While the traditional approach to data binding has a lot of benefits when it comes to performing complex data manipulations and calculations, it can introduce some problems with respect to the design of the web application’s front-end architecture. With AngularJS’s use of two-way data binding, your application can greatly simplify its presentation layer, allowing the UI to be built off of a cleaner, less-destructive approach to DOM presentation. While it isn’t useful in every situation, the two-way data binding AngularJS provides can greatly ease web application development, and reduce the pain faced by your front-end developers. Build your Angular app and connect it to any database with Backand today. – Get started now.

January 13, 2015

by Itay Herskovits

· 6,712 Views

Byte Buddy, an alternative to cglib and Javassist

Byte Buddy is a code generation library for creating Java classes during the runtime of a Java application and without the help of a compiler. Other than the code generation utilities that ship with the Java Class Library, Byte Buddy allows the creation of arbitrary classes and is not limited to implementing interfaces for the creation of runtime proxies. In order to use Byte Buddy, one does not require an understanding of Java byte code or the class file format. In contrast, Byte Buddy’s API aims for code that is concise and easy to understand for everybody. Nevertheless, Byte Buddy remains fully customizable down to the possibility of defining custom byte code. Furthermore, the API was designed to be as non-intrusive as possible and as a result, Byte Buddy does not leave any trace in the classes that were created by it. For this reason, the generated classes can exist without requiring Byte Buddy on the class path. Because of this feature, Byte Buddy’s mascot was chosen to be a ghost. Byte Buddy is written in Java 6 but supports the generation of classes for any Java version. Byte Buddy is a light-weight library and only depends on the visitor API of the Java byte code parser library ASM which does itself not require any further dependencies. At first sight, runtime code generation can appear to be some sort of black magic that should be avoided and only few developers write applications that explicitly generate code during their runtime. However, this picture changes when creating libraries that need to interact with arbitrary code and types that are unknown at compile time. In this context, a library implementer must often choose between either requiring a user to implement library-proprietary interfaces or to generate code at runtime when the user’s types becomes first known to the library. Many known libraries such as for example Spring or Hibernate choose the latter approach which is popular among their users under the term of using Plain Old Java Objects. As a result, code generation has become an ubiquitous concept in the Java space. Byte Buddy is an attempt to innovate the runtime creation of Java types in order to provide a better tool set to those relying on code generation. Hello World Saying Hello World with Byte Buddy is as easy as it can get. Any creation of a Java class starts with an instance of the ByteBuddy class which represents a configuration for creating new types: Class dynamicType = new ByteBuddy() .subclass(Object.class) .method(named("toString")).intercept(FixedValue.value("Hello World!")) .make() .load(getClass().getClassLoader(), ClassLoadingStrategy.Default.WRAPPER) .getLoaded(); assertThat(dynamicType.newInstance().toString(), is("Hello World!")); The default ByteBuddy configuration which is used in the above example creatse a Java class in the newest version of the class file format that is understood by the processing Java virtual machine. As hopefully obvious from the example code, the created type will extend the Object class and intercept its toString method which should return a fixed value of Hello World!. The method to be intercepted is identified by a so-called method matcher. In the above example, a predefined method matcher named(String) is used which identifies a method by its exact name. Byte Buddy comes with numerous predefined and well-tested method matchers which are collected in the MethodMatchers class. The creation of custom matchers is however as simple as implementing the (functional) MethodMatcher interface. For implementing the toString method, the FixedValue class defines a constant return value for the intercepted method. Defining a constant value is only one example of many method interceptors that ship with Byte Buddy. By implementing the Instrumentation interface, a method could however even be defined by custom byte code. Finally, the described Java class is created and then loaded into the Java virtual machine. For this purpose, a target class loader is required as well as a class loading strategy where we choose a wrapper strategy. The latter creates a new child class loader which wraps the given class loader and only knows about the newly created dynamic type. Eventually, we can convince ourselves of the result by calling the toString method on an instance of the created class and finding the return value to represent the constant value we expected. Where to go from here? Byte Buddy is a comprehensive library and we only scratched the surface of Byte Buddy's capabilities. However, Byte Buddy aims for being easy to use by providing a domain-specific language for creating classes. Most runtime code generation can be done by writing readable code and without any knowledge of Java's class file format. If you want to learn more about Byte Buddy, you can find such a tutorial on Byte Buddy's web page. Furthermore, Byte Buddy comes with a detailed in-code documentation and extensive test case coverage which can also serve as example code. you can also find the source code of Byte Buddy on GitHub .

January 11, 2015

by Rafael Winterhalter

· 10,693 Views · 6 Likes

An Impatient New User's Introduction to API Management with JBoss apiman 1.0

API Management? Did you say “API Management?” Software application development models are evolutionary things. New technologies are always being created and require new approaches. It’s frequently the case today, that a service oriented architecture (SOA) model is used and that the end product is a software service that can be used by applications. The explosion in growth of mobile devices has only accelerated this trend. Every new mobile phone sold is another platform onto which applications are deployed. These applications are often built from services provided from multiple sources. The applications often consume these services through their APIs. OK, that’s all interesting, but why does this matter? Here’s why: If you are providing a service, you’d probably like to receive payment when it’s used by an application. For example, let’s say that you’ve spent months creating a new service that provides incredibly accurate and timely driving directions. You can imagine every mobile phone GPS app making use of your service someday. That is, however, assuming that you can find a way to enforce a contract on consumers of the API and provide them with a service level agreement (SLA). Also, you have to find a way to actually track consumers’ use of the API so that you can actually enforce that SLA. Finally, you have to have the means to update a service and publish new versions of services. Likewise, if you are consuming a service, for example, if you want to build the killer app that will use that cool new mapping service, you have to have the means to find the API, identify the API’s endpoint, and register your usage of the API with its provider. The approach that is followed to fulfill both service providers’ and consumers’ needs is...API Management. JBoss apiman 1.0 apiman is JBoss’ open source API Management system. apiman fulfills service API providers’ and consumers’ needs by implementing: API Manager - The API Manager provides an easy way for API/service providers to use a web UI to define service contracts for their APIs, apply these contracts across multiple APIs, and control role-based user access and API versioning. These contracts can govern access to services and limits on the rate at which consumers can access services. The same UI enables API consumers to easily locate and access APIs. API Gateway - The gateway applies the service contract policies of API Management by enforcing at runtime the rules defined in the contracts and tracking the service API consumers’ use of the APIs for every request made to the services. The way that the API Gateway works is that the consumer of the service accesses the service through a URL that designates the API Gateway as a proxy for the service. If the policies defined to govern access to the service (see a later section in this post for a discussion of apiman polices), the API Gateway then proxies requests to the service’s backend API implementation. The best way to understand API Management with apiman is to see it in action. In this post, we’ll install apiman 1.0, configure an API with contracts through the API Manager, and watch the API Gateway control access to the API and track its use. Prerequisites We don’t need very much to run apiman out of the box. Before we install apiman, you’ll have to have Java (version 1.7 or newer) installed on your system. You’ll also need to git and maven installed to be able to build the example service that we’ll use. A note on software versions: In this post we’ll use the latest available version of apiman as of December 2014. As if this writing, version 1.0 of apiman was just released (December 2014). Depending on the versions of software that you use, some screen displays may look a bit different. Getting apiman Like all JBoss software, installation of apiman is simple. First, you will need an application server on which to install and run apiman. We’ll use the open source JBoss WildFly server release 8.2 (http://www.wildfly.org/). To make things easier, apiman includes a pointer to JBoss WildFly on its download page here: http://www.apiman.io/latest/download.html To install WildFly, simply download http://download.jboss.org/wildfly/8.2.0.Final/wildfly-8.2.0.Final.zip and unzip the file into the directory in which you want to run the sever. Then, download the apiman 1.0 WildFly overlay zip file inside the directory that was created when you un-zipped the WildFly download. The apiman 1.0 WildFly overlay zip file is available here: http://downloads.jboss.org/overlord/apiman/1.0.0.Final/apiman-distro-wildfly8-1.0.0.Final-overlay.zip The commands that you will execute will look something like this: mkdir apiman cd apiman unzip wildfly-8.2.0.Final.zip unzip -o apiman-distro-wildfly8-1.0.0.Final-overlay.zip -d wildfly-8.2.0.Final Then, to start the server, execute these commands: cd wildfly-8.2.0.Final ./bin/standalone.sh -c standalone-apiman.xml The server will write logging messages to the screen. When you see some messages that look like this, you’ll know that the server is up and running with apiman installed: 13:57:03,229 INFO [org.jboss.as.server] (ServerService Thread Pool -- 29) JBAS018559: Deployed "apiman-ds.xml" (runtime-name : "apiman-ds.xml") 13:57:03,261 INFO [org.jboss.as] (Controller Boot Thread) JBAS015961: Http management interface listening on http://127.0.0.1:9990/management 13:57:03,262 INFO [org.jboss.as] (Controller Boot Thread) JBAS015951: Admin console listening on http://127.0.0.1:9990 13:57:03,262 INFO [org.jboss.as] (Controller Boot Thread) JBAS015874: WildFly 8.2.0.Final "Tweek" started in 5518ms - Started 754 of 858 services (171 services are lazy, passive or on-demand) If this were a production server, the first thing that we’d do is to change the OOTB default admin username and/or password. apiman is configured by default to use JBoss KeyCloak (http://keycloak.jboss.org/) for password security. Also, the default database used by apiman to store contract and service information is the H2 database. For a production server, you’d want to reconfigure this to use a production database. Note: apiman includes DDLs for both MySQL and PostgreSQL. For the purposes of our demo, we’ll keep things simple and use the default configuration. To access apiman’s API Manager UI, go to: http://localhost:8080/apiman-manager, and log in. The admin user account that we’ll use has a username of “admin” and a password of “admin123!” You should see a screen that looks like this: Before we start using apiman, let’s take a look at how apiman defines how services and the meta data on which they depend are organized. Policies, Plans, and Organizations apiman uses a hierarchical data model that consists of these elements: Polices, Plans, and Organizations: Policies Policies are at the lowest level of the data model, and they are the basis on which the higher level elements of the data model are built. A policy defines an action that is performed by the API Gateway at runtime. Everything defined in the API Manager UI is there to enable apiman to apply policies to requests made to services. When a request to a service is made, apiman creates a chain of policies to be applied to that request. apiman policy chains define a specific sequence order in which the policies defined in the API Manager UI are applied to service requests. The sequence in which incoming service requests have policies applied is: First, at the application level. In apiman, an application is contracted to use one or more services. Second, at the plan level. In apiman, policies are organized into groups called plans. (We’ll discuss plans in the next section of this post.) Third, at the individual service level. What happens is that when a service request is received by the API Gateway at runtime, the policy chain is applied in the order of application, plan, and service. If no failures, such as a rate counter being exceeded, occur, the API Gateway sends the request to the service’s backend API implementation. As we mentioned earlier in this post, the API Gateway acts as a proxy for the service: Next, when the API Gateway receives a response from the service’s backend implementation, the policy chain is applied again, but this time in the reverse order. The service policies are applied first, then the plan policies, and finally the application policies. If no failures occur, then the service response is sent back to the consumer of the service. By applying the policy chain twice, both for the originating incoming request and the resulting response, apiman allows policy implementations two opportunities to provide management functionality during the lifecycle. The following diagram illustrates this two-way approach to applying policies: Plans In apiman, a “plan” is a set policies that together define the level of service that apiman provides for service. Plans enable apiman users to define multiple different levels of service for their APIs, based on policies. It’s common to define different plans for the same service, where the differences depend on configuration options. For example, a group or company may offer both a “gold” and “silver” plan for the same service. The gold plan may be more expensive than the silver plan, but it may offer a higher level of service requests in a given (and configurable) time period. Organizations The “organization” is at top level of the apiman data model. An organization contains and manages all elements used by a company, university, group inside a company, etc. for API management with apiman. All plans, services, applications, and users for a group are defined in an apiman organization. In this way, an organization acts as a container of other elements. Users must be associated with an organization before they can use apiman to create or consume services. apiman implements role-based access controls for users. The role assigned to a user defines the actions that a user can perform and the elements that a user can manage. Before we can define a service, the policies that govern how it is accessed, the users who will be able to access, and the organizations that will create and consume it, we need a service and a client to access that service. Luckily, creating the service and deploying it to our WildFly server, and accessing it through a client are easy. Getting and Building and Deploying the Example Service The source code for the example service is contained in a git repo (http://git-scm.com) hosted at github (https://github.com/apiman). To download a copy of the example service, navigate to the directory in which you want to build the service and execute this git command: git clone [email protected]:apiman/apiman-quickstarts.git As the source code is downloading, you'll see output that looks like this: git clone [email protected]:apiman/apiman-quickstarts.git Initialized empty Git repository in /tmp/tmp/apiman-quickstarts/.git/ remote: Counting objects: 104, done. remote: Total 104 (delta 0), reused 0 (delta 0) Receiving objects: 100% (104/104), 18.16 KiB, done. Resolving deltas: 100% (40/40), done. And, after the download is complete, you'll see a populated directory tree that looks like this: └── apiman-quickstarts ├── echo-service │ ├── pom.xml │ ├── README.md │ └── src │ └── main │ ├── java │ │ └── io │ │ └── apiman │ │ └── quickstarts │ │ └── echo │ │ ├── EchoResponse.java │ │ └── EchoServlet.java │ └── webapp │ └── WEB-INF │ ├── jboss-web.xml │ └── web.xml ├── LICENSE ├── pom.xml ├── README.md ├── release.sh └── src └── main └── assembly └── dist.xml As we mentioned earlier in the post, the example service is very simple. The only action that the service performs is to echo back in responses the meta data in the REST (http://en.wikipedia.org/wiki/Representational_state_transfer) requests that it receives. Maven is used to build the service. To build the service into a deployable .war file, navigate to the directory into which you downloaded the service example: cd apiman-quickstarts/echo-service And then execute this maven command: mvn package As the service is being built into a .war file, you'll see output that looks like this: [INFO] Scanning for projects... [INFO] [INFO] Using the builder org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder with a thread count of 1 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building apiman-quickstarts-echo-service 1.0.1-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ apiman-quickstarts-echo-service --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /jboss/local/redhat_git/apiman-quickstarts/echo-service/src/main/resources [INFO] [INFO] --- maven-compiler-plugin:2.5.1:compile (default-compile) @ apiman-quickstarts-echo-service --- [INFO] Compiling 2 source files to /jboss/local/redhat_git/apiman-quickstarts/echo-service/target/classes [INFO] [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ apiman-quickstarts-echo-service --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /jboss/local/redhat_git/apiman-quickstarts/echo-service/src/test/resources [INFO] [INFO] --- maven-compiler-plugin:2.5.1:testCompile (default-testCompile) @ apiman-quickstarts-echo-service --- [INFO] No sources to compile [INFO] [INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ apiman-quickstarts-echo-service --- [INFO] No tests to run. [INFO] [INFO] --- maven-war-plugin:2.2:war (default-war) @ apiman-quickstarts-echo-service --- [INFO] Packaging webapp [INFO] Assembling webapp in [/jboss/local/redhat_git/apiman-quickstarts/echo-service/target/apiman-quickstarts-echo-service-1.0.1-SNAPSHOT] [INFO] Processing war project [INFO] Copying webapp resources [/jboss/local/redhat_git/apiman-quickstarts/echo-service/src/main/webapp] [INFO] Webapp assembled in [23 msecs] [INFO] Building war: /jboss/local/redhat_git/apiman-quickstarts/echo-service/target/apiman-quickstarts-echo-service-1.0.1-SNAPSHOT.war [INFO] WEB-INF/web.xml already added, skipping [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 1.184 s [INFO] Finished at: 2014-12-26T16:11:19-05:00 [INFO] Final Memory: 14M/295M [INFO] ------------------------------------------------------------------------ If you look closely, near the end of the output, you'll see the location of the .war file: /jboss/local/redhat_git/apiman-quickstarts/echo-service/target/apiman-quickstarts-echo-service-1.0.1-SNAPSHOT.war To deploy the service, we can copy the .war file to our WildFly server's "deployments" directory. After you copy the service's .war file to the deployments directory, you'll see output like this generated by the WildFly server: 16:54:44,313 INFO [org.jboss.as.server.deployment] (MSC service thread 1-7) JBAS015876: Starting deployment of "apiman-quickstarts-echo-service-1.0.1-SNAPSHOT.war" (runtime-name: "apiman-quickstarts-echo-service-1.0.1-SNAPSHOT.war") 16:54:44,397 INFO [org.wildfly.extension.undertow] (MSC service thread 1-16) JBAS017534: Registered web context: /apiman-echo 16:54:44,455 INFO [org.jboss.as.server] (DeploymentScanner-threads - 1) JBAS018559: Deployed "apiman-quickstarts-echo-service-1.0.1-SNAPSHOT.war" (runtime-name : "apiman-quickstarts-echo-service-1.0.1-SNAPSHOT.war") Make special note of this line of output: 16:54:44,397 INFO [org.wildfly.extension.undertow] (MSC service thread 1-16) JBAS017534: Registered web context: /apiman-echo This output indicates that the URL of the deployed example service is: [a href="http://localhost:8080/apiman-echo" style="text-decoration: none;"]http://localhost:8080/apiman-echo Remember, however, that this is the URL of the deployed example service if we access it directly. We'll refer to this as the "unmanaged service" as we are able to connect to the service directly, without going through the API Gateway. The URL to access the service through the API Gateway ("the managed service") at runtime will be different. Now that our example service is installed, it’s time to install and configure our client to access the server. Accessing the Example Service Through a Client There are a lot of options available when it comes to what we can use for a client to access our service. We’ll keep the client simple so that we can keep our focus on apiman and simply install a REST client into the FireFox browser. The REST Client FireFox add-on (http://restclient.net/) is available here: https://addons.mozilla.org/en-US/firefox/addon/restclient/versions/2.0.3 After you install the client into FireFox, you can access the deployed service using the URL that we just defined. If you execute a GET command, you’ll see output that looks like this: Now that our example service is built, deployed and running, it’s time to create the organizations for the service provider and the service consumer. The differences between the requirements of the two organizations will be evident in their apiman configuration properties. Creating Users for the Service Provider and Consumer Before we create the organizations, we have to create a user for each organization. We'll start by creating the service provider user. To do this, logout from the admin account in the API Manager UI. The login dialog will then be displayed. Select the "New user" Option and register the service provider user: Then, logout and repeat the process to register a new application developer user too: Now that the new users are registered we can create the organizations. Creating the Service Producer Organization To create the service producer organization, log back into the API Manager UI as the servprov user and select “Create a new Organization”: Select a name and description for the organization, and press “Create Organization”: And, here’s our organization: Note that in a production environment, users would request membership in an organization. The approval process for accepting new members into an organization would follow the organization's workflow, but this would be handled outside of the API Manager. For the purposes of our demonstration, we'll keep things simple. Configuring the Service, its Policies, and Plans To configure the service, we’ll first create a plan to contain the policies that we want applied by the API Gateway at runtime when requests to the service are made. To create a new plan, select the “Plans” tab. We’ll create a “gold” plan: Once the plan is created, we will add policies to it: apiman provides several OOTB policies. Since we want to be able to demonstrate a policy being applied, we’ll select a Rate Limiting Policy, and set its limit to a very low level. If our service receives more than 10 requests in a day, the policy should block all subsequent requests. So much for a “gold” level of service! After we create the policy and add it to the plan, we have to lock the plan: And, here is the finished, and locked plan: At this point, additional plans can be defined for the service. We’ll also create a “silver” plan, that will offer a lower level of service (i.e., a request rate limit lower than 10 per day) than the gold plan. Since the process to create this silver plan is identical to that of the gold plan, we’ll skip the screenshots. Now that the two plans are complete and locked, it’s time to define the service. We’ll give the service an appropriate name, so that providers and consumers alike will be able to run a query in the API Manager to find it. After the service is defined, we have to define its implementation. In the context of the API Manager, the API Endpoint is the service’s direct URL. Remember that the API Gateway will act as a proxy for the service, so it must know the service’s actual URL. In the case of our example service, the URL is: http://localhost:8080/apiman-echo The plans tab shows which plans are available to be applied to the service: Let’s make our service more secure by adding an authentication policy that will require users to login before they can access the service. Select the Policies tab, and then define a simple authentication policy. Remember the user name and password that you define here as we’ll need them later on when send requests to the service. After the authentication policy is added, we can publish the service to the API Gateway: And, here it is, the published service: OK, that finishes the definition of the service provider organization and the publication of the service. Next, we'll switch over to the service consumer side and create the service consumer organization and register an application to connect to the managed service through the proxy of the API Gateway. The Service Consumer Organization We'll repeat the process that we used to create the application development organization. Log in to the API Manager UI as the “appdev” user and create the organization: Unlike the process we used when we created the elements used by the service provider, the first step that we’ll take is to create a new application and then search for the service to be used by the application: Searching for the service is easy, as we were careful to set the service name to something memorable: Select the service name, and then specify the plan to be used. We’ll splurge and use the gold plan: Next, select “create contract” for the plan: Then, agree to the contract terms (which seem to be written in a strange form of Latin in the apiman 1.0 release): The last step is to register the application with the API Gateway so that the gateway can act as a proxy for the service: Congratulations! All the steps necessary to provide and consume the service are complete! There’s just one more step that we have to take in order for clients to be able access the service through the API Gateway. Remember the URL that we used to access the unmanaged service directly? Well, forget it. In order to access the managed service through the API Gateway acting as a proxy for other service we have to obtain the managed service's URL. In the API Manager UI, head on over to the "APIs" tab for the application, click on the the “>” character to the left of the service name. This will expose the API Key and the service’s HTTP endpoint in the API Gateway: In order to be able access the service through the API Gateway, we have to provide the API Key with each request. The API Key can be provided either through an HTTP Header (X-API-Key) or a URL query parameter. Luckily, the API Manager UI does the latter for us. Select the icon to the right of the HTTP Endpoint and this dialog is displayed: Copy the URL into the clipboard. We’ll need to enter this into the client in a bit. The combined API Key and HTTP endpoint should look something like this: http://localhost:8080/apiman-gateway/ACMEServices/echo/1.0?apikey=c374c202-d4b3-4442-b9e4-c6654f406e3d Accessing the Managed Service Through the apiman API Gateway, Watching the Policies at Runtime Thanks for hanging in there! The set up is done. Now, we can fire up the client and watch the policies in action as they are applied at runtime by the API Gateway, for example: Open the client, and enter the URL for the managed service (http://localhost:8080/apiman-gateway/ACMEServices/echo/1.0?apikey=c374c202-d4b3-4442-b9e4-c6654f406e3d) What happens first is that the authentication policy is applied and a login dialog is then displayed: Enter the username and password (user1/password) that we defined when we created the authentication policy to access the service. The fact that you are seeing this dialog confirms that you are accessing the managed service and are not accessing the service directly. When you send a GET request to the service, you should see a successful response: So far so good. Now, send 10 more requests and you will see a response that looks like this as the gold plan rate limit is exceeded: And there it is. Your gold plan has been exceeded. Maybe next time you’ll spend a little more and get the platinum plan! ;-) Wrap-up Let’s recap what we just accomplished in this demo: We installed apiman 1.0 onto a WildFly server instance. We used git to download and maven to build a sample REST client. As a service provider, we created an organization, defined policies based on service use limit rates and user authentication, and a plan, and assigned them to a service. As a service consumer, we searched for and found that service, and assigned it to an application. As a client, we accessed the service and observed how the API Gateway managed the service. And, if you note, in the process of doing all this, the only code that we had to write or build was for the client. We were able to fully configure the service, policies, plans, and the application in the API Manager UI. What’s Next? In this post, we’ve only scratched the surface of API Management with apiman. To learn more about apiman, you can explore its website here: http://www.apiman.io/ Join the project mailing list here: https://lists.jboss.org/mailman/listinfo/apiman-user And, better still, get involved! Contribute bug reports or feature requests. Write about your own experiences with apiman. Download the apiman source code, take a look around, and contribute your own additions. apiman 1.0 was just released, there’s no better time to join in and contribute! Acknowledgements The author would like to acknowledge Eric Wittmann for his (never impatient) review comments and suggestions on writing this post! Downloads Used in this Article REST Client (http://restclient.net/) FireFox Add-On - https://addons.mozilla.org/en-US/firefox/addon/restclient/versions/2.0.3 Echo service source code - https://github.com/EricWittmann/apiman-quickstarts apiman 1.0 - http://downloads.jboss.org/overlord/apiman/1.0.0.Final/apiman-distro-wildfly8-1.0.0.Final-overlay.zip WildFly 8.2.0 - http://download.jboss.org/wildfly/8.2.0.Final/wildfly-8.2.0.Final.zip Git - http://git-scm.com Maven - http://maven.apache.org References http://www.apiman.io/ apiman tutorial videos - https://vimeo.com/user34396826 http://www.softwareag.com/blog/reality_check/index.php/soa-what/what-is-api-management/ http://keycloak.jboss.org/

January 9, 2015

by Len DiMaggio

· 13,354 Views

How to Configure MySQL Metastore for Hive?

This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

January 8, 2015

by Saurabh Chhajed

· 81,646 Views · 4 Likes

Need Micro Caching? Memoization to the Rescue

Caching solves wide sort of performance problems. There are many ways to integrate caching into our applications. For example when we use Spring there is easy to use @Cacheable support. Quite easy but we still have to configure cache manager, cache regions, etc. Sometimes it's unfortunately like taking a sledgehammer to crack a nut. So what can we do to "go lighter"? There is a technique called memoization. Technically it's as easy as pie but true genius lies in simplicity. Model solution looks as follows: public Foo getValue() { if (storedValue == null) { storedValue = retrieveFoo(); } return storedValue; } As you can see there is no problem in implementing it manually, but as long as we remember about DRY rule we can use already implemented solutions which additionally provides thread safety. Pretty good idea is to use Guava library. // create Supplier memoizer = Suppliers.memoize(this::retrieveFoo); // and use Foo variable = memoizer.get(); Sometimes it's enough but what can we do if we need to specify TTL for our value? We have to store (cache) retrieved value only for few seconds and after exceeding defined duration get this value one more time? One more time we can use functionality provided by Guava. Supplier memoizer = Suppliers.memoizeWithExpiration(this::retrieveFoo, 5, TimeUnit.SECONDS); The above line builds memoizer with TTL = 5 seconds. As you can see - simple... but powerful :)

January 6, 2015

by Jakub Kubrynski

· 17,774 Views

On Heap vs Off Heap Memory Usage

I was recently asked about the benefits and wisdom of using off heap memory in Java. The answers may be of interest to others facing the same choices. Off heap memory is nothing special. The thread stacks, application code, NIO buffers are all off heap. In fact in C and C++, you only have unmanaged memory as it does not have a managed heap by default. The use of managed memory or "heap" in Java is a special feature of the language. Note: Java is not the only language to do this. new Object() vs Object pool vs Off Heap memory. new Object() Before Java 5.0, using object pools was very popular. Creating objects was still very expensive. However, from Java 5.0, object allocation and garbage cleanup was made much cheaper, and developers found they got a performance speed up and a simplification of their code by removing object pools and just creating new objects whenever needed. Before Java 5.0, almost any object pool, even an object pool which used objects provided an improvement, from Java 5.0 pooling only expensive objects obviously made sense e.g. threads, socket and database connections. Object pools In the low latency space it was still apparent that recycling mutable objects improved performance by reduced pressure on your CPU caches. These objects have to have simple life cycles and have a simple structure, but you could see significant improvements in performance and jitter by using them. Another area where it made sense to use object pools is when loading large amounts of data with many duplicate objects. With a significant reduction in memory usage and a reduction in the number of objects the GC had to manage, you saw a reduction in GC times and an increase in throughput. These object pools were designed to be more light weight than say using a synchronized HashMap, and so they still helped. Take this StringInterner class as an example. You pass it a recycled mutable StringBuilder of the text you want as a String and it will provide a String which matches. Passing a String would be inefficient as you would have already created the object. The StringBuilder can be recycled. Note: this structure has an interesting property that requires no additional thread safety features, like volatile or synchronized, other than is provided by the minimum Java guarantees. i.e. you can see the final fields in a String correctly and only read consistent references. public class StringInterner { private final String[] interner; private final int mask; public StringInterner(int capacity) { int n = Maths.nextPower2(capacity, 128); interner = new String[n]; mask = n - 1; } private static boolean isEqual(@Nullable CharSequence s, @NotNull CharSequence cs) { if (s == null) return false; if (s.length() != cs.length()) return false; for (int i = 0; i < cs.length(); i++) if (s.charAt(i) != cs.charAt(i)) return false; return true; } @NotNull public String intern(@NotNull CharSequence cs) { long hash = 0; for (int i = 0; i < cs.length(); i++) hash = 57 * hash + cs.charAt(i); int h = (int) Maths.hash(hash) & mask; String s = interner[h]; if (isEqual(s, cs)) return s; String s2 = cs.toString(); return interner[h] = s2; } } Off heap memory usage Using off heap memory and using object pools both help reduce GC pauses, this is their only similarity. Object pools are good for short lived mutable objects, expensive to create objects and long live immutable objects where there is a lot of duplication. Medium lived mutable objects, or complex objects are more likely to be better left to the GC to handle. However, medium to long lived mutable objects suffer in a number of ways which off heap memory solves. Off heap memory provides; Scalability to large memory sizes e.g. over 1 TB and larger than main memory. Notional impact on GC pause times. Sharing between processes, reducing duplication between JVMs, and making it easier to split JVMs. Persistence for faster restarts or replying of production data in test. The use of off heap memory gives you more options in terms of how you design your system. The most important improvement is not performance, but determinism. Off heap and testing One of the biggest challenges in high performance computing is reproducing obscure bugs and being able to prove you have fixed them. By storing all your input events and data off heap in a persisted way you can turn your critical systems into a series of complex state machines. (Or in simple cases, just one state machine) In this way you get reproducible behaviour and performance between test and production.A number of investment banks use this technique to replay a system reliably to any event in the day and work out exactly why that event was processed the way it was. More importantly, once you have a fix you can show that you have fixed the issue which occurred in production, instead of finding an issue and hoping this was the issue.Along with deterministic behaviour comes deterministic performance. In test environments, you can replay the events with realistic timings and show the latency distribution you expect to get in production. Some system jitter can't be reproduce esp if the hardware is not the same, but you can get pretty close when you take a statistical view. To avoid taking a day to replay a day of data you can add a threshold. e.g. if the time between events is more than 10 ms you might only wait 10 ms. This can allow you to replay a day of events with realistic timing in under an hour and see whether your changes have improved your latency distribution or not. By going more low level don't you lose some of "compile once, run anywhere"? To some degree this is true, but it is far less than you might think. When you are working closer the processor and so you are more dependant on how the processor, or OS behaves. Fortunately, most systems use AMD/Intel processors and even ARM processors are becoming more compatible in terms of the low level guarantees they provide. There is also differences in the OSes, and these techniques tend to work better on Linux than Windows. However, if you develop on MacOSX or Windows and use Linux for production, you shouldn't have any issues. This is what we do at Higher Frequency Trading. What new problems are we creating by using off heap? Nothing comes for free, and this is the case with off heap. The biggest issue with off heap is your data structures become less natural. You either need a simple data structure which can be mapped directly to off heap, or you have a complex data structure which serializes and deserializes to put it off heap. Obvious using serialization has its own headaches and performance hit. Using serialization thus much slower than on heap objects.In the financial world, most high ticking data structure are flat and simple, full of primitives which maps nicely off heap with little overhead. However, this doesn't apply in all applications and you can get complex nested data structures e.g. graphs, which you can end up having to cache some objects on-heap as well.Another problem is that the JVM limits how much of the system you can use. You don't have to worry about the JVM overloading the system so much. With off heap, some limitations are lifted and you can use data structures much larger than main memory, and you start having to worry about what kind of disk sub-system you have if you do this. For example, you don't want to be paging to a HDD which has 80 IOPS, instead you are likely to want an SSD with 80,000 IOPS (Input/Ouput Operations per Second) or better i.e. 1000x faster. How does OpenHFT help? OpenHFT has a number of libraries to hide the fact you are really using native memory to store your data. These data structures are persisted and can be used with little or no garbage. These are used in applications which run all day without a minor collection Chronicle Queue - Persisted queue of events. Supports concurrent writers across JVMs on the same machine and concurrent readers across machines. Micro-second latencies and sustained throughputs in the millions of messages per second. Chronicle Map - Native or Persisted storage of a key-value Map. Can be shared between JVMs on the same machine, replicated via UDP or TCP and/or accessed remotely via TCP. Micro-second latencies and sustained read/write rates in the millions of operations per second per machine. Thread Affinity - Binding of critical threads to isolated cores or logical cpus to minimise jitter. Can reduce jitter by a factor of 1000. Which API to use? If you need to record every event -> Chronicle Queue If you only need the latest result for a unique key -> Chronicle Map If you care about 20 micro-second jitter -> Thread Affinity Conclusion Off heap memory can have challenges but also come with a lot of benefits. Where you see the biggest gain and compares with other solutions introduced to achieve scalability. Off heap is likely to be simpler and much faster than using partitioned/sharded on heap caches, messaging solutions, or out of process databases. By being faster, you may find that some of the tricks you need to do to give you the performance you need are no longer required. e.g. off heap solutions can support synchronous writes to the OS, instead of having to perform them asynchronously with the risk of data loss.The biggest gain however, can be your startup time, giving you a production system which restarts much faster. e.g. mapping in a 1 TB data set can take 10 milli-seconds, and ease of reproducibility in test by replaying every event in order you get the same behaviour every time. This allows you to produce quality systems you can rely on.

January 2, 2015

by Peter Lawrey

· 109,457 Views · 8 Likes