Data Engineering Resources

Mounting an EBS Volume to Docker on AWS Elastic Beanstalk

Mounting an EBS volume to a Docker instance running on Amazon Elastic Beanstalk (EB) is surprisingly tricky. The good news is that it is possible. I will describe how to automatically create and mount a new EBS volume (optionally based on a snapshot). If you would prefer to mount a specific, existing EBS volume, you should check out leg100’s docker-ebs-attach (using AWS API to mount the volume) that you can use either in a multi-container setup or just include the relevant parts in your own Dockerfile. The problem with EBS volumes is that, if I am correct, a volume can only be mounted to a single EC2 instance – and thus doesn’t play well with EB’s autoscaling. That is why EB supports only creating and mounting a fresh volume for each instance. Why would you want to use an auto-created EBS volume? You can already use a docker VOLUME to mount a directory on the host system’s ephemeral storage to make data persistent across docker restarts/redeploys. The only advantage of EBS is that it survives restarts of the EC2 instance but that is something that, I suppose, happens rarely. I suspect that in most cases EB actually creates a new EC2 instance and then destroys the old one. One possible benefit of an EBS volume is that you can take a snapshot of it and use that to launch future instances. I’m now inclined to believe that a better solution in most cases is to set up automatic backup to and restore from S3, f.ex. using duplicity with its S3 backend (as I do for my NAS). Anyway, here is how I got EBS volume mounting working. There are 4 parts to the solution: Configure EB to create an EBS mount for your instances Add custom EB commands to format and mount the volume upon first use Restart the Docker daemon after the volume is mounted so that it will see it (see this discussion) Configure Docker to mount the (mounted) volume inside the container 1-3.: .ebextensions/01-ebs.config: # .ebextensions/01-ebs.config commands: 01format-volume: command: mkfs -t ext3 /dev/sdh test: file -sL /dev/sdh | grep -v 'ext3 filesystem' # ^ prints '/dev/sdh: data' if not formatted 02attach-volume: ### Note: The volume may be renamed by the Kernel, e.g. sdh -> xvdh but # /dev/ will then contain a symlink from the old to the new name command: | mkdir /media/ebs_volume mount /dev/sdh /media/ebs_volume service docker restart # We must restart Docker daemon or it wont' see the new mount test: sh -c "! grep -qs '/media/ebs_volume' /proc/mounts" option_settings: # Tell EB to create a 100GB volume and mount it to /dev/sdh - namespace: aws:autoscaling:launchconfiguration option_name: BlockDeviceMappings value: /dev/sdh=:100 4.: Dockerrun.aws.json and Dockerfile: Dockerrun.aws.json: mount the host’s /media/ebs_volume as /var/easydeploy/share inside the container: { "AWSEBDockerrunVersion": "1", "Volumes": [ { "HostDirectory": "/media/ebs_volume", "ContainerDirectory": "/var/easydeploy/share" } ] } Dockerfile: Tell Docker to use a directory on the host system as /var/easydeploy/share – either a randomly generated one or the one given via the -m mount option to docker run: ... VOLUME ["/var/easydeploy/share"] ...

June 3, 2015

by Jakub Holý

· 14,785 Views

Ecosystem of Hadoop Animal Zoo

hadoop is best known for map reduce and it's distributed file system (hdfs). recently other productivity tools developed on top of these will form a complete ecosystem of hadoop. most of the projects are hosted under apache software foundation . hadoop ecosystem projects are listed below. hadoop common a set of components and interfaces for distributed file system and i/o (serialization, java rpc, persistent data structures) http://hadoop.apache.org/ hadoop ecosystem hdfs a distributed file system that runs on large clusters of commodity hardware. hadoop distributed file system, hdfs renamed form ndfs. scalable data store that stores semi-structured, un-structured and structured data. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfsuserguide.html http://wiki.apache.org/hadoop/hdfs map reduce map reduce is the distributed, parallel computing programming model for hadoop. inspired from google map reduce research paper . hadoop includes implementation of map reduce programming model. in map reduce there are two phases, not surprisingly map and reduce. to be precise in between map and reduce phase, there is another phase called sort and shuffle. job tracker in name node machine manages other cluster nodes. map reduce programming can be written in java. if you like sql or other non- java languages, you are still in luck. you can use utility called hadoop streaming. http://wiki.apache.org/hadoop/hadoopmapreduce hadoop streaming a utility to enable map reduce code in many languages like c, perl, python, c++, bash etc., examples include a python mapper and awk reducer. http://hadoop.apache.org/docs/r1.2.1/streaming.html avro a serialization system for efficient, cross-language rpc and persistent data storage. avro is a framework for performing remote procedure calls and data serialization. in the context of hadoop, it can be used to pass data from one program or language to another, e.g. from c to pig. it is particularly suited for use with scripting languages such as pig, because data is always stored with its schema in avro. http://avro.apache.org/ apache thrift apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across programming languages. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business. http://thrift.apache.org/ hive and hue if you like sql, you would be delighted to hear that you can write sql and hive convert it to a map reduce job. but, you don't get a full ansi-sql environment. hue gives you a browser based graphical interface to do your hive work. hue features a file browser for hdfs, a job browser for map reduce/yarn, an hbase browser, query editors for hive, pig, cloudera impala and sqoop2.it also ships with an oozie application for creating and monitoring workflows, a zookeeper browser and an sdk. pig a high-level programming data flow language and execution environment to do map reduce coding the pig language is called pig latin. you may find naming conventions some what un-conventional, but you get incredible price-performance and high availability. https://pig.apache.org/ jaql jaql is a functional, declarative programming language designed especially for working with large volumes of structured, semi-structured and unstructured data. as its name implies, a primary use of jaql is to handle data stored as json documents, but jaql can work on various types of data. for example, it can support xml, comma-separated values (csv) data and flat files. a "sql within jaql" capability lets programmers work with structured sql data while employing a json data model that's less restrictive than its structured query language counterparts. 1. jaql in google code 2. what is jaql? by ibm sqoop sqoop provides a bi-directional data transfer between hadoop -hdfs and your favorite relational database. for example you might be storing your app data in relational store such as oracle, now you want to scale your application with hadoop so you can migrate oracle database data to hadoop hdfs using sqoop. http://sqoop.apache.org/ oozie manages hadoop workflow. this doesn't replace your scheduler or BPM tooling, but it will provide if-then-else branching and control with hadoop jobs. https://oozie.apache.org/ zookeeper a distributed, highly available coordination service. zookeeper provides primitives such as distributed locks that can be used for building the highly scalable applications. it is used to manage synchronization for cluster. http://zookeeper.apache.org/ hbase based on google's bigtable , hbase "is an open-source, distributed, version, column-oriented store" that sits on top of hdfs. a super scalable key-value store. it works very much like a persistent hash-map (for python developers think like a dictionary). it is not a conventional relational database. it is a distributed, column oriented database. hbase uses hdfs for it's underlying. supports both batch-style computations using map reduce and point queries for random reads. https://hbase.apache.org/ cassandra a column oriented nosql data store which offers scalability, high availability with out compromising on performance. it perfect platform for commodity hardware and cloud infrastructure.cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for de-normalization and materialized views , and powerful built-in caching. http://cassandra.apache.org/ flume a real time loader for streaming your data into hadoop. it stores data in hdfs and hbase.flume "channels" data between "sources" and "sinks" and its data harvesting can either be scheduled or event-driven. possible sources for flume include avro, files, and system logs, and possible sinks include hdfs and hbase. http://flume.apache.org/ mahout machine learning for hadoop, used for predictive analytics and other advanced analysis. there are currently four main groups of algorithms in mahout: recommendations, a.k.a. collective filtering classification, a.k.a categorization clustering frequent item set mining, a.k.a parallel frequent pattern mining mahout is not simply a collection of pre-existing algorithms; many machine learning algorithms are intrinsically non-scalable; that is, given the types of operations they perform, they cannot be executed as a set of parallel processes. algorithms in the mahout library belong to the subset that can be executed in a distributed fashion. http://en.wikipedia.org/wiki/list_of_machine_learning_algorithms https://www.coursera.org/course/machlearning https://mahout.apache.org/ fuse makes the hdfs system to look like a regular file system so that you can use ls, rm, cd etc., directly on hdfs data. whirr apache whirr is a set of libraries for running cloud services. whirr provides a cloud-neutral way to run services. you don't have to worry about the idiosyncrasies of each provider.a common service api. the details of provisioning are particular to the service. smart defaults for services. you can get a properly configured system running quickly, while still being able to override settings as needed. you can also use whirr as a command line tool for deploying clusters. https://whirr.apache.org/ giraph an open source graph processing api like pregel from google https://giraph.apache.org/ chukwa chukwa, an incubator project on apache, is a data collection and analysis system built on top of hdfs and map reduce. tailored for collecting logs and other data from distributed monitoring systems, chukwa provides a workflow that allows for incremental data collection, processing and storage in hadoop. it is included in the apache hadoop distribution as an independent module. https://chukwa.apache.org/ drill apache drill, an incubator project on apache, is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. drill is the open source version of google's dremel system which is available as an iaas service called google big query. one explicitly stated design goal is that drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. http://incubator.apache.org/drill/ impala (cloudera) released by cloudera, impala is an open-source project which, like apache drill, was inspired by google's paper on dremel; the purpose of both is to facilitate real-time querying of data in hdfs or hbase. impala uses an sql-like language that, though similar to hiveql, is currently more limited than hiveql. because impala relies on the hive meta store, hive must be installed on a cluster in order for impala to work. the secret behind impala's speed is that it "circumvents map reduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel rdbmss." (source: cloudera) http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html http://training.cloudera.com/elearning/impala/

June 3, 2015

by Umashankar Ankuri

· 23,891 Views · 3 Likes

The Myth of Asynchronous JDBC

I keep seeing people (especially in the scala/typesafe world) posting about async jdbc libraries. STOP IT! Under the current APIs, async JDBC belongs in a realm with Unicorns, Tiger Squirrels, and 8' spiders. While you might be able to move the blocking operations and queue requests and keep your "main" worker threads from blocking, jdbc is synchronous. At some point, somewhere, there's going to be a thread blocking waiting for a response. It's frustrating to see so many folks hyping this and muddying the waters. Unless you write your own client for a dbms and have a dbms that can multiplex calls over a single connection (or using some other strategy to enable this capability) db access is going to block. It's not impossible to make the calls completely async, but nobody's built it yet. Yes, I know ajdbc is taking a stab at this capability, but even IT uses a thread pool for the blocking calls (be default). Someday we'll have async database access (it's not impossible...well it IS with the current JDBC specification), but no general purpose RDBMS has this right now. The primary problems with the hype/misdirection are that #1 inexperienced programmers don't understand that they've just moved the problem and will use the APIs and wonder why the system is so slow (oh I have 1000 db calls queued up waiting for my single db thread to process the work) and #2 It belies a serious misunderstanding of the difference between async JDBC (not possible per current spec) and async db access (totally possible/doable, but rare in the wild).

May 29, 2015

by Michael Mainguy

· 17,813 Views · 2 Likes

What is API First?

API First is a design strategy for a company’s entire product line, where APIs are the basis of every product instead of being a separate side product. To understand why API First is a good idea, you first have to understand how existing APIs are generally created. There are various models for creating APIs, but generally the API accesses the backend directly in parallel with the main product. This means that if you want to make new applications, you have to either write more systems that access the backend or extend the API so that it supports both alternative products. Additionally, an API is frequently considered to be an “extra, nice to have” product rather than an important member of the product ecosystem. This creates serious problems with resource contention as the API competes against revenue producing products for engineering resources. APIs as a Side Product Think first about an example of the “usual” setup for APIs – as you can imagine, the API is separate from the main product, and even if all secondary products run off of the API, there will be a mismatch between the two experiences. Keeping everything entirely consistent is theoretically doable but a lot more work than simply restructuring the infrastructure to treat the API as a piece of core technology for the system. There is some thought that separating out the API from the main product can protect that product from attacks via the API, but truthfully the backend system is the critical piece, and separating out the clients in this way simply makes it harder to triage and fix problems that might occur in the product or the API. Keeping everything in the same pipeline helps assure consistency, reliability and as your product grows, it makes scaling much easier. Imagine that a company has a product for creating and updating contact information for users. The backend returns lists of users based on a database call from the client. In this case, the main product communicates directly with the backend system, and is likely to retrieve data in a different way as a result. When the backend system team adds a new “Location” field feature, it doesn't show up in the API until an engineer has time to add it there. This means that there will be a necessary lag between the addition of this field to the main product and availability within the API until and unless someone takes the time to write essentially duplicate code to retrieve the new fields. There’s a lot of technical debt incurred when you have multiple systems trying to reproduce a single interface. There’s not likely to be a good process for comparing this API change to other APIs from the company – resulting in inconsistent results. In this case, the mobile app wouldn’t allow or see locations, resulting in developer dissatisfaction and customer confusion and irritation. Once you’ve established how you want your users to interact with your system, it’s best to support that everywhere. When you have multiple teams creating products without a shared vision, you also tend to have poor communication between those teams. This can lead to bug fixes in one code base but not in the other, or inconsistencies between the items available from the system depending on which interface is being used. API First Model The other possibility is an API First model. All of the products run off of the same interface. This ensures that each device, application or integration has the same resources available to use. These resources will be consistent across the entire product line. Note that just because an API resource is available within the system, you don’t have to expose it to all the world – you can decide which of the API resources is internal, partner only, or open to anyone. It’s still a great idea to have your API ready because when a major partner asks for access to some specific resources to support a use case, you have it ready to go. API First also encourages communication between your backend team and each of the client engineering teams. Understanding use cases at a high level the helps create APIs that are usable for the use cases you understand up front, and more likely to support future use cases that come up. Once you’re creating the API as a larger team, you’ll find many places where different teams offer complementary resources adding to a much better structured system. API First makes a lot of sense for any company – as soon as you have more than one product (for most companies that’s going to be a website and a mobile application) you need to have a layer to protect the clients from changes on the server. A well-documented interface into the system, crafted with specific use cases in mind, allows you the freedom to change things around on the backend, as long as the interface doesn’t change. Integrated testing is easier, and the products running on the API will by their nature test the integrity of the system on a regular basis. You can learn much more about APIs in my book Irresistible APIs, available from Manning Publications, Inc.

May 29, 2015

by Kirsten Hunter

· 6,654 Views · 1 Like

Mechanical Sympathy: Understanding the Hardware Makes You a Better Developer

[This article appears in the DZone Guide to Performance & Monitoring – 2015 Edition. For additional information including insight from industry experts and luminaries, performance statistics and strategies, and an overview of how modern companies are handling application monitoring, download the guide below.] I have spent the last few years of my career working in the field of high-performance, low-latency systems. Two observations struck me when I returned to working on the metal: Many of the assumptions that programmers make outside this esoteric domain are simply wrong. Lessons learned from high-performance, low-latency systems are applicable to many other problem domains. Some of these assumptions are pretty basic. For example, which is faster: storing to disk or writing to a cluster pair on your network? If, like me, you survived programming in the late 20th century, you know that the network is slower than disk, but that’s actually wrong! It turns out that it’s much more efficient to use clustering as a backup mechanism than to save everything to disk. Other assumptions are more general and more damaging. A common meme in our industry is “avoid pre-optimization.” I used to preach this myself, and now I’m very sorry thatI did. These are just a few supposedly obvious principles that truly high-performance systems call into question. For many developers, these rules of thumb appear reasonably inviolable in everyday practice. But as performance demands grow increasingly strict, it becomes proportionally important for developers to understand exactly how systems work—at both the abstract, procedural level, and the level of the metal itself. MOTIVATION: THE REPERCUSSIONS OF INEFFICIENT SOFTWARE First, let’s think about why it’s important to transcend crude oversimplifications like “disk I/O is faster than network I/O.” One motivation is a bit negative. I can think of noother field of human endeavor that tolerates thelevels of inefficiency that are normal in our industry.My experience has been that for most systems, any performance specialist can improve its performance ten-fold quite easily. This is because most applications are hundreds, thousands, even tens of thousands of times less efficient than they could be. Modern hardware is phenomenally capable, but we software folks regularly underestimate the strides hardware manufacturers have made. As a result, we often fail to take advantage of new hardware capabilities and miss significant shifts in the locations of common performance bottlenecks. You may say this doesn’t matter—after all, what is all that hardware performance for, if not to make it easier to write software? And then I’ll reply: it’s for lots of things. Here’s one very concrete reason. The energy consumption from data centers constitutes a significant fraction of the CO2 being pumped into our atmosphere [1]. If most software is more than 10x less efficient than it could be, then that could mean 10x more CO2. It also means 10x the capital cost in hardware required to run all that inefficient software. “You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy.” – Jackie Stewart, racing driver Perhaps more important is that performance is about more than just efficiency for its own sake. Performanceis also an important enabler of innovation. The ability to pack the data representing 1000 songs onto the tiny hard disk in the first generation iPod made it possible for Apple to revolutionize the music industry. For an even more basic example, graphical user interfaces only became feasible when the hardware was fast enough to “waste” all that time drawing pretty pictures. High-performance hardware enabled these innovations; the software simply had to follow. Then there’s the opportunity cost. What could we be doing with all the spare capacity of our astonishingly fast hardware and enormously vast storage if we weren’t wasting it on inefficient software? In the problem domains where I’ve worked, the idea of mechanical sympathy has helped enormously. Let’s examine this idea a little closer. KEY CONCEPT: MECHANICAL SYMPATHY The term Mechanical Sympathy was coined by racing driver Jackie Stewart and applied to software by Martin Thompson. Jackie Stewart said, “You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy.” He meant that understanding how a car works makes you a better driver. This is just as true for writing code. You don’t need to be a hardware engineer, but you do need to understand how the hardware works and take that into consideration when you design software. Let’s take something simple like writing a file to disk. Disks are random access, right? Well, not really. Disks work by encoding data in sectors and addressing them in segments of the disk. When you read or write data to disk, youneed to wait for the heads to move to the correct physical location. If you do this randomly, then you will incur performance penalties as the heads are physically moved across the surface of the disk and you wait for the spin of the disk to place the sector you want beneath the read/ write heads. This averages out to about 3ms per seek for the heads, and about 3ms rotational latency. That makes for an average total of about 6ms per seek! This seek time is dramatically slow when compared tothe electronic. Electrons beat spinning rust every time! And there are even more specific reasons why “random access” is a poor rule of thumb. In fact, modern disks are optimized to stream data so that they can play movies and audio. If you understand that, and treat your file storage as a serial, block device rather than random access, you will get dramatically higher performance. In this example, the difference can be two orders of magnitude (200MB/s is a good working figure for the upper bound). REACHING HIGH PERFORMANCE: DO EXPERIMENTS, DO ARITHMETIC Abstraction is important. It is essential that we are, to some degree, insulated from the complexity of the devices and systems that form the platform upon which our software executes. However, many of the abstractions that we take for granted are very poor and leaky. The overall system complexity gets amplified when we build abstraction on top of abstraction to hide the problems caused by the leakiness, resulting in the dramatic performance differences we see between high-performance and “normal” systems. You don’t need to model or measure your entire systemto tackle its complexity. The starting point is to do some experimenting to understand your theoretical maximums. You should have a rough model of the following: How much data you can write to disk in one second How many messages your code can process in one second How much data you can send or receive across your network in one second A common meme in our industry is, “avoid pre- optimization.” I used to preach this myself. I am now very sorry that I did. The idea here is to measure your actual performance against the theoretically possible performance of your system within the limits imposed by the underlying hardware. Let’s consider a specific example. I recently talked witha developer who believed he was working on a high- performance system. He told me that his system was working at 10 transactions per second. As of mid-2015, modern processors on commodity hardware can easily perform 2-3 billion instructions per second. So at 2-3 billion instructions per second, the developer was limited to 200- 300 million instructions per transaction ([10 transactions/ sec] / [2-3G instructions/sec] = 200-300m instructions/ transaction) to achieve his goals. Of course, his application isn’t responsible for all of these—the operating system is chewing some cycles too—but 200 million instructions is an awful lot of work. You can write an effective chess player in 672 bytes, as David Horne has shown. Okay, so maybe the bottleneck was I/O. Perhaps this developer’s application was disk bound. That’s unlikely— modern hard disks found in commodity hardware can transfer data at phenomenal rates. A moderate transfer rate from disk to a buffer in memory is about 10MB/s. If this is the limit, he must be pushing more than 10MB per transaction between disk and memory. You can represent a lot of information in 10 million bytes! Well, perhaps the problem was the network. Probably not—10Gbit/s networks are now on the low end. A 10Gbit/s network can transmit roughly 1GB of data per second.This means that each of our misguided developer’s 10 transactions per second is occupying 1/10th of a GB, or approximately 100MB—more than 10 times the throughput of our disks! The hardware wasn’t the problem. HIGH PERFORMANCE SIMPLICITY I: MODEL THE PROBLEM DOMAIN One of the great myths about high-performance programming is that high-performance solutions are more complex than “normal” solutions. This is just not true.By definition, a high-performance solution must do the most amount of work in the fewest instructions. Complex solutions precisely don’t last long in high-performance systems because complexity is where performance bottlenecks hide. The best programmers I know achieve the simplicity demanded by high-performance systems by modeling the problem domain. A software simulation of the problem domain is the best way to come up with an elegant solution. HIGH PERFORMANCE SIMPLICITY II: SIMPLIFY THE CODE If you follow the lead of the problem domain, you tend to end up with smaller, simpler classes with clearer relationships. If you follow strong separation of concerns as a guiding principle, then your design will push you in the direction of cleaner, simpler code. The cardinal rule of object-oriented simplicity is “one class, one thing; one method, one thing.” Modern compilers are extremely effective at optimizing code, but they are best at optimizing simple code. If you write 300-line methods with multiple for-loops, each containing several nested if-conditions throwing exceptions all over the place and returning from multiple points, then the optimizer will simply give up. If you write small, simple, easy to read methods, they are easier to test and easier for the optimizer to understand, which results in significant improvements to performance. One of the great myths about high- performance programming isthat high-performance solutions are more complex than “normal” solutions. This is just not true. THE KEY TO HIGH PERFORMANCE: MINIMIZE INSTRUCTIONS, MINIMIZE DATAFundamentally, developing high-performance systemsis simple: minimize the number of instructions being processed and the amount of data being shunted around. You achieve this by modeling the problem domainand eliminating nonessential complexity. For software developers, “Mechanically Sympathizing” with modern high-performance hardware will help you understand where you’re doing things wrong. It can also point you in the direction of better measurement and profiling, which will help you understand why your code is not performing to its theoretical maximum. So why not try to save your company some money, work in a simplified codebase,and maybe help reduce the carbon footprint of our data centers—all at the same time? [1] https://energy.stanford.edu/news/data-centers-can-slash-co2-emissions-88- or-more DOWNLOAD YOUR FREE COPY TODAY

May 24, 2015

by Dave Farley

· 56,940 Views · 58 Likes

Server and Storage I/O Benchmark Tools: Microsoft Diskspd (Part I)

A key to improving performance is benchmarking. Read about Microsoft Diskspd's tools for storage and server benchmarking, and boost your I/O performance.

May 22, 2015

by Greg Schulz

· 14,764 Views

Efficient Cassandra Write Pattern for Micro-Batching

The best way to write to a Cassandra cluster are concurrent asynchronous writes. In cases where data exhibits strong temporal locality, speed can be improved.

May 20, 2015

by John Georgiadis

· 35,046 Views · 1 Like

How To Set Up a Tomcat, Apache and mod_jk Cluster

In this article I will go through a common set-up for a small production environment. A single tier, load balanced application server cluster. Overview A high level overview of what we will be doing. Downloading and installing Apache HTTP server and mod_jk Downloading Tomcat Downloading Java Configuring two local Tomcat servers Clustering the two Tomcat servers Configuring Apache to use mod_jk to forward request to Tomcat Deploying application to Tomcat server that tests our set-up Introduction What is Apache? Apache is an HTTP server. What is mod_jk? It is an Apache module that allows AJP communication between Apache and a back end application server like Tomcat.I am running this on Ubuntu 14.04LTS installed on a dual boot PC with Windows 7. Download Apache2 We are going to use Ubuntu's APT package maintenance system to obtain and install Apache2. sudo apt-get install apache2 This will install in /etc/apache2 Download and install mod_jk The mod_jk module is not included in the Apache2 download so must be obtained and installed separately. The installation requires that the mod_jk module is visible to Apache and configured to ensure that Apache knows where to look for it and what to do with the requests you want to proxy. sudo apt-get install libapache2-mod-jk This will install in /etc/libapache2-mod-jk also two files have been added to the /etc/apache2/mods-available folder. Downloading and installing Tomcat 8 At the time of writing this Tomcat 8 does not have a package in APT so you must download the binaries from the tomcat website.http://tomcat.apache.org/download-80.cgi select the appropriate binary distribution and extract it as follows. tar xvzf apache-tomcat-8.0.5.tar.gz We need two copies of the Tomcat server to be load balanced. I created two directories in the /opt/ location: /opt/tomcat-server1/ and /opt/tomcat-server2/ and copied tomcat into each one. Download and install Java Download Java from APT as follows: apt-get install openjdk-7-jdk and set JAVA_HOME in .bashrc vim ~/.bashrc export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 Configure two local Tomcat servers We will edit only the server.xml of the server2 installation of tomcat. We need to change port numbers to avoid conflicts.We change the following: and comment out the HTTP Connector as we only want the web application to be accessible through the load balancer.Here is my server2 Tomcat server.xml configuration. Configure mod_jk Load balancing is configured in the workers.properties file, located /etc/libapache2-mod-jk/ where workers represent actual or virtual workers.We will define two actual workers and two virtual workers which map to the Tomcat servers. In the worker.list property I have defined two virtual workers: status and loadbalancer, I will refer to these later in the Apache configuration.Workers for each server have been defined using values for the server.xml configuration files. I used the port values for the AJP connectors and I have included an lbfactor that sets the preference that the load balancer will show for that server.Finally we define the virtual workers. The loadbalancer worker is set to type lb and set the workers that represent the Tomcat servers in the balancer_workers properties. The status only needs to be set to type status. worker.list=loadbalancer,status worker.server1.port=8009worker.server1.host=localhostworker.server1.type=ajp13 worker.server2.port=9009worker.server2.host=localhostworker.server2.type=ajp13 worker.server1.lbfactor=1worker.server2.lbfactor=1 worker.loadbalancer.type=lbworker.loadbalancer.balance_workers=server1,server2 worker.status.type=status Ensure that you remove any other worker configuration that are not being used. Configure Apache Web Server to forward requests You will need to add the following to the Apache configurations located in etc/apache2/sites-enabled/000-default.conf JkMount /status status JkMount /* loadbalancer Verify the installation To test that all has been configured correctly we need to deploy an application. A sample application that has been used for years to test such configurations is called the ClusterJSP sample application. You can find it by googling in or from the JBoss site.Now deploy the war to the webapps folder on both servers and start each server using the start-up script /opt/tomcat-server1/bin/startup.sh.Go to http://localhost/clusterjsp/HaJsp.jsp and you should see the page show HttpSession information. Now lets look at the mod_jk status page: http://localhost/status. You will see that this page shows information about the load balancer workers and the workers it is balancing. If everything is working you will see the worker error state show OK or OK/IDLE if they are not currently balancing load. Things to try out Enable sticky sessions: Configure jvmRoute in the server.xml configuration. Further reading Loadbalancing with mod_jk and ApacheWorking with mod_jk Connecting Apache's Web Server to Multiple Instances of Tomcat

May 19, 2015

by Alex Theedom

· 10,812 Views · 1 Like

Data Locality w/ Cassandra : How to Scan the Local Token Range of a Table...

I'm working on a mechanism that will allow HPCC to access data stored in Cassandra with data locality, leveraging the Java streaming capabilities from HPCC.

May 18, 2015

by Brian O' Neill

· 14,350 Views · 1 Like

Swim Lane Diagrams in JavaScript

Learn about swim-lane diagrams to connect business processes and departments and apply the concept to the object-oriented world of javascript.

May 17, 2015

by Daniel Jebaraj

· 7,148 Views

Use RegEx to Test Password Strength in JavaScript

In this post, we learn how to combine JavaScript and RegEx to create scripts that can help us test our password strength.

May 16, 2015

by Nic Raboy

· 93,727 Views · 1 Like

Integrating External APIs into your Meteor.js Application

Meteor itself does not rely on REST APIs, but it can easily access data from other services. This article is an excerpt from the book Meteor in Action and explains how you can integrate third-party data into your applications by accessing RESTful URLs from the server-side. Many applications rely on external APIs to retrieve data. Getting information regarding your friends from Facebook, looking up the current weather in your area, or simply retrieving an avatar image from another website – there are endless uses for integrating additional data. They all share a common challenge: APIs must be called from the server, but an API usually takes longer than executing the method itself. You need to ensure that the result gets back to the client – even if it takes a couple of seconds. Let’s talk about how to integrate an external API via HTTP. Based on the IP address of a visitor, you can tell various information about their current location, e.g., coordinates, city or timezone. There is a simple API that takes an IPv4 address and returns all these tidbits as a JSON object. The API is called Telize. Making RESTful calls with the http package In order to communicate with RESTful external APIs such as Telize, you need to add the http package: meteor add http While the http package allows you to make HTTP calls from both client and server, the API call in this example will be performed from the server only. Many APIs require you to provide an ID as well as a secret key to identify the application that makes an API request. In those cases you should always run your requests from the server. That way you never have to share secret keys with clients. Let's look at a graphic to explain the basic concept. A user requests location information for an IP address (step 1). The client application calls a server method called geoJsonforIp (step 2) that makes an (asynchronous) call to the external API using the HTTP.get() method (step 3). The response (step 4) is a JSON object with information regarding the geographic location associated with an IP address, which gets sent back to the client via a callback (step 5). Using a synchronous method to query an API Let’s add a method that queries telize.com for a given IP address as shown in the following listing. This includes only the bare essentials for querying an API for now. Remember: This code belongs in a server-side only file or inside a if (Meteor.isServer) {} block. Meteor.methods({ // The method expects a valid IPv4 address 'geoJsonForIp': function (ip) { console.log('Method.geoJsonForIp for', ip); // Construct the API URL var apiUrl = 'http://www.telize.com/geoip/' + ip; // query the API var response = HTTP.get(apiUrl).data; return response; } }); Once the method is available on the server, querying the location of an IP works simply by calling the method with a callback from the client: Meteor.call('geoJsonForIp', '8.8.8.8', function(err,res){ console.log(res); }); While this solution appears to be working fine there are two major flaws to this approach: If the API is slow to respond requests will start queuing up. Should the API return an error there is no way to return it back to the UI. To address the issue of queuing, you can add an unblock() statement to the method: this.unblock(); Calling an external API should always be done asynchronously. That way you can also return possible error values back to the browser, which will solve the second issue. Let’s create a dedicated function for calling the API asynchronously to keep the method itself clean. Using an asynchronous method to call an API The listing below shows how to issue an HTTP.get call and return the result via a callback. It also includes error handling that can be shown on the client. var apiCall = function (apiUrl, callback) { // try…catch allows you to handle errors try { var response = HTTP.get(apiUrl).data; // A successful API call returns no error // but the contents from the JSON response callback(null, response); } catch (error) { // If the API responded with an error message and a payload if (error.response) { var errorCode = error.response.data.code; var errorMessage = error.response.data.message; // Otherwise use a generic error message } else { var errorCode = 500; var errorMessage = 'Cannot access the API'; } // Create an Error object and return it via callback var myError = new Meteor.Error(errorCode, errorMessage); callback(myError, null); } } Inside a try…catch block, you can differentiate between a successful API call (the try block) and an error case (the catch block). A successful call may return null for the error object of the callback, an error will return only an error object and null for the actual response. There are different types of errors and you want to differentiate between a problem with accessing the API and an API call that got an error inside the returned response. This is what the if statement checks for – in case the error object has a response property both code and message for the error should be taken from it; otherwise you can display a generic error 500 that the API could not be accessed. Each case, success and failure, returns a callback that can be passed back to the UI. In order to make the API call asynchronous you need to update the method as shown in the next code snippet. The improved code unblocks the method and wraps the API call in a wrapAsync function. Meteor.methods({ 'geoJsonForIp': function (ip) { // avoid blocking other method calls from the same client this.unblock(); var apiUrl = 'http://www.telize.com/geoip/' + ip; // asynchronous call to the dedicated API calling function var response = Meteor.wrapAsync(apiCall)(apiUrl); return response; } }); Finally, to allow requests from the browser and show error messages you should add a template similar to the following code. Query the location data for an IP Look up location {{#with location} {{#if error} There was an error: {{error.errorType} {{error.message}! {{else} The IP address {{location.ip} is in {{location.city} ({{location.country}). {{/if} {{/with} A Session variable called location is used to store the results from the API call. Clicking the button takes the content of the input box and sends it as a parameter to the geoJsonForIp method. The Session variable is set to the value of the callback. This is the required JavaScript code for connecting the template with the method call: Template.telize.helpers({ location: function () { return Session.get('location'); } }); Template.telize.events({ 'click button': function (evt, tpl) { var ip = tpl.find('input#ipv4').value; Meteor.call('geoJsonForIp', ip, function (err, res) { // The method call sets the Session variable to the callback value if (err) { Session.set('location', {error: err}); } else { Session.set('location', res); return res; } }); } }); As a result you will be able to make API calls from the browser just like in this figure: And that’show to integrate an external API via HTTP!

May 15, 2015

by Stephan Hochhaus

· 40,165 Views

Log Collection With Graylog on AWS

Log collection is essential to properly analyzing issues in production. An interface to search and be notified about exceptions on all your servers is a must. Well, if you have one server, you can easily ssh to it and check the logs, of course, but for larger deployments, collecting logs centrally is way more preferable than logging to 10 machines in order to find “what happened”. There are many options to do that, roughly separated in two groups – 3rd party services and software to be installed by you. 3rd party (or “cloud-based” if you want) log collection services include Splunk,Loggly, Papertrail, Sumologic. They are very easy to setup and you pay for what you use. Basically, you send each message (e.g. via a custom logback appender) to a provider’s endpoint, and then use the dashboard to analyze the data. In many cases that would be the preferred way to go. In other cases, however, company policy may frown upon using 3rd party services to store company-specific data, or additional costs may be undesired. In these cases extra effort needs to be put into installing and managing an internal log collection software. They work in a similar way, but implementation details may differ (e.g. instead of sending messages with an appender to a target endpoint, the software, using some sort of an agent, collects local logs and aggregates them). Open-source options include Graylog, FluentD, Flume, Logstash. After a very quick research, I considered graylog to fit our needs best, so below is a description of the installation procedure on AWS (though the first part applies regardless of the infrastructure). The first thing to look at are the ready-to-use images provided by graylog, including docker, openstack, vagrant and AWS. Unfortunately, the AWS version has two drawbacks – it’s using Ubuntu, rather than the Amazon AMI. That’s not a huge issue, although some generic scripts you use in your stack may have to be rewritten. The other was the dealbreaker – when you start it, it doesn’t run a web interface, although it claims it should. Only mongodb, elasticsearch and graylog-server are started. Having 2 instances – one web, and one for the rest would complicate things, so I opted for manual installation. Graylog has two components – the server, which handles the input, indexing and searching, and the web interface, which is a nice UI that communicates with the server. The web interface uses mongodb for metadata, and the server uses elasticsearch to store the incoming logs. Below is a bash script (CentOS) that handles the installation. Note that there is no “sudo”, because initialization scripts are executed as root on AWS. #!/bin/bash # install pwgen for password-generation yum upgrade ca-certificates --enablerepo=epel yum --enablerepo=epel -y install pwgen # mongodb cat >/etc/yum.repos.d/mongodb-org.repo <<'EOT' [mongodb-org] name=MongoDB Repository baseurl=http://downloads-distro.mongodb.org/repo/redhat/os/x86_64/ gpgcheck=0 enabled=1 EOT yum -y install mongodb-org chkconfig mongod on service mongod start # elasticsearch rpm --import https://packages.elasticsearch.org/GPG-KEY-elasticsearch cat >/etc/yum.repos.d/elasticsearch.repo <<'EOT' [elasticsearch-1.4] name=Elasticsearch repository for 1.4.x packages baseurl=http://packages.elasticsearch.org/elasticsearch/1.4/centos gpgcheck=1 gpgkey=http://packages.elasticsearch.org/GPG-KEY-elasticsearch enabled=1 EOT yum -y install elasticsearch chkconfig --add elasticsearch # configure elasticsearch sed -i -- 's/#cluster.name: elasticsearch/cluster.name: graylog2/g' /etc/elasticsearch/elasticsearch.yml sed -i -- 's/#network.bind_host: localhost/network.bind_host: localhost/g' /etc/elasticsearch/elasticsearch.yml service elasticsearch stop service elasticsearch start # java yum -y update yum -y install java-1.7.0-openjdk update-alternatives --set java /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java # graylog wget https://packages.graylog2.org/releases/graylog2-server/graylog-1.0.1.tgz tar xvzf graylog-1.0.1.tgz -C /opt/ mv /opt/graylog-1.0.1/ /opt/graylog/ cp /opt/graylog/bin/graylogctl /etc/init.d/graylog sed -i -e 's/GRAYLOG2_SERVER_JAR=\${GRAYLOG2_SERVER_JAR:=graylog.jar}/GRAYLOG2_SERVER_JAR=\${GRAYLOG2_SERVER_JAR:=\/opt\/graylog\/graylog.jar}/' /etc/init.d/graylog sed -i -e 's/LOG_FILE=\${LOG_FILE:=log\/graylog-server.log}/LOG_FILE=\${LOG_FILE:=\/var\/log\/graylog-server.log}/' /etc/init.d/graylog cat >/etc/init.d/graylog <<'EOT' #!/bin/bash # chkconfig: 345 90 60 # description: graylog control sh /opt/graylog/bin/graylogctl $1 EOT chkconfig --add graylog chkconfig graylog on chmod +x /etc/init.d/graylog # graylog web wget https://packages.graylog2.org/releases/graylog2-web-interface/graylog-web-interface-1.0.1.tgz tar xvzf graylog-web-interface-1.0.1.tgz -C /opt/ mv /opt/graylog-web-interface-1.0.1/ /opt/graylog-web/ cat >/etc/init.d/graylog-web <<'EOT' #!/bin/bash # chkconfig: 345 91 61 # description: graylog web interface sh /opt/graylog-web/bin/graylog-web-interface > /dev/null 2>&1 & EOT chkconfig --add graylog-web chkconfig graylog-web on chmod +x /etc/init.d/graylog-web #configure mkdir --parents /etc/graylog/server/ cp /opt/graylog/graylog.conf.example /etc/graylog/server/server.conf sed -i -e 's/password_secret =.*/password_secret = '$(pwgen -s 96 1)'/' /etc/graylog/server/server.conf sed -i -e 's/root_password_sha2 =.*/root_password_sha2 = '$(echo -n password | shasum -a 256 | awk '{print $1}')'/' /etc/graylog/server/server.conf sed -i -e 's/application.secret=""/application.secret="'$(pwgen -s 96 1)'"/g' /opt/graylog-web/conf/graylog-web-interface.conf sed -i -e 's/graylog2-server.uris=""/graylog2-server.uris="http:\/\/127.0.0.1:12900\/"/g' /opt/graylog-web/conf/graylog-web-interface.conf service graylog start sleep 30 service graylog-web start You may also want to set a TTL (auto-expiration) for messages, so that you don’t store old logs forever. Here’s how # wait for the index to be created INDEXES=$(curl --silent "http://localhost:9200/_cat/indices") until [[ "$INDEXES" =~ "graylog2_0" ]]; do sleep 5 echo "Index not yet created. Indexes: $INDEXES" INDEXES=$(curl --silent "http://localhost:9200/_cat/indices") done # set each indexed message auto-expiration (ttl) curl -XPUT "http://localhost:9200/graylog2_0/message/_mapping" -d'{"message": {"_ttl" : { "enabled" : true, "default" : "15d" }}' Now you have everything running on the instance. Then you have to do some AWS-specific things (if using CloudFormation, that would include a pile of JSON). Here’s the list: you can either have an auto-scaling group with one instance, or a single instance. I prefer the ASG, though the other one is a bit simpler. The ASG gives you auto-respawn if the instance dies. set the above script to be invoked in the UserData of the launch configuration of the instance/asg (e.g. by getting it from s3 first) allow UDP port 12201 (the default logging port). That should happen for the instance/asg security group (inbound), for the application nodes security group (outbound), and also as a network ACL of your VPC. Test the UDP connection to make sure it really goes through. Keep the access restricted for all sources, except for your instances. you need to pass the private IP address of your graylog server instance to all the application nodes. That’s tricky on AWS, as private IP addresses change. That’s why you need something stable. You can’t use an ELB (load balancer), because it doesn’t support UDP. There are two options: Associate an Elastic IP with the node on startup. Pass that IP to the application nodes. But there’s a catch – if they connect to the elastic IP, that would go via NAT (if you have such), and you may have to open your instance “to the world”. So, you must turn the elastic IP into its corresponding public DNS. The DNS then will be resolved to the private IP. You can do that by manually and hacky: 1 GRAYLOG_ADDRESS="ec2-$GRAYLOG_ADDRESS//./-}.us-west-1.compute.amazonaws.com" or you can use the AWS EC2 CLI to obtain the instance details of the instance that the elastic IP is associated with, and then with another call obtain its Public DNS. Instead of using an Elastic IP, which limits you to a single instance, you can use Route53 (the AWS DNS manager). That way, when a graylog server instance starts, it can append itself to a route53 record, that way allowing for a round-robin DNS of multiple graylog instances that are in a cluster. Manipulating the Route53 records is again done via the AWS CLI. Then you just pass the domain name to applications nodes, so that they can send messages. alternatively, you can install graylog-server on all the nodes (as an agent), and point them to an elasticsearch cluster. But that’s more complicated and probably not the intended way to do it configure your logging framework to send messages to graylog. There are standard GELF (the greylog format) appenders, e.g. this one, and the only thing you have to do is use the Public DNS environment variable in the logback.xml (which supports environment variable resolution). You should make the web interface accessible outside the network, so you can use an ELB for that, or the round-robin DNS mentioned above. Just make sure the security rules are tight and not allowing external tampering with your log data. If you are not running a graylog cluster (which I won’t cover), then the single instance can potentially fail. That isn’t a great loss, as log messages can be obtained from the instances, and they are short-lived anyway. But the metadata of the web interface is important – dashboards, alerts, etc. So it’s good to do regular backups (e.g. with mongodump). Using an EBS volume is also an option. Even though you send your log messages to the centralized log collector, it’s a good idea to also keep local logs, with the proper log rotation and cleanup. It’s not a trivial process, but it’s essential to have log collection, so I hope the guide has been helpful.

May 14, 2015

by Bozhidar Bozhanov

· 19,993 Views

HashMap Custom implementation in java

Contents of page : Custom HashMap > Entry Putting 5 key-value pairs in HashMap (step-by-step)> Methods used in custom HashMap > What will happen if map already contains mapping for key? Complexity calculation of put and get methods in HashMap > put method - worst Case complexity > put method - best Case complexity > get method - worst Case complexity > get method - best Case complexity > Summary of complexity of methods in HashMap > Custom HashMap > This is very important and trending topic. In this post i will be explaining HashMap custom implementation in lots of detail with diagrams which will help you in visualizing the HashMap implementation. I will be explaining how we will put and get key-value pair in HashMap by overriding- >equals method - helps in checking equality of entry objects. >hashCode method - helps in finding bucket’s index on which data will be stored. We will maintain bucket (ArrayList) which will store Entry (LinkedList). Entry We store key-value pair by usingEntry Entry contains K key, V value and Entrynext (i.e. next entry on that location of bucket). static class Entry { K key; V value; Entry next; public Entry(K key, V value, Entry next){ this.key = key; this.value = value; this.next = next; } } Putting 5 key-value pairs in custom HashMap (step-by-step)> I will explain you the whole concept of HashMap by putting 5 key-value pairs in HashMap. Initially, we have bucket of capacity=4. (all indexes of bucket i.e. 0,1,2,3 are pointing to null) Let’s put first key-value pair in HashMap- Key=21, value=12 newEntry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 21%4= 1. So, 1 will be the index of bucket on which newEntry object will be stored. We will go to 1stindex as it is pointing to null we will put our newEntry object there. At completion of this step, our HashMap will look like this- Let’s put second key-value pair in HashMap- Key=25, value=121 newEntry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 25%4= 1. So, 1 will be the index of bucket on which newEntry object will be stored. We will go to 1st index, it contains entry with key=21, we will compare two keys(i.e. compare 21 with 25 by using equals method), as two keys are different we check whether entry with key=21’s next is null or not, if next is null we will put our newEntry objecton next. At completion of this step our HashMap will look like this- Let’s put third key-value pair in HashMap- Key=30, value=151 newEntry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 30%4= 2. So, 2 will be the index of bucket on which newEntry object will be stored. We will go to 2nd index as it is pointing to null we will put our newEntry object there. At completion of this step, our HashMap will look like this- Let’s put fourth key-value pair in HashMap- Key=33, value=15 Entry Object will be formed like this > We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 33%4= 1, So, 1 will be the index of bucket on whichnewEntry object will be stored. We will go to 1st index - >it contains entry with key=21, we will compare two keys (i.e. compare 21 with 33 by using equals method, as two keys are different, proceed to next of entry with key=21 (proceed only if next is not null). >now, next contains entry with key=25, we will compare two keys (i.e. compare 25 with 33 by using equals method, as two keys are different, now next of entry with key=25 is pointing to null so we won’t proceed further, we will put our newEntry object on next. At completion of this step our HashMap will look like this- Let’s put fifth key-value pair in HashMap- Key=35, value=89 Repeat above mentioned steps. At completion of this step our HashMap will look like this- Must read: LinkedHashMap Custom implementation Methods used in custom HashMap > public void put(K newKey, V data) -Method allows you put key-value pair in HashMap -If the map already contains a mapping for the key, the old value is replaced. -provide complete functionality how to override equals method. -provide complete functionality how to override hashCode method. public V get(K key) Method returns value corresponding to key. public boolean remove(K deleteKey) Method removes key-value pair from HashMapCustom. public void display() -Method displays all key-value pairs present in HashMapCustom., -insertion order is not guaranteed, for maintaining insertion order refer LinkedHashMapCustom. private int hash(K key) -Method implements hashing functionality, which helps in finding the appropriate bucket location to store our data. -This is very important method, as performance of HashMapCustom is very much dependent on this method's implementation. What will happen if map already contains mapping for key? If the map already contains a mapping for the key, the old value is replaced. Complexity calculation of put and get methods in HashMap > put method - worst Case complexity > O(n). But how complexity is O(n)? Initially, let's say map is like this - And we have to insert newEntry Object with Key=25, value=121 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 25%4= 1. So, 1 will be the index of bucket on which newEntry object will be stored. We will go to 1st index, it contains entry with key=21, we will compare two keys(i.e. compare 21 with 25 by using equals method), as two keys are different we check whether entry with key=21’s next is null or not, if next is null we will put our newEntry objecton next. At completion of this step our HashMap will look like this- Now let’s do complexity calculation - Earlier there was 1 element in HashMap and for putting newEntry Object we iterated on it. Hence complexity was O(n). Note: We may calculate complexity by adding more elements in HashMap as well, but to keep explanation simple i kept less elements in HashMap. put method - best Case complexity > O(1). But how complexity is O(n)? Let's say map is like this - And we have to insert newEntry Object with Key=30, value=151 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 30%4= 2. So, 2 will be the index of bucket on which newEntry object will be stored. We will go to 2nd index as it is pointing to null we will put our newEntry object there. At completion of this step our HashMap will look like this- Now let’s do complexity calculation - Earlier there 2 elements in HashMap but we were able to put newEntry Object in first go. Hence complexity was O(1). get method - worst Case complexity > O(n). But how complexity is O(n)? Initially, let's say map is like this - And we have to get Entry Object with Key=25, value=121 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 25%4= 1. So, 1 will be the index of bucket on which Entry object is stored. We will go to 1st index, it contains entry with key=21, we will compare two keys(i.e. compare 21 with 25 by using equals method), as two keys are different we check whether entry with key=21’s next is null or not, next is not null so we will repeat same process and ultimately will be able to get Entry object. Now let’s do complexity calculation - There were 2 elements in HashMap and for getting Entry Object we iterated on both of them. Hence complexity was O(n). Note: We may calculate complexity by using HashMap of larger size, but to keep explanation simple i kept less elements in HashMap. get method - best Case complexity > O(1). But how complexity is O(n)? Initially, let's say map is like this - And we have to get Entry Object with Key=30, value=151 We will calculate hash by using our hash(K key) method - in this case it returns key/capacity= 30%4= 2. So, 2 will be the index of bucket on which Entry object is stored. We will go to 2nd index and get Entry object. Now let’s do complexity calculation - There were 3 elements in HashMap but we were able to get Entry Object in first go. Hence complexity was O(1). Summary of complexity of methods in HashMap > Operation/ method Worst case Best case put(K key, V value) O(n) O(1) get(Object key) O(n) O(1) REFER: http://javamadesoeasy.com/2015/02/hashmap-custom-implementation.html HashMap Custom implementation - put, get, remove Employee object.

May 13, 2015

by Ankit Mittal

· 74,468 Views · 10 Likes

Docker Machine on Windows - How To Setup You Hosts

I've been playing around with Docker a lot lately. Many reasons for that, one for sure is, that I love to play around with latest technology and even help out to build a demo or two or a lab. The main difference, between what everybody else of my coworkers is doing is, that I run my setup on Windows. Like most of the middleware developers out there. So, If you followed Arun's blog about "Docker Machine to Setup Docker Host" you might have tried to make this work on windows already. Here is the ultimate short how-to guide on using Docker Machine to administrate and spin up your Docker hosts. Docker Machine Machine lets you create Docker hosts on your computer, on cloud providers, and inside your own data center. It creates servers, installs Docker on them, then configures the Docker client to talk to them. You basically don't have to have anything installed on your machine prior to this. Which is a hell lot easier, than having to manually install boot2docker before. So, let's try this out. You want to have at least one thing in place before starting with anything Docker or Machine. Go and get Git for Windows (aka msysgit). It has all kinds of helpful unix tools in his belly, which you need anyway. Prerequisites - The One For All Solution The first is to install the windows boot2docker distribution which I showed in an earlier blog. It contains the following bits configured and ready for you to use: - VirtualBox - Docker Windows Client Prerequisites- The Bits And Pieces I dislike the boot2docker installer for a variety of reasons. Mostly, because I want to know what exactly is going on on my machine. So I played around a bit and here is the bits and pieces installer if you decide against the one-for-all solution. Start with the virtualization solution. We need something like that on Windows, because it just can't run Linux and this is what Docker is based on. At least for now. So, get VirtualBox and ensure that version 4.3.18 is correctly installed on your system (VirtualBox-4.3.18-96516-Win.exe, 105 MB). WARNING: There is a strange issue, when you run Windows itself in Virtualbox. You might run into an issue with starting the host. And while you're at it, go and get the Docker Windows Client. The other is to grab the final from the test servers as a direct download (docker-1.6.0.exe, x86_64, 7.5MB). Rename to "docker" and put it into a folder of your choice (I assume it will be c:\docker\. Now you also need to download Docker Machine, which is another single executable (docker-machine_windows-amd64.exe, 11.5MB). Rename to "docker-machine" and put it into the same folder. Now add this folder to your PATH: set PATH=%PATH%;C:\docker If you change your standard PATH environment variable, this might safe your from a lot of typing. That's it. Now you're ready to create your first Machine managed Docker Host. Create Your Docker Host With Machine All you need is a simple command: docker-machine create --driver virtualbox dev And the output should state: ←[34mINFO←[0m[0000] Creating SSH key... ←[34mINFO←[0m[0001] Creating VirtualBox VM... ←[34mINFO←[0m[0016] Starting VirtualBox VM... ←[34mINFO←[0m[0022] Waiting for VM to start... ←[34mINFO←[0m[0076] "dev" has been created and is now the active machine. ←[34mINFO←[0m[0076] To point your Docker client at it, run this in your shell: eval "$(docker-machine.exe env dev)" This means, you just created a Docker Host using the VirtualBox provider and the name “dev”. Now you need to find out on which IP address the host is running. docker-machine ip 192.168.99.102 If you want to configure your environment variables, needed by the client more easy, just use the following command: docker-machine env dev export DOCKER_TLS_VERIFY=1 export DOCKER_CERT_PATH="C:\\Users\\markus\\.docker\\machine\\machines\\dev" export DOCKER_HOST=tcp://192.168.99.102:2376 Which outputs the Linux version of environment variable definition. All you have to do is to change the "export" keyword to "set", remove the " and the double back-slashes and you are ready to go. C:\Users\markus\Downloads>set DOCKER_TLS_VERIFY=1 C:\Users\markus\Downloads>set DOCKER_CERT_PATH=C:\Users\markus\.docker\machine\machines\dev C:\Users\markus\Downloads>set DOCKER_HOST=tcp://192.168.99.102:2376 Time to test our Docker Client And here we go now run WildFly on your freshly created host: docker run -it -p 8080:8080 jboss/wildfly Watch the container being downloaded and check, that it is running by redirecting your browser to http://192.168.99.102:8080/. Congratulations on having setup your very first docker host with Maschine on Windows.

May 12, 2015

by Markus Eisele

· 20,165 Views

Collecting Transaction Per Minute from SQL Server and HammerDB

SQL Server script file can be created to run in a loop collecting for a given amount of time at a specified interval.

May 11, 2015

by Greg Schulz

· 10,306 Views

Python: Equivalent to flatMap for Flattening an Array of Arrays

I found myself wanting to flatten an array of arrays while writing some Python code earlier this afternoon and being lazy my first attempt involved building the flattened array manually: episodes = [ {"id": 1, "topics": [1,2,3]}, {"id": 2, "topics": [4,5,6]} ] flattened_episodes = [] for episode in episodes: for topic in episode["topics"]: flattened_episodes.append({"id": episode["id"], "topic": topic}) for episode in flattened_episodes: print episode If we run that we’ll see this output: $ python flatten.py {'topic': 1, 'id': 1} {'topic': 2, 'id': 1} {'topic': 3, 'id': 1} {'topic': 4, 'id': 2} {'topic': 5, 'id': 2} {'topic': 6, 'id': 2} What I was really looking for was the Python equivalent to the flatmap function which I learnt can be achieved in Python with a list comprehension like so: flattened_episodes = [{"id": episode["id"], "topic": topic} for episode in episodes for topic in episode["topics"]] for episode in flattened_episodes: print episode We could also choose to use itertools in which case we’d have the following code: from itertools import chain, imap flattened_episodes = chain.from_iterable( imap(lambda episode: [{"id": episode["id"], "topic": topic} for topic in episode["topics"]], episodes)) for episode in flattened_episodes: print episode We can then simplify this approach a little by wrapping it up in a ‘flatmap’ function: def flatmap(f, items): return chain.from_iterable(imap(f, items)) flattened_episodes = flatmap( lambda episode: [{"id": episode["id"], "topic": topic} for topic in episode["topics"]], episodes) for episode in flattened_episodes: print episode I think the list comprehensions approach still works but I need to look into itertools more – it looks like it could work well for other list operations.

May 9, 2015

by Mark Needham

· 36,363 Views · 2 Likes

8 Questions You Need to Ask About Microservices, Containers & Docker in 2015

In containers and microservices, we’re facing the greatest potential change in how we deliver and run software services since the arrival of virtual machines.

May 9, 2015

by Andrew Phillips

· 15,028 Views · 1 Like

Make Your IoT Gateway WiFi-Aware Using Camel and Kura

The common scenario for the mobile IoT Gateways is to cache collected data locally on the device storage and synchronizing the data with the data center.

May 9, 2015

by Henryk Konsek

· 8,089 Views

Quick Notes: What is CAP Theorem?

CAP theorem states that any database system can only attain two out of following states which is Consistency, Availability and Partition Tolerance.

May 5, 2015

by Ajitesh Kumar

· 26,333 Views · 3 Likes

The Latest Data Engineering Topics