Data Engineering Resources

Remove boilerplate code in your JUnit tests with parameterized tests.

July 23, 2015

by John Thompson

· 40,709 Views · 2 Likes

Interested in using Liquidbase? Here's how to run it automatically at startup, manually as needed, or "just give me the SQL and I'll do it myself.

July 22, 2015

by Nathan Voxland

· 8,781 Views

The Periodic Table of DevOps Tools

A very cool, intuitive guide to the massive landscape of DevOps tools and their use cases.

July 22, 2015

by Necco Ceresani

· 36,022 Views · 7 Likes

Coding a .PSD File to HTML – A Simple and Basic Guide for Beginners

Learn to transform your PhotoShop .PSD file into HTML/CSS for your website to preserve the graphic design.

July 17, 2015

by Jack Calder

· 2,402 Views · 1 Like

MongoDB Aggregation Queries for "Counts Per Day" (Part 1)

To feed a heatmap display with MongoDB data, explore grouping documents to return a daily aggregation query.

July 17, 2015

by Kevin Hooke

· 49,057 Views · 3 Likes

Microservices with Spring

How to put Spring, Spring Boot, and Spring Cloud together to create a microservice.

July 15, 2015

by Pieter Humphrey

· 20,512 Views · 6 Likes

Using the H2 Database Console in Spring Boot with Spring Security

H2 as a memory database for Spring-based applications is lightweight, easy to use, and emulates other RDBMS with the help of JPA and Hibernate.

July 13, 2015

by John Thompson

· 102,564 Views · 6 Likes

Design Patterns in Automated Testing

Learn how to make your test automation framework better through Page Objects, Facades, and Singletons.

July 13, 2015

by Anton Angelov

· 81,105 Views · 7 Likes

Where Am I? Collecting GPS Data With Apache Camel

In this article I will tell you how Apache Camel can turn a full-stack Linux microcomputer (like Raspberry Pi) into a device collecting the GPS coordinates.

July 8, 2015

by Henryk Konsek

· 5,742 Views · 1 Like

Java 8: Master Permutations

Using Permutations, you can try all combinations of an input set.

July 7, 2015

by Per-Åke Minborg

· 39,838 Views · 11 Likes

Modern Database Design by Example

The database design task, which was once monotonous, has now become an exciting task which requires a lot of creativity.

July 6, 2015

by Anh Tuan Nguyen

· 13,269 Views · 1 Like

60 Most Commonly Used R Packages in R Programming Language

A comprehensive list of 60 most commonly used R packages for data science and analytics.

July 6, 2015

by Ajitesh Kumar

· 10,430 Views · 2 Likes

Microservices Design Principles

Get a crash course in understanding microservices and the difficulties in implementing them.

July 5, 2015

by Saravanan Subramanian

· 62,373 Views · 10 Likes

The UUID Discussion

UUID really start coming in handy is when you start synchronizing data across servers.

July 3, 2015

by Lieven Doclo

· 26,821 Views

Too Big Data: Coping with Overplotting

written by tim brock. scatter plots are a wonderful way of showing ( apparent ) relationships in bivariate data. patterns and clusters that you wouldn't see in a huge block of data in a table can become instantly visible on a page or screen. with all the hype around big data in recent years it's easy to assume that having more data is always an advantage. but as we add more and more data points to a scatter plot we can start to lose these patterns and clusters. this problem, a result of overplotting, is demonstrated in the animation below. the data in the animation above is randomly generated from a pair of simple bivariate distributions. the distinction between the two distributions becomes less and less clear as we add more and more data. so what can we do about overplotting? one simple option is to make the data points smaller. (note this is a poor "solution" if many data points share exactly the same values.) we can also make them semi-transparent. and we can combine these two options: these refinements certainly help when we have ten thousand data points. however, by the time we've reached a million points the two distributions have seemingly merged in to one again. making points smaller and more transparent might help things; nevertheless, at some point we may have to consider a change of visualization. we'll get on to that later. but first let's try to supplement our visualization with some extra information. specifically let's visualize the marginal distributions . we have several options. there's far too much data for a rug plot , but we can bin the data and show histograms . or we can use a smoother option - a kernel density plot . finally, we could use the empirical cumulative distribution . this last option avoids any binning or smoothing but the results are probably less intuitive. i'll go with the kernel density option here, but you might prefer a histogram. the animated gif below is the same as the gif above but with the smoothed marginal distributions added. i've left scales off to avoid clutter and because we're only really interested in rough judgements of relative height. adding marginal distributions, particularly the distribution of variable 2, helps clarify that two different distributions are present in the bivariate data. the twin-peaked nature of variable 2 is evident whether there are a thousand data points or a million. the relative sizes of the two components is also clear. by contrast, the marginal distribution of variable 1 only has a single peak, despite coming from two distinct distributions. this should make it clear that adding marginal distributions is by no means a universal solution to overplotting in scatter plots. to reinforce this point, the animation below shows a completely different set of (generated) data points in a scatter plot with marginal distributions. the data again comes from a random sample of two different 2d distributions, but both marginal distributions of the complete dataset fail to highlight this separation. as previously, when the number of data points is large the distinction between the two clusters can't be seen from the scatter plot either. returning to point size and opacity, what do we get if we make the data points very small and almost completely transparent? we can now clearly distinguish two clusters in each dataset. it's difficult to make out any fine detail though. since we've lost that fine detail anyway, it seems apt to question whether we really want to draw a million data points. it can be tediously slow and impossible in certain contexts. 2d histograms are an alternative. by binning data we can reduce the number of points to plot and, if we pick an appropriate color scale, pick out some of the features that were lost in the clutter of the scatter plot. after some experimenting i picked a color scale that ran from black through green to white at the high end. note, this is (almost) the reverse of the effect created by overplotting in the scatter plots above. in both 2d histograms we can clearly see the two different clusters representing the two distributions from which the data is drawn. in the first case we can also see that there are more counts from the upper-left cluster than the bottom-right cluster, a detail that is lost in the scatter plot with a million data points (but more obvious from the marginal distributions). conversely, in the case of the second dataset we can see that the "heights" of the two clusters are roughly comparable. 3d charts are overused, but here (see below) i think they actually work quite well in terms of providing a broad picture of where the data is and isn't concentrated. feature occlusion is a problem with 3d charts so if you're going to go down this route when exploring your own data i highly recommend using software that allows for user interaction through rotation and zooming. in summary, scatter plots are a simple and often effective way of visualizing bivariate data. if, however, your chart suffers from overplotting, try reducing point size and opacity. failing that, a 2d histogram or even a 3d surface plot may be helpful. in the latter case be wary of occlusion.

July 3, 2015

by Josh Anderson

· 13,592 Views

Finding Dependency in Stored Procedure

Introduction Here in this article we are trying to discuss about the finding reference object within stored procedure and also finding the calling procedure references. Hope you like it and it will be informative. What We Want Developers are writing several stored procedure almost every day. Sometimes developers need to know about the information such as what object is used within the stored procedure or from where (SP) the specified stored procedure call. This is the vital information for the developer before working on a particular stored procedure. Here we are representing a pictorial diagram to understand the nature of implementation. Now we have to answer some question 1. What are the DB Object used in Stored Procedure1 and there type. 2. In case of Store Procedure3 which procedure calls the Store Procedure3 So we are not going to read the Stored Procedure to find the answer. Suppose the each procedure have more than 3000 line. How We Solve the Answer To solve the answer first we take the example and create an example scenario to understand it. -- Base Table CREATE TABLE T1 (EMPID INT, EMPNAME VARCHAR(50)); GO CREATE TABLE T2 (EMPID INT, EMPNAME VARCHAR(50)); GO --1 CREATE PROCEDURE [dbo].[Procedure1] AS BEGIN SELECT * FROM T1; SELECT * FROM T2; EXEC [dbo].[Procedure3]; END GO --2 CREATE PROCEDURE [dbo].[Procedure2] AS BEGIN EXEC [dbo].[Procedure3]; END GO --3 CREATE PROCEDURE [dbo].[Procedure3] AS BEGIN SELECT * FROM T1; END GO Now we are going to solve the question What are the DB Object used in Stored Procedure1 and there type. sp_depends Procedure1 In case of Store Procedure3 which procedure calls the Store Procedure3 SELECT OBJECT_NAME(id) AS [Calling SP] FROM syscomments WHERE [text] LIKE '%Procedure3%' GROUP BY OBJECT_NAME(id); Hope you like it.

July 3, 2015

by Joydeep Das

· 12,185 Views

Ramesh Shivakumaran Gulftainer records 8 growth in container volume to achieve 6.4 Million Teus in 2014

16 Apr 2015 In a year defined by international expansion and investments in new infrastructure to enhance operational efficiency, Gulftainer recorded robust growth across its entire terminal portfolio. Iain Rawlinson, Group Commercial Director of Gulftainer said: “The positive growth recorded by Gulftainer across its terminals globally underlines the confidence of our partners in our ability to meet their requirements efficiently. Our extensive network and technological expertise are the strengths that have enabled us to expand our footprint to new locations. We continuously invest in enhancing our infrastructure, thus boosting reliability, operational efficiency and productivity.” He added: “The growth in volume achieved throughout our terminals is strong testament to the expertise and dedication of our employees and the strong productivity levels we are able to achieve on a consistent basis. In the dynamic global trade routes linking Asia and Europe, our terminals today play an increasingly significant role. Even as we expand and grow our business, we also remain committed to the communities we serve in by creating new jobs and supporting the domestic economy.” In global markets, Gulftainer’s Saudi terminals recorded impressive growth with Northern Container Terminal accounting for 1.9 million TEUs, sustaining previous-year trends, while Jubail Container Terminal (JCT) noted a growth of 22 per cent to over 396,000 TEUs. The total volume at the Saudi terminals was over 2.29 million TEUs. Gulftainer’s Umm Qasr terminal also accomplished a significant growth of 46 per cent in 2014, while the Recife terminal in Brazil marked a growth in volume of 7 per cent. Gulftainer’s UAE terminals recorded a total volume of 3.8 million TEUs in line with the all-round growth in business. The company marked another significant milestone, with the Sharjah Container Terminal (SCT) surpassing 400,000 TEUs in annual throughput for the very first time. Operations at SCT were energised by the positive growth in global trade and the arrival of new services, such as UASC’s Gulf India Service (GIS1), which now connects Sharjah with Sohar in Oman, Mundra in India and Karachi in Pakistan. The addition of this service represented a significant development for Sharjah and boosted the national carrier’s volumes through SCT last year. The only fully fledged operational container terminal in the UAE located outside the Strait of Hormuz, Khorfakkan Container Terminal (KCT) has today emerged as one of the most important transshipment hubs for the Arabian Gulf, the Indian Sub-continent, the Gulf of Oman and the East African markets. Further strengthening the operations at KCT, Gulftainer has received and commissioned new state-of-the-art Ship to Shore (STS) and Rubber Tyred Gantry (RTG) cranes that will further increase overall performance and productivity. This enhanced infrastructure marks an investment of over US$60 million. Gulftainer has set an ambitious target to triple the volume over the next decade through organic growth across existing businesses, exploring green field opportunities and potential M&A activities.

July 2, 2015

by Androcles Buckley

· 689 Views

Using HA-JDBC with Spring Boot

This is a really simple way to provide high-availability with failover and load balancing to any Java backend using JDBC and Spring Boot .

July 2, 2015

by Lieven Doclo

· 23,232 Views · 1 Like

GULFTAINER SURPASSES 400,000 TEU MILESTONE AT SHARJAH CONTAINER TERMINAL IN 2014

Gulftainer, a privately owned, independent terminal operating and logistics company, marked another significant milestone with the Sharjah Container Terminal (SCT) surpassing 400,000 TEUs (Twenty Foot Equivalent Units) in annual throughput during 2014. SCT has again recorded double-digit growth compared to last year’s volumes. The achievement was reached with an impressive safety record under challenging conditions including space constraints. Iain Rawlinson, Group Commercial Director of Gulftainer said that the professional approach of Gulftainer’s management, along with consistently high productivity levels, was a driving force behind the Terminal’s success. “SCT has always marketed itself as ‘The Flexible Alternative’ and the individual attention we extend to our customers offers us an advantage over competitors.” The 400,000th unit was discharged from Mag Container Lines’ vessel, ‘Mag Success’, one of the Terminal’s regular callers, which considers Sharjah as her base port. Speaking on behalf of Mag Line’s CEO, BDM Jamal Saleh congratulated the Terminal for its achievement. He said: “The announcement today reflects how Gulftainer and MCL have grown together over the years and, in partnership, managed to reach this target. The continuous support, flexibility and excellent operational performance MCL receives from Gulftainer, both operationally and logistically, has contributed greatly to this achievement.” The milestone was achieved on the shift of Duty Superintendent Mehmood Malik, the longest serving employee at over 38 years at the Terminal and part of the team when the first TEU crossed the quay. Mehmood has witnessed several records and milestones and recalls handling 2,500 TEUs in 1976: “At that time we could not imagine reaching the levels of throughput we have today, so this is a very special moment for me.” SCT, which is managed and operated by Gulftainer on behalf of the Sharjah Port Authority, has the honour of being the site of the first container terminal in the Gulf, commenced operations in 1976. SCT is located in the heart of Sharjah and is an ideal gateway for import and export cargo with direct links throughout the Gulf, Asia, Europe, Americas and Africa. The strong performance of the Sharjah economy has supported the growth of many of SCT’s customers, enabling them to increase their throughput and contribute to a record year for the Terminal. The relationships built with our customers have been strengthened by the joint efforts of Gulftainer’s sales and marketing team and the high levels of service and operational efficiency at the terminal, “When looking at the Sharjah market, the dedicated team at SCT listen to and address the many requirements of our diverse and interesting customer base,” said Iain Rawlinson. SCT’s figures have been further boosted with the arrival of new services throughout the year, including UASC’s Gulf India Service (GIS1), which now connects Sharjah with Sohar in Oman, Mundra in India and Karachi in Pakistan, which has boosted in the national carrier’s volumes through SCT in November and December. Gulftainer’s current portfolio covers UAE operations in Khorfakkan Port and Port Khalid in Sharjah as well as activities at Umm Qasr in Iraq, Recife in Brazil, Jeddah and Jubail in Saudi Arabia and in Tripoli Port in Lebanon, which will be operational in April 2016. It also marked another milestone in 2014 with its expansion to the US by signing a long-term agreement to operate the container and multi-cargo terminal at Port Canaveral in Florida. With a current handling activity of over 6 million TEUs, the company has set an ambitious target to triple the volume over the next decade through organic growth across existing businesses, exploring green field opportunities and potential M&A activities.

July 2, 2015

by Tirill Malmin

· 733 Views

Crowdsourcing our way to better food hygiene

The last few years has seen a tremendous boom in the number of sources online relaying information about restaurant quality. Whether it’s review sites or more general social media, there is no shortage of feedback on how people have found a particular restaurant. I wrote a few years ago about a project from the University of Rochester that aimed to mine Twitter for mentions of eating out, with the hope of producing a detailed and comprehensive map of food hygiene standards throughout restaurants in New York. The system, called nEmesis, analyzed millions of tweets, and was on the hunt for people sharing an attack of food poisoning after visiting a restaurant. You might think, or hope at least, that this would be a relatively small number, but over a four month period they found 480 such mentions in New York City alone from a total of 23,000 restaurant visitors. What’s more, the data collected correlated well with public health data on those diners. Crowdsourcing food hygiene A recent Harvard led project is hoping to provide similar assistance to the Boston food hygiene authorities by providing more intelligent information for the authorities to base their inspection checks on. Rather than using Twitter for data however, the Harvard project is turning to the review website Yelp. They have launched a NetFlix style competition to create an algorithm that can search through the ratings of restaurants in Boston and produce recommendations for which restaurants warrant a visit from the hygiene police. The competition, organized by the data company DrivenData, will see the raw data posted online and then an army of data scientists charged with solving the puzzle. The founders observed that whilst the collection of machine readable data was now mandated by the government, there was a literacy problem that rendered much of that data left dormant and unused. Bringing data science to the masses And so the competition was born to try and make data science affordable for organizations with a clear social need but no budget to afford what are still very expensive skill sets. Of course, the food hygiene challenge is but one of the challenges on the DrivenData website, with the venture coming along way from their first challenge to make a better algorithm for improving spending in schools. The organization try and ensure that whatever winning entries emerge from the competitions receive support and help to grow and improve. The winner of that initial competition, for instance, eventually turned their algorithm into a software tool for schools to use. The eventual aim is to establish a community of data scientists that are happy to deploy their talents for socially worthwhile endeavors. “Our mindset has grown; we want to solve the big-picture data literacy and data capacity problems in the social and public sectors,” the creators say. “We think competitions are a great mechanism to do that right now, but our goal is to do more, to serve that community in other ways.” Suffice to say, challenges have come a long way from their beginnings in the 18th century when the UK government launched such a competition to help find longitude more easily. The likes of the X Prize has taken them to newfound heights, and it’s great to see organizations like DrivenData apply the concept to more manageable challenges. Of course, they aren’t the only organization seeking to make algorithms more accessible. I wrote last year about the Algorithmia social network, which aims to connect up organizations with lots of data with algorithms that are being under-utilized. The aim is that this match up will create not just new insights but extra profits. Data science is undoubtedly a burgeoning field, and it’s one with a great many exciting developments in it. Original post

July 2, 2015

by Adi Gaskell

· 869 Views · 1 Like

The Latest Data Engineering Topics