DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

The Latest Data Engineering Topics

article thumbnail
MapReduce and Yarn: Hadoop Processing Unit Part 1
In this article, we begin a two-part series on basic implementation of MapReduce and Yarn in the Hadoop ecosystem.
December 26, 2019
by Dheeraj Gupta CORE
· 99,237 Views · 5 Likes
article thumbnail
MapReduce Algorithms: Understanding Data Joins, Part II
It’s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional Programming Principals in Scala and Principles of Reactive Programming. I found both of them to be great courses and would recommend taking either one if you have the time. In this post we resume our series on implementing the algorithms found in Data-Intensive Text Processing with MapReduce, this time covering map-side joins. As we can guess from the name, map-side joins join data exclusively during the mapping phase and completely skip the reducing phase. In the last post on data joins we covered reduce side joins. Reduce-side joins are easy to implement, but have the drawback that all data is sent across the network to the reducers. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network. However, unlike reduce-side joins, map-side joins require very specific criteria be met. Today we will discuss the requirements for map-side joins and how we can implement them. Map-Side Join Conditions To take advantage of map-side joins our data must meet one of following criteria: The datasets to be joined are already sorted by the same key and have the same number of partitions Of the two datasets to be joined, one is small enough to fit into memory We are going to consider the first scenario where we have two (or more) datasets that need to be joined, but are too large to fit into memory. We will assume the worst case scenario, the files aren’t sorted or partitioned the same. Data Format Before we start, let’s take a look at the data we are working with. We will have two datasets: The first dataset consists of a GUID, First Name, Last Name, Address, City and State The second dataset consists of a GUID and Employer information Both datasets are comma delimited and the join-key (GUID) is in the first position. After the join we want the employer information from dataset two to be appended to the end of dataset one. Additionally, we want to keep the GUID in the first position of dataset one, but remove the GUID from dataset two. Dataset 1: aef9422c-d08c-4457-9760-f2d564d673bc,Linda,Narvaez,3253 Davis Street,Atlanta,GA 08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC de68186a-1004-4211-a866-736f414eac61,Charles,Arnold,1764 Public Works Drive,Johnson City,TN 6df1882d-4c81-4155-9d8b-0c35b2d34284,John,Schofield,65 Summit Park Avenue,Detroit,MI Dataset 2: de68186a-1004-4211-a866-736f414eac61,Jacobs 6df1882d-4c81-4155-9d8b-0c35b2d34284,Chief Auto Parts aef9422c-d08c-4457-9760-f2d564d673bc,Earthworks Yard Maintenance 08db7c55-22ae-4199-8826-c67a5689f838,Ellman's Catalog Showrooms Joined results: 08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC,Ellman's Catalog Showrooms 6df1882d-4c81-4155-9d8b-0c35b2d34284,John,Schofield,65 Summit Park Avenue,Detroit,MI,Chief Auto Parts aef9422c-d08c-4457-9760-f2d564d673bc,Linda,Narvaez,3253 Davis Street,Atlanta,GA,Earthworks Yard Maintenance de68186a-1004-4211-a866-736f414eac61,Charles,Arnold,1764 Public Works Drive,Johnson City,TN,Jacobs Now we move on to how we go about joining our two datasets. Map-Side Joins with Large Datasets To be able to perform map-side joins we need to have our data sorted by the same key and have the same number of partitions, implying that all keys for any record are in the same partition. While this seems to be a tough requirement, it is easily fixed. Hadoop sorts all keys and guarantees that keys with the same value are sent to the same reducer. So by simply running a MapReduce job that does nothing more than output the data by the key you want to join on and specifying the exact same number of reducers for all datasets, we will get our data in the correct form. Considering the gains in efficiency from being able to do a map-side join, it may be worth the cost of running additional MapReduce jobs. It bears repeating at this point it is crucial all datasets specify the exact same number of reducers during the “preparation” phase when the data will be sorted and partitioned. In this post we will take two data-sets and run an initial MapReduce job on both to do the sorting and partitioning and then run a final job to perform the map-side join. First let’s cover the MapReduce job to sort and partition our data in the same way. Step One: Sorting and Partitioning First we need to create a Mapper that will simply choose the key for sorting by a given index: public class SortByKeyMapper extends Mapper { private int keyIndex; private Splitter splitter; private Joiner joiner; private Text joinKey = new Text(); @Override protected void setup(Context context) throws IOException, InterruptedException { String separator = context.getConfiguration().get("separator"); keyIndex = Integer.parseInt(context.getConfiguration().get("keyIndex")); splitter = Splitter.on(separator); joiner = Joiner.on(separator); } @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { Iterable values = splitter.split(value.toString()); joinKey.set(Iterables.get(values,keyIndex)); if(keyIndex != 0){ value.set(reorderValue(values,keyIndex)); } context.write(joinKey,value); } private String reorderValue(Iterable value, int index){ List temp = Lists.newArrayList(value); String originalFirst = temp.get(0); String newFirst = temp.get(index); temp.set(0,newFirst); temp.set(index,originalFirst); return joiner.join(temp); } } The SortByKeyMapper simply sets the value of the joinKey by extracting the value from the given line of text found at the position given by the configuration parameter keyIndex. Also, if the keyIndex is not equal to zero, we swap the order of the values found in the first position and the keyIndex position. Although this is a questionable feature, We’ll discuss why we are doing this later. Next we need a Reducer: public class SortByKeyReducer extends Reducer { private static final NullWritable nullKey = NullWritable.get(); @Override protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { for (Text value : values) { context.write(nullKey,value); } } } The SortByKeyReducer writes out all values for the given key, but throws out the key and writes a NullWritable instead. In the next section we will explain why we are not using the key. Step Two: The Map-Side join When performing a map-side join the records are merged before they reach the mapper. To achieve this, we use the CompositeInputFormat. We will also need to set some configuration properties. Let’s look at how we will configure our map-side join: private static Configuration getMapJoinConfiguration(String separator, String... paths) { Configuration config = new Configuration(); config.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", separator); String joinExpression = CompositeInputFormat.compose("inner", KeyValueTextInputFormat.class, paths); config.set("mapred.join.expr", joinExpression); config.set("separator", separator); return config; } First, we are specifying the character that separates the key and values by setting the mapreduce.input.keyvaluelinerecordreader.key.value.separator property. Next we use the CompositeInputFormat.compose method to create a “join expression” specifying an inner join by using the word “inner”, then specifying the input format to use, the KeyValueTextInputclass and finally a String varargs representing the paths of the files to join (which are the output paths of the map-reduce jobs ran to sort and partition the data). The KeyValueTextInputFormat class will use the separator character to set the first value as the key and the rest will be used for the value. Mapper for the join Once the values from the source files have been joined, the Mapper.map method is called, it will receive a Text object for the key (the same key across joined records) and a TupleWritable that is composed of the values joined from our input files for a given key. Remember we want our final output to have the join-key in the first position, followed by all of joined values in one delimited String. To achieve this we have a custom mapper to put our data in the correct format: public class CombineValuesMapper extends Mapper { private static final NullWritable nullKey = NullWritable.get(); private Text outValue = new Text(); private StringBuilder valueBuilder = new StringBuilder(); private String separator; @Override protected void setup(Context context) throws IOException, InterruptedException { separator = context.getConfiguration().get("separator"); } @Override protected void map(Text key, TupleWritable value, Context context) throws IOException, InterruptedException { valueBuilder.append(key).append(separator); for (Writable writable : value) { valueBuilder.append(writable.toString()).append(separator); } valueBuilder.setLength(valueBuilder.length() - 1); outValue.set(valueBuilder.toString()); context.write(nullKey, outValue); valueBuilder.setLength(0); } } In the CombineValuesMapper we are appending the key and all the joined values into one delimited String. Here we can finally see the reason why we threw the join-key away in the previous MapReduce jobs. Since the key is the first position in the values for all the datasets to be joined, our mapper naturally eliminates the duplicate keys from the joined datasets. All we need to do is insert the given key into a StringBuilder, then append the values contained in the TupleWritable. Putting It All Together Now we have all the code in place to run a map-side join on large datasets. Let’s take a look at how we will run all the jobs together. As was stated before, we are assuming that our data is not sorted and partitioned the same, so we will need to run N (2 in this case) MapReduce jobs to get the data in the correct format. After the initial sorting/partitioning jobs run, the final job performing the actual join will run. public class MapSideJoinDriver { public static void main(String[] args) throws Exception { String separator = ","; String keyIndex = "0"; int numReducers = 10; String jobOneInputPath = args[0]; String jobTwoInputPath = args[1]; String joinJobOutPath = args[2]; String jobOneSortedPath = jobOneInputPath + "_sorted"; String jobTwoSortedPath = jobTwoInputPath + "_sorted"; Job firstSort = Job.getInstance(getConfiguration(keyIndex, separator)); configureJob(firstSort, "firstSort", numReducers, jobOneInputPath, jobOneSortedPath, SortByKeyMapper.class, SortByKeyReducer.class); Job secondSort = Job.getInstance(getConfiguration(keyIndex, separator)); configureJob(secondSort, "secondSort", numReducers, jobTwoInputPath, jobTwoSortedPath, SortByKeyMapper.class, SortByKeyReducer.class); Job mapJoin = Job.getInstance(getMapJoinConfiguration(separator, jobOneSortedPath, jobTwoSortedPath)); configureJob(mapJoin, "mapJoin", 0, jobOneSortedPath + "," + jobTwoSortedPath, joinJobOutPath, CombineValuesMapper.class, Reducer.class); mapJoin.setInputFormatClass(CompositeInputFormat.class); List jobs = Lists.newArrayList(firstSort, secondSort, mapJoin); int exitStatus = 0; for (Job job : jobs) { boolean jobSuccessful = job.waitForCompletion(true); if (!jobSuccessful) { System.out.println("Error with job " + job.getJobName() + " " + job.getStatus().getFailureInfo()); exitStatus = 1; break; } } System.exit(exitStatus); } The MapSideJoinDriver does the basic configuration for running MapReduce jobs. One interesting point is the sorting/partitioning jobs specify 10 reducers each, while the final job explicitly sets the number of reducers to 0, since we are joining on the map-side and don’t need a reduce phase. Since we don’t have any complicated dependencies, we put the jobs in an ArrayList and run the jobs in linear order (lines 24-33). Results Initially we had 2 files; name and address information in the first file and employment information in the second. Both files had a unique id in the first column. File one: .... 08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC ... File two: .... 08db7c55-22ae-4199-8826-c67a5689f838,Ellman's Catalog Showrooms .... Results: 08db7c55-22ae-4199-8826-c67a5689f838,John,Gregory,258 Khale Street,Florence,SC,Ellman's Catalog Showrooms As we can see here, we’ve successfully joined the records together and maintained the format of the files without duplicate keys in the results. Conclusion In this post we’ve demonstrated how to perform a map-side join when both data sets are large and can’t fit into memory. If you get the feeling this takes a lot of work to pull off, you are correct. While in most cases we would want to use higher level tools like Pig or Hive, it’s helpful to know the mechanics of performing map-side joins with large datasets. This especially true on those occasions when you need to write a solution from scratch. Thanks for your time. Resources Data-Intensive Processing with MapReduce by Jimmy Lin and Chris Dyer Hadoop: The Definitive Guide by Tom White Source Code and Tests from blog Programming Hive by Edward Capriolo, Dean Wampler and Jason Rutherglen Programming Pig by Alan Gates Hadoop API MRUnit for unit testing Apache Hadoop map reduce jobs
May 23, 2023
by Bill Bejeck
· 9,151 Views · 1 Like
article thumbnail
Managing Amazon WorkDocs With AWS Managed Microsoft Active Directory (AD)
Learn about how AWS WorkDocs provides multiple business benefits as a document management system by providing secure and limited access.
January 27, 2020
by Raja Sekhar Mandava
· 5,221 Views · 2 Likes
article thumbnail
Devs and Data, Part 3: Managing a Large Volume of Data
We take a look at what respondents to our 2019 Big Data Survey told us about data management and coping with data at enormous volumes.
February 25, 2019
by Jordan Baker
· 6,860 Views · 2 Likes
article thumbnail
Managed MQTT Broker Comparison — Product Packages and Pricing
In this article, we will compare the differences between several managed MQTT Brokers in aspects of product packages and pricing.
October 25, 2022
by David Li
· 5,015 Views · 1 Like
article thumbnail
Managed MQTT Broker Comparison — Console/Dashboard Features
I will compare the differences between several managed MQTT Brokers in aspects of console/dashboard features.
November 9, 2022
by David Li
· 5,206 Views · 1 Like
article thumbnail
Making Your SSR Sites 42x Faster With Redis Cache
Let's look at how you can leverage Redis Cache with Node.JS and Express
May 5, 2022
by Johnny Simpson CORE
· 6,460 Views · 8 Likes
article thumbnail
Making the County List Dynamic
A Zone Leader continues his case study of building a new application for a family member. In this article, read how different tax rates by county and for a particular time span introduced a challenge.
December 11, 2018
by John Vester CORE
· 10,281 Views · 3 Likes
article thumbnail
Making Machine Learning Accessible for Enterprises: Part 2
Let's take a look at discussing critical areas of machine learning-based solutions, such as model explainability and model governance.
August 8, 2018
by Ramesh Balakrishnan
· 4,423 Views · 3 Likes
article thumbnail
Making Enterprise Developers Lives Easier With Cloud Tools: An Interview With Andi Grabner
This interview with Dynatrace DevOps activist Andi Grabner talks about some of the ways using Dynatrace helps developers spend more time coding.
February 12, 2020
by Blake Ethridge
· 8,869 Views · 6 Likes
article thumbnail
Making Data Scientists Productive in Azure
In this article, we take a look at Azure's Machine Learning Studio and what services and tools are available for you big data needs.
December 18, 2019
by Valdas Maksimavičius
· 16,947 Views · 9 Likes
article thumbnail
Make Windows Green Again (Part 4)
The quest to bring Linux to Windows continues. Let's tackle an error caused by mismatched tech, namely getting the distro to run on something other than VMware.
February 24, 2017
by Hannes Kuhnemund
· 4,101 Views · 2 Likes
article thumbnail
Make Database Queries With Real-Time Chat
Integrate with and enable users to query any database.
November 14, 2019
by Tom Smith CORE
· 4,860 Views · 2 Likes
article thumbnail
4 Major Announcements from Google Cloud Next 2019
Here are a few of the bigger announcements we saw yesterday and today.
April 10, 2019
by Kara Phelps
· 8,681 Views · 2 Likes
article thumbnail
The Magic Testing Challenge: Part 2
My last article raised an interesting discussion whether you should see tests more as documentation or more as specification. I agree that they can contribute to both of them, but I still think tests are just - tests... There were also complaints about my statement that testing often becomes tedious work which nobody likes. Also here I agree, that techniques like TDD can help you to structure your code and make sure you code exactly what is needed by writing the tests, but the result of the process will still be a class which needs to be tested somehow. So I have set up another small challenge to show how the visual approach featured by MagicTest helps to make testing a breeze. As you know, traditional assertion-based test frameworks like TestNG or JUnit force us to include the expected results in the test code. Where this may be more or less suitable for simple tests (like in the previous article), it quickly becomes cumbersome if the test handles complex objects or voluminous data. The Task We must test the method createEvenOddTable() (see appended ) with the following functionality: Create HTML table (elements table, tr, td) with specified number of data rows and columns. An additional row will be added to store header information (element th). An additional column will be added which contains the row number (element th) The rows will have attribute class set to "head", "even", or "odd" for easy styling. Both the specification (the 4 lines above) and the source code itself (25 lines) are short and simple to understand, so any experienced developer will write this method in a few minutes. So what's the problem with testing this method? We will see if we look at how MagicTest handles this case. The Magic Test The MagicTest for this method looks like this: public class HtmlTableTest { @Trace public void testCreateEvenOddTable() { HtmlTable.createEvenOddTable(4, 3); } @Formatter(outputType=OutputType.TEXT) public static String formatElement(Element elem) { XMLOutputter serializer = new XMLOutputter(); serializer.setFormat(Format.getPrettyFormat()); return serializer.outputString(elem); } } Some details: We use the @Trace annotation to automatically capture information about calls to the method under test. We rely on naming conventions, so the method HtmlTable.createEvenOddTable() is tested by HtmlTableTest.testCreateEvenOddTable(). Per default, MagicTest uses the toString() method to report the parameter and return values. As the Element's toString() method returns only its name, we have to define a custom @Formatter to get the full XML tree. If we run the test, we get the following report: If we look at the XML element tree in the report, we can see all the details which a complete test should cover: correct nesting of elements (table, tr, td), correct header line, correct line numbers, correct number of rows, correct number of cells for each row, correct class attribute for each row, etc. But even if you end up with a bunch of lengthy assert statements like assert("head".equals(((Element) elem.getChildren("tr").get(0)).getAttributeValue("class"))); which tests for the correct class attribute, this will not be enough: you should also test the absence of the class attribute for all cells except the first ones in each row. So yes, for a sound test you must actually verify the whole XML tree - and this is exactly the information which MagicTest shows you for confirmation. Let the Challenge Begin To run the test yourself, you will need to download the MagicTest Eclipse plug-in. Copy it into the Eclipse dropins folder and restart Eclipse. Then download the attached Eclipse project and import it into your workspace. Run the test class TagsTest by executing Run As / MagicTest. After the first run, the test report will show up and all test steps will be red. This is the MagicTest way of telling you that a step has failed. In our case, the steps just fail because MagicTest simply does not know anything about the expected result. So we carefully check the output and confirm its correctness by clicking on the save button. Now all steps are green - and the test is successful. You have now seen how efficiently this test can be realized using MagicTest - it even looked like fun. Does your test tool accept the challenge? How many minutes and lines does it take you to write the test? I'm looking forward to your contributions! Appendix: Listing HtmlTable /** * Create HTML table (elements table, tr, td) with specified number of data rows and columns. * An additional row will be added to store header information (element th). * An additional column will be added which contains the row number (element th) * The rows will have attribute class set to "head", "even", or "odd" for easy styling. * * @param rows number of rows * @param cols number of column * @return XML element containing the HTML table */ public static Element createEvenOddTable(int rows, int cols) { Element table = new Element("table"); for (int r=0; r 0) { td.setText(Integer.toString(r)); } } } return table; }
May 23, 2023
by Thomas Mauch
· 4,359 Views · 1 Like
article thumbnail
Magic Terminal over Web Sockets and SignalR
What would a web based IDE be without a terminal component? Well, we no longer need to ponder that question since we've now got the Magic Terminal.
June 1, 2021
by Thomas Hansen CORE
· 7,230 Views · 2 Likes
article thumbnail
MachineX: Two Parts of Association Rule Learning
Decouple the support and confidence requirements for Association Rule Learning in this article.
May 30, 2018
by Akshansh Jain
· 4,565 Views · 1 Like
article thumbnail
MachineX: Artificial Neural Networks (Part 2)
This article shows us more about artificial neural networks and forward and backpropagation.
May 22, 2019
by Shubham Goyal
· 5,514 Views · 3 Likes
article thumbnail
Machine Learning in Cybersecurity
Machine Learning accelerates threat detection. Machine learning works with computers to learn as humans do: by trial and error.
June 28, 2022
by Navdeep Singh Gill
· 6,238 Views · 1 Like
article thumbnail
Machine Learning for .Net Developers Using Visual Studio
Se how to install an extension to Visual Studio and a Nuget package.
May 21, 2019
by Ajay Kumar Singh
· 14,811 Views · 5 Likes
  • Previous
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: