DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Tools Topics

article thumbnail
Bucketing, Multiplexing and Combining in Hadoop - Part 1
this is the first blog post in a series which looks at some data organization patterns in mapreduce. we’ll look at how to bucket output across multiple files in a single task, how to multiplex data across multiple files, and also how to coalesce data. these are all common patterns that are useful to have in your mapreduce toolkit. we’ll kick things off with a look at bucketing data outputs in your map or reduce tasks. by default when using a fileoutputformat-derived outputformat (such as textoutputformat), all the outputs for a reduce task (or a map task in a map-only job) are written to a single file in hdfs. imagine a situation where you have user activity logs being streamed into hdfs, and you want to write a mapreduce job to better organize the incoming data. as an example a large organization with multiple products may want to bucket the logs based on the product. to do this you’ll need the ability to write to multiple output files in a single task. let’s take a look at how we can make that happen. multipleoutputformat there are a few ways you can achieve your goal, and the first option we’ll look at is the multipleoutputformat class in hadoop. this is an abstract class that lets you do the following: define the output path for each and every key/value output record being emitted by a task. incorporate the input paths into the output directory for map-only jobs. redefine the key and value that are used to write to the underlying recordwriter . this is useful in situations where you want to remove data from the outputs as it duplicates data in the filename. for each output path, define the recordwriter that should be used to write the outputs. ok enough with the words - let’s look at some data and code. first up is the simple data we’ll use in our example - imagine you work at a fruit market with locations in multiple cities, and you have a purchase transaction stream which contains the store location along with the fruit that was purchased. cupertino apple sunnyvale banana cupertino pear to help bucket your data for future analysis, you want to bin each record into city-specific files. for the simple data set above you don’t want to filter, project or transform your data, just bucket it out, so a simple identity map-only job will do the job. to force more than one mapper, we’ll write the data to two separate files. $ tab="$(printf '\t')" $ hdfs -put - file1.txt << eof cupertino${tab}apple sunnyvale${tab}banana eof $ hdfs -put - file2.txt << eof cupertino${tab}pear eof here’s the code which will let you write city-specific output files. import org.apache.commons.lang.stringutils; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.filesystem; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.identitymapper; import org.apache.hadoop.mapred.lib.multipletextoutputformat; import org.apache.hadoop.util.progressable; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; import java.io.ioexception; import java.util.arrays; /** * an example of how to use {@link org.apache.hadoop.mapred.lib.multipleoutputformat}. */ public class mofexample extends configured implements tool { /** * create output files based on the output record's key name. */ static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } /** * the main job driver. */ public int run(final string[] args) throws exception { string csvinputs = stringutils.join(arrays.copyofrange(args, 0, args.length - 1), ","); path outputdir = new path(args[args.length - 1]); jobconf jobconf = new jobconf(super.getconf()); jobconf.setjarbyclass(mofexample.class); jobconf.setnumreducetasks(0); jobconf.setmapperclass(identitymapper.class); jobconf.setinputformat(keyvaluetextinputformat.class); jobconf.setoutputformat(keybasedmultipletextoutputformat.class); fileinputformat.setinputpaths(jobconf, csvinputs); fileoutputformat.setoutputpath(jobconf, outputdir); return jobclient.runjob(jobconf).issuccessful() ? 0 : 1; } /** * main entry point for the utility. * * @param args arguments * @throws exception when something goes wrong */ public static void main(final string[] args) throws exception { int res = toolrunner.run(new configuration(), new mofexample(), args); system.exit(res); } } run this code and you’ll see the following files in hdfs, where /output is the job output directory: $ hadoop fs -lsr /output /output/cupertino/part-00000 /output/cupertino/part-00001 /output/sunnyvale/part-00000 if you look at the output files you’ll see that the files contain the correct buckets. $ hadoop fs -lsr /output/cupertino/* cupertino apple cupertino pear $ hadoop fs -lsr /output/sunnyvale/* sunnyvale banana awesome, you have your data bucketed by store. now that we have everything working, let’s look at what we did to get there. we had to do two things to get this working: extend multipletextoutputformat this is where the magic happened - let’s look at that class again. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } you are working with text, which is why you extended multipletextoutputformat , a class that in turn extends multipleoutputformat . multipletextoutputformat is a simple class which instructs the multipleoutputformat to use textoutputformat as the underlying output format for writing out the records. if you were to use multipleoutputformat as-is it behaves as if you were using the regular textoutputformat , which is to say that it’ll only write to a single output file. to write data to multiple files you had to extend it, as with the example above. the generatefilenameforkeyvalue method allows you to return the output path for an input record. the third argument, name , is the original fileoutputformat -created filename, which is in the form “part-nnnnn”, where “nnnnn” is the task index, to ensure uniqueness. to avoid file collisions, it’s a good idea to make sure your generated output paths are unique, and leveraging the original output file is certainly a good way of doing this. in our example we’re using the key as the directory name, and then writing to the original fileoutputformat filename within that directory. specify the outputformat the next step was easy - specify that this output format should be used for your job: jobconf.setoutputformat(keybasedmultipletextoutputformat.class); earlier we also mentioned that you can use the input path as part of the output path, which we will look at next. using the input filename as part of the output filename in map-only jobs what if we wanted to keep the input filename as part of the output filename? this only works for map-only jobs, and can be accomplished by overriding the getinputfilebasedoutputfilename method. let’s look at the following code to understand how this method fits into the overall sequence of actions that the multipleoutputformat class performs: public void write(k key, v value) throws ioexception { // get the file name based on the key string keybasedpath = generatefilenameforkeyvalue(key, value, myname); // get the file name based on the input file name string finalpath = getinputfilebasedoutputfilename(myjob, keybasedpath); // get the actual key k actualkey = generateactualkey(key, value); v actualvalue = generateactualvalue(key, value); recordwriter rw = this.recordwriters.get(finalpath); if (rw == null) { // if we don't have the record writer yet for the final path, create // one // and add it to the cache rw = getbaserecordwriter(myfs, myjob, finalpath, myprogressable); this.recordwriters.put(finalpath, rw); } rw.write(actualkey, actualvalue); }; the getinputfilebasedoutputfilename method is called with the output of generatefilenameforkeyvalue , which contains our already-customized output file. our new keybasedmultipletextoutputformat can now be updated to override getinputfilebasedoutputfilename and append the original input filename to the output filename: static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(object key, object value, string name) { return key.tostring() + "/" + name; } @override protected string getinputfilebasedoutputfilename(jobconf job, string name) { string infilename = new path(job.get("map.input.file")).getname(); return name + "-" + infilename; } if you run with your modified outputformat class you’ll see the following files in hdfs, confirming that the input filenames are now concatenated to the end of each output file. $ hadoop fs -lsr /output /output/cupertino/part-00000-file1.txt /output/cupertino/part-00001-file2.txt /output/sunnyvale/part-00000-file1.txt the implementation of getinputfilebasedoutputfilename in multipleoutputformat doesn’t do anything interesting by default, but if you set the value of the mapred.outputformat.numoftrailinglegs configurable to an integer greater than 0, then the getinputfilebasedoutputfilename will use part of the input path as the output path. let’s see what happens when we set the value to 1: jobconf.setint("mapred.outputformat.numoftrailinglegs", 1); the output files in hdfs now exactly mirror the input files used for the job: $ hadoop fs -lsr /output /output/file1.txt /output/file2.txt if we set mapred.outputformat.numoftrailinglegs to 2, and our input files exist in the /inputs directory, then our output directory looks like this: $ hadoop fs -lsr /output /output/input/file1.txt /output/input/file2.txt basically as you keep incrementing mapred.outputformat.numoftrailinglegs , then multipleoutputformat will continue to go up the parent directories of the input file and use them in the output path. modifying the output key and value it’s very possible that the actual key and value you want to emit are different from those that were used to determine the output file. in our example, we took the output key and wrote to a directory using the key name. if you do that keeping the key in the output file may be redundant. how would we modify the output record so that the key isn’t written? multipleoutputformat has your back with the generateactualkey method. class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override protected text generateactualkey(text key, text value) { return null; } } the returned value from this method replaces the key that’s supplied to the underlying recordwriter , so if you return null as in the above example, no key will be written to the file. $ hadoop fs -lsr /output/cupertino/* apple pear $ hadoop fs -lsr /output/sunnyvale/* banana you can achieve the same result for the output value by overriding the generateactualvalue method. changing the recordwriter in our final step we’ll look at how you can leverage multiple recordwriter classes for different output files. this is accomplished by overriding the getrecordwriter method. in the example below we’re leveraging the same textoutputformat for all the files, but it gives you a sense of what can be accomplished. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override public recordwriter getrecordwriter(filesystem fs, jobconf job, string name, progressable prog) throws ioexception { if (name.startswith("apple")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } else if (name.startswith("banana")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } return super.getrecordwriter(fs, job, name, prog); } } conclusion when using multipleoutputformat , give some thought to the number of distinct files that each reducer will create. it would be prudent to plan your bucketing so that you have a relatively small number of files. in this post we extended multipletextoutputformat , which is a simple extension of multipleoutputformat that supports text outputs. multiplesequencefileoutputformat also exists to support sequencefiles in a similar fashion. so what are the shortcomings with the multipleoutputformat class? if you have a job that uses both map and reduce phases, then multipleoutputformat can’t be used in the map-side to write outputs. of course, multipleoutputformat works fine in map-only jobs. all recordwriter classes must support exactly the same output record types. for example, you wouldn’t be able to support a recordwriter that emitted for one output file, and have another recordwriter that emitted . multipleoutputformat exists in the mapred package, so it won’t work with a job that requires use of the mapreduce package. all is not lost if you bump into either one of these issues, as you’ll discover in the next blog post.
May 20, 2013
by Alex Holmes
· 6,270 Views
article thumbnail
JavaFX Accordion Slide Out Menu for the NetBeans Platform
Let's say you have a NetBeans Platform application that puts a premium on vertical space. Maybe a Heads Up Display on a Touch Screen? Wouldn't it be great to have the menu slide out from the edge of the screen only when you need it? Well the NetBeans Platform provides slide-in TopComponents, of course, but a JMenu just isn't going to work out so well inside one. We can use JavaFX as part of the solution as it provides some capabilities that the base Swing components available in the NetBeans Platform do not. Let's say we take all of our root MenuBar items and place them within an Accordion type pane. Each collapsible TitledPane of the Accordion control could then contain the sub-menu items, maybe represented by a JavaFX MenuButton. This would allow for a recursive Menu like effect but the overall container could be placed anywhere. Something like the screenshot below: What we see here is the described effect sliding out and overlayed on top of the Favorites tab. I sprinkled in some transparency for good measure. Notice how we are able to completely eliminate the Menu Bar and Tool Bar gaining potentially valuable real estate? The rest of this tutorial will explain the steps necessary to achieve something like this. That article was written by Geertjan Wielenga and it will become clear that much of the base code to accomplish this article was extended from Geertjan's example. Thanks again Geertjan! Similar articles to this that may be helpful are below: https://dzone.com/articles/javafx-fxml-meets-netbeans https://dzone.com/articles/how-embed-javafx-chart-visual All these articles are loosely coupled in a tutorial arc towards explaining and demonstrating the advantages of integrating JavaFX into the NetBeans Platform. The following two steps are borrowed exactly as found from Geertjan's tutorial: Step 1. Remove the default menubar and replace with your own: import org.openide.modules.ModuleInstall; public class Installer extends ModuleInstall { @Override public void restored() { System.setProperty("netbeans.winsys.menu_bar.path", "LookAndFeel/MenuBar.instance"); } } Step 2: In the layer.xml file define your Swing replacement menubar. I have also taken the liberty to hide the Toolbars as well. Now why are we replacing the old MenuBar with a new MenuBar if we intend to hide it? Well if you hide the MenuBar via the layer.xml as I did the Toolbars the filesystem folder tree will not be instantiated. That means we won't be able to dynamically determine the Menu Folder tree to rebuild our custom AccordionMenu. The solution? Make an empty Menubar. package polaris.javafxwizard.jfxmenu; import javax.swing.JMenuBar; /** * * @author SPhillips (King of Australia) */ public class HiddenMenuBar extends JMenuBar { public HiddenMenuBar() { super(); } } Step 3: Build an "AccordionMenu" using JavaFX This is where the tutorials diverge and this process gets a bit more complicated. Our task is to use the JavaFX/Swing Interop pattern to create a component that extends JFXPanel yet can give the user access to all the items that were once in the Menu Bar. The basic algorithm is as such: Create a component that extends JFXPanel Implement the standard Platform.runLater() pattern for creating a JavaFX scene Loop through each top level file object in the Menu folder of the application file system: Create a JavaFX Flow Pane for each file object Recursively create JavaFX ButtonMenu items for submenus Add ButtonMenu items to FlowPanes Add FlowPane to JavaFX TitledPane Add TitledPane to JavaFX Accordion component Add Accordion to scene So instead of Menus and SubMenus, we are using MenuButtons which can be recursively added to other MenuButtons and MenuItems. The Accordion control gives us a space saving collapsible view with some nice animation. The FlowPane makes it easy to layout the MenuButtons horizontally in a way that maximizes space. Below is the code for my AccordionMenu class. You will see where I borrowed heavily from Geertjan's example: polaris.javafxwizard.jfxmenu; import java.lang.reflect.InvocationTargetException; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import javafx.application.Platform; import javafx.embed.swing.JFXPanel; import javafx.event.ActionEvent; import javafx.event.EventHandler; import javafx.geometry.Orientation; import javafx.scene.Group; import javafx.scene.Scene; import javafx.scene.control.Accordion; import javafx.scene.control.Button; import javafx.scene.control.MenuButton; import javafx.scene.control.MenuItem; import javafx.scene.control.TitledPane; import javafx.scene.effect.DropShadow; import javafx.scene.layout.FlowPane; import javafx.scene.paint.Color; import javax.swing.Action; import javax.swing.SwingUtilities; import org.openide.awt.Actions; import org.openide.filesystems.FileObject; import org.openide.filesystems.FileUtil; import org.openide.loaders.DataFolder; import org.openide.loaders.DataObject; import org.openide.util.Exceptions; /** * * @author SPhillips (King of Australia) */ public class AccordionMenu extends JFXPanel{ public Accordion accordionPane; public String transparentCSS = "-fx-background-color: rgba(0,100,100,0.1);"; public AccordionMenu() { super(); // create JavaFX scene Platform.setImplicitExit(false); Platform.runLater(new Runnable() { @Override public void run() { createScene(); //Standard Swing Interop Pattern } }); } private void createScene() { FileObject menuFolder = FileUtil.getConfigFile("Menu"); FileObject[] menuKids = menuFolder.getChildren(); //for each Menu folder need to create a TilePane and add it to an Accordion List titledPaneList = new ArrayList<>(); for (FileObject menuKid : FileUtil.getOrder(Arrays.asList(menuKids), true)) { //Build a Flow pane based on menu children //TOP level menu items should all be flow panes FlowPane flowPane = buildFlowPane(menuKid); flowPane.setStyle(transparentCSS); TitledPane newTitledPaneFromFileObject = new TitledPane(menuKid.getName(), flowPane); newTitledPaneFromFileObject.setAnimated(true); newTitledPaneFromFileObject.autosize(); newTitledPaneFromFileObject.setStyle(transparentCSS); titledPaneList.add(newTitledPaneFromFileObject); } Group g = new Group(); Scene scene = new Scene(g, 400, 400,new Color(0.0,0.0,0.0,0.0)); scene.setFill(null); g.setStyle(transparentCSS); accordionPane = new Accordion(); accordionPane.setStyle(transparentCSS); accordionPane.getPanes().addAll(titledPaneList); g.getChildren().add(accordionPane); setScene(scene); validate(); this.setOpaque(true); this.setBackground(new java.awt.Color(0.0f, 0.0f, 0.0f, 0.0f)); } private FlowPane buildFlowPane(FileObject fo) { //FlowPanes are made up of Buttons and MenuButtons built from actions and sub menus FlowPane flowPane = new FlowPane(Orientation.HORIZONTAL,5,5); flowPane.setStyle(transparentCSS); //If anything at the Flow Pane level is an action we need to add it as a button //otherwise we can recursively build it as a MenuButton DataFolder df = DataFolder.findFolder(fo); DataObject[] childs = df.getChildren(); for (DataObject oneChild : childs) { //If child is folder we need to build recursively if (oneChild.getPrimaryFile().isFolder()) { FileObject childFo = oneChild.getPrimaryFile(); MenuButton newMenuButton = new MenuButton(childFo.getName()); buildMenuButton(childFo, newMenuButton); flowPane.getChildren().add(newMenuButton); } else { Object instanceObj = FileUtil.getConfigObject(oneChild.getPrimaryFile().getPath(), Object.class); if (instanceObj instanceof Action) { //If it is an Action we have reached an endpoint final Action a = (Action) instanceObj; String name = (String) a.getValue(Action.NAME); String cutAmpersand = Actions.cutAmpersand(name); Button buttonItem = new Button(cutAmpersand); MenuEventHandler meh = new MenuEventHandler(a); buttonItem.setOnAction(meh); buttonItem.setEffect(new DropShadow()); flowPane.getChildren().add(buttonItem); } } } return flowPane; } private void buildMenuButton(FileObject fo, MenuButton menuButton) { DataFolder df = DataFolder.findFolder(fo); DataObject[] childs = df.getChildren(); for (DataObject oneChild : childs) { //If child is folder we need to build recursively if (oneChild.getPrimaryFile().isFolder()) { FileObject childFo = oneChild.getPrimaryFile(); //Menu newMenu = new Menu(childFo.getName()); MenuButton newMenuButton = new MenuButton(childFo.getName()); //menu.getItems().add(newMenu); buildMenuButton(childFo, newMenuButton); } else { Object instanceObj = FileUtil.getConfigObject(oneChild.getPrimaryFile().getPath(), Object.class); if (instanceObj instanceof Action) { //If it is an Action we have reached an endpoint final Action a = (Action) instanceObj; String name = (String) a.getValue(Action.NAME); String cutAmpersand = Actions.cutAmpersand(name); MenuItem menuItem = new MenuItem(cutAmpersand); MenuEventHandler meh = new MenuEventHandler(a); menuItem.setOnAction(meh); menuButton.getItems().add(menuItem); } } } } private class MenuEventHandler implements EventHandler { public Action theAction; public MenuEventHandler(Action action) { super(); theAction = action; } @Override public void handle(final ActionEvent t) { try { SwingUtilities.invokeAndWait(new Runnable() { @Override public void run() { java.awt.event.ActionEvent event = new java.awt.event.ActionEvent( t.getSource(), t.hashCode(), t.toString()); theAction.actionPerformed(event); } }); } catch ( InterruptedException | InvocationTargetException ex) { Exceptions.printStackTrace(ex); } } } } I took the liberty of placing a few CSS stylings here and there, trying to play with the transparency. Also I found that it looked better if a JavaFX Button was used for any Actions found at the very top level, instead of a MenuButton with a single item. Step 4: Build a Slide in TopComponent for the new AccordionMenu Now that you have a JFXPanel Swing Interop component, your NetBeans Platform TopComponent doesn't need to know about JavaFX. However in this scenario the Platform also is contributing via its wonderful docking framework. Use the Window wizard and select Left Sliding In as a mode. I would also advise making this component not closable, otherwise the user could lose the ability to use the menu. Here are the annotations and constructor code in my TopComponent: @ConvertAsProperties( dtd = "-//polaris.javafxwizard.jfxmenu//SlidingAccordion//EN", autostore = false) @TopComponent.Description( preferredID = "SlidingAccordionTopComponent", iconBase="polaris/javafxwizard/jfxmenu/categories.png", persistenceType = TopComponent.PERSISTENCE_ALWAYS) @TopComponent.Registration(mode = "leftSlidingSide", openAtStartup = true) @ActionID(category = "Window", id = "polaris.javafxwizard.jfxmenu.SlidingAccordionTopComponent") @ActionReference(path = "Menu/JavaFX" /*, position = 333 */) @TopComponent.OpenActionRegistration( displayName = "#CTL_SlidingAccordionAction", preferredID = "SlidingAccordionTopComponent") @Messages({ "CTL_SlidingAccordionAction=SlidingAccordion", "CTL_SlidingAccordionTopComponent=SlidingAccordion Window", "HINT_SlidingAccordionTopComponent=This is a SlidingAccordion window" }) public final class SlidingAccordionTopComponent extends TopComponent { public AccordionMenu accordionMenu; public SlidingAccordionTopComponent() { initComponents(); setName(Bundle.CTL_SlidingAccordionTopComponent()); setToolTipText(Bundle.HINT_SlidingAccordionTopComponent()); putClientProperty(TopComponent.PROP_CLOSING_DISABLED, Boolean.TRUE); putClientProperty(TopComponent.PROP_DRAGGING_DISABLED, Boolean.TRUE); putClientProperty(TopComponent.PROP_MAXIMIZATION_DISABLED, Boolean.TRUE); putClientProperty(TopComponent.PROP_UNDOCKING_DISABLED, Boolean.TRUE); putClientProperty(TopComponent.PROP_KEEP_PREFERRED_SIZE_WHEN_SLIDED_IN, Boolean.TRUE); setLayout(new BorderLayout()); //Standard JFXPanel Swing Interop Pattern accordionMenu = new AccordionMenu(); //transparency Color transparent = new Color(0.0f, 0.0f, 0.0f, 0.0f); accordionMenu.setOpaque(true); accordionMenu.setBackground(transparent); this.add(accordionMenu); this.setOpaque(true); this.setBackground(transparent); } Step 5. See how great it looks We now have a slide out collapsible application menu provided by JavaFX components. These components can be "skinned" using CSS stylings and as such the menu can be crafted differently for different applications. (By the way if anyone reading this has some ideas please contact me because I am not a CSS guy at all) Best of all we have adapted our application to work nicely with a Heads Up Display or Kiosk view that typically run on touchscreen computers. This is because we have saved real estate and implemented an interface that is more condusive to single touches versus mouse drag events. Hey let's see how it might look with an application that needs all the space it can get?
May 17, 2013
by Sean Phillips
· 20,374 Views
article thumbnail
Deploy a File Server in the Cloud (WebDav on Windows Azure)
this month, my fellow it pro technical evangelists and i are authoring a new series of articles on 20 key scenarios with windows azure infrastructure services . check out the list of articles here: http://mythoughtsonit.com/2013/05/20-key-scenarios-with-windows-azure-infrastructure-services/ . web-based distributed authoring and versioning, or webdav, is a set of protocols based on http that allows end-users to map a network drive over http and edit content and files stored on the web server. when webdav was first offered on microsoft server i had evaluated it and decided it did not perform well enough for me. the webdav extension to iis was completely rewritten back in the server 2008 timeframe and is worth taking a look at again. in this article i will guide you step by step through the process of setting up webdav on server 2012 in a windows azure iaas environment. this will give you a solid performing file share on the internet over port 80 and the http protocol. first you need an azure account. you can setup a free trail of azure. details can be found here: http://mythoughtsonit.com/2013/04/step-by-step-guide-to-setting-up-a-windows-azure-free-trial/ second provision a server 2012 machine. watch a video of what to do here: third open port 80 to this new server: in the azure portal select your 2012 server and choose the “endpoints” tab on the top. click “add endpoint” at the bottom of the screen enter the endpoint information for port 80 to port 80 done. next we need to install the iis webserver and webdav. installing webdav on iis 8.0 start server manager and go to “add roles and features” under server roles – add the web server (iis) role click through the wizard until you come to the role services section. then find and select “webdav publishing” and “windows authentication” click next and then install when the install is finished you are ready to move on to the next section. configuring iis 8 for webdav after the installation finishes you need to configure the box for access. start the iis manager tool. choose the “default web site” on the left side. then click on “authentication” open the windows authentication option and enable it. open the “webdav authoring rules” create a webdav rule. i choose to allow all users access to all content. a better security practice is to limit what users can use the service. it’s your data so you decide. make sure webdav is enabled and that your access rule is set: that is it… now your ready to access your webdav file share! test and insure you can hit the web server by using your browser: because you opened port 80 and installed iis 8 you should see the default web page when you browse to your servers internet dns name. example: http://yourdomainname.cloudapp.net/ how to map a drive to your webdav server: there are two ways i use to connect to the webdav server how to map a drive to your webdav server from the win 8 gui: from windows explorer, right click on “computer” and select “map a network drive” map your network drive by entering the address to your server example: http://yourdomainname.cloudapp.net/ i selected “connect using different credentials” because my workstation was not joined to the server in anyway and i needed to use an account in the servers local sam database. hit “finish” and enter your credentials. now you will have a connected drive that you can access from windows explorer or any tool via the drive mapping. how to map a drive to your webdav server from a cmd box: 1. hit windows start and type: cmd 2. enter the command: net use [drive letter] [url] example: net use e: http://yourdomainname.cloudapp.net/
May 15, 2013
by Brian Lewis
· 15,918 Views
article thumbnail
Synchronising Multithreaded Integration Tests revisited
I recently stumbled upon an article Synchronising Multithreaded Integration Tests on Captain Debug's Blog. That post emphasizes the problem of designing integration tests involving class under test running business logic asynchronously. This contrived example was given (I stripped some comments): public class ThreadWrapper { public void doWork() { Thread thread = new Thread() { @Override public void run() { System.out.println("Start of the thread"); addDataToDB(); System.out.println("End of the thread method"); } private void addDataToDB() { // Dummy Code... try { Thread.sleep(4000); } catch (InterruptedException e) { e.printStackTrace(); } } }; thread.start(); System.out.println("Off and running..."); } } This is only an example of common pattern where business logic is delegated to some asynchronous job pool we have no control over. Roger Hughes (the author) enumerates few techniques of testing such code, including: arbitrary ("long enough") sleep() in test method to make sure background logic finishes refactoring doWork() so that it accepts CountDownLatch and agrees to notify it when job is done making the method above package private and @VisibleForTesting only "The" solution - refactoring doWork() so that it accepts arbitrary Runnable. In test we can wrap this Runnable (decorator pattern) and wait for inner Runnable to complete Last solution is not bad but it changes the responsibilities of ThreadWrapper significantly. Now it's up to the caller to decide what kind of job should be executed asynchronously while previously ThreadWrapper was encapsulating business logic completely. I am not saying it's a bad design, but it's drastically different from original method. Awaitility Can we write a test without such a massive refactoring? First solution involves handy library called Awaitility. This library is not a silver bullet, it simply evaluates given condition periodically and makes sure it's fulfilled within given time. It's the kind of code you probably wrote once or twice - wrapped in a library with well designed API. So here is our initial approach: import static com.jayway.awaitility.Awaitility.await; import static java.util.concurrent.TimeUnit.SECONDS; //... await().atMost(10, SECONDS).until(recordInserted()); //... private Callable recordInserted() { return new Callable() { @Override public Boolean call() throws Exception { return dataExists(); } }; } I think there is nothing to explain here. dataExists() is simply a boolean method that initially returns false but will eventually return true once the background task (addDataToDB()) is done. In other words we assume that background task introduces some side effect and dataExists() can detect that side effect. BTW I happened to have JDK 8 with Lambda support installed and IntelliJ IDEA gives me this nice tooltip: Suddenly I get this Java 8-compatible alternative suggested: private Callable recordInserted() { return () -> dataExists(); } But there's more: Which transforms my code to: private Callable recordInserted() { return this::dataExists; } this:: prefix means that recordInsterted is a method of current object. Just as well we can say someDao::dataExists. Simply put this syntax turns method into a function object we can pass around (this process is called eta expansion in Scala). By now recordInsterted() method is no longer that needed so I can inline it and remove it completely: await().atMost(10, SECONDS).until(this::dataExists); I am not sure what I love more - the new lambda syntax or how IntelliJ IDEA takes pre-Java 8 code and retrofits it for me automatically (well, it's still a bit experimental, just reported IDEA-106670). I can run this intention in IntelliJ project-wide, Lambda-enabling my whole code base in seconds. Sweet! But back to original problem. Awaitility helps a lot by providing decent API and some handy features. I use it extensively in combination with FluentLenium. But periodically polling for state changes feels a bit like a workaround and still introduces minimal latency. But notice that running and synchronizing on asynchronous tasks is quite common and JDK already provides necessary facilities: Future abstraction! java.util.concurrent.Future To limit the scope of refactoring I will leave the original new Thread() approach for now and use SettableFuture from Guava. It is a Future implementation that allows triggering completion or failure at any time, from any thread (see DeferredResult - asynchronous processing in Spring MVC for more advanced usage). As you can see the changes are quite small: public class ThreadWrapper { public ListenableFuture doWork() { final SettableFuture future = SettableFuture.create(); Thread thread = new Thread() { @Override public void run() { addDataToDB() //... //last instruction future.set(null); } private void addDataToDB() { // Dummy Code... // ... } }; thread.start(); return future; } } doWork() now returns ListenableFuture with lifecycle controlled inside asynchronous task. We use Void but in reality you might want to return some asynchronous result instead. future.set(null) invocation in the end is crucial. It signals that future is fulfilled and all threads waiting for that future will be notified. Once again, in practice you would use e.g. Future and then instead of null we would say future.set(someInteger). Here null is just a placeholder for Void type. How does this help us? Test code can now rely on future completion: final ListenableFuture future = wrapper.doWork(); future.get(10, SECONDS); future.get() blocks until future is done (with timeout), i.e. until we call future.set(...). BTW I use ListenableFuture from Guava but Java 8 introduces equivalent and standard CompletableFuture - I will write about it soon. So, we are getting somewhere. Future is a useful abstraction for waiting and signalling completion of background jobs. But there is also one immense advantage of Future which are not taking, ekhm, advantage from - exception handling and propagation. Future.get() will block until future is complete and return asynchronous result or throw an exception initially thrown from our job. This is really useful for asynchronous tests. Currently if Thread.run() throws an exception it may or may not be logged or visible to us and future will never be completed. With Awaitility it's slightly better - it will timeout without any meaningful reason, which have to be tracked down manually in console/logs. But with minor modification our test is much more verbose: public void run() { try { addDataToDB() //... future.set(null); } catch (Exception e) { future.setException(e); } } If some exception occurs in asynchronous job, it will pop-up and be shown as JUnit/TestNG failure reason. (Listening)ExecutorService That's it. If addDataToDB() throws an exception it will not be lost. Instead our future.get() in test will re-throw that exception for us. Our test won't simply timeout leaving us with no clue what went wrong. Great, but do we really have to create this special SettableFuture instance, can't we just use existing libraries that already give us Future with correct underlying implementation? Of course! By this requires further refactoring: import com.google.common.util.concurrent.ListeningExecutorService; import com.google.common.util.concurrent.MoreExecutors; import java.util.concurrent.Executors; import java.util.concurrent.Future; public class ThreadWrapper { private final ListeningExecutorService executorService = MoreExecutors.listeningDecorator( Executors.newSingleThreadExecutor() ); public ListenableFuture doWork() { Runnable job = new Runnable() { @Override public void run() { //... } }; return executorService.submit(job); } } This is what you've all been waiting for. Don't start new Thread all the time, use thread pool! I actually went one step further by using ListeningExecutorService - an extension to ExecutorService that returns ListenableFuture instances (see why you want that). But the solution doesn't require this, I just spread good practices. As you can see Future instance is now created and managed for us. The test is exactly the same but production code is cleaner and more robust. MoreExecutors.sameThreadExecutor() The final trick I want to show you involves dependency injection. First let's externalize the creation of a thread pool from ThreadWrapper class: private final ListeningExecutorService executorService; public ThreadWrapper() { this(Executors.newSingleThreadExecutor()); } public ThreadWrapper(ExecutorService executorService) { this.executorService = MoreExecutors.listeningDecorator(executorService); } We can now optionally supply custom ExecutorService. This is good for various other reasons, but for us it opens brand new testing opportunity: MoreExecutors.sameThreadExecutor(). This time we modify our test slightly: final ThreadWrapper wrapper = new ThreadWrapper(MoreExecutors.sameThreadExecutor()); wrapper.doWork().get(); See how we pass custom ExecutorService? It's a very special implementation that doesn't really maintain thread pool of any kind. Every time you submit() some task to that "pool" it will be executed in the same thread in a blocking manner. This means that we no longer have asynchronous test, even though the production code wasn't changed that much! wrapper.doWork() will block until "background" job finishes. The extra call to get() is still needed to make sure exceptions are propagated, but is guaranteed to never block (because the job is already done). Using the same thread to execute asynchronous task instead of a thread pool might have an unexpected results if you somehow depend on thread-based properties, e.g. transactions, security, ThreadLocal. However if you use standard ThreadPoolExecutor with CallerRunsPolicy, JDK already behaves this way if thread pool is overflowed. So it's not that unusual. Summary Testing asynchronous code is hard, but you have options. Several options. But one conclusion that strikes me is the side effect of our efforts. We refactored original code in order to make it testable. But the final production code is not only testable, but also much better structured and robust. Surprisingly it's even source-code compatible with previous version as we barely changed return type from void to Future. It seems to be a rule - testable code is often better designed and implemented. Unit test is the first client code using our library. It naturally forces us to to think more about consumers, not the implementation.
May 7, 2013
by Tomasz Nurkiewicz
· 8,936 Views · 1 Like
article thumbnail
How to Integrate JavaFX into a NetBeans Platform Wizard (Part 1)
When working within the NetBeans Platform, Swing is King. JavaFX is the crown prince. However, some developers avoid developing GUI controls with JavaFX in the NetBeans Platform because Swing is available by default. Well, it is possible to develop your JavaFX forms and simply replace the default NetBeans panels. The following tutorial explains how a developer can take a JavaFX GUI form and FXML developed using Scene Builder and replace a NetBeans Platform Wizard visual panel with minimal effort. Now, why would this concept be useful? Well, consider a development team where new Java applications are being written in JavaFX. Why rewrite the useful Panel classes to Swing just to use them within a NetBeans Platform Wizard? Why force new form development to be in Swing just to be compatible with a NetBeans Platform application? NetBeans Platform applications are perfectly capable of rendering JavaFX interop'd with Swing. Here's how: First you will need to do a little prep work to setup an application for this tutorial. Do the following: Create the JavaFX GUI. Create a new JavaFX FXML GUI using SceneBuilder. Add the controls you want and generate your FXML file and controller class. Update your Controller Class by Extending JFXPanel. This is part of the Swing Interop pattern that we all know and love. You will also need to @Override the getName() method so that the wizard framework can update the current step title. Encapsulate fields/values. Create public methods that will provide Wizard framework with the fields it needs to pass from panel to panel. This is the same thing you would need to do with a standard Swing Wizard JPanel class. The code for your controller class still runs without a problem within your JavaFX application but is now Swing Interop compatible. The code might look like this: package jfxwizpanel.jfxwiz; import java.io.File; import java.net.URL; import java.util.ResourceBundle; import javafx.embed.swing.JFXPanel; import javafx.event.ActionEvent; import javafx.fxml.FXML; import javafx.fxml.Initializable; import javafx.scene.control.Button; import javafx.scene.control.TextField; import javafx.stage.FileChooser; /** * * @author SPhillips (King of Australia) */ public class WizPanelController extends JFXPanel implements Initializable { @FXML // fx:id="browseButton" private Button browseButton; // Value injected by FXMLLoader @FXML // fx:id="pathText" private TextField pathText; //Field that Path is stored in private String filePath = ""; //some value to pass to the next Wizard panel // Handler for Button[fx:id="browseButton"] onAction public void handleButtonAction(ActionEvent event) { FileChooser fileChooser = new FileChooser(); fileChooser.setTitle("Select File"); //Show open file dialog File file = fileChooser.showOpenDialog(null); if(file!=null) { setFilePath(file.getPath()); pathText.setText(filePath); } } @Override // This method is called by the FXMLLoader when initialization is complete public void initialize(URL fxmlFileLocation, ResourceBundle resources) { assert browseButton != null : "fx:id=\"browseButton\" was not injected: check your FXML file 'WizPanel.fxml'."; assert pathText != null : "fx:id=\"pathText\" was not injected: check your FXML file 'WizPanel.fxml'."; // initialize your logic here: all @FXML variables will have been injected } @Override //This method is used by Wizard Framework to generate list of steps public String getName() { return "FXML JFXPanel"; } /** * @return the filePath */ public String getFilePath() { return filePath; } /** * @param filePath the filePath to set */ public void setFilePath(String filePath) { this.filePath = filePath; } } And when you run this Code within the JavaFX FXML application you get something like the following screenshot: Create the NetBeans Platform Application. Create a new NetBeans Platform application and add a new module. Add a Wizard using the "Wizard" Wizard. Include the JavaFX Runtime. Create a NetBeans library wrapper module to include "jfxrt.jar" and set a dependency on it in the module described above. Copy Controller class and FXML file. As of NetBeans 7.3 you cannot refactor copy these files from your JavaFX FXML project to your NetBeans Platform application package. After manually copying these two files you will need to do a manual replace of the package path in both the Controller class and the fx:controller string in the FXML file. Your FXML code might now look something like this: Replace Swing Panel with FXML Controller. At this point you can replace the autogenerated Swing JPanel class that would normally be loaded by the Wizard control class with your JavaFX FXML controller. Remember we extended JFXPanel and it pays off here. All we have to do now is follow our standard Swing Interop technique. However this time we have to use our Platform.runLater() pattern in the getComponent() method of the Wizard controller class. Below is the relevant code after the update. Notice how little we had to change: public class JfxwizWizardPanel1 implements WizardDescriptor.Panel { /** * The visual component that displays this panel. If you need to access the * component from this class, just use getComponent(). */ //private JfxwizVisualPanel1 component; public WizPanelController component; //Replaces original autogenerated JPanel class // Get the visual component for the panel. In this template, the component // is kept separate. This can be more efficient: if the wizard is created // but never displayed, or not all panels are displayed, it is better to // create only those which really need to be visible. @Override public WizPanelController getComponent() { if (component == null) { component = new WizPanelController(); //return new JFXPanel controller Platform.setImplicitExit(false); Platform.runLater(new Runnable() { @Override public void run() { createScene(); //standard Swing Interop Pattern } }); } return component; } private void createScene() { try { URL location = getClass().getResource("WizPanel.fxml"); //same FXML copied from JavaFX app FXMLLoader fxmlLoader = new FXMLLoader(); fxmlLoader.setLocation(location); fxmlLoader.setBuilderFactory(new JavaFXBuilderFactory()); Parent root = (Parent) fxmlLoader.load(location.openStream()); Scene scene = new Scene(root); component.setScene(scene); component = (WizPanelController) fxmlLoader.getController(); } catch (IOException ex) { Exceptions.printStackTrace(ex); } } At this point, you should be able to resolve any import issues, compile and run. You should see your JavaFX GUI nicely loaded within the Wizard Dialog frame like the screenshot below: Wow that's awesome that you can load JavaFX GUIs into your wizards. But you didn't do anything with the information, so you didn't actually leverage the WizardDescriptor framework.
April 26, 2013
by Sean Phillips
· 22,694 Views
article thumbnail
Multipart Upload on S3 with jclouds
1. Goal In the previous article, we looked at how we can use the generic Blob APIs from jclouds to upload content to S3. In this article we will use the S3 specific asynchronous API from jclouds to upload content and leverage the multipart upload functionality provided by S3. 2. Preparation 2.1. Set up the custom API The first part of the upload process is creating the jclouds API – this is a custom API for Amazon S3: public AWSS3AsyncClient s3AsyncClient() { String identity = ... String credentials = ... BlobStoreContext context = ContextBuilder.newBuilder("aws-s3"). credentials(identity, credentials).buildView(BlobStoreContext.class); RestContext providerContext = context.unwrap(); return providerContext.getAsyncApi(); } 2.2. Determining the number of parts for the content Amazon S3 has a 5 MB limit for each part to be uploaded. As such, the first thing we need to do is determine the right number of parts that we can split our content into so that we don’t have parts below this 5 MB limit: public static int getMaximumNumberOfParts(byte[] byteArray) { int numberOfParts= byteArray.length / fiveMB; // 5*1024*1024 if (numberOfParts== 0) { return 1; } return numberOfParts; } 2.3. Breaking the content into parts Were going to break the byte array into a set number of parts: public static List breakByteArrayIntoParts(byte[] byteArray, int maxNumberOfParts) { List parts = Lists. newArrayListWithCapacity(maxNumberOfParts); int fullSize = byteArray.length; long dimensionOfPart = fullSize / maxNumberOfParts; for (int i = 0; i < maxNumberOfParts; i++) { int previousSplitPoint = (int) (dimensionOfPart * i); int splitPoint = (int) (dimensionOfPart * (i + 1)); if (i == (maxNumberOfParts - 1)) { splitPoint = fullSize; } byte[] partBytes = Arrays.copyOfRange(byteArray, previousSplitPoint, splitPoint); parts.add(partBytes); } return parts; } We’re going to test the logic of breaking the byte array into parts – we’re going to generate some bytes, split the byte array, recompose it back together using Guava and verify that we get back the original: @Test public void given16MByteArray_whenFileBytesAreSplitInto3_thenTheSplitIsCorrect() { byte[] byteArray = randomByteData(16); int maximumNumberOfParts = S3Util.getMaximumNumberOfParts(byteArray); List fileParts = S3Util.breakByteArrayIntoParts(byteArray, maximumNumberOfParts); assertThat(fileParts.get(0).length + fileParts.get(1).length + fileParts.get(2).length, equalTo(byteArray.length)); byte[] unmultiplexed = Bytes.concat(fileParts.get(0), fileParts.get(1), fileParts.get(2)); assertThat(byteArray, equalTo(unmultiplexed)); } To generate the data, we simply use the support from Random: byte[] randomByteData(int mb) { byte[] randomBytes = new byte[mb * 1024 * 1024]; new Random().nextBytes(randomBytes); return randomBytes; } 2.4. Creating the Payloads Now that we have determined the correct number of parts for our content and we managed to break the content into parts, we need to generate the Payload objects for the jclouds API: public static List createPayloadsOutOfParts(Iterable fileParts) { List payloads = Lists.newArrayList(); for (byte[] filePart : fileParts) { byte[] partMd5Bytes = Hashing.md5().hashBytes(filePart).asBytes(); Payload partPayload = Payloads.newByteArrayPayload(filePart); partPayload.getContentMetadata().setContentLength((long) filePart.length); partPayload.getContentMetadata().setContentMD5(partMd5Bytes); payloads.add(partPayload); } return payloads; } 3. Upload The upload process is a flexible multi-step process – this means: the upload can be started before having all the data – data can be uploaded as it’s coming in data is uploaded in chunks – if one of these operations fails, it can simply be retrieved chunks can be uploaded in parallel – this can greatly increase the upload speed, especially in the case of large files 3.1. Initiating the Upload operation The first step in the Upload operation is to initiate the process. This request to S3 must contain the standard HTTP headers – the Content-MD5 header in particular needs to be computed. Were going to use the Guava hash function support here: Hashing.md5().hashBytes(byteArray).asBytes(); This is the md5 hash of the entire byte array, not of the parts yet. To initiate the upload, and for all further interactions with S3, we’re going to use the AWSS3AsyncClient – the asynchronous API we created earlier: ObjectMetadata metadata = ObjectMetadataBuilder.create().key(key).contentMD5(md5Bytes).build(); String uploadId = s3AsyncApi.initiateMultipartUpload(container, metadata).get(); The key is the handle assigned to the object – this needs to be a unique identifier specified by the client. Also notice that, even though we’re using the async version of the API, we’re blocking for the result of this operation – this is because we will need the result of the initialize to be able to move forward. The result of the operation is an upload id returned by S3 – this will identify the upload throughout it’s lifecycle and will be present in all subsequent upload operations. 3.2. Uploading the Parts The next step is uploading the parts. Our goal here is to send these requests in parallel, as the upload parts operation represent the bulk of the upload process: List> ongoingOperations = Lists.newArrayList(); for (int partNumber = 0; partNumber < filePartsAsByteArrays.size(); partNumber++) { ListenableFuture future = s3AsyncApi.uploadPart( container, key, partNumber + 1, uploadId, payloads.get(partNumber)); ongoingOperations.add(future); } The part numbers need to be continuous but the order in which the requests are send is not relevant. After all of the upload part requests have been submitted, we need to wait for their responses so that we can collect the individual ETag value of each part: Function, String> getEtagFromOp = new Function, String>() { public String apply(ListenableFuture ongoingOperation) { try { return ongoingOperation.get(); } catch (InterruptedException | ExecutionException e) { throw new IllegalStateException(e); } } }; List etagsOfParts = Lists.transform(ongoingOperations, getEtagFromOp); If, for whatever reason, one of the upload part operations fails, the operation can be retried until it succeeds. The logic above does not contain the retry mechanism, but building it in should be straightforward enough. 3.3. Completing the Upload operation The final step of the upload process is completing the multipart operation. The S3 API requires the responses from the previous parts upload as a Map, which we can now easily create from the list of ETags that we obtained above: Map parts = Maps.newHashMap(); for (int i = 0; i < etagsOfParts.size(); i++) { parts.put(i + 1, etagsOfParts.get(i)); } And finally, send the complete request: s3AsyncApi.completeMultipartUpload(container, key, uploadId, parts).get(); This will return final ETag of the finished object and will complete the entire upload process. 4. Conclusion In this article we built a multipart enabled, fully parallel upload operation to S3, using the custom S3 jclouds API. This operation is ready to be used as is, but it can be improved in a few ways. First, retry logic should be added around the upload operations to better deal with failures. Next, for really large files, even though the mechanism is sending all upload multipart requests in parallel, a throttling mechanism should still limit the number of parallel requests being sent. This is both to avoid bandwidth becoming a bottleneck as well as to make sure Amazon itself doesn’t flag the upload process as exceeding an allowed limit of requests per second – the Guava RateLimiter can potentially be very well suited for this. P.S. You might dig following me on Twitter.
April 21, 2013
by Eugen Paraschiv
· 6,594 Views · 1 Like
article thumbnail
Upload on S3 with the jclouds Library
There are several good ways to upload content to an S3 bucket in the Java world – in this article we’ll look at what the jclouds library provides for this purpose. To use jclouds – specifically the APIs discussed in this article, this simple Maven dependency should be added to the pom of the project: org.jclouds jclouds-allblobstore 1.5.9 1. Uploading to Amazon S3 The first step, in order to access any of these APIs, is to create a BlobStoreContext: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(BlobStoreContext.class); This represents the entry-point to a general key-value storage service, such as Amazon S3 – but not limited to it. For the more specific S3 only implementation, the context can be created similarly: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(S3BlobStoreContext.class); And even more specifically: BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); When the authenticated context is no longer needed, closing it is required to release all resources – threads and connections – associated to it. 2. The four S3 APIs of jclouds The jclouds library provides four different APIs to upload content to S3 bucket, ranging from simple but inflexible to complex and powerful, all obtained via the BlobStoreContext. Let’s start with the simplest. 2.1. Upload via the Map API The easiest way jclouds can be used to interact with an S3 bucket is by representing that bucket as a Map. The API is obtained from the context: InputStreamMap bucket = context.createInputStreamMap("bucketName"); Then, to upload a simple HTML file: bucket.putString("index1.html", "hello world1"); The InputStreamMap API exposes several other types of PUT operations – files, raw bytes – both for single and bulk. A simple integration test can be used as an example: @Test public void whenFileIsUploadedToS3WithMapApi_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); InputStreamMap bucket = context.createInputStreamMap("bucketName"); bucket.putString("index1.html", "hello world1"); context.close(); } 2.2. Upload via BlobMap Using the simple Map API is straightforward but ultimately limited – for example, there is no way to pass in metadata about the content being uploaded. When more flexibility and customization is necessary, this simplified approach to uploading data to S3 via a Map is no longer enough. The next API we’ll look at is the Blob Map API – this is obtained from the context: BlobMap bucket = context.createBlobMap("bucketName"); The API allows the client to access more lower level details, such as Content-Length, Content-Type, Content-Encoding, eTag hash and others; to upload new content in the bucket: Blob blob = bucket.blobBuilder().name("index2.html"). payload("hello world2"). contentType("text/html").calculateMD5().build(); The API also allows setting a variety of payloads on the create request. A simple integration test for uploading a basic HTML file to S3 via the Blob Map API: @Test public void whenFileIsUploadedToS3WithBlobMap_thenNoExceptions() throws IOException { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobMap bucket = context.createBlobMap("bucketName"); Blob blob = bucket.blobBuilder().name("index2.html"). payload("hello world2"). contentType("text/html").calculateMD5().build(); bucket.put(blob.getMetadata().getName(), blob); context.close(); } 2.3. Upload via BlobStore The previous APIs had no way to upload content using multipart upload – this makes them ill suited when working with large files. This limitation is addressed by the next API we’re going to look at – the synchronous BlobStore API. This is obtained from the context: BlobStore blobStore = context.getBlobStore(); To use the multipart support and upload a file to S3: Blob blob = blobStore.blobBuilder("index3.html"). payload("hello world3").contentType("text/html").build(); blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); The payload builder is the same one that was being used by the BlobMap API, so the same flexibility in specifying lower level metadata information about the blob is available here. The difference is the PutOptions supported by the PUT operation of the API – namely the multipart support. The previous integration test now has multipart enabled: @Test public void whenFileIsUploadedToS3WithBlobStore_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobStore blobStore = context.getBlobStore(); Blob blob = blobStore.blobBuilder("index3.html"). payload("hello world3").contentType("text/html").build(); blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); context.close(); } 2.4. Upload via AsyncBlobStore While the previous BlobStore API was synchronous, there is also an asynchronous API for BlobStore – AsyncBlobStore. The API is similarly obtained from the context: AsyncBlobStore blobStore = context.getAsyncBlobStore(); The only difference between the two is that the async API is returning ListenableFuture for the PUT asynchronous operation: Blob blob = blobStore.blobBuilder("index4.html"). .payload("hello world4").build(); blobStore.putBlob("bucketName", blob).get(); The integration test displaying this operation is similar to the synchronous one: @Test public void whenFileIsUploadedToS3WithBlobStore_thenNoExceptions() { BlobStoreContext context = ContextBuilder.newBuilder("aws-s3").credentials(identity, credentials) .buildView(AWSS3BlobStoreContext.class); BlobStore blobStore = context.getBlobStore(); Blob blob = blobStore.blobBuilder("index4.html"). payload("hello world4").contentType("text/html").build(); Future putOp = blobStore.putBlob("bucketName", blob, PutOptions.Builder.multipart()); putOp.get(); context.close(); } 3. Conclusion In this article, we analysed the four APIs that the jclouds library provides to upload content to Amazon S3. These four APIs are generic and they work with other key-value storage services as well – such as Microsoft Azure Storage for example. In the next article we’ll look at the Amazon specific S3 API available in jclouds – the AWSS3Client. We’ll implement the operation of uploading a large file, dynamically calculate the optimal number of parts for any given file, and perform the upload of all parts in parallel. P.S. You might dig following me on Twitter.
April 18, 2013
by Eugen Paraschiv
· 8,862 Views · 1 Like
article thumbnail
HotSpot GC Thread CPU footprint on Linux
The following question will test your knowledge on garbage collection and high CPU troubleshooting for Java applications running on Linux OS. This troubleshooting technique is especially crucial when investigating excessive GC and / or CPU utilization. It will assume that you do not have access to advanced monitoring tools such as Compuware dynaTrace or even JVisualVM. Future tutorials using such tools will be presented in the future but please ensure that you first master the base troubleshooting principles. Question: How can you monitor and calculate how much CPU % each of the Oracle HotSpot or JRockit JVM garbage collection (GC) threads is using at runtime on Linux OS? Answer: On the Linux OS, Java threads are implemented as native Threads, which results in each thread being a separate Linux process. This means that you are able to monitor the CPU % of any Java thread created by the HotSpot JVM using the top –H command (Threads toggle view). That said, depending of the GC policy that you are using and your server specifications, the HotSpot & JRockit JVM will create a certain number of GC threads that will be performing young and old space collections. Such threads can be easily identified by generating a JVM thread dump. As you can see below in our example, the Oracle JRockit JVM did create 4 GC threads identified as "(GC Worker Thread X)”. ===== FULL THREAD DUMP =============== Fri Nov 16 19:58:36 2012 BEA JRockit(R) R27.5.0-110-94909-1.5.0_14-20080204-1558-linux-ia32 "Main Thread" id=1 idx=0x4 tid=14911 prio=5 alive, in native, waiting -- Waiting for notification on: weblogic/t3/srvr/T3Srvr@0xfd0a4b0[fat lock] at jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method) at java/lang/Object.wait(J)V(Native Method) at java/lang/Object.wait(Object.java:474) at weblogic/t3/srvr/T3Srvr.waitForDeath(T3Srvr.java:730) ^-- Lock released while waiting: weblogic/t3/srvr/T3Srvr@0xfd0a4b0[fat lock] at weblogic/t3/srvr/T3Srvr.run(T3Srvr.java:380) at weblogic/Server.main(Server.java:67) at jrockit/vm/RNI.c2java(IIIII)V(Native Method) -- end of trace "(Signal Handler)" id=2 idx=0x8 tid=14920 prio=5 alive, in native, daemon "(GC Main Thread)" id=3 idx=0xc tid=14921 prio=5 alive, in native, native_waiting, daemon "(GC Worker Thread 1)" id=? idx=0x10 tid=14922 prio=5 alive, in native, daemon "(GC Worker Thread 2)" id=? idx=0x14 tid=14923 prio=5 alive, in native, daemon "(GC Worker Thread 3)" id=? idx=0x18 tid=14924 prio=5 alive, in native, daemon "(GC Worker Thread 4)" id=? idx=0x1c tid=14925 prio=5 alive, in native, daemon ……………………… Now let’s put all of these principles together via a simple example. Step #1 - Monitor the GC thread CPU utilization The first step of the investigation is to monitor and determine: Identify the native Thread ID for each GC worker thread shown via the Linux top –H command. Identify the CPU % for each GC worker thread. Step #2 – Generate and analyze JVM Thread Dumps At the same time of Linux top –H, generate 2 or 3 JVM Thread Dump snapshots via kill -3 . Open the JVM Thread Dump and locate the JVM GC worker threads. Now correlate the "top -H" output data with the JVM Thread Dump data by looking at the native thread id (tid attribute). As you can see in our example, such analysis did allow us to determine that all our GC worker threads were using around 20% CPU each. This was due to major collections happening at that time. Please note that it is also very useful to enable verbose:gc as it will allow you to correlate such CPU spikes with minor and major collections and determine how much your JVM GC process is contributing to the overall server CPU utilization.
April 17, 2013
by Pierre - Hugues Charbonneau
· 14,516 Views
article thumbnail
Capture a Signature on iOS
Originally authored by Jason Harwig The Square Engineering Blog has a great article on Smoother Signatures for Android, but I didn't find anything specifically about iOS. So, what is the best way to capture a users signature on an iOS device? Although I didn't find any articles on signature capture, there are good implementations on the App Store. My target user experience was the iPad application Paper by 53, a drawing application with beautiful and responsive brushes. All code is available in the Github repository: SignatureDemo. Connecting the Dots The simplest approach is to capture the touches and connect them with straight lines. In the initializer of a UIView subclass, create the path and gesture recognizer to capture touch events. // Create a path to connect lines path = [UIBezierPath bezierPath]; // Capture touches UIPanGestureRecognizer *pan = [[UIPanGestureRecognizer alloc] initWithTarget:self action:@selector(pan:)]; pan.maximumNumberOfTouches = pan.minimumNumberOfTouches = 1; [self addGestureRecognizer:pan]; Capture the pan events into a bézier path by connecting the points with lines. - (void)pan:(UIPanGestureRecognizer *)pan { CGPoint currentPoint = [pan locationInView:self]; if (pan.state == UIGestureRecognizerStateBegan) { [path moveToPoint:currentPoint]; } else if (pan.state == UIGestureRecognizerStateChanged) [path addLineToPoint:currentPoint]; [self setNeedsDisplay]; } Stroke the path - (void)drawRect:(CGRect)rect { [[UIColor blackColor] setStroke]; [path stroke]; } An example "J" character rendered using this technique reveals some issues. At slow velocities iOS captures enough touch resolution that the lines aren't noticeable, but faster movement shows large gaps between touches that accentuates the lines. The 2012 Apple Developer Conference included a session Building Advanced Gesture Recognizers that addresses this issue using math. Quadratic Bézier Curves Instead of connected lines between the touch points, quadratic bézier curves connect the points using the technique discussed in the aforementioned WWDC session (Seek to 42:15.) Connect the touch points with a quadratic curve using the touch points as the control points and the mid points as start and end. Adding quadratic curves to the previous code requires the storing the previous touch point, so add an instance variable for that. CGPoint previousPoint; Create a function to calculate the midpoint of two points. static CGPoint midpoint(CGPoint p0, CGPoint p1) { return (CGPoint) { (p0.x + p1.x) / 2.0, (p0.y + p1.y) / 2.0 }; } Update the pan gesture handler to add quadratic curves instead of straight lines - (void)pan:(UIPanGestureRecognizer *)pan { CGPoint currentPoint = [pan locationInView:self]; CGPoint midPoint = midpoint(previousPoint, currentPoint); if (pan.state == UIGestureRecognizerStateBegan) { [path moveToPoint:currentPoint]; } else if (pan.state == UIGestureRecognizerStateChanged) { [path addQuadCurveToPoint:midPoint controlPoint:previousPoint]; } previousPoint = currentPoint; [self setNeedsDisplay]; } Not much code and already we see a big difference. The touch points are no longer visible, but it looks a little bland when drawing a signature. Every curve is the same width, which doesn't match the physics of a real pen. Variable Stroke Width The width can be varied based on the touch velocity to create a more natural stroke. The UIPanGestureRecognizer already includes a method called velocityInView: that returns the current touch velocity as a CGPoint. To render a stroke of varying width, I switched to OpenGL ES and a technique called tesselation to convert the stroke into triangles – specifically, triangle strips (OpenGL has support for drawing lines, but iOS doesn't support variable line widths with smoothing.) The quadratic points along a curve also need to be calculated, but is beyond the scope of this article. Check the source on github for details. Given two points, a perpendicular vector is calculated and its magnitude set to half the current thickness. Given the nature of GL_TRIANGLE_STRIP only two points are needed to create the next rectangle segment with two triangles. Here is an example of the final output using quadratic bézier curves, and velocity based stroke thickness creating a visually appealing and natural signature.
April 8, 2013
by Scott Leberknight
· 20,813 Views
article thumbnail
Configuring Apache SolrCloud on Amazon VPC
We are going to construct an Apache SolrCloud (4.1) with 12 node EC2 instance(s) inside Amazon VPC in this post. Since the search data stored inside the SolrCloud is critical, we are going to build High availability at Solr Node level as well as AZ level. This setup will be done inside private subnet of Amazon VPC and will leverage 3 Availability Zones of the Amazon EC2 Region. Deployment architecture of the setup is given below: A small brief about setup: 3 Zookeepers will be deployed on 3 Availability Zones. ZK EC2 instances will be deployed on the Private subnet of the Amazon VPC. 3 Solr Shard EC2 instances will be deployed on Private subnet of Availability Zone 1 inside Amazon VPC. 3 Solr Replica EC2 instances will be deployed on Private subnet of Availability Zone 2 inside Amazon VPC. 3 Solr Replica EC2 instances will be deployed on Private subnet of Availability Zone 3 inside Amazon VPC. EBS optimized + PIOPS EC2 instances can be used for Solr EC2 Nodes To know more about SolrCloud Deployment best practices on Amazon VPC, Refer article: http://harish11g.blogspot.in/2013/03/Apache-Solr-cloud-on-Amazon-EC2-AWS-VPC-implementation-deployment.html Step 1: Creating Virtual Private Cloud on AWS Create a VPC with Public and Private Subnets. Assume the Load balancer and Web/App Servers can reside on the public subnet and Apache Solr Cloud will reside on the private subnet of the VPC. Step 2: Assigning the IP for the Subnets Create the subnet with its IP range. Chose the Availability zone for this subnet. Step 3: Multiple Subnets on Multiple AZ’s Create multiple subnets in Multiple AZ for building a Highly available setup for SolCloud Step 4: Install Java for Zookeeper & Solr Amazon Linux is chosen as the EC2 OS variant. Execute the following instructions on the respective EC2 nodes after their launch. EC2 instances should be launched in Multi-AZ in Multiple VPC Private Subnets. Solr uses Zookeeper as the cluster configuration and coordinator. Zookeeper is a distributed file system containing information about all the Solr Nodes. Solrconfig.xml, Schema.xml etc are stored in the repository.We have used Oracle-Sun Java over OpenJDK “sudo -s” “cd /opt” “wget --no-cookies --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2Ftechnetwork%2Fjava%2Fjavase%2Fdownloads%2Fjdk-7u3-download-1501626.html;" http://download.oracle.com/otn-pub/java/jdk/7u13-b20/jdk-7u13-linux-x64.rpm” “mv jdk-7u10-linux-x64.rpm?AuthParam=1357217677_76ec3d8d9a3644f4b9ec1ea79e1fcf33 jdk-7u10-linux-x64.rpm jdk-7u10-linux-x64.rpm” “sudo rpm -ivh jdk-7u10-linux-x64.rpm” “alternatives --install /usr/bin/java java /usr/java/jdk1.7.0_10/jre/bin/java 20000” “alternatives --install /usr/bin/javaws javaws /usr/java/jdk1.7.0_10/jre/bin/javaws 20000” “alternatives --install /usr/bin/javac javac /usr/java/jdk1.7.0_10/bin/javac 20000” “alternatives --install /usr/bin/jar jar /usr/java/jdk1.7.0_10/bin/jar 20000” “alternatives --install /usr/bin/java java /usr/java/jre1.7.0_10/bin/java 20000” “alternatives --install /usr/bin/javaws javaws /usr/java/jre1.7.0_10/bin/javaws 20000” “alternatives --configure java” Add JAVA_HOME in .bash_profile: “vim ~/.bash_profile” export JAVA_HOME="/usr/java/jdk1.7.0_09" export PATH=$PATH:$JAVA_HOME/bin Restart the instance. “init 6” Check the version of Java installed using “java -version” command Step 5: Configure the ZooKeeper (v3.4.5) Ensemble: Since single Zookeeper is not ideal for a large Solr cluster (because of SPOF), it is recommended to configure multiple Zookeepers in concert as an ensemble .In this step we will install and configure 3 ZooKeeper EC2 nodes spanning across 3 different Availability Zones in respective Private Subnets inside a VPC.Zookeeper will be configured on Amazon Linux. “sudo yum update” “sudo -s” “ cd /opt” “wget http://apache.techartifact.com/mirror/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz” “tar -xzvf zookeeper-3.4.5.tar.gz” “rm zookeeper-3.4.5.tar.gz” “cd zookeeper-3.4.5” “cp conf/zoo_sample.cfg conf/zoo.cfg” Add the following lines in zoo.cfg “vim conf/zoo.cfg” dataDir=/data server.1=[zk-server01-ip]:2888:3888 server.2=[zk-server02-ip]:2888:3888 server.3=[zk-server03-ip]:2888:3888 “cd /opt/zookeeper/data” “vim myid” 1 or 2 or 3 respectively on each ZooKeeper EC2 instances in Multi-AZ #Starting ZooKeeper Program. “bin/zkServer.sh start” Follow the above steps in all the ZooKeeper servers. ReferClustered (Multi-Server) SetupandConfiguration Parameters for understandingquorum_port,leader_election_port and the filemyid. Every ZooKeeper node needs to know about every other ZK EC2 node in the ensemble, and a majority of EC2’s (called a Quorum) are needed to provide the service. Make sure the VPC IP of all the Zookeepers are given in every ZK node, like the one in following command. server.1=:: server.2=:: server.3=:: Step 6: Configuring Solr 4.1 EC2 node In this step we will install and configure 3 Apache Solr4.1 Shard EC2 instances in a single Amazon AZ and 2 Solr Replicas in another AZ in their respective Private subnets. Please note that we have to specify all the ZooKeeper (ZK) hosts on every Solr instance as below. Note: Solr gets comes with jetty in default, it is suggested to use tomcat for production nodes. Perform the following after launching EC2 instances in Multi-AZ in Multiple VPC Private Subnets. “sudo -s” “yum update” “cd /opt” “wget http://apache.techartifact.com/mirror/lucene/solr/4.1.0/apache-solr-4.1.0.tgz” “tar -xzvf apache-solr-4.1.0.tgz” “rm -f apache-solr-4.1.0.tgz” On Solr Shard/Replica Instances: “cd /opt/apache-solr-4.0.0/example/” “vim /opt/apache-solr-4.0.0/example/solr/collection1/conf/solrconfig.xml” Change /var/data/solr to /data Starting Solr4.1 Shard/Replica Java Program. “java -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=SolrCloud4.1-Conf -DnumShards=3 -DzkHost=[zk-server01-ip]:2181,[zk-server02-ip]:2181,[zk-server03-ip]:2181 -jar start.jar “java -DzkHost= DzkHost=:,:,: -jar start.jar” -DnumShards: the number of shards that will be present. Note that once set, this number cannot be increased or decreased without re-indexing the entire data set. (Dynamically changing the number of shards is part of the Solr roadmap!) -DzkHost: a comma-separated list of ZooKeeper servers. -Dbootstrap_confdir, -Dcollection.configName: these parameters are specified only when starting up the first Solr instance. This will enable the transfer of configuration files to ZooKeeper. Subsequent Solr instances need to just point to the ZooKeeper ensemble. The above command with –DnumShards=3 specifies that it is a 3-shard cluster. The first Solr EC2 node automatically becomes shard1 and the second Solr EC2 node automatically becomes shard2 …. What happens when we launch fourth Solr instance in this cluster? Since it’s a 3-shard cluster, the fourth Solr EC2 node automatically becomes a replica of shard1 and the fifth Solr EC2 node becomes a replica of shard2. Step 7: AWS Security Group TCP Ports to be enabled: Configure the following TCP ports on the AWS security group to allow access between Solr and ZK nodes deployed in Multiple AZ. Solr Shards/Replicas will connect to ZK through TCP Port 2181 Solr Web Interface with Jetty container through TCP Port 8983 Solr Web Interface with Tomcat container through TCP Port 8080 Every instance that is part of the ZooKeeper ensemble should know about every other machine in the ensemble. We can accomplish this with the series of lines of the form server.id=host:port:port For example, server.1=[vpc-ip]:2888:3888 server.2=[vpc-ip]:2888:3888 server.3=[vpc-ip]:2888:3888 TCP Ports 2888, 3888 should be opened for ZK Ensemble.
April 5, 2013
by Harish Ganesan
· 7,777 Views
article thumbnail
AWS VPC NAT Instance Failover and High Availability
Amazon Virtual Private Cloud (VPC) is a great way to setup an isolated portion of AWS and control the network topology. It is a great way to extend your data center and use AWS for burst requirements. With the latest VPC for Everyone announcement, what was earlier "Classic" and "VPC" in AWS will soon be only VPC. That is, every deployment in AWS will be on a VPC even though one might not need all the additional features that VPC provides. One might eventually start looking at utilizing VPC features such as multiple Subnets, Network isolation, Network ACLs, etc.. Those who have already worked with VPC's understand the role of NAT Instance in a VPC. When you create a VPC, you create them with multiple Subnets (Public and Private). Instances launched in the Public Subnet have direct internet connectivity to send and receive internet traffic through the internet gateway of the VPC. Typically, internet facing servers such as web servers are kept in the Public Subnet. A Private Subnet can be used to launch Instances that do not require direct access from the internet. Instances in a Private Subnet can access the Internet without exposing their private IP address by routing their traffic through a Network Address Translation (NAT) instance in the Public Subnet. AWS provides an AMI that can be launched as a NAT Instance. Following diagram is the representation of a standard VPC that gets provisioned through the AWS Management Console wizard. Standard Private and Public Subnets in a VPC The above architecture has A Public Subnet that has direct internet connectivity through the Internet Gateway. Web Instances can be placed within the Public Subnet The custom Route Table associated with Public Subnet will have the necessary routing information to route traffic to the Internet Gateway A NAT Instance is also provisioned in the Public Subnet A Private Subnet that has outbound internet connectivity through the NAT Instance in the Public Subnet The Main Route Table is by default associated with the Private Subnet. This will have necessary routing information to route internet traffic to the NAT Instance Instances in the Private Subnet will use the NAT Instance for outbound internet connectivity. For example, DB backups from standby that needs to be stored in S3. Background programs that make external web services calls Of course, the above architecture has limited High Availability since all the Subnets are created within the same Availability Zone. We can avoid this by creating multiple Subnets in multiple Availability Zones. Public and Private Subnets with multiple Availability Zones Additional Subnets (Public and Private) are created in one another Availability Zone Both Private Subnets are attached to the Main Routing Table Both Public Subnets are attached to the same Custom Routing Table Instances in the Private Subnet still continue to use the NAT Instance for outbound internet connectivity Though we increased the High Availability by utilizing multiple Availability Zones, the NAT Instance is still a Single Point of Failure. NAT Instance is just another EC2 Instance that can become unavailable any time. The updated architecture below uses two NAT Instances to provide failover and High Availability for the NAT Instances NAT Instance High Availability Each Subnet is associated with its own Route Table NAT1 is provisioned in Public Subnet 1 NAT2 is provisioned in Public Subnet 2 Private Subnet 1's Route Table (RT) has routing entry to NAT1 for internet traffic Private Subnet 2's Route Table (RT) has routing entry to NAT2 for internet traffic NAT Instance HA Illustration A script can be installed on both the NAT Instances to monitor each other and swap the routing table association if one of them fails. For example, if NAT1 detects that NAT2 is not responding to its ping requests, it can change the Route Table of Private Subnet 2 to NAT1 for internet traffic. Once NAT2 becomes operational again, a reverse swapping can happen. AWS has a pretty good documentation on this and a sample script for the swapping. Apart from HA, the above architecture also provides better overall throughput, since during normal conditions, both NAT Instances can be used to drive the outbound internet requirements of the VPC. If there are workloads that requires a lot of outbound internet connectivity, having more than one NAT Instance would make sense. Of course, you are still limited with one NAT Instance per Subnet.
March 28, 2013
by Raghuraman Balachandran
· 18,772 Views
article thumbnail
Debugging “Wrong FS expected: file:///” exception from HDFS
I just spent some time putting together some basic Java code to read some data from HDFS. Pretty basic stuff. No map reduce involved. Pretty boilerplate code like the stuff from this popular tutorial on the topic. No matter what, I kept hitting my head on this error: Exception in thread “main” java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:9000/user/hadoop/DOUG_SVD/out.txt, expected: file:/// If you checkout the tutorial above, what’s supposed to be happening is that an instance of Hadoop’s Configuration should encounter a fs.default.name property, in one of the config files its given. The Configuration should realize that this property has a value of hdfs://localhost:9000. When you use the Configuration to create a Hadoop FileSystem instance, it should happily read this property from Configuration and process paths from HDFS. That’s a long way of saying these three lines of Java code: // pickup config files off classpath Configuration conf = new Configuration() // explicitely add other config files conf.addResource("/home/hadoop/conf/core-site.xml"); // create a FileSystem object needed to load file resources FileSystem fs = FileSystem.get(conf); // load files and stuff below! Well… My Hadoop config files (core-site.xml) appear setup correctly. It appears to be in my CLASSPATH. I’m even trying to explicitly add the resource. Basically I’ve followed all the troubleshooting tips you’re supposed to follow when you encounter this exception. But I’m STILL getting this exception. Head meet wall. This has to be something stupid. Troubleshooting Hadoop’s Configuration & FileSystem Objects Well before I reveal my dumb mistake in the above code, it turns out there’s some helpful functions to help debug these kind of problems: As Configuration is just a bunch of key/value pairs from a set of resources, its useful to know what resources it thinks it loaded and what properties it thinks it loaded from those files. getRaw() — return the raw value for a configuration item (like conf.getRaw("fs.default.name")) toString() — Configuration‘s toString shows the resources loaded You can similarly checkout FileSystem‘s helpful toString method. It nicely lays out where it thinks its pointing (native vs HDFS vs S3 etc). So if you similarly are looking for a stupid mistake like I was, pepper your code with printouts of these bits of info. They will at least point you in a new direction to search for your dumb mistake. Drumroll Please Turns out I missed the crucial step of passing a Path object not a String to addResource. They appear to do slightly different things. Adding a String adds a resource relative to the classpath. Adding a Path is used to add a resource at an absolute location and does not consider the classpath. So to explicitly load the correct config file, the code above gets turned into (drumroll please): // pickup config files off classpath Configuration conf = new Configuration() // explicitely add other config files // PASS A PATH NOT A STRING! conf.addResource(new Path("/home/hadoop/conf/core-site.xml")); FileSystem fs = FileSystem.get(conf); // load files and stuff below! Then Tada! everything magically works! Hopefully these tips can save you the next time you encounter these kinds of problems.
March 27, 2013
by Doug Turnbull
· 17,944 Views
article thumbnail
Accessing AWS Without Key and Secret
If you are using Amazon Web Services(AWS), you are probably aware how to access and use resources like SNS, SQS, S3 using key and secret. With the aws-java-sdk that is straight forward: AmazonSNSClient snsClient = new AmazonSNSClient( new BasicAWSCredentials("your key", "your secret")) One of the difficulties with this approach is storing the key/secret securely especially when there are different set of these for different environments. Using java property files, combined with maven or spring profiles might help a little bit to externalize the key/secret out of your source code, but still doesn't solve the issue of securely accessing these resources. Amazon has another service to help you in this occasion. No, no, this is not one more service to pay for in order to use the previous services. It is a free service, actually it is a feature of the amazon account. AWS Identity and Access Management (IAM) lets you securely control access to AWS services and resources for your users, you can manage users and groups and define permissions for AWS resources. One interesting functionality of IAM is the ability to assign roles to EC2 instances. The idea is you create roles with sets of permissions and you launch an EC2 instance by assigning the role to the instance. And when you deploy an application on that instance, the application doesn't need to have access key and secret in order to access other amazon resource. The application will use the role credentials to sign the requests. This has a number of benefits like a centralized place to control all the instances credentials, reduced risk with auto refreshing credentials and so on. Here is a short video demonstrating how to assign roles to an EC2 instance: Once you have role based security enabled for an instance, to access other resources from that instances you have to create and AwsClient using the chained credential provider: AmazonSNSClient snsClient = new AmazonSNSClient( new DefaultAWSCredentialsProviderChain()) The provider will search your system properties, environment properties and finally call instance metadata API to retrieve the role credentials in chain of responsibility fashion. It will also refresh the credentials in the background periodically depending on its expiration period. And finally, if you want to use role based security from Camel applications running on Amazon, all you have to do is create an instance of the client with configured chained credentials object and don't specify any key or secret: from("direct:start") .to("aws-sns://MyTopic?amazonSNSClient=#snsClient");
March 26, 2013
by Bilgin Ibryam
· 14,390 Views
article thumbnail
Using Lambda Expression to Sort a List in Java 8 using NetBeans Lambda Support
As part of JSR 335Lambda expressions are being introduced to the Java language from Java 8 onwards and this is a major change in the Java language.
March 21, 2013
by Mohamed Sanaulla
· 346,096 Views · 6 Likes
article thumbnail
How to Configure diff and Merge Tool in Visual Studio Git Tools
If you are using Visual Studio plugin for Git, but you have also configured Git with MSys git, probably you could be surprised by some Visual Studio behavior.
March 20, 2013
by Ricci Gian Maria
· 75,522 Views · 1 Like
article thumbnail
Where is My Datastore in Hyper-V? Server Virtualization - Part 4
The term 'datastore’ is one that many of you who work with VMware are familiar, but which doesn’t really translate to the world of Microsoft’s Hyper-V. “Since Hyper-V does not require a different formatting of the underlying physical disk structure like VMFS(VMware’s proprietary disk format) we are able to browse the ‘datastore’ with File Explorer(In Windows 8/Server 2012…formerly known as Windows Explorer).” “Who said that?” That quote was from my friend Tommy Patterson, who writes about datastores and how they compare to the file system structures used in Hyper-V in Part 4 of our “20+ Days of Server Virtualization” series. In his article he describes the locations of the various components that define and make up a Hyper-V virtual machine, and even provides a script to help you quickly locate filesystem locations for your virtual machine bits. READ HIS ARTICLE HERE
March 9, 2013
by Kevin Remde
· 9,034 Views
article thumbnail
How to build a dictionary for the NetBeans spellchecker
I'll try to explain how I have developed the French, German and Spanish dictionaries for the NetBeans online spellchecker : we will build a French dictionary for NetBeans 7.3 (it works with 7.2 and 7.1 too). Summary : #1 expand GNU Aspell dictionaries files. #2 checkout the NetBeans 7.3.0 FCS sources from Mercurial repository. #3 open the English dictionary project and make a copy. #4 modify the copied project. #5 make the NBM file and test it. #6 (optional) sign the NBM file and submit it to the community for validation. Step 1 : expand GNU Aspell dictionaries files Download Aspell. You can get the latest Win32 installer version from ftp://ftp.gnu.org(...)Aspell-0-50-3-3-Setup.exe. Run it to install Aspell. Download the latest French dictionaries file. You can get Win32 installers from ftp://ftp.gnu.org/gnu/aspell/w32/. The latest French dictionnaries file is Aspell-fr-0.50-3-3.exe. Run it to install the French dictionaries into Aspell. You can now expand the dictionaries files you want to include in the future NetBeans plugin. We'll expand the "fr_FR" and "fr_CH" dictionaries. Go to the "dict" directory of Aspell and run the following commands (if necessary, add the Aspell "bin" directory to your PATH variable, in the Operating System or a batch script) : aspell --lang=fr_FR --master=fr_FR dump master | sort > aspell_dump_fr_FR.txt aspell --lang=fr_CH --master=fr_CH dump master | sort > aspell_dump_fr_CH.txt The first line will expand and sort the "fr_FR" dictionary to the default output. The > switch is used to save the output to a file (aspell_dump_fr_FR.txt), in the current directory. The second line does the same job with the "fr_CH" language. You'll note that the expanded dictionaries files may not be UTF-8 encoded. If necessary, re-encode them to UTF-8. You can do it with Notepad2 : open a file and go to File, Encoding, and select the UTF-8 encoding. It will encode the opened file. To finish, pack the two files into a ZIP file (you can do it with every ZIP archiver, like 7-Zip). We will call this archive "aspell-frwl.zip" : Nota n°1 : What does "expand a dictionary file" means ? Aspell dictionaries files are a set of words and affixes lists. Word lists contains basic forms of common words. Affixes are used to compute the different variations of words. By expanding a dictionary , we ask Aspell to compute a list of all words with all their variations. The result is a huge file. We need to expand dictionaries files because the NetBeans Spellchecker (seems to) use only expanded dictionaries files : it doesn't support affixes lists ;) Nota n°2 : These are the MS Windows instructions. Linux and MacOS ones should be easy to find. Step 2 : checkout the NetBeans 7.3.0 FCS sources from Mercurial repository Refer to the given tutorial : How to build NetBeans from sources, step 1 only. After that, you need at least two directories : "releases/nbbuild/" and "releases/spellchecker.dictionary_en/". If you want to free some space, you can delete the other directories. You can now start your NetBeans IDE and load the "spellchecker.dictionary_en" project. Step 3 : open the English dictionary project and make a copy NetBeans will ask you for a new project name. Choose something like "spellchecker.dictionary_fr" : Actually, the project name you have chosen is the project's folder name. NetBeans will show you the "SpellChecker English Dictionaries (0)" title for your project. Rename it (you can use the F2 key on the project's name) to "SpellChecker French Dictionaries" : You now have the two projects, English and French dictionaries plugin projects : Step 4 : modify the copied project The French project is correctly named, so we can now modify its content to target French dictionaries. Switch to the "File" project tab to show more files. Step 4.1 : empty the "external" directory's content and copy your French dictionaries archive file This directory contains the English dictionaries files. We will provide our own files. Delete the files located into the "external" directory and copy the "aspell-frwl.zip" created during step 1. Step 4.2 : edit the "nbproject/project.properties" file Replace : release.external/ispell-enwl-3.1.20.zip=modules/dict/ispell-enwl-3.1.20.zip jnlp.indirect.files=modules/dict/dictionary_en_US.description,modules/dict/dictionary_en_GB.description,modules/dict/ispell-enwl-3.1.20.zip,modules/dict/dictionary_en.description by : release.external/aspell-frwl.zip=modules/dict/aspell-frwl.zip jnlp.indirect.files=modules/dict/dictionary_fr_FR.description,modules/dict/dictionary_fr_CH.description,modules/dict/aspell-frwl.zip,modules/dict/dictionary_fr.description We have replaced references to English dictionary and locales by French ones. Step 4.3 : edit the "nbproject/project.xml" file Replace : org.netbeans.modules.spellchecker.dictionary_en by : org.netbeans.modules.spellchecker.dictionary_fr Step 4.4 : rename the "src/org/netbeans/modules/spellchecker/dictionary_en/" folder Rename the "dictionary_en" folder to "dictionary_fr". Step 4.5 : edit the "src/org/netbeans/modules/spellchecker/dictionary_fr/Bundle.properties" file Replace : OpenIDE-Module-Display-Category=Base IDE OpenIDE-Module-Long-Description=\ Provides Ispell's (ver. 3.1.20) word list for use in the online spellchecker. OpenIDE-Module-Name=Spellchecker English Dictionaries OpenIDE-Module-Short-Description=English Dictionaries for Spellchecker by something like : OpenIDE-Module-Display-Category=Base IDE OpenIDE-Module-Long-Description=\ Provides Aspell's French word list for use in the online spellchecker. OpenIDE-Module-Name=Spellchecker French Dictionaries OpenIDE-Module-Short-Description=French Dictionaries for Spellchecker This file describes your plugin. Do not hesitate to change the description fields. Step 4.6 : edit the "build.xml" file Replace : by : Step 4.7 : edit the "manifest.mf" file Replace : Manifest-Version: 1.0 OpenIDE-Module: org.netbeans.modules.spellchecker.dictionary_en XOpenIDE-Module-Layer: org/netbeans/modules/spellchecker/dictionary_en/layer.xml OpenIDE-Module-Localizing-Bundle: org/netbeans/modules/spellchecker/dictionary_en/Bundle.properties OpenIDE-Module-Specification-Version: 1.8.1 by : Manifest-Version: 1.0 OpenIDE-Module: org.netbeans.modules.spellchecker.dictionary_fr XOpenIDE-Module-Layer: org/netbeans/modules/spellchecker/dictionary_fr/layer.xml OpenIDE-Module-Localizing-Bundle: org/netbeans/modules/spellchecker/dictionary_fr/Bundle.properties OpenIDE-Module-Specification-Version: 1.0 Please note we have changed the plugin version, from 1.8.1 to 1.0. Step 4.8 : rename and edit the "release/module/dict/dictionary_en.description" file Rename the "dictionary_en.description" file to "dictionary_fr.description". After that, replace : jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.0 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.1 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.2 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.3 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/american.0 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/american.1 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/american.2 by : jar:nbinst:///modules/dict/aspell-frwl.zip!/aspell_dump_fr_FR.txt jar:nbinst:///modules/dict/aspell-frwl.zip!/aspell_dump_fr_CH.txt Some explanations : our plugin will contain dictionaries for the "fr", "fr_FR" and "fr_CH" locales. We have a ".description" file for each of them. In these files, we locate the word list(s) to use. So, the "fr" locale is the combination of the "fr_FR" and "fr_CH" word lists. The "fr_FR" locale uses the "fr_FR" word list, and the "fr_CH" locale uses the "fr_CH" word list. Here, we have edited the description file of the "fr" locale. Let's do it for the other ones :) Nota : The "aspell-frwl.zip!" syntax means that we enter into a ZIP file. Step 4.9 : rename and edit the "release/module/dict/dictionary_en_GB.description" file Rename the "dictionary_en_GB.description" file to "dictionary_fr_FR.description". After that, replace : jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.0 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.1 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.2 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.3 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/british.0 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/british.1 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/british.2 by : jar:nbinst:///modules/dict/aspell-frwl.zip!/aspell_dump_fr_FR.txt Step 4.10 : rename and edit the "release/module/dict/dictionary_en_US.description" file Rename the "dictionary_en_US.description" file to "dictionary_fr_CH.description". After that, replace : jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.0 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.1 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.2 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/english.3 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/american.0 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/american.1 jar:nbinst:///modules/dict/ispell-enwl-3.1.20.zip!/american.2 by : jar:nbinst:///modules/dict/aspell-frwl.zip!/aspell_dump_fr_CH.txt Step 5 : make the NBM file and test it You can now launch the Build action to compile the project, and the Create NBM action to package your plugin into a NBM file ("build/org-netbeans-modules-spellchecker-dictionary_fr.nbm"). To test it : go to the Plugins Manager ("Tools" / "Plugins"), the "Downloaded" tab, the "Add Plugins..." button, load your NBM file and confirm : your plugin is now installed ! you can now change the spellchecker default locale, open a text file and type a letter (only !), wait a few seconds to let Netbeans generate a cache file for the selected locale (if you don't wait, the cache creation will fail and the spellchecker won't work), and try to use the spellchecker correction. This is explained in my French Dictionary Plugin page. Nota : If you don't wait the cache creation, the spellchecker won't work for the selected locale. You can go to your ".netbeans/7.x.y/var/cache/dict/" directory and delete the corresponding ".trie1" (or ".trie2") file. Restart NetBeans to let it to recreate a cache file. Step 6 : (optional) sign the NBM file and submit it to the community for validation You can now publish your plugin on the NetBeans Plugins Portal (don't forget create an account). To make it available in the NetBeans Plugins Manager (Tools / Plugins), you have to submit it to validation. Firstly, you have to sign your plugin (generate a certificate file and sign your plugin) : you'll find a tutorial at http://wiki.netbeans.org/DevFaqSignNbm. Example : keytool -genkey -storepass PASSWORD -alias ALIAS -keystore FOO.cert -validity 3651 (it will create the "FOO.cert" certificate file with a 3651 days validity, the "ALIAS" alias an the "PASSWORD" password. The tool will ask you additional information). Then, you can now upload your plugin to the NetBeans Plugins Portal and ask for a validation. Once validated, your plugin will be available in the NetBeans Plugins Manager.
March 4, 2013
by Jonathan Lermitage
· 10,717 Views
article thumbnail
Using the Libjars Option with Hadoop
When working with MapReduce one of the challenges that is encountered early-on is determining how to make your third-part JAR’s available to the map and reduce tasks. One common approach is to create a fat jar, which is a JAR that contains your classes as well as your third-party classes (see this Cloudera blog post for more details). A more elegant solution is to take advantage of the libjars option in the hadoop jar command, also mentioned in the Cloudera post at a high level. Here I’ll go into detail on the three steps required to make this work. Add libjars to the options It can be confusing to know exactly where to put libjars when running the hadoop jar command. The following example shows the correct position of this option: $ export LIBJARS=/path/jar1,/path/jar2 $ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value It’s worth noting in the above example that the JAR’s supplied as the value of the libjar option are comma-separated, and not separated by your O.S. path delimiter (which is how a Java classpath is delimited). You may think that you’re done, but often times this step alone may not be enough - read on for more details! Make sure your code is using GenericOptionsParser The Java class that’s being supplied to the hadoop jar command should use the GenericOptionsParser class to parse the options being supplied on the CLI. The easiest way to do that is demonstrated with the following code, which leverages the ToolRunner class to parse-out the options: public static void main(final String[] args) throws Exception { Configuration conf = new Configuration(); int res = ToolRunner.run(conf, new com.example.MyTool(), args); System.exit(res); } t is crucial that the configuration object being passed into the ToolRunner.run method is the same one that you’re using when setting-up your job. To guarantee this, your class should use the getConf() method defined in Configurable (and implemented in Configured) to access the configuration: public class SmallFilesMapReduce extends Configured implements Tool { public final int run(final String[] args) throws Exception { Job job = new Job(super.getConf()); ... job.waitForCompletion(true); return ...; } f you don’t leverage the Configuration object supplied to the ToolRunner.run method in your MapReduce driver code, then your job won’t be correctly configured and your third-party JAR’s won’t be copied to the Distributed Cache or loaded in the remote task JVM’s. It’s the ToolRunner.run method (actually it delegates the command parsing to GenericOptionsParser) which actually parses-out the libjars argument, and adds to the Configuration object a value for the tmpjarproperty. So a quick way to make sure that this step is working is to look at the job file for your MapReduce job (there’s a link when viewing the job details from the JobTracker), and make sure that the tmpjar configuration name exists with a value identical to the path that you specified in your command. You can also use the command-line to search for the libjars configuration in HDFS $ hadoop fs -cat /_logs/history/*.xml | grep tmpjars Use HADOOP_CLASSPATH to make your third-party JAR’s available on the client-side So far the first two steps tackled what you needed to do to to make your third-party JAR’s available to the remote map and reduce task JVM’s. But what hasn’t been covered so far is making these same JAR’s available to the client JVM, which is the JVM that’s created when you run the hadoop jar command. For this to happen, you should set the HADOOP_CLASSPATH environment variable to contain the O.S. path-delimited list of third-party JAR’s. Let’s extend the commands in the first step above with the addition of setting the HADOOP_CLASSPATH environment variable: $ export LIBJARS=/path/jar1,/path/jar2 $ export HADOOP_CLASSPATH=/path/jar1:/path/jar2 $ hadoop jar my-example.jar com.example.MyTool -libjars ${LIBJARS} -mytoolopt value Note that value for HADOOP_CLASSPATH uses a Unix path delimiter of :, so modify accordingly for your platform. And if you don’t like the copy-paste above you could modify that line to substitute the commas for semi-colons: $ export HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`
February 26, 2013
by Alex Holmes
· 22,498 Views
article thumbnail
Building an Online-Recommendation Engine with MongoDB
once upon a time there was a munich pizza baker who developed a technique to beam pizza out of bright sunshine. he can produce more than a thousand pizzas per second and needs a channel to sell this amount of pizza and decides to build an online shop. mario’s initial idea is to sell pizzas, but now he is thinking about introduction of new product lines like beverages, salads and pasta. before we take a look to the validation of mario´s idea, lets take a short look at the existing online shop. mario’s online shop is based on mongodb , apache wicket and spring . mongodb is a document-oriented nosql-database . mongodb stores records not in tables as a relational database but in bson documents, which is a binary version of json (java script object notation) and very similar to the object structure in mario’s application. the usage of mongodb makes his development easier and deployment faster. the figure shows a json document which is very similar to a java object: a json document property with the according value corresponds to the java object property with the appropriate value. you can add or remove properties in your java object and this will automatically change your database schema. so there is no need to put your java object model into a relational schema via hibernate. mario also decided to build his online shop only with open-source technologies like apache wicket and spring. wicket is a very common lightweight component-based web application framework and it is closely patterned after stateful gui frameworks such as javafx . the spring framework is an open source application framework and inversion of control container for the java platform and does not impose any specific programming model. spring has become popular in the java community as an alternative to, replacement for, or even addition to the enterprise javabean (ejb) model. because of this architecture mario is able to deploy its application in a lightweight application server like tomcat or jetty . this figure shows the system landscape of mario. mario has two major system on the lefthand site there is his online shop and on the righthand site there is ‘pas’ a famous billing system. in the middle is hadoop that connects both systems together. in the business world an application normally does not stand alone. in most cases an application must communicate with others. the lean architecture of marios online shop enables him to connect the billing system ‘pas’ to his online shop. spring for apache hadoop provides this integration between the two systems online shop and ‘pas’. hadoop supports data-intensive distributed applications and implements a computational paradigm named mapreduce, where the computation is divided into many small fragments, each of them may be executed or re-executed on any node in the cluster of commodity hardware. mario uses hadoop as an etl layer that enables him to transfer gigabytes of order information into the billing system. in this case hadoop makes it possible for a financial controller to verify if all orders were billed correctly. in addition to the online shop feature mario has a real-time sales dashboard that enables him to track his sales in real time. the dashboard displays daily and monthly sales statistics for each pizza and contains a map with the geographical overview of customer activity and competitor locations. here is a walkthrough of the shop : now lets talk about mario’s incredible new idea : mario wants to sell even more pizza! and other products as well. mario decides to use lean startup methods in order to test the possible introduction of new product lines and plans an experiment to validate his new idea using a scientific approach and pure facts instead of hunches. mario´s core assumption is that customers wants to buy other products than pizza – drinks, salads and pasta. furthermore he is worried about pricing. mario contacts all customers to complete a survey and provides an incentive for the participation, a free pizza to every customer who responds to the survey. the result of the survey validated mario’s assumption – customers want to buy beverages, salads and pasta. but he also found out that his customers are willing to pay higher prices for high-quality products and that they simply love his easy shopping flow. currently a pizza order can be completed with three clicks only, so there is new riskiest assumption to validate: will a more complex shopping flow affect his sales? the figures shows a validation board. a validation board is a deceptively simple tool for testing out product ideas. furthermore a validation board tracks pivots which follows from customer feedback. mario decides to introduce beverages, salads and pasta product lines and thinks about a possibility, how he can handle the extension of the product line without destroying the easy shopping flow. that’s why mario thinks a recommendation engine is the right way for him. panels for recommendations can be integrated in the online shop without changing the shopping flow. mario hired a statistician to help him implement a recommender system for his online shop for better cross-selling. he also defined new measurement points to validate his new idea . therefore he tracks the conversion rate of orders as well as cross-selling rates and every event in the online shop is already tracked in realtime. so mario can very easily perform further experiments in order to verify more assumptions. follow the blog to see how the story continues or come to mongodb usergroup meetup in munich , february 20, 2013 or mongodb days in berlin , february 26, 2013 to get a live presentation. our talk sheds light on how to build an online recommendation engine based on mongodb and apache mahout. we’ll show which recommenders must be built to reach mario’s goal and how these can be integrated in mario’s shop infrastructure.
February 17, 2013
by Comsysto Gmbh
· 8,436 Views
article thumbnail
Using awk and Friends with Hadoop
imagine you have a csv file that you want to manipulate. here’s a sample file we can play with: lopez,charlie,2002,11,21 parker,ward,1995,04,08 henderson,russell,2007,10,01 our goal is to transform this into the following form by combining the last three columns: lopez,charlie,20021121 parker,ward,19950408 henderson,russell,20071001 in linux this would take all of two seconds (excuse the awkward awk command): shell$ awk -f"," '{ print $1","$2","$3$4$5 }' people.txt what if you wanted to quickly do the same in hdfs - and let’s assume you want to write the results back to hdfs. one approach would be to use the hdfs cli to stream the inputs into awk, and stream the awk output back into hdfs. you could do this with the hdfs cat and put - options (note that adding a hyphen after put instructs the put command to stream data from standard input to hdfs): shell$ hadoop fs -cat people.txt | awk -f"," '{ print $1","$2","$3$4$5 }' | hadoop fs -put - people-coalesed.txt btw, if your input and output files are lzop-compressed then this command would work: shell$ hadoop fs -cat people.txt.lzo | lzop -dc | awk -f"," '{ print $1","$2","$3$4$5 }' | \ lzop -c | hadoop fs -put - people-coalesed.txt.lzo this is great if your file isn’t too large, but if it’s multiple gigabytes in length then you probably want to harness the power of mapreduce to get this done in a jiffy! the words “in a jiffy” and “mapreduce” aren’t commonly used together, so what do we do? well you could crack open pig or hive and write some custom user-defined functions, but this means you end up in java which we want to avoid. hadoop streaming comes to the rescue in these situations. let’s first create our awk script which will be executed: shell$ cat people.awk #!/bin/awk -f begin { fs = "," } { print $1","$2","$3$4$5 } in linux, if you make this awk script executable, you could execute is as follows: shell$ ./people.awk people.txt in mapreduce-land we don’t need to join data in this particular example, so we don’t need to run any reducers. call your awk script from mappers via hadoop streaming with this command: shell$ hadoop_home=/usr/lib/hadoop shell$ ${hadoop_home}/bin/hadoop \ jar ${hadoop_home}/contrib/streaming/*.jar \ -d mapreduce.job.reduces=0 \ -d mapred.reduce.tasks=0 \ -input people.txt \ -output people-coalesed \ -mapper people.awk \ -file people.awk a few options in the hadoop streaming command are worth examining: finally - to get lzo into the picture you need to add -inputformat , -d mapred.output.compress and -d mapred.output.compression.codec arguments: shell$ hadoop_home=/usr/lib/hadoop shell$ ${hadoop_home}/bin/hadoop \ jar ${hadoop_home}/contrib/streaming/*.jar \ -d mapreduce.job.reduces=0 \ -d mapred.reduce.tasks=0 \ -d mapred.output.compress=true \ -d stream.map.input.ignorekey=true \ -d mapred.output.compression.codec=com.hadoop.compression.lzo.lzopcodec \ -inputformat com.hadoop.mapred.deprecatedlzotextinputformat \ -input people.txt.lzo \ -output people-coalesed \ -mapper people.awk \ -file people.awk
February 14, 2013
by Alex Holmes
· 13,126 Views · 1 Like
  • Previous
  • ...
  • 303
  • 304
  • 305
  • 306
  • 307
  • 308
  • 309
  • 310
  • 311
  • 312
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×