Data Engineering Resources

The Latest Data Engineering Topics

Running Hadoop MapReduce Application from Eclipse Kepler

it's very important to learn hadoop by practice. one of the learning curves is how to write the first map reduce app and debug in favorite ide, eclipse. do we need any eclipse plugins? no, we do not. we can do hadoop development without map reduce plugins this tutorial will show you how to set up eclipse and run your map reduce project and mapreduce job right from your ide. before you read further, you should have setup hadoop single node cluster and your machine. you can download the eclipse project from github . use case: we will explore the weather data to find maximum temperature from tom white’s book hadoop: definitive guide (3rd edition) chapter 2 and run it using toolrunner i am using linux mint 15 on virtualbox vm instance. in addition, you should have hadoop (mrv1 am using 1.2.1) single node cluster installed and running, if you have not done so, would strongly recommend you do it from here download eclipse ide, as of writing this, latest version of eclipse is kepler 1. create new java project 2. add dependencies jars right click on project properties and select java build path add all jars from $hadoop_home/lib and $hadoop_home (where hadoop core and tools jar lives) 3. create mapper package com.letsdobigdata; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; public class maxtemperaturemapper extends mapper { private static final int missing = 9999; @override public void map(longwritable key, text value, context context) throws ioexception, interruptedexception { string line = value.tostring(); string year = line.substring(15, 19); int airtemperature; if (line.charat(87) == '+') { // parseint doesn't like leading plus // signs airtemperature = integer.parseint(line.substring(88, 92)); } else { airtemperature = integer.parseint(line.substring(87, 92)); } string quality = line.substring(92, 93); if (airtemperature != missing && quality.matches("[01459]")) { context.write(new text(year), new intwritable(airtemperature)); } } } 4. create reducer package com.letsdobigdata; import java.io.ioexception; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.reducer; public class maxtemperaturereducer extends reducer { @override public void reduce(text key, iterable values, context context) throws ioexception, interruptedexception { int maxvalue = integer.min_value; for (intwritable value : values) { maxvalue = math.max(maxvalue, value.get()); } context.write(key, new intwritable(maxvalue)); } } 5. create driver for mapreduce job map reduce job is executed by useful hadoop utility class toolrunner package com.letsdobigdata; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; /*this class is responsible for running map reduce job*/ public class maxtemperaturedriver extends configured implements tool{ public int run(string[] args) throws exception { if(args.length !=2) { system.err.println("usage: maxtemperaturedriver "); system.exit(-1); } job job = new job(); job.setjarbyclass(maxtemperaturedriver.class); job.setjobname("max temperature"); fileinputformat.addinputpath(job, new path(args[0])); fileoutputformat.setoutputpath(job,new path(args[1])); job.setmapperclass(maxtemperaturemapper.class); job.setreducerclass(maxtemperaturereducer.class); job.setoutputkeyclass(text.class); job.setoutputvalueclass(intwritable.class); system.exit(job.waitforcompletion(true) ? 0:1); boolean success = job.waitforcompletion(true); return success ? 0 : 1; } public static void main(string[] args) throws exception { maxtemperaturedriver driver = new maxtemperaturedriver(); int exitcode = toolrunner.run(driver, args); system.exit(exitcode); } } 6. supply input and output we need to supply input file that will be used during map phase and the final output will be generated in output directory by reduct task. edit run configuration and supply command line arguments. sample.txt reside in the project root. your project explorer should contain following ] 7. map reduce job execution 8. final output if you managed to come this far, once the job is complete, it will create output directory with _success and part_nnnnn , double click to view it in eclipse editor and you will see we have supplied 5 rows of weather data (downloaded from ncdc weather) and we wanted to find out the maximum temperature in a given year from input file and the output will contain 2 rows with max temperature in (centigrade) for each supplied year 1949 111 (11.1 c) 1950 22 (2.2 c) make sure you delete the output directory next time running your application else you will get an error from hadoop saying directory already exists. happy hadooping!

February 21, 2014

by Hardik Pandya

· 144,695 Views · 2 Likes

Voron and the FreeDB Dataset

i got tired of doing arbitrary performance testing, so i decided to take the freedb dataset and start working with that. freedb is a data set used to look up cd information based on the a nearly unique disk id. this is a good dataset, because it contains a lot of data (over three million albums, and over 40 million songs), and it is production data. that means that it is dirty . this makes it perfect to run all sort of interesting scenarios. the purpose of this post (and maybe the new few) is to show off a few things. first, we want to see how voron behaves with realistic data set. second, we want to show off the way voron works, its api, etc. to start with, i run my freedb parser, pointing it at /dev/null. the idea is to measure what is the cost of just going through the data is. we are using freedb-complete-20130901.tar.bz2 from sep 2013. after 1 minute, we went through 342,224 albums, and after 6 minutes we were at 2,066,871 albums. reading the whole 3,328,488 albums took about a bit over ten minutes. so just the cost of parsing and reading the freedb dataset is pretty expensive. the end result is a list of objects that looks like this: now, let us see how we want to actually use this. we want to be able to: lookup an album by the disk ids lookup all the albums by an artist*. lookup albums by album title*. this gets interesting, because we need to deal with questions such as: “given pearl jam, if i search for pearl, do i get them? do i get it for jam?” for now, we are going to go with case insensitive, but we won’t be doing full text search, we will allow, however, prefix searches. we are using the following abstraction for the destination: public abstract class destination { public abstract void accept(disk d); public abstract void done(); } basically, we read data as fast as we can, and we shove it to the destination, until we are done. here is the voron implementation: public class vorondestination : destination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private readonly jsonserializer _serializer = new jsonserializer(); private int counter = 1; public vorondestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override void accept(disk d) { var ms = new memorystream(); _serializer.serialize(new jsontextwriter(new streamwriter(ms)), d); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); if(d.artist != null) _currentbatch.multiadd(d.artist.tolower(), key, "ix_artists"); if (d.title != null) _currentbatch.multiadd(d.title.tolower(), key, "ix_titles"); if (counter%1000 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } } public override void done() { _storageenvironment.writer.write(_currentbatch); } } let us go over this in detail, shall we? in line 10 we create a new storage environment. in this case, we want to just import the data, so we can create the storage inline. on lines 13 – 15, we create the relevant trees. you can think about voron trees in a very similar manner to the way you think about tables. they are a way to separate data into different parts of the storage. note that this still all reside in a single file, so there isn’t a physical separation. note that we created an albums tree, which will contain the actual data. and ix_artists, ix_titles trees. those are indexes into the albums tree. you can see them being used just a little lower. in the accept method, you can see that we use a writebatch, a native voron notion that allows us to batch multiple operations into a single transaction. in this case, for every album, we are making 3 writes. first, we write all of the data, as a json string, into a stream and put it in the albums tree. then we create a simple incrementing integer to be the actual album key. finally, we add the artist and title entries (lower case, so we don’t have to worry about case sensitivity in searches) into the relevant indexes. at 60 seconds, we written 267,998 values to voron. in fact, i explicitly designed it so we can see the relevant metrics. at 495 seconds we have reads 1,995,385 entries from the freedb file, we parsed 1,995,346 of them and written to voron 1,610,998. as you can imagined, each step is running in a dedicated thread, so we can see how they behave on an individual basis. the good thing about this is that i can physically see the various costs, it is actually pretty cool here is the voron directory at 60 seconds: you can see that we have two journal files active (haven’t been applied to the data file yet) and the db.voron file is at 512 mb. the compression buffer is at 32 mb (this is usually twice as big as the biggest transaction, uncompressed). the scratch buffer is used to hold in flight transaction information (until we send it to the data file), and you can see it is sitting on 256mb in size. at 15 minutes, we have the following numbers: 3,035,452 entries read from the file, 3,035,426 parsed and 2,331,998 written to voron. note that we are reading the file & writing to voron on the same disk, so that might impact the read performance. at that time, we can see the following on the disk: note that we increase the size of most of our files by factor of 2, so some of the space in the db.voron file is probably not used. note that we needed more scratch space to handle the in flight information. the entire process took 22 minutes, start to finish. although i have to note that this hasn’t been optimized at all, and i know we are doing a lot of stupid stuff through it. you might have noticed something else, we actually “crashed” closed the voron db, this was done to see what would happen when we open a relatively large db after an unordered shutdown. we’ll actually get to play with the data in my next post. so far this has been pretty much just to see how things are behaving. and… i just realized something, i forgot to actually add an index on disk id . which means that i have to import the data again. but before that, i also wrote the following: public class jsonfiledestination : destination { private readonly gzipstream _stream; private readonly streamwriter _writer; private readonly jsonserializer _serializer = new jsonserializer(); public jsonfiledestination() { _stream = new gzipstream(new filestream("freedb.json.gzip", filemode.createnew, fileaccess.readwrite), compressionlevel.optimal); _writer = new streamwriter(_stream); } public override void accept(disk d) { _serializer.serialize(new jsontextwriter(_writer), d); _writer.writeline(); } public override void done() { _writer.flush(); _stream.dispose(); } } this completed in ten minutes, for 3,328,488 entries. or a rate of about 5,538 per / second. the result is a 845mb gzip file. i had twofold reasons to want to do this. first, this gave me something to compare ourselves to, and more to the point, i can re-use this gzip file for my next tests, without having to go through the expensive parsing of the freedb file. i did just that and ended up with the following: public class voronentriesdestination : entrydestination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private int counter = 1; public voronentriesdestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_diskids"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override int accept(string d) { var disk = jobject.parse(d); var ms = new memorystream(); var writer = new streamwriter(ms); writer.write(d); writer.flush(); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); int count = 1; foreach (var diskid in disk.value("diskids")) { count++; _currentbatch.multiadd(diskid.value(), key, "ix_diskids"); } var artist = disk.value("artist"); if (artist != null) { count++; _currentbatch.multiadd(artist.tolower(), key, "ix_artists"); } var title = disk.value("title"); if (title != null) { count++; _currentbatch.multiadd(title.tolower(), key, "ix_titles"); } if (counter % 100 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } return count; } public override void done() { _storageenvironment.writer.write(_currentbatch); _storageenvironment.dispose(); } } now we are actually properly disposing of things, and i also decreased the size of the batch, to see how it would respond. note that it is now being fed directly from the gzip file, at a greatly reduced cost. i also added tracking note only for how many albums we write, but also how many entries . by entries i mean, how many voron entries (which include the values we add to the index). i did find a bug where we would just double the file size without due consideration to its size, so now we are doing smaller file size increases. word of warning : i didn’t realized until after i was done with all the benchmarks, but i actually run all of those in debug configuration, which basically means that it is utterly useless as a performance metric. that is especially true because we have a lot of verifier code that runs in debug mode. so please don’t take those numbers as actual performance metrics, they aren’t valid. time # of albums # of entries 4 minutes 773,398 3,091,146 6 minutes 1,126,998 4,504,550 8 minutes 1,532,858 6,126,413 18 minutes 2,781,698 11,122,799 24 minutes 3,328,488 13,301,496 the status of the file system midway during the run. you can see that now we increase the file is smaller increments. and that we are using more scratch space, probably because we are under very heavy write load. after the run: scratch & compression are only used when the database is running, and deleted on close. the database is 7gb in side, which is quite respectable. now, to working with it, but i’ll save that for my next post, this one is long enough already.

February 20, 2014

by Oren Eini

· 3,792 Views

The Risks Of Big-Bang Deployments And Techniques For Step-wise Deployment

If you ever need to persuade management why it might be better to deploy a larger change in multiple stages and push it to customers gradually, read on. A deployment of many changes is risky. We want therefore to deploy them in a way which minimizes the risk of harm to our customers and our companies. The deployment can be done either in an all-at-once (also known as big-bang) way or a gradual way. We will argue here for the more gradual (“stepwise”) approach. Big-bang or stepwise deployment? A big-bang deployment seems to be the natural thing to do: the full solution is developed and tested and then replaces the current system at once. However, it has two crucial flaws. First, it assumes that most defects can be discovered by testing. However, due to differences in test/prod environments, unknown dependencies, and the sheer scale of a typical larger system there always will be problems that are not discovered until production deployment or even until the application runs for a while in production (whichapplies even to airplanes). The more parts have been changed, the more of these production defects will happen at the same time. A gradual deployment makes it possible to discover and handle them one by one. Second, the more complex the deployment, the higher chance of human error(s), i.e. the deployment itself is a likely source of serious defects. Some of the drawbacks of a big-bang deployment in more detail: Complexity: A big-bang deployment requires coordination of many people and “moving parts” that depend on each other, providing a huge opportunity for human mistake (i.e. there will be mistakes). Lot of time: Such a deployment requires lot of time (typically also more than planed/expected) and thus lot of downtime when users cannot use the system. Hard troubleshooting: With a network of inter-dependent parts that changed all at the same time, while perhaps also changing the infrastructure (i.e. connections between them), it is extremely hard to pinpoint the source of defects, thus considerably increasing the time to detect and correct defects while also increasing the risk of people stepping on the toes of each other and “panic fixes” that either cause more problems than they remove or are not good enough (as the rollback that sped upKnight’s downfall). Rollback is likely either impossible or equally time-consuming and risky as the deployment itself, thus increasing the impact of defects and inviting even more human errors. Impact: Deploying everything to all users at the same time means that everybody will be impacted by a potential defect/error/mistake. Long freeze: All needs to be tested together after all development is finished, which requires a lot of time while the code is frozen and no more fixes and changes can get into production for weeks. Risk mitigation The goal of a good deployment plan is to mitigate the risk of the deployment and get it to an acceptable level. There are two aspects to risk: the probability of a defect and the impact of the defect. The following table shows how the possible measures affect them: Defect probability reduction Defect impact reduction testing stepwise deployment gradual migration of users to the new version (f.ex. 1 in 1000 or particular subsets) rollback mechanism => these also lead to much lower time to detect and fix defects Practices for stepwise deployment Enable stepwise deployment: Use parallel change and other Continuous Delivery techniques to make it possible to deploy updated components independently from each other and to switch on/off new features and to switch what versions of the components they depend on are currently used. (Parallel change – keeping the old and new code and being able to use one or the other – is crucial here. Also notice that parallel change applies also to data – you will need to evolve your data schema gradually and keep both old and new one at the same time in a period of time.) Enable rollback. The previous measure – stepwise deployment – makes it also easy(ier) to roll-back the changes by switching to a previous version of a dependency or by switching back to the old code. Migrate users gradually to the new version, i.e. expose the new version only to a small subset of the users initially and increase that subset until everybody uses it. This can be done f.ex. by deploying to only a subset of servers and sending a random/particular subset of users to the new servers but there are also ways if you have only a single machine. (See f.ex. my post Webapp Blue-Green Deployment Without Breaking Sessions/With Fallback With HAProxy.) Monitoring – make sure you are able to monitor flow of users through the system and detect any anomalies and errors early, long before angry calls from the business. Tools such as Logstash, Google Analytics (with custom events from JavaScript), client-side error logging via one of existing services or a custom solution are invaluable. About these ads

February 20, 2014

by Jakub Holý

· 22,142 Views

Customize the Appearance of Pivot Table Reports inside Android Apps

This technical tip shows how developers can customize the Appearance of Pivot Table Reports inside their Android applications using Aspose.Cells for Android. Previously we have shown how to create a simple pivot table. This article further goes and discusses how to customize the appearance of a pivot table by setting its properties like Setting the AutoFormat and PivotTableStyle Types, Setting Format Options, Setting Row Column and Page Fields Format, Modify a Pivot Table Quick Style and Clearing PivotFields etc. //Setting the AutoFormat and PivotTableStyle Type //Setting the PivotTable report is automatically formatted for Excel 2003 formats pivotTable.setAutoFormat(true); //Setting the PivotTable atuoformat type. pivotTable.setAutoFormatType(PivotTableAutoFormatType.CLASSIC); //Setting the PivotTable's Styles for Excel 2007/2010 formats e.g XLSX. pivotTable.setPivotTableStyleType(PivotTableStyleType.PIVOT_TABLE_STYLE_LIGHT_1); //Setting Format Options //The code sample that follows illustrates how to set a number of pivot table formatting options, including adding grand totals for rows and columns. //Dragging the third field to the data area. pivotTable.addFieldToArea(PivotFieldType.DATA,2); //Show grand totals for rows. pivotTable.setRowGrand(true); //Show grand totals for columns. pivotTable.setColumnGrand(true); //Display a custom string in cells that contain null values. pivotTable.setDisplayNullString(true); pivotTable.setNullString("null"); //Setting the layout pivotTable.setPageFieldOrder(PrintOrderType.DOWN_THEN_OVER); //Setting Row, Column, and Page Fields Format //The code example that follows shows how to access row fields, access a particular row, set subtotals, apply automatic sorting, and using the autoShow option. //Accessing the row fields. PivotFieldCollection pivotFields = pivotTable.getRowFields(); //Accessing the first row field in the row fields. PivotField pivotField = pivotFields.get(0); //Setting Subtotals. pivotField.setSubtotals(PivotFieldSubtotalType.SUM,true); pivotField.setSubtotals(PivotFieldSubtotalType.COUNT,true); //Setting autosort options. //Setting the field auto sort. pivotField.setAutoSort(true); //Setting the field auto sort ascend. pivotField.setAscendSort(true); //Setting the field auto sort using the field itself. pivotField.setAutoSortField(-1); //Setting autoShow options. //Setting the field auto show. pivotField.setAutoShow(true); //Setting the field auto show ascend. pivotField.setAscendShow(false); //Setting the auto show using field(data field). pivotField.setAutoShowField(0); //The following lines of code illustrate how to format data fields. //Accessing the data fields. PivotFieldCollection pivotFields = pivotTable.getDataFields(); //Accessing the first data field in the data fields. PivotField pivotField = pivotFields.get(0); //Setting data display format pivotField.setDataDisplayFormat(PivotFieldDataDisplayFormat.PERCENTAGE_OF); //Setting the base field. pivotField.setBaseField(1); //Setting the base item. pivotField.setBaseItem(PivotItemPosition.NEXT); //Setting number format pivotField.setNumber(10); //Modify a Pivot Table Quick Style //The code examples that follow show how to modify the quick style applied to a pivot table. File sdDir = Environment.getExternalStorageDirectory(); String sdPath = sdDir.getCanonicalPath(); //Open the template file containing the pivot table. Workbook wb = new Workbook(sdPath + "/Template.xlsx"); //Add pivot table style Style style1 = wb.createStyle(); com.aspose.cells.Font font1 = style1.getFont(); font1.setColor(Color.getRed()); Style style2 = wb.createStyle(); com.aspose.cells.Font font2 = style2.getFont(); font2.setColor( Color.getBlue()); int i = wb.getWorksheets().getTableStyles().addPivotTableStyle("tt"); //Get and Set the table style for different categories TableStyle ts = wb.getWorksheets().getTableStyles().get(i); int index = ts.getTableStyleElements().add(TableStyleElementType.FIRST_COLUMN); TableStyleElement e = ts.getTableStyleElements().get(index); e.setElementStyle(style1); index = ts.getTableStyleElements().add(TableStyleElementType.GRAND_TOTAL_ROW); e = ts.getTableStyleElements().get(index); e.setElementStyle(style2); //Set Pivot Table style name PivotTable pt = wb.getWorksheets().get(0).getPivotTables().get(0); pt.setPivotTableStyleName ("tt"); //Save the file. wb.save(sdPath + "/OutputFile.xlsx"); //Clearing PivotFields //PivotFieldCollection has a method named clear() for the task. When you want to clear all the PivotFields in the areas e.g., page, column, row or data, you can use it. The code sample below shows how to clear all the PivotFields in data area. File sdDir = Environment.getExternalStorageDirectory(); String sdPath = sdDir.getCanonicalPath(); //Open the template file containing the pivot table. Workbook workbook = new Workbook(sdPath + "/PivotTable.xlsx"); //Get the first worksheet Worksheet sheet = workbook.getWorksheets().get(0); //Get the pivot tables in the sheet PivotTableCollection pivotTables = sheet.getPivotTables(); //Get the first PivotTable PivotTable pivotTable = pivotTables.get(0); //Clear all the data fields pivotTable.getDataFields().clear(); //Add new data field pivotTable.addFieldToArea(PivotFieldType.DATA, "Betrag Netto FW"); //Set the refresh data flag on pivotTable.setRefreshDataFlag(false); //Refresh and calculate the pivot table data pivotTable.refreshData(); pivotTable.calculateData(); //Save the Excel file workbook.save(sdPath + "/out1.xlsx");

February 19, 2014

by David Zondray

· 2,710 Views

Eclipse's BIRT: Scripted Data Set

This article presents the usage of sripted data set in the eclipse's BIRT.

February 18, 2014

by Kosta Stojanovski

· 38,794 Views · 1 Like

How to Build an iOS and Android App in 24 hours with HTML5 and Cordova

what can one create during the new year and christmas holidays? as it turned down – quite enough. even if you have two kids and a bunch of family members whom you want to visit. the only thing you cannot accomplish in time is to finish an article for dzone. it takes a lot of time, nearly the entire january. by the 5th of january i had a laptop and a couple of days to spend on some development. having estimated what i can do here, i decided to create a mobile app that would work faster than the original. for this, i needed to find communicative creators of a popular app. hence, i found a “ spender ” app in the app store. it is a simple app for tracking your budget. with it, you can estimate how effectively you spend your money in the end of each month. by the 5th of january, this app was in top-10 in the russian app store. i also found their dev-story on iphones.ru. in their dev-story, the developers wrote that after completing their previous project, they had three-four free days. so, they decided to create a new app during this free time. their product manager and programmers helped them with positioning the app and its key features. this encouraged me and i began to think how to create nearly the same app in 2 days . note: the original app was updated in the middle of january, and now it looks a little different from my app. anyway, you can find its screenshots in the dev-story. i already had the experience of mobile app development using c# and cocoa. since this was my personal free time, i wanted to use it with maximum effectiveness. even if i didn’t succeed, i was eager to learn a new framework or programming language. i was working for devexpress from 2006 till2011 and have been reading their announces since i left the company. so, i knew that they created a mobile js-framework based on cordova/phonegap. they made it after i left the company, so i was curious to try it. the gartner research company reports that by august, 2013 most of the enterprise mobile software was created using phonegap or phonegap-based products (like kony ). from my consumer experience, it's far from true. maybe i was wrong? i'm not so good at html and javascript. i can create mark-up with stackoverflow.com and i can write simple selectors with jquery. i can also find the required information in their documentation. in other words, html+js was a gap in my knowledge and i was ready to fill it or gain some experience. thus, i planned to create a cross-platform application that could become an advantage over the original ios-only spender app. moreover, i wanted to spend my time in the most effective way. on the one hand, i had a potentially effective js framework, on the other – a lack of js experience. i hoped that the js framework advantages could balance my poor experience. since i like to use a vcs during development, i'll try to recover my progress. you can download complete apps here: ios , android i'm not sure i can provide public access to my repo, because it contains images i bought from fotolia and third-party libraries, each with a difference license. i'm not a lawyer, so i’d prefer not to take the risk. the most curious of you can take a look into the app bundle itself. js wasn't minified. place: tula, russia, date: january, 5, 2014 +20 minutes spent on installing node.js and cordova cli +10 minutes downloaded a template app from cordova. added a template from phonejs. created a git-repo, registered it in webstorm. added a new record to the httpd.conf in order to have an ability to debug my future app in the browser. +38 minutes changed the app namespace to "io.nikitin.thriftbox". added navigation. phonejs is an mvc-framework. each app screen is represented as a collection of html markup (views) and fabric function (viewmodel). here is how it looks at its simplest // view content and thriftbox.home = function (params) { // request parameters taken from uri return {}; // viewmodel instance }; then view and view model are bound via knockout-bindings . to be in time, i create only two screens: expense input and monthly expense report. +4 hours 20 minutes here i got stuck for the first time. i couldn't create a markup of digit buttons. the original app had a huge keyboard that looked like a calculator or dialer. i found out that it was not that easy to create such a keyboard, even using a table tag. in the iphone retina screen, 1px borders between buttons changed their colors after clicking on the buttons. on my iphone, the difference in colors was very noticeable. i had to invent how to tackle this. i tried to implement buttons using div s. but i couldn't achieve a border width of 1 px and make all buttons look equal in different screens. three hours later i gave up the idea of using divs and moved forward. +28 minutes removing a clicked button indicator on ios. ios displays a gray indicator around tapped links and objects with the onclick event handler. since i had my own indicator of a tapped object (the tapped button became darker), i didn't need the default indicator. i solved this problem using the dxaction event: was: 1 became: 1 this event is an extended variation of a "click" event: its handler supports uri navigation between views and correctly works in the scrollable area. +14 minutes the buttonpress event handler shown in the previous example now validates numbers from user input. var number = ko.observable(null); var isvalidnumber = ko.computed(function() { return number() && parsefloat(number()) > 0; }); ...... function buttonpress(button) { if (button) { if (number()) number(number() + button); else number(button); } else if (number()) number(number().substr(0, number().length - 1)); } var viewmodel = { number: number, isvalidnumber: isvalidnumber, viewshowing: viewshowing, buttonpress: buttonpress }; ..... +8 minutes added a fastclick.js , which removes a delay between tapping the screen and raising the 'click' event on phones. the mobile browser delays the raising of the click event by default to be sure the end-user will not perform a double tap. for the end-user, this looks as if the app is sluggish. you click buttons much faster than an app responds. fastclick.js handles the touchstart event and then creates all the click event process logic. btw, adding this library was a mistake; later i'll tell why. +4 minutes added a limitation to the length of user input numbers. corrected the font size for a better look-and-feel. +58 minutes added a choice of an expense category. added a scrollable pane with available categories below the input field. video . it took less time than it could be. in the phonejs component collection, i found dxtileview . it provides a kinetic scrolling with the required appearance out-of-the-box. it's not easy to implement kinetic scrolling by yourself and thus it’s great that this scrolling is enabled for ios only - android doesn't have it. it was 7:40 pm, so, i decided to continue the next day. place: tula, russia, date: january, 5, 2014 +3 hours 9 minutes storing data on a local storage. phonejs contains classes for working with data: selection, filtering, sorting, and grouping. there are several approaches to store data: odata and localstorage. i didn't want to implement a server side for a free app, and decided to use localstorage. later i found out that this was not an ideal decision. for example, when updating to ios 5.1 user data is erased , other people complained that localstorage is cleared regularly or even when shutting the device down. i didn't want to risk, so i used file api of phonegap. documentation says that this api is based on w3c file api. in fact, this means that this api differs in safari for mac os, chrome for mac os, cordova for ios and cordova for android. file api implementation is different for ios and android . e.g. android implementation doesn't contain the 'blob' class and 'window.permanent' constant. ii however implements the 'localfilesystem' and 'localfilesystem.persistent' classes. the laptop browser provides additional api for requesting an additional storage space, which mobile browsers don't provide. the available documentation for this api adds more problems. i found several articles searching by "html5 file api". and, i couldn't find an article that would cover all my questions. finally i created a new class for working with fileapi. this class supports cordova 3.3 on ios, android, and chrome 32 for mac os and windows 8. you can find it here: https://github.com/chebum/filestorage-for-phone.js/blob/master/filestorage.js you can use it as follows: // in this example i create data/records file in the documents folder of the app fs.initfileapi(1000000, true) .then(function () { var records = new fs.filearraystore({ key: "id", filename: "records" }); return records.insert({ customer: "peter" }) }) .then(function () { alert("record saved!"); }); // or use low-level api: fs.initfileapi(100000, true) .then(function() { return fs.writefile("file1", "file content") }) .then(function() { alert("file saved!"); }); +33 minutes saving the added records to the storage. category list is stored in arraystore , to simplify the selection operations. +26 minutes creating layout for the app's views. phonejs provides several layouts that are the placeholders for the views. my app's start page didn't fit into any of the available layout, so i have chosen the emptylayout. but, it doesn't provide animation effects when navigating through views. i copied the emptylayout code and added an attribute that had animation effects. +1 h. 51 min. template's about screen was redesigned to a report screen, empty by that moment. created a viewmodel that selects data for a current month. added localization date formatting for the screen caption. +59 minutes added the display of expenses grouped by categories for a current month. +28 minutes added the selection of months for which the report should be generated. end-users can tap the screen header to select the required month. +1 h. 20 min. added cordova-plugin statusbar that didn't work outof-the-box. i found that the reference to cordova.js was commented in the phonejs app template: as a result, the native part of my app didn't work. +39 minutes in the report screen, the upper part was changed to dxtoolbar . +22 minutes i discoveredwhy the dxbutton click event handler didn't work. removing the fastclick.js solved my problem, but caused a delay between tapping and event raising. i've changed the dxaction event subscription to 'touchstart'. +25 minutes formatting output strings when generating a report. at night i dreamed of crappy buttons in the application’s main screen. places: tula, vnukovo airport, date: january, 7-8, 2014 i had an early flight to budapest from vnukovo, and because i had no time in the afternoon, i gradually completed at the airport at night. as you know, it’s not very comfortable to sleep or sit in a café chair for a long time, but it turned out that programming was ok. +2 h. 5 min. in the morning, i decided to split the buttons in order to remove borders between them. i took the ios dialer keyboard as a sample. i created three keyboards. the button size changes depending on screen resolution: for 3.5'', 4'' and 5'' phones. each table cell contained a div with configured alignment. because of the lack of an incomplete vertical text alignment in html, the final css style for buttons ended to be quite complex: .home-view .buttons td div { color: #4a5360; border: 1px solid #4a5360; text-align: center; position: absolute; left: 50%; /* small buttons - default */ font-size: 26px; padding: 13px 0 13px 0; width: 52px; line-height: 26px; border-radius: 26px; margin-left: -27px; margin-top: -27px; } +1 h. 50 minutes i bought several vector icon sets on fotolia. i cut the required icons and converted them to png. it took me quite a long time, maybe, because it was 1.30 am :) +1 hour 10 minutes added a splash-screen for the app. +36 minutes created three sizes for the app icon. localized the app name for ios. +20 minutes hiding the splash screen after the app is completely loaded. +2 hours fixing multiple bugs. +2 hours creating screenshots for play store +30 minutes creating screenshots for app store +30 minutes writing an app description for two app stores. +1 h. 30 minutes submitting my app to the app store. here i faced with an issue with the app certification. my accountancy let's summarize the time i spent and divide it into categories. development: 21 hours 37 minutes graphics and texts: 8 hours 26 minutes totally: 30 hours 3 minutes as a result, i got a minimum-feature working app, though it is not as cool as the latest version of "spender". i couldn't create splitting expenses by days and income input. my app's ui could be more elegant as well. after analyzing the original 'spender' developer work, i got the following. they say that they involved four developers for three-four days. it is about 96-128 man-hours. i spent only 30 man-hours and got an app for three mobile platforms. ios and android versions are already in stores. the version for windows phone 8 requires a ui redesign. i can be proud of myself :). you can download complete apps here: ios , android

February 12, 2014

by Ivan Nikitin

· 210,702 Views

Build Your Own Custom Lucene Query and Scorer

Every now and then we’ll come across a search problem that can’t simply be solved with plain Solr relevancy. This usually means a customer knows exactly how documents should be scored. They may have little tolerance for close approximations of this scoring through Solr boosts, function queries, etc. They want a Lucene-based technology for text analysis and performant data structures, but they need to be extremely specific in how documents should be scored relative to each other. Well for those extremely specialized cases we can prescribe a little out-patient surgery to your Solr install – building your own Lucene Query. This is the Nuclear Option Before we dive in, a word of caution. Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths: You’ve utilized Solr’s extensive set of query parsers & features including function queries, joins, etc. None of this solved your problem You’ve exhausted the ecosystem of plugins that extend on the capabilities in (1). That didn’t work. You’ve implemented your own query parser plugin that takes user input and generates existing Lucene queries to do this work. This still didn’t solve your problem. You’ve thought carefully about your analyzers – massaging your data so that at index time and query time, text lines up exactly as it should to optimize the behavior of existing search scoring. This still didn’t get what you wanted. You’ve implemented your own custom Similarity that modifies how Lucene calculates the traditional relevancy statistics – query norms, term frequency, etc. You’ve tried to use Lucene’s CustomScoreQuery to wrap an existing Query and alter each documents score via a callback. This still wasn’t low-level enough for you, you needed even more control. If you’re still reading you either think this is going to be fun/educational (good for you!) or you’re one of the minority that must control exactly what happens with search. If you don’t know, you can of course contact us for professional services. Ok back to the action… Refresher – Lucene Searching 101 Recall that to search in Lucene, we need to get a hold of an IndexSearcher. This IndexSearcher performs search over an IndexReader. Assuming we’ve created an index, with these classes we can perform searches like in this code: Directory dir = new RAMDirectory(); IndexReader idxReader = new IndexReader(dir); idxSearcher idxSearcher = new IndexSearcher(idxReader) Query q = new TermQuery(new Term(“field”, “value”)); idxSearcher.search(q); Let’s summarize the objects we’ve created: Directory – Lucene’s interface to a file system. This is pretty straight-forward. We won’t be diving in here. IndexReader – Access to data structures in Lucene’s inverted index. If we want to look up a term, and visit every document it exists in, this is where we’d start. If we wanted to play with term vectors, offsets, or anything else stored in the index, we’d look here for that stuff as well. IndexSearcher — wraps an IndexReader for the purpose of taking search queries and executing them. Query – How we expect the searcher to perform the search, encompassing both scoring and which documents are returned. In this case, we’re searching for “value” in field “field”. This is the bit we want to toy with In addition to these classes, we’ll mention a support class exists behind the scenes: Similarity – Defines rules/formulas for calculating norms at index time and query normalization. Now with this outline, let’s think about a custom Lucene Query we can implement to help us learn. How about a query that searches for terms backwards. If the document matches a term backwards (like ananab for banana), we’ll return a score of 5.0. If the document matches the forwards version, let’s still return the document, with a score of 1.0 instead. We’ll call this Query “BackwardsTermQuery”. This example is hosted here on github. A tale of 3 classes – A Query, A Weight, and a Scorer Before we sling code, let’s talk about general architecture. A Lucene Query follows this general structure: A custom Query class, inheriting from Query A custom Weight class, inheriting from Weight A custom Scorer class inheriting from Scorer These three objects wrap each other. A Query creates a Weight, and a Weight in turn creates a Scorer. A Query is itself a very straight-forward class. One of its main responsibilities when passed to the IndexSearcher is to create a Weight instance. Other than that, there are additional responsibilities to Lucene and users of your Query to consider, that we’ll discuss in the “Query” section below. A Query creates a Weight. Why? Lucene needs a way to track IndexSearcher level statistics specific to each query while retaining the ability to reuse the query across multiple IndexSearchers. This is the role of the Weight class. When performing a search, IndexSearcher asks the Query to create a Weight instance. This instance becomes the container for holding high-level statistics for the Query scoped to this IndexSearcher (we’ll go over these steps more in the “Weight” section below). The IndexSearcher safely owns the Weight, and can abuse and dispose of it as needed. If later the Query gets reused by another IndexSearcher, a new Weight simply gets created. Once an IndexSearcher has a Weight, and has calculated any IndexSearcher level statistics, the IndexSearcher’s next task is to find matching documents and score them. To do this, the Weight in turn creates a Scorer. Just as the Weight is tied closely to an IndexSearcher, a Scorer is tied to an individual IndexReader. Now this may seem a little odd – in our code above the IndexSearcher always has exactly one IndexReader right? Not quite. See, a little hidden implementation detail is that IndexReaders may actually wrap other smaller IndexReaders – each tied to a different segment of the index. Therefore, an IndexSearcher needs to have the ability score documents across multiple, independent IndexReaders. How your scorer should iterate over matches and score documents is outlined in the “Scorer” section below. So to summarize, we can expand the last line from our example above… idxSearcher.search(q); … into this psuedocode: Weight w = q.createWeight(idxSearcher); // IndexSearcher level calculations for weight Foreach IndexReader idxReader: Scorer s = w.scorer(idxReader); // collect matches and score them Now that we have the basic flow down, let’s pick apart the three classes in a little more detail for our custom implementation. Our Custom Query What should our custom Query implementation look like? Query implementations always have two audiences: (1) Lucene and (2) users of your Query implementation. For your users, expose whatever methods you require to modify how a searcher matches and scores with your query. Want to only return as a match 1/3 of the documents that match the query? Want to punish the score because the document length is longer than the query length? Add the appropriate modifier on the query that impacts the scorer’s behavior. For our BackwardsTermQuery, we don’t expose accessors to modify the behavior of the search. The user simply uses the constructor to specify the term and field to search. In our constructor, we will simply be reusing Lucene’s existing TermQuery for searching individual terms in a document. private TermQuery backwardsQuery; private TermQuery forwardsQuery; public BackwardsTermQuery(String field, String term) { // A wrapped TermQuery for the reverse string Term backwardsTerm = new Term(field, new StringBuilder(term).reverse().toString()); backwardsQuery = new TermQuery(backwardsTerm); // A wrapped TermQuery for the Forward Term forwardsTerm = new Term(field, term); forwardsQuery = new TermQuery(forwardsTerm); } Just as importantly, be sure your Query meets the expectation of Lucene. Most importantly, you MUST override the following. createWeight() hashCode() equals() The method createWeight() we’ve discussed. This is where you’ll create a weight instance for an IndexSearcher. Pass any parameters that will influence the scoring algorithm, as the Weight will in turn be creating a searcher. Even though they are not abstract methods, overriding the hashCode()/equals() methods is very important. These methods are used by Lucene/Solr to cache queries/results. If two queries are equal, there’s no reason to rerun the query. Running another instance of your query could result in seeing the results of your first query multiple times. You’ll see your search for “peas” work great, then you’ll search for “bananas” and see “peas” search results. Override equals() and hashCode() so that “peas” != bananas. Our BackwardsTermQuery implements createWeight() by creating a custom BackwardsWeight that we’ll cover below: @Override public Weight createWeight(IndexSearcher searcher) throws IOException { return new BackwardsWeight(searcher); } BackwardsTermQuery has a fairly boilerplate equals() and hashCode() that passes through to the wrapped TermQuerys. Be sure equals() includes all the boilerplate stuff such as the check for self-comparison, the use of the super equals operator, the class comparison, etc etc. By using Lucene’s unit test suite, we can get a lot of good checks that our implementation of these is correct. @Override public boolean equals(Object other) { if (this == other) { return true; } if (!super.equals(other)) { return false; } if (getClass() != other .getClass()) { return false; } BackwardsTermQuery otherQ = (BackwardsTermQuery)(other); if (otherQ.getBoost() != getBoost()) { return false; } return otherQ.backwardsQuery.equals(backwardsQuery) && otherQ.forwardsQuery.equals(forwardsQuery); } @Override public int hashCode() { return super.hashCode() + backwardsQuery.hashCode() + forwardsQuery.hashCode(); } Our Custom Weight You may choose to use Weight simply as a mechanism to create Scorers (where the real meat of search scoring lives). However, your Custom Weight class must at least provide boilerplate implementations of the query normalization methods even if you largely ignore what is passed in: getValueForNormalization normalize These methods participate in a little ritual that IndexSearcher puts your Weight through with the Similarity for query normalization. To summarize the query normalization code in the IndexSearcher: float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f); Great, what does this code do? Well a value is extracted from Weight. This value is then passed to a Similarity instance that “normalizes” that value. Weight then receives this normalized value back. In short, this is allowing IndexSearcher to give weight some information about how its “value for normalization” compares to the rest of the stuff being searched by this searcher. This is extremely high level, “value for normalization” could mean anything, but here it generally means “what I think is my weight” and what Weight receives back is what the searcher says “no really here is your weight”. The details of what that means depend on the Similarity and Weight implementation. It’s expected that the Weight’s generated Scorer will use this normalized weight in scoring. You can chose to do whatever you want in your own Scorer including completely ignoring what’s passed to normalize(). While our Weight isn’t factoring into the scoring calculation, for consistency sake, we’ll participate in the little ritual by overriding these methods: @Override public float getValueForNormalization() throws IOException { return backwardsWeight.getValueForNormalization() + forwardsWeight.getValueForNormalization(); } @Override public void normalize(float norm, float topLevelBoost) { backwardsWeight.normalize(norm, topLevelBoost); forwardsWeight.normalize(norm, topLevelBoost); } Outside of these query normalization details, and implementing “scorer”, little else happens in the Weight. However, you may perform whatever else that requires an IndexSearcher in the Weight constructor. In our implementation, we don’t perform any additional steps with IndexSearcher. The final and most important requirement of Weight is to create a Scorer. For BackwardsWeight we construct our custom BackwardsScorer, passing scorers created from each of the wrapped queries to work with. @Override public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws IOException { Scorer backwardsScorer = backwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); Scorer forwardsScorer = forwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); return new BackwardsScorer(this, context, backwardsScorer, forwardsScorer); } Our Custom Scorer The Scorer is the real meat of the search work. Responsible for identifying matches and providing scores for those matches, this is where the lion share of our customization will occur. It’s important to note that a Scorer is also a Lucene DocIdSetIterator. A DocIdSetIterator is a cursor into a set of documents in the index. It provides three important methods: docID() – what is the id of the current document? (this is an internal Lucene ID, not the Solr “id” field you might have in your index) nextDoc() – advance to the next document advance(target) – advance (seek) to the target One uses a DocIdSetIterator by first calling nextDoc() or advance() and then reading the docID to get the iterator’s current location. The value of the docIDs only increase as they are iterated over. By implementing this interface a Scorer acts as an iterator over matches in the index. A Scorer for the query “field1:cat” can be iterated over in this manner to return all the documents that match the cat query. In fact, if you recall from my article, this is exactly how the terms are stored in the search index. You can chose to either figure out how to correctly iterate through the documents in a search index, or you can use the other Lucene queries as building blocks. The latter is often the simplest. For example, if you wish to iterate over the set of documents containing two terms, simply use the scorer corresponding to a BooleanQuery for iteration purposes. The first method of our scorer to look at is docID(). It works by reporting the lowest docID() of our underlying scorers. This scorer can be thought of as being “before” the other in the index, and as we want to report numerically increasing docIDs, we always want to chose this value: @Override public int docID() { int backwordsDocId = backwardsScorer.docID(); int forwardsDocId = forwardsScorer.docID(); if (backwordsDocId <= forwardsDocId && backwordsDocId != NO_MORE_DOCS) { currScore = BACKWARDS_SCORE; return backwordsDocId; } else if (forwardsDocId != NO_MORE_DOCS) { currScore = FORWARDS_SCORE; return forwardsDocId; } return NO_MORE_DOCS; } Similarly, we always want to advance the scorer with the lowest docID, moving it ahead. Then, we report our current position by returning docID() which as we’ve just seen will report the docID of the scorer that advanced the least in the nextDoc() operation. @Override public int nextDoc() throws IOException { int currDocId = docID(); // increment one or both if (currDocId == backwardsScorer.docID()) { backwardsScorer.nextDoc(); } if (currDocId == forwardsScorer.docID()) { forwardsScorer.nextDoc(); } return docID(); } In our advance() implementation, we allow each Scorer to advance. An advance() implementation promises to either land docID() exactly on or past target. Our call to docID() after we call advance will return either that one or both are on target, or it will return the lowest docID past target. @Override public int advance(int target) throws IOException { backwardsScorer.advance(target); forwardsScorer.advance(target); return docID(); } What a Scorer adds on top of DocIdSetIterator is the “score” method. When score() is called, a score for the current document (the doc at docID) is expected to be returned. Using the full capabilities of the IndexReader, any number of information stored in the index can be consulted to arrive at a score either in score() or while iterating documents in nextDoc()/advance(). Given the docId, you’ll be able to access the term vector for that document (if available) to perform more sophisticated calculations. In our query, we’ll simply keep track as to whether the current docID is from the wrapped backwards term scorer, indicating a match on the backwards term, or the forwards scorer, indicating a match on the normal, unreversed term. Recall docID() is always called on advance/nextDoc. You’ll notice we update currScore in docID, updating it every time the document advances. @Override public float score() throws IOException { return currScore; } A Note on Unit Testing Now that we have an implementation of a search query, we’ll want to test it! I highly recommend using Lucene’s test framework. Lucene will randomly inject different implementations of various support classes, index implementations, to throw your code off balance. Additionally, Lucene creates test implementations of classes such as IndexReader that work to check whether your Query correctly fulfills its contract. In my work, I’ve had numerous cases where tests would fail intermittently, pointing to places where my use of Lucene’s data structures subtly violated the expected contract. An example unit test is included in the github project associated with this blog post. Wrapping Up That’s a lot of stuff! And I didn’t even cover everything there is to know! As an exercise to the reader, you can explore the Scorer methods cost() and freq(), as well as the rewrite() method of Query used optionally for optimization. Additionally, I haven’t explored how most of the traditional search queries end up using a framework of Scorers/Weights that don’t actually inherit from Scorer or Weight known as “SimScorer” and “SimWeight”. These support classes consult a Similarity instance to customize calculation certain search statistics such as tf, convert a payload to a boost, etc. In short there’s a lot here! So tread carefully, there’s plenty of fiddly bits out there! But have fun! Creating a custom Lucene query is a great way to really understand how search works, and the last resort short in solving relevancy problems short of creating your own search engine. And if you have relevancy issues, contact us! If you don’t know whether you do, our search relevancy product, Quepid – might be able to tell you!

February 10, 2014

by Doug Turnbull

· 14,462 Views

Voron & Time Series: Working with Real Data

dan liebster has been kind enough to send me a real world time series database. the data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this. this is what this looks like: the first thing that i did was take the code in this post , and try it out for size. i wrote the following: int i = 0; using (var parser = new textfieldparser(@"c:\users\ayende\downloads\timeseries.csv")) { parser.hasfieldsenclosedinquotes = true; parser.delimiters = new[] {","}; parser.readline();//ignore headers var startnew = stopwatch.startnew(); while (parser.endofdata == false) { var fields = parser.readfields(); debug.assert(fields != null); dts.add(fields[1], datetime.parseexact(fields[2], "o", cultureinfo.invariantculture), double.parse(fields[3])); i++; if (i == 25*1000) { break; } if (i%1000 == 0) console.write("\r{0,15:#,#} ", i); } console.writeline(); console.writeline(startnew.elapsed); } note that we are using a separate transaction per line , which means that we are really doing a lot of extra work. but this simulate very well incoming events coming one at a time. we were able to process 25,000 events in 8.3 seconds. at a rate of just over 3 events per millisecond . now, note that we have in here the notion of “channels”. from my investigation, it seems clear that some form of separation is actually very common in time series data. we are usually talking about sensors or some such, and we want to track data across different sensors over time. and there is little if any call for working over multiple sensors / channels at the same time. because of that, i made a relatively minor change in voron, that allows it to have an infinite number of separate trees. that means that i can use as many trees as you want, and we can model a channel as a tree in voron. i also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. that dropped the time to insert 25,000 lines to 0.8 seconds. or a full order of magnitude faster. that done, i inserted the full data set, which is just over 1,096,384 records. that took 36 seconds. in the data set i have, there are 35 channels. i just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. that allows doing things like doing averages over time, comparing data, etc. you can see the code implementing this in the following link .

February 7, 2014

by Oren Eini

· 4,023 Views

Managing Disk Space in MongoDB

In our previous post on MongoDB storage structure and dbStats metrics, we covered how MongoDB stores data and the differences between the dataSize, storageSize and fileSize metrics. We can now apply this knowledge to evaluate strategies for re-using MongoDB disk space. When documents or collections are deleted, empty record blocks within data files arise. MongoDB attempts to reuse this space when possible, but it will never return this space to the file system. This behavior explains why fileSize never decreases despite deletes on a database. If your app frequently deletes or if your fileSize is significantly larger than the size of your data plus indexes, you can use one of the methods below reclaim free space. Getting your free space back Compacting individual collections You can compact individual collections using the compact command. This command rewrites and defragments all data in a collection, as well as all of the indexes on that collection. Important notes on compacting: This operation blocks all other database activity when running and should be used only when downtime for your database is acceptable. If you are running a replica set, you can perform compaction on secondaries in order to avoid blocking the primary and use failover to make the primary a secondary before compacting it. Compacting individual collections will not reduce your storage footprint on disk (i.e., your fileSize) but it will defragment the collections you compact. Compacting one or more databases For a single-node MongoDB deployment, you can use the db.repairDatabase() command to compact all the collections in the database. This operation rewrites all the data and indexes for each collection in the database from scratch and thereby compacts and defragments the entire database. To compact all the databases on your server process, you can stop your mongod process and run it with the “–repair” option. Important notes on running a repair: This operation blocks all other database activity when running and should be used only when downtime for your database is acceptable. Running a repair requires free disk space equal to the size of your current data set plus 2 GB. You can use space in a different volume than the one that your mongod is running in by specifying the “–repairpath” option. Compacting all databases on a server by re-syncing replica set nodes For a multi-node MongoDB deployment, you can resync a secondary from scratch to reclaim space. By resyncing each node in your replica set you effectively rewrite the data files from scratch and thereby defragment your database. Please note that if your cluster is comprised of only two electable nodes, you will sacrifice high availability during the resync because the secondary is completely wiped before syncing. If your app is sensitive to downtime, we recommend a process similar to the one we use here at MongoLab which we call a “rolling node replacement.” This process replaces each node in your cluster in turn by bringing a new node into the cluster, replicating the data to that new node and removing the old node. In this way, your cluster can maintain the same level of redundancy during the compaction as during normal operations. A tip about efficiently using space usePowerOf2Sizes Setting the usePowerof2Sizes option is a proactive approach to reusing space in collections that experience frequent document moves or deletions. This option supersedes the default padding factor mechanism and reduces the impact of fragmentation within the collection by allocating additional space for each document in intervals that follow the powers of 2. Setting this option for a specific collection makes it less likely that documents in that collection need to be moved when they grow in size, less likely that a document will need to be moved more than once in its lifetime, and more likely that space left by moving documents can be reused by new or other moved documents. Thanks for reading! We hope the above strategies help guide you in evaluating options for reusing empty space in your MongoDB.

February 6, 2014

by Chris Chang

· 23,622 Views

Java: Handling a RuntimeException in a Runnable

At the end of last year I was playing around with running scheduled tasks to monitor a Neo4j cluster and one of the problems I ran into was that the monitoring would sometimes exit. I eventually realised that this was because a RuntimeException was being thrown inside the Runnable method and I wasn’t handling it. The following code demonstrates the problem: import java.util.ArrayList; import java.util.List; import java.util.concurrent.*; public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } If we run that code we’ll see the RuntimeException but the executor won’t exit because the thread died without informing it: Exception in thread "main" pool-1-thread-1 -> 1391212558074 java.util.concurrent.ExecutionException: java.lang.RuntimeException: game over at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at RunnableBlog.main(RunnableBlog.java:11) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) At the time I ended up adding a try catch block and printing the exception like so: public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { try { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } catch (RuntimeException e) { e.printStackTrace(); } } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } This allows the exception to be recognised and as far as I can tell means that the thread executing the Runnable doesn’t die. java.lang.RuntimeException: game over pool-1-thread-1 -> 1391212651955 at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) pool-1-thread-1 -> 1391212652956 java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) pool-1-thread-1 -> 1391212653955 java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) This worked well and allowed me to keep monitoring the cluster. However, I recently started reading ‘Java Concurrency in Practice‘ (only 6 years after I bought it!) and realised that this might not be the proper way of handling the RuntimeException. public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { try { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } catch (RuntimeException e) { Thread t = Thread.currentThread(); t.getUncaughtExceptionHandler().uncaughtException(t, e); } } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } I don’t see much difference between the two approaches so it’d be great if someone could explain to me why this approach is better than my previous one of catching the exception and printing the stack trace.

February 6, 2014

by Mark Needham

· 19,623 Views

Big Data Search, Part 6: Sorting Randomness

As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4 GB of random numbers, let us find how many times we have the number 1. Because I wanted to ensure a consistent answer, I wrote: public static IEnumerable RandomNumbers() { const long count = 1024 * 1024 * 1024L * 1; var random = new MyRand(); for (long i = 0; i < count; i++) { if (i % 1024 == 0) { yield return 1; continue; } var result = random.NextUInt(); while (result == 1) { result = random.NextUInt(); } yield return result; } } /// /// Based on Marsaglia, George. (2003). Xorshift RNGs. /// http://www.jstatsoft.org/v08/i14/paper /// public class MyRand { const uint Y = 842502087, Z = 3579807591, W = 273326509; uint _x, _y, _z, _w; public MyRand() { _y = Y; _z = Z; _w = W; _x = 1337; } public uint NextUInt() { uint t = _x ^ (_x << 11); _x = _y; _y = _z; _z = _w; return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8)); } } I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine. For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate. The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help. Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million. That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them. Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this. For what is it worth, you can find the code for this here.

February 5, 2014

by Oren Eini

· 9,054 Views

AES-256 Encryption with Java and JCEKS

Security has become a great topic of discussion in the last few years due to the recent releasing of documents from Edward Snowden and the explosion of hacking against online commerce stores like JC Penny, Sony andTarget. While this post will not give you all of the tools to help prevent the use of illegally sourced data, this post will provide a starting point for building a set of tools and tactics that will help prevent the use of data by other parties. This post will show how to adopt AES encryption for strings in a Java environment. It will talk about creating AES keys and storing AES keys in a JCEKS keystore format. A working example of the code in this blog is located athttps://github.com/mike-ensor/aes-256-encryption-utility It is recommended to read each section in order because each section builds off of the previous section, however, this you might want to just jump quickly jump to a particular section. Setup - Setup and create keys with keytool Encrypt - Encrypt messages using byte[] keys Decrypt - Decrypt messages using same IV and key from encryption Obtain Keys from Keystore - Obtain keys from keystore via an alias What is JCEKS? JCEKS stands for Java Cryptography Extension KeyStore and it is an alternative keystore format for the Java platform. Storing keys in a KeyStore can be a measure to prevent your encryption keys from being exposed. Java KeyStores securely contain individual certificates and keys that can be referenced by an alias for use in a Java program. Java KeyStores are often created using the "keytool" provided with the Java JDK. NOTE: It is strongly recommended to create a complex passcode for KeyStores to keep the contents secure. The KeyStore is a file that is considered to be public, but it is advisable to not give easy access to the file. Setup All encryption is governed by laws of each country and often have restrictions on the strength of the encryption. One example is that in the United States, all encryption over 128-bit is restricted if the data is traveling outside of the boarder. By default, the Java JCE implements a strength policy to comply with these rules. If a stronger encryption is preferred, and adheres to the laws of the country, then the JCE needs to have access to the stronger encryption policy. Very plainly put, if you are planning on using AES 256-bit encryption, you must install theUnlimited Strength Jurisdiction Policy Files. Without the policies in place, 256-bit encryption is not possible. Installation of JCE Unlimited Strength Policy This post is focusing on the keys rather than the installation and setup of the JCE. The installation is rather simple with explicit instructions found here (NOTE: this is for JDK7, if using a different JDK, search for the appropriate JCE policy files). Keystore Setup When using the KeyTool manipulating a keystore is simple. Keystores must be created with a link to a new key or during an import of an existing keystore. In order to create a new key and keystore simply type: keytool -genseckey -keystore aes-keystore.jck -storetype jceks -storepass mystorepass -keyalg AES -keysize 256 -alias jceksaes -keypass mykeypass Important Flags In the example above here are the explanations for the keytool's parameters: Keystore Parameters genseckey Generate SecretKey. This is the flag indicating the creation of a synchronous key which will become our AES key keystore Location of the keystore. If the keystore does not exist, the tool will create a new store. Paths can be relative or absolute but must be local storetype this is the type of store (JCE, PK12, JCEKS, etc). JCEKS is used to store symmetric keys (AES) not contained within a certificate. storepass password related to the keystore. Highly recommended to create a strong passphrase for the keystore Key Parameters keyalg algorithm used to create the key (AES/DES/etc) keysize size of the key (128, 192, 256, etc) alias alias given to the newly created key in which to reference when using the key keypass password protecting the use of the key Encrypt As it pertains to data in Java and at the most basic level, encryption is an algorithmic process used to programmatically obfuscate data through a reversible process where both parties have information pertaining to the data and how the algorithm is used. In Java encryption, this involves the use of a Cipher. A Cipher object in the JCE is a generic entry point into the encryption provider typically selected by the algorithm. This example uses the default Java provider but would also work with Bouncy Castle. Generating a Cipher object Obtaining an instance of Cipher is rather easy and the same process is required for both encryption and decryption. (NOTE: Encryption and Decryption require the same algorithm but do not require the same object instance) Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding"); Once we have an instance of the Cipher, we can encrypt and decrypt data according to the algorithm. Often the algorithm will require additional pieces of information in order to encrypt/decrypt data. In this example, we will need to pass the algorithm the bytes containing the key and an initial vector (explained below). Initialization In order to use the Cipher, we must first initialize the cipher. This step is necessary so we can provide additional information to the algorithm like the AES key and the Initial Vector (aka IV). cipher.init(Cipher.ENCRYPT_MODE, secretKeySpecification, initialVector); Parameters The SecretKeySpecification is an object containing a reference to the bytes forming the AES key. The AES key is nothing more than a specific sized byte array (256-bit for AES 256 or 32 bytes) that is generated by the keytool(see above). Alternative Parameteters There are multiple methods to create keys such as a hash including a salt, username and password (or similar). This method would utilize a SHA1 hash of the concatenated strings, convert to bytes and then truncate result to the desired size. This post will not show the generation of a key using this method or the use of a PBE key method using a password and salt. The password and/or salt usage for the keys is handled by the keytool using the inputs during the creation of new keys. Initialization Vector The AES algorithm also requires a second parameter called the Initialiation Vector. The IV is used in the process to randomize the encrypted message and prevent the key from easy guessing. The IV is considered a publicly shared piece of information, but again, it is not recommended to openly share the information (for example, it wouldn't be wise to post it on your company's website). When encrypting a message, it is not uncommon to prepend the message with the IV since the IV will be a set/known size based on the algorithm. NOTE: the AES algorithm will output the same result if using the same IV, key and message. It is recommended that the IV be randomly created each time an encryption takes place. With the newly initialized Cipher, encrypting a message is simple. Simply call: byte[] encryptedMessageInBytes = Cipher.doFinal((message.getBytes("UTF-8")); String base64EncodedEncryptedMsg = BaseEncoding.base64().encode(encryptedMessageInBytes); String base32EncodedEncryptedMsg = BaseEncoding.base32().encode(encryptedMessageInBytes); Encoding Results Byte arrays are difficult to visualize since they often do not form characters in any charset. The best recommendation to solve this is to represent the bytes in HEX (base-16), Double HEX (base-32) or Base64 format. If the message will be passed via a URL or POST parameter, be sure to use a web-safe Base64 encoding. Google Guava library provides a excellent BaseEncoding utility. NOTE: Remember to decode the encoded message before decrypting. Decrypt Decrypting a message is almost a reverse of the encryption process with a few exceptions. Decryption requires a known initialization vector as a parameter unlike the encryption process generating a random IV. Decryption When decrypting, obtain a cipher object with the same process as the encryption method. The Cipher object will need to utilize the exact same algorithm including the method and padding selections. Once the code has obtained a reference to a Cipher object, the next step is to initialize the cipher for decryption and pass in a reference to a key and the initialization vector. // key is the same byte[] key used in encryption SecretKeySpec secretKeySpecification = new SecretKeySpec(key, "AES"); cipher.init(Cipher.DECRYPT_MODE, secretKeySpecification, initialVector); NOTE: The key is stored in the keystore and obtained by the use of an alias. See below for details on obtaining keys from a keystore Once the cipher has been provided the key, IV and initialized for decryption, the cipher is ready to perform the decryption. byte[] encryptedTextBytes = BaseEncoding.base64().decode(message); byte[] decryptedTextBytes = cipher.doFinal(encryptedTextBytes); String origMessage = new String(decryptedTextBytes); Strategies to keep IV The IV used to encrypt the message is important to decrypting the message therefore the question is raised, how do they stay together. One solution is to Base Encode (see above) the IV and prepend it to the encrypted and encoded message: Base64UrlSafe(myIv) + delimiter + Base64UrlSafe(encryptedMessage). Other possible solutions might be contextual such as including an attribute in an XML file with the IV and one for the alias to the key used. Obtain Key from Keystore The beginning of this post has shown how easy it is to create new AES-256 keys that reference an alias inside of a keystore database. The post then continues on how to encrypt and decrypt a message given a key, but has yet shown how to obtain a reference to the key in a keystore. Solution // for clarity, ignoring exceptions and failures InputStream keystoreStream = new FileInputStream(keystoreLocation); KeyStore keystore = KeyStore.getInstance("JCEKS"); keystore.load(keystoreStream, keystorePass.toCharArray()); if (!keystore.containsAlias(alias)) { thrownew RuntimeException("Alias for key not found"); } Key key = keystore.getKey(alias, keyPass.toCharArray()); Parameters keystoreLocation String - Location to local keystore file location keypass String - Password used when creating or modifying the keystore file with keytool (see above) alias String - Alias used when creating new key with keytool (see above) Conclusion This post has shown how to encrypt and decrypt string based messages using the AES-256 encryption algorithm. The keys to encrypt and decrypt these messages are held inside of a JCEKS formatted KeyStore database created using the JDK provided "keytool" utility. The examples in this post should be considered a solid start to encrypting/decrypting symmetric keys such as AES. This should not be considered the only line of defense when encrypting messages, for example key rotation. Key rotation is a method to mitigate risks in the event of a data breach. If an intruder obtains data and manages to hack a single key, the data contained in multiple files should have used several keys to encrypt the data thus bringing down risk of a total exposure loss. All of the examples in this blog post have been condensed into a simple tool allowing for the viewing of keys inside of a keystore, an operation that is not supported out of the box by the JDK keytool. Each aspect of the steps and topics outlined in this post are available at: https://github.com/mike-ensor/aes-256-encryption-utility. NOTE: The examples, sample code and any reference is to be used at the sole implementers risk and there is no implied warranty or liability, you assume all risks.

February 4, 2014

by Mike Ensor

· 102,361 Views · 2 Likes

Export MS Visio Diagram to XML (VDX, VTX, VSX) Formats in C# & VB.NET

This technical tip shows how .NET developers can export Microsoft Visio diagram to XML inside their own applications using Aspose.Diagram for .NET. Aspose.Diagram for .NET lets you export diagrams to a variety of formats: image formats, HTML, SVG, SWF and XML formats: VDX defines an XML diagram. VTX defines an XML template. VSX defines an XML stencil. The Diagram class' constructors read a diagram and the Save method is used to save, or export, a diagram in a different file format. The code snippets in this article show how to use the Save method to save a Visio file to VDX, VTX and VSX. Exporting VSD to VDX VDX is a schema-based XML file format that lets you save diagrams in a format that products other than Microsoft Visio can read. It's a useful format for transferring diagrams between software applications and retaining editable data. To export a VSD diagram to VDX first create an instance of the Diagram class and call the Diagram class' Save method to write the Visio drawing file to VDX. Exporting from VSD to VSX VSX is an XML format for defining stencils, the basic objects from which a diagram is built up. When a Visio file is converted to VSX, only the stencils are exported. To export a VSD diagram to VSX first you need to create an instance of the Diagram class and then call the Diagram class' Save method to write the Visio drawing file to VSX. //The Sample code shows how to export VSD to VDX //[C# Sample] //Call the diagram constructor to load diagram from a VSD file Diagram diagram = new Diagram("D:\\Drawing1.vsd"); this.Response.Clear(); this.Response.ClearHeaders(); this.Response.ContentType = "application/vnd.ms-visio"; this.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vdx"); this.Response.Flush(); System.IO.Stream vdxStream = this.Response.OutputStream; //Save input VSD as VDX diagram.Save(vdxStream, SaveFileFormat.VDX); this.Response.End(); //[VB.NET Code Sample] 'Call the diagram constructor to load diagram from a VSD file Dim diagram As New Diagram("D:\Drawing1.vsd") Me.Response.Clear() Me.Response.ClearHeaders() Me.Response.ContentType = "application/vnd.ms-visio" Me.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vdx") Me.Response.Flush() Dim vdxStream As System.IO.Stream = Me.Response.OutputStream 'Save inpupt VSD as VDX diagram.Save(vdxStream, SaveFileFormat.VDX) Me.Response.End() //The Sample code shows how to export VSD to VSX format. [C# Code Sample] // Call the diagram constructor to load diagram from a VSD file Diagram diagram = new Diagram("D:\\Drawing1.vsd"); this.Response.Clear(); this.Response.ClearHeaders(); this.Response.ContentType = "application/vnd.ms-visio"; this.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vsx"); this.Response.Flush(); System.IO.Stream vsxStream = this.Response.OutputStream; //Save input VSD as VSX diagram.Save(vsxStream, SaveFileFormat.VSX); this.Response.End() //[VB.NET Code Sample] 'Call the diagram constructor to load diagram from a VSD file Dim diagram As New Diagram("D:\Drawing1.vsd") Me.Response.Clear() Me.Response.ClearHeaders() Me.Response.ContentType = "application/vnd.ms-visio" Me.Response.AppendHeader("Content-Disposition", "attachment; filename=Diagram.vsx") Me.Response.Flush() Dim vsxStream As System.IO.Stream = Me.Response.OutputStream 'Save input VSD as VSX diagram.Save(vsxStream, SaveFileFormat.VSX) Me.Response.End()

January 29, 2014

by David Zondray

· 14,445 Views

Geek, dork, nerd and dweeb — the difference in a Venn diagram

Working in and around Silicon Valley and technology, I hear people throwing around the terms “geek”, “dork”, ”nerd” and “dweeb” constantly. They’re thrown around interchangeably, in fact, which is where the problem lies. They’re not the same and knowing that matters a great deal. In fact, using the wrong term gives credit where credit it isn’t due or unfairly labels someone’s better qualities. Venn diagram Finding a Venn diagram to explain the difference was a big moment. I suddenly see how I should interview for different roles and what to look for in partners and employees. I know what to seek and what to avoid. It was an epiphany. Starting from that point, I could see career paths for each and every one (OK, except one). It breaks out like this: Geek, Nerd, Dweeb and Dork in order of value to the organization Geek – Both smart and driven but able to talk about fun things — they’re your leaders and sales people Nerd - Centered in smarts and drive, tempered by some awkwardness — they sustain your company Dweeb – Smart and awkward but probably uncommitted — they won’t stay up all night to solve a problem Dork - Least fun of the bunch — Avoid these people because they waste your time and sap your will to live These may not be everyone’s definitions, but maybe it’s time to standardize in our labels lest we use the terms insensitively. For an interesting take on these categories, I found this on Democratic Underground. Geek: Someone who spends a lot of time and energy in a certain special but conventional area, like computer programming or trouble-shooting, but not necessarily computers or technology. You can apparently have chess geeks, guitar geeks, or cooking geeks. A geek is an outwardly normal person who can relate to others in general but who has taken the time to learn specific technical skills and would rather talk about their special obsession than anything else. They are generally not athletic and enjoy sedentary pursuits like video games, comic books, being on the internet, etc. They usually dress to suit their special interest, which can be flamboyant, such as wearing a tee-shirt describing their special obsession or a hat bearing a logo of their special pursuit. Geeks can be self-confident and proud of their traits. Nerd: Someone with a great interest in academic subjects like math and science and who is socially awkward and has trouble relating to others outside of their fields of academia. Their IQ often exceeds their weight. Science fiction such as The Matrix and Star Wars or LOTR are often their cup of tea, as are hobbies like astronomy or chemistry sets. Nerds usually dress conservatively and are more interested in the mind than their outward appearance, although as both men and women they tend to be tidy, clean-cut, and hygienic. Nerds generally are self-confident in the academic setting and take pride in their intellect and band together with other nerds although their social skills outside of their academic obsession are diminished. Dork: Someone who has special interests like a geek but whose interests and obsessions are less common and odd, such as having an oddball collection of some sort like old Three Stooges bubblegum cards or an uncommon skill like yodeling. Walking talking Star Trek encyclopedic knowledge and convention dress up obsessions can be considered dorky. They can act silly at times and not care what anyone thinks. Dorks are typically more noted for their quirky personality and tend to be loners. Hygiene can sometimes be an issue. Dorks can nonetheless be self-confident and proud of the way they are because they simply don’t care what others think. Dweeb: A person who tends to be regarded as physically wimpish, intellectually challenged, and socially awkward, with little self-confidence. Dweebs tend to be obsessed with unusual pursuits like dorks (tap dancing or ant farms) but are lacking in skill, knowledge, or ability. Dweebs tend to be loners like dorks but understand their shortcomings and lack pride. Hygiene can also be an issue. I’m not a dork, nor dweeb and I don’t see myself as a nerd…I’m clearly a geek :-). Thank you to the Great White Snark, where I found the Venn diagram.

January 27, 2014

by Christopher Taylor

· 21,403 Views

Big Data Search, Part 5: Sorting Optimizations

I mentioned several times that the entire point of the exercise was to just see how this works, not to actually do anything production worthy. But it is interesting to see how we could do better here. In no particular order, I think that there are at least several things that we could do to significantly improve the time it takes to sort. Right now we defined 2 indexes on top of a 1GB file, and it took under 1 minute to complete. That gives us a runtime of about 10 days over a 15TB file. Well, one of the reason for this performance is that we execute this in a serial fashion, that is, one after another. But we have to completely isolated indexes, there is no reason why we can’t parallelize the work between them. For that matter, we are buffering in memory up to a certain point, then we sort, then we buffer some more, etc. That is pretty inefficient. We can push the actual sorting to a different thread, and continue parsing and adding to a buffer while we are adding to the buffer. We wrote to intermediary files, but we wrote to those using plain file I/O. But it is usually a lot more costly to write to disk than to compress and then write to disk. We are writing sorted data, so it is probably going to compress pretty well. Those are the things that pop to mind. Can you think of additional options?

January 27, 2014

by Oren Eini

· 7,865 Views

Spring and Caching JMS Connections

As follow up to previous posts covering JMS, this post will delve into more depth on Spring's CachingConnectionFactory. Spring provides two implementations of the javax.jms.ConnectionFactory interface, namely, the SingleConnectionFactory and the CachingConnectionFactory. The SingleConnectionFactory returns as you might expect the same single connection upon all calls to the createConnection() method. This is fine for certain scenarios and applications but the CachingConnectionFactory provides a more performant and scalable solution. By default, a single session is cached so for a multi threaded application you would set the sessionCacheSize to be a more suitable number although this number wouldn't reflect the true number of sessions cached as this figure refers to the size of cache per session acknowledgement type eg AUTO_ACKNOWLEDGE, CLIENT_ACKNOWLEDGE, DUPS_OK_ACKNOWLEDGE and SESSION_TRANSACTED. By default, the CachingConnectionFactory will cache the Message Producers and Message Consumers for every session. As an aside the Message Consumers are cached using keys which include the JMS selector so the more fine grained the message filter the more Message Consumers there would be, and Message Consumers aren't closed until the session is closed and removed from the pool. An alternative is to use a Listener Container for consuming messages. Also to be noted is that on creating a CachingConnectionFactory instance, the reconnect on exception flag is set to be true. This should mean that the onException method on the default ExceptionListener class gets called which will reset the connections. You can also override the default exception listener with your own implementation. The below snippet of XML shows a simple configuration of a CachingConnectionFactory:

January 27, 2014

by Geraint Jones

· 52,621 Views · 1 Like

Using Database Views in Grails

This post is a quick explanation on how to use database views in Grails. For an introduction I tried to summarize what database views are. However, I noticed I cannot describe it better than it is already done on Wikipedia. Therefore I will just quote the Wikipedia summary of View (SQL)here: In database theory, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary. Unlike ordinary base tables in a relational database, a view does not form part of the physical schema: as a result set, it is a virtual table computed or collated from data in the database, dynamically when access to that view is requested. Changes applied to the data in a relevant underlying table are reflected in the data shown in subsequent invocations of the view. (Wikipedia) Example Let's assume we have a Grails application with the following domain classes: class User { String name Address address ... } class Address { String country ... } For whatever reason we want a domain class that contains direct references to the name and the country of an user. However, we do not want to duplicate these two values in another database table. A view can help us here. Creating the view At this point I assume you are already using the Grails database-migration plugin. If you don't you should clearly check it out. The plugin is automatically included with newer Grails versions and provides a convenient way to manage databases using change sets. To create a view we just have to create a new change set: changeSet(author: '..', id: '..') { createView(""" SELECT u.id, u.name, a.country FROM user u JOIN address a on u.address_id = a.id """, viewName: 'user_with_country') } Here we create a view named user_with_country which contains three values: user id, user name andcountry. Creating the domain class Like normal tables views can be mapped to domain classes. The domain class for our view looks very simple: class UserWithCountry { String name String country static mapping = { table 'user_with_country' version false } } Note that we disable versioning by setting version to false (we don't have a version column in our view). At this point we just have to be sure that our database change set is executed before hibernate tries to create/update tables on application start. This is typically be done by disabling the table creation of hibernate in DataSource.groovy and enabling the automatic migration on application start by settinggrails.plugin.databasemigration.updateOnStart to true. Alternatively this can be achieved by manually executing all new changesets by running the dbm-update command. Usage Now we can use our UserWithCountry class to access the view: Address johnsAddress = new Address(country: 'england') User john = new User(name: 'john', address: johnsAddress) john.save(failOnError: true) assert UserWithCountry.count() == 1 UserWithCountry johnFromEngland = UserWithCountry.get(john.id) assert johnFromEngland.name == 'john' assert johnFromEngland.country == 'england' Advantages of views I know the example I am using here is not the best. The relationship between User and Address is already very simple and a view isn't required here. However, if you have more sophisticated data structures views can be a nice way to hide complex relationships that would require joining a lot of tables. Views can also be used as security measure if you don't want to expose all columns of your tables to the application.

January 25, 2014

by Michael Scharhag

· 16,650 Views

Big Data Search, Part 4: The Index Format is Horrible

I have completed my own exercise, and while I wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is. This isn’t something that you can really use for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file. That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance. Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file. In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second. Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere. Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time. What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still… Obviously, there are multiple solutions for this issue. B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that). However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here. However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here. This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb. It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.

January 24, 2014

by Oren Eini

· 12,103 Views

Running Multiple ActiveMQ Instances on One Machine

a few weeks ago i started making use of apache activemq again as the jms provider with my mule esb solution. since it had been a few years that i used activemq i thought it would be nice to check out some of the (new) features like the failover transport and other clustering features . to be able to test these last things i needed multiple installations of activemq on my machine. luckily this isn’t very hard to accomplish, although the documentation on this on the activemq site is quite minimal. the first step is to download and unzip the activemq package, which i did at ~/develop/apache-activemq-5.8.0. to create the instances i go to the activemq home directory and use the ‘create’ command like this: cd develop/apache-activemq-5.8.0/ ./bin/activemq create instancea ./bin/activemq create instanceb now if you do a ‘ls -l’ you will see that there are two subdirectories created, ‘instancea’ and ‘instanceb’. since both instances will make use of the default ports we have to modify the config for the second instance. go to the directory ‘develop/apache-activemq-5.8.0/instanceb/conf’ and open the file ‘jetty.xml’ to make the webconsole available at port ’8162′ by modifying the following line: next open the file ‘activemq.xml’ in the same directory and modify the following part: that’s it! make sure both files are saved and start the first instance with: cd ~/develop/apache-activemq-5.8.0/instancea/bin ./instancea console open up a new console and run the commands: cd /users/pascal/develop/apache-activemq-5.8.0/instanceb/bin ./instanceb console now you have two instances running next to each other and can start testing the ‘advanced’ functions of activemq.

January 24, 2014

by $$anonymous$$

· 17,805 Views · 1 Like

How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1

Learn how to set up a four node Hadoop cluster using AWS EC2, PuTTy(gen), and WinSCP.

January 23, 2014

by Hardik Pandya

· 135,928 Views · 3 Likes