Data Resources

Reading data from Google spreadsheet using JAVA

System Requirements: Eclipse Kepler Service Release 2 JDK 1.5 or above Installed Google App Engine SDK on eclipse – this is required for second version of this example. Create a google spreadsheet – login to your google account and create a new spreadsheet, if you want to read existing one then put that url in SPREADSHEET_URL. Once you create a new spreadsheet our url will be like – https://docs.google.com/spreadsheets/d/1L8xtAJfOObsXL-XemliUV10wkDHQNxjn6jKS4XwzYZ8/ but don’t put this url in SPREADSHEET_URL, Use the below url simply change the bold part. public static final String SPREADSHEET_URL = “https://spreadsheets.google.com/feeds/spreadsheets/1L8xtAJfOObsXL-XemliUV10wkDHQNxjn6jKS4XwzYZ8“; // Fill in google spreadsheet URI package org.gopaldas.readsps; import java.io.IOException; import java.net.URL; import com.google.gdata.client.spreadsheet.SpreadsheetService; import com.google.gdata.data.spreadsheet.ListEntry; import com.google.gdata.data.spreadsheet.ListFeed; import com.google.gdata.data.spreadsheet.SpreadsheetEntry; import com.google.gdata.data.spreadsheet.WorksheetEntry; import com.google.gdata.util.ServiceException; public class ReadSpreadsheet { public static final String GOOGLE_ACCOUNT_USERNAME = "[email protected]"; // Fill in google account username public static final String GOOGLE_ACCOUNT_PASSWORD = "xxxx"; // Fill in google account password public static final String SPREADSHEET_URL = "https://spreadsheets.google.com/feeds/spreadsheets/1L8xtAJfOObsXL-XemliUV10wkDHQNxjn6jKS4XwzYZ8"; //Fill in google spreadsheet URI public static void main(String[] args) throws IOException, ServiceException { /** Our view of Google Spreadsheets as an authenticated Google user. */ SpreadsheetService service = new SpreadsheetService("Print Google Spreadsheet Demo"); // Login and prompt the user to pick a sheet to use. service.setUserCredentials(GOOGLE_ACCOUNT_USERNAME, GOOGLE_ACCOUNT_PASSWORD); // Load sheet URL metafeedUrl = new URL(SPREADSHEET_URL); SpreadsheetEntry spreadsheet = service.getEntry(metafeedUrl, SpreadsheetEntry.class); URL listFeedUrl = ((WorksheetEntry) spreadsheet.getWorksheets().get(0)).getListFeedUrl(); // Print entries ListFeed feed = (ListFeed) service.getFeed(listFeedUrl, ListFeed.class); for(ListEntry entry : feed.getEntries()) { System.out.println("new row"); for(String tag : entry.getCustomElements().getTags()) { System.out.println(" "+tag + ": " + entry.getCustomElements().getValue(tag)); } } } }

April 29, 2014

by Gopal Das

· 39,933 Views

The 7 Log Management Tools Java Developers Should Know

splunk vs. sumo logic vs. logstash vs. graylog vs. loggly vs. papertrails vs. splunk>storm splunk, sumo logic, logstash, graylog, loggly, papertrails - did i miss someone? i’m pretty sure i did. logs are like fossil fuels - we’ve been wanting to get rid of them for the past 20 years, but we’re not quite there yet. well, if that's the case i want a bmw! to deal with the growth of log data a host of log management & analysis tools have been built over the last few years to help developers and operations make sense of the growing data. i thought it’d be interesting to look at our options and what are each tools’ selling point, from a developer’s standpoint . splunk as the biggest tool in this space, i decided to put splunk in a category of its own. that’s not to say it’s the best tool for what you need, but more to give credit to a product who essentially created a new category. pros splunk is probably the most feature rich solution in the space. it’s got hundreds of apps (i counted 537 ) to make sense of almost every format of log data, from security to business analytics to infrastructure monitoring. splunk’s search and charting tools are feature rich to the point that there’s probably no set of data you can’t get to through its ui or apis. cons splunk has two major cons. the first, that is more subjective, is that it’s an on-premise solution which means that setup costs in terms of money and complexity are high. to deploy in a high-scale environment you will need to install and configure a dedicated cluster. as a developer, it’s usually something you can't or don’t want to do as your first choice. splunk’s second con is that it’s expensive. to support a real-world application you’re looking at tens of thousands of dollars, which most likely means you’ll need sign offs from high-ups in your organization, and the process is going to be slow. if you’ve got a new app and you want something fast that you can quickly spin up and ramp as things progress - keep reading. some more enterprise log analyzers can be found here . saas log analyzers sumo logic sumo was founded as a saas version of splunk, going so far as to imitate some of splunk’s features and visuals early on. having said that, sl has developed to a full fledged enterprise class log management solution. pros sl is chock-full of features to reduce, search and chart mass amounts of data. out of all the saas log analyzers, it’s probably the most feature rich. also, being a saas offering it inherently means setup and ongoing operation are easier. one of sumo logic’s main points of attraction is the ability to establish baselines and to actively notify you when key metrics change after an event such as a new version rollout or a breach attempt. cons this one is shared across all saas log analyzers, which is you need to get the data to the service to actually do something with it. this means that you’ll be looking at possible gbs (or more) uploaded from your servers. this can create issues on multiple fronts - as a developer, if you're logging sensitive or pii you need to make sure it’s redacted. there may be a lag between the time data is logged and the time it’s visible to to the service. there’s additional overhead on your machines transmitting gbs of data, which really depends on your logging throughput. sumo’s pricing is also not transparent , which means you might be looking at a buying process which is more complex than swiping your team’s credit card to get going. loggly loggly is also a robust log analyzer, focusing on simplicity and ease of use for a devops audience. pros whereas sumo logic has a strong enterprise and security focus, loggly is geared more towards helping devops find and fix operational problems. this makes it very developer-friendly. things like creating custom performance and devops dashboards are super-easy to do. pricing is also transparent, which makes start of use easier. cons don't expect loggly to scale into a full blown infrastructure, security or analytics solution. if you need forensics or infrastructure monitoring you’re in the wrong place. this is a tools mainly for devops to parse data coming from your app servers. anything beyond that you’ll have to build yourself. papertrails papertrails is a simple way to look and search through logs from multiple machines, in one consolidated easy-to-use interface. think of it like tailing your log in the cloud, and you won't be too far off. pros pt is what it is. a simple way to look at log files from multiple machines in a singular view in the cloud. the ux itself is very similar to looking at a log on your machine, and so are the search commands. it aims to do something simple and useful, and does it elegantly. it’s also very affordable . cons pt is mostly text based. looking for any advanced integrations, predictive or reporting capabilities? you're barking up the wrong tree. splunk>storm this is splunk’s little (some may say step) saas brother. it’s a pretty similar offering that’s hosted on splunk’s servers. pros storm lets you experiment with splunk without having to install the actual software on-premise, and contains much of the features available in the full version. cons this isn't really a commercial offering, and you're limited in the amount of data you can send. it seems to be more of an online limited version of splunk meant to help people test out the product without having to deploy first. a new service called splunk cloud is aimed at providing a full-blown splunk saas experience. open source analyzers logstash logstash is an open source tool for collecting and managing log files. it’s part of an open-source stack which includes elasticsearch for indexing and searching through data and kibana for charting and visualizing data. together they form a powerful log management solution. pros being an open-source solution means you're inherently getting a lot of a control and a very good price. logstash uses three mature and powerful components, all heavily maintained, to create a very robust and extensible package. for an open-source solution it’s also very easy to install and start using. we use logstash and love it. cons as logstash is essentially a stack, it means you're dealing with three different products. that means that extensibility also becomes complex. logstash filters are written in ruby, kibana is pure javascript and elasticsearch has its own rest api as well as json templates. when you move to production, you’ll also need to separate the three into different machines, which adds to the complexity. graylog2 a fairly new player in the space, gl2 is an open-source log analyzer backed by mongodb as well as elasticsearch (similar to logstash) for storing and searching through log errors. it’s mainly focused on helping developers detect and fix errors in their apps. also in this category you can find fluentd and kafka whose one of its main use-cases is also storing log data. phew, so many choices! takipi for logs while this post is not about takipi, i thought there’s one feature it has which you might find relevant to all of this. the biggest disadvantage in all log analyzers and log files in general, is that the right data has to be put there by you first. from a dev perspective, it means that if an exception isn’t logged, or the variable data you need to understand why it happened isn't there, no log file or analyzer in the world can help you. production debugging sucks. one of the things we’ve added to takipi is the ability to jump into a recorded debugging session straight from a log file error. this means that for every log error you can see the actual source code and variable values at the moment of error. you can learn more about it here . this is one post where i would love to hear from you guys about your experiences with some of the tools mentioned (and some that i didn’t). i’m sure there are things you would disagree with or would like to correct me on - so go ahead, the comment section is below and i would love to hear from you. originally posted on takipi blog

April 29, 2014

by Chen Harel

· 37,812 Views

HashMap Performance Improvements in Java 8

See how HashMap performance has been improved with new features of Java 8.

April 23, 2014

by Tomasz Nurkiewicz

· 136,796 Views · 60 Likes

Java String length confusion

Facts and Terminology As you probably know, Java uses UTF-16 to represent Strings. In order to understand the confusion about String.length(), you need to be familiar with some Encoding/Unicode terms. Code Point: A unique integer value which represents a character in the code space. Code Unit: A bit sequence used to encode characters (Code Points). One or more Code Units may be required to represent a Code Point. UTF-16 Unicode Code Points are logically divided into 17 planes. The first plane, the Basic Multilingual Plane (BMP) contains the “classic” characters (from U+0000 to U+FFFF). The other planes contain the supplementary characters (from U+10000 to U+10FFFF). Characters (Code Points) from the first plane are encoded in one 16-bit Code Unit with the same value. Supplementary characters (Code Points) are encoded in two Code Units (encoding-specific, see Wiki for the explanation). Example Character: A Unicode Code Point: U+0041 UTF-16 Code Unit(s): 0041 Character: Mathematical double-struck capital A Unicode Code Point: U+1D538 UTF-16 Code Unit(s): D835 DD38 As you can see here, there are characters which are encoded in two Code Units. String.length() Let’s take a look at the Javadoc of the length() method: public int length() Returns the length of this string. The length is equal to the number of Unicode code units in the string. So if you have one supplementary character which consists of two code units, the length of that single character is two. // Mathematical double-struck capital A String str = "\uD835\uDD38"; System.out.println(str); System.out.println(str.length()); //prints 2 Which is correct according to the documentation, but maybe it’s not expected. ~Solution You need to count the code points not the code units: String str = "\uD835\uDD38"; System.out.println(str); System.out.println(str.codePointCount(0, str.length())); See: codePointCount(int beginIndex, int endIndex) References/Sources The Java Language Specification Unicode Glossary: Code Point Wiki: Code Point Unicode Glossary: Code Unit Wiki: Code Unit Wiki: Unicode Wiki: UTF-16 Supplementary Characters in the Java Platform Wiki: Unicode Planes

April 21, 2014

by Jonatan Ivanov

· 18,068 Views · 7 Likes

Be Careful with Java Path.endsWith(String) Usage

If you need to compare the java.io.file.Path object, be aware that Path.endsWith(String) will ONLY match another sub-element of Path object in your original path, not the path name string portion! If you want to match the string name portion, you would need to call the Path.toString() first. For example // Match all jar files. Files.walk(dir).forEach(path -> { if (path.toString().endsWith(".jar")) System.out.println(path); }); With out the "toString()" you will spend many fruitless hours wonder why your program didn't work.

April 19, 2014

by Zemian Deng

· 10,651 Views · 1 Like

How to Convert C# Object Into JSON String with JSON.NET

Before some time I have written a blog post – Converting a C# object into JSON string in that post one of reader, Thomas Levesque commented that mostly people are using JSON.NET a popular high performance JSON for creating for .NET Created by James Newton- King. I agree with him if we are using .NET Framework 4.0 or higher version for earlier version still JavaScriptSerializer is good. So in this post we are going to learn How we can convert C# object into JSON string with JSON.NET framework. What is JSON.NET: JSON.NET is a very high performance framework compared to other serializer for converting C# object into JSON string. It is created by James Newton-Kind. You can find more information about this framework from following link. http://james.newtonking.com/json How to convert C# object into JSON string with JSON.NET framework: For this I am going to use old application that I have used in previous post. Following is a employee class with two properties first name and last name. public class Employee { public string FirstName { get; set; } public string LastName { get; set; } } I have created same object of “Employee” class as I have created in previous post like below. Employee employee=new Employee {FirstName = "Jalpesh", LastName = "Vadgama"}; Now it’s time to add JSON.NET Nuget package. You install Nuget package via following command. I have installed like below. Now we are done with adding NuGet package. Following is code I have written to convert C# object into JSON string. string jsonString = Newtonsoft.Json.JsonConvert.SerializeObject(employee); Console.WriteLine(jsonString); Let's run application and following is a output as expected. That’s it. It’s very easy. Hope you like it. Stay tuned for more.

April 14, 2014

by Jalpesh Vadgama

· 193,323 Views

How to Migrate from MySQL to MongoDB

In the last week I was working on a key project to migrate a BI platform from MySQL to MongoDB. We chose that database due to its support and scalability.

April 14, 2014

by Moshe Kaplan

· 115,916 Views · 6 Likes

Groovy Goodness: Converting Byte Array to Hex String

To convert a byte[] array to a String we can simply use the new String(byte[]) constructor. But if the array contains non-printable bytes we don't get a good representation. In Groovy we can use the method encodeHex() to transform a byte[] array to a hex String value. The byteelements are converted to their hexadecimal equivalents. final byte[] printable = [109, 114, 104, 97, 107, 105] // array with non-printable bytes 6, 27 (ACK, ESC) final byte[] nonprintable = [109, 114, 6, 27, 104, 97, 107, 105] assert new String(printable) == 'mrhaki' assert new String(nonprintable) != 'mr haki' // encodeHex() returns a Writable final Writable printableHex = printable.encodeHex() assert printableHex.toString() == '6d7268616b69' final nonprintableHex = nonprintable.encodeHex().toString() assert nonprintableHex == '6d72061b68616b69' // Convert back assert nonprintableHex.decodeHex() == nonprintable Code written with Groovy 2.2.1

April 6, 2014

by Hubert Klein Ikkink

· 14,498 Views · 5 Likes

Distributed Counters Feature Design

this is another experiment with longer posts. previously, i used the time series example as the bed on which to test some ideas regarding feature design, to explain how we work and in general work out the rough patches along the way. i should probably note that these posts are purely fiction at this point. we have no plans to include a time series feature in ravendb at this time. i am trying to work out some thoughts in the open and get your feedback. at any rate, yesterday we had a request for cassandra style counters at the mailing list. and as long as i am doing feature design series, i thought that i could talk about how i would go about implementing this. again, consider this fiction, i have no plans of implementing this at this time. the essence of what we want is to be able to… count stuff. efficiently, in a distributed manner, with optional support for cross data center replication. very roughly, the idea is to have “sub counters”, unique for every node in the system. whenever you increment the value, we log this to our own sub counter, and then replicate it out. whenever you read it, we just sum all the data we have from all the sub counters. let us outline the various parts of the solution in the same order as the one i used for time series. storage a counter is just a named 64 bits signed integer. a counter name can be any string up to 128 printable characters. the external interface of the storage would look like this: 1: public struct counterincrement 2: { 3: public string name; 4: public long change; 5: } 6: 7: public struct counter 8: { 9: public string name; 10: public string source; 11: public long value; 12: } 13: 14: public interface icounterstorage 15: { 16: void localincrementbatch(counterincrement[] batch); 17: 18: counter[] read(string name); 19: 20: void replicatedupdates(counter[] updates); 21: } as you can see, this gives us very simple interface for the storage. we can either change the data locally (which modify our own storage) or we can get an update from a replica about its changes. there really isn’t much more to it, to be fair. the localincrementbatch() increment a local value, and read() will return all the values for a counter. there is a little bit of trickery involved in how exactly one would store the counter values. for now, i think we’ll store each counter as two step values. we’ll have a tree of multi tree values that will carry each value from each source. that means that a counter will take roughly 4kb or so. this is easy to work with and nicely fit the model voron uses internally. note that we’ll outline additional requirement for storage (searching for counter by prefix, iterating over counters, addresses of other servers, stats, etc) below. i’m not showing them here because they aren’t the major issue yet. over the wire skipping out on any optimizations that might be required, we will expose the following endpoints: get /counters/read?id=users/1/visits&users/1/posts <—will return json response with all the relevant values (already summed up). { “users/1/visits”: 43, “users/1/posts”: 3 } get /counters/read?id=users/1/visits&users/1/1/posts&raw=true <—will return json response with all the relevant values, per source. { “users/1/visits”: {“rvn1”: 21, “rvn2”: 22 } , “users/1/posts”: { “rvn1”: 2, “rvn3”: 1 } } post /counters/increment <– allows to increment counters. the request is a json array of the counter name and the change. for a real system, you’ll probably need a lot more stuff, metrics, stats, etc. but this is the high level design, so this would be enough. note that we are skipping the high performance stream based writes we outlined for time series. we’ll probably won’t need them, so that doesn’t matter, but they are an option if we need them. system behavior this is where it is really not interesting, there is very little behavior here, actually. we only have to read the data from the storage, sum it up, and send it to the user. hardly what i’ll call business logic. client api the client api will probably look something like this: 1: counters.increment("users/1/posts"); 2: counters.increment("users/1/visits", 4); 3: 4: using(var batch = counters.batch()) 5: { 6: batch.increment("users/1/posts"); 7: batch.increment("users/1/visits",5); 8: batch.submit(); 9: } note that we’re offering both batch and single api. we’ll likely also want to offer a fire & forget style, which will be able to offer even better performance (because they could do batching across more than a single thread), but that is out of scope for now. for simplicity sake, we are going to have the client just a container for all of endpoints that it knows about. the container would be responsible for… updating the client visible topology, selecting the best server to use at any given point, etc. user interface there isn’t much to it. just show a list of counter values in a list. allow to search by prefix, allow to dive into a particular counter and read its raw values, but that is about it. oh, and allow to delete a counter. deleting data honestly, i really hate deletes. they are very expensive to handle properly the moment you have more than a single node. in this case, there is an inherent race condition between a delete going out and another node getting an increment. and then there is the issue of what happens if you had a node down when you did the delete, etc. this just sucks. deletion are handled normally, (with the race condition caveat, obviously), and i’ll discuss how we replicate them in a bit. high availability / scale out by definition, we actually don’t want to have storage replication here. either log shipping or consensus based. we actually do want to have different values, because we are going to be modifying things independently on many servers. that means that we need to do replication at the database level. and that leads to some interesting questions. again, the hard part here is the deletes. actually, the really hard part is what we are going to do with the new server problem. the new server problem dictates how we are going to bring a new server into the cluster. if we could fix the size of the cluster, that would make things a lot easier. however, we are actually interested in being able to dynamically grow the cluster size. therefor, there are only two real ways to do it: add a new empty node to the cluster, and have it be filled from all the other servers. add a new node by backing up an existing node, and restoring as a new node. ravendb, for example, follows the first option. but it means that in needs to track a lot more information. the second option is actually a lot simpler, because we don’t need to care about keeping around old data. however, this means that the process of bringing up a new server would now be: update all nodes in the cluster with the new node address (node isn’t up yet, replication to it will fail and be queued). backup an existing node and restore at the new node. start the new node. the order of steps is quite important. and it would be easy to get it wrong. also, on large systems, backup & restore can take a long time. operationally speaking, i would much rather just be able to do something like, bring a new node into the cluster in “silent” mode. that is, it would get information from all the other nodes, and i can “flip the switch” and make it visible to clients at any point in time. that is how you do it with ravendb, and it is an incredibly powerful system, when used properly. that means that for all intents and purposes, we don’t do real deletes. what we’ll actually do is replace the counter value with delete marker. this turns deletes into a much simple “just another write”. it has the sad implication of not free disk space on deletes, but deletes tend to be rare, and it is usually fine to add a “purge” admin option that can be run on as needed basis. but that brings us to an interesting issue, how do we actually handle replication. the topology map to simplify things, we are going to go with one way replication from a node to another. that allows complex topologies like master-master, cluster-cluster, replication chain, etc. but in the end, this is all about a single node replication to another. the first question to ask is, are we going to replicate just our local changes, or are we going to have to replicate external changes as well? the problem with replicating external changes is that you may have the following topology: now, server a got a value and sent it to server b. server b then forwarded it to server c. however, at that point, we also have a the value from server a replicated directly to server c. which value is it supposed to pick? and what about a scenario where you have more complex topology? in general, because in this type of system, we can have any node accept writes, and we actually desire this to be the case , we don’t want this behavior. we want to only replicate local data, not all the data. of course, that leads to an annoying question, what happens if we have a 3 node cluster, and one node fails catastrophically. we can bring a new node in, and the other two nodes will be able to fill in their values via replication, but what about the node that is down? the data isn’t gone, it is still right there in the other two nodes, but we need a way to pull it out. therefor, i think that the best option would be to say that nodes only replicate their local state, except in the case of a new node. a new node will be told the address of an existing node in the cluster, at which point it will: register itself in all the nodes in the cluster (discoverable from the existing node). this assumes a standard two way replication link between all servers, if this isn’t the case, the operators would have the responsibility to setup the actual replication semantics on their own. new node now starts getting updates from all the nodes in the cluster. it keeps them in a log for now, not doing anything yet. ask that node for a complete update of all of its current state. when it has all the complete state of the existing node, it replays all of the remembered logs that it didn’t have a chance to apply yet. then it announces that it is in a valid state to start accepting client connections. note that this process is likely to be very sensitive to high data volumes. that is why you’ll usually want to select a backup node to read from, and that decision is an ops decision. you’ll also want to be able to report extensively on the current status of the node, since this can take a while, and ops will be watching this very closely. server name a node requires a unique name. we can use guids, but those aren’t readable, so we can use machine name + port, but those can change. ideally, we can require the user to set us up with a unique name. that is important for readability and for being able to alter see all the values we have in all the nodes. it is important that names are never repeated, so we’ll probably have a guid there anyway, just to be on the safe side. actual replication semantics since we have the new server problem down to an automated process, we can choose the drastically simpler model of just having an internal queue per each replication destination. whenever we make a change, we also make a note of that in the queue for that destination, then we start an async replication process to that server, sending all of our updates there. it is always safe to overwrite data using replication, because we are overwriting our own data, never anyone else. and… that is about it, actually. there are probably a lot of details that i am missing / would discover if we were to actually implement this. but i think that this is a pretty good idea about what this feature is about.

March 25, 2014

by Oren Eini

· 12,642 Views · 1 Like

Estimating Statistics via Bootstrapping and Monte Carlo Simulation

We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population. It will be too tedious to enumerate all members of the whole population. For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population. We understand the estimation done this way (via random sampling) can deviate from the population. Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics). The challenge is how do we estimate the "standard error" or the "confidence interval". A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation. Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation. But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set. Bootstrapping provides a way to address this ... Bootstrapping Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set. Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population. R provides a nice "bootstrap" library to do this. > library(boot) > # Generate a population > population.weight <- rnorm(100000, 160, 60) > # Lets say we care about the ninety percentile > quantile(population.weight, 0.9) 90% 236.8105 > # We create our first sample set of 500 samples > sample_set1 <- sample(population.weight, 500) > # Here is our sample statistic of ninety percentile > quantile(sample_set1, 0.9) 90% 232.3641 > # Notice that the sample statistics deviates from the population statistics > # We want to estimate how big is this deviation by using bootstrapping > # I need to define my function to compute the statistics > ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))} > # Bootstrapping will call this function many times with different idx > boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000) > boot_result ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = sample_set1, statistic = ninety_percentile, R = 1000) Bootstrap Statistics : original bias std. error t1* 232.3641 2.379859 5.43342 > plot(boot_result) > boot.ci(boot_result, type="bca") BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = boot_result, type = "bca") Intervals : Level BCa 95% (227.2, 248.1 ) Calculations and Intervals on Original Scale Here is the visual output of the bootstrap plot Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way. It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics. If we assume certain distribution model want to see the distribution of certain statistics. Monte Carlo simulation provides a powerful way for this. Monte Carlo Simulation The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples. We compute the statistics of each sample set and see how the statistics distributed across different sample sets. For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples. > sample_stats <- rep(0, 1000) > for (i in 1:1000) { + sample_stats[i] <- max(rnorm(5)) + } > mean(sample_stats) [1] 1.153008 > sd(sample_stats) [1] 0.6584022 > par(mfrow=c(1,2)) > hist(sample_stats, breaks=30) > qqnorm(sample_stats) > qqline(sample_stats) Here is the distribution of the "max(5)" statistics, which shows some right skewness Bootstrapping and Monte Carlo simulation are powerful tools to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.

March 21, 2014

by Ricky Ho

· 5,213 Views

WSO2 DSS: Batch Insert Sample (End to End)

WSO2 DSS wraps Data Services Layer and provides us with a simple GUI to define a Data Service with zero Java code. With this, a change to the data source is just a simple click away and no other party needs to be aware of this. With this sample demonstration, we will see how to do a batch insert to a table. Batch insert is useful when you want to insert data in sequential manner. This also means that if at least one of the insertion query fails all the other queries ran so far in the batch will be rolled back as well. If one insertion in the batch fails means whole batch is failed. This can be used if you are running the same query to insert data many times. With batch insert all the data will be sent in one call. So this reduce the number calls you have to call, to get the data inserted. This comes with one condition that, The query should not be producing results back. (We will only be notified whether the query was successful or not.) Prerequisites: WSO2 Data Services Server - http://wso2.com/products/data-services-server/ (current latest 3.1.1) Mysql connector (JDBC) - https://www.mysql.com/products/connector/ If we already have a data service running which is not sending back a result set , then it's just a matters of adding following property in service declaration. enableBatchRequests="true" Anyway I will be demonstrating the creation of the service from the scratch. 1. Create a service as follows going through the wizard, 2. Create the data source 3. Create the query - (This is an insert query. Also note the input mapping we have add as relevant to the query. To know more about input mapping and using validation refer the documentation.) 4. Create the operation - Select the query to be executed once the operation is called. By enabling return request status, we will be notified whether the operation was a success or not. 5. Try it! - When we list the services we will see this new service now. In the right we will have an option to try it. Here we can see the option to try the service giving the input parameters. Here I have tried it two insertions in a batch. Now if we go to XML view of the service it will be similar to following, which is saved in server as a .dbs file. com.mysql.jdbc.Driver jdbc:mysql://localhost:3306/json_array root root 1 10 SELECT 1 insert into flights (flight_no, number_of_cases, created_by, description, trips) values (:flight_no,:number_of_cases,:created_by,:description,:trips) If we hit on the service name in the list of services, we will be directed to Service Dashboard where we can see several other options for the service. It provides the option to generate an Axis2 client for the service. Once we get the client then it's a matter of calling the methods in the stub as follows. private static BatchRequestSampleOldStub.AddFlight_type0 createFlight(int cases, String creator, String description, int trips) { BatchRequestSampleOldStub.AddFlight_type0 val = new BatchRequestSampleOldStub.AddFlight_type0(); val.setNumber_of_cases(cases); val.setCreated_by(creator); val.setDescription(description); val.setTrips(trips); printFlightInfo(cases, creator, description, trips); return val; } public static void main(String[] args) throws Exception { String epr = "http://localhost:9763" + "/services/BatchInsertSample"; BatchRequestSampleOldStub stub = new BatchRequestSampleOldStub(epr); BatchRequestSampleOldStub.AddFlight_batch_req vals1 = new BatchRequestSampleOldStub.AddFlight_batch_req(); vals1.addAddFlight(createFlight(1, "Pushpalanka", "test", 2)); vals1.addAddFlight(createFlight(2, "Jayawardhana", "test", 2)); vals1.addAddFlight(createFlight(3, "[email protected]", "test", 2)); try { System.out.println("Executing Add Flights.."); stub.addFlight_batch_req(vals1); } catch (Exception e) { System.out.println("Error in Add Flights!"); } Complete client code can be found here. Cheers! Ref: http://docs.wso2.org/display/DSS311/Batch+Processing+Sample

March 21, 2014

by Pushpalanka Jayawardhana

· 10,116 Views

Exporting Spring Data JPA Repositories as REST Services using Spring Data REST

Spring Data modules provides various modules to work with various types of datasources like RDBMS, NOSQL stores etc in unified way. In my previous article SpringMVC4 + Spring Data JPA + SpringSecurity configuration using JavaConfig I have explained how to configure Spring Data JPA using JavaConfig. Now in this post let us see how we can use Spring Data JPA repositories and export JPA entities as REST endpoints using Spring Data REST. First let us configure spring-data-jpa and spring-data-rest-webmvc dependencies in our pom.xml. org.springframework.data spring-data-jpa 1.5.0.RELEASE org.springframework.data spring-data-rest-webmvc 2.0.0.RELEASE Make sure you have latest released versions configured correctly, otherwise you will encounter the following error: java.lang.ClassNotFoundException: org.springframework.data.mapping.SimplePropertyHandler Create JPA entities. @Entity @Table(name = "USERS") public class User implements Serializable { private static final long serialVersionUID = 1L; @Id @GeneratedValue(strategy = GenerationType.IDENTITY) @Column(name = "user_id") private Integer id; @Column(name = "username", nullable = false, unique = true, length = 50) private String userName; @Column(name = "password", nullable = false, length = 50) private String password; @Column(name = "firstname", nullable = false, length = 50) private String firstName; @Column(name = "lastname", length = 50) private String lastName; @Column(name = "email", nullable = false, unique = true, length = 50) private String email; @Temporal(TemporalType.DATE) private Date dob; private boolean enabled=true; @OneToMany(fetch=FetchType.EAGER, cascade=CascadeType.ALL) @JoinColumn(name="user_id") private Set roles = new HashSet<>(); @OneToMany(mappedBy = "user") private List contacts = new ArrayList<>(); //setters and getters } @Entity @Table(name = "ROLES") public class Role implements Serializable { private static final long serialVersionUID = 1L; @Id @GeneratedValue(strategy = GenerationType.IDENTITY) @Column(name = "role_id") private Integer id; @Column(name="role_name",nullable=false) private String roleName; //setters and getters } @Entity @Table(name = "CONTACTS") public class Contact implements Serializable { private static final long serialVersionUID = 1L; @Id @GeneratedValue(strategy = GenerationType.IDENTITY) @Column(name = "contact_id") private Integer id; @Column(name = "firstname", nullable = false, length = 50) private String firstName; @Column(name = "lastname", length = 50) private String lastName; @Column(name = "email", nullable = false, unique = true, length = 50) private String email; @Temporal(TemporalType.DATE) private Date dob; @ManyToOne @JoinColumn(name = "user_id") private User user; //setters and getters } Configure DispatcherServlet using AbstractAnnotationConfigDispatcherServletInitializer. Observe that we have added RepositoryRestMvcConfiguration.class to getServletConfigClasses() method. RepositoryRestMvcConfiguration is the one which does the heavy lifting of looking for Spring Data Repositories and exporting them as REST endpoints. package com.sivalabs.springdatarest.web.config; import javax.servlet.Filter; import org.springframework.data.rest.webmvc.config.RepositoryRestMvcConfiguration; import org.springframework.orm.jpa.support.OpenEntityManagerInViewFilter; import org.springframework.web.servlet.support.AbstractAnnotationConfigDispatcherServletInitializer; import com.sivalabs.springdatarest.config.AppConfig; public class SpringWebAppInitializer extends AbstractAnnotationConfigDispatcherServletInitializer { @Override protected Class[] getRootConfigClasses() { return new Class[] { AppConfig.class}; } @Override protected Class[] getServletConfigClasses() { return new Class[] { WebMvcConfig.class, RepositoryRestMvcConfiguration.class }; } @Override protected String[] getServletMappings() { return new String[] { "/rest/*" }; } @Override protected Filter[] getServletFilters() { return new Filter[]{ new OpenEntityManagerInViewFilter() }; } } Create Spring Data JPA repositories for JPA entities. public interface UserRepository extends JpaRepository { } public interface RoleRepository extends JpaRepository { } public interface ContactRepository extends JpaRepository { } That's it. Spring Data REST will take care of rest of the things. You can use spring Rest Shell https://github.com/spring-projects/rest-shell or Chrome's Postman Addon to test the exported REST services. D:\rest-shell-1.2.1.RELEASE\bin>rest-shell http://localhost:8080:> Now we can change the baseUri using baseUri command as follows: http://localhost:8080:>baseUri http://localhost:8080/spring-data-rest-demo/rest/ http://localhost:8080/spring-data-rest-demo/rest/> http://localhost:8080/spring-data-rest-demo/rest/>list rel href ====================================================================================== users http://localhost:8080/spring-data-rest-demo/rest/users{?page,size,sort} roles http://localhost:8080/spring-data-rest-demo/rest/roles{?page,size,sort} contacts http://localhost:8080/spring-data-rest-demo/rest/contacts{?page,size,sort} Note: It seems there is an issue with rest-shell when the DispatcherServlet url mapped to "/" and issue list command it responds with "No resources found". http://localhost:8080/spring-data-rest-demo/rest/>get users/ { "_links": { "self": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/{?page,size,sort}", "templated": true }, "search": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/search" } }, "_embedded": { "users": [ { "userName": "admin", "password": "admin", "firstName": "Administrator", "lastName": null, "email": "[email protected]", "dob": null, "enabled": true, "_links": { "self": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/1" }, "roles": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/1/roles" }, "contacts": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/1/contacts" } } }, { "userName": "siva", "password": "siva", "firstName": "Siva", "lastName": null, "email": "[email protected]", "dob": null, "enabled": true, "_links": { "self": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/2" }, "roles": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/2/roles" }, "contacts": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/2/contacts" } } } ] }, "page": { "size": 20, "totalElements": 2, "totalPages": 1, "number": 0 } } You can find the source code at https://github.com/sivaprasadreddy/sivalabs-blog-samples-code/tree/master/spring-data-rest-demo For more Info on Spring Rest Shell: https://github.com/spring-projects/rest-shell

March 7, 2014

by Siva Prasad Reddy Katamreddy

· 30,019 Views

Convert CSV Data to Avro Data

In one of my previous posts I explained how we can convert json data to avro data and vice versa using avro tools command line option. Today I was trying to see what options we have for converting csv data to avro format, as of now we don't have any avro tool option to accomplish this . Now, we can either write our own java program (MapReduce program or a simple java program) or we can use various SerDe's available with Hive to do this quickly and without writing any code :) To convert csv data to Avro data using Hive we need to follow the steps below: Create a Hive table stored as textfile and specify your csv delimiter also. Load csv file to above table using "load data" command. Create another Hive table using AvroSerDe. Insert data from former table to new Avro Hive table using "insert overwrite" command. To demonstrate this I will use use the data below (student.csv): 0,38,91 0,65,28 0,78,16 1,34,96 1,78,14 1,11,43 Now execute below queries in Hive: --1. Create a Hive table stored as textfile USE test; CREATE TABLE csv_table ( student_id INT, subject_id INT, marks INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; --2. Load csv_table with student.csv data LOAD DATA LOCAL INPATH "/path/to/student.csv" OVERWRITE INTO TABLE test.csv_table; --3. Create another Hive table using AvroSerDe CREATE TABLE avro_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "com.rishav.avro", "name": "student_marks", "type": "record", "fields": [ { "name":"student_id","type":"int"}, { "name":"subject_id","type":"int"}, { "name":"marks","type":"int"}] }'); --4. Load avro_table with data from csv_table INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table; Now you can get data in Avro format from Hive warehouse folder. To dump this file to local file system use below command: hadoop fs -cat /path/to/warehouse/test.db/avro_table/* > student.avro If you want to get json data from this avro file you can use avro tools command: java -jar avro-tools-1.7.5.jar tojson student.avro > student.json So we can easily convert csv to avro and csv to json also by just writing 4 HQLs.

March 5, 2014

by Rishav Rohit

· 39,705 Views · 1 Like

When to Use MongoDB Rather than MySQL (or Other RDBMS): The Billing Example

NoSQL has been a hot topic a pretty long time (well, it's not only a buzz anymore). However, when should we really use it instead of an RDBMS?

March 3, 2014

by Moshe Kaplan

· 378,912 Views · 12 Likes

Python CSV Files: Reading and Writing

Learn to parse CSV (Comma Separated Values) files with Python examples using the csv module's reader function and DictReader class.

March 3, 2014

by Mike Driscoll

· 375,662 Views · 6 Likes

Getting Started with Mocking in Java using Mockito

We all write unit tests but the challenge we face at times is that the unit under test might be dependent on other components. And configuring other components for unit testing is definitely an overkill. Instead we can make use of Mocks in place of the other components and continue with the unit testing. To show how one can use mocks, I have a Data access layer(DAL), basically a class which provides an API for the application to access and modify the data in the data repository. I then unit test the DAL without actually the need to connect to the data repository. The data repository can be a local database or remote database or a file system or any place where we can store and retrieve the data. The use of a DAL class helps us in keeping the data mappers separate from the application code. Lets create a Java project using maven. mvn archetype:generate -DgroupId=info.sanaulla -DartifactId=MockitoDemo -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false The above creates a folder MockitoDemo and then creates the entire directory structure for source and test files. Consider the below model class for this example: package info.sanaulla.models; import java.util.List; /** * Model class for the book details. */ public class Book { private String isbn; private String title; private List authors; private String publication; private Integer yearOfPublication; private Integer numberOfPages; private String image; public Book(String isbn, String title, List authors, String publication, Integer yearOfPublication, Integer numberOfPages, String image){ this.isbn = isbn; this.title = title; this.authors = authors; this.publication = publication; this.yearOfPublication = yearOfPublication; this.numberOfPages = numberOfPages; this.image = image; } public String getIsbn() { return isbn; } public String getTitle() { return title; } public List getAuthors() { return authors; } public String getPublication() { return publication; } public Integer getYearOfPublication() { return yearOfPublication; } public Integer getNumberOfPages() { return numberOfPages; } public String getImage() { return image; } } The DAL class for operating on the Book model class is: package info.sanaulla.dal; import info.sanaulla.models.Book; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.List; /** * API layer for persisting and retrieving the Book objects. */ public class BookDAL { private static BookDAL bookDAL = new BookDAL(); public List getAllBooks(){ return Collections.EMPTY_LIST; } public Book getBook(String isbn){ return null; } public String addBook(Book book){ return book.getIsbn(); } public String updateBook(Book book){ return book.getIsbn(); } public static BookDAL getInstance(){ return bookDAL; } } The DAL layer above currently has no functionality and we are going to unit test that piece of code (TDD). The DAL layer might communicate with a ORM Mapper or Database API which we are not concerned while designing the API. Test Driving the DAL Layer There are lot of frameworks for Unit testing and mocking in Java but for this example I would be picking JUnit for unit testing and Mockito for mocking. We would have to update the dependency in Maven’s pom.xml 4.0.0 info.sanaulla MockitoDemo jar 1.0-SNAPSHOT MockitoDemo http://maven.apache.org junit junit 4.10 test org.mockito mockito-all 1.9.5 test Now lets unit test the BookDAL. During the unit testing we will inject mock data into the BookDAL so that we can complete the testing of the API without depending on the data source. Initially we will have an empty test class: public class BookDALTest { public void setUp() throws Exception { } public void testGetAllBooks() throws Exception { } public void testGetBook() throws Exception { } public void testAddBook() throws Exception { } public void testUpdateBook() throws Exception { } } We will inject the mock BookDAL and mock data in the setUp() as shown below: public class BookDALTest { private static BookDAL mockedBookDAL; private static Book book1; private static Book book2; @BeforeClass public static void setUp(){ //Create mock object of BookDAL mockedBookDAL = mock(BookDAL.class); //Create few instances of Book class. book1 = new Book("8131721019","Compilers Principles", Arrays.asList("D. Jeffrey Ulman","Ravi Sethi", "Alfred V. Aho", "Monica S. Lam"), "Pearson Education Singapore Pte Ltd", 2008,1009,"BOOK_IMAGE"); book2 = new Book("9788183331630","Let Us C 13th Edition", Arrays.asList("Yashavant Kanetkar"),"BPB PUBLICATIONS", 2012,675,"BOOK_IMAGE"); //Stubbing the methods of mocked BookDAL with mocked data. when(mockedBookDAL.getAllBooks()).thenReturn(Arrays.asList(book1, book2)); when(mockedBookDAL.getBook("8131721019")).thenReturn(book1); when(mockedBookDAL.addBook(book1)).thenReturn(book1.getIsbn()); when(mockedBookDAL.updateBook(book1)).thenReturn(book1.getIsbn()); } public void testGetAllBooks() throws Exception {} public void testGetBook() throws Exception {} public void testAddBook() throws Exception {} public void testUpdateBook() throws Exception {} } In the above setUp() method I have: Created a mock object of BookDAL BookDAL mockedBookDAL = mock(BookDAL.class); Stubbed the API of BookDAL with mock data, such that when ever the API is invoked the mocked data is returned. //When getAllBooks() is invoked then return the given data and so on for the other methods. when(mockedBookDAL.getAllBooks()).thenReturn(Arrays.asList(book1, book2)); when(mockedBookDAL.getBook("8131721019")).thenReturn(book1); when(mockedBookDAL.addBook(book1)).thenReturn(book1.getIsbn()); when(mockedBookDAL.updateBook(book1)).thenReturn(book1.getIsbn()); Populating the rest of the tests we get: package info.sanaulla.dal; import info.sanaulla.models.Book; import org.junit.BeforeClass; import org.junit.Test; import static org.junit.Assert.*; import static org.mockito.Mockito.mock; import static org.mockito.Mockito.when; import java.util.Arrays; import java.util.List; public class BookDALTest { private static BookDAL mockedBookDAL; private static Book book1; private static Book book2; @BeforeClass public static void setUp(){ mockedBookDAL = mock(BookDAL.class); book1 = new Book("8131721019","Compilers Principles", Arrays.asList("D. Jeffrey Ulman","Ravi Sethi", "Alfred V. Aho", "Monica S. Lam"), "Pearson Education Singapore Pte Ltd", 2008,1009,"BOOK_IMAGE"); book2 = new Book("9788183331630","Let Us C 13th Edition", Arrays.asList("Yashavant Kanetkar"),"BPB PUBLICATIONS", 2012,675,"BOOK_IMAGE"); when(mockedBookDAL.getAllBooks()).thenReturn(Arrays.asList(book1, book2)); when(mockedBookDAL.getBook("8131721019")).thenReturn(book1); when(mockedBookDAL.addBook(book1)).thenReturn(book1.getIsbn()); when(mockedBookDAL.updateBook(book1)).thenReturn(book1.getIsbn()); } @Test public void testGetAllBooks() throws Exception { List allBooks = mockedBookDAL.getAllBooks(); assertEquals(2, allBooks.size()); Book myBook = allBooks.get(0); assertEquals("8131721019", myBook.getIsbn()); assertEquals("Compilers Principles", myBook.getTitle()); assertEquals(4, myBook.getAuthors().size()); assertEquals((Integer)2008, myBook.getYearOfPublication()); assertEquals((Integer) 1009, myBook.getNumberOfPages()); assertEquals("Pearson Education Singapore Pte Ltd", myBook.getPublication()); assertEquals("BOOK_IMAGE", myBook.getImage()); } @Test public void testGetBook(){ String isbn = "8131721019"; Book myBook = mockedBookDAL.getBook(isbn); assertNotNull(myBook); assertEquals(isbn, myBook.getIsbn()); assertEquals("Compilers Principles", myBook.getTitle()); assertEquals(4, myBook.getAuthors().size()); assertEquals("Pearson Education Singapore Pte Ltd", myBook.getPublication()); assertEquals((Integer)2008, myBook.getYearOfPublication()); assertEquals((Integer)1009, myBook.getNumberOfPages()); } @Test public void testAddBook(){ String isbn = mockedBookDAL.addBook(book1); assertNotNull(isbn); assertEquals(book1.getIsbn(), isbn); } @Test public void testUpdateBook(){ String isbn = mockedBookDAL.updateBook(book1); assertNotNull(isbn); assertEquals(book1.getIsbn(), isbn); } } One can run the test by using maven command: mvn test. The output is: ------------------------------------------------------- T E S T S ------------------------------------------------------- Running info.sanaulla.AppTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.029 sec Running info.sanaulla.dal.BookDALTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.209 sec Results : Tests run: 5, Failures: 0, Errors: 0, Skipped: 0 So we have been able to test the DAL class without actually configuring the data source by using mocks.

February 26, 2014

by Mohamed Sanaulla

· 233,463 Views · 18 Likes

How to Estimate Memory Consumption

This story goes back at least a decade, when I was first approached by a PHB with a question “How big servers are we going to need to buy for our production deployment”. The new and shiny system we have been building was nine months from production rollout and apparently the company had promised to deliver the whole solution, including hardware. Oh boy, was I in trouble. With just a few years of experience down my belt, I could have pretty much just tossed a dice. Even though I am sure my complete lack of confidence was clearly visible, I still had to come up with the answer. Four hours of googling later I recall sitting there with the same question still hovering in front of my bedazzled face: “How to estimate the need for computing power?” In this post I start to open up the subject by giving you rough guidelines on how to estimate memory requirements for your brand new Java application. For the impatient ones – the answer will be to start with the memory equal to approximately 5 x [amount of memory consumed by Live Data] and start the fine-tuning from there. For the ones more curious about the logic behind, stay with me and I will walk you through the reasoning. First and foremost, I can only recommend to avoid answering a question phrased like this without detailed information being available. Your answer has to be based upon the performance requirements, so do not even start without clarifying those first. And I do not mean way-too-ambiguous “The system needs to support 700 concurrent users”, but a lot more specific ones about latency and throughput, taking into account the amount of data, usage patterns. One should also not forget about the budget also – we all can all dream about sub-millisecond latencies, but those without HFT banking backbone budgets – unfortunately it will only remain a dream. For now, lets assume you have those requirements in place. Next stop would be to create the load test scripts emulating user behaviour. If you are now able to launch those scripts concurrently you have built a foundation to the answer. As you might also have guessed, the next step involves our usual advice of measuring not guessing. But with a caveat. Live Data Size Namely, our quest for the optimal memory configuration requires capturing the Live Data Size. Having captured this, we have the baseline configuration in place for the fine-tuning. How does one define live data size? Charlie Hunt and Binu John in their “Java Performance” book have given it the following definition: Live data size is the heap size consumed by the set of long-lived objects required to run the application in its steady state. Equipped with the definition, we are ready to run your load tests against the application with the GC logging turned on (-XX:+PrintGCTimeStamps -Xloggc:/tmp/gc.log -XX:+PrintGCDetails) and visualize the logs (with the help of gcviewer for example) to determine the moment when the application has reached to the steady state. What you are after looks similar to the following: We can see the GC doing its job both with minor and Full GC runs in a familiar double-saw-toothed graphic. This particular application seems to have achieved a steady state already after the first full GC run on 21st second. In most cases however, it takes 10-20 Full GC runs to spot the change in trends. After four full GC runs we can estimate that the Live Data Size is equal to approximately 100MB. The aforementioned Java Performance book is now indicating that there is a strong correlation between the Live Data Size and the optimal memory configuration parameters in a typical Java EE application. The evidence from the field is also backing up their recommendations: Set the maximum heap size to 3-4 x [Live Data Size] So, for our application at hand, we should set the -Xmx to be in between 300m and 400m for the initial performance tests and take it from there. We have mixed feelings about other recommendations given in the book, recommending to set the maximum permanent generation size to 1.2-1.5 x [Live Data Size of the Permanent Generation] and the -XX:NewRatio being set to 1-1.5 x of the [Live Data Size]. We are currently gathering more data to determine whether the positive correlation exists, but until then I recommend to base your survival and eden configuration decisions on monitoring your allocation rate instead. Why should one bother you might now ask. Indeed, two reasons for not caring immediately surface: 8G memory chip is in sub $100 territory at the time of writing this article Virtualization, especially when using large providers such as Amazon AWS make adjusting the capacity easy Both of the reasons are partially valid and have definitely reduced the need for provisioning to be precise. But both of them are still putting you in the danger zone When tossing in huge amounts of memory “just in case” you are most likely going to significantly affect the latency – going into heaps above 8G it is darn easy to introduce Full GC pauses spanning over tens of seconds. When over-provisioning with the mindset of “lets tune it later”, the “later” part has a tendency of never arriving. I have faced numerous applications running on vastly over provisioned environments just because of this. For example the aforementioned application I discovered running on Amazon EC2 m1.xlarge instance was costing the company $4,200 per instance / year. Converting it to m1.small reduced the bill to just $520 for the instance. 8-fold cost reduction will be visible from your operations budget if your deployments are large, trust me on this. Summary Unfortunately I still see way too many decisions made exactly like I was forced to do a decade ago. This leads to the under- and over planning of capacity, both of which can be equally poor choices, especially if you cannot enjoy the benefits of virtualization. I got lucky with mine, but you might not get away with your guestimate, so I can only recommend to actually plan ahead using the simple framework described in this post. If you enjoyed the content, I can only recommend to follow our performance tuning advice in Twitter.

February 25, 2014

by Nikita Salnikov-Tarnovski

· 11,804 Views

Voron & Time Series Data: Getting Real Data Outputs

So far, we have just put the data in and out. And we have had a pretty good track record doing so. However, what do we do with the data now that we have it? As you can expect, we need to read it out. Usually by specific date ranges. The interesting thing is that we usually are not interested in just a single channel, we care about multiple channels. And for fun, those channel might be synchronized or not. An example of the first might be the current speed and the current engine temperature in a car. They are generally share the exact same timestamps. An example of out of sync is when you have a sensor on a rooftop measuring rainfall, and another sensor in the sewer measuring water flow rates. (Again, thanks to Dan for helping me with the domain). This is interesting, because it present quite a few interesting problems: We need to merge different streams into a unified view. We need to handle both matching and non matching sequences. We need to handle erroneous data, what happens when we have two reading for the same time for the same sensor? Yes, that shouldn’t happen, but it does. I solved this with the following API: public class RangeEntry { public DateTime Timestamp; public double?[] Values; } IEnumerable results = dts.ScanRanges(DateTime.MinValue, DateTime.MaxValue, new[] { "6febe146-e893-4f64-89f8-527f2dbaae9b", "707dcb42-c551-4f1a-9203-e4b0852516cf", "74d5bee8-9a7b-4d4e-bd85-5f92dfc22edb", "7ae29feb-6178-4930-bc38-a90adf99cfd3", }); This API gives me the results in the time order, with the same positions as the ids requested for the values. With nulls if there isn’t a value matching the value from that time in that particular sensor channel. The actual implementation relies on this method: IEnumerable ScanRange(DateTime start, DateTime end, string id) All this does it provide the entries all the entries in a particular date range, for a particular channel. Let us see how we implement multi channel scanning on top of this: private class PendingEnumerator { public IEnumerator Enumerator; public int Index; } private class PendingEnumerators { private readonly SortedDictionary> _values = new SortedDictionary>(); public void Enqueue(PendingEnumerator entry) { List list; var dateTime = entry.Enumerator.Current.Timestamp; if (_values.TryGetValue(dateTime, out list) == false) { _values.Add(dateTime, list = new List()); } list.Add(entry); } public bool IsEmpty { get { return _values.Count == 0; } } public List Dequeue() { if (_values.Count == 0) return new List(); var kvp = _values.First(); _values.Remove(kvp.Key); return kvp.Value; } } public IEnumerable ScanRanges(DateTime start, DateTime end, string[] ids) { if (ids == null || ids.Length == 0) yield break; var pending = new PendingEnumerators(); for (int i = 0; i < ids.Length; i++) { var enumerator = ScanRange(start, end, ids[i]).GetEnumerator(); if(enumerator.MoveNext() == false) continue; pending.Enqueue(new PendingEnumerator { Enumerator = enumerator, Index = i }); } var result = new RangeEntry { Values = new double?[ids.Length] }; while (pending.IsEmpty == false) { Array.Clear(result.Values,0,result.Values.Length); var entries = pending.Dequeue(); if (entries.Count == 0) break; foreach (var entry in entries) { var current = entry.Enumerator.Current; result.Timestamp = current.Timestamp; result.Values[entry.Index] = current.Value; if(entry.Enumerator.MoveNext()) pending.Enqueue(entry); } yield return result; } } We are getting a single entry from each channel into the pending enumerators. Then, we collate all the entries that share the same time into a single entry. We use the Index property to track the actual expected index of the entry in the output. And we handle duplicate times in the same channel by outputting multiple entries. Testing this on my 1.1 million records data set, we can get 185 thousands records back in 0.15 seconds.

February 25, 2014

by Oren Eini

· 5,403 Views

The Risks Of Big-Bang Deployments And Techniques For Step-wise Deployment

If you ever need to persuade management why it might be better to deploy a larger change in multiple stages and push it to customers gradually, read on. A deployment of many changes is risky. We want therefore to deploy them in a way which minimizes the risk of harm to our customers and our companies. The deployment can be done either in an all-at-once (also known as big-bang) way or a gradual way. We will argue here for the more gradual (“stepwise”) approach. Big-bang or stepwise deployment? A big-bang deployment seems to be the natural thing to do: the full solution is developed and tested and then replaces the current system at once. However, it has two crucial flaws. First, it assumes that most defects can be discovered by testing. However, due to differences in test/prod environments, unknown dependencies, and the sheer scale of a typical larger system there always will be problems that are not discovered until production deployment or even until the application runs for a while in production (whichapplies even to airplanes). The more parts have been changed, the more of these production defects will happen at the same time. A gradual deployment makes it possible to discover and handle them one by one. Second, the more complex the deployment, the higher chance of human error(s), i.e. the deployment itself is a likely source of serious defects. Some of the drawbacks of a big-bang deployment in more detail: Complexity: A big-bang deployment requires coordination of many people and “moving parts” that depend on each other, providing a huge opportunity for human mistake (i.e. there will be mistakes). Lot of time: Such a deployment requires lot of time (typically also more than planed/expected) and thus lot of downtime when users cannot use the system. Hard troubleshooting: With a network of inter-dependent parts that changed all at the same time, while perhaps also changing the infrastructure (i.e. connections between them), it is extremely hard to pinpoint the source of defects, thus considerably increasing the time to detect and correct defects while also increasing the risk of people stepping on the toes of each other and “panic fixes” that either cause more problems than they remove or are not good enough (as the rollback that sped upKnight’s downfall). Rollback is likely either impossible or equally time-consuming and risky as the deployment itself, thus increasing the impact of defects and inviting even more human errors. Impact: Deploying everything to all users at the same time means that everybody will be impacted by a potential defect/error/mistake. Long freeze: All needs to be tested together after all development is finished, which requires a lot of time while the code is frozen and no more fixes and changes can get into production for weeks. Risk mitigation The goal of a good deployment plan is to mitigate the risk of the deployment and get it to an acceptable level. There are two aspects to risk: the probability of a defect and the impact of the defect. The following table shows how the possible measures affect them: Defect probability reduction Defect impact reduction testing stepwise deployment gradual migration of users to the new version (f.ex. 1 in 1000 or particular subsets) rollback mechanism => these also lead to much lower time to detect and fix defects Practices for stepwise deployment Enable stepwise deployment: Use parallel change and other Continuous Delivery techniques to make it possible to deploy updated components independently from each other and to switch on/off new features and to switch what versions of the components they depend on are currently used. (Parallel change – keeping the old and new code and being able to use one or the other – is crucial here. Also notice that parallel change applies also to data – you will need to evolve your data schema gradually and keep both old and new one at the same time in a period of time.) Enable rollback. The previous measure – stepwise deployment – makes it also easy(ier) to roll-back the changes by switching to a previous version of a dependency or by switching back to the old code. Migrate users gradually to the new version, i.e. expose the new version only to a small subset of the users initially and increase that subset until everybody uses it. This can be done f.ex. by deploying to only a subset of servers and sending a random/particular subset of users to the new servers but there are also ways if you have only a single machine. (See f.ex. my post Webapp Blue-Green Deployment Without Breaking Sessions/With Fallback With HAProxy.) Monitoring – make sure you are able to monitor flow of users through the system and detect any anomalies and errors early, long before angry calls from the business. Tools such as Logstash, Google Analytics (with custom events from JavaScript), client-side error logging via one of existing services or a custom solution are invaluable. About these ads

February 20, 2014

by Jakub Holý

· 22,182 Views

Eclipse's BIRT: Scripted Data Set

This article presents the usage of sripted data set in the eclipse's BIRT.

February 18, 2014

by Kosta Stojanovski

· 38,826 Views · 1 Like

The Latest Data Topics