DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Data Topics

article thumbnail
Java String length confusion
Facts and Terminology As you probably know, Java uses UTF-16 to represent Strings. In order to understand the confusion about String.length(), you need to be familiar with some Encoding/Unicode terms. Code Point: A unique integer value which represents a character in the code space. Code Unit: A bit sequence used to encode characters (Code Points). One or more Code Units may be required to represent a Code Point. UTF-16 Unicode Code Points are logically divided into 17 planes. The first plane, the Basic Multilingual Plane (BMP) contains the “classic” characters (from U+0000 to U+FFFF). The other planes contain the supplementary characters (from U+10000 to U+10FFFF). Characters (Code Points) from the first plane are encoded in one 16-bit Code Unit with the same value. Supplementary characters (Code Points) are encoded in two Code Units (encoding-specific, see Wiki for the explanation). Example Character: A Unicode Code Point: U+0041 UTF-16 Code Unit(s): 0041 Character: Mathematical double-struck capital A Unicode Code Point: U+1D538 UTF-16 Code Unit(s): D835 DD38 As you can see here, there are characters which are encoded in two Code Units. String.length() Let’s take a look at the Javadoc of the length() method: public int length() Returns the length of this string. The length is equal to the number of Unicode code units in the string. So if you have one supplementary character which consists of two code units, the length of that single character is two. // Mathematical double-struck capital A String str = "\uD835\uDD38"; System.out.println(str); System.out.println(str.length()); //prints 2 Which is correct according to the documentation, but maybe it’s not expected. ~Solution You need to count the code points not the code units: String str = "\uD835\uDD38"; System.out.println(str); System.out.println(str.codePointCount(0, str.length())); See: codePointCount(int beginIndex, int endIndex) References/Sources The Java Language Specification Unicode Glossary: Code Point Wiki: Code Point Unicode Glossary: Code Unit Wiki: Code Unit Wiki: Unicode Wiki: UTF-16 Supplementary Characters in the Java Platform Wiki: Unicode Planes
April 21, 2014
by Jonatan Ivanov
· 18,038 Views · 7 Likes
article thumbnail
Be Careful with Java Path.endsWith(String) Usage
If you need to compare the java.io.file.Path object, be aware that Path.endsWith(String) will ONLY match another sub-element of Path object in your original path, not the path name string portion! If you want to match the string name portion, you would need to call the Path.toString() first. For example // Match all jar files. Files.walk(dir).forEach(path -> { if (path.toString().endsWith(".jar")) System.out.println(path); }); With out the "toString()" you will spend many fruitless hours wonder why your program didn't work.
April 19, 2014
by Zemian Deng
· 10,612 Views · 1 Like
article thumbnail
How to Convert C# Object Into JSON String with JSON.NET
Before some time I have written a blog post – Converting a C# object into JSON string in that post one of reader, Thomas Levesque commented that mostly people are using JSON.NET a popular high performance JSON for creating for .NET Created by James Newton- King. I agree with him if we are using .NET Framework 4.0 or higher version for earlier version still JavaScriptSerializer is good. So in this post we are going to learn How we can convert C# object into JSON string with JSON.NET framework. What is JSON.NET: JSON.NET is a very high performance framework compared to other serializer for converting C# object into JSON string. It is created by James Newton-Kind. You can find more information about this framework from following link. http://james.newtonking.com/json How to convert C# object into JSON string with JSON.NET framework: For this I am going to use old application that I have used in previous post. Following is a employee class with two properties first name and last name. public class Employee { public string FirstName { get; set; } public string LastName { get; set; } } I have created same object of “Employee” class as I have created in previous post like below. Employee employee=new Employee {FirstName = "Jalpesh", LastName = "Vadgama"}; Now it’s time to add JSON.NET Nuget package. You install Nuget package via following command. I have installed like below. Now we are done with adding NuGet package. Following is code I have written to convert C# object into JSON string. string jsonString = Newtonsoft.Json.JsonConvert.SerializeObject(employee); Console.WriteLine(jsonString); Let's run application and following is a output as expected. That’s it. It’s very easy. Hope you like it. Stay tuned for more.
April 14, 2014
by Jalpesh Vadgama
· 193,290 Views
article thumbnail
How to Migrate from MySQL to MongoDB
In the last week I was working on a key project to migrate a BI platform from MySQL to MongoDB. We chose that database due to its support and scalability.
April 14, 2014
by Moshe Kaplan
· 115,877 Views · 6 Likes
article thumbnail
Groovy Goodness: Converting Byte Array to Hex String
To convert a byte[] array to a String we can simply use the new String(byte[]) constructor. But if the array contains non-printable bytes we don't get a good representation. In Groovy we can use the method encodeHex() to transform a byte[] array to a hex String value. The byteelements are converted to their hexadecimal equivalents. final byte[] printable = [109, 114, 104, 97, 107, 105] // array with non-printable bytes 6, 27 (ACK, ESC) final byte[] nonprintable = [109, 114, 6, 27, 104, 97, 107, 105] assert new String(printable) == 'mrhaki' assert new String(nonprintable) != 'mr haki' // encodeHex() returns a Writable final Writable printableHex = printable.encodeHex() assert printableHex.toString() == '6d7268616b69' final nonprintableHex = nonprintable.encodeHex().toString() assert nonprintableHex == '6d72061b68616b69' // Convert back assert nonprintableHex.decodeHex() == nonprintable Code written with Groovy 2.2.1
April 6, 2014
by Hubert Klein Ikkink
· 14,456 Views · 5 Likes
article thumbnail
Distributed Counters Feature Design
this is another experiment with longer posts. previously, i used the time series example as the bed on which to test some ideas regarding feature design, to explain how we work and in general work out the rough patches along the way. i should probably note that these posts are purely fiction at this point. we have no plans to include a time series feature in ravendb at this time. i am trying to work out some thoughts in the open and get your feedback. at any rate, yesterday we had a request for cassandra style counters at the mailing list. and as long as i am doing feature design series, i thought that i could talk about how i would go about implementing this. again, consider this fiction, i have no plans of implementing this at this time. the essence of what we want is to be able to… count stuff. efficiently, in a distributed manner, with optional support for cross data center replication. very roughly, the idea is to have “sub counters”, unique for every node in the system. whenever you increment the value, we log this to our own sub counter, and then replicate it out. whenever you read it, we just sum all the data we have from all the sub counters. let us outline the various parts of the solution in the same order as the one i used for time series. storage a counter is just a named 64 bits signed integer. a counter name can be any string up to 128 printable characters. the external interface of the storage would look like this: 1: public struct counterincrement 2: { 3: public string name; 4: public long change; 5: } 6: 7: public struct counter 8: { 9: public string name; 10: public string source; 11: public long value; 12: } 13: 14: public interface icounterstorage 15: { 16: void localincrementbatch(counterincrement[] batch); 17: 18: counter[] read(string name); 19: 20: void replicatedupdates(counter[] updates); 21: } as you can see, this gives us very simple interface for the storage. we can either change the data locally (which modify our own storage) or we can get an update from a replica about its changes. there really isn’t much more to it, to be fair. the localincrementbatch() increment a local value, and read() will return all the values for a counter. there is a little bit of trickery involved in how exactly one would store the counter values. for now, i think we’ll store each counter as two step values. we’ll have a tree of multi tree values that will carry each value from each source. that means that a counter will take roughly 4kb or so. this is easy to work with and nicely fit the model voron uses internally. note that we’ll outline additional requirement for storage (searching for counter by prefix, iterating over counters, addresses of other servers, stats, etc) below. i’m not showing them here because they aren’t the major issue yet. over the wire skipping out on any optimizations that might be required, we will expose the following endpoints: get /counters/read?id=users/1/visits&users/1/posts <—will return json response with all the relevant values (already summed up). { “users/1/visits”: 43, “users/1/posts”: 3 } get /counters/read?id=users/1/visits&users/1/1/posts&raw=true <—will return json response with all the relevant values, per source. { “users/1/visits”: {“rvn1”: 21, “rvn2”: 22 } , “users/1/posts”: { “rvn1”: 2, “rvn3”: 1 } } post /counters/increment <– allows to increment counters. the request is a json array of the counter name and the change. for a real system, you’ll probably need a lot more stuff, metrics, stats, etc. but this is the high level design, so this would be enough. note that we are skipping the high performance stream based writes we outlined for time series. we’ll probably won’t need them, so that doesn’t matter, but they are an option if we need them. system behavior this is where it is really not interesting, there is very little behavior here, actually. we only have to read the data from the storage, sum it up, and send it to the user. hardly what i’ll call business logic. client api the client api will probably look something like this: 1: counters.increment("users/1/posts"); 2: counters.increment("users/1/visits", 4); 3: 4: using(var batch = counters.batch()) 5: { 6: batch.increment("users/1/posts"); 7: batch.increment("users/1/visits",5); 8: batch.submit(); 9: } note that we’re offering both batch and single api. we’ll likely also want to offer a fire & forget style, which will be able to offer even better performance (because they could do batching across more than a single thread), but that is out of scope for now. for simplicity sake, we are going to have the client just a container for all of endpoints that it knows about. the container would be responsible for… updating the client visible topology, selecting the best server to use at any given point, etc. user interface there isn’t much to it. just show a list of counter values in a list. allow to search by prefix, allow to dive into a particular counter and read its raw values, but that is about it. oh, and allow to delete a counter. deleting data honestly, i really hate deletes. they are very expensive to handle properly the moment you have more than a single node. in this case, there is an inherent race condition between a delete going out and another node getting an increment. and then there is the issue of what happens if you had a node down when you did the delete, etc. this just sucks. deletion are handled normally, (with the race condition caveat, obviously), and i’ll discuss how we replicate them in a bit. high availability / scale out by definition, we actually don’t want to have storage replication here. either log shipping or consensus based. we actually do want to have different values, because we are going to be modifying things independently on many servers. that means that we need to do replication at the database level. and that leads to some interesting questions. again, the hard part here is the deletes. actually, the really hard part is what we are going to do with the new server problem. the new server problem dictates how we are going to bring a new server into the cluster. if we could fix the size of the cluster, that would make things a lot easier. however, we are actually interested in being able to dynamically grow the cluster size. therefor, there are only two real ways to do it: add a new empty node to the cluster, and have it be filled from all the other servers. add a new node by backing up an existing node, and restoring as a new node. ravendb, for example, follows the first option. but it means that in needs to track a lot more information. the second option is actually a lot simpler, because we don’t need to care about keeping around old data. however, this means that the process of bringing up a new server would now be: update all nodes in the cluster with the new node address (node isn’t up yet, replication to it will fail and be queued). backup an existing node and restore at the new node. start the new node. the order of steps is quite important. and it would be easy to get it wrong. also, on large systems, backup & restore can take a long time. operationally speaking, i would much rather just be able to do something like, bring a new node into the cluster in “silent” mode. that is, it would get information from all the other nodes, and i can “flip the switch” and make it visible to clients at any point in time. that is how you do it with ravendb, and it is an incredibly powerful system, when used properly. that means that for all intents and purposes, we don’t do real deletes. what we’ll actually do is replace the counter value with delete marker. this turns deletes into a much simple “just another write”. it has the sad implication of not free disk space on deletes, but deletes tend to be rare, and it is usually fine to add a “purge” admin option that can be run on as needed basis. but that brings us to an interesting issue, how do we actually handle replication. the topology map to simplify things, we are going to go with one way replication from a node to another. that allows complex topologies like master-master, cluster-cluster, replication chain, etc. but in the end, this is all about a single node replication to another. the first question to ask is, are we going to replicate just our local changes, or are we going to have to replicate external changes as well? the problem with replicating external changes is that you may have the following topology: now, server a got a value and sent it to server b. server b then forwarded it to server c. however, at that point, we also have a the value from server a replicated directly to server c. which value is it supposed to pick? and what about a scenario where you have more complex topology? in general, because in this type of system, we can have any node accept writes, and we actually desire this to be the case , we don’t want this behavior. we want to only replicate local data, not all the data. of course, that leads to an annoying question, what happens if we have a 3 node cluster, and one node fails catastrophically. we can bring a new node in, and the other two nodes will be able to fill in their values via replication, but what about the node that is down? the data isn’t gone, it is still right there in the other two nodes, but we need a way to pull it out. therefor, i think that the best option would be to say that nodes only replicate their local state, except in the case of a new node. a new node will be told the address of an existing node in the cluster, at which point it will: register itself in all the nodes in the cluster (discoverable from the existing node). this assumes a standard two way replication link between all servers, if this isn’t the case, the operators would have the responsibility to setup the actual replication semantics on their own. new node now starts getting updates from all the nodes in the cluster. it keeps them in a log for now, not doing anything yet. ask that node for a complete update of all of its current state. when it has all the complete state of the existing node, it replays all of the remembered logs that it didn’t have a chance to apply yet. then it announces that it is in a valid state to start accepting client connections. note that this process is likely to be very sensitive to high data volumes. that is why you’ll usually want to select a backup node to read from, and that decision is an ops decision. you’ll also want to be able to report extensively on the current status of the node, since this can take a while, and ops will be watching this very closely. server name a node requires a unique name. we can use guids, but those aren’t readable, so we can use machine name + port, but those can change. ideally, we can require the user to set us up with a unique name. that is important for readability and for being able to alter see all the values we have in all the nodes. it is important that names are never repeated, so we’ll probably have a guid there anyway, just to be on the safe side. actual replication semantics since we have the new server problem down to an automated process, we can choose the drastically simpler model of just having an internal queue per each replication destination. whenever we make a change, we also make a note of that in the queue for that destination, then we start an async replication process to that server, sending all of our updates there. it is always safe to overwrite data using replication, because we are overwriting our own data, never anyone else. and… that is about it, actually. there are probably a lot of details that i am missing / would discover if we were to actually implement this. but i think that this is a pretty good idea about what this feature is about.
March 25, 2014
by Oren Eini
· 12,600 Views · 1 Like
article thumbnail
Estimating Statistics via Bootstrapping and Monte Carlo Simulation
We want to estimate some "statistics" (e.g. average income, 95 percentile height, variance of weight ... etc.) from a population. It will be too tedious to enumerate all members of the whole population. For efficiency reason, we randomly pick a number samples from the population, compute the statistics of the sample set to estimate the corresponding statistics of the population. We understand the estimation done this way (via random sampling) can deviate from the population. Therefore, in additional to our estimated statistics, we also include a "standard error" (how big our estimation may be deviated from the actual population statistics) or a "confidence interval" (a lower and upper bound of the statistics which we are confident about containing the true statistics). The challenge is how do we estimate the "standard error" or the "confidence interval". A straightforward way is to repeat the sampling exercise many times, each time we create a different sample set from which we compute one estimation. Then we look across all estimations from different sample sets to estimate the standard error and confidence interval of the estimation. But what if collecting data from a different sample set is expensive, or for any reason the population is no longer assessable after we collected our first sample set. Bootstrapping provides a way to address this ... Bootstrapping Instead of creating additional sample sets from the population, we create additional sample sets by re-sampling data (with replacement) from the original sample set. Each of the created sample set will follow the same data distribution of the original sample set, which in turns, follow the population. R provides a nice "bootstrap" library to do this. > library(boot) > # Generate a population > population.weight <- rnorm(100000, 160, 60) > # Lets say we care about the ninety percentile > quantile(population.weight, 0.9) 90% 236.8105 > # We create our first sample set of 500 samples > sample_set1 <- sample(population.weight, 500) > # Here is our sample statistic of ninety percentile > quantile(sample_set1, 0.9) 90% 232.3641 > # Notice that the sample statistics deviates from the population statistics > # We want to estimate how big is this deviation by using bootstrapping > # I need to define my function to compute the statistics > ninety_percentile <- function(x, idx) {return(quantile(x[idx], 0.9))} > # Bootstrapping will call this function many times with different idx > boot_result <- boot(data=sample_set1, statistic=ninety_percentile, R=1000) > boot_result ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = sample_set1, statistic = ninety_percentile, R = 1000) Bootstrap Statistics : original bias std. error t1* 232.3641 2.379859 5.43342 > plot(boot_result) > boot.ci(boot_result, type="bca") BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = boot_result, type = "bca") Intervals : Level BCa 95% (227.2, 248.1 ) Calculations and Intervals on Original Scale Here is the visual output of the bootstrap plot Bootstrapping is a powerful simulation technique for estimate any statistics in an empirical way. It is also non-parametric because it doesn't assume any model as well as parameters and just use the original sample set to estimate the statistics. If we assume certain distribution model want to see the distribution of certain statistics. Monte Carlo simulation provides a powerful way for this. Monte Carlo Simulation The idea is pretty simple, based on a particular distribution function (defined by a specific model parameters), we generate many sets of samples. We compute the statistics of each sample set and see how the statistics distributed across different sample sets. For example, given a normal distribution population, what is the probability distribution of the max value of 5 randomly chosen samples. > sample_stats <- rep(0, 1000) > for (i in 1:1000) { + sample_stats[i] <- max(rnorm(5)) + } > mean(sample_stats) [1] 1.153008 > sd(sample_stats) [1] 0.6584022 > par(mfrow=c(1,2)) > hist(sample_stats, breaks=30) > qqnorm(sample_stats) > qqline(sample_stats) Here is the distribution of the "max(5)" statistics, which shows some right skewness Bootstrapping and Monte Carlo simulation are powerful tools to estimate statistics in an empirical manner, especially when we don't have an analytic form of solution.
March 21, 2014
by Ricky Ho
· 5,173 Views
article thumbnail
WSO2 DSS: Batch Insert Sample (End to End)
WSO2 DSS wraps Data Services Layer and provides us with a simple GUI to define a Data Service with zero Java code. With this, a change to the data source is just a simple click away and no other party needs to be aware of this. With this sample demonstration, we will see how to do a batch insert to a table. Batch insert is useful when you want to insert data in sequential manner. This also means that if at least one of the insertion query fails all the other queries ran so far in the batch will be rolled back as well. If one insertion in the batch fails means whole batch is failed. This can be used if you are running the same query to insert data many times. With batch insert all the data will be sent in one call. So this reduce the number calls you have to call, to get the data inserted. This comes with one condition that, The query should not be producing results back. (We will only be notified whether the query was successful or not.) Prerequisites: WSO2 Data Services Server - http://wso2.com/products/data-services-server/ (current latest 3.1.1) Mysql connector (JDBC) - https://www.mysql.com/products/connector/ If we already have a data service running which is not sending back a result set , then it's just a matters of adding following property in service declaration. enableBatchRequests="true" Anyway I will be demonstrating the creation of the service from the scratch. 1. Create a service as follows going through the wizard, 2. Create the data source 3. Create the query - (This is an insert query. Also note the input mapping we have add as relevant to the query. To know more about input mapping and using validation refer the documentation.) 4. Create the operation - Select the query to be executed once the operation is called. By enabling return request status, we will be notified whether the operation was a success or not. 5. Try it! - When we list the services we will see this new service now. In the right we will have an option to try it. Here we can see the option to try the service giving the input parameters. Here I have tried it two insertions in a batch. Now if we go to XML view of the service it will be similar to following, which is saved in server as a .dbs file. com.mysql.jdbc.Driver jdbc:mysql://localhost:3306/json_array root root 1 10 SELECT 1 insert into flights (flight_no, number_of_cases, created_by, description, trips) values (:flight_no,:number_of_cases,:created_by,:description,:trips) If we hit on the service name in the list of services, we will be directed to Service Dashboard where we can see several other options for the service. It provides the option to generate an Axis2 client for the service. Once we get the client then it's a matter of calling the methods in the stub as follows. private static BatchRequestSampleOldStub.AddFlight_type0 createFlight(int cases, String creator, String description, int trips) { BatchRequestSampleOldStub.AddFlight_type0 val = new BatchRequestSampleOldStub.AddFlight_type0(); val.setNumber_of_cases(cases); val.setCreated_by(creator); val.setDescription(description); val.setTrips(trips); printFlightInfo(cases, creator, description, trips); return val; } public static void main(String[] args) throws Exception { String epr = "http://localhost:9763" + "/services/BatchInsertSample"; BatchRequestSampleOldStub stub = new BatchRequestSampleOldStub(epr); BatchRequestSampleOldStub.AddFlight_batch_req vals1 = new BatchRequestSampleOldStub.AddFlight_batch_req(); vals1.addAddFlight(createFlight(1, "Pushpalanka", "test", 2)); vals1.addAddFlight(createFlight(2, "Jayawardhana", "test", 2)); vals1.addAddFlight(createFlight(3, "[email protected]", "test", 2)); try { System.out.println("Executing Add Flights.."); stub.addFlight_batch_req(vals1); } catch (Exception e) { System.out.println("Error in Add Flights!"); } Complete client code can be found here. Cheers! Ref: http://docs.wso2.org/display/DSS311/Batch+Processing+Sample
March 21, 2014
by Pushpalanka Jayawardhana
· 10,088 Views
article thumbnail
Exporting Spring Data JPA Repositories as REST Services using Spring Data REST
Spring Data modules provides various modules to work with various types of datasources like RDBMS, NOSQL stores etc in unified way. In my previous article SpringMVC4 + Spring Data JPA + SpringSecurity configuration using JavaConfig I have explained how to configure Spring Data JPA using JavaConfig. Now in this post let us see how we can use Spring Data JPA repositories and export JPA entities as REST endpoints using Spring Data REST. First let us configure spring-data-jpa and spring-data-rest-webmvc dependencies in our pom.xml. org.springframework.data spring-data-jpa 1.5.0.RELEASE org.springframework.data spring-data-rest-webmvc 2.0.0.RELEASE Make sure you have latest released versions configured correctly, otherwise you will encounter the following error: java.lang.ClassNotFoundException: org.springframework.data.mapping.SimplePropertyHandler Create JPA entities. @Entity @Table(name = "USERS") public class User implements Serializable { private static final long serialVersionUID = 1L; @Id @GeneratedValue(strategy = GenerationType.IDENTITY) @Column(name = "user_id") private Integer id; @Column(name = "username", nullable = false, unique = true, length = 50) private String userName; @Column(name = "password", nullable = false, length = 50) private String password; @Column(name = "firstname", nullable = false, length = 50) private String firstName; @Column(name = "lastname", length = 50) private String lastName; @Column(name = "email", nullable = false, unique = true, length = 50) private String email; @Temporal(TemporalType.DATE) private Date dob; private boolean enabled=true; @OneToMany(fetch=FetchType.EAGER, cascade=CascadeType.ALL) @JoinColumn(name="user_id") private Set roles = new HashSet<>(); @OneToMany(mappedBy = "user") private List contacts = new ArrayList<>(); //setters and getters } @Entity @Table(name = "ROLES") public class Role implements Serializable { private static final long serialVersionUID = 1L; @Id @GeneratedValue(strategy = GenerationType.IDENTITY) @Column(name = "role_id") private Integer id; @Column(name="role_name",nullable=false) private String roleName; //setters and getters } @Entity @Table(name = "CONTACTS") public class Contact implements Serializable { private static final long serialVersionUID = 1L; @Id @GeneratedValue(strategy = GenerationType.IDENTITY) @Column(name = "contact_id") private Integer id; @Column(name = "firstname", nullable = false, length = 50) private String firstName; @Column(name = "lastname", length = 50) private String lastName; @Column(name = "email", nullable = false, unique = true, length = 50) private String email; @Temporal(TemporalType.DATE) private Date dob; @ManyToOne @JoinColumn(name = "user_id") private User user; //setters and getters } Configure DispatcherServlet using AbstractAnnotationConfigDispatcherServletInitializer. Observe that we have added RepositoryRestMvcConfiguration.class to getServletConfigClasses() method. RepositoryRestMvcConfiguration is the one which does the heavy lifting of looking for Spring Data Repositories and exporting them as REST endpoints. package com.sivalabs.springdatarest.web.config; import javax.servlet.Filter; import org.springframework.data.rest.webmvc.config.RepositoryRestMvcConfiguration; import org.springframework.orm.jpa.support.OpenEntityManagerInViewFilter; import org.springframework.web.servlet.support.AbstractAnnotationConfigDispatcherServletInitializer; import com.sivalabs.springdatarest.config.AppConfig; public class SpringWebAppInitializer extends AbstractAnnotationConfigDispatcherServletInitializer { @Override protected Class[] getRootConfigClasses() { return new Class[] { AppConfig.class}; } @Override protected Class[] getServletConfigClasses() { return new Class[] { WebMvcConfig.class, RepositoryRestMvcConfiguration.class }; } @Override protected String[] getServletMappings() { return new String[] { "/rest/*" }; } @Override protected Filter[] getServletFilters() { return new Filter[]{ new OpenEntityManagerInViewFilter() }; } } Create Spring Data JPA repositories for JPA entities. public interface UserRepository extends JpaRepository { } public interface RoleRepository extends JpaRepository { } public interface ContactRepository extends JpaRepository { } That's it. Spring Data REST will take care of rest of the things. You can use spring Rest Shell https://github.com/spring-projects/rest-shell or Chrome's Postman Addon to test the exported REST services. D:\rest-shell-1.2.1.RELEASE\bin>rest-shell http://localhost:8080:> Now we can change the baseUri using baseUri command as follows: http://localhost:8080:>baseUri http://localhost:8080/spring-data-rest-demo/rest/ http://localhost:8080/spring-data-rest-demo/rest/> http://localhost:8080/spring-data-rest-demo/rest/>list rel href ====================================================================================== users http://localhost:8080/spring-data-rest-demo/rest/users{?page,size,sort} roles http://localhost:8080/spring-data-rest-demo/rest/roles{?page,size,sort} contacts http://localhost:8080/spring-data-rest-demo/rest/contacts{?page,size,sort} Note: It seems there is an issue with rest-shell when the DispatcherServlet url mapped to "/" and issue list command it responds with "No resources found". http://localhost:8080/spring-data-rest-demo/rest/>get users/ { "_links": { "self": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/{?page,size,sort}", "templated": true }, "search": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/search" } }, "_embedded": { "users": [ { "userName": "admin", "password": "admin", "firstName": "Administrator", "lastName": null, "email": "[email protected]", "dob": null, "enabled": true, "_links": { "self": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/1" }, "roles": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/1/roles" }, "contacts": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/1/contacts" } } }, { "userName": "siva", "password": "siva", "firstName": "Siva", "lastName": null, "email": "[email protected]", "dob": null, "enabled": true, "_links": { "self": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/2" }, "roles": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/2/roles" }, "contacts": { "href": "http://localhost:8080/spring-data-rest-demo/rest/users/2/contacts" } } } ] }, "page": { "size": 20, "totalElements": 2, "totalPages": 1, "number": 0 } } You can find the source code at https://github.com/sivaprasadreddy/sivalabs-blog-samples-code/tree/master/spring-data-rest-demo For more Info on Spring Rest Shell: https://github.com/spring-projects/rest-shell
March 7, 2014
by Siva Prasad Reddy Katamreddy
· 29,982 Views
article thumbnail
Convert CSV Data to Avro Data
In one of my previous posts I explained how we can convert json data to avro data and vice versa using avro tools command line option. Today I was trying to see what options we have for converting csv data to avro format, as of now we don't have any avro tool option to accomplish this . Now, we can either write our own java program (MapReduce program or a simple java program) or we can use various SerDe's available with Hive to do this quickly and without writing any code :) To convert csv data to Avro data using Hive we need to follow the steps below: Create a Hive table stored as textfile and specify your csv delimiter also. Load csv file to above table using "load data" command. Create another Hive table using AvroSerDe. Insert data from former table to new Avro Hive table using "insert overwrite" command. To demonstrate this I will use use the data below (student.csv): 0,38,91 0,65,28 0,78,16 1,34,96 1,78,14 1,11,43 Now execute below queries in Hive: --1. Create a Hive table stored as textfile USE test; CREATE TABLE csv_table ( student_id INT, subject_id INT, marks INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; --2. Load csv_table with student.csv data LOAD DATA LOCAL INPATH "/path/to/student.csv" OVERWRITE INTO TABLE test.csv_table; --3. Create another Hive table using AvroSerDe CREATE TABLE avro_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "com.rishav.avro", "name": "student_marks", "type": "record", "fields": [ { "name":"student_id","type":"int"}, { "name":"subject_id","type":"int"}, { "name":"marks","type":"int"}] }'); --4. Load avro_table with data from csv_table INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table; Now you can get data in Avro format from Hive warehouse folder. To dump this file to local file system use below command: hadoop fs -cat /path/to/warehouse/test.db/avro_table/* > student.avro If you want to get json data from this avro file you can use avro tools command: java -jar avro-tools-1.7.5.jar tojson student.avro > student.json So we can easily convert csv to avro and csv to json also by just writing 4 HQLs.
March 5, 2014
by Rishav Rohit
· 39,678 Views · 1 Like
article thumbnail
When to Use MongoDB Rather than MySQL (or Other RDBMS): The Billing Example
NoSQL has been a hot topic a pretty long time (well, it's not only a buzz anymore). However, when should we really use it instead of an RDBMS?
March 3, 2014
by Moshe Kaplan
· 378,832 Views · 12 Likes
article thumbnail
Python CSV Files: Reading and Writing
Learn to parse CSV (Comma Separated Values) files with Python examples using the csv module's reader function and DictReader class.
March 3, 2014
by Mike Driscoll
· 375,586 Views · 6 Likes
article thumbnail
Getting Started with Mocking in Java using Mockito
We all write unit tests but the challenge we face at times is that the unit under test might be dependent on other components. And configuring other components for unit testing is definitely an overkill. Instead we can make use of Mocks in place of the other components and continue with the unit testing. To show how one can use mocks, I have a Data access layer(DAL), basically a class which provides an API for the application to access and modify the data in the data repository. I then unit test the DAL without actually the need to connect to the data repository. The data repository can be a local database or remote database or a file system or any place where we can store and retrieve the data. The use of a DAL class helps us in keeping the data mappers separate from the application code. Lets create a Java project using maven. mvn archetype:generate -DgroupId=info.sanaulla -DartifactId=MockitoDemo -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false The above creates a folder MockitoDemo and then creates the entire directory structure for source and test files. Consider the below model class for this example: package info.sanaulla.models; import java.util.List; /** * Model class for the book details. */ public class Book { private String isbn; private String title; private List authors; private String publication; private Integer yearOfPublication; private Integer numberOfPages; private String image; public Book(String isbn, String title, List authors, String publication, Integer yearOfPublication, Integer numberOfPages, String image){ this.isbn = isbn; this.title = title; this.authors = authors; this.publication = publication; this.yearOfPublication = yearOfPublication; this.numberOfPages = numberOfPages; this.image = image; } public String getIsbn() { return isbn; } public String getTitle() { return title; } public List getAuthors() { return authors; } public String getPublication() { return publication; } public Integer getYearOfPublication() { return yearOfPublication; } public Integer getNumberOfPages() { return numberOfPages; } public String getImage() { return image; } } The DAL class for operating on the Book model class is: package info.sanaulla.dal; import info.sanaulla.models.Book; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.List; /** * API layer for persisting and retrieving the Book objects. */ public class BookDAL { private static BookDAL bookDAL = new BookDAL(); public List getAllBooks(){ return Collections.EMPTY_LIST; } public Book getBook(String isbn){ return null; } public String addBook(Book book){ return book.getIsbn(); } public String updateBook(Book book){ return book.getIsbn(); } public static BookDAL getInstance(){ return bookDAL; } } The DAL layer above currently has no functionality and we are going to unit test that piece of code (TDD). The DAL layer might communicate with a ORM Mapper or Database API which we are not concerned while designing the API. Test Driving the DAL Layer There are lot of frameworks for Unit testing and mocking in Java but for this example I would be picking JUnit for unit testing and Mockito for mocking. We would have to update the dependency in Maven’s pom.xml 4.0.0 info.sanaulla MockitoDemo jar 1.0-SNAPSHOT MockitoDemo http://maven.apache.org junit junit 4.10 test org.mockito mockito-all 1.9.5 test Now lets unit test the BookDAL. During the unit testing we will inject mock data into the BookDAL so that we can complete the testing of the API without depending on the data source. Initially we will have an empty test class: public class BookDALTest { public void setUp() throws Exception { } public void testGetAllBooks() throws Exception { } public void testGetBook() throws Exception { } public void testAddBook() throws Exception { } public void testUpdateBook() throws Exception { } } We will inject the mock BookDAL and mock data in the setUp() as shown below: public class BookDALTest { private static BookDAL mockedBookDAL; private static Book book1; private static Book book2; @BeforeClass public static void setUp(){ //Create mock object of BookDAL mockedBookDAL = mock(BookDAL.class); //Create few instances of Book class. book1 = new Book("8131721019","Compilers Principles", Arrays.asList("D. Jeffrey Ulman","Ravi Sethi", "Alfred V. Aho", "Monica S. Lam"), "Pearson Education Singapore Pte Ltd", 2008,1009,"BOOK_IMAGE"); book2 = new Book("9788183331630","Let Us C 13th Edition", Arrays.asList("Yashavant Kanetkar"),"BPB PUBLICATIONS", 2012,675,"BOOK_IMAGE"); //Stubbing the methods of mocked BookDAL with mocked data. when(mockedBookDAL.getAllBooks()).thenReturn(Arrays.asList(book1, book2)); when(mockedBookDAL.getBook("8131721019")).thenReturn(book1); when(mockedBookDAL.addBook(book1)).thenReturn(book1.getIsbn()); when(mockedBookDAL.updateBook(book1)).thenReturn(book1.getIsbn()); } public void testGetAllBooks() throws Exception {} public void testGetBook() throws Exception {} public void testAddBook() throws Exception {} public void testUpdateBook() throws Exception {} } In the above setUp() method I have: Created a mock object of BookDAL BookDAL mockedBookDAL = mock(BookDAL.class); Stubbed the API of BookDAL with mock data, such that when ever the API is invoked the mocked data is returned. //When getAllBooks() is invoked then return the given data and so on for the other methods. when(mockedBookDAL.getAllBooks()).thenReturn(Arrays.asList(book1, book2)); when(mockedBookDAL.getBook("8131721019")).thenReturn(book1); when(mockedBookDAL.addBook(book1)).thenReturn(book1.getIsbn()); when(mockedBookDAL.updateBook(book1)).thenReturn(book1.getIsbn()); Populating the rest of the tests we get: package info.sanaulla.dal; import info.sanaulla.models.Book; import org.junit.BeforeClass; import org.junit.Test; import static org.junit.Assert.*; import static org.mockito.Mockito.mock; import static org.mockito.Mockito.when; import java.util.Arrays; import java.util.List; public class BookDALTest { private static BookDAL mockedBookDAL; private static Book book1; private static Book book2; @BeforeClass public static void setUp(){ mockedBookDAL = mock(BookDAL.class); book1 = new Book("8131721019","Compilers Principles", Arrays.asList("D. Jeffrey Ulman","Ravi Sethi", "Alfred V. Aho", "Monica S. Lam"), "Pearson Education Singapore Pte Ltd", 2008,1009,"BOOK_IMAGE"); book2 = new Book("9788183331630","Let Us C 13th Edition", Arrays.asList("Yashavant Kanetkar"),"BPB PUBLICATIONS", 2012,675,"BOOK_IMAGE"); when(mockedBookDAL.getAllBooks()).thenReturn(Arrays.asList(book1, book2)); when(mockedBookDAL.getBook("8131721019")).thenReturn(book1); when(mockedBookDAL.addBook(book1)).thenReturn(book1.getIsbn()); when(mockedBookDAL.updateBook(book1)).thenReturn(book1.getIsbn()); } @Test public void testGetAllBooks() throws Exception { List allBooks = mockedBookDAL.getAllBooks(); assertEquals(2, allBooks.size()); Book myBook = allBooks.get(0); assertEquals("8131721019", myBook.getIsbn()); assertEquals("Compilers Principles", myBook.getTitle()); assertEquals(4, myBook.getAuthors().size()); assertEquals((Integer)2008, myBook.getYearOfPublication()); assertEquals((Integer) 1009, myBook.getNumberOfPages()); assertEquals("Pearson Education Singapore Pte Ltd", myBook.getPublication()); assertEquals("BOOK_IMAGE", myBook.getImage()); } @Test public void testGetBook(){ String isbn = "8131721019"; Book myBook = mockedBookDAL.getBook(isbn); assertNotNull(myBook); assertEquals(isbn, myBook.getIsbn()); assertEquals("Compilers Principles", myBook.getTitle()); assertEquals(4, myBook.getAuthors().size()); assertEquals("Pearson Education Singapore Pte Ltd", myBook.getPublication()); assertEquals((Integer)2008, myBook.getYearOfPublication()); assertEquals((Integer)1009, myBook.getNumberOfPages()); } @Test public void testAddBook(){ String isbn = mockedBookDAL.addBook(book1); assertNotNull(isbn); assertEquals(book1.getIsbn(), isbn); } @Test public void testUpdateBook(){ String isbn = mockedBookDAL.updateBook(book1); assertNotNull(isbn); assertEquals(book1.getIsbn(), isbn); } } One can run the test by using maven command: mvn test. The output is: ------------------------------------------------------- T E S T S ------------------------------------------------------- Running info.sanaulla.AppTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.029 sec Running info.sanaulla.dal.BookDALTest Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.209 sec Results : Tests run: 5, Failures: 0, Errors: 0, Skipped: 0 So we have been able to test the DAL class without actually configuring the data source by using mocks.
February 26, 2014
by Mohamed Sanaulla
· 233,424 Views · 18 Likes
article thumbnail
How to Estimate Memory Consumption
This story goes back at least a decade, when I was first approached by a PHB with a question “How big servers are we going to need to buy for our production deployment”. The new and shiny system we have been building was nine months from production rollout and apparently the company had promised to deliver the whole solution, including hardware. Oh boy, was I in trouble. With just a few years of experience down my belt, I could have pretty much just tossed a dice. Even though I am sure my complete lack of confidence was clearly visible, I still had to come up with the answer. Four hours of googling later I recall sitting there with the same question still hovering in front of my bedazzled face: “How to estimate the need for computing power?” In this post I start to open up the subject by giving you rough guidelines on how to estimate memory requirements for your brand new Java application. For the impatient ones – the answer will be to start with the memory equal to approximately 5 x [amount of memory consumed by Live Data] and start the fine-tuning from there. For the ones more curious about the logic behind, stay with me and I will walk you through the reasoning. First and foremost, I can only recommend to avoid answering a question phrased like this without detailed information being available. Your answer has to be based upon the performance requirements, so do not even start without clarifying those first. And I do not mean way-too-ambiguous “The system needs to support 700 concurrent users”, but a lot more specific ones about latency and throughput, taking into account the amount of data, usage patterns. One should also not forget about the budget also – we all can all dream about sub-millisecond latencies, but those without HFT banking backbone budgets – unfortunately it will only remain a dream. For now, lets assume you have those requirements in place. Next stop would be to create the load test scripts emulating user behaviour. If you are now able to launch those scripts concurrently you have built a foundation to the answer. As you might also have guessed, the next step involves our usual advice of measuring not guessing. But with a caveat. Live Data Size Namely, our quest for the optimal memory configuration requires capturing the Live Data Size. Having captured this, we have the baseline configuration in place for the fine-tuning. How does one define live data size? Charlie Hunt and Binu John in their “Java Performance” book have given it the following definition: Live data size is the heap size consumed by the set of long-lived objects required to run the application in its steady state. Equipped with the definition, we are ready to run your load tests against the application with the GC logging turned on (-XX:+PrintGCTimeStamps -Xloggc:/tmp/gc.log -XX:+PrintGCDetails) and visualize the logs (with the help of gcviewer for example) to determine the moment when the application has reached to the steady state. What you are after looks similar to the following: We can see the GC doing its job both with minor and Full GC runs in a familiar double-saw-toothed graphic. This particular application seems to have achieved a steady state already after the first full GC run on 21st second. In most cases however, it takes 10-20 Full GC runs to spot the change in trends. After four full GC runs we can estimate that the Live Data Size is equal to approximately 100MB. The aforementioned Java Performance book is now indicating that there is a strong correlation between the Live Data Size and the optimal memory configuration parameters in a typical Java EE application. The evidence from the field is also backing up their recommendations: Set the maximum heap size to 3-4 x [Live Data Size] So, for our application at hand, we should set the -Xmx to be in between 300m and 400m for the initial performance tests and take it from there. We have mixed feelings about other recommendations given in the book, recommending to set the maximum permanent generation size to 1.2-1.5 x [Live Data Size of the Permanent Generation] and the -XX:NewRatio being set to 1-1.5 x of the [Live Data Size]. We are currently gathering more data to determine whether the positive correlation exists, but until then I recommend to base your survival and eden configuration decisions on monitoring your allocation rate instead. Why should one bother you might now ask. Indeed, two reasons for not caring immediately surface: 8G memory chip is in sub $100 territory at the time of writing this article Virtualization, especially when using large providers such as Amazon AWS make adjusting the capacity easy Both of the reasons are partially valid and have definitely reduced the need for provisioning to be precise. But both of them are still putting you in the danger zone When tossing in huge amounts of memory “just in case” you are most likely going to significantly affect the latency – going into heaps above 8G it is darn easy to introduce Full GC pauses spanning over tens of seconds. When over-provisioning with the mindset of “lets tune it later”, the “later” part has a tendency of never arriving. I have faced numerous applications running on vastly over provisioned environments just because of this. For example the aforementioned application I discovered running on Amazon EC2 m1.xlarge instance was costing the company $4,200 per instance / year. Converting it to m1.small reduced the bill to just $520 for the instance. 8-fold cost reduction will be visible from your operations budget if your deployments are large, trust me on this. Summary Unfortunately I still see way too many decisions made exactly like I was forced to do a decade ago. This leads to the under- and over planning of capacity, both of which can be equally poor choices, especially if you cannot enjoy the benefits of virtualization. I got lucky with mine, but you might not get away with your guestimate, so I can only recommend to actually plan ahead using the simple framework described in this post. If you enjoyed the content, I can only recommend to follow our performance tuning advice in Twitter.
February 25, 2014
by Nikita Salnikov-Tarnovski
· 11,713 Views
article thumbnail
Voron & Time Series Data: Getting Real Data Outputs
So far, we have just put the data in and out. And we have had a pretty good track record doing so. However, what do we do with the data now that we have it? As you can expect, we need to read it out. Usually by specific date ranges. The interesting thing is that we usually are not interested in just a single channel, we care about multiple channels. And for fun, those channel might be synchronized or not. An example of the first might be the current speed and the current engine temperature in a car. They are generally share the exact same timestamps. An example of out of sync is when you have a sensor on a rooftop measuring rainfall, and another sensor in the sewer measuring water flow rates. (Again, thanks to Dan for helping me with the domain). This is interesting, because it present quite a few interesting problems: We need to merge different streams into a unified view. We need to handle both matching and non matching sequences. We need to handle erroneous data, what happens when we have two reading for the same time for the same sensor? Yes, that shouldn’t happen, but it does. I solved this with the following API: public class RangeEntry { public DateTime Timestamp; public double?[] Values; } IEnumerable results = dts.ScanRanges(DateTime.MinValue, DateTime.MaxValue, new[] { "6febe146-e893-4f64-89f8-527f2dbaae9b", "707dcb42-c551-4f1a-9203-e4b0852516cf", "74d5bee8-9a7b-4d4e-bd85-5f92dfc22edb", "7ae29feb-6178-4930-bc38-a90adf99cfd3", }); This API gives me the results in the time order, with the same positions as the ids requested for the values. With nulls if there isn’t a value matching the value from that time in that particular sensor channel. The actual implementation relies on this method: IEnumerable ScanRange(DateTime start, DateTime end, string id) All this does it provide the entries all the entries in a particular date range, for a particular channel. Let us see how we implement multi channel scanning on top of this: private class PendingEnumerator { public IEnumerator Enumerator; public int Index; } private class PendingEnumerators { private readonly SortedDictionary> _values = new SortedDictionary>(); public void Enqueue(PendingEnumerator entry) { List list; var dateTime = entry.Enumerator.Current.Timestamp; if (_values.TryGetValue(dateTime, out list) == false) { _values.Add(dateTime, list = new List()); } list.Add(entry); } public bool IsEmpty { get { return _values.Count == 0; } } public List Dequeue() { if (_values.Count == 0) return new List(); var kvp = _values.First(); _values.Remove(kvp.Key); return kvp.Value; } } public IEnumerable ScanRanges(DateTime start, DateTime end, string[] ids) { if (ids == null || ids.Length == 0) yield break; var pending = new PendingEnumerators(); for (int i = 0; i < ids.Length; i++) { var enumerator = ScanRange(start, end, ids[i]).GetEnumerator(); if(enumerator.MoveNext() == false) continue; pending.Enqueue(new PendingEnumerator { Enumerator = enumerator, Index = i }); } var result = new RangeEntry { Values = new double?[ids.Length] }; while (pending.IsEmpty == false) { Array.Clear(result.Values,0,result.Values.Length); var entries = pending.Dequeue(); if (entries.Count == 0) break; foreach (var entry in entries) { var current = entry.Enumerator.Current; result.Timestamp = current.Timestamp; result.Values[entry.Index] = current.Value; if(entry.Enumerator.MoveNext()) pending.Enqueue(entry); } yield return result; } } We are getting a single entry from each channel into the pending enumerators. Then, we collate all the entries that share the same time into a single entry. We use the Index property to track the actual expected index of the entry in the output. And we handle duplicate times in the same channel by outputting multiple entries. Testing this on my 1.1 million records data set, we can get 185 thousands records back in 0.15 seconds.
February 25, 2014
by Oren Eini
· 5,369 Views
article thumbnail
The Risks Of Big-Bang Deployments And Techniques For Step-wise Deployment
If you ever need to persuade management why it might be better to deploy a larger change in multiple stages and push it to customers gradually, read on. A deployment of many changes is risky. We want therefore to deploy them in a way which minimizes the risk of harm to our customers and our companies. The deployment can be done either in an all-at-once (also known as big-bang) way or a gradual way. We will argue here for the more gradual (“stepwise”) approach. Big-bang or stepwise deployment? A big-bang deployment seems to be the natural thing to do: the full solution is developed and tested and then replaces the current system at once. However, it has two crucial flaws. First, it assumes that most defects can be discovered by testing. However, due to differences in test/prod environments, unknown dependencies, and the sheer scale of a typical larger system there always will be problems that are not discovered until production deployment or even until the application runs for a while in production (whichapplies even to airplanes). The more parts have been changed, the more of these production defects will happen at the same time. A gradual deployment makes it possible to discover and handle them one by one. Second, the more complex the deployment, the higher chance of human error(s), i.e. the deployment itself is a likely source of serious defects. Some of the drawbacks of a big-bang deployment in more detail: Complexity: A big-bang deployment requires coordination of many people and “moving parts” that depend on each other, providing a huge opportunity for human mistake (i.e. there will be mistakes). Lot of time: Such a deployment requires lot of time (typically also more than planed/expected) and thus lot of downtime when users cannot use the system. Hard troubleshooting: With a network of inter-dependent parts that changed all at the same time, while perhaps also changing the infrastructure (i.e. connections between them), it is extremely hard to pinpoint the source of defects, thus considerably increasing the time to detect and correct defects while also increasing the risk of people stepping on the toes of each other and “panic fixes” that either cause more problems than they remove or are not good enough (as the rollback that sped upKnight’s downfall). Rollback is likely either impossible or equally time-consuming and risky as the deployment itself, thus increasing the impact of defects and inviting even more human errors. Impact: Deploying everything to all users at the same time means that everybody will be impacted by a potential defect/error/mistake. Long freeze: All needs to be tested together after all development is finished, which requires a lot of time while the code is frozen and no more fixes and changes can get into production for weeks. Risk mitigation The goal of a good deployment plan is to mitigate the risk of the deployment and get it to an acceptable level. There are two aspects to risk: the probability of a defect and the impact of the defect. The following table shows how the possible measures affect them: Defect probability reduction Defect impact reduction testing stepwise deployment gradual migration of users to the new version (f.ex. 1 in 1000 or particular subsets) rollback mechanism => these also lead to much lower time to detect and fix defects Practices for stepwise deployment Enable stepwise deployment: Use parallel change and other Continuous Delivery techniques to make it possible to deploy updated components independently from each other and to switch on/off new features and to switch what versions of the components they depend on are currently used. (Parallel change – keeping the old and new code and being able to use one or the other – is crucial here. Also notice that parallel change applies also to data – you will need to evolve your data schema gradually and keep both old and new one at the same time in a period of time.) Enable rollback. The previous measure – stepwise deployment – makes it also easy(ier) to roll-back the changes by switching to a previous version of a dependency or by switching back to the old code. Migrate users gradually to the new version, i.e. expose the new version only to a small subset of the users initially and increase that subset until everybody uses it. This can be done f.ex. by deploying to only a subset of servers and sending a random/particular subset of users to the new servers but there are also ways if you have only a single machine. (See f.ex. my post Webapp Blue-Green Deployment Without Breaking Sessions/With Fallback With HAProxy.) Monitoring – make sure you are able to monitor flow of users through the system and detect any anomalies and errors early, long before angry calls from the business. Tools such as Logstash, Google Analytics (with custom events from JavaScript), client-side error logging via one of existing services or a custom solution are invaluable. About these ads
February 20, 2014
by Jakub Holý
· 22,117 Views
article thumbnail
Eclipse's BIRT: Scripted Data Set
This article presents the usage of sripted data set in the eclipse's BIRT.
February 18, 2014
by Kosta Stojanovski
· 38,772 Views · 1 Like
article thumbnail
Voron & Time Series: Working with Real Data
dan liebster has been kind enough to send me a real world time series database. the data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this. this is what this looks like: the first thing that i did was take the code in this post , and try it out for size. i wrote the following: int i = 0; using (var parser = new textfieldparser(@"c:\users\ayende\downloads\timeseries.csv")) { parser.hasfieldsenclosedinquotes = true; parser.delimiters = new[] {","}; parser.readline();//ignore headers var startnew = stopwatch.startnew(); while (parser.endofdata == false) { var fields = parser.readfields(); debug.assert(fields != null); dts.add(fields[1], datetime.parseexact(fields[2], "o", cultureinfo.invariantculture), double.parse(fields[3])); i++; if (i == 25*1000) { break; } if (i%1000 == 0) console.write("\r{0,15:#,#} ", i); } console.writeline(); console.writeline(startnew.elapsed); } note that we are using a separate transaction per line , which means that we are really doing a lot of extra work. but this simulate very well incoming events coming one at a time. we were able to process 25,000 events in 8.3 seconds. at a rate of just over 3 events per millisecond . now, note that we have in here the notion of “channels”. from my investigation, it seems clear that some form of separation is actually very common in time series data. we are usually talking about sensors or some such, and we want to track data across different sensors over time. and there is little if any call for working over multiple sensors / channels at the same time. because of that, i made a relatively minor change in voron, that allows it to have an infinite number of separate trees. that means that i can use as many trees as you want, and we can model a channel as a tree in voron. i also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. that dropped the time to insert 25,000 lines to 0.8 seconds. or a full order of magnitude faster. that done, i inserted the full data set, which is just over 1,096,384 records. that took 36 seconds. in the data set i have, there are 35 channels. i just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. that allows doing things like doing averages over time, comparing data, etc. you can see the code implementing this in the following link .
February 7, 2014
by Oren Eini
· 4,008 Views
article thumbnail
Big Data Search, Part 6: Sorting Randomness
As it turns out, doing work on big data sets is quite hard. To start with, you need to get the data, and it is… well, big. So that takes a while. Instead, I decided to test my theory on the following scenario. Given 4 GB of random numbers, let us find how many times we have the number 1. Because I wanted to ensure a consistent answer, I wrote: public static IEnumerable RandomNumbers() { const long count = 1024 * 1024 * 1024L * 1; var random = new MyRand(); for (long i = 0; i < count; i++) { if (i % 1024 == 0) { yield return 1; continue; } var result = random.NextUInt(); while (result == 1) { result = random.NextUInt(); } yield return result; } } /// /// Based on Marsaglia, George. (2003). Xorshift RNGs. /// http://www.jstatsoft.org/v08/i14/paper /// public class MyRand { const uint Y = 842502087, Z = 3579807591, W = 273326509; uint _x, _y, _z, _w; public MyRand() { _y = Y; _z = Z; _w = W; _x = 1337; } public uint NextUInt() { uint t = _x ^ (_x << 11); _x = _y; _y = _z; _z = _w; return _w = (_w ^ (_w >> 19)) ^ (t ^ (t >> 8)); } } I am using a custom Rand function because it is significantly faster than System.Random. This generate 4GB of random numbers, at also ensure that we get exactly 1,048,576 instances of 1. Generating this in an empty loop takes about 30 seconds on my machine. For fun, I run the external sort routine in 32 bits mode, with a buffer of 256MB. It is currently processing things, but I expect it to take a while. Because the buffer is 256 in size, we flush it every 128 MB (while we still have half the buffer free to do more work). The interesting thing is that even though we generate random number, sorting then compressing the values resulted in about 60% compression rate. The problem is that for this particular case, I am not sure if that is a good thing. Because the values are random, we need to select a pretty high degree of compression just to get a good compression rate. And because of that, a significant amount of time is spent just compressing the data. I am pretty sure that for real world scenario, it would be better, but that is something that we’ll probably need to test. Not compressing the data in the random test is a huge help. Next, external sort is pretty dependent on the performance of… sort, of course. And sort isn’t that fast. In this scenario, we are sorting arrays of about 26 million items. And that takes time. Implementing parallel sort cut this down to less than a minute per batch of 26 million. That let us complete the entire process, but then it halts with the merge. The reason for that is that we push all the values into a heap, and there are 1 billion of them. Now, the heap never exceed 40 items, but those are still 1 billion * O(log 40) or about 5.4 billion comparisons that we have to do, and we do this sequentially, which takes time. I tried thinking about ways to parallel, but I am not sure how that can be done. We have 40 sorted files, and we want to merge all of them. Obviously we can sort each 10 files set in parallel, then sort the resulting 4, but the cost we have now is the actual sorting cost, not I/O. I am not sure how to approach this. For what is it worth, you can find the code for this here.
February 5, 2014
by Oren Eini
· 9,051 Views
article thumbnail
AES-256 Encryption with Java and JCEKS
Security has become a great topic of discussion in the last few years due to the recent releasing of documents from Edward Snowden and the explosion of hacking against online commerce stores like JC Penny, Sony andTarget. While this post will not give you all of the tools to help prevent the use of illegally sourced data, this post will provide a starting point for building a set of tools and tactics that will help prevent the use of data by other parties. This post will show how to adopt AES encryption for strings in a Java environment. It will talk about creating AES keys and storing AES keys in a JCEKS keystore format. A working example of the code in this blog is located athttps://github.com/mike-ensor/aes-256-encryption-utility It is recommended to read each section in order because each section builds off of the previous section, however, this you might want to just jump quickly jump to a particular section. Setup - Setup and create keys with keytool Encrypt - Encrypt messages using byte[] keys Decrypt - Decrypt messages using same IV and key from encryption Obtain Keys from Keystore - Obtain keys from keystore via an alias What is JCEKS? JCEKS stands for Java Cryptography Extension KeyStore and it is an alternative keystore format for the Java platform. Storing keys in a KeyStore can be a measure to prevent your encryption keys from being exposed. Java KeyStores securely contain individual certificates and keys that can be referenced by an alias for use in a Java program. Java KeyStores are often created using the "keytool" provided with the Java JDK. NOTE: It is strongly recommended to create a complex passcode for KeyStores to keep the contents secure. The KeyStore is a file that is considered to be public, but it is advisable to not give easy access to the file. Setup All encryption is governed by laws of each country and often have restrictions on the strength of the encryption. One example is that in the United States, all encryption over 128-bit is restricted if the data is traveling outside of the boarder. By default, the Java JCE implements a strength policy to comply with these rules. If a stronger encryption is preferred, and adheres to the laws of the country, then the JCE needs to have access to the stronger encryption policy. Very plainly put, if you are planning on using AES 256-bit encryption, you must install theUnlimited Strength Jurisdiction Policy Files. Without the policies in place, 256-bit encryption is not possible. Installation of JCE Unlimited Strength Policy This post is focusing on the keys rather than the installation and setup of the JCE. The installation is rather simple with explicit instructions found here (NOTE: this is for JDK7, if using a different JDK, search for the appropriate JCE policy files). Keystore Setup When using the KeyTool manipulating a keystore is simple. Keystores must be created with a link to a new key or during an import of an existing keystore. In order to create a new key and keystore simply type: keytool -genseckey -keystore aes-keystore.jck -storetype jceks -storepass mystorepass -keyalg AES -keysize 256 -alias jceksaes -keypass mykeypass Important Flags In the example above here are the explanations for the keytool's parameters: Keystore Parameters genseckey Generate SecretKey. This is the flag indicating the creation of a synchronous key which will become our AES key keystore Location of the keystore. If the keystore does not exist, the tool will create a new store. Paths can be relative or absolute but must be local storetype this is the type of store (JCE, PK12, JCEKS, etc). JCEKS is used to store symmetric keys (AES) not contained within a certificate. storepass password related to the keystore. Highly recommended to create a strong passphrase for the keystore Key Parameters keyalg algorithm used to create the key (AES/DES/etc) keysize size of the key (128, 192, 256, etc) alias alias given to the newly created key in which to reference when using the key keypass password protecting the use of the key Encrypt As it pertains to data in Java and at the most basic level, encryption is an algorithmic process used to programmatically obfuscate data through a reversible process where both parties have information pertaining to the data and how the algorithm is used. In Java encryption, this involves the use of a Cipher. A Cipher object in the JCE is a generic entry point into the encryption provider typically selected by the algorithm. This example uses the default Java provider but would also work with Bouncy Castle. Generating a Cipher object Obtaining an instance of Cipher is rather easy and the same process is required for both encryption and decryption. (NOTE: Encryption and Decryption require the same algorithm but do not require the same object instance) Cipher cipher = Cipher.getInstance("AES/CBC/PKCS5Padding"); Once we have an instance of the Cipher, we can encrypt and decrypt data according to the algorithm. Often the algorithm will require additional pieces of information in order to encrypt/decrypt data. In this example, we will need to pass the algorithm the bytes containing the key and an initial vector (explained below). Initialization In order to use the Cipher, we must first initialize the cipher. This step is necessary so we can provide additional information to the algorithm like the AES key and the Initial Vector (aka IV). cipher.init(Cipher.ENCRYPT_MODE, secretKeySpecification, initialVector); Parameters The SecretKeySpecification is an object containing a reference to the bytes forming the AES key. The AES key is nothing more than a specific sized byte array (256-bit for AES 256 or 32 bytes) that is generated by the keytool(see above). Alternative Parameteters There are multiple methods to create keys such as a hash including a salt, username and password (or similar). This method would utilize a SHA1 hash of the concatenated strings, convert to bytes and then truncate result to the desired size. This post will not show the generation of a key using this method or the use of a PBE key method using a password and salt. The password and/or salt usage for the keys is handled by the keytool using the inputs during the creation of new keys. Initialization Vector The AES algorithm also requires a second parameter called the Initialiation Vector. The IV is used in the process to randomize the encrypted message and prevent the key from easy guessing. The IV is considered a publicly shared piece of information, but again, it is not recommended to openly share the information (for example, it wouldn't be wise to post it on your company's website). When encrypting a message, it is not uncommon to prepend the message with the IV since the IV will be a set/known size based on the algorithm. NOTE: the AES algorithm will output the same result if using the same IV, key and message. It is recommended that the IV be randomly created each time an encryption takes place. With the newly initialized Cipher, encrypting a message is simple. Simply call: byte[] encryptedMessageInBytes = Cipher.doFinal((message.getBytes("UTF-8")); String base64EncodedEncryptedMsg = BaseEncoding.base64().encode(encryptedMessageInBytes); String base32EncodedEncryptedMsg = BaseEncoding.base32().encode(encryptedMessageInBytes); Encoding Results Byte arrays are difficult to visualize since they often do not form characters in any charset. The best recommendation to solve this is to represent the bytes in HEX (base-16), Double HEX (base-32) or Base64 format. If the message will be passed via a URL or POST parameter, be sure to use a web-safe Base64 encoding. Google Guava library provides a excellent BaseEncoding utility. NOTE: Remember to decode the encoded message before decrypting. Decrypt Decrypting a message is almost a reverse of the encryption process with a few exceptions. Decryption requires a known initialization vector as a parameter unlike the encryption process generating a random IV. Decryption When decrypting, obtain a cipher object with the same process as the encryption method. The Cipher object will need to utilize the exact same algorithm including the method and padding selections. Once the code has obtained a reference to a Cipher object, the next step is to initialize the cipher for decryption and pass in a reference to a key and the initialization vector. // key is the same byte[] key used in encryption SecretKeySpec secretKeySpecification = new SecretKeySpec(key, "AES"); cipher.init(Cipher.DECRYPT_MODE, secretKeySpecification, initialVector); NOTE: The key is stored in the keystore and obtained by the use of an alias. See below for details on obtaining keys from a keystore Once the cipher has been provided the key, IV and initialized for decryption, the cipher is ready to perform the decryption. byte[] encryptedTextBytes = BaseEncoding.base64().decode(message); byte[] decryptedTextBytes = cipher.doFinal(encryptedTextBytes); String origMessage = new String(decryptedTextBytes); Strategies to keep IV The IV used to encrypt the message is important to decrypting the message therefore the question is raised, how do they stay together. One solution is to Base Encode (see above) the IV and prepend it to the encrypted and encoded message: Base64UrlSafe(myIv) + delimiter + Base64UrlSafe(encryptedMessage). Other possible solutions might be contextual such as including an attribute in an XML file with the IV and one for the alias to the key used. Obtain Key from Keystore The beginning of this post has shown how easy it is to create new AES-256 keys that reference an alias inside of a keystore database. The post then continues on how to encrypt and decrypt a message given a key, but has yet shown how to obtain a reference to the key in a keystore. Solution // for clarity, ignoring exceptions and failures InputStream keystoreStream = new FileInputStream(keystoreLocation); KeyStore keystore = KeyStore.getInstance("JCEKS"); keystore.load(keystoreStream, keystorePass.toCharArray()); if (!keystore.containsAlias(alias)) { thrownew RuntimeException("Alias for key not found"); } Key key = keystore.getKey(alias, keyPass.toCharArray()); Parameters keystoreLocation String - Location to local keystore file location keypass String - Password used when creating or modifying the keystore file with keytool (see above) alias String - Alias used when creating new key with keytool (see above) Conclusion This post has shown how to encrypt and decrypt string based messages using the AES-256 encryption algorithm. The keys to encrypt and decrypt these messages are held inside of a JCEKS formatted KeyStore database created using the JDK provided "keytool" utility. The examples in this post should be considered a solid start to encrypting/decrypting symmetric keys such as AES. This should not be considered the only line of defense when encrypting messages, for example key rotation. Key rotation is a method to mitigate risks in the event of a data breach. If an intruder obtains data and manages to hack a single key, the data contained in multiple files should have used several keys to encrypt the data thus bringing down risk of a total exposure loss. All of the examples in this blog post have been condensed into a simple tool allowing for the viewing of keys inside of a keystore, an operation that is not supported out of the box by the JDK keytool. Each aspect of the steps and topics outlined in this post are available at: https://github.com/mike-ensor/aes-256-encryption-utility. NOTE: The examples, sample code and any reference is to be used at the sole implementers risk and there is no implied warranty or liability, you assume all risks.
February 4, 2014
by Mike Ensor
· 102,331 Views · 2 Likes
  • Previous
  • ...
  • 414
  • 415
  • 416
  • 417
  • 418
  • 419
  • 420
  • 421
  • 422
  • 423
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×