DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
"Schemas" in CouchDB
schema noun ( pl. schemata or schemas ) 1 technical a representation of a plan or theory in the form of an outline or model: a schema of scientific reasoning. 2 Logic a syllogistic figure. 3 (in Kantian philosophy) a conception of what is common to all members of a class; a general or essential type or form. CouchDB is a schema-less document store, but there are times when a schema is a good thing to have around, one way or another. So can you have your cake and eat it too? Below I'll take a high level look at adding a kind of schema to an application and the benefits and draw backs associated with this way of working. What I describe below isn't for everyone. It goes against some of the core principles of CouchDB and makes your data much less human readable, but there are cases where that trade off is worth making. Schemas: WTF?! It might seem a bit weird to add a schema to a schema-less database but sometimes it is a very useful thing indeed. When you're dealing with large datasets verbose object key names can be a problem (e.g. cost you money) so you end up stuck between a rock and a hard place; either make your data terse and hard to use or be explicit and spend more on storage and network. { "shape": "triangle", "colour_label": "red", "opposite_length_in_mm": 767.12254256805875, "angle_in_radians": 1.5514293603308698, "adjacent_length_in_mm": 73.59881843627835 } What usually happens is some middle ground where a nice descriptive name like "angle_in_radians" gets reduced to "angle" or "rads". That's fine in that it reduces the storage and network required to deal with all that data. { "adj": 73.59881843627835, "shape": "triangle", "angle": 1.5514293603308698, "opp": 767.12254256805875, "colour": "red" } However, by making this small change you move the description of the data out of your database and into some undefined place; higher level code, documentation, shared knowledge, a whiteboard, a notebook, someones head. As your data becomes more terse you might rely on duck typing (deriving from the data itself what the data describes) to get data that quacks right in your application. That's fine so long as you have data that is sufficiently distinguishable from the other ducks on the pond; if I rely on pulling a triangle object from the database because it has an angle member I might accidentally pull out a rhombus or an icosahedron. To make sure you get the data you expect you might add an explicit type field to each data (e.g. "type=goose" or "shape=triangle") something which I've always felt was rather odd. This starts to add up on storage (remember you have a large dataset/flock of ducks) and, more importantly, it doesn't help with where the description of the data is held - you know that you have a goose but don't know what a goose is. This last point is important, especially if you're working in a team of developers. Knowing what describing a shape as a triangle means is vital in producing consistent code that many people can work on. The straight jacket of a SQL schema looks pretty comfy sometimes. Okay, I'll buy that a schema might be useful... So how do you add a schema into a CouchDB database, something that is inherently schema-less? Can I get the best of both worlds? Here's a little trick that might help. First you define a document that is the schema for a particular type of data: { "_id": "datatype/triangle/v1", "fields": [ "opposite_length_in_mm", "adjacent_length_in_mm", "angle_in_radians", "colour_label" ] } Then you change your document structure to reference that "schema": { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ] } Note that the schema is versioned and that ordering in the data list is important here! I now know precisely what the data represents without having to store that description in the data itself. This way of working has benefits beyond disk storage; you reduce wire traffic, and there is less for a client to parse before rendering it. This is especially useful if you're rendering into a browser based visualisation - you don't need a complex set of objects to make a bar chart, just a list of x and y values. I can also share the data structure with colleagues and be reasonably confident that when I'm talking about a "v1 triangle" they'll know that lengths are in millimeters, are the opposite and adjacent sides and that the angle is in radians, hopefully reducing the chance of costly mistakes. Isn't that error prone? Yes and no. If you make a mistake in the ordering of your fields then, yes you are going to have issues. This is reasonably easy to manage with some form of client verification (e.g. validation on a web form) and generating the interface from the data (e.g. use the schema definition to build the GUI). If you're adding these data into the database by hand (e.g. via a curl or futon) then you aren't going to be in the regime where this trick is useful; your dataset needs to be large for this to make sense. Things still quack What's particularly nice about this way of working is that I can still duck type the data, add additional fields to annotate it etc. since the schema isn't strictly enforced. Nothing stops me from having a triangle document like: { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" } My views that deal with the data with a schema will still work (by ignoring these additional fields), my MVC framework will still render my pages, and I'll still have all the data I want in my database. Nesting You could have a nested object structure like: { "datatype": "pattern/v1", "data": [ { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" }, { "datatype": "square/v1", data: [ 10, "green" ] } ] } But if you're going to have a schema you may as well reflect the nesting inside it, e.g say that you have a list of triangles and a list of squares: { "_id": "datatype/pattern/v1", "fields": [ ["triangle/v1"], ["square/v1"] ] } { "datatype": "pattern/v1", "data": [ [ { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" } ], [ { data: [ 10, "green" ] } ] } Schema evolution A nice feature of this way of working is that you can deal with schema evolutions; changing the format of your data. { "_id": "datatype/triangle/v2", "fields": [ "opposite_length_in_cm", "hypotenuse_length_in_cm", "angle_in_degrees", "colour_label" ] } There are only so many ways you can represent the data. While sometimes you may have a major schema evolution, one where old data is completely unusable, often changes are just tweaks for consistency (say changing the units of a quantity) or extending the schema by adding in optional data. In either case you should be able to use data from multiple schema versions together by using appropriate manipulations on the data. For example you could instantiate shape objects via a factory which knows how to create the right object for different schema versions. Validation The above does no validation of the data; the color field in the input data could be set to a number instead of a string, the angle to something non- physical etc. If you really needed validation you could do it with CouchDB's validation functions. If you go the fully validated route you'd want to define the schema in the design document (instead of as a normal doc) and use a CommonJS include to make sure that the validator in the app was doing the same thing as the schema. This ties you to a version of the design document (which is where the validators live), which may or may not be an issue. It will also considerably slow down insertion rate as CouchDB has to do more work to add your data. Personally I prefer to put validation logic in the client making writes. Views If I were using this way of working I would want to have a view which returned all the schema's defined on the database. This then allows me to build objects appropriately. A view to return schema's documents would look like: function(doc) { if (doc._id.slice(0, 'datatype'.length) == 'datatype') { emit (doc._id.slice('datatype/'.length, doc._id.length), doc.fields) } } You can pull out documents that have a schema with a simple view like: function(doc) { if (doc.datatype){ emit(doc.datatype, doc.data); } } This can be queried to find objects of a given shape using CouchDB's view slicing (e.g. ?startkey="square/v1"&endkey="square/v2") which returns data like: {"id":"datatype/square/v1","key":["square/v1",0],"value":["side_length_in_mm","colour_label"]}, {"id":"f98ffe7e4cd91cbb0d904f9098499ca8","key":["square/v1",1],"value":[872.4342711412228,"green"]}, {"id":"f98ffe7e4cd91cbb0d904f909849a218","key":["square/v1",1],"value":[370.29971491443905,"yellow"]}, {"id":"f98ffe7e4cd91cbb0d904f909849acd0","key":["square/v1",1],"value":[8.799279300193753,"yellow"]} You'll notice the name of the "schema" is the key and the values are held in value. This means I can parse the data into a set of appropriate objects with something like: var objects = []; function build(schema, data){ // Build the appropriate object for the schema... } for (row in data){ // build up the objects in a factory var obj = build(row.key, row.value); objects.push(obj); } If I wanted all versions of a shape the query would be, and used a vNUMERIC_COUNTER notation for versioning, ?startkey="square/v1"&endkey="square/vXXX" as numbers sort lower than strings. Taking it to the extreme If you are really worried about data size you can take this technique to the extreme by encoding the data arrays as a byte string and using the schema documents to describe that byte array. This effectively turns your JSON structure into something not dissimilar to a protocol buffer, at the expense of human readability and view complexity. If you are particularly concerned with data size over the wire (for example are writing an MMORPG) then this may be an acceptable trade off. Reminder This trick isn't suitable for every dataset. If you modify the data by hand it is prone to error. If you have a small dataset, or only ever send a small subset of the data to the client it's massive overkill. But if you have a large dataset of machine generated data, that needs to be frequently accessed over the WAN (think a monitoring app or game) then this is a nice way to reduce storage, network IO and browser render time. It's also worth reiterating that the schema is not enforced, you could have a square with 3 sides, and that adding strict schema enforcement with a validation function will considerably slow down insert rate.
September 8, 2012
by Simon Metson
· 10,362 Views
article thumbnail
Manual Test-Driven Development
Test-Driven Development is a code-level practice, based on running automated tests that are written before the production code they exercise. But practices can be applied only in the context where they were developed: when some premises are not present is difficult to apply TDD as-is. Automated specification For example, consider the premise of assertion automation: it is possible to write a (hopefully) small algorithm that is able to check the result of running production code and return true or false. In the case the problem is: Draw an antialiased circle on this blank canvas. -- Carlo Pescio it is not immediately clear how to define automated tests for this behavior. We could check that some pixels are still blank inside or outside the circle, or that there is a bound number of pixels of black color; or even that they are contiguous. An opinion I've heard (that I try not to misrepresent) is that we only need to write some looser tests in these cases, checking only a few pixels of the circle. This process will give us a little feedback on the API of our Canvas or Circle object, but not much on the algorithm we are implementing inside it. Are we going in the right direction? Have new test cases correctly been satisfied without a large intervention on the existing code? Are we painting some unrelated pixels due to an hidden bug? What I argument here is instead that we should change the nature of the feedback mechanism. Speaking in control theory terms, change the block that acquires the output and influences the input to our design process. Develop in the browser When I was developing a Couchapp, a kind of web application served directly from a CouchDB database, I was appaled by the difficulty of testing it. While the production code was composed of ~100 lines, it was a complex mix of technologies: HTML and CSS code, client-side JavaScript for managing user events and some server-side JavaScript for the "queries" (actually the server-side only consists of the database in Couchapps.) Some of this logic could be tested in automation, like the result of queries over views. Yet much of it was related to a user interface, and as such requiring a large time investment to automate. Instead of waking up my Selenium server and start to manipulate a browser with code, I noticed that this UI was almost read-only; there were a few cases where a new document would have to be inserted, but a manual test of them was short and did not even required to reload the page. The whole application state was observable. Summing it up, I performed a frequent manual test that took a few seconds instead of trying to define complex and brittle automation logic for testing the UI. Now that I've been introduced to a simple qualitative ROI model by Carlo Pescio's article, I would do the same for every context where: a large time investment is needed for automating tests. it is possible to perform manual tests quickly. as the only logic conclusion. A word of caution TDD has many benefits (including catching regressions early) so I'm not prepared to give it up just because it is difficult to test. These are technical scenarios where I have successfully followed TDD by the book: multithreaded and multiprocess code applications distributed over multiple machines computer vision (object recognition and tracking) image manipulation code (via comparison testing) development of browser bindings for Selenium And even in the case the big picture is not easy to test-first (like in the case of image manipulation), we can benefit from TDD the pieces of the solution. For example, in the computer vision case I wasn't able to write a test beforehand for tracking a car inside a movie. But I was able to TDD the objects that the algorithmic solution to the problem called for: Patch, Area, Cluster, Movement, and so on. End-to-end TDD is not always cheap but unit level TDD can often be, if it considers testability as a relevant property (while regression testing even at the end-to-end level is always possible, in the worst case with record and replay.) End-to-end specifications If we can't define automated assertions for our "big picture" problem, it doesn't mean that we cannot apply the TDD approach, by substituting a manual step. Going back to the circle problem, I would define manual test cases on an inspection page seen by a human. I've seen this done with layouts and multiple browsers to catch CSS rendering bugs, for example: It would be very difficult to check these screenshots automatically, as each browser renders pages a bit differently from the others. The iterative process becomes: Define a cheap manual test, automating the arrange and act phases but not the assertion. Write only the code necessary to make it pass. Refactor. As long as the number of tests does not increase without limit and the manual check can be performed quickly, this approach does not slow you down with respect to TDD by-the-book. You'll have to take care of regression with other means; but at least you define a set of manual test cases. Feedback! TDD is an instrument of feedback: if feedback cannot be gathered in an automated way, we have to resort to manual checking of the specifications. Here are other examples of manual tools for generating feedback: Read-Eval-Print Loops: you can experimenting with existing classes and functions, and easily repeat steps thanks to history. the browser refresh button: the fastest way to transform a PSD into an HTML and CSS template. MongoDB console for learning the database API; other kinds of consoles like Firebug and Chrome's, or Clojure's.
September 3, 2012
by Giorgio Sironi
· 10,259 Views
article thumbnail
Idempotent DB Update Scripts
An idempotent function gives the same result even if it is applied several times. That is exactly how a database update script should behave. It shouldn’t matter if it is run on or multiple times. The result should be the same. A database update script should be made to first check the state of the database and then apply the changes needed. If the script is done this way, several operations can be combined into one script that works on several databases despite the databases being at different (possibly unknown) state to start with. For the database schema itself I usually use Visual Studio 2010 database projects that handles updates automatically (in VS2012 the functionality has been changed significantly). Even with the schema updates handled automatically, there are always things that need manual handling. One common case is lookup tables that need initialization. Lookup Table Init Script I use a combination of a temp table and a MERGE clause to init lookup tables. CREATE TABLE #Colours ( ColourId INT NOT NULL, Name NVARCHAR(10) NOT NULL ) INSERT #Colours VALUES (1, N'Red'), (2, N'Green'), (3, N'Blue') MERGE Colours dst USING #Colours src ON (src.ColourId = dst.ColourId) WHEN MATCHED THEN UPDATE SET dst.ColourId = src.ColourId WHEN NOT MATCHED THEN INSERT VALUES (src.ColourId, src.Name) WHEN NOT MATCHED BY SOURCE THEN DELETE; DROP TABLE #Colours I think that the temp table approach is great because it gives a clear overview in the script of what the final values will be. It also works regardless of what the current values are. Sometimes it is relevant to keep old values, which can be done by removing the last two lines of the MERGE clause. It is also possible to flag records as inactive instead of deleting them. MERGE... ... WHEN NOT MATCHED BY SOURCE THEN SET dst.Active = 0; Checking Current State An idempotent script has to be able to check the current state and adopt its behaviour. The lookup table init script uses the MERGE clause for that, checking the actual values. In most cases it is possible to check the current state by inspecting the values of the table or through the sys meta data views. If that’s not possible, a separate table can be used to log the scripts run. This method has the advantage of an easy way to check what scripts have been run. The disadvantage is that it violates the DRY Principle by keeping a separate log, which can get out of sync with the actual database schema. What happens when a script is partially run and then fails before writing the log entry? What will happen the next time the script is run? This is where true idempotent script shines. Whenever there’s a doubt of the current state of the database the entire script can be run again, bringing the database to a known state.
September 3, 2012
by Anders Abel
· 11,158 Views
article thumbnail
Building A Simple API Proxy Server with PHP
these days i’m playing with backbone and using public api as a source. the web browser has one horrible feature: it don’t allow you to fetch any external resource to our host due to the cross-origin restriction. for example if we have a server at localhost we cannot perform one ajax request to another host different than localhost. nowadays there is a header to allow it: access-control-allow-origin . the problem is that the remote server must set up this header. for example i was playing with github’s api and github doesn’t have this header. if the server is my server, is pretty straightforward to put this header but obviously i’m not the sysadmin of github, so i cannot do it. what the solution? one possible solution is, for example, create a proxy server at localhost with php. with php we can use any remote api with curl (i wrote about it here and here for example). it’s not difficult, but i asked myself: can we create a dummy proxy server with php to handle any request to localhost and redirects to the real server, instead of create one proxy for each request?. let’s start. problably there is one open source solution (tell me if you know it) but i’m on holidays and i want to code a little bit (i now, it looks insane but that’s me ). the idea is: ... $proxy->register('github', 'https://api.github.com'); ... and when i type: http://localhost/github/users/gonzalo123 and create a proxy to : https://api.github.com/users/gonzalo123 the request method is also important. if we create a post request to localhost we want a post request to github too. this time we’re not going to reinvent the wheel, so we will use symfony componets so we will use composer to start our project: we create a conposer.json file with the dependencies: { "require": { "symfony/class-loader":"dev-master", "symfony/http-foundation":"dev-master" } } now php composer.phar install and we can start coding. the script will look like this: register('github', 'https://api.github.com'); $proxy->run(); foreach($proxy->getheaders() as $header) { header($header); } echo $proxy->getcontent(); as we can see we can register as many servers as we want. in this example we only register github. the application only has two classes: restproxy , who extracts the information from the request object and calls to the real server through curlwrapper . request = $request; $this->curl = $curl; } public function register($name, $url) { $this->map[$name] = $url; } public function run() { foreach ($this->map as $name => $mapurl) { return $this->dispatch($name, $mapurl); } } private function dispatch($name, $mapurl) { $url = $this->request->getpathinfo(); if (strpos($url, $name) == 1) { $url = $mapurl . str_replace("/{$name}", null, $url); $querystring = $this->request->getquerystring(); switch ($this->request->getmethod()) { case 'get': $this->content = $this->curl->doget($url, $querystring); break; case 'post': $this->content = $this->curl->dopost($url, $querystring); break; case 'delete': $this->content = $this->curl->dodelete($url, $querystring); break; case 'put': $this->content = $this->curl->doput($url, $querystring); break; } $this->headers = $this->curl->getheaders(); } } public function getheaders() { return $this->headers; } public function getcontent() { return $this->content; } } the restproxy receive two instances in the constructor via dependency injection (curlwrapper and request). this architecture helps a lot in the tests , because we can mock both instances. very helpfully when building restproxy. the restproxy is registerd within packaist so we can install it using composer installer: first install componser curl -s https://getcomposer.org/installer | php and create a new project: php composer.phar create-project gonzalo123/rest-proxy proxy if we are using php5.4 (if not, what are you waiting for?) we can run the build-in server cd proxy php -s localhost:8888 -t www/ now we only need to open a web browser and type: http://localhost:8888/github/users/gonzalo123 the library is very minimal (it’s enough for my experiment) and it does’t allow authorization. of course full code is available in github .
September 2, 2012
by Gonzalo Ayuso
· 20,302 Views
article thumbnail
Password Encryption -- Short Answer: Don't.
First, read this. Why passwords have never been weaker—and crackers have never been stronger. There are numerous important lessons in this article. One of the small lessons is that changing your password every sixty or ninety days is farcical. The rainbow table algorithms can crack a badly-done password in minutes. Every 60 days, the cracker has to spend a few minutes breaking your new password. Why bother changing it? It only annoys the haxorz; they'll be using your account within a few minutes. However. That practice is now so ingrained that it's difficult to dislodge from the heads of security consultants. The big lesson, however, is profound. Work Experience Recently, I got a request from a developer on how to encrypt a password. We have a Python back-end and the developer was asking which crypto package to download and how to install it. "Crypto?" I asked. "Why do we need crypto?" "To encrypt passwords," they replied. I spat coffee on my monitor. I felt like hitting Caps Lock in the chat window so I could respond like this: "NEVER ENCRYPT A PASSWORD, YOU DOLT." I didn't, but I felt like it. Much Confusion The conversation took hours. Chat can be slow that way. Also, I can be slow because I need to understand what's going on before I reply. I'm a slow thinker. But the developer also needed to try stuff and provide concrete code examples, which takes time. At the time, I knew that passwords must be hashed with salt. I hadn't read the Ars Technica article cited above, so I didn't know why computationally intensive hash algorithms are best for this. We had to discuss hash algorithms. We had to discuss algorithms for generating unique salt. We had to discuss random number generators and how to use an entropy source for a seed. We had to discuss http://www.ietf.org/rfc/rfc2617.txt in some depth, since the algorithms in section 3.2.2. show some best practices in creating hash summaries of usernames, passwords, and realms. All of this was, of course, side topics before we got to the heart of the matter. What's Been Going On After several hours, my "why" questions started revealing things. The specific user story, for example, was slow to surface. Why? Partly because I didn't demand it early enough. But also, many technology folks will conceive of a "solution" and pursue that technical concept no matter how difficult or bizarre. In some cases, the concept doesn't really solve the problem. I call this the "Rat Holes of Lost Time" phenomena: we chase some concept through numerous little rat-holes before we realize there's a lot of activity but no tangible progress. There's a perceptual narrowing that occurs when we focus on the technology. Often, we're not actually solving the problem. IT people leap past the problem into the solution as naturally as they breathe. It's a hard habit to break. It turned out that they were creating some additional RESTful web services. They knew that the RESTful requests needed proper authentication. But, they were vague on the details of how to secure the new RESTful services. So they were chasing down their concept: encrypt a password and provide this encrypted password with each request. They were half right, here. A secure "token" is required. But an encrypted password is a terrible token. Use The Framework, Luke What's most disturbing about this is the developer's blind spot. For some reason, the existence of other web services didn't enter into this developer's head. Why didn't they read the code for the services created on earlier sprints? We're using Django. We already have a RESTful web services framework with a complete (and high quality) security implementation. Nothing more is required. Use the RESTful authentication already part of Django. In most cases, HTTPS is used to encrypt at the socket layer. This means that Basic Authentication is all that's required. This is a huge simplification, since all the RESTful frameworks already offer this. The Django Rest Framework has a nice authentication module. When using Piston, it's easy to work with their Authentication handler. It's possible to make RESTful requests with Digest Authentication, if SSL is not being used. For example, Akoha handles this. It's easy to extend a framework to add Digest in addition to Basic authentication. For other customers, I created an authentication handler between Piston and ForgeRock OpenAM so that OpenAM tokens were used with each RESTful request. (This requires some care to create a solution that is testable.) Bottom Lines Don't encrypt passwords. Ever. Don't write your own hash and salt algorithm. Use a framework that offers this to you. Read the Ars Technica article before doing anything password-related.
August 28, 2012
by Steven Lott
· 21,823 Views
article thumbnail
Adding Hibernate Entity Level Filtering feature to Spring Data JPA Repository
Original Article: http://borislam.blogspot.hk/2012/07/adding-hibernate-entity-level-filter.html Those who have used data filtering features of hibernate should know that it is very powerful. You could define a set of filtering criteria to an entity class or a collection. Spring data JPA is a very handy library but it does not have fitering features. In this post, I will demonstarte how to add the hibernate filter features at entity level. You can use this features when you are using Hibernate Entity Manager. We can just define annotation in your repositoy interface to enable this features. Step 1. Define filter at entity level as usual. Just use hibernate @FilterDef annotation @Entity @Table(name = "STUDENT") @FilterDef(name="filterBySchoolAndClass", parameters={@ParamDef(name="school", type="string"),@ParamDef(name="class", type="integer")}) public class Student extends GenericEntity implements Serializable { // add your properties ... } Step2. Define two custom annotations. These two annotations are to be used in your repository interfaces. You could apply the hibernate filter defined in step 1 to specific query through these annotations. @Target(ElementType.TYPE) @Retention(RetentionPolicy.RUNTIME) public @interface EntityFilter { FilterQuery[] filterQueries() default {}; } @Retention(RetentionPolicy.RUNTIME) public @interface FilterQuery { String name() default ""; String jpql() default ""; } Step3. Add a method to your Spring data JPA base repository. This method will read the annotation you defined (i.e. @FilterQuery) and apply hibernate filter to the query by just simply unwrap the EntityManager. You could specify the parameter in your hibernate filter and also the parameter in you query in this method. If you do not know how to add custom method to your Spring data JPA base repository, please see my previous article for how to customize your Spring data JPA base repository for detail. You can see in previous article that I intentionally expose the repository interface (i.e. the springDataRepositoryInterface property) in the GenericRepositoryImpl. This small tricks enable me to access the annotation in the repository interface easily. public List doQueryWithFilter( String filterName, String filterQueryName, Map inFilterParams, Map inQueryParams){ if (GenericRepository.class.isAssignableFrom(getSpringDataRepositoryInterface())) { Annotation entityFilterAnn = getSpringDataRepositoryInterface().getAnnotation(EntityFilter.class); if(entityFilterAnn != null){ EntityFilter entityFilter = (EntityFilter)entityFilterAnn; FilterQuery[] filterQuerys = entityFilter.filterQueries() ; for (FilterQuery fQuery : filterQuerys) { if (StringUtils.equals(filterQueryName, fQuery.name())) { String jpql = fQuery.jpql(); Filter filter = em.unwrap(Session.class).enableFilter(filterName); //set filter parameter for (Object key: inFilterParams.keySet()) { String filterParamName = key.toString(); Object filterParamValue = inFilterParams.get(key); filter.setParameter(filterParamName, filterParamValue); } //set query parameter Query query= em.createQuery(jpql); for (Object key: inQueryParams.keySet()) { String queryParamName = key.toString(); Object queryParamValue = inQueryParams.get(key); query.setParameter(queryParamName, queryParamValue); } return query.getResultList(); } } } } } return null; } Last Step: example usage In your repositry, define which query you would like to apply hibernate filter through your @EntityFilter and @FilterQuery annotation. @EntityFilter ( filterQueries = { @FilterQuery(name="query1", jpql="SELECT s FROM Student LEFT JOIN FETCH s.Subject where s.subject = :subject" ), @FilterQuery(name="query2", jpql="SELECT s FROM Student LEFT JOIN s.TeacherSubject where s.teacher = :teacher") } ) public interface StudentRepository extends GenericRepository { } In your service or business class that inject your repository, you could just simply call the doQueryWithFilter() method to enable the filtering function. @Service public class StudentService { @Inject private StudentRepository studentRepository; public List searchStudent( String subject, String school, String class) { List studentList; // Prepare parameters for query filter HashMap inFilterParams = new HashMap(); inFilterParams.put("school", "Hong Kong Secondary School"); inFilterParams.put("class", "S5"); // Prepare parameters for query HashMap inParams = new HashMap(); inParams.put("subject", "Physics"); studentList = studentRepository.doQueryWithFilter( "filterBySchoolAndClass", "query1", inFilterParams, inParams); return studentList; } }
August 24, 2012
by Boris Lam
· 56,834 Views · 1 Like
article thumbnail
Spring Data, Spring Security and Envers integration
Learn about pros, cons, and basics of Spring security and data, plus Envers integration.
August 20, 2012
by Nicolas Fränkel
· 25,050 Views · 1 Like
article thumbnail
EF Migrations Command Reference
Entity Framework Migrations are handled from the package manager console in Visual Studio. The usage is shown in various tutorials, but I haven’t found a complete list of the commands available and their usage, so I created my own. There are four available commands. Enable-Migrations: Enables Code First Migrations in a project. Add-Migration: Scaffolds a migration script for any pending model changes. Update-Database: Applies any pending migrations to the database. Get-Migrations: Displays the migrations that have been applied to the target database. The information here is the output of running get-help command-name -detailed for each of the commands in the package manager console (running EF 4.3.1). I’ve also added some own comments where I think some information is missing. My own comments are placed under the Additional Information heading. Please note that all commands should be entered on the same line. I’ve added line breaks to avoid vertical scrollbars. Enable-Migrations Enables Code First Migrations in a project. Syntax Enable-Migrations [-EnableAutomaticMigrations] [[-ProjectName] ] [-Force] [] Description Enables Migrations by scaffolding a migrations configuration class in the project. If the target database was created by an initializer, an initial migration will be created (unless automatic migrations are enabled via the EnableAutomaticMigrations parameter). Parameters -EnableAutomaticMigrations Specifies whether automatic migrations will be enabled in the scaffolded migrations configuration. If ommitted, automatic migrations will be disabled. -ProjectName Specifies the project that the scaffolded migrations configuration class will be added to. If omitted, the default project selected in package manager console is used. -Force Specifies that the migrations configuration be overwritten when running more than once for given project. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Enable-Migrations -examples. For more information, type: get-help Enable-Migrations -detailed. For technical information, type: get-help Enable-Migrations -full. Additional Information The flag for enabling automatic migrations is saved in the Migrations\Configuration.cs file, in the constructor. To later change the option, just change the assignment in the file. public Configuration() { AutomaticMigrationsEnabled = false; } Add-Migration Scaffolds a migration script for any pending model changes. Syntax Add-Migration [-Name] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [-IgnoreChanges] [] Add-Migration [-Name] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [-IgnoreChanges] [] Description Scaffolds a new migration script and adds it to the project. Parameters -Name Specifies the name of the custom script. -Force Specifies that the migration user code be overwritten when re-scaffolding an existing migration. -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. -IgnoreChanges Scaffolds an empty migration ignoring any pending changes detected in the current model. This can be used to create an initial, empty migration to enable Migrations for an existing database. N.B. Doing this assumes that the target database schema is compatible with the current model. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Add-Migration -examples. For more information, type: get-help Add-Migration -detailed. For technical information, type: get-help Add-Migration -full. Update-Database Applies any pending migrations to the database. Syntax Update-Database [-SourceMigration ] [-TargetMigration ] [-Script] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [] Update-Database [-SourceMigration ] [-TargetMigration ] [-Script] [-Force] [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [] Description Updates the database to the current model by applying pending migrations. Parameters -SourceMigration Only valid with -Script. Specifies the name of a particular migration to use as the update’s starting point. If ommitted, the last applied migration in the database will be used. -TargetMigration Specifies the name of a particular migration to update the database to. If ommitted, the current model will be used. -Script Generate a SQL script rather than executing the pending changes directly. -Force Specifies that data loss is acceptable during automatic migration of the database. -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Update-Database -examples. For more information, type: get-help Update-Database -detailed. For technical information, type: get-help Update-Database -full. Additional Information The command always runs any pending code-based migrations first. If the database is still incompatible with the model the additional changes required are applied as an separate automatic migration step if automatic migrations are enabled. If automatic migrations are disabled an error message is shown. Get-Migrations Displays the migrations that have been applied to the target database. Syntax Get-Migrations [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] [-ConnectionStringName ] [] Get-Migrations [-ProjectName ] [-StartUpProjectName ] [-ConfigurationTypeName ] -ConnectionString -ConnectionProviderName [] Description Displays the migrations that have been applied to the target database. Parameters -ProjectName Specifies the project that contains the migration configuration type to be used. If ommitted, the default project selected in package manager console is used. -StartUpProjectName Specifies the configuration file to use for named connection strings. If omitted, the specified project’s configuration file is used. -ConfigurationTypeName Specifies the migrations configuration to use. If omitted, migrations will attempt to locate a single migrations configuration type in the target project. -ConnectionStringName Specifies the name of a connection string to use from the application’s configuration file. -ConnectionString Specifies the the connection string to use. If omitted, the context’s default connection will be used. -ConnectionProviderName Specifies the provider invariant name of the connection string. This cmdlet supports the common parameters: Verbose, Debug, ErrorAction, ErrorVariable, WarningAction, WarningVariable, OutBuffer and OutVariable. For more information, type: get-help about_commonparameters. Remarks To see the examples, type: get-help Get-Migrations -examples. For more information, type: get-help Get-Migrations -detailed. For technical information, type: get-help Get-Migrations -full. Additional Information The powershell commands are complex powershell functions, located in the tools\EntityFramework.psm1 file of the Entity Framework installation. The powershell code is mostly a wrapper around the System.Data.Entity.Migrations.MigrationsCommands found in the tools\EntityFramework\EntityFramework.PowerShell.dll file. First a MigrationsCommands object is instantiated with all configuration parameters. Then there is a public method on the MigrationsCommands object for each of the available commands.
August 20, 2012
by Anders Abel
· 31,380 Views · 1 Like
article thumbnail
How to Migrate Drupal to Azure Web Sites
DrupalCon Munich is next week, and I am lucky enough to be going. As part of preparing for the conference, I thought it would be worthwhile to see just how easy (or difficult) it would be to migrate an existing Drupal site to Windows Azure Web Sites. So, in this post, I’ll do just that. Fortunately, because Windows Azure Web Sites supports both PHP and MySQL, the migration process is relatively straightforward. And, because Drupal and PHP run on any platform, the process I’ll describe should work for moving Drupal to Windows Azure Web Sites regardless of what platform you are moving from. Of course, Drupal installations can vary widely, so YMMV. I tested the instructions below on relatively small (and simple) Drupal installation running on CentOS 5. (Unfortunately, I won’t be using Drush since it isn’t supported on Windows Azure Websites.) If you are considering moving a large and complex Drupal application, may want to consider moving to Windows Azure Cloud Services (more information about that here: Migrating a Drupal Site from LAMP to Windows Azure). Before getting started, it’s worth noting that Windows Azure Websites lets you run up to 10 Web Sites for free in a multitenant environment. And, you can seamlessly upgrade to private, reserved VM instances as your traffic grows. To sign up, try the Windows Azure 90-day free trial. 1. Create a Windows Azure Web Site and MySQL database There is a step-by-step tutorial on http://www.windowsazure.com that walks you through creating a new website and a MySQL database, so I’ll refer you there to get started: Create a PHP-MySQL Windows Azure web site and deploy using Git. If you intend to use Git to publish your Drupal site, then go ahead and follow the instructions for setting up a Git repository. Make sure to follow the instructions in the Get remote MySQL connection information section as you will need that information later. You can ignore the remainder of the tutorial for the purposes of deploying your Drupal site, but if you are new to Windows Azure Web Sites (and to Git), you might find the additional reading informative. Ok, now you have a new website with a MySQL database, your have your MySQL database connection information, and you have (optionally) created a remote Git repository and made note of the Git deployment instructions. Now you are ready to copy your database to MySQL in Windows Azure Web Sites. 2. Copy database to MySQL in Windows Azure Web Sites I’m sure there is more than one way to copy your Drupal database, but I found the mysqldump tool to be effective and easy to use. To copy from a local machine to Windows Azure Web Sites, here’s the command I used: mysqldump -u local_username --password=local_password drupal | mysql -h remote_host -u remote_username --password=remote_password remote_db_name You will, of course, have to provide the username and password for your existing Drupal database, and you will have to provide the hostname, username, password, and database name for the MySQL database you created in step 1. This information is available in the connection string information that you should have noted in step 1. i.e. You should have a connection string that looks something like this: Database=remote_db_name;Data Source=remote_host;User Id=remote_username;Password=remote_password Depending on the size of your database, the copying process could take several minutes. Now your Drupal database is live in Windows Azure Websites. Before you deploy your Drupal code, you need to modify it so it can connect to the new database. 3. Modify database connection info in settings.php Here, you will again need your new database connection information. Open the /drupal/sites/default/setting.php file in your favorite text editor, and replace the values of ‘database’, ‘username’, ‘password’, and ‘host’ in the $databases array with the correct values for your new database. When you are finished, you should have something similar to this: $databases = array ( 'default' => array ( 'default' => array ( 'database' => 'remote_db_name', 'username' => 'remote_username', 'password' => 'remote_password', 'host' => 'remote_host', 'port' => '', 'driver' => 'mysql', 'prefix' => '', ), ), ); Be sure to save the settings.phpfile, then you are ready to deploy. 4. Deploy Drupal code using Git or FTP The last step is to deploy your code to Windows Azure Web Sites using Git or FTP. If you are using FTP, you can get the FTP hostname and username from you website’s dashboard. Then, use your favorite FTP client to upload your Drupal files to the /site/wwwroot folder of the remote site. If you are using Git, you need to set up a Git repository in Windows Azure Web Sites (steps for this are in the tutorial mentioned earlier). And, you will need Git installed on your local machine. Then, just follow the instructions provided after you created the repository: One note about using Git here: depending on your Git settings, your .gitignore file (a hidden file and a sibling to the .git folder created in your local root directory after you executed git commit), some files in your Drupal application may be ignored. In my case, all the files in the sites directory were ignored. If this happens, you will want to edit the .gitignore file so that these files aren’t ignored and redeploy. After you have deployed Drupal to Windows Azure Web Sites, you can continue to deploy updates via Git or FTP. Related information If you are looking for more information about Windows Azure Web Sites, these posts might be helpful: Windows Azure Websites- A PHP Perspective Windows Azure Websites, Web Roles, and VMs- When to use which- Configuring PHP in Windows Azure Websites with .user.ini Files One last thing you might consider, depending on your site, is using the Windows Azure Integration Module to store and serve your site’s media files.
August 19, 2012
by Brian Swan
· 10,248 Views
article thumbnail
tcpdump: Learning how to read UDP packets
Use tcpdump to capture any UDP packets on port 8125.
August 7, 2012
by Mark Needham
· 305,732 Views
article thumbnail
Spring Data With Cassandra Using JPA
We recently adopted the use of Spring Data. Spring Data provides a nice pattern/API that you can layer on top of JPA to eliminate boiler-plate code. With that adoption, we started looking at the DAO layer we use against Cassandra for some of our operations. Some of the data we store in Cassandra is simple. It does *not* leverage the flexible nature of NoSQL. In other words, we know all the table names, the column names ahead of time, and we don't anticipate them changing all that often. We could have stored this data in an RDBMs, using hibernate to access it, but standing up another persistence mechanism seemed like overkill. For simplicity's sake, we preferred storing this data in Cassandra. That said, we want the flexibility to move this to an RDBMs if we need to. Enter JPA. JPA would provide us a nice layer of abstraction away from the underlying storage mechanism. Wouldn't it be great if we could annotate the objects with JPA annotations, and persist them to Cassandra? Enter Kundera. Kundera is a JPA implementation that supports Cassandra (among other storage mechanisms). OK -- so JPA is great, and would get us what we want, but we had just adopted the use of Spring Data. Could we use both? The answer is "sort of". I forked off SpringSource's spring-data-cassandra: https://github.com/boneill42/spring-data-cassandra And I started hacking on it. I managed to get an implementation of the PagingAndSortingRepository for which I wrote unit tests that worked, but I was duplicating a lot of what should have come for free in the SimpleJpaRepository. When I tried to substitute my CassandraJpaRepository for the SimpleJpaRepository, I ran into some trouble w/ Kundera. Specifically, the MetaModel implementation appeared to be incomplete. MetaModelImpl was returning null for all managedTypes(). SimpleJpa wasn't too happy with this. Instead of wrangling with Kundera, we punted. We can achieve enough of the value leveraging JPA directly. Perhaps more importantly, there is still an impedance mismatch between JPA and NoSQL. In our case, it would have been nice to get at Cassandra through Spring Data using JPA for a few cases in our app, but for the vast majority of the application, a straight up ORM layer whereby we know the tables, rows and column names ahead of time is insufficient. For those cases where we don't know the schema ahead of time, we're going to need to leverage the converters pattern in Spring Data. So, I started hacking on a proper Spring Data layer using Astyanax as the client. Follow along here: https://github.com/boneill42/spring-data-cassandra More to come on that....
July 31, 2012
by Brian O' Neill
· 30,259 Views
article thumbnail
11 OPEN NoSQL Document-Oriented Databases
A document-oriented database is a designed for storing, retrieving, and managing document-oriented, or semi structured data. Document-oriented databases are one of the main categories of NoSQL databases. The central concept of a document-oriented database is the notion of a Document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard format(s) (or encoding(s)). Encodings in use include XML, YAML, JSON and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB: MongoDB is a collection-oriented, schema-free document database. Data is grouped into sets that are called ‘collections’. Each collection has a unique name in the database, and can contain an unlimited number of documents. Collections are analogous to tables in a RDBMS, except that they don’t have any defined schema. It store data (which is in BASON – “Binary Serialized dOcument Notation” format) that is a structured collection of key-value pairs, where keys are strings, and values are any of a rich set of data types, including arrays and documents. Home: http://www.mongodb.org/ Quick Start: http://www.mongodb.org/display/DOCS/Quickstart Download: http://www.mongodb.org/downloads CouchDB: CouchDB is a document database server, accessible via a RESTful JSON API. It is Ad-hoc and schema-free with a flat address space. Its Query-able and index-able, featuring a table oriented reporting engine that uses JavaScript as a query language. A CouchDB document is an object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. Home: http://couchdb.apache.org/ Quick Start: http://couchdb.apache.org/docs/intro.html Download: http://couchdb.apache.org/downloads.html Terrastore: Terrastore is a modern document store which provides advanced scalability and elasticity features without sacrificing consistency. It is based on Terracotta, so it relies on an industry-proven, fast clustering technology. Home: http://code.google.com/p/terrastore/ Quick Start: http://code.google.com/p/terrastore/wiki/Documentation Download: http://code.google.com/p/terrastore/downloads/list RavenDB: Raven is a .NET Linq enabled Document Database, focused on providing high performance, schema-less, flexible and scalable NoSQL data store for the .NET and Windows platforms. Raven store any JSON document inside the database. It is schema-less database where you can define indexes using C#’s Linq syntax. Home: http://ravendb.net/ Quick Start: http://ravendb.net/tutorials Download: http://ravendb.net/download OrientDB: OrientDB is an open source NoSQL database management system written in Java. Even if it is a document-based database, the relationships are managed as in graph databases with direct connections between records. It supports schema-less, schema-full and schema-mixed modes. It has a strong security profiling system based on users and roles and supports SQL as a query languages. Home: http://www.orientechnologies.com/ Quick Start: http://code.google.com/p/orient/wiki/Tutorials Download: http://code.google.com/p/orient/wiki/Download ThruDB: Thrudb is a set of simple services built on top of the Apache Thrift framework that provides indexing and document storage services for building and scaling websites. Its purpose is to offer web developers flexible, fast and easy-to-use services that can enhance or replace traditional data storage and access layers. It supports multiple storage backends such as BerkeleyDB, Disk, MySQL and also having Memcache and Spread integration. Home: http://code.google.com/p/thrudb/ Quick Start: http://thrudb.googlecode.com/svn/trunk/doc/Thrudb.pdf Download: http://code.google.com/p/thrudb/source/checkout SisoDB: SisoDb is a document-oriented db-provider for Sql-Server written in C#. It lets you store object graphs of POCOs (plain old clr objects) without having to configure any mappings. Each entity is treated as an aggregate root and will get separate tables created on the fly. Home: http://www.sisodb.com Quick Start: http://www.sisodb.com/Wiki Download: https://github.com/danielwertheim/SisoDb-Provider/ RaptorDB: RaptorDB is a extremely small size and fast embedded, noSql, persisted dictionary database using b+tree or MurMur hash indexing. It was primarily designed to store JSON data (see my fastJSON implementation), but can store any type of data that you give it. Home: http://www.codeproject.com/KB/database/RaptorDB.aspx Quick Start: http://www.codeproject.com/KB/database/RaptorDB.aspx Download: http://www.codeproject.com/KB/database/RaptorDB.aspx CloudKit: CloudKit provides schema-free, auto-versioned, RESTful JSON storage with optional OpenID and OAuth support, including OAuth Discovery. Home: http://getcloudkit.com/ Quick Start: http://getcloudkit.com/api/ Download: https://github.com/jcrosby/cloudkit Perservere: Persevere is an open source set of tools for persistence and distributed computing using an intuitive standards-based JSON interfaces of HTTP REST, JSON-RPC, JSONPath, and REST Channels. The core of the Persevere project is the Persevere Server. The Persevere server includes a Persevere JavaScript client, but the standards-based interface is intended to be used with any framework or client. Home: http://code.google.com/p/persevere-framework/ Quick Start: http://code.google.com/p/persevere-framework/w/list Download: http://code.google.com/p/persevere-framework/downloads/list Jackrabbit: The Apache Jackrabbit™ content repository is a fully conforming implementation of the Content Repository for Java Technology API (JCR, specified in JSR 170 and 283). A content repository is a hierarchical content store with support for structured and unstructured content, full text search, versioning, transactions, observation, and more. Home: http://jackrabbit.apache.org Quick Start: http://jackrabbit.apache.org/getting-started-with-apache-jackrabbit.html Download: http://jackrabbit.apache.org/downloads.html Conclusion: Document databases store and retrieve documents and basic atomic stored unit is a document. As always your requirement leads into the decision. You need to think about your data-access patterns / use-cases to create a smart document-model. When your domain model can be split and partitioned across some documents, a document-database will be a suitable one for you. For example for a blog-software, a CMS or a wiki-software a document-db works extremely well. But at the same time a non-relational database is not better than a relational one in some cases where your database have a lot of relations and normalization. Just check the following link from stackoverflow also to cover the pros/cons of Relational Vs Document based databases. http://stackoverflow.com/questions/337344/pros-cons-of-document-based-databases-vs-relational-databases
July 23, 2012
by Lijin Joseji
· 69,247 Views · 2 Likes
article thumbnail
How Does SQL Server Scheduling Work? There's a Flowchart For That
srgolla - SQL Server Scheduling Flowchart This is a basic flowchart explaining SQL Server Scheduling at a very high level. This will appeal to a limited audience, but is still something I thought very informative and not something I see flowcharted every day (week/month/year).
July 21, 2012
by Greg Duncan
· 7,056 Views
article thumbnail
Replacing Query String Elements in C# .NET and JavaScript
While writing list navigation and search features in websites today there is a constant need to find/replace and play with query string elements, so that you can easily manipulate these mystical items while you’re carrying them around in your website’s URLs. I have a few little methods I’ve used over the years and carry with me project to project, and this post is putting them on the record for easy access later. I have a secret. This post is actually more aimed at an audience of 'myself', and my ability to have an easy bit of source code to call upon when I’m on the go looking for a quick solution to cut and paste – as most of my blog posts are. But you, dear reader, you get to share in this benefit with me by pulling from the awesomeness within this post as well. Solution: .Net c# When doing this with c# you have a few pretty cool features up your sleeve. One of these is HttpUtility.ParseQueryString(urlPath) framework method. This static method allows you to extract a NameValueCollection that is editable from a given query string. Why is this cool? Because it allows you to very easily play with the query string collection like it is any other NameValueCollection – with Add() and Remove() methods. This makes it incredibly powerful. Quick & Dirty code beware! The code I’m pasting below is far from being the most elegant solution, i seem to have misplaced my nicer piece of code and am in too much of a rush to find it right now (sorry). Until i find my nicer solution, the method below will get you by – whether you have a hatred for ternary’s or not. public static string ReplaceQueryStringParam(string currentPageUrl, string paramToReplace, string newValue) { string urlWithoutQuery = currentPageUrl.IndexOf('?') >= 0 ? currentPageUrl.Substring(0, currentPageUrl.IndexOf('?')) : currentPageUrl; string queryString = currentPageUrl.IndexOf('?') >= 0 ? currentPageUrl.Substring(currentPageUrl.IndexOf('?')) : null; var queryParamList = queryString != null ? HttpUtility.ParseQueryString(queryString) : HttpUtility.ParseQueryString(string.Empty); if (queryParamList[paramToReplace] != null) { queryParamList[paramToReplace] = newValue; } else { queryParamList.Add(paramToReplace, newValue); } return String.Format("{0}?{1}", urlWithoutQuery, queryParamList); } To call this, you can do the following: // var currentUrl = HttpContext.Current.Request.Url; var currentUrl = "http://www.mysite.com/mypage?category=cool-products&sort=price&page=3"; // change the my sort-by param named"sort" to "name" var newUrlWithChangedSort = ReplaceQueryStringParam(currentUrl, "sort", "name"); Solution: JavaScript The second part of this post includes a JavaScript solution, as you never know when you have to do this on the client side. function replaceQueryString(url, param, value) { if (url.lastIndexOf('?') <= 0) url = url + "?"; var re = new RegExp("([?|&])" + param + "=.*?(&|$)", "i"); if (url.match(re)) return url.replace(re, '$1' + param + "=" + value + '$2'); else return url.substring(url.length - 1) == '?' ? url + param + "=" + value : url + '&' + param + "=" + value; } And to use the above code in your client-side javascript simply write something along the lines of: //var currentUrl = self.location; var currentUrl = "http://www.mysite.com/mypage?category=cool-products&sort=price&page=3"; // change the my sort-by param named"sort" to "name" var newUrlWithChangedSort = replaceQueryString(currentUrl, "sort", "name"); Easy – now next time you need to knock something together, instead of writing it yourself, you can simply cut & paste mine!
July 20, 2012
by Douglas Rathbone
· 16,519 Views
article thumbnail
How to Autoscale MySQL on Amazon EC2
Autoscaling your webserver tier is typically straightforward. Image your apache server with source code or without, then sync down files from S3 upon spinup. Roll that image into the autoscale configuration and you’re all set. With the database tier though, things can be a bit tricky. The typical configuration we see is to have a single master database where your application writes. But scaling out or horizontally on Amazon EC2 should be as easy as adding more slaves, right? Why not automate that process? Below we’ve set out to answer some of the questions you’re likely to face when setting up slaves against your master. We’ve included instructions on building an AMI that automatically spins up as a slave. Fancy! How can I autoscale my database tier? Build an auto-starting MySQL slave against your master. Configure those to spinup. Amazon’s autoscaling loadbalancer is one option, another is to use a roll-your-own solution, monitoring thresholds on servers, and spinning up or dropping off slaves as necessary. Does an AWS snapshot capture subvolume data or just the SIZE of the attached volume? In fact, if you have an attached EBS volume and you create an new AMI off of that, you will capture the entire root volume, plus your attached volume data. In fact we find this a great way to create an auto-building slave in the cloud. How do I freeze MySQL during AWS snapshot? mysql> flush tables with read lock;mysql> system xfs_freeze -f /data At this point you can use the Amazon web console, ylastic, or ec2-create-image API call to do so from the command line. When the server you are imaging off of above restarts – as it will do by default – it will start with /data partition unfrozen and mysql’s tables unlocked again. Voila! If you’re not using xfs for your /data filesystem, you should be. It’s fast! The xfsprogs docs seem to indicate this may also work with foreign filesystems. Check the docs for details. How do I build an AMI mysql slave that autoconnects to master? Install mysql_serverid script below. Configure mysql to use your /data EBS mount. Set all your my.cnf settings including server_id Configure the instance as a slave in the normal way. When using GRANT to create the ‘rep’ user on master, specify the host with a subnet wildcard. For example ’10.20.%’. That will subsequently allow any 10.20.x.y servers to connect and replicate. Point the slave at the master. When all is running properly, edit the my.cnf file and remove server_id. Don’t restart mysql. Freeze the filesystem as described above. Use the Amazon console, ylastic or API call to create your new image. Test it of course, to make sure it spins up, sets server_id and connects to master. Make a change in the test schema, and verify that it propagates to all slaves. How do I set server_id uniquely? As you hopefully already know, in MySQL replication environment each node requires a unique server_id setting. In my Amazon Machine Images, I want the server to startup and if it doesn’t find the server_id in the /etc/my.cnf file, to add it there, correctly! Is that so much to ask? Here’s what I did. Fire up your editor of choice and drop in this bit of code: #!/bin/shif grep -q “server_id” /etc/my.cnf then : # do nothing – it’s already set else # extract numeric component from hostname – should be internet IP in Amazon environment export server_id=`echo $HOSTNAME | sed ‘s/[^0-9]*//g’` echo “server_id=$server_id” >> /etc/my.cnf # restart mysql /etc/init.d/mysql restart fi Save that snippet at /root/mysql_serverid. Also be sure to make it executable: $ chmod +x /root/mysql_serverid Then just append it to your /etc/rc.local file with an editor or echo: $ echo "/root/mysql_serverid" >> /etc/rc.local Assuming your my.cnf file does *NOT* contain the server_id setting when you re-image, then it’ll set this automagically each time you spinup a new server off of that AMI. Nice! Can you easily slave off of a slave? How? It’s not terribly different from slaving off of a normal master. A. First enable slave updates. The setting is not dynamic, so if you don’t already have it set, you’ll have to restart your slave. log_slave_updates=true B. Get an initial snapshot of your slave data. You can do that the locking way: mysql> flush tables with read lock;mysql> show master status\G; mysql> system mysqldump -A > full_slave_dump.mysql mysql> unlock tables; You may also choose to use Percona’s excellent xtrabackup utility to create hotbackups without locking any tables. We are very lucky to have an open-source tool like this at our disposal. MySQL Enterprise Backup from Oracle Corp can also do this. C. On the slave, seed the database with your dump created above. $ mysql < full_slave_dump.mysql D. Now point your slave to the original slave. mysql> change master to master_user='rep', master_password='rep', master_host='192.168.0.1', master_log_file='server-bin-log.000004', master_log_pos=399;mysql> start slave; mysql> show slave status\G; Slave master is set as an IP address. Is there another way? It’s possible to use hostnames in MySQL replication, however it’s not recommended. Why? Because of the wacky world of DNS. Suffice it to say MySQL has to do a lot of work to resolve those names into IP addresses. A hickup in DNS can interrupt all MySQL services potentially as sessions will fail to authenticate. To avoid this problem do two things: A. Set this parameter in my.cnf skip_name_resolve = true Remove entries in mysql.user table where hostname is not an IP address. Those entries will be invalid for authentication after setting the above parameter. Doesn’t RDS take care of all of this for me? RDS is Amazon’s Relational Database Service which is built on MySQL. Amazon’s RDS solution presents MySQL as a service which brings certain benefits to administrators and startups: Simpler administration. Nuts and bolts are handled for you. Push-button replication. No more struggling with the nuances and issues of MySQL’s replication management. Simplicity of administration of course has it’s downsides. Depending on your environment, these may or may not be dealbreakers. No access to the slow query log. This is huge. The single best tool for troubleshooting slow database response is this log file. Queries are a large part of keeping a relational database server healthy and happy, and without this facility, you are severely limited. Locked in downtime window When you signup for RDS, you must define a thirty minute maintenance window. This is a weekly window during which your instance *COULD* be unavailable. When you host yourself, you may not require as much downtime at all, especially if you’re using master-master mysql and zero-downtime configuration. Can’t use Percona Server to host your MySQL data. You won’t be able to do this in RDS. Percona server is a high performance distribution of MySQL which typically rolls in serious performance tweaks and updates before they make it to community addition. Well worth the effort to consider it. No access to filesystem, server metrics & command line. Again for troubleshooting problems, these are crucial. Gathering data about what’s really happening on the server is how you begin to diagnose and troubleshoot a server stall or pileup. You are beholden to Amazon’s support services if things go awry. That’s because you won’t have access to the raw iron to diagnose and troubleshoot things yourself. Want to call in an outside consultant to help you debug or troubleshoot? You’ll have your hands tied without access to the underlying server. You can’t replicate to a non-RDS database. Have your own datacenter connected to Amazon via VPC? Want to replication to a cloud server? RDS won’t fit the bill. You’ll have to roll your own – as we’ve described above. And if you want to replicate to an alternate cloud provider, again RDS won’t work for you. Related posts: Deploying MySQL on Amazon EC2 – 8 Best Practices Review: Host Your Web Site In The Cloud, Amazon Web Services Made Easy 5 Ways to Boost MySQL Scalability Top MySQL DBA interview questions (Part 2) MySQL Cluster In The Cloud – Managers Guide
July 20, 2012
by Sean Hull
· 18,503 Views
article thumbnail
Spring Data - Apache Hadoop
Spring for Apache Hadoop is a Spring project to support writing applications that can benefit of the integration of Spring Framework and Hadoop. This post describes how to use Spring Data Apache Hadoop in an Amazon EC2 environment using the “Hello World” equivalent of Hadoop programming – a Wordcount application. 1./ Launch an Amazon Web Services EC2 instance. - Navigate to AWS EC2 Console (“https://console.aws.amazon.com/ec2/home”): - Select Launch Instance then Classic Wizzard and click on Continue. My test environment was a “Basic Amazon Linux AMI 2011.09″ 32-bit., Instant type: Micro (t1.micro , 613 MB), Security group quick-start-1 that enables ssh to be used for login. Select your existing key pair (or create a new one). Obviously you can select another AMI and instance types depending on your favourite flavour. (Should you vote for Windows 2008 based instance, you also need to have cygwin installed as an additional Hadoop prerequisite beside Java JDK and ssh, see “Install Apache Hadoop” section) 2./ Download Apache Hadoop - as of writing this article, 1.0.0 is the latest stable version of Apache Hadoop, that is what was used for testing purposes. I downloaded hadoop-1.0.0.tar.gz and copied it into /home/ec2-user directory using pscp command from my PC running Windows: c:\downloads>pscp -i mykey.ppk hadoop-1.0.0.tar.gz [email protected]:/home/ec2-user (the computer name above – ec2-ipaddress-region-compute.amazonaws.com – can be found on AWS EC2 console, Instance Description, public DNS field) 3./ Install Apache Hadoop: As prerequisites, you need to have Java JDK 1.6 and ssh installed, see Apache Single-Node Setup Guide. (ssh is automatically installed with Basic Amazon AMI). Then install hadoop itself: $ cd ~ # change directory to ec2-user home (/home/ec2-user) $ tar xvzf hadoop-1.0.0.tar.gz $ ln -s hadoop-1.0.0 hadoop $ cd hadoop/conf $ vi hadoop-env.sh # edit as below export JAVA_HOME=/opt/jdk1.6.0_29 $ vi core-site.xml # edit as below – this defines the namenode to be running on localhost and listeing to port 9000. fs.default.name hdfs://localhost:9000 $ vi hdsf-site.xml # edit as below this defines that file system replicate is 1 (in production environment it is supposed to be 3 by default) dfs.replication 1 $ vi mapred-site.xml # edit as below – this defines the jobtracker to be running on localhost and listeing to port 9001. mapred.job.tracker localhost:9001 $ cd ~/hadoop $ bin/hadoop namenode -format $ bin/start-all.sh At this stage all hadoop jobs are running in pseudo distributed mode, you can verify it by running: $ ps -ef | grep java You should see 5 java processes: namenode, secondarynamenode, datanode, jobtracker and tasktracker. 4./ Install Spring Data Hadoop Download Spring Data Hadoop package from SpringSource community download site. As of writing this article, the latest stable version is spring-data-hadoop-1.0.0.M1.zip. $ cd ~ $ tar xzvf spring-data-hadoop-1.0.0.M1.zip $ ln -s spring-data-hadoop-1.0.0.M1 spring-data-hadoop 5./ Build and Run Spring Data Hadoop Wordcount example $ cd spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount Spring Data Hadoop is using gradle as build tool. Check build.grandle build file. The original version packaged in the tar.gz file does not compile, it complains about thrift, version 0.2.0 and jdo2-api, version2.3-ec. Add datanucleus.org maven repository to the build.gradle file to support jdo2-api (http://www.datanucleus.org/downloads/maven2/) . Unfortunatelly, there seems to be no maven repo for thrift 0.2.0 . You should download thrift 0.2.0.jar and thrift.0.2.0.pom file e.g. from this repo: “http://people.apache.org/~rawson/repo“ and then add it to local maven repo. $ mvn install:install-file -DgroupId=org.apache.thrift -DartifactId=thrift -Dversion=0.2.0 -Dfile=thrift-0.2.0.jar -Dpackaging=jar $ vi build.grandle # modify the build file to refer to datanucleus maven repo for jdo2-api and the local repo for thrift repositories { // Public Spring artefacts mavenCentral() maven { url “http://repo.springsource.org/libs-release” } maven { url “http://repo.springsource.org/libs-milestone” } maven { url “http://repo.springsource.org/libs-snapshot” } maven { url “http://www.datanucleus.org/downloads/maven2/” } maven { url “file:///home/ec2-user/.m2/repository” } } I also modified the META-INF/spring/context.xml file in order to run hadoop file system commands manually: $ cd /home/ec2-user/spring-data-hadoop/spring-data-hadoop-1.0.0.M1/samples/wordcount/src/main/resources $vi META-INF/spring/context.xml # remove clean-script and also the dependency on it for JobRunner. xmlns=”http://www.springframework.org/schema/beans” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xmlns:context=”http://www.springframework.org/schema/context” xmlns:hdp=”http://www.springframework.org/schema/hadoop” xmlns:p=”http://www.springframework.org/schema/p” xsi:schemaLocation=”http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/hadoop http://www.springframework.org/schema/hadoop/spring-hadoop.xsd”> fs.default.name=${hd.fs} Copy the sample file – nietzsche-chapter-1.txt – to Hadoop file system (/user/ec2-user-/input directory) $ cd src/main/resources/data $ hadoop fs -mkdir /user/ec2-user/input $ hadoop fs -put nietzsche-chapter-1.txt /user/ec2-user/input/data $ cd ../../../.. # go back to samples/wordcount directory $ ../gradlew Verify the result: $ hadoop fs -cat /user/ec2-user/output/part-r-00000 | more “AWAY 1 “BY 1 “Beyond 1 “By 2 “Cheers 1 “DE 1 “Everywhere 1 “FROM” 1 “Flatterers 1 “Freedom 1
July 19, 2012
by Istvan Szegedi
· 11,915 Views
article thumbnail
My Experience Moving Data from MySQL to Cassandra
I had a relational database, that I wanted to migrate to cassandra. Cassandra's sstableloader provides option to load the existing data from flat files to a cassandra ring. Hence this can be used as a way to migrate data in relational databases to cassandra, as most relational databases let us export the data into flat files. sqoop gives the option to do this effectively. Interestingly, DataStax Enterprise provides everything we want in the big data space as a package. This includes, cassandra, hadoop, hive, pig, sqoop, and mahout, which comes handy in this case. Under the resources directory, you may find the cassandra, dse, hadoop, hive, log4j-appender, mahout, pig, solr, sqoop, and tomcat specific configurations. For example, from resources/hadoop/bin, you may format the hadoop name node using ./hadoop namenode -format as usual. * Download and extract DataStax Enterprise binary archive (dse-2.1-bin.tar.gz). * Follow the documentation, which is also available as a PDF. * Migrating a relational database to cassandra is documented and is also blogged. * Before starting DataStax, make sure that the JAVA_HOME is set. This also can be set directly on conf/hadoop-env.sh. * Include the connector to the relational database into a location reachable by sqoop. I put mysql-connector-java-5.1.12-bin.jar under resources/sqoop. * Set the environment $ bin/dse-env.sh * Start DataStax Enterprise, as an Analytics node. $ sudo bin/dse cassandra -t where cassandra starts the Cassandra process plus CassandraFS and the -t option starts the Hadoop JobTracker and TaskTracker processes. if you start without the -t flag, the below exception will be thrown during the further operations that are discussed below. No jobtracker found Unable to run : jobtracker not found Hence do not miss the -t flag. * Start cassandra cli to view the cassandra keyrings and you will be able to view the data in cassandra, once you migrate using sqoop as given below. $ bin/cassandra-cli -host localhost -port 9160 Confirm that it is connected to the test cluster that is created on the port 9160, by the below from the CLI. [default@unknown] describe cluster; Cluster Information: Snitch: com.datastax.bdp.snitch.DseDelegateSnitch Partitioner: org.apache.cassandra.dht.RandomPartitioner Schema versions: f5a19a50-b616-11e1-0000-45b29245ddff: [127.0.1.1] If you have missed mentioning the host/port (starting the cli by just bin/cassandra-cli) or given it wrong, you will get the response as "Not connected to a cassandra instance." $ bin/dse sqoop import --connect jdbc:mysql://127.0.0.1:3306/shopping_cart_db --username root --password root --table Category --split-by categoryName --cassandra-keyspace shopping_cart_db --cassandra-column-family Category_cf --cassandra-row-key categoryName --cassandra-thrift-host localhost --cassandra-create-schema Above command will now migrate the table "Category" in the shopping_cart_db with the primary key categoryName, into a cassandra keyspace named shopping_cart, with the cassandra row key categoryName. You may use the --direct mysql specific option, which is faster. In my above command, I have everything runs on localhost. +--------------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+-------------+------+-----+---------+-------+ | categoryName | varchar(50) | NO | PRI | NULL | | | description | text | YES | | NULL | | | image | blob | YES | | NULL | | +--------------+-------------+------+-----+---------+-------+ This also creates the respective java class (Category.java), inside the working directory. To import all the tables in the database, instead of a single table. $ bin/dse sqoop import-all-tables -m 1 --connect jdbc:mysql://127.0.0.1:3306/shopping_cart_db --username root --password root --cassandra-thrift-host localhost --cassandra-create-schema --direct Here "-m 1" tag ensures a sequential import. If not specified, the below exception will be thrown. ERROR tool.ImportAllTablesTool: Error during import: No primary key could be found for table Category. Please specify one with --split-by or perform a sequential import with '-m 1'. To check whether the keyspace is created, [default@unknown] show keyspaces; ................ Keyspace: shopping_cart_db: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: Category_cf Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider Key cache size / save period in seconds: 200000.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy ............. [default@unknown] describe shopping_cart_db; Keyspace: shopping_cart_db: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: Category_cf Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.UTF8Type Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.SerializingCacheProvider Key cache size / save period in seconds: 200000.0/14400 GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy You may also use hive to view the databases created in cassandra, in an sql-like manner. * Start Hive $ bin/dse hive hive> show databases; OK default shopping_cart_db When the entire database is imported as above, separate java classes will be created for each of the tables. $ bin/dse sqoop import-all-tables -m 1 --connect jdbc:mysql://127.0.0.1:3306/shopping_cart_db --username root --password root --cassandra-thrift-host localhost --cassandra-create-schema --direct 12/06/15 15:42:11 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 12/06/15 15:42:11 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 12/06/15 15:42:11 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Category` AS t LIMIT 1 12/06/15 15:42:11 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Category.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:13 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Category.jar 12/06/15 15:42:13 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:13 INFO mapreduce.ImportJobBase: Beginning import of Category 12/06/15 15:42:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/06/15 15:42:15 INFO mapred.JobClient: Running job: job_201206151241_0007 12/06/15 15:42:16 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:25 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:25 INFO mapred.JobClient: Job complete: job_201206151241_0007 12/06/15 15:42:25 INFO mapred.JobClient: Counters: 18 12/06/15 15:42:25 INFO mapred.JobClient: Job Counters 12/06/15 15:42:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6480 12/06/15 15:42:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:25 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:25 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:25 INFO mapred.JobClient: Bytes Written=2848 12/06/15 15:42:25 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21419 12/06/15 15:42:25 INFO mapred.JobClient: CFS_BYTES_WRITTEN=2848 12/06/15 15:42:25 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:25 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:25 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:25 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:25 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:25 INFO mapred.JobClient: Physical memory (bytes) snapshot=119435264 12/06/15 15:42:25 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:25 INFO mapred.JobClient: CPU time spent (ms)=630 12/06/15 15:42:25 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:25 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2085318656 12/06/15 15:42:25 INFO mapred.JobClient: Map output records=36 12/06/15 15:42:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:25 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 11.4492 seconds (0 bytes/sec) 12/06/15 15:42:25 INFO mapreduce.ImportJobBase: Retrieved 36 records. 12/06/15 15:42:25 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:25 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Customer` AS t LIMIT 1 12/06/15 15:42:25 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Customer.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:25 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Customer.jar 12/06/15 15:42:26 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:26 INFO mapreduce.ImportJobBase: Beginning import of Customer 12/06/15 15:42:26 INFO mapred.JobClient: Running job: job_201206151241_0008 12/06/15 15:42:27 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:35 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:35 INFO mapred.JobClient: Job complete: job_201206151241_0008 12/06/15 15:42:35 INFO mapred.JobClient: Counters: 17 12/06/15 15:42:35 INFO mapred.JobClient: Job Counters 12/06/15 15:42:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6009 12/06/15 15:42:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:35 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:35 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:35 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:42:35 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21489 12/06/15 15:42:35 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:35 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:35 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:35 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:35 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:35 INFO mapred.JobClient: Physical memory (bytes) snapshot=164855808 12/06/15 15:42:35 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:35 INFO mapred.JobClient: CPU time spent (ms)=510 12/06/15 15:42:35 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:35 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2082869248 12/06/15 15:42:35 INFO mapred.JobClient: Map output records=0 12/06/15 15:42:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:35 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.3143 seconds (0 bytes/sec) 12/06/15 15:42:35 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:42:35 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:35 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `OrderEntry` AS t LIMIT 1 12/06/15 15:42:35 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderEntry.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:35 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderEntry.jar 12/06/15 15:42:36 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:36 INFO mapreduce.ImportJobBase: Beginning import of OrderEntry 12/06/15 15:42:36 INFO mapred.JobClient: Running job: job_201206151241_0009 12/06/15 15:42:37 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:45 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:45 INFO mapred.JobClient: Job complete: job_201206151241_0009 12/06/15 15:42:45 INFO mapred.JobClient: Counters: 17 12/06/15 15:42:45 INFO mapred.JobClient: Job Counters 12/06/15 15:42:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6381 12/06/15 15:42:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:45 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:45 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:45 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:42:45 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21569 12/06/15 15:42:45 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:45 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:45 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:45 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:45 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=137252864 12/06/15 15:42:45 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:45 INFO mapred.JobClient: CPU time spent (ms)=520 12/06/15 15:42:45 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2014703616 12/06/15 15:42:45 INFO mapred.JobClient: Map output records=0 12/06/15 15:42:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:45 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2859 seconds (0 bytes/sec) 12/06/15 15:42:45 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:42:45 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `OrderItem` AS t LIMIT 1 12/06/15 15:42:45 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderItem.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:45 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/OrderItem.jar 12/06/15 15:42:46 WARN manager.CatalogQueryManager: The table OrderItem contains a multi-column primary key. Sqoop will default to the column orderNumber only for this job. 12/06/15 15:42:46 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:46 INFO mapreduce.ImportJobBase: Beginning import of OrderItem 12/06/15 15:42:46 INFO mapred.JobClient: Running job: job_201206151241_0010 12/06/15 15:42:47 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:42:55 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:42:55 INFO mapred.JobClient: Job complete: job_201206151241_0010 12/06/15 15:42:55 INFO mapred.JobClient: Counters: 17 12/06/15 15:42:55 INFO mapred.JobClient: Job Counters 12/06/15 15:42:55 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5949 12/06/15 15:42:55 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:42:55 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:42:55 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:42:55 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:42:55 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:42:55 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:42:55 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:42:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21524 12/06/15 15:42:55 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:42:55 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:42:55 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:42:55 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:42:55 INFO mapred.JobClient: Map input records=1 12/06/15 15:42:55 INFO mapred.JobClient: Physical memory (bytes) snapshot=116674560 12/06/15 15:42:55 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:42:55 INFO mapred.JobClient: CPU time spent (ms)=590 12/06/15 15:42:55 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:42:55 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2014703616 12/06/15 15:42:55 INFO mapred.JobClient: Map output records=0 12/06/15 15:42:55 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:42:55 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2539 seconds (0 bytes/sec) 12/06/15 15:42:55 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:42:55 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:42:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Payment` AS t LIMIT 1 12/06/15 15:42:55 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Payment.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:42:55 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Payment.jar 12/06/15 15:42:56 WARN manager.CatalogQueryManager: The table Payment contains a multi-column primary key. Sqoop will default to the column orderNumber only for this job. 12/06/15 15:42:56 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:42:56 INFO mapreduce.ImportJobBase: Beginning import of Payment 12/06/15 15:42:56 INFO mapred.JobClient: Running job: job_201206151241_0011 12/06/15 15:42:57 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:43:05 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:43:05 INFO mapred.JobClient: Job complete: job_201206151241_0011 12/06/15 15:43:05 INFO mapred.JobClient: Counters: 17 12/06/15 15:43:05 INFO mapred.JobClient: Job Counters 12/06/15 15:43:05 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5914 12/06/15 15:43:05 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:43:05 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:43:05 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:43:05 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:43:05 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:43:05 INFO mapred.JobClient: Bytes Written=0 12/06/15 15:43:05 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:43:05 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21518 12/06/15 15:43:05 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:43:05 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:43:05 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:43:05 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:43:05 INFO mapred.JobClient: Map input records=1 12/06/15 15:43:05 INFO mapred.JobClient: Physical memory (bytes) snapshot=137998336 12/06/15 15:43:05 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:43:05 INFO mapred.JobClient: CPU time spent (ms)=520 12/06/15 15:43:05 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:43:05 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2082865152 12/06/15 15:43:05 INFO mapred.JobClient: Map output records=0 12/06/15 15:43:05 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:43:05 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2642 seconds (0 bytes/sec) 12/06/15 15:43:05 INFO mapreduce.ImportJobBase: Retrieved 0 records. 12/06/15 15:43:05 INFO tool.CodeGenTool: Beginning code generation 12/06/15 15:43:05 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `Product` AS t LIMIT 1 12/06/15 15:43:06 INFO orm.CompilationManager: HADOOP_HOME is /home/pradeeban/programs/dse-2.1/resources/hadoop/bin/.. Note: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Product.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 12/06/15 15:43:06 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-pradeeban/compile/926ddf787c73be06c4e2ad1f8fc522f1/Product.jar 12/06/15 15:43:06 INFO manager.DirectMySQLManager: Beginning mysqldump fast path import 12/06/15 15:43:06 INFO mapreduce.ImportJobBase: Beginning import of Product 12/06/15 15:43:07 INFO mapred.JobClient: Running job: job_201206151241_0012 12/06/15 15:43:08 INFO mapred.JobClient: map 0% reduce 0% 12/06/15 15:43:16 INFO mapred.JobClient: map 100% reduce 0% 12/06/15 15:43:16 INFO mapred.JobClient: Job complete: job_201206151241_0012 12/06/15 15:43:16 INFO mapred.JobClient: Counters: 18 12/06/15 15:43:16 INFO mapred.JobClient: Job Counters 12/06/15 15:43:16 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=5961 12/06/15 15:43:16 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/15 15:43:16 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/15 15:43:16 INFO mapred.JobClient: Launched map tasks=1 12/06/15 15:43:16 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/06/15 15:43:16 INFO mapred.JobClient: File Output Format Counters 12/06/15 15:43:16 INFO mapred.JobClient: Bytes Written=248262 12/06/15 15:43:16 INFO mapred.JobClient: FileSystemCounters 12/06/15 15:43:16 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21527 12/06/15 15:43:16 INFO mapred.JobClient: CFS_BYTES_WRITTEN=248262 12/06/15 15:43:16 INFO mapred.JobClient: CFS_BYTES_READ=87 12/06/15 15:43:16 INFO mapred.JobClient: File Input Format Counters 12/06/15 15:43:16 INFO mapred.JobClient: Bytes Read=0 12/06/15 15:43:16 INFO mapred.JobClient: Map-Reduce Framework 12/06/15 15:43:16 INFO mapred.JobClient: Map input records=1 12/06/15 15:43:16 INFO mapred.JobClient: Physical memory (bytes) snapshot=144871424 12/06/15 15:43:16 INFO mapred.JobClient: Spilled Records=0 12/06/15 15:43:16 INFO mapred.JobClient: CPU time spent (ms)=1030 12/06/15 15:43:16 INFO mapred.JobClient: Total committed heap usage (bytes)=121241600 12/06/15 15:43:16 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2085318656 12/06/15 15:43:16 INFO mapred.JobClient: Map output records=300 12/06/15 15:43:16 INFO mapred.JobClient: SPLIT_RAW_BYTES=87 12/06/15 15:43:16 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 9.2613 seconds (0 bytes/sec) 12/06/15 15:43:16 INFO mapreduce.ImportJobBase: Retrieved 300 records. I found DataStax an interesting project to explore more. I have blogged on the issues that I faced on this as a learner, and how easily can they be fixed - Issues that you may encounter during the migration to Cassandra using DataStax/Sqoop and the fixes.
July 16, 2012
by Pradeeban Kathiravelu
· 20,430 Views · 2 Likes
article thumbnail
Working with MongoDB MultiMaster
Learn all about working with MondoDB multimaster.
July 11, 2012
by Rick Copeland
· 28,209 Views · 2 Likes
article thumbnail
5 Things You Should Check Now to Improve PHP Web Performance
We all know how financially important it is for your app’s server architecture to handle peaks of load. This article discusses 5 tips for improving PHP Web performance.
July 11, 2012
by Gonzalo Ayuso
· 263,883 Views · 2 Likes
article thumbnail
The Activiti Performance Showdown
the question everybody always asks when they learn about activiti, is as old as software development itself: “how does it perform?”. up till now, when you would ask me that same question, i would tell you about how activiti minimizes database access in every way possible, how we break down the process structure into an ‘execution tree’ which allows for fast queries or how we leverage ten years of workflow framework development knowledge. you know, trying to get around the question without answering it. we knew it is fast, because of the theoretical foundation upon which we have built it. but now we have proof: real numbers …. yes, it’s going to be a lengthy post. but trust me, it’ll be worth your time! disclaimer: performance benchmarks are hard. really hard. different machines, slight different test setup … very small things can change the results seriously. the numbers here are only to prove that the activiti engine has a very minimal overhead, while also integrating very easily into the java eco-system and offering bpmn 2.0 process execution. the activiti benchmark project to test process execution overhead of the activiti engine, i created a little side project on github: https://github.com/jbarrez/activiti-benchmark the project contains currently 9 test processes, which we’ll analyse below. the logic in the project is pretty straightforward: a process engine is created for each test run each of the processes are sequentially executed on this process engine, using a threadpool from 1 up to 10 threads. all the processes are thrown into a bag, of which a number of random executions are drawn. all the results are collected and a html report with some nice charts are generated to run the benchmark, simply follow the instructions on the github page to build and execute the jar. benchmark results the test machine i used for the results is my (fairly old) desktop machine: amd phenom ii x4 940 3.0ghz, 8 gb 800mhz ram and an old-skool 7200 rpm hd running ubuntu 11.10. the database used for the test runs on the same machine on which the tests also run. so keep in mind that in a ‘real’ server environment the results could even be better! the benchmark project i mentioned above, was executed on a default ubuntu mysql 5 database. i just switched to the ‘large.cnf’ setting (which throws more ram at the db and stuff like that) instead the default config. each of the test processes ran for 2500 times, using a threadpool going from one to ten threads . in simpleton language: 2500 process executions using just one thread, 2500 threads using two threads, 2500 process executions using three … yeah, you get it. each benchmark run was done using a ‘default’ activiti process engine. this basically means a ‘regular’ standalone activiti engine, created in plain java. each benchmark run was also done in a ‘spring’ config. here, the process engine was constructed by wrapping it in the factory bean, the datasource is a spring datasource and also the transactions and connection pool is managed by spring (i’m actually using a tweaked bonecp threadpool) each benchmark run was executed with history on the default history level (ie. ‘audit’) and without history enabled (ie. history level ‘none’) . the processes are in detail analyzed in the sections below, but here are the integral results of the test runs already: activiti 5.9 – mysql – default – history enabled activiti 5.9 – mysql – default – history disabled activiti 5.9 – mysql – spring – history enabled activiti 5.9 – mysql – spring – history disabled i ran all the tests using the latest public release of activiti, being activiti 5.9. however, my test runs brought some potential performance fixes to the surface (i also ran the benchmark project through a profiler). it was quickly clear that most of the process execution time was done actually cleaning up when a process ended. basically, more than often queries were fired which were not necessary if we would save some more state in our execution tree. i sat together with daniel meyer from camunda and my colleague frederik heremans, and they’ve managed to commit fixes for this! as such, the current trunk of activiti, being activiti 5.10-snapshot at the moment, is significantly faster than 5.9 . activiti 5.10 – mysql – default – history enabled activiti 5.10 – mysql – default – history disabled activiti 5.10 – mysql – spring – history enabled activiti 5.10 – mysql – spring – history disabled from a high-level perspective (scroll down for detailed analysis), there are a few things to note: i had expected some difference between the default and spring config, due to the more ‘professional’ connection pool being used. however, the results for both environments are quite alike. sometimes the default is faster, sometimes spring. it’s hard to really find a pattern. as such, i omitted the spring results in the detailed analyses below. the best average timings are most of the times found when using four threads to execute the processes . this is probably due to having a quad-core machine. the best throughput numbers are most of the times found when using eight threads to execute the processes. i can only assume that is also has something to do with having a quad-core machine. when the number of threads in the threadpool go up, the throughput (processes executed / second) goes up, both it has a negative effect on the average time. certainly with more than six or seven threads, you see this effect very clear. this basically means that while the processes on itself take a little longer to execute, but due to the multiple threads you can execute more of these ‘slower’ processes in the same amount of time. enabling history does have an impact. often, enabling history will double execution time. this is logical, given that many extra records are inserted when history is on the default level (ie. ‘audit’). there was one last test i ran, just out of curiosity: running the best performing setting on an oracle xe 11.2 database. the oracle xe is a free version of the ‘real’ oracle database. no matter how hard, i tried, i couldn’t get it decently running on ubuntu. as such, i used an old windows xp install on that same machine. however, the os is 32 bit, wich means the system only has 3.2 of the 8gb of ram available. here are the results: activiti 5.10 – oracle on windows – default – history disabled the results speak for itself. oracle blows away any of the (single-threaded) results on mysql (and they are already very fast!). however, when going multi-threaded it is far worse than any of the mysql results. my guess is that these are due to the limitations of the xe version : only one cpu is used, only 1 gb of ram, etc. i would really like to run these test on a real oracle-managed-by-a-real-dba … feel free to contact me if you are interested ! in the next sections, we will take a detailed look into the performance numbers of each of the test processes. an excel sheet containing all the the numbers and charts below can be downloaded for yourself . process 1: the bare micromum (one transaction) the first process is not a very interesting one, business-wise at least. after starting the process, the end is immediately reached. not very useful on itself, but its numbers learn us one essential thing: the bare overhead of the activiti engine. here are the average timings: this process runs in a single transaction, which means that nothing is saved to the database when the history is disabled due to activiti’s optimizations. with history enabled, you’ll basically get the cost for inserting one row into the historical process instance table, which is around 4.44 ms here. it is also clear that our fix for activiti 5.10 has an enormous impact here. in the previous version, 99% of the time was spent in the cleanup check of the process. take a look at the best result here: 0.47 ms when using 4 threads to execute 2500 runs of this process. that’s only half a millisecond ! it’s fair to say that the activiti engine overhead is extremely small. the throughput numbers are equally impressive: in the best case here, 8741 processes are executed. per second. by the time you arrive here reading the post, you could have executed a few millions of this process . you can also see that there is little difference between 4 or 8 threads here. most of the execution time here is cpu time, and no potential collisions such as waiting for a database lock happens here. in these numbers, you can also easily see that the oracle xe doesn’t scale well with multiple threads (which is explained above). you will see the same behavior in the following results. process 2: the same, but a bit longer (one transaction) this process is pretty similar to the previous one. we have again only one transaction. after the process is started, we pass through seven no-op passthrough activities before reaching the end. some things to note here: the best result (again 4 threads, with history disabled) is actually better than the simpler previous process. but also note that the single threaded execution is a tad slower. this means that the process on itself is a bit slower, which is logical as is has more activities. but using more threads and having more activities in the process does allow for more potential interleaving. in the previous case, the thread was barely born before it was killed again. the difference between history enabled/disabled is bigger than the previous process. this is logical, as more history is written here (for each activity one record in the database). again, activiti 5.10 is far more superior to activiti 5.9. the throughput numbers follow these observations: there is more opportunity to use threading here. the best result lingers around 12000 process execution per second . again, it demonstrates the very lightweight execution of the activiti engine. process 3: parallelism in one transaction this process executes a parallel gateway that forks and one that joins in the same transaction. you would expect something along the lines of the previous results, but you’d be surprised: comparing these numbers with the previous process, you see that execution is slower. so why is this process slower, even if it has less activities? the reason lies with how the parallel gateway is implemented, especially the join behavior. the hard part, implementation-wise, is that you need to cope with the situation when multiple executions arrive at the join. to make sure that the behavior is atomic, we internally do some locking and fetch all child executions in the execution tree to find out whether the join activates or not. so it is quite a ‘costly’ operation, compared to the ‘regular’ activities. do mind, we’re talking here about only 5 ms single threaded and 3.59 ms in the best case for mysql . given the functionality that is required for implementing the parallel gateway functionality, this is peanuts if you’d ask me. the throughput numbers: this is the first process which actually contains some ‘logic’. in the best case above, it means 1112 processes can be executed in a second. pretty impressive, if you’d ask me! . process 4: now we’re getting somewhere (one transaction) this process already looks like something you’d see when modeling real business processes. we’re still running it in one database transaction though, as all the activities are automatic passthroughs. here we also have two forks and two joins. take a look at the lowest number: 6.88 ms on oracle when running with one thread. that’s freaking fast , taking in account all that is happening here. the history numbers are at least doubled here (activiti 5.10), which makes sense because there is quite a bit of activity audit logging going on here. you can also see that this causes to have a higher average time for four threads here, which is probably due to the implementation of the joining. if you know a bit about activiti internals, you’ll understand this means there are quite a bit of executions in the execution tree. we have one big concurrent root, but also multiple children which are sometimes also concurrent roots. but while the average time rises, the throughput definitely benefits: running this process with eight threads, allows you to do 411 runs of this process in a single second. there is also something peculiar here: the oracle database performs better with more thread concurrency. this is completely contrary with all other measurements, where oracle is always slower in that environment (see above for explanation). i assume it has something to do with the internal locking and forced update we are applying when forking/joining, which is better handled by oracle it seems. process 5: adding some java logic (single transaction) i added this process to see the influence of adding a java service task in a process. in this process, the first activity generates a random value, stores it as a process variable and then goes up or down in the process depending on the random value. the chance is about 50/50 to go up or down. the average timings are very very good. actually, the results are in the same range as those of process 1 and 2 above (which had no activities or only automatic passthroughs). this means that the overhead of integrating java logic into your process is nearly non-existant (nothing is of course for free). of course, you can still write slow code in that logic, but you can’t blame the activiti engine for that throughput numbers are comparable to those of process 1 and 2: very, very high. in the best case here, more than 9000 processes are executed per second . that indeed also means 9000 invocations of your own java logic. process 6, 7 and 8: adding wait states and transactions the previous processes demonstrated us the bare overhead of the activiti engine. here, we’ll take a look at how wait states and multiple transactions have influence on performance. for this, i added three test processes which contain user tasks. for each user task, the engine commits the current transaction and returns the thread to the client. since the results are pretty much compatible for these processes, we’re grouping them here. these are the processes: here are the average timings results, in order of the processes above. for the first process, containing just one user task: it is clear that having wait states and multiple transaction does have influence on the performance. this is also logical: before, the engine could optimize by not inserting the runtime state into the database, because the process was finished in one transaction. now, the whole state, meaning the pointers to where you are currently, need to be saved into the database. the process could be ‘sleeping’ like this for many days, months, years now …. the activiti engine doesn’t hold it into memory now anymore, and it is freed to give its full attention to other processes. if you check the results of the process with only one user task, you can see that in the best case (oracle, single thread – the 4 threads on mysql is pretty close) this is done in 6.27ms . this is really fast, if you take in account we have a few inserts (the execution tree, the task), a few updates (the execution tree) and deletes (cleaning up) going on here. the second process here, with 7 user tasks: the second chart learns us that logically, more transactions means more time. in the best case here the process is done in 32.12 ms . that is for seven transactions, which gives 4.6 ms for each transactions. so it is clear that average time scales in a linearly way when adding wait states. this makes of course sense, because transactions aren’t free. also note that enabling history does add quite some overhead here. this is due to having the history level set to ‘audit’, which stores all the user task information in the history tables. this is also noticeable from the difference between activiti 5.9 with history disabled and activiti 5.10 with history enabled: this is a rare case where activiti 5.10 with history enabled is slower than 5.9 with history disabled. but it is logical, given the volume of history stored here. and the third process learns us how user tasks and parallel gateways interact: the third chart learns us not much new. we have two user tasks now, and the more ‘expensive’ fork/join (see above). the average timings are how we expected them. the throughput charts are as you would expect given the average timings. between 70 and 250 processes per second. aw yeah! to save some space, you’ll need to click them to enlarge: process 9: so what about scopes? for the last process, we’ll take a look at ‘scopes’. a ‘scope’ is how we call it internally in the engine, and it has to do with variable visibility, relationships between the pointers indicating process state, event catching, etc. bpmn 2.0 has quite some cases for those scopes, for example with embedded subprocesses as shown in the process here. basically, every subprocess can have boundary events (catching an error, a message, etc) that only are applied on its internal activities when it’s scope is active. without going into too much technical details: to get scopes implemented in the correct way, you need some not so trivial logic. the example process here has 4 subprocesses, nested in each other. the inner process is using concurrency, which is a scope on itself again for the activiti engine. there are also two user tasks here, so that means two transactions. so let’s see how it performs: you can clearly see the big difference between activiti 5.9 and 5.10. scopes are indeed an area where the fixes around the ‘process cleanup’ at the end have a huge benefit, as many execution objects are created and persisted to represent the many different scopes. single threaded performance is not so good on activiti 5.9. luckily, as you can see from the gap between the blue and the red bars, those scopes do allow for high concurrency. the numbers of oracle, combined with the multi-threaded results of the 5.10 tests, do prove that scopes are now efficiently handled by the engine. the throughput charts prove that the process nicely scales with more threads, as you can see by the big gap between the red and green line in the second last block. in the best case, 64 processes of this more complex process are handled by the engine. random execution if you have already clicked on the full reports at the beginning of the post, you probably have noticed also random execution is tested for each environment. in this setting, 2500 process executions were done, both the process was randomly chosen. as shown in those reports this meant that over 2500 executions, each process was executed almost the same number of times (normal distribution). this last chart shows the best setting (activiti 5.10, history disabled) and how the throughput of those random process executions goes when adding more threads: as we’ve seen in many of the test above, once passed four threads things don’t change that much anymore. the numbers (167 processes/second) prove that in a realistic situation (ie. multiple processes executing at the same time), the activiti engine nicely scales up. conclusion the average timing charts show two things clearly: the activiti engine is fast and overhead is minimal ! the difference between history enabled or disabled is definitely noticeably. sometimes it comes even down to half the time needed. all history tests were done using the ‘audit’ level, but there is a simpler history level (‘activity’) which might be good enough for the use case. activiti is very flexible in history configuration, and you can tweak the history level for each process specifically. so do think about the level your process needs to have, if it needs to have history at all ! the throughput charts prove that the engine scales very well when more threads are available (ie. any modern application server). activiti is well designed to be used in high-throughput and availability (clustered) architectures . as i said in the introduction, the numbers are what they are: just numbers. my main point which i want to conclude here, is that the activiti engine is extremely lightweight. the overhead of using activiti for automating your business processes is small. in general, if you need to automate your business processes or workflows, you want top-notch integration with any java system and you like all of that fast and scalable … look no further!
July 10, 2012
by
· 11,121 Views
  • Previous
  • ...
  • 512
  • 513
  • 514
  • 515
  • 516
  • 517
  • 518
  • 519
  • 520
  • 521
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×