DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
The Dangers of SQL JOINs & Aggregate Functions: How to Avoid Fanouts
Every SQL developer uses JOINs and aggregate functions, but some may not be aware that the two interacting can return incorrect results if not handled carefully. Brett Sauve at Looker has covered the issue in this recent post - specifically, he focuses on the issue of "fanouts" - and it's an easy problem to overlook. A fanout occurs when the primary table in a query contains fewer rows than the combined table resulting from a JOIN. Because the new table contains more rows, aggregate functions like COUNT or SUM may include duplicates, throwing off the results. According to Sauve, the answer is careful ordering of JOIN tables: Herein lies a key point: to help avoid fanouts, begin your joins with the most granular table. Sauve also provides some tips to help SQL developers understand how to avoid fanouts and ensure the validity of their data. He points to three types of JOINs: 1-to-1 Many-to-1 1-to-Many Knowing which JOIN you are using can rule out the possibility of a fanout, for example, because it is not possible in a 1-to-1 or Many-to-1 JOIN. If you are using a 1-to-Many JOIN, though, Sauve has some tips there, too. For example, a messy JOIN can be cleaned up by grouping data to create a 1-to-1 relationship. Check out the full article for the rest of Sauve's tips to make sure your aggregate functions are working correctly, and if you're looking for a few more tips to keep your SQL skills sharp, you might try one of these: Yet Another 10 Common Mistakes Java Developers Make When Writing SQL 6 Simple Performance Tips for SQL SELECT Statements
June 30, 2014
by Alec Noller
· 16,591 Views
article thumbnail
Option.fold() Considered Unreadable
We had a lengthy discussion recently during code review whether scala.Option.fold() is idiomatic and clever or maybe unreadable and tricky? Let's first describe what the problem is. Option.fold does two things: maps a function f overOption's value (if any) or returns an alternative alt if it's absent. Using simple pattern matching we can implement it as follows: val option: Option[T] = //... def alt: R = //... def f(in: T): R = //... val x: R = option match { case Some(v) => f(v) case None => alt } If you prefer one-liner, fold is actually a combination of map and getOrElse val x: R = option map f getOrElse alt Or, if you are a C programmer that still wants to write in C, but using Scala compiler: val x: R = if (option.isDefined) f(option.get) else alt Interestingly this is similar to how fold() is actually implemented, but that's an implementation detail. OK, all of the above can be replaced with single Option.fold(): val x: R = option.fold(alt)(f) Technically you can even use /: and \: operators (alt /: option) - but that would be simply masochistic. I have three problems with option.fold() idiom. First of all - it's anything but readable. We are folding (reducing) over Option - which doesn't really make much sense. Secondly it reverses the ordinary positive-then-negative-case flow by starting with failure (absence, alt) condition followed by presence block (f function; see also: Refactoring map-getOrElse to fold). Interestingly this method would work great for me if it was named mapOrElse: ** * Hypothetical in Option */ def mapOrElse[B](f: A => B, alt: => B): B = this map f getOrElse alt Actually there is already such method in Scalaz, called OptionW.cata. cata. Here is whatMartin Odersky has to say about it: "I personally find methods like cata that take two closures as arguments are often overdoing it. Do you really gain in readability over map + getOrElse? Think of a newcomer to your code[...]"While cata has some theoretical background, Option.fold just sounds like a random name collision that doesn't bring anything to the table, apart from confusion. I know what you'll say, that TraversableOnce has fold and we are sort-of doing the same thing. Why it's a random collision rather than extending the contract described inTraversableOnce? fold() method in Scala collections typically just delegates to one offoldLeft()/foldRight() (the one that works better for given data structure), thus it doesn't guarantee order and folding function has to be associative. But inOption.fold() the contract is different: folding function takes just one parameter rather than two. If you read my previous article about folds you know that reducing function always takes two parameters: current element and accumulated value (initial value during first iteration). But Option.fold() takes just one parameter: current Option value! This breaks the consistency, especially when realizing Option.foldLeft() andOption.foldRight() have correct contract (but it doesn't mean they are more readable). The only way to understand folding over option is to imagine Option as a sequence with0 or 1 elements. Then it sort of makes sense, right? No. def double(x: Int) = x * 2 Some(21).fold(-1)(double) //OK: 42 None.fold(-1)(double) //OK: -1 but: Some(21).toList.fold(-1)(double) : error: type mismatch; found : Int => Int required: (Int, Int) => Int Some(21).toList.fold(-1)(double) ^ If we treat Option[T] as a List[T], awkward Option.fold() breaks because it has different type than TraversableOnce.fold(). This is my biggest concern. I can't understand why folding wasn't defined in terms of the type system (trait?) and implemented strictly. As an example take a look at: Data.Foldable in Haskell (advanced) Data.Foldable typeclass describes various flavours of folding in Haskell. There are familiar foldl/foldr/foldl1/foldr1, in Scala namedfoldLeft/foldRight/reduceLeft/reduceRight accordingly. They have the same type as Scala and behave unsurprisingly with all types that you can fold over, including Maybe, lists, arrays, etc. There is also a function named fold, but it has a completely different meaning: class Foldable t where fold :: Monoid m => t m -> m While other folds are quite complex, this one barely takes a foldable container of ms (which have to be Monoids) and returns the same Monoid type. A quick recap: a type can be aMonoid if there exists a neutral value of that type and an operation that takes two values and produces just one. Applying that function with one of the arguments being neutral value yields the other argument. String ([Char]) is a good example with empty string being neutral value (mempty) and string concatenation being such operation (mappend). Notice that there are two different ways you can construct monoids for numbers: under addition with neutral value being 0 (x + 0 == 0 + x == x for any x) and under multiplication with neutral 1 (x * 1 == 1 * x == x for any x). Let's stick to strings. If I fold empty list of strings, I'll get an empty string. But when a list contains many elements, they are being concatenated: > fold ([] :: [String]) "" > fold [] :: String "" > fold ["foo", "bar"] "foobar" In the first example we have to explicitly say what is the type of empty list []. Otherwise Haskell compiler can't figure out what is the type of elements in a list, thus which monoid instance to choose. In second example we declare that whatever is returned from fold [], it should be a String. From that the compiler infers that [] actually must have a type of [String]. Last fold is the simplest: the program folds over elements in list and concatenates them because concatenation is the operation defined in Monoid Stringtypeclass instance. Back to options (or more precisely Maybe). Folding over Maybe monad having type parameter being Monoid (I can't believe I just said it) has an interesting interpretation: it either returns value inside Maybe or a default Monoid value: > fold (Just "abc") "abc" > fold Nothing :: String "" Just "abc" is same as Some("abc") in Scala. You can see here that if Maybe Stringis Nothing, neutral String monoid value is returned, that is an empty string. Summary Haskell shows that folding (also over Maybe) can be at least consistent. In ScalaOption.fold is unrelated to List.fold, confusing and unreadable. I advise avoiding it and staying with slightly more verbose map/getOrElse transformations or pattern matching. PS: Did I mention there is also Either.fold() (with even different contract) but noTry.fold()?
June 26, 2014
by Tomasz Nurkiewicz
· 9,599 Views
article thumbnail
How to Install Mono on a Raspberry Pi
This post exists to help with an MSDN Magazine article that I am authoring It provides some of the low-level details for the article How to install Mono and root certificates on a raspberry pi How to create an Azure mobile service How to create a Custom API inside Azure mobile services that the raspberry pi can call into How to create an Azure storage account MONO - HOW TO INSTALL ON A RASPBERRY PI Why Mono? How to install Mono on a raspberry pi Installing trusted root certificates on to the raspberry pi http://www.mono-project.com/Main_Page An open source, cross-platform, implementation of C# and the CLR that is binary compatible with Microsoft.NET Mono is a free and open source project led by Xamarin (formerly by Novell) that provides a .NET Framework-compatible set of tools including, among others, a C# compiler and a Common Language Runtime WHY MONO? Because it lets us write .net code compiled on Windows We can simply copy the binary files from Windows to Linux and run it as is From a raspberry pi device, it is possible to use a .net application to take a photo and upload it to Windows Azure storage HOW TO INSTALL ON A RASPBERRY PI RUNNING LINUX You will issue the following commands: pi@raspberrypi ~ $ sudo apt-get update pi@raspberrypi ~ $ sudo apt-get install mono-complete The first command makes sure all the local package index are up to date with the changes made in repositories. Second command installs the complete Mono tooling and runtime. MAKING SURE THAT YOUR MONO APPLICATIONS CAN MAKE A HTTPS REST-BASED CALLS This command downloads the trusted root certificates from the Mozilla LXR web site into the Mono certificate store. Once complete, the Raspberry PI will be capable of making web requests using HTTPS requests within Mono. pi@raspberrypi ~ $ mozroots --import --ask-remove --machine CREATING A NEW AZURE MOBILE SERVICES ACCOUNT The mobile services account is needed to host a Node.js application that provides shared access signatures to raspberry pi devices The shared access signature is needed by the raspberry pi, so that it can directly and securely upload photos to Azure storage STEPS TO CREATE AN AZURE MOBILE SERVICE The steps below will create an Azure mobile service The service will be used to host a Node.js application interacting with a raspberry pi devices We will provision a SQL database, although it will not be used initially FOLLOW THESE STEPS TO CREATE THE MOBILE SERVICE Login into the Azure Portal Select MOBILE SERVICES from the left menu pane at the Azure Portal. In the lower left corner select "+NEW" to create a new Azure Mobile Service. Make sure you've selected, "COMPUTE / MOBILE SERVICE / CREATE." You will now enter a url. We will call this service raspberrymobileservice. For the DATABASE, we will choose "Create a new SQL database instance." The REGION we chose is "West US." The BACKEND is "JavaScript." Click the "->" arrow to proceed to the next screen. In this screen you will "Specify database settings." The NAME of your database will based on the URL you entered previously. In this case, the database is called "raspberrymobileservice_db." You will need to choose a SERVER. We will choose "New SQL database server" from the drop-down list. You will need to provide a SERVER LOGIN NAME and a SERVER LOGIN PASSWORD. Take note of the login you provided as it will be needed later CREATING A CUSTOM API Azure mobile services allows you to create a custom API written in JavaScript that can be called from a raspberry pi device using REST This custom API is really just a Node.js application running in the server CREATING THE API TO RESPOND TO THE DEVICE TRYING TO UPLOAD PHOTOS Now that the service is established, we will turn our attention to creating an API that the device can call into to upload a photo. Login into the Azure Portal Your mobile service will take a few minutes to complete, and you should see the "Ready" flag as the "Status" for your service. Once it is ready you can drill into your service to customize its behavior. Just to the right of the service name, click the right arrow key "->" to drill into the service details. The top menu bar will offer many options, but we are interested in the one titled "API." The API allows you to create a series of node.JS API calls that a device can call into using rest-based approaches. Click on "API." from there, select "CREATE A CUSTOM API." You will be asked to provide an API name. Type in "photos" for the API name. Below you will see a series of drop-down combo boxes that relate to permission. We will keep the default value of "Anybody with the application key." This might not be the best option for all scenarios. You can read more about this here. http://msdn.microsoft.com/en-us/library/azure/jj193161.aspx. Click the checkmark to complete the process. The name of the AP you just created, "Photos," should be visible on the portal interface. To drill into the photos API click on the right arrow key "->". The right arrow key will be just to the right of the name of the API "Photos". At this point you should see a basic script that has been provided by default. We will overwrite this default script with our own script as described in the MSDN Magazine article. CREATING A STORAGE ACCOUNT TO STORE THE PHOTOS Navigate to the portal and create a storage account Create a container for the photos Obtain the: Storage Account Name (you will provide a name) Storage Account Access key (generated for you) Container Name (you will create) CREATING A STORAGE ACCOUNT We will need a storage account so that we can upload photos to it. The steps are well documented here: http://azure.microsoft.com/en-us/documentation/articles/storage-create-storage-account/ In our case we call the storage account raspberrystorage. This means that the URL that the device will use to upload photos is https://raspberrystorage.blob.core.windows.net/. As you complete these steps make sure that you choose the storage account location to be the same location as was used for your mobile services account. This avoids any unnecessary latency or bandwidth costs between data centers. Once the storage account is created, we will need to create a container within it. Photos or any blob for that matter, are always stored within a container. To create a container drill into your newly created storage account and select CONTAINERS from the top menu. From there, select CREATE A CONTAINER. The new container dialog box will ask for a name for your container. Take note of the name you provide. We are calling our container ?photocontainer.? When the raspberry pi device uploads photos to the storage account, it will target a specific container, such as the one we just created. You will next be asked to indicate ACCESS rights. To keep things simple we will select access rights of Public Blob. ENTERING APP SETTINGS Rather than hard-code storage account information inside your JavaScript/Node.js applications, you should consider using apps settings inside of the Azure mobile services portal This post also discusses it well: http://blogs.msdn.com/b/carlosfigueira/archive/2013/12/09/application-settings-in-azure-mobile-services.aspx ?The idea of application settings is a set of key-value pairs which can be set for the mobile service (either via the portal or via the command-line interface), and those values could be then read in the service runtime.? NAVIGATING TO APP SETTINGS Navigate to the Azure Mobile Services section of the portal. Drill into the specific service by hitting the arrow below Select from the Configure Menu at the top Scroll down to the very bottom to see app settings Note that we need to enter: - We need to get this from Azure Storage - PhotoContainerName - AccountName - AccountKey We get this information from the Azure Storage Section of the Portal. Note that you need to have provisioned a Storage Account to have this information. How to get the AccountKey with Azure Storage Services Now you can get the access keys HOW NODE.JS WILL ACCESS THE APP SETTINGS You will create a Node.js application inside of Azure Mobile Services See previous steps THE NODE.JS APPLICATION READING APP SETTINGS You will starting by going back to Azure Mobile Services and drill down into your newly minted service We called ours raspberrymobileservice Once you click API, you should see: Notice the app settings are being read on lines 12 to 14.
June 19, 2014
by Bruno Terkaly
· 16,767 Views
article thumbnail
Synchronizing Data Across Mule Applications with MongoDB
One of the many things that Mule does extremely well is transparently store state between messages. Take the idempotent-message-filter for example; The idempotent-message-filter keeps track of what messages have been processed previously to ensure no duplicate messages are processed. This is all achieved with one single xml element. Or take OAuth enabled cloud connectors that need to keep track of access tokens so you're not redirected to some authorize page each time. All this is handled behind the scenes without you having to worry about it. Mule’s ability to store this state is implemented via object-stores. Object-stores provide a simple generic way to store data in Mule. Many message processors such as the idempotent-meesage-filter, OAuth message processors, until-successful routers and many more all transparently use object-stores behind the scenes to store state between messages. By default, Mule uses in-memory object-stores behind the scenes. Things get more interesting however, when your Mule application is distributed across multiple Mule nodes. At some point you are going to need to synchronise object-stores across multiple applications. If you are fortunate to be using the Enterprise Edition of Mule you can take advantage of its out of the box clustering support for synchronizing state across an entire cluster. But even then as mentioned in this blog post by John Demic: Synchronizing Mule Applications Across Data Centers with Apache Cassandra, this may not be ideal if your Mule nodes are geographically distributed across multiple data centres. In the aforementioned post, John gives a great example of how to use Apache Cassandra as a shared object-store. But many other traditonal and NoSql data stores can be used as well. In this blog post I'm going to show a simple example of how you can achieve the same using MongoDB. If it's good enough for the Hipster Hacker then it's good enough for me! OMG! Does this officially make @MuleSoft Mule part of the Hipster Stack? pic.twitter.com/OwM4dzhGvi — Ryan Carter (@rydcarter) October 9, 2013 Configuring the MongoDB object-store MongoDB is an open-source document database, and the leading NoSQL database. The MongoDB module can be used to read/write/query and perform a whole load of other operation a MongoDB database as part of your Mule flows. It also exposes an additional submodule for using your MondogDB database as an object-store implementation in Mule. Here is a sample configuration from one of the tests: As you can see, it's classed as it's own module with it's own namespace and is configured simply using the mos:config name="mos" database="mongo-connector-test" element. However, this doesn't work for the current version of the connector. Due to the following bug. You can simply apply the fix in the aforementioned pull request, or you can instantiate the object store yourself using Spring - similar to that in the Apache Cassandra configuration. The object-store class can be found here: MongoObjectStore.java. As you can see, it has various annotations for default values and initialising the class. However, as we are not using this as a fully fledged Mule Devkit module, those annotations have no effect. So we have to pass in the default values ourselves and call the initialise method when instantiating our object-store. This is quite simple with Spring. Here is a full working example using the Spring instantiated object-store as the store for the idempotent-message-filter: If you run this configuration, Mule will poll every 10 seconds with the exact same message: "1". The first time it polls you should see "Passed Filter" in the log output. Subsequent polls should not log anything as the idempotent message filter will filter them out. This is usually performed with the default in-memory object-store. To prove this is persisted you should be able to check the value in the Mongo database collection: db.getCollection('mule.objectstore._default').find(). Also you can stop your Mule app and restart it to prove that it works between restarts and failovers etc.
June 17, 2014
by Ryan Carter
· 10,019 Views
article thumbnail
Unit Testing Checklist: Keep Your Tests Useful and Avoid Big Mistakes
A unit test is a set of methods launched frequently to validate your code. It is usually a good idea to write code in this order: Write a class API Write a method to test the API Implement the API Launch the unit tests Why write unit tests? They validate current and future implementations. They measure code quality. They force you to write testable, loosely coupled code. They’re cheaper than manual regression testing. They build confidence in your code. They help teamwork. Why use a checklist? Unit testing can be harder than actual implementation. Unit testing forces you to really think things through. But unit tests should be simple, direct, and easy to read and maintain. You also need to know when to stop writing tests and start writing the implementation. Use this checklist to be sure your tests are really useful and to the point. Remember: the checklist helps you avoid big mistakes, but you need to make sure of the following: □ My test class is testing only one class. o You are testing a class API to be sure the public contract is respected. □ My methods are testing only one method at a time. o Be sure not to test private methods! They are hidden implementation details, not API. □ My variables and method names are explicit. o For example, store an expected value in an expectedFoo variable instead of just foo. If you test many combinations, use composed variable names like inputValue_NotNull, inputValue_ZeroData, inputValue_PastDate, etc. (according to your application’s coding convention). □ My test cases are easy to read by humans. o Future maintainers should be able to read your tests before reading the implementation. This will help them understand a class API before tweaking or debugging it. □ My tests respect the usual clean code standards. o There should be no flow control in a test method (switch, if, etc.). A good test is just a very straightforward sequence of setup/validate instructions. If necessary, use sub-methods to factorize and make your tests easier to read. In case of multiple scenarios, use multiple test methods (one for each case). o For example, a test method should fit on screen without scrolling – 1 to 20 lines long. If the method is longer, consider writing multiple test methods for each case instead of jamming them together. □ My tests are also testing expected exceptions. o In java, use @Test(expected=MyException.class). □ My tests don’t need access to a database. o Or if a test does need database access, then it must be a mocked, “fire and forget” temporary database that you fill with test cases for every new test method (use the Setup/Teardown methods to prepare it). □ My tests don’t need access to network resources. o You can’t rely on third parties like network or device presence to validate a method (use mocks). □ My tests control side effects, limit values (max, min) and null variables (even when they throw exceptions). o You want to make sure these problem cases never occur, even when the test won’t be used during maintenance. □ My tests can be run at any time, at any place without needing configuration or human intervention. □ My tests pass for the current implementation and are easy to evolve. o Tests really exist to support code evolution. If they are too hard to maintain or too light to refine the code, then they are a useless burden. (Many developers avoid unit testing for this reason.) □ My tests are concrete. o In Java: don’t use Date() as input for a method you are testing, but build a concrete date out of Calendar (don’t forget to force the timezone). Other example: use name = “Smith”; instead of name = “name”; or name = “test”; □ My tests use a mock to simulate/stub complex class structure or methods. o Remember to test only one class API at a time. o Never test third-party libraries through your own classes. Libraries should come with their own tests (this is actually a good way to choose a library). □ My tests are never @ignored or commented out. Never. Ever. □ My tests help me validate my architecture. o If you can’t test a method or a class, your design is not agile enough. □ My tests can run on any supported platform, not just the targeted platform. o Don’t expect a particular device or hardware configuration. Otherwise, your tests will make migration tougher and you will be incentivized to disable them. □ My tests are lightning fast! o Slow tests shouldn’t drag you down. Speed encourages you to launch your tests often. It also helps to reduce building time on Continuous Integration systems. o Use a test runner that allows you to launch one test at a time while you are writing it. Use “delay” or “sleep” with caution – i.e., only in some edge cases, like waiting for notifications or clock-based methods.
June 16, 2014
by Jean-baptiste Rieu
· 24,987 Views
article thumbnail
Java 8 Friday: 10 Subtle Mistakes When Using the Streams API
at data geekery , we love java. and as we’re really into jooq’s fluent api and query dsl , we’re absolutely thrilled about what java 8 will bring to our ecosystem. java 8 friday every friday, we’re showing you a couple of nice new tutorial-style java 8 features, which take advantage of lambda expressions, extension methods, and other great stuff. you’ll find the source code on github . 10 subtle mistakes when using the streams api we’ve done all the sql mistakes lists: 10 common mistakes java developers make when writing sql 10 more common mistakes java developers make when writing sql yet another 10 common mistakes java developers make when writing sql (you won’t believe the last one) but we haven’t done a top 10 mistakes list with java 8 yet! for today’s occasion ( it’s friday the 13th ), we’ll catch up with what will go wrong in your application when you’re working with java 8. (it won’t happen to us, as we’re stuck with java 6 for another while) 1. accidentally reusing streams wanna bet, this will happen to everyone at least once. like the existing “streams” (e.g. inputstream ), you can consume streams only once. the following code won’t work: intstream stream = intstream.of(1, 2); stream.foreach(system.out::println); // that was fun! let's do it again! stream.foreach(system.out::println); you’ll get a java.lang.illegalstateexception: stream has already been operated upon or closed so be careful when consuming your stream. it can be done only once 2. accidentally creating “infinite” streams you can create infinite streams quite easily without noticing. take the following example: // will run indefinitely intstream.iterate(0, i -> i + 1) .foreach(system.out::println); the whole point of streams is the fact that they can be infinite, if you design them to be. the only problem is, that you might not have wanted that. so, be sure to always put proper limits: // that's better intstream.iterate(0, i -> i + 1) .limit(10) .foreach(system.out::println); 3. accidentally creating “subtle” infinite streams we can’t say this enough. you will eventually create an infinite stream, accidentally. take the following stream, for instance: intstream.iterate(0, i -> ( i + 1) % 2) .distinct() .limit(10) .foreach(system.out::println); so… we generate alternating 0′s and 1′s then we keep only distinct values, i.e. a single 0 and a single 1 then we limit the stream to a size of 10 then we consume it well… the distinct() operation doesn’t know that the function supplied to the iterate() method will produce only two distinct values. it might expect more than that. so it’ll forever consume new values from the stream, and the limit(10) will never be reached. tough luck, your application stalls. 4. accidentally creating “subtle” parallel infinite streams we really need to insist that you might accidentally try to consume an infinite stream. let’s assume you believe that the distinct() operation should be performed in parallel. you might be writing this: intstream.iterate(0, i -> ( i + 1) % 2) .parallel() .distinct() .limit(10) .foreach(system.out::println); now, we’ve already seen that this will turn forever. but previously, at least, you only consumed one cpu on your machine. now, you’ll probably consume four of them, potentially occupying pretty much all of your system with an accidental infinite stream consumption. that’s pretty bad. you can probably hard-reboot your server / development machine after that. have a last look at what my laptop looked like prior to exploding: if i were a laptop, this is how i’d like to go. 5. mixing up the order of operations so, why did we insist on your definitely accidentally creating infinite streams? it’s simple. because you may just accidentally do it. the above stream can be perfectly consumed if you switch the order of limit() and distinct() : intstream.iterate(0, i -> ( i + 1) % 2) .limit(10) .distinct() .foreach(system.out::println); this now yields: 0 1 why? because we first limit the infinite stream to 10 values (0 1 0 1 0 1 0 1 0 1), before we reduce the limited stream to the distinct values contained in it (0 1). of course, this may no longer be semantically correct, because you really wanted the first 10 distinct values from a set of data (you just happened to have “forgotten” that the data is infinite). no one really wants 10 random values, and only then reduce them to be distinct. if you’re coming from a sql background, you might not expect such differences. take sql server 2012, for instance. the following two sql statements are the same: -- using top selectdistincttop10 * fromi orderby.. -- using fetch select* fromi orderby.. offset 0 rows fetchnext10 rowsonly so, as a sql person, you might not be as aware of the importance of the order of streams operations. 6. mixing up the order of operations (again) speaking of sql, if you’re a mysql or postgresql person, you might be used to the limit .. offset clause. sql is full of subtle quirks, and this is one of them. the offset clause is applied first , as suggested in sql server 2012′s (i.e. the sql:2008 standard’s) syntax. if you translate mysql / postgresql’s dialect directly to streams, you’ll probably get it wrong: intstream.iterate(0, i -> i + 1) .limit(10) // limit .skip(5) // offset .foreach(system.out::println); the above yields 5 6 7 8 9 yes. it doesn’t continue after 9 , because the limit() is now applied first , producing (0 1 2 3 4 5 6 7 8 9). skip() is applied after, reducing the stream to (5 6 7 8 9). not what you may have intended. beware of the limit .. offset vs. "offset .. limit" trap! 7. walking the file system with filters we’ve blogged about this before . what appears to be a good idea is to walk the file system using filters: files.walk(paths.get(".")) .filter(p -> !p.tofile().getname().startswith(".")) .foreach(system.out::println); the above stream appears to be walking only through non-hidden directories, i.e. directories that do not start with a dot. unfortunately, you’ve again made mistake #5 and #6. walk() has already produced the whole stream of subdirectories of the current directory. lazily, though, but logically containing all sub-paths. now, the filter will correctly filter out paths whose names start with a dot “.”. e.g. .git or .idea will not be part of the resulting stream. but these paths will be: .\.git\refs , or .\.idea\libraries . not what you intended. now, don’t fix this by writing the following: files.walk(paths.get(".")) .filter(p -> !p.tostring().contains(file.separator + ".")) .foreach(system.out::println); while that will produce the correct output, it will still do so by traversing the complete directory subtree, recursing into all subdirectories of “hidden” directories. i guess you’ll have to resort to good old jdk 1.0 file.list() again. the good news is, filenamefilter and filefilter are both functional interfaces. 8. modifying the backing collection of a stream while you’re iterating a list , you must not modify that same list in the iteration body. that was true before java 8, but it might become more tricky with java 8 streams. consider the following list from 0..9: // of course, we create this list using streams: list list = intstream.range(0, 10) .boxed() .collect(tocollection(arraylist::new)); now, let’s assume that we want to remove each element while consuming it: list.stream() // remove(object), not remove(int)! .peek(list::remove) .foreach(system.out::println); interestingly enough, this will work for some of the elements! the output you might get is this one: 0 2 4 6 8 null null null null null java.util.concurrentmodificationexception if we introspect the list after catching that exception, there’s a funny finding. we’ll get: [1, 3, 5, 7, 9] heh, it “worked” for all the odd numbers. is this a bug? no, it looks like a feature. if you’re delving into the jdk code, you’ll find this comment in arraylist.arralistspliterator : /* * if arraylists were immutable, or structurally immutable (no * adds, removes, etc), we could implement their spliterators * with arrays.spliterator. instead we detect as much * interference during traversal as practical without * sacrificing much performance. we rely primarily on * modcounts. these are not guaranteed to detect concurrency * violations, and are sometimes overly conservative about * within-thread interference, but detect enough problems to * be worthwhile in practice. to carry this out, we (1) lazily * initialize fence and expectedmodcount until the latest * point that we need to commit to the state we are checking * against; thus improving precision. (this doesn't apply to * sublists, that create spliterators with current non-lazy * values). (2) we perform only a single * concurrentmodificationexception check at the end of foreach * (the most performance-sensitive method). when using foreach * (as opposed to iterators), we can normally only detect * interference after actions, not before. further * cme-triggering checks apply to all other possible * violations of assumptions for example null or too-small * elementdata array given its size(), that could only have * occurred due to interference. this allows the inner loop * of foreach to run without any further checks, and * simplifies lambda-resolution. while this does entail a * number of checks, note that in the common case of * list.stream().foreach(a), no checks or other computation * occur anywhere other than inside foreach itself. the other * less-often-used methods cannot take advantage of most of * these streamlinings. */ now, check out what happens when we tell the stream to produce sorted() results: list.stream() .sorted() .peek(list::remove) .foreach(system.out::println); this will now produce the following, “expected” output 0 1 2 3 4 5 6 7 8 9 and the list after stream consumption? it is empty: [] so, all elements are consumed, and removed correctly. the sorted() operation is a “stateful intermediate operation” , which means that subsequent operations no longer operate on the backing collection, but on an internal state. it is now “safe” to remove elements from the list! well… can we really? let’s proceed with parallel() , sorted() removal: list.stream() .sorted() .parallel() .peek(list::remove) .foreach(system.out::println); this now yields: 7 6 2 5 8 4 1 0 9 3 and the list contains [8] eek. we didn’t remove all elements!? free beers ( and jooq stickers ) go to anyone who solves this streams puzzler! this all appears quite random and subtle, we can only suggest that you never actually do modify a backing collection while consuming a stream. it just doesn’t work. 9. forgetting to actually consume the stream what do you think the following stream does? intstream.range(1, 5) .peek(system.out::println) .peek(i -> { if(i == 5) thrownewruntimeexception("bang"); }); when you read this, you might think that it will print (1 2 3 4 5) and then throw an exception. but that’s not correct. it won’t do anything. the stream just sits there, never having been consumed. as with any fluent api or dsl, you might actually forget to call the “terminal” operation. this might be particularly true when you use peek() , as peek() is an aweful lot similar to foreach() . this can happen with jooq just the same, when you forget to call execute() or fetch() : dsl.using(configuration) .update(table) .set(table.col1, 1) .set(table.col2, "abc") .where(table.id.eq(3)); oops. no execute() yes, the “best” way – with 1-2 caveats ;-) 10. parallel stream deadlock this is now a real goodie for the end! all concurrent systems can run into deadlocks, if you don’t properly synchronise things. while finding a real-world example isn’t obvious, finding a forced example is. the following parallel() stream is guaranteed to run into a deadlock: object[] locks = { newobject(), newobject() }; intstream .range(1, 5) .parallel() .peek(unchecked.intconsumer(i -> { synchronized(locks[i % locks.length]) { thread.sleep(100); synchronized(locks[(i + 1) % locks.length]) { thread.sleep(50); } } })) .foreach(system.out::println); note the use of unchecked.intconsumer() , which transforms the functional intconsumer interface into a org.jooq.lambda.fi.util.function.checkedintconsumer , which is allowed to throw checked exceptions. well. tough luck for your machine. those threads will be blocked forever :-) the good news is, it has never been easier to produce a schoolbook example of a deadlock in java! for more details, see also brian goetz’s answer to this question on stack overflow . conclusion with streams and functional thinking, we’ll run into a massive amount of new, subtle bugs. few of these bugs can be prevented, except through practice and staying focused. you have to think about how to order your operations. you have to think about whether your streams may be infinite. streams (and lambdas) are a very powerful tool. but a tool which we need to get a hang of, first.
June 16, 2014
by Lukas Eder
· 10,347 Views · 2 Likes
article thumbnail
The Performance Price of Foreign Keys in PostgreSQL
If you want to optimize the performance of your PostgreSQL tables, you might benefit from looking in places you wouldn't ordinarily expect. For example, your foreign keys. According to Shaun Thomas in this recent post, foreign keys can have a considerable impact on performance. Ordinarily they're a good thing - Thomas acknowledges that - but there are certain design choices in PostgreSQL that can complicate them: In PostgreSQL, every foreign key is maintained with an invisible system-level trigger added to the source table in the reference. At least one trigger must go here, as operations that modify the source data must be checked that they do not violate the constraint. Each trigger, Thomas says, requires some overhead, and as the number of triggers increases, so does the cost. Thomas offers an example query to demonstrate where this kind of thing becomes an issue - not really a practical or real-world example, though - but the performance hit is substantial: . . . after merely five foreign keys, performance of our updates drops by 28.5%. By the time we have 20 foreign keys, the updates are 95% slower! This is not to say that you shouldn't use foreign keys, Thomas says, but that you should be careful with them and avoid abusing them. In other words, if you don't need a foreign key, you should probably just leave it out. Check out the Thomas' post for all the details.
June 12, 2014
by Alec Noller
· 17,659 Views
article thumbnail
Querying XML CLOB Data Directly in Oracle
Oracle 9i introduced a new datatype - XMLType - specifically designed for handling XML naively in the database. However, we may find we have a database that stores XML directly in a CLOB/NCLOB datatype (either because of legacy data, or because we are supporting multiple database platforms). It can be useful to query this data directly using the XMLType functions available. This post just shows a very simple example of a query to do this. (Note: there may be better ways to do this - but this worked for me when I needed an ad-hoc query quickly!) Our Table Assume we have a table called USER, which has the following columns Our embedded XML (example for Ringo) Our Query Say we want to look at account details in the XML - e.g. find which users all have the same sort code. A query like the following: SELECT ID, NAME XMLTYPE(u.accountxml).EXTRACT('/person/account/sortCode/text()') as SORTCODE FROM USER u; Will produce output like this. Obviously from here you can use all the standard SQL operators to filter, join etc further to get what you need:
June 11, 2014
by Adrian Milne
· 45,583 Views · 1 Like
article thumbnail
Building a Simple RESTful API with Java Spark
Disclaimer: This post is about the Java micro web framework named Spark and not about the data processing engine Apache Spark. In this blog post we will see how Spark can be used to build a simple web service. As mentioned in the disclaimer, Spark is a micro web framework for Java inspired by the Ruby framework Sinatra. Spark aims for simplicity and provides only a minimal set of features. However, it provides everything needed to build a web application in a few lines of Java code. Getting Started Let's assume we have a simple domain class with a few properties and a service that provides some basic CRUDfunctionality: public class User { private String id; private String name; private String email; // getter/setter } public class UserService { // returns a list of all users public List getAllUsers() { .. } // returns a single user by id public User getUser(String id) { .. } // creates a new user public User createUser(String name, String email) { .. } // updates an existing user public User updateUser(String id, String name, String email) { .. } } We now want to expose the functionality of UserService as a RESTful API (For simplicity we will skip the hypermedia part of REST ;-)). For accessing, creating and updating user objects we want to use following URL patterns: GET /users Get a list of all users GET /users/ Get a specific user POST /users Create a new user PUT /users/ Update a user The returned data should be in JSON format. To get started with Spark we need the following Maven dependencies: com.sparkjava spark-core 2.0.0 org.slf4j slf4j-simple 1.7.7 Spark uses SLF4J for logging, so we need to a SLF4J binder to see log and error messages. In this example we use the slf4j-simple dependency for this purpose. However, you can also use Log4j or any other binder you like. Having slf4j-simple in the classpath is enough to see log output in the console. We will also use GSON for generating JSON output and JUnit to write a simple integration tests. You can find these dependencies in the complete pom.xml. Returning All Users Now it is time to create a class that is responsible for handling incoming requests. We start by implementing the GET /users request that should return a list of all users. import static spark.Spark.*; public class UserController { public UserController(final UserService userService) { get("/users", new Route() { @Override public Object handle(Request request, Response response) { // process request return userService.getAllUsers(); } }); // more routes } } Note the static import of spark.Spark.* in the first line. This gives us access to various static methods including get(), post(), put() and more. Within the constructor the get() method is used to register aRoute that listens for GET requests on /users. A Route is responsible for processing requests. Whenever aGET /users request is made, the handle() method will be called. Inside handle() we return an object that should be sent to the client (in this case a list of all users). Spark highly benefits from Java 8 Lambda expressions. Route is a functional interface (it contains only one method), so we can implement it using a Java 8 Lambda expression. Using a Lambda expression the Routedefinition from above looks like this: get("/users", (req, res) -> userService.getAllUsers()); To start the application we have to create a simple main() method. Inside main() we create an instance of our service and pass it to our newly created UserController: public class Main { public static void main(String[] args) { new UserController(new UserService()); } } If we now run main(), Spark will start an embedded Jetty server that listens on Port 4567. We can test our first route by initiating a GET http://localhost:4567/users request. In case the service returns a list with two user objects the response body might look like this: [com.mscharhag.sparkdemo.User@449c23fd, com.mscharhag.sparkdemo.User@437b26fe] Obviously this is not the response we want. Spark uses an interface called ResponseTransformer to convert objects returned by routes to an actual HTTP response. ReponseTransformer looks like this: public interface ResponseTransformer { String render(Object model) throws Exception; } ResponseTransformer has a single method that takes an object and returns a String representation of this object. The default implementation of ResponseTransformer simply calls toString() on the passed object (which creates output like shown above). Since we want to return JSON we have to create a ResponseTransformer that converts the passed objects to JSON. We use a small JsonUtil class with two static methods for this: public class JsonUtil { public static String toJson(Object object) { return new Gson().toJson(object); } public static ResponseTransformer json() { return JsonUtil::toJson; } } toJson() is an universal method that converts an object to JSON using GSON. The second method makes use of Java 8 method references to return a ResponseTransformer instance. ResponseTransformer is again a functional interface, so it can be satisfied by providing an appropriate method implementation (toJson()). So whenever we call json() we get a new ResponseTransformer that makes use of our toJson()method. In our UserController we can pass a ResponseTransformer as a third argument to Spark's get()method: import static com.mscharhag.sparkdemo.JsonUtil.*; public class UserController { public UserController(final UserService userService) { get("/users", (req, res) -> userService.getAllUsers(), json()); ... } } Note again the static import of JsonUtil.* in the first line. This gives us the option to create a newResponseTransformer by simply calling json(). Our response looks now like this: [{ "id": "1866d959-4a52-4409-afc8-4f09896f38b2", "name": "john", "email": "[email protected]" },{ "id": "90d965ad-5bdf-455d-9808-c38b72a5181a", "name": "anna", "email": "[email protected]" }] We still have a small problem. The response is returned with the wrong Content-Type. To fix this, we can register a Filter that sets the JSON Content-Type: after((req, res) -> { res.type("application/json"); }); Filter is again a functional interface and can therefore be implemented by a short Lambda expression. After a request is handled by our Route, the filter changes the Content-Type of every response toapplication/json. We can also use before() instead of after() to register a filter. Then, the Filterwould be called before the request is processed by the Route. The GET /users request should be working now :-) Returning a Specific User To return a specific user we simply create a new route in our UserController: get("/users/:id", (req, res) -> { String id = req.params(":id"); User user = userService.getUser(id); if (user != null) { return user; } res.status(400); return new ResponseError("No user with id '%s' found", id); }, json()); With req.params(":id") we can obtain the :id path parameter from the URL. We pass this parameter to our service to get the corresponding user object. We assume the service returns null if no user with the passed id is found. In this case, we change the HTTP status code to 400 (Bad Request) and return an error object. ResponseError is a small helper class we use to convert error messages and exceptions to JSON. It looks like this: public class ResponseError { private String message; public ResponseError(String message, String... args) { this.message = String.format(message, args); } public ResponseError(Exception e) { this.message = e.getMessage(); } public String getMessage() { return this.message; } } We are now able to query for a single user with a request like this: GET /users/5f45a4ff-35a7-47e8-b731-4339c84962be If an user with this id exists we will get a response that looks somehow like this: { "id": "5f45a4ff-35a7-47e8-b731-4339c84962be", "name": "john", "email": "[email protected]" } If we use an invalid user id, a ResponseError object will be created and converted to JSON. In this case the response looks like this: { "message": "No user with id 'foo' found" } Creating and Updating Users Creating and updating users is again very easy. Like returning the list of all users it is done using a single service call: post("/users", (req, res) -> userService.createUser( req.queryParams("name"), req.queryParams("email") ), json()); put("/users/:id", (req, res) -> userService.updateUser( req.params(":id"), req.queryParams("name"), req.queryParams("email") ), json()); To register a route for HTTP POST or PUT requests we simply use the static post() and put() methods of Spark. Inside a Route we can access HTTP POST parameters using req.queryParams(). For simplicity reasons (and to show another Spark feature) we do not do any validation inside the routes. Instead we assume that the service will throw an IllegalArgumentException if we pass in invalid values. Spark gives us the option to register ExceptionHandlers. An ExceptionHandler will be called if anException is thrown while processing a route. ExceptionHandler is another single method interface we can implement using a Java 8 Lambda expression: exception(IllegalArgumentException.class, (e, req, res) -> { res.status(400); res.body(toJson(new ResponseError(e))); }); Here we create an ExceptionHandler that is called if an IllegalArgumentException is thrown. The caught Exception object is passed as the first parameter. We set the response code to 400 and add an error message to the response body. If the service throws an IllegalArgumentException when the email parameter is empty, we might get a response like this: { "message": "Parameter 'email' cannot be empty" } The complete source the controller can be found here. Testing Because of Spark's simple nature it is very easy to write integration tests for our sample application. Let's start with this basic JUnit test setup: public class UserControllerIntegrationTest { @BeforeClass public static void beforeClass() { Main.main(null); } @AfterClass public static void afterClass() { Spark.stop(); } ... } In beforeClass() we start our application by simply running the main() method. After all tests finished we call Spark.stop(). This stops the embedded server that runs our application. After that we can send HTTP requests within test methods and validate that our application returns the correct response. A simple test that sends a request to create a new user can look like this: @Test public void aNewUserShouldBeCreated() { TestResponse res = request("POST", "/users?name=john&[email protected]"); Map json = res.json(); assertEquals(200, res.status); assertEquals("john", json.get("name")); assertEquals("[email protected]", json.get("email")); assertNotNull(json.get("id")); } request() and TestResponse are two small self made test utilities. request() sends a HTTP request to the passed URL and returns a TestResponse instance. TestResponse is just a small wrapper around some HTTP response data. The source of request() and TestResponse is included in the complete test classfound on GitHub. Conclusion Compared to other web frameworks Spark provides only a small amount of features. However, it is so simple you can build small web applications within a few minutes (even if you have not used Spark before). If you want to look into Spark you should clearly use Java 8, which reduces the amount of code you have to write a lot. You can find the complete source of the sample project on GitHub.
June 9, 2014
by Michael Scharhag
· 111,610 Views · 3 Likes
article thumbnail
Neo4j & Cypher: UNWIND vs FOREACH
I’ve written a couple of posts about the new UNWIND clause in Neo4j’s cypher query language, but I forgot about my favorite use of UNWIND, which is to get rid of some uses of FOREACH from our queries. Let’s say we’ve created a timetree up front and now have a series of events coming in that we want to create in the database and attach to the appropriate part of the timetree. Before UNWIND existed we might try to write the following query using FOREACH: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events FOREACH (event IN events | CREATE (e:Event {name: event.name}) MATCH (year:Year {year: event.timetree.year }), (year)-[:HAS_MONTH]->(month {month: event.timetree.month }), (month)-[:HAS_DAY]->(day {day: event.timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day)) Unfortunately we can’t use MATCH inside a FOREACH statement so we’ll get the following error: Invalid use of MATCH inside FOREACH (line 5, column 3) " MATCH (year:Year {year: event.timetree.year }), " ^ Neo.ClientError.Statement.InvalidSyntax We can work around this by using MERGE instead in the knowledge that it’s never going to create anything because the timetree already exists: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events FOREACH (event IN events | CREATE (e:Event {name: event.name}) MERGE (year:Year {year: event.timetree.year }) MERGE (year)-[:HAS_MONTH]->(month {month: event.timetree.month }) MERGE (month)-[:HAS_DAY]->(day {day: event.timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day)) If we replace the FOREACH with UNWIND we’d get the following: WITH [{name: "Event 1", timetree: {day: 1, month: 1, year: 2014}, {name: "Event 2", timetree: {day: 2, month: 1, year: 2014}] AS events UNWIND events AS event CREATE (e:Event {name: event.name}) WITH e, event.timetree AS timetree MATCH (year:Year {year: timetree.year }), (year)-[:HAS_MONTH]->(month {month: timetree.month }), (month)-[:HAS_DAY]->(day {day: timetree.day }) CREATE (e)-[:HAPPENED_ON]->(day) Although the lines of code has slightly increased the query is now correct and we won’t accidentally correct new parts of our time tree. We could also pass on the event that we created to the next part of the query which wouldn’t be the case when using FOREACH.
June 9, 2014
by Mark Needham
· 12,725 Views
article thumbnail
Checking Media Queries With jQuery
With the web being used on so many different devices now it's very important that you can change your design to fit on different screen sizes. The best way of changing your display depending on screen size is to use media queries to find out the size viewport of the screen and allowing you to change the design depending on what screen size is on. You will mainly make these changes in the CSS file as you can define the media query break points to change the design on different devices like this. /* Smartphones (portrait) ----------- */ @media only screen and (max-width : 320px) { } /* Desktops and laptops ----------- */ @media only screen and (min-width : 1224px) { } /* Large screens ----------- */ @media only screen and (min-width : 1824px) { } The above code will allow you to make styling completely different on different devices, but what if you wanted to change the functionality of the site depending on the screen size? What if you needed to use some Javascript code on different screen sizes, for example to create a slide down navigation button. How Do You Use Media Queries With jQuery Media queries will be checking the width of the window to see what the size of the device is so you would think that you can use a method like .width() on the window object like this. if($(window).width() < 767) { // change functionality for smaller screens } else { // change functionality for larger screens } But this will not return the true value of the window as it takes into effect things like body padding and scroll bars on the window. The other option you have when checking the media size is to use a Javascript method of .matchMedia() on the window object. var window_size = window.matchMedia('(max-width: 768px)')); This works the same way as media queries and is supported on many browsers apart from IE9 and lower. Can I Use window.matchMedia To use matchMedia you need to pass in the min or max values you want to check (like media queries) and see if the viewport matches this. if (window.matchMedia('(max-width: 768px)').matches) { // do functionality on screens smaller than 768px } Now you can use this to add a click event on to a sub-menu for screens smaller than 768px. The below code is an example of how you can add some Javascript code which will only be run on screens smaller than 768px. if (window.matchMedia('(max-width: 768px)').matches) { $('.sub-menu-button').on('click', function(e) { var subMenu = $(this).next('.sub-navigation'); if(subMenu.is(':visible')) { subMenu.slideUp(); } else { subMenu.slideDown(); } return false; }); }
June 6, 2014
by Paul Underwood
· 135,725 Views · 3 Likes
article thumbnail
How Does Spring @Transactional Really Work?
Wondering how Spring @Transactional works? Learn about what's really going on.
June 5, 2014
by Vasco Cavalheiro
· 882,589 Views · 47 Likes
article thumbnail
MapDB: The Agile Java Data Engine
MapDB is a pure Java database, specifically designed for the Java developer. The fundamental concept for MapDB is very clever yet natural to use: provide a reliable, full-featured and “tune-able” database engine using the Java Collections API. MapDB 1.0 has just been released, this is the culmination of years of research and development to get the project to this point. Jan Kotek, the primary developer for MapDB, also worked on predecessor projects (JDBM), starting MapDB as an entire from-scratch rewrite. Jan’s expertise and dedication to low-level debugging has yielded excellent results, producting an easy-to-use database for Java with comparable performance to many C-based engines. What sets MapDB apart is the “map” concept. The idea is to leverage the totally natural Java Collections API – so familiar to Java developers that most of them literally use it daily in their work. For most database interactions with a Java application, some sort of translator is required. There are many Object-Relational Mapping (ORM) tools to name just one category of such components. The goal has always been in the direction of making it natural to code in objects in the Java language, and translate them to a specific database syntax (such as SQL). However, such efforts have always come up short, adding complexity for both the application developer and the data architect. When using MapDB there is no object “translation layer” – developers just access data in familiar structures like Maps, Sets, Queues, etc. There is no change in syntax from typical Java coding, other than a brief initialization syntax and transaction management. A developer can literally transform memory-limited maps into a high-speed persistent store in seconds (typically changing just one line of code). A MapDB Example Here is a simple MapDB example, showing how easy and intuitive it is to use in a Java application: // Initialize a MapDB database DB db = DBMaker.newFileDB(new File("testdb")) .closeOnJvmShutdown() .make(); // Create a Map: Map myMap = db.getTreeMap(“testmap”); // Work with the Map using the normal Map API. myMap.put(“key1”, “value1”); myMap.put(“key2”, “value2”); String value = myMap.get(“key1”); ... That’s all you need to do, now you have a file-backed Map of virtually any size. Note the “builder-style” initialization syntax, enabling MapDB as the agile database choice for Java. There are many builder options that let you tune your database for the specific requirements at hand. Just a small subset of options include: In-memory implementation Enable transactions Configurable caching This means that you can configure your database just for what you need, effectively making MapDB serve the job of many other databases. MapDB comes with a set of powerful configuration options, and you can even extend the product to make your own data implementations if necessary. Another very powerful feature is that MapDB utilizes some of the advanced Java Collections variants, such as ConcurrentNavigableMap. With this type of Map you can go beyond simple key-value semantics, as it is also a sorted Map allowing you to access data in order, and find values near a key. Not many people are aware of this extension to the Collections API, but it is extremely powerful and allows you to do a lot with your MapDB database (I will cover more of these capabilities in a future article). The Agile Aspect of MapDB When I first met Jan and started talking with him about MapDB he said something that made a very important impression: If you know what data structure you want, MapDB allows you to tailor the structure and database characteristics to your exact application needs. In other words, the schema and ways you can structure your data is very flexible. The configuration of the physical data store is just as flexible, making a perfect combination for meeting almost any database need. They key to this capability is inherent in MapDB’s architecture, and how it translates to the MapDB API itself. Here is a simple diagram of the MapDB architecture: As you can see from the diagram, there are 3 tiers in MapDB: Collections API: This is the familiar Java Collections API that every Java developer uses for maintaining application state. It has a simple builder-style extension to allow you to control the exact characteristics of a given database (including its internal format or record structure). Engine: The Engine is the real key to MapDB, this is where the records for a database – including their internal structure, concurrency control, transactional semantics – are controlled. MapDB ships with several engines already, and it is straightforward to add your own Engine if needed for specialized data handling. Volume: This is the physical storage layer (e.g., on-disk or in-memory). MapDB has a few standard Volume implementations, and they should suffice for most projects. The main point is that the development API is completely distinct from the Engine implementation (the heart of MapDB), and both are separate from the actual physical storage layer. This offers a very agile approach, allowing developers to exactly control what type of internal structure is needed for a given database, and what the actual data structure looks like from the top-level Collections API. To make things even more extensible and agile, MapDB uses a concept of Engine Wrappers. An Engine Wrapper allows adding additional features and options on top of a specific engine layer. For example, if the standard Map engine is utilized for creating a B-Tree backed Map, it is feasible to enable (or disable) caching support. This caching feature is done through an Engine Wrapper, and that is what shows up in the builder-style API used to configure a given database. While a whole article could be written just about this, the point here is that this adds to MapDB’s inherent agile nature. By way of example, here is how you configure a pure in-memory database, without transactional capabilities: // Initialize an in-memory MapDB database // without transactions DB db = DBMaker.newMemoryDB() .transactionDisable() .closeOnJvmShutdown() .make(); // Create a Map: Map myMap = db.getTreeMap(“testmap”); // Work with the Map using the normal Map API. myMap.put(“key1”, “value1”); myMap.put(“key2”, “value2”); String value = myMap.get(“key1”); ... That’s it! All that was needed was to change the DBMaker call to add the new options, everything else works exactly the same as in the example shown earlier. Agile Data Model In addition to customizing the features and performance characteristics of a given database instance, MapDB allows you to create an agile data model, with a schema exactly matching your application requirements. This is probably similar to how you write your code when creating standard Java in-memory structures. For example, let’s say you need to lookup a Person object by username, or by personID. Simply create a Person object and two Maps to meet your needs: public class Person { private Integer personID; private String username; ... // Setters and getters go here ... } // Create a Map of Person by username. Map personByUsernameMap = ... // Create a Map of Person by personID. Map personByPersonIDMap = ... This is a very trivial example, but now you can easily write to both maps for each new Person instance, and subsequently retrieve a Person by either key. Another interesting concept with MapDB data structures are some key extensions to the normal Java Collections API. A common requirement in applications is to have a Map with a key/value, and in addition to finding the value for a key to be able to perform the inverse: lookup the key for a given value. We can easily do this using the MapDB extension for bi-directional maps: // Create a primary map HTreeMap map = DBMaker.newTempHashMap(); // Create the inverse mapping for primary map NavigableSet> inverseMapping = new TreeSet>(); // Bind the inverse mapping to primary map, so it is auto-updated each time the primary map gets a new key/value Bind.mapInverse(map, inverseMapping); map.put(10L,"value2"); map.put(1111L,"value"); map.put(1112L,"value"); map.put(11L,"val"); // Now find a key by a given value. Long keyValue = Fun.filter(inverseMapping.get(“value2”); MapDB supports many constructs for the interaction of Maps or other collections, allowing you to create a schema of related structures that can automatically be kept in sync. This avoids a lot of scanning of structures, makes coding fast and convenient, and can keep things very fast. Wrapping it up I have shown a very brief introduction on MapDB and how the product works. As you can see its strengths are its use of the natural Java Collections API, the agile nature of the engine itself, and the support for virtually any type of data model or schema that your application needs. MapDB is freely available for any use under the Apache 2.0 license. To learn more, check out: www.mapdb.org.
June 5, 2014
by Cory Isaacson
· 28,543 Views · 3 Likes
article thumbnail
Are There Any Differences Between Table Scan and Index Scan
Introduction As we all know, developers prefer Index Seek in the case of performance of a query, and we all know what Index Seek is and how it improves performance. In this article we are not going to discuss Index Seek. A table scan is performed on a table which does not have an Index upon it (a heap) – it looks at the rows in the table and an Index Scan is performed on an indexed table – the index itself. Here we are trying discuss a specified scenario related to Table Scan and Index Scan. Scenario Description Suppose we have a table object without any Index on it. The name of the Table Object is tbl_WithoutIndex and another Table Object called tbl_WuthIndex -- Heap Table Defination IF OBJECT_ID(N'dbo.tbl_WithoutIndex', N'U') IS NOT NULL BEGIN DROP TABLE [dbo].[tbl_WithoutIndex]; END GO CREATE TABLE [dbo].[tbl_WithoutIndex] ( EMPID INT NOT NULL, EMPNAME VARCHAR(50) NOT NULL ); GO -- Insert Some Records INSERT INTO [dbo].[tbl_WithoutIndex] (EMPID, EMPNAME) VALUES (101, 'Joydeep Das'), (102, 'Sukamal Jana'), (103, 'Ranajit Shinah'); GO -- Table with Index (Clustered Index for primary Key) IF OBJECT_ID(N'dbo.tbl_WithIndex', N'U') IS NOT NULL BEGIN DROP TABLE [dbo].[tbl_WithIndex]; END GO CREATE TABLE [dbo].[tbl_WithIndex] ( EMPID INT NOT NULL PRIMARY KEY, EMPNAME VARCHAR(50) NOT NULL ); GO -- Finding Index Name sp_helpindex tbl_WithIndex; index_name index_description index_keys PK__tbl_With__14CCD97D3AD6B8E2 clustered, unique, primary key located on PRIMARY EMPID -- Insert Some Records INSERT INTO [dbo].[tbl_WithIndex] (EMPID, EMPNAME) VALUES (101, 'Joydeep Das'), (102, 'Sukamal Jana'), (103, 'Ranajit Shinah'); GO Now we are going to compare the execution plan output SELECT * FROM [dbo].[tbl_WithoutIndex]; has no Index (Here I mean the clustered Index) the table is a Heap. So when we put the SELECT statement the entire table scanned. SELECT * FROM [dbo].[tbl_WithIndex]; Here he Table has a PRIMARY KEY, so it has a Clustered Index on it. But here we are not putting the Index columns on the WHERE clause, so the Clustered Index Scan Occurs. Close Observation of Execution Plan Please remember that the table has small number of records. Table Name Estimated IO Cost tbl_WithoutIndex 0.0032035 Tbl_WithIndex 0.003125 If we see the Estimated Operation Cost, it would be same for both the Query (0.0032853). Question in Mind Here we can see the Estimated Operation cost for both the Query is same. So Question is in the mind that, if a Index Scan occur we can drop the index and use the heap (in our case). So is there any other difference between them. How they are Differences Here we understand what the internal difference between Table Scan and Index Scan. When the table scan occurs MS SQL server reads all the Rows and Columns into memory. When the Index Scan occurs, it's going to read all the Rows and only the Columns in the index. Effects in Performance In case of performance the Table Scan and Index Scan both have the same output, if we have use the single table SELECT statement. But it differs in the case of Index Scan when we are going to JOIN table. Hope you like it.
June 5, 2014
by Joydeep Das
· 24,241 Views
article thumbnail
Introducing Partitioned Collections for MongoDB Applications
TokuMX 1.5 is around the corner. The big feature will be something we discussed briefly when talking about replication changes in 1.4: partitioned collections. Before introducing the feature, I wanted to mention the following. Although TokuMX 1.5 is not available as of this writing, we would love to hear feedback on partitioned collections, which we think are wonderful for time-series data, as I describe below. If you are interested in trying out the feature, email [email protected] for a pre-release version of TokuMX 1.5. What is a partitioned collection? A partitioned collection is analogous to a partitioned table in relational databases. Oracle, MySQL, SQL Server, and Postgres all support partitioned tables. We are happy to bring this functionality to TokuMX. So, if the remainder of this blog is unclear, and you have friends in the office who are familiar with relational databases, you may want to ask them for more information :). Nevertheless, a partitioned collection is a collection that underneath the covers is broken into (or partitioned into) several individual collections, based on ranges of a “partition key”. From the application developer’s point of view, the collection is just another collection. Queries, inserts, updates, and deletes just work with no syntactical changes. Secondary indexes and replication work as well. But underneath the covers, the data will be broken into several collections, with each collection responsible for all data for a range of the partition key. If you are running TokuMX 1.4, a simple example is the oplog, which is a partitioned collection. Any normal query works just fine on the oplog. However, if you look in your data directory, you will see several .tokumx files named “local_oplog_rs_p…”. These files are the individual partitions that break up the data. Each partition stores a range of _id fields in the oplog. Why should I bother using a partitioned collection? This will be its own post with longer examples, but here is a summary. Partitioned collections have two big advantages: Large chunks of data can be deleted very efficiently by dropping partitions. The cost is that of performing an “rm” on some files in the filesystem. This is really fast and efficient. Queries that include the partition key may be isolated to individual partitions, and therefore run faster. This is similar to “query isolation” for shard keys. So, one scenario you may want a partitioned collection for is where the oldest data gets dropped periodically, and many queries benefit from a time based key. That will be a good fit. In short: time series data. If you have a time-series application where you want to keep a rolling period of data (e.g. the last 6 months worth), then using a partitioned collection will be great, and is preferable to using a TTL index or a capped collection. In a future blog post, I will expand on this. How do I use a partitioned collection in TokuMX 1.5? Basically, just like a normal collection, except with some commands added to create a partitioned collection, add partitions, and drop partitions. Below, I explain the shell commands added for this functionality. Our documentation contains the full commands so that they may be called by any driver’s runCommand method. Ok, so how do I create a partitioned collection? The first thing to consider is what your partition key should be. That is, what key do you want to use ranges of to partition your data? This key has similarities with a shard key. It should be a key that can be used to isolate partitions, the way a shard key is used to isolate shards (as explained here). Also, it should be a key that contains a range of data you would like to delete all at once. With time series data, that key will likely be a timestamp. In TokuMX, the partition key is always the primary key. To create a partitioned collection, “foo”, with a timestamp field, “ts”, used for the partitioning run the following: > db.createCollection("foo", { partitioned : 1 , primaryKey : { ts : 1 , _id : 1 } }) { "ok" : 1 } Note that in TokuMX, the primary key must have the _id field appended to it to ensure uniqueness. As a side note, we do not support hash based partitioning, only range based partitioning. Adding partitions? In TokuMX, partitions can only be appended to the end. Individual partitions cannot be split. So, say we have a collection that partitions on the _id field, where all _id’s happen to be integers. Suppose we have three partitions with the following ranges: _id <= 0 0 < _id <= 1000 _id > 1000 With this collection we cannot create a partition with the range 500 < _id <= 1000, because that would split the second partition. All we can do is add a new partition to the end, and “cap” the current last partition with a new maximum value. This new maximum value must be greater than or equal to the primary key (or in this case, _id) of the last partition’s last document. So, if the last partition’s last document has an _id of 2500, we can only partitions that create a range whose maximum is at least 2500. There are two ways to add a partition. The first method peeks at the last document in the current last partition, caps the partition with the primary key of that last document, and creates a new partition. To do so, one does: > db.foo.addPartition() { "ok" : 1 } In the above example, the partitioned collection would now have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 2500 _id > 2500 Alternatively, we can specify what the new maximum of the existing last partition may be, provided it is greater than the last document in last partition (which in this example is 2500). To do so, we simply pass in the new maximum as a parameter: > db.foo.addPartition({ _id : 3000 }); { "ok" : 1 } This would make the collection have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 3000 _id > 3000 Dropping partitions? Dropping partitions is simple. First, see what the partitions are with the following shell command: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(4), "partitions" : [ { "_id" : NumberLong(0), "max" : { "_id" : 0 }, "createTime" : ISODate("2014-05-29T01:50:15.839Z") }, { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This lists each partition, what the maximum value that each partition may hold (thus defining the range of the partition), and the id of the partition (in the _id field). So, in the example we used for adding partitions, we have four partitions with _ids 0 through 3. To drop a partition, we run the following command and pass the _id of the partition we want to drop. To drop partition 0, we run: > db.foo.dropPartition(0) { "ok" : 1 } Looking at the list of partitions after this operation, we see the partition is dropped: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(3), "partitions" : [ { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This covers how to use partitioned collections. We hope users in the MongoDB ecosystem find this feature as useful as relational database users do. In the comments section below, feel free to leave questions and/or feedback.
June 4, 2014
by Zardosht Kasheff
· 12,473 Views
article thumbnail
Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel
Explore different message brokers, and discover how these important web technologies impact a customer's backlog of messages, and cluster/data performance.
June 3, 2014
by Yves Trudeau
· 460,610 Views · 86 Likes
article thumbnail
MySQL Transaction Isolation Levels and Locks
Originally written by Lim Han Recently, an application that my team was working on encountered problems with a MySQL deadlock situation and it took us some time to figure out the reasons behind it. This application that we deployed was running on a 2-node cluster and they both are connected to an AWS MySQL database. The MySQL db tables are mostly based on InnoDB which supports transaction (meaning all the usual commit and rollback semantics) as well as row-level locking that MyISAM engine does not provide. So the problem arose when our users, due to some poorly designed user interface, was able to execute the same long running operation twice on the database. As it turned out, due to the fact that we have a dual node cluster, each of the user operation originated from a different web application (which in turn meant 2 different transaction running the same queries). The deadlock query happened to be a “INSERT INTO T… SELECT FROM S WHERE” query that introduced shared locks on the records that were used in the SELECT query. It didn’t help that both T and S in this case happened to be the same table. In effect, both the shared locks and exclusive locks were applied on the same table. An attempt to explain the possible cause of the deadlock on the queries could be explained by the following table. This is based on the assumption that we are using a default REPEATABLE_READ transaction isolation level (I will explain the concept of transaction isolation later) Assuming that we have a table as such RowId Value 1 Collection 1 2 Collection 2 … Collection N 450000 Collection 450000 The following is a sample sequence that could possibly cause a deadlock based on the 2 transactions running an SQL query like “INSERT INTO T SELECT FROM T WHERE … “ : Time Transaction 1 Transaction 2 Comment T1 Statement executed Statement executed. A shared lock is applied to records that are read by selection T2 Read lock s1 on Row 10-20 The lock on the index across a range. InnoDB has a concept of gap locks. T3 Statement executed Transaction 2 statement executed. Similar shared lock to s1 applied by selection T4 Read lock s2 on Row 10-20 Shared read locks allow both transaction to read the records only T5 Insert lock x1 into Row 13 in index wanted Transaction 1 attempts to get exclusive lock on Row 13 for insertion but Transaction 2 is holding a shared lock T6 Insert lock x2 into Row 13 in index wanted Transaction 2 attempts to get exclusive lock on Row 13 for insertion but Transaction 1 is holding a shared lock T7 Deadlock! The above scenario occurs only when we use REPEATABLE_READ (which introduces shared read locks). If we were to lower the transaction isolation level to READ_COMMITTED, we would reduce the chances of a deadlock happening. Of course, this would mean relaxing the consistency of the database records. In the case of our data requirements, we do not have such strict requirements for strong consistency. Thus, it is acceptable for one transaction to read records that are committed by other transactions. So, to delve deeper into the idea of Transaction Isolation, this concept has been defined by ANSI/ISO SQL as the following from highest isolation levels to lowest Serializable This is the highest isolation level and usually requires the use of shared read locks and exclusive write locks (as in the case of MySQL). What this means in essence that any query made will require access to a shared read lock on the records which prevents another transaction’s query to modify these records. Every update statement will require access to an exclusive write lock Also, range-locks must be acquired when a select statement with a WHERE condition is used. This is implemented as a gap lock in MySQL. Repeatable Reads This is the default level used in MySQL. This is mainly similar to Serializable beside the fact that a range lock is not used. However, the way that MySQL implements this level seemed to me a little different. Based on Wikipedia’s article on Transaction Isolation, a range lock is not implemented and so phantom reads can still occur. Phantom reads refer to a possibility that select queries will have additional records when the same query is made within a transaction. However, what I understand from MySQL’s document is that range locks are still used and the same select queries made in the same transaction will always return the same records. Maybe I’m mistaken in my understanding and if there’s any mistakes in my intepretations, I stand ready to be corrected. Read Committed This is an isolation level that will maintain a write lock until the end of the transaction but read locks will be released at the end of the SELECT statement. It does not promise that a SELECT statement will find the same data if it is re-run again in the same transaction. It will, however, guarantee that the data that is read are not “dirty” and has been committed. Read Uncommitted This is an isolation level that I doubt would be useful for most use cases. Basically, it allows a transaction to see all data that has been modified, including “dirty” or uncommitted data. This is the lowest isolation level Having gone through the different transaction isolation levels, we could see how the selection of the Transaction Isolation level determines the kind of database locking mechanism. From a practical standpoint, the default MySQL isolation level (REPEATABLE_READ) might not always be a good choice when you are dealing with a scenario like ours where there is really no need for such strong consistency in the data reads. I believe that by lowering the isolation level, it is likely to reduce chances that your database queries meet with a deadlock. Also, it might even allow a higher concurrent access to your database which improve the performance level of your queries. Of course, this comes with the caveat that you need to understand how important consistent reads are for your application. If you are dealing with data where precision is paramount (e.g. your bank accounts), then it is definitely necessary to impose as much isolation as possible so that you would not read inconsistent information within your transaction.
June 2, 2014
by Anh Tuan Nguyen
· 18,220 Views · 1 Like
article thumbnail
Neo4j & Cypher: Rounding a Float Value to Decimal Places
About 6 months ago, Jacqui Read created a github issue explaining how she wanted to round a float value to a number of decimal places but was unable to do so due to the round function not taking the appropriate parameter. I found myself wanting to do the same thing last week where I initially had the following value: RETURN toFloat("12.336666") AS value I wanted to round that to 2 decimal places and Wes suggested multiplying the value before using ROUND and then dividing afterwards to achieve that. For two decimal places we need to multiply and divide by 100: WITH toFloat("12.336666") AS value RETURN round(100 * value) / 100 AS value 12.34 Michael suggested abstracting the number of decimal places like so: WITH 2 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value If we want to round to 4 decimal places we can easily change that: WITH 4 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value 12.3367 The code is available as a graph gist too if you want to play around with it. And you may as well check out the other gists while you’re at it – enjoy!
June 2, 2014
by Mark Needham
· 6,991 Views
article thumbnail
Connecting to Cassandra from Java
In this post, I look at the basics of connecting to a Cassandra database from a Java client. I will use the DataStax Java Client JAR in order to do so.
May 30, 2014
by Dustin Marx
· 121,297 Views · 3 Likes
article thumbnail
Performance Tuning of Spring/Hibernate Applications
For most typical Spring/Hibernate enterprise applications, the application performance depends almost entirely on the performance of it's persistence layer. This post will go over how to confirm that we are in presence of a 'database-bound' application, and then walk through 7 frequently used 'quick-win' tips that can help improve application performance. How to confirm that an application is 'database-bound' To confirm that an application is 'database-bound', start by doing a typical run in some development environment, using VisualVM for monitoring. VisualVM is a Java profiler shipped with the JDK and launchable via the command line by calling jvisualvm. After launching Visual VM, try the following steps: double click on your running application Select Sampler click on Settings checkbox Choose Profile only packages, and type in the following packages: your.application.packages.* org.hibernate.* org.springframework.* your.database.driver.package, for example oracle.* Click Sample CPU The CPU profiling of a typical 'database-bound' application should look something like this: We can see that the client Java process spends 56% of it's time waiting for the database to return results over the network. This is a good sign that the queries on the database are what's keeping the application slow. The 32.7% in Hibernate reflection calls is normal and nothing much can be done about it. First step for tuning - obtaining a baseline run The first step to do tuning is to define a baseline run for the program. We need to identify a set of functionally valid input data that makes the program go through a typical execution similar to the production run. The main difference is that the baseline run should run in a much shorter period of time, as a guideline an execution time of around 5 to 10 minutes is a good target. What makes a good baseline? A good baseline should have the following characteristics: it's functionally correct the input data is similar to production in it's variety it completes in a short amount of time optimizations in the baseline run can be extrapolated to a full run Getting a good baseline is solving half of the problem. What makes a bad baseline? For example, in a batch run for processing call data records in a telecommunications system, taking the first 10 000 records could be the wrong approach. The reason being, the first 10 000 might be mostly voice calls, but the unknown performance problem is in the processing of SMS traffic. Taking the first records of a large run would lead us to a bad baseline, from which wrong conclusions would be taken. Collecting SQL logs and query timings The SQL queries executed with their execution time can be collected using for example log4jdbc. See this blog post for how to collect SQL queries using log4jdbc - Spring/Hibernate improved SQL logging with log4jdbc. The query execution time is measured from the Java client side, and it includes the network round-trip to the database. The SQL query logs look like this: 16 avr. 2014 11:13:48 | SQL_QUERY /* insert your.package.YourEntity */ insert into YOUR_TABLE (...) values (...) {executed in 13 msec} The prepared statements themselves are also a good source of information - they allow to easily identify frequent query types. They can be logged by following this blog post - Why and where is Hibernate doing this SQL query? What metrics can be extracted from SQL logs The SQL logs can give the answer these questions: What are slowest queries being executed? What are the most frequent queries? What is the amount of time spent generating primary keys? Is there some data that could benefit from caching ? How to parse the SQL logs Probably the only viable option for large log volumes is to use command line tools. This approach has the advantage of being very flexible. At the expense of writing a small script or command, we can extract mostly any metric needed. Any command line tool will work as long as you are comfortable with it. If you are used to the Unix command line, bash might be a good option. Bash can be used also in Windows workstations, using for example Cygwin, or Git that includes a bash command line. Frequently applied Quick-Wins The quick-wins bellow identify common performance problems in Spring/Hibernate applications, and their corresponding solutions. Quick-win Tip 1 - Reduce primary key generation overhead In processes that are 'insert-intensive', the choice of a primary key generation strategy can matter a lot. One common way to generate id's is to use database sequences, usually one per table to avoid contention between inserts on different tables. The problem is that if 50 records are inserted, we want to avoid that 50 network round-trips are made to the database in order to obtain 50 id's, leaving the Java process hanging most of the time. How does Hibernate usually handle this? Hibernate provides new optimized ID generators that avoid this problem. Namely for sequences, a HiLo id generator is used by default. This is how the HiLo sequence generator it works: call a sequence once and get 1000 (the High value) calculate 50 id's like this: 1000 * 50 + 0 = 50000 1000 * 50 + 1 = 50001 ... 1000 * 50 + 49 = 50049, Low value (50) reached call sequence for new High value 1001 ... etc ... So from a single sequence call, 50 keys where generated, reducing the overhead caused my inumerous network round-trips. These new optimized key generators are on by default in Hibernate 4, and can even be turned off if needed by setting hibernate.id.new_generator_mappings to false. Why can primary key generation still be a problem? The problem is, if you declared the key generation strategy as AUTO, the optimized generators are still off, and your application will end up with a huge amount of sequence calls. In order to make sure the new optimized generators are on, make sure to use the SEQUENCE strategy instead of AUTO: @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "your_key_generator") private Long id; With this simple change, an improvement in the range of 10%-20% can be measured in 'insert-intensive' applications, with basically no code changes. Quick-win Tip 2 - Use JDBC batch inserts/updates For batch programs, JDBC drivers usually provide an optimization for reducing network round-trips named 'JDBC batch inserts/updates'. When these are used, inserts/updates are queued at the driver level before being sent to the database. When a threshold is reached, then the whole batch of queued statements is sent to the database in one go. This prevents the driver from sending the statements one by one, which would waist multiple network round-trips. This is the entity manager factory configuration needed to active batch inserts/updates: 100 true true Setting only the JDBC batch size won't work. This is because the JDBC driver will batch the inserts only when receiving insert/updates for the exact same table. If an insert to a new table is received, then the JDBC driver will first flush the batched statements on the previous table, before starting to batch statements on the new table. A similar functionality is implicitly used if using Spring Batch. This optimization can easily buy you 30% to 40% to 'insert intensive' programs, without changing a single line of code. Quick-win Tip 3 - Periodically flush and clear the Hibernate session When adding/modifying data in the database, Hibernate keeps in the session a version of the entities already persisted, just in case they are modified again before the session is closed. But many times we can safely discard entities once the corresponding inserts where done in the database. This releases memory in the Java client process, preventing performance problems caused by long running Hibernate sessions. Such long-running sessions should be avoided as much as possible, but if by some reason they are needed, this is how to contain memory consumption: entityManager.flush(); entityManager.clear(); The flush will trigger the inserts from new entities to be sent to the database. The clear releases the new entities from the session. Quick-win Tip 4 - Reduce Hibernate dirty-checking overhead Hibernate uses internally a mechanism to keep track of modified entities called dirty-checking. This mechanism is not based on the equals and hashcode methods of the entity classes. Hibernate does it's most to keep the performance cost of dirty-checking to a minimum, and to dirty-check only when it needs to, but the mechanism does have a cost, which is more noticeable in tables with a large number of columns. Before applying any optimization, the most important is to measure the cost of dirty-checking using VisualVM. How to avoid dirty-checking? In Spring business methods that we know are read-only, dirty-checking can be turned off like this: @Transactional(readOnly=true) public void someBusinessMethod() { .... } An alternative to avoid dirty-checking is to use the Hibernate Stateless Session, which is detailed in the documentation. Quick-win Tip 5 - Search for 'bad' query plans Check the queries in the slowest queries list to see if they have good query plans. The most usual 'bad' query plans are: Full table scans: they happen when the table is being fully scanned due to usually a missing index or outdated table statistics. Full cartesian joins: This means that the full cartesian product of several tables is being computed. Check for missing join conditions, or if this can be avoided by splitting a step into several. Quick-win Tip 6 - check for wrong commit intervals If you are doing batch processing, the commit interval can make a large difference in the performance results, as in 10 to 100 times faster. Confirm that the commit interval is the one expected (usually around 100-1000 for Spring Batch jobs). It happens often that this parameter is not correctly configured. Quick-win Tip 7 - Use the second-level and query caches If some data is identified as being eligible for caching, then have a look at this blog post for how to setup the Hibernate caching: Pitfalls of the Hibernate Second-Level / Query Caches Conclusions To solve application performance problems, the most important action to take is to collect some metrics that allow to find what the current bottleneck is. Without some metrics it is often not possible to guess in useful time what the correct problem cause is. Also, many but not all of the typical performance pitfalls of a 'database-driven' application can be avoided in the first place by using the Spring Batch framework.
May 29, 2014
by Vasco Cavalheiro
· 38,325 Views · 3 Likes
  • Previous
  • ...
  • 498
  • 499
  • 500
  • 501
  • 502
  • 503
  • 504
  • 505
  • 506
  • 507
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×