DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
Are There Any Differences Between Table Scan and Index Scan
Introduction As we all know, developers prefer Index Seek in the case of performance of a query, and we all know what Index Seek is and how it improves performance. In this article we are not going to discuss Index Seek. A table scan is performed on a table which does not have an Index upon it (a heap) – it looks at the rows in the table and an Index Scan is performed on an indexed table – the index itself. Here we are trying discuss a specified scenario related to Table Scan and Index Scan. Scenario Description Suppose we have a table object without any Index on it. The name of the Table Object is tbl_WithoutIndex and another Table Object called tbl_WuthIndex -- Heap Table Defination IF OBJECT_ID(N'dbo.tbl_WithoutIndex', N'U') IS NOT NULL BEGIN DROP TABLE [dbo].[tbl_WithoutIndex]; END GO CREATE TABLE [dbo].[tbl_WithoutIndex] ( EMPID INT NOT NULL, EMPNAME VARCHAR(50) NOT NULL ); GO -- Insert Some Records INSERT INTO [dbo].[tbl_WithoutIndex] (EMPID, EMPNAME) VALUES (101, 'Joydeep Das'), (102, 'Sukamal Jana'), (103, 'Ranajit Shinah'); GO -- Table with Index (Clustered Index for primary Key) IF OBJECT_ID(N'dbo.tbl_WithIndex', N'U') IS NOT NULL BEGIN DROP TABLE [dbo].[tbl_WithIndex]; END GO CREATE TABLE [dbo].[tbl_WithIndex] ( EMPID INT NOT NULL PRIMARY KEY, EMPNAME VARCHAR(50) NOT NULL ); GO -- Finding Index Name sp_helpindex tbl_WithIndex; index_name index_description index_keys PK__tbl_With__14CCD97D3AD6B8E2 clustered, unique, primary key located on PRIMARY EMPID -- Insert Some Records INSERT INTO [dbo].[tbl_WithIndex] (EMPID, EMPNAME) VALUES (101, 'Joydeep Das'), (102, 'Sukamal Jana'), (103, 'Ranajit Shinah'); GO Now we are going to compare the execution plan output SELECT * FROM [dbo].[tbl_WithoutIndex]; has no Index (Here I mean the clustered Index) the table is a Heap. So when we put the SELECT statement the entire table scanned. SELECT * FROM [dbo].[tbl_WithIndex]; Here he Table has a PRIMARY KEY, so it has a Clustered Index on it. But here we are not putting the Index columns on the WHERE clause, so the Clustered Index Scan Occurs. Close Observation of Execution Plan Please remember that the table has small number of records. Table Name Estimated IO Cost tbl_WithoutIndex 0.0032035 Tbl_WithIndex 0.003125 If we see the Estimated Operation Cost, it would be same for both the Query (0.0032853). Question in Mind Here we can see the Estimated Operation cost for both the Query is same. So Question is in the mind that, if a Index Scan occur we can drop the index and use the heap (in our case). So is there any other difference between them. How they are Differences Here we understand what the internal difference between Table Scan and Index Scan. When the table scan occurs MS SQL server reads all the Rows and Columns into memory. When the Index Scan occurs, it's going to read all the Rows and only the Columns in the index. Effects in Performance In case of performance the Table Scan and Index Scan both have the same output, if we have use the single table SELECT statement. But it differs in the case of Index Scan when we are going to JOIN table. Hope you like it.
June 5, 2014
by Joydeep Das
· 24,211 Views
article thumbnail
Introducing Partitioned Collections for MongoDB Applications
TokuMX 1.5 is around the corner. The big feature will be something we discussed briefly when talking about replication changes in 1.4: partitioned collections. Before introducing the feature, I wanted to mention the following. Although TokuMX 1.5 is not available as of this writing, we would love to hear feedback on partitioned collections, which we think are wonderful for time-series data, as I describe below. If you are interested in trying out the feature, email [email protected] for a pre-release version of TokuMX 1.5. What is a partitioned collection? A partitioned collection is analogous to a partitioned table in relational databases. Oracle, MySQL, SQL Server, and Postgres all support partitioned tables. We are happy to bring this functionality to TokuMX. So, if the remainder of this blog is unclear, and you have friends in the office who are familiar with relational databases, you may want to ask them for more information :). Nevertheless, a partitioned collection is a collection that underneath the covers is broken into (or partitioned into) several individual collections, based on ranges of a “partition key”. From the application developer’s point of view, the collection is just another collection. Queries, inserts, updates, and deletes just work with no syntactical changes. Secondary indexes and replication work as well. But underneath the covers, the data will be broken into several collections, with each collection responsible for all data for a range of the partition key. If you are running TokuMX 1.4, a simple example is the oplog, which is a partitioned collection. Any normal query works just fine on the oplog. However, if you look in your data directory, you will see several .tokumx files named “local_oplog_rs_p…”. These files are the individual partitions that break up the data. Each partition stores a range of _id fields in the oplog. Why should I bother using a partitioned collection? This will be its own post with longer examples, but here is a summary. Partitioned collections have two big advantages: Large chunks of data can be deleted very efficiently by dropping partitions. The cost is that of performing an “rm” on some files in the filesystem. This is really fast and efficient. Queries that include the partition key may be isolated to individual partitions, and therefore run faster. This is similar to “query isolation” for shard keys. So, one scenario you may want a partitioned collection for is where the oldest data gets dropped periodically, and many queries benefit from a time based key. That will be a good fit. In short: time series data. If you have a time-series application where you want to keep a rolling period of data (e.g. the last 6 months worth), then using a partitioned collection will be great, and is preferable to using a TTL index or a capped collection. In a future blog post, I will expand on this. How do I use a partitioned collection in TokuMX 1.5? Basically, just like a normal collection, except with some commands added to create a partitioned collection, add partitions, and drop partitions. Below, I explain the shell commands added for this functionality. Our documentation contains the full commands so that they may be called by any driver’s runCommand method. Ok, so how do I create a partitioned collection? The first thing to consider is what your partition key should be. That is, what key do you want to use ranges of to partition your data? This key has similarities with a shard key. It should be a key that can be used to isolate partitions, the way a shard key is used to isolate shards (as explained here). Also, it should be a key that contains a range of data you would like to delete all at once. With time series data, that key will likely be a timestamp. In TokuMX, the partition key is always the primary key. To create a partitioned collection, “foo”, with a timestamp field, “ts”, used for the partitioning run the following: > db.createCollection("foo", { partitioned : 1 , primaryKey : { ts : 1 , _id : 1 } }) { "ok" : 1 } Note that in TokuMX, the primary key must have the _id field appended to it to ensure uniqueness. As a side note, we do not support hash based partitioning, only range based partitioning. Adding partitions? In TokuMX, partitions can only be appended to the end. Individual partitions cannot be split. So, say we have a collection that partitions on the _id field, where all _id’s happen to be integers. Suppose we have three partitions with the following ranges: _id <= 0 0 < _id <= 1000 _id > 1000 With this collection we cannot create a partition with the range 500 < _id <= 1000, because that would split the second partition. All we can do is add a new partition to the end, and “cap” the current last partition with a new maximum value. This new maximum value must be greater than or equal to the primary key (or in this case, _id) of the last partition’s last document. So, if the last partition’s last document has an _id of 2500, we can only partitions that create a range whose maximum is at least 2500. There are two ways to add a partition. The first method peeks at the last document in the current last partition, caps the partition with the primary key of that last document, and creates a new partition. To do so, one does: > db.foo.addPartition() { "ok" : 1 } In the above example, the partitioned collection would now have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 2500 _id > 2500 Alternatively, we can specify what the new maximum of the existing last partition may be, provided it is greater than the last document in last partition (which in this example is 2500). To do so, we simply pass in the new maximum as a parameter: > db.foo.addPartition({ _id : 3000 }); { "ok" : 1 } This would make the collection have partitions with the following ranges: _id <= 0 0 < _id <= 1000 1000 < _id <= 3000 _id > 3000 Dropping partitions? Dropping partitions is simple. First, see what the partitions are with the following shell command: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(4), "partitions" : [ { "_id" : NumberLong(0), "max" : { "_id" : 0 }, "createTime" : ISODate("2014-05-29T01:50:15.839Z") }, { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This lists each partition, what the maximum value that each partition may hold (thus defining the range of the partition), and the id of the partition (in the _id field). So, in the example we used for adding partitions, we have four partitions with _ids 0 through 3. To drop a partition, we run the following command and pass the _id of the partition we want to drop. To drop partition 0, we run: > db.foo.dropPartition(0) { "ok" : 1 } Looking at the list of partitions after this operation, we see the partition is dropped: > db.foo.getPartitionInfo() { "numPartitions" : NumberLong(3), "partitions" : [ { "_id" : NumberLong(1), "max" : { "_id" : 1000 }, "createTime" : ISODate("2014-05-29T01:50:27.049Z") }, { "_id" : NumberLong(2), "max" : { "_id" : 2500 }, "createTime" : ISODate("2014-05-29T01:50:30.549Z") }, { "_id" : NumberLong(3), "max" : { "_id" : { "$maxKey" : 1 } }, "createTime" : ISODate("2014-05-29T01:50:35.903Z") } ], "ok" : 1 } This covers how to use partitioned collections. We hope users in the MongoDB ecosystem find this feature as useful as relational database users do. In the comments section below, feel free to leave questions and/or feedback.
June 4, 2014
by Zardosht Kasheff
· 12,443 Views
article thumbnail
Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel
Explore different message brokers, and discover how these important web technologies impact a customer's backlog of messages, and cluster/data performance.
June 3, 2014
by Yves Trudeau
· 460,455 Views · 86 Likes
article thumbnail
MySQL Transaction Isolation Levels and Locks
Originally written by Lim Han Recently, an application that my team was working on encountered problems with a MySQL deadlock situation and it took us some time to figure out the reasons behind it. This application that we deployed was running on a 2-node cluster and they both are connected to an AWS MySQL database. The MySQL db tables are mostly based on InnoDB which supports transaction (meaning all the usual commit and rollback semantics) as well as row-level locking that MyISAM engine does not provide. So the problem arose when our users, due to some poorly designed user interface, was able to execute the same long running operation twice on the database. As it turned out, due to the fact that we have a dual node cluster, each of the user operation originated from a different web application (which in turn meant 2 different transaction running the same queries). The deadlock query happened to be a “INSERT INTO T… SELECT FROM S WHERE” query that introduced shared locks on the records that were used in the SELECT query. It didn’t help that both T and S in this case happened to be the same table. In effect, both the shared locks and exclusive locks were applied on the same table. An attempt to explain the possible cause of the deadlock on the queries could be explained by the following table. This is based on the assumption that we are using a default REPEATABLE_READ transaction isolation level (I will explain the concept of transaction isolation later) Assuming that we have a table as such RowId Value 1 Collection 1 2 Collection 2 … Collection N 450000 Collection 450000 The following is a sample sequence that could possibly cause a deadlock based on the 2 transactions running an SQL query like “INSERT INTO T SELECT FROM T WHERE … “ : Time Transaction 1 Transaction 2 Comment T1 Statement executed Statement executed. A shared lock is applied to records that are read by selection T2 Read lock s1 on Row 10-20 The lock on the index across a range. InnoDB has a concept of gap locks. T3 Statement executed Transaction 2 statement executed. Similar shared lock to s1 applied by selection T4 Read lock s2 on Row 10-20 Shared read locks allow both transaction to read the records only T5 Insert lock x1 into Row 13 in index wanted Transaction 1 attempts to get exclusive lock on Row 13 for insertion but Transaction 2 is holding a shared lock T6 Insert lock x2 into Row 13 in index wanted Transaction 2 attempts to get exclusive lock on Row 13 for insertion but Transaction 1 is holding a shared lock T7 Deadlock! The above scenario occurs only when we use REPEATABLE_READ (which introduces shared read locks). If we were to lower the transaction isolation level to READ_COMMITTED, we would reduce the chances of a deadlock happening. Of course, this would mean relaxing the consistency of the database records. In the case of our data requirements, we do not have such strict requirements for strong consistency. Thus, it is acceptable for one transaction to read records that are committed by other transactions. So, to delve deeper into the idea of Transaction Isolation, this concept has been defined by ANSI/ISO SQL as the following from highest isolation levels to lowest Serializable This is the highest isolation level and usually requires the use of shared read locks and exclusive write locks (as in the case of MySQL). What this means in essence that any query made will require access to a shared read lock on the records which prevents another transaction’s query to modify these records. Every update statement will require access to an exclusive write lock Also, range-locks must be acquired when a select statement with a WHERE condition is used. This is implemented as a gap lock in MySQL. Repeatable Reads This is the default level used in MySQL. This is mainly similar to Serializable beside the fact that a range lock is not used. However, the way that MySQL implements this level seemed to me a little different. Based on Wikipedia’s article on Transaction Isolation, a range lock is not implemented and so phantom reads can still occur. Phantom reads refer to a possibility that select queries will have additional records when the same query is made within a transaction. However, what I understand from MySQL’s document is that range locks are still used and the same select queries made in the same transaction will always return the same records. Maybe I’m mistaken in my understanding and if there’s any mistakes in my intepretations, I stand ready to be corrected. Read Committed This is an isolation level that will maintain a write lock until the end of the transaction but read locks will be released at the end of the SELECT statement. It does not promise that a SELECT statement will find the same data if it is re-run again in the same transaction. It will, however, guarantee that the data that is read are not “dirty” and has been committed. Read Uncommitted This is an isolation level that I doubt would be useful for most use cases. Basically, it allows a transaction to see all data that has been modified, including “dirty” or uncommitted data. This is the lowest isolation level Having gone through the different transaction isolation levels, we could see how the selection of the Transaction Isolation level determines the kind of database locking mechanism. From a practical standpoint, the default MySQL isolation level (REPEATABLE_READ) might not always be a good choice when you are dealing with a scenario like ours where there is really no need for such strong consistency in the data reads. I believe that by lowering the isolation level, it is likely to reduce chances that your database queries meet with a deadlock. Also, it might even allow a higher concurrent access to your database which improve the performance level of your queries. Of course, this comes with the caveat that you need to understand how important consistent reads are for your application. If you are dealing with data where precision is paramount (e.g. your bank accounts), then it is definitely necessary to impose as much isolation as possible so that you would not read inconsistent information within your transaction.
June 2, 2014
by Anh Tuan Nguyen
· 18,186 Views · 1 Like
article thumbnail
Neo4j & Cypher: Rounding a Float Value to Decimal Places
About 6 months ago, Jacqui Read created a github issue explaining how she wanted to round a float value to a number of decimal places but was unable to do so due to the round function not taking the appropriate parameter. I found myself wanting to do the same thing last week where I initially had the following value: RETURN toFloat("12.336666") AS value I wanted to round that to 2 decimal places and Wes suggested multiplying the value before using ROUND and then dividing afterwards to achieve that. For two decimal places we need to multiply and divide by 100: WITH toFloat("12.336666") AS value RETURN round(100 * value) / 100 AS value 12.34 Michael suggested abstracting the number of decimal places like so: WITH 2 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value If we want to round to 4 decimal places we can easily change that: WITH 4 as precision WITH toFloat("12.336666") AS value, 10^precision AS factor RETURN round(factor * value)/factor AS value 12.3367 The code is available as a graph gist too if you want to play around with it. And you may as well check out the other gists while you’re at it – enjoy!
June 2, 2014
by Mark Needham
· 6,958 Views
article thumbnail
Connecting to Cassandra from Java
In this post, I look at the basics of connecting to a Cassandra database from a Java client. I will use the DataStax Java Client JAR in order to do so.
May 30, 2014
by Dustin Marx
· 121,263 Views · 3 Likes
article thumbnail
Performance Tuning of Spring/Hibernate Applications
For most typical Spring/Hibernate enterprise applications, the application performance depends almost entirely on the performance of it's persistence layer. This post will go over how to confirm that we are in presence of a 'database-bound' application, and then walk through 7 frequently used 'quick-win' tips that can help improve application performance. How to confirm that an application is 'database-bound' To confirm that an application is 'database-bound', start by doing a typical run in some development environment, using VisualVM for monitoring. VisualVM is a Java profiler shipped with the JDK and launchable via the command line by calling jvisualvm. After launching Visual VM, try the following steps: double click on your running application Select Sampler click on Settings checkbox Choose Profile only packages, and type in the following packages: your.application.packages.* org.hibernate.* org.springframework.* your.database.driver.package, for example oracle.* Click Sample CPU The CPU profiling of a typical 'database-bound' application should look something like this: We can see that the client Java process spends 56% of it's time waiting for the database to return results over the network. This is a good sign that the queries on the database are what's keeping the application slow. The 32.7% in Hibernate reflection calls is normal and nothing much can be done about it. First step for tuning - obtaining a baseline run The first step to do tuning is to define a baseline run for the program. We need to identify a set of functionally valid input data that makes the program go through a typical execution similar to the production run. The main difference is that the baseline run should run in a much shorter period of time, as a guideline an execution time of around 5 to 10 minutes is a good target. What makes a good baseline? A good baseline should have the following characteristics: it's functionally correct the input data is similar to production in it's variety it completes in a short amount of time optimizations in the baseline run can be extrapolated to a full run Getting a good baseline is solving half of the problem. What makes a bad baseline? For example, in a batch run for processing call data records in a telecommunications system, taking the first 10 000 records could be the wrong approach. The reason being, the first 10 000 might be mostly voice calls, but the unknown performance problem is in the processing of SMS traffic. Taking the first records of a large run would lead us to a bad baseline, from which wrong conclusions would be taken. Collecting SQL logs and query timings The SQL queries executed with their execution time can be collected using for example log4jdbc. See this blog post for how to collect SQL queries using log4jdbc - Spring/Hibernate improved SQL logging with log4jdbc. The query execution time is measured from the Java client side, and it includes the network round-trip to the database. The SQL query logs look like this: 16 avr. 2014 11:13:48 | SQL_QUERY /* insert your.package.YourEntity */ insert into YOUR_TABLE (...) values (...) {executed in 13 msec} The prepared statements themselves are also a good source of information - they allow to easily identify frequent query types. They can be logged by following this blog post - Why and where is Hibernate doing this SQL query? What metrics can be extracted from SQL logs The SQL logs can give the answer these questions: What are slowest queries being executed? What are the most frequent queries? What is the amount of time spent generating primary keys? Is there some data that could benefit from caching ? How to parse the SQL logs Probably the only viable option for large log volumes is to use command line tools. This approach has the advantage of being very flexible. At the expense of writing a small script or command, we can extract mostly any metric needed. Any command line tool will work as long as you are comfortable with it. If you are used to the Unix command line, bash might be a good option. Bash can be used also in Windows workstations, using for example Cygwin, or Git that includes a bash command line. Frequently applied Quick-Wins The quick-wins bellow identify common performance problems in Spring/Hibernate applications, and their corresponding solutions. Quick-win Tip 1 - Reduce primary key generation overhead In processes that are 'insert-intensive', the choice of a primary key generation strategy can matter a lot. One common way to generate id's is to use database sequences, usually one per table to avoid contention between inserts on different tables. The problem is that if 50 records are inserted, we want to avoid that 50 network round-trips are made to the database in order to obtain 50 id's, leaving the Java process hanging most of the time. How does Hibernate usually handle this? Hibernate provides new optimized ID generators that avoid this problem. Namely for sequences, a HiLo id generator is used by default. This is how the HiLo sequence generator it works: call a sequence once and get 1000 (the High value) calculate 50 id's like this: 1000 * 50 + 0 = 50000 1000 * 50 + 1 = 50001 ... 1000 * 50 + 49 = 50049, Low value (50) reached call sequence for new High value 1001 ... etc ... So from a single sequence call, 50 keys where generated, reducing the overhead caused my inumerous network round-trips. These new optimized key generators are on by default in Hibernate 4, and can even be turned off if needed by setting hibernate.id.new_generator_mappings to false. Why can primary key generation still be a problem? The problem is, if you declared the key generation strategy as AUTO, the optimized generators are still off, and your application will end up with a huge amount of sequence calls. In order to make sure the new optimized generators are on, make sure to use the SEQUENCE strategy instead of AUTO: @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "your_key_generator") private Long id; With this simple change, an improvement in the range of 10%-20% can be measured in 'insert-intensive' applications, with basically no code changes. Quick-win Tip 2 - Use JDBC batch inserts/updates For batch programs, JDBC drivers usually provide an optimization for reducing network round-trips named 'JDBC batch inserts/updates'. When these are used, inserts/updates are queued at the driver level before being sent to the database. When a threshold is reached, then the whole batch of queued statements is sent to the database in one go. This prevents the driver from sending the statements one by one, which would waist multiple network round-trips. This is the entity manager factory configuration needed to active batch inserts/updates: 100 true true Setting only the JDBC batch size won't work. This is because the JDBC driver will batch the inserts only when receiving insert/updates for the exact same table. If an insert to a new table is received, then the JDBC driver will first flush the batched statements on the previous table, before starting to batch statements on the new table. A similar functionality is implicitly used if using Spring Batch. This optimization can easily buy you 30% to 40% to 'insert intensive' programs, without changing a single line of code. Quick-win Tip 3 - Periodically flush and clear the Hibernate session When adding/modifying data in the database, Hibernate keeps in the session a version of the entities already persisted, just in case they are modified again before the session is closed. But many times we can safely discard entities once the corresponding inserts where done in the database. This releases memory in the Java client process, preventing performance problems caused by long running Hibernate sessions. Such long-running sessions should be avoided as much as possible, but if by some reason they are needed, this is how to contain memory consumption: entityManager.flush(); entityManager.clear(); The flush will trigger the inserts from new entities to be sent to the database. The clear releases the new entities from the session. Quick-win Tip 4 - Reduce Hibernate dirty-checking overhead Hibernate uses internally a mechanism to keep track of modified entities called dirty-checking. This mechanism is not based on the equals and hashcode methods of the entity classes. Hibernate does it's most to keep the performance cost of dirty-checking to a minimum, and to dirty-check only when it needs to, but the mechanism does have a cost, which is more noticeable in tables with a large number of columns. Before applying any optimization, the most important is to measure the cost of dirty-checking using VisualVM. How to avoid dirty-checking? In Spring business methods that we know are read-only, dirty-checking can be turned off like this: @Transactional(readOnly=true) public void someBusinessMethod() { .... } An alternative to avoid dirty-checking is to use the Hibernate Stateless Session, which is detailed in the documentation. Quick-win Tip 5 - Search for 'bad' query plans Check the queries in the slowest queries list to see if they have good query plans. The most usual 'bad' query plans are: Full table scans: they happen when the table is being fully scanned due to usually a missing index or outdated table statistics. Full cartesian joins: This means that the full cartesian product of several tables is being computed. Check for missing join conditions, or if this can be avoided by splitting a step into several. Quick-win Tip 6 - check for wrong commit intervals If you are doing batch processing, the commit interval can make a large difference in the performance results, as in 10 to 100 times faster. Confirm that the commit interval is the one expected (usually around 100-1000 for Spring Batch jobs). It happens often that this parameter is not correctly configured. Quick-win Tip 7 - Use the second-level and query caches If some data is identified as being eligible for caching, then have a look at this blog post for how to setup the Hibernate caching: Pitfalls of the Hibernate Second-Level / Query Caches Conclusions To solve application performance problems, the most important action to take is to collect some metrics that allow to find what the current bottleneck is. Without some metrics it is often not possible to guess in useful time what the correct problem cause is. Also, many but not all of the typical performance pitfalls of a 'database-driven' application can be avoided in the first place by using the Spring Batch framework.
May 29, 2014
by Vasco Cavalheiro
· 38,299 Views · 3 Likes
article thumbnail
Implementing Correlation ids in Spring Boot (for Distributed Tracing in SOA/Microservices)
After attending Sam Newman’s microservice talks at Geecon last week I started to think more about what is most likely an essential feature of service-oriented/microservice platforms for monitoring, reporting and diagnostics: correlation ids. Correlation ids allow distributed tracing within complex service oriented platforms, where a single request into the application can often be dealt with by multiple downstream service. Without the ability to correlate downstream service requests it can be very difficult to understand how requests are being handled within your platform. I’ve seen the benefit of correlation ids in several recent SOA projects I have worked on, but as Sam mentioned in his talks, it’s often very easy to think this type of tracing won’t be needed when building the initial version of the application, but then very difficult to retrofit into the application when you do realise the benefits (and the need for!). I’ve not yet found the perfect way to implement correlation ids within a Java/Spring-based application, but after chatting to Sam via email he made several suggestions which I have now turned into a simple project using Spring Boot to demonstrate how this could be implemented. Why? During both of Sam’s Geecon talks he mentioned that in his experience correlation ids were very useful for diagnostic purposes. Correlation ids are essentially an id that is generated and associated with a single (typically user-driven) request into the application that is passed down through the stack and onto dependent services. In SOA or microservice platforms this type of id is very useful, as requests into the application typically are ‘fanned out’ or handled by multiple downstream services, and a correlation id allows all of the downstream requests (from the initial point of request) to be correlated or grouped based on the id. So called ‘distributed tracing’ can then be performed using the correlation ids by combining all the downstream service logs and matching the required id to see the trace of the request throughout your entire application stack (which is very easy if you are using a centralised logging framework such as logstash) The big players in the service-oriented field have been talking about the need for distributed tracing and correlating requests for quite some time, and as such Twitter have created their open source Zipkin framework (which often plugs into their RPC framework Finagle), and Netflix has open-sourced their Karyon web/microservice framework, both of which provide distributed tracing. There are of course commercial offering in this area, one such product being AppDynamics, which is very cool, but has a rather hefty price tag. Creating a proof-of-concept in Spring Boot As great as Zipkin and Karyon are, they are both relatively invasive, in that you have to build your services on top of the (often opinionated) frameworks. This might be fine for some use cases, but no so much for others, especially when you are building microservices. I’ve been enjoying experimenting with Spring Boot of late, and this framework builds on the much known and loved (at least by me :-) ) Spring framework by providing lots of preconfigured sensible defaults. This allows you to build microservices (especially ones that communicate via RESTful interfaces) very rapidly. The remainder of this blog pos explains how I implemented a (hopefully) non-invasive way of implementing correlation ids. Goals Allow a correlation id to be generated for a initial request into the application Enable the correlation id to be passed to downstream services, using as method that is as non-invasive into the code as possible Implementation I have created two projects on GitHub, one containing an implementation where all requests are being handled in a synchronous style (i.e. the traditional Spring approach of handling all request processing on a single thread), and also one for when an asynchronous (non-blocking) style of communication is being used (i.e., using the Servlet 3 asynchronous support combined with Spring’s DeferredResult and Java’s Futures/Callables). The majority of this article describes the asynchronous implementation, as this is more interesting: Spring Boot asynchronous (DeferredResult + Futures) communication correlation id Github repo The main work in both code bases is undertaken by the CorrelationHeaderFilter, which is a standard Java EE Filter that inspects the HttpServletRequest header for the presence of a correlationId. If one is found then we set a ThreadLocal variable in the RequestCorrelation Class (discussed later). If a correlation id is not found then one is generated and added to the RequestCorrelation Class: public class CorrelationHeaderFilter implements Filter { //... @Override public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException { final HttpServletRequest httpServletRequest = (HttpServletRequest) servletRequest; String currentCorrId = httpServletRequest.getHeader(RequestCorrelation.CORRELATION_ID_HEADER); if (!currentRequestIsAsyncDispatcher(httpServletRequest)) { if (currentCorrId == null) { currentCorrId = UUID.randomUUID().toString(); LOGGER.info("No correlationId found in Header. Generated : " + currentCorrId); } else { LOGGER.info("Found correlationId in Header : " + currentCorrId); } RequestCorrelation.setId(currentCorrId); } filterChain.doFilter(httpServletRequest, servletResponse); } //... private boolean currentRequestIsAsyncDispatcher(HttpServletRequest httpServletRequest) { return httpServletRequest.getDispatcherType().equals(DispatcherType.ASYNC); } The only thing is this code that may not instantly be obvious is the conditional checkcurrentRequestIsAsyncDispatcher(httpServletRequest), but this is here to guard against the correlation id code being executed when the Async Dispatcher thread is running to return the results (this is interesting to note, as I initially didn’t expect the Async Dispatcher to trigger the execution of the filter again?) Here is the RequestCorrelation Class, which contains a simple ThreadLocal static variable to hold the correlation id for the current Thread of execution (set via the CorrelationHeaderFilter above) public class RequestCorrelation { public static final String CORRELATION_ID = "correlationId"; private static final ThreadLocal id = new ThreadLocal(); public static String getId() { return id.get(); } public static void setId(String correlationId) { id.set(correlationId); } } Once the correlation id is stored in the RequestCorrelation Class it can be retrieved and added to downstream service requests (or data store access etc) as required by calling the static getId() method within RequestCorrelation. It is probably a good idea to encapsulate this behaviour away from your application services, and you can see an example of how to do this in a RestClient Class I have created, which composes Spring’s RestTemplate and handles the setting of the correlation id within the header transparently from the calling Class. @Component public class CorrelatingRestClient implements RestClient { private RestTemplate restTemplate = new RestTemplate(); @Override public String getForString(String uri) { String correlationId = RequestCorrelation.getId(); HttpHeaders httpHeaders = new HttpHeaders(); httpHeaders.set(RequestCorrelation.CORRELATION_ID, correlationId); LOGGER.info("start REST request to {} with correlationId {}", uri, correlationId); //TODO: error-handling and fault-tolerance in production ResponseEntity response = restTemplate.exchange(uri, HttpMethod.GET, new HttpEntity(httpHeaders), String.class); LOGGER.info("completed REST request to {} with correlationId {}", uri, correlationId); return response.getBody(); } } //... calling Class public String exampleMethod() { RestClient restClient = new CorrelatingRestClient(); return restClient.getForString(URI_LOCATION); //correlation id handling completely abstracted to RestClient impl } Making this work for asynchronous requests… The code included above works fine when you are handling all of your requests synchronously, but it is often a good idea in a SOA/microservice platform to handle requests in a non-blocking asynchronous manner. In Spring this can be achieved by using the DeferredResult Class in combination with the Servlet 3 asynchronous support. The problem with using ThreadLocal variables within the asynchronous approach is that the Thread that initially handles the request (and creates the DeferredResult/Future) will not be the Thread doing the actual processing. Accordingly, a bit of glue code is needed to ensure that the correlation id is propagated across the Threads. This can be achieved by extending Callable with the required functionality: (don’t worry if example Calling Class code doesn’t look intuitive – this adaption between DeferredResults and Futures is a necessary evil within Spring, and the full code including the boilerplate ListenableFutureAdapter is in my GitHub repo): public class CorrelationCallable implements Callable { private String correlationId; private Callable callable; public CorrelationCallable(Callable targetCallable) { correlationId = RequestCorrelation.getId(); callable = targetCallable; } @Override public V call() throws Exception { RequestCorrelation.setId(correlationId); return callable.call(); } } //... Calling Class @RequestMapping("externalNews") public DeferredResult externalNews() { return new ListenableFutureAdapter<>(service.submit(new CorrelationCallable<>(externalNewsService::getNews))); } And there we have it – the propagation of correlation id regardless of the synchronous/asynchronous nature of processing! You can clone the Github report containing my asynchronous example, and execute the application by running mvn spring-boot:run at the command line. If you access http://localhost:8080/externalNewsin your browser (or via curl) you will see something similar to the following in your Spring Boot console, which clearly demonstrates a correlation id being generated on the initial request, and then this being propagated through to a simulated external call (have a look in the ExternalNewsServiceRest Class to see how this has been implemented): [nio-8080-exec-1] u.c.t.e.c.w.f.CorrelationHeaderFilter : No correlationId found in Header. Generated : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : start REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 [nio-8080-exec-2] u.c.t.e.c.w.f.CorrelationHeaderFilter : Found correlationId in Header : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : completed REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 Conclusion I’m quite happy with this simple prototype, and it does meet the two goals I listed above. Future work will include writing some tests for this code (shame on me for not TDDing!), and also extend this functionality to a more realistic example. I would like to say a massive thanks to Sam, not only for sharing his knowledge at the great talks at Geecon, but also for taking time to respond to my emails. If you’re interested in microservices and related work I can highly recommend Sam’s Microservice book which is available in Early Access at O’Reilly. I’ve enjoyed reading the currently available chapters, and having implemented quite a few SOA projects recently I can relate to a lot of the good advice contained within. I’ll be following the development of this book with keen interest! If you have any comments or thoughts then please do share them via the comment below, or feel free to get in touch via the usual mechanisms! References I used Tomasz Nurkiewicz’s excellent blog several times for learning how best to wire up all of the DeferredResult/Future code in Spring: http://www.nurkiewicz.com/2013/03/deferredresult-asynchronous-processing.html
May 29, 2014
by Daniel Bryant
· 24,455 Views · 1 Like
article thumbnail
Visualizing XML as a Graph Using Neo4j
neo4j is quite awesome for pretty much everything . here is an extract i found from this blog whenever someone gives you a problem, think graphs . they are the most fundamental and flexible way of representing any kind of a relationship, so it’s about a 50-50 shot that any interesting design problem has a graph involved in it. make absolutely sure you can’t think of a way to solve it using graphs before moving on to other solution types. as a baby step, let us try to visualize an xml as a graph. we can start with a simple xml here, belgian waffles $5.95 two of our famous belgian waffles with plenty of real maple syrup 650 it is a simple ‘breakfast menu’ with just one item, ‘ belgian waffles ‘. neo4j has a neat interface which lets us visualize the graph we created. the graph for the above xml looks slick.. the above xml has 6 tags and we have 6 nodes in our graph. the nested tags/nodes has a ‘child-of’ relationship with the parent tag/node. in the graph above, node numbered 0 is the root node – corresponds to the tag node numbered 1 is the immediate child of the root – corresponds to the tag nodes numbered from 2-4 are the tags which come under tag, viz , , and transforming xml.. let us parse the xml into a data structure which can be easily persisted using neo4j . personally, i would prefer sax parser as there are serious memory constraints when creating a dom object (really painful if you talking big data as xml). additionally sax parsing gives you unlimited freedom to do whatever you want to do with the xml. this is how i represent a node, using the object xmlelement . each xmlelement is identified by an id. this is basically an integer or long. only thing to make sure is that the number has to be unique across all xmlelements. publicclassxmlelement { privatestring tagname; privatestring tagvalue; privatemap attributes = newhashmap(); privatehierarchyidentifier hierarchyidentifier; privateintparentid; privatebooleanpersisted; //setters and getters for the members publicstring getatrributestring(){ objectmapper jsonmapper = newobjectmapper(); try{ returnjsonmapper.writevalueasstring(this.attributes); } catch(jsonprocessingexception e) { logger.severe(e.getmessage()); e.printstacktrace(); } returnnull; } } the xml tag name and value are stored as string and the attributes are stored as a map. also, there is another member object, hierarchyidentifier publicclasshierarchyidentifier { privateintdepth; privateintwidth; privateintid; //getters and setters for member element } the hierarchyidentifier class contains the id class which is used to identify an xmlelement , which translates to identifying an xml tag. each xmlelement has the id of it’s parent stored. and after the parse, the xml should be represented as a map of and each xmlelement identified by the corresponding integer id. you can see the sax parser here .. so we pass the map to the graphwriter object. we create a node with the following properties value – the value of the xml tag id – the id of the current node parent – the id of the parent tag attributes – the string representation of the tag attributes also, the node has the following labels node – for all nodes parent – for the seed ( highest parent ) node xml tag name – say, the node for the tag will have label food currently, there is only a single type of relationship, ‘child_of’ , between immediate nested tags. see the below the example for a bigger xml belgian waffles $5.95 two of our famous belgian waffles with plenty of real maple syrup 650 masala dosa $10.95 south india's famous slim pancake with mashed potatoes 650 25 i have added a second item to the ‘breakfast_menu’.. the legendary masala dosa . also, the second food item has a new child tag, . and the resultant graph looks prettier the sets of tags,relationships and properties can be seen in the neo4j local server interface. the entire codebase can be found here in github .. visualizing as a graphgist an easy representation can be done in graphgist. i have created the graph of this xml. here is the link to the gist . also, the gist script can be found in the github gist here ...
May 29, 2014
by Nikhil Kuriakose
· 25,832 Views
article thumbnail
Debunking Myths About the VoltDB In-Memory Database
Originally written by John Hugg I’d like to take a quick moment to address some myths and misconceptions about VoltDB. Many people selling products who view VoltDB as competition seem to be repeating them. As you’ll read, much of what’s said is just plain FUD. VoltDB is an in-memory database that has benchmarked at over 3 million transactions a second on bare metal, and recently crushed previous performance records in the cloud, posting eye-popping YCSB (Yahoo! Cloud Service Benchmark) numbers on AWS, Amazon’s cloud platform. It’s fully transactional, supports full disk durability, and has very low latency and fully native high availability. It’s not surprising competing interests look for FUD. Note the comments below apply to the current shipping version of VoltDB at the time of this posting (4.2), unless otherwise noted. Myth #1: “VoltDB requires stored procedures.” This was true for 1.0, but no one seems to notice it’s been false since we shipped 1.1 in 2010. VoltDB supports unforeseen SQL without any stored procedure use. We have users in production who have never used a single stored procedure. SQL queries can be run through: JDBC Our command line SQL console A web portal built into every server The HTTP/JSON interface Native client drivers written in C++, C#, PHP, Python, Node.js and even Java. Community-provided language drivers ODBC drivers (in development to ship in Q2) There is one restriction: VoltDB doesn’t support external transaction control; VoltDB is permanently in auto-commit mode. Note this offers more transactionality than most NoSQL systems, and is a shared restriction of several NewSQL or “In-Memory” SQL systems. For those systems that do offer external transaction control, I’m not aware of any that publish benchmarks where transactions require multiple round trips to the client. This is because it’s generally understood to be a bad thing in a performance-critical system. In addition to ad-hoc SQL, VoltDB auto-generates optimized procedures for various CRUD operations (create, read, update, delete). For example, to fetch a row by primary key from table FOO, I can call a procedure named “FOO.select” with the key value as a parameter. Myth #2: “VoltDB doesn’t support ad-hoc SQL.” This is just a rephrasing of Myth #1 and is still false. Myth #3: “VoltDB is slow unless I use stored procedures.” Well, no. VoltDB can run faster with stored procedures, but it’s still fast if they are not used. In our internal benchmarks on pretty cheap single-socket hardware, we can run about 50k write statements per second, per host with full durability. That’s a bit lower than we can achieve with stored procedures, so we’re doing some engineering work to close the gap. Still, on a 3-node cluster costing about $3,000 for hardware, we can run well over 100k SQL updates per second with durability and double redundancy. If your current system can run that fast, let us know. Also, the auto-generated CRUD procedures mentioned above run at the full speed of pre-optimized procedures. Using just CRUD, VoltDB is the fastest durable Key-Value store I’m aware of. Myth #4: “I have to know Java to use VoltDB.” As of VoltDB 3.0, released over a year ago, (we’re on V4.2 today), a user can build VoltDB apps and run the server without ever directly interacting with the Java CLI tools or any Java code. Client apps can also be written without Java. As mentioned previously, VoltDB supports client drivers in many languages, with community-contributed drivers in others. If you do want to leverage VoltDB’s ability to perform multi-statement, transactional logic, only the most basic understanding of Java is needed to build stored procedures. We provide example applications in our kit that can be copied from and/or modified. We even support declaring procedures in Groovy right in your DDL. Myth #5: “VoltDB has garbage collection problems because it is written in Java.” The short answer is this is just not true. First, VoltDB uses native C++ code for all of its data storage and SQL execution paths. This is done to avoid putting long-lived data on the Java heap, but also to allow very fine control of memory use. VoltDB’s special native data structures use less memory than competing systems and actively fight allocation fragmentation. What’s left on the Java heap is either near-immortal configuration and setup data, or very short-lived per-transaction data. This is almost the perfect workload for the Java garbage collector, so there is minimal impact to the user’s workload. For more detail, read Ariel Weisberg’s excellent blog post on the topic. Myth #6: “VoltDB’s SQL is very limited.” This misconception is based on historical truth, but is way out of date in 2014. The original H-Store project, upon which VoltDB is based, started with very limited, OLTP-focused SQL, but much has changed in the past years. VoltDB is rapidly approaching SQL-92 functionality for DML and DQL SQL. One of the final holdouts, subselects, shipped in part in 4.2; more functionality will ship in upcoming versions. In addition to this ANSI SQL baseline, we’ve added materialized view support, function-based index support, native JSON functionality and more. We have a lot more coming in 2014. We’re proud to announce all of this is fully Unicode-compliant and is in production in several uses in countries with multi-byte character sets. Myth #7: “Yes, VoltDB supports cross-partition transactions, but they are too slow to use.” Yes we do, and no they aren’t too slow to use. Keep in mind that NoSQL systems often don’t offer cross-partition transactions and many other NewSQL systems have functionality or performance limitations. Few benchmarks stress this functionality, so I’ll be as clear as possible here. Traditionally, VoltDB’s cross-partition transactions have been slower than our partitioned operations. Additional coordination needs to happen to make these transactions ACID at the “serializable” isolation level at which all VoltDB transactions run. Furthermore, these operations don’t scale with cluster size and sometimes can vary in performance, depending on your networking performance. But are they too slow to use? I’ll let you be the judge. Cross-partition writes run on the order of 1,000 per second,regardless of cluster size. This number can vary by an order of magnitude depending on hardware. Cross partition reads run on the order of 50,000 per second, regardless of cluster size. Again, there is some variability, but less than with writes. It’s possible to send writes to each partition individually using the “Send Everywhere” pattern. These writes are transactional at each partition, but individual partitions might fail and rollback independently of others. A common pattern we see is a mix of single-partition writes and cross-partition reads. Cross-partition writes are often used to update infrequently changing lookup tables. Of course at 1000 per second, they can still change fairly frequently. Note that if your read operation is scanning millions of tuples, you won’t see the same throughput, but if it’s populating a schema supported leaderboard or computing a time window statistic using a materialized view, you will. The bottom line is that most of our customers not only use cross-partition transactions, but also these kinds of operations are one of the key benefits VoltDB offers. We are always working to improve the performance of cross-partition operations, and in recent releases we have made a lot of progress. In 4.0 we introduced read optimizations that got us to the 50k/second number. We also have write optimizations in the pipeline. In Conclusion Many of these myths and misconceptions are firmly rooted in the world of traditional RDBMSs. VoltDB is a NewSQL database. When evaluating VoltDB, consider the non-traditional features that set us head and shoulders above traditional RDBMSs, NoSQL and other NewSQL options: Intelligent, asynchronous clients allow tremendous throughput per-client, as well as terrific management. Native export connectors push tuples into downstream systems like message queues, analytic data-warehouses and Hadoop Active-Active intra-cluster replication with sub-millisecond latency ensures perfectly consistent high availability and redundant durability. Seamless node failure and replacement without impacting serializable ACID semantics. Transactional cluster expansion with low performance impact and without impacting serializable ACID semantics provides near-limitless scalability. 1500 active, connected clients per server node with minimal latency impact meets the needs of most high-velocity, high transaction volume businesses. VoltDB is open source (https://github.com/VoltDB/voltdb) If you want to hear more, or if you have a use case you want to talk to us about, contact us. We’d love to hear from you. If you’d like to try VoltDB out yourself, you can download a free trial here.
May 28, 2014
by John Piekos
· 9,820 Views · 1 Like
article thumbnail
Implementing Correlation IDs in Spring Boot (for Distributed Tracing in SOA/Microservices)
After attending Sam Newman’s microservice talks at Geecon last week I started to think more about what is most likely an essential feature of service-oriented/microservice platforms for monitoring, reporting and diagnostics: correlation ids. Correlation ids allow distributed tracing within complex service oriented platforms, where a single request into the application can often be dealt with by multiple downstream service. Without the ability to correlate downstream service requests it can be very difficult to understand how requests are being handled within your platform. I’ve seen the benefit of correlation ids in several recent SOA projects I have worked on, but as Sam mentioned in his talks, it’s often very easy to think this type of tracing won’t be needed when building the initial version of the application, but then very difficult to retrofit into the application when you do realise the benefits (and the need for!). I’ve not yet found the perfect way to implement correlation ids within a Java/Spring-based application, but after chatting to Sam via email he made several suggestions which I have now turned into a simple project using Spring Boot to demonstrate how this could be implemented. Why? During both of Sam’s Geecon talks he mentioned that in his experience correlation ids were very useful for diagnostic purposes. Correlation ids are essentially an id that is generated and associated with a single (typically user-driven) request into the application that is passed down through the stack and onto dependent services. In SOA or microservice platforms this type of id is very useful, as requests into the application typically are ‘fanned out’ or handled by multiple downstream services, and a correlation id allows all of the downstream requests (from the initial point of request) to be correlated or grouped based on the id. So called ‘distributed tracing’ can then be performed using the correlation ids by combining all the downstream service logs and matching the required id to see the trace of the request throughout your entire application stack (which is very easy if you are using a centralised logging framework such as logstash) The big players in the service-oriented field have been talking about the need for distributed tracing and correlating requests for quite some time, and as such Twitter have created their open source Zipkin framework (which often plugs into their RPC framework Finagle), and Netflix has open-sourced their Karyon web/microservice framework, both of which provide distributed tracing. There are of course commercial offering in this area, one such product being AppDynamics, which is very cool, but has a rather hefty price tag. Creating a proof-of-concept in Spring Boot As great as Zipkin and Karyon are, they are both relatively invasive, in that you have to build your services on top of the (often opinionated) frameworks. This might be fine for some use cases, but no so much for others, especially when you are building microservices. I’ve been enjoying experimenting with Spring Boot of late, and this framework builds on the much known and loved (at least by me :-) ) Spring framework by providing lots of preconfigured sensible defaults. This allows you to build microservices (especially ones that communicate via RESTful interfaces) very rapidly. The remainder of this blog pos explains how I implemented a (hopefully) non-invasive way of implementing correlation ids. Goals Allow a correlation id to be generated for a initial request into the application Enable the correlation id to be passed to downstream services, using as method that is as non-invasive into the code as possible Implementation I have created two projects on GitHub, one containing an implementation where all requests are being handled in a synchronous style (i.e. the traditional Spring approach of handling all request processing on a single thread), and also one for when an asynchronous (non-blocking) style of communication is being used (i.e., using the Servlet 3 asynchronous support combined with Spring’s DeferredResult and Java’s Futures/Callables). The majority of this article describes the asynchronous implementation, as this is more interesting: Spring Boot asynchronous (DeferredResult + Futures) communication correlation id Github repo The main work in both code bases is undertaken by the CorrelationHeaderFilter, which is a standard Java EE Filter that inspects the HttpServletRequest header for the presence of a correlationId. If one is found then we set a ThreadLocal variable in the RequestCorrelation Class (discussed later). If a correlation id is not found then one is generated and added to the RequestCorrelation Class: public class CorrelationHeaderFilter implements Filter { //... @Override public void doFilter(ServletRequest servletRequest, ServletResponse servletResponse, FilterChain filterChain) throws IOException, ServletException { final HttpServletRequest httpServletRequest = (HttpServletRequest) servletRequest; String currentCorrId = httpServletRequest.getHeader(RequestCorrelation.CORRELATION_ID_HEADER); if (!currentRequestIsAsyncDispatcher(httpServletRequest)) { if (currentCorrId == null) { currentCorrId = UUID.randomUUID().toString(); LOGGER.info("No correlationId found in Header. Generated : " + currentCorrId); } else { LOGGER.info("Found correlationId in Header : " + currentCorrId); } RequestCorrelation.setId(currentCorrId); } filterChain.doFilter(httpServletRequest, servletResponse); } //... private boolean currentRequestIsAsyncDispatcher(HttpServletRequest httpServletRequest) { return httpServletRequest.getDispatcherType().equals(DispatcherType.ASYNC); } The only thing is this code that may not instantly be obvious is the conditional check currentRequestIsAsyncDispatcher(httpServletRequest), but this is here to guard against the correlation id code being executed when the Async Dispatcher thread is running to return the results (this is interesting to note, as I initially didn’t expect the Async Dispatcher to trigger the execution of the filter again?) Here is the RequestCorrelation Class, which contains a simple ThreadLocal static variable to hold the correlation id for the current Thread of execution (set via the CorrelationHeaderFilter above) public class RequestCorrelation { public static final String CORRELATION_ID = "correlationId"; private static final ThreadLocal id = new ThreadLocal(); public static String getId() { return id.get(); } public static void setId(String correlationId) { id.set(correlationId); } } Once the correlation id is stored in the RequestCorrelation Class it can be retrieved and added to downstream service requests (or data store access etc) as required by calling the static getId() method within RequestCorrelation. It is probably a good idea to encapsulate this behaviour away from your application services, and you can see an example of how to do this in a RestClient Class I have created, which composes Spring’s RestTemplate and handles the setting of the correlation id within the header transparently from the calling Class. @Component public class CorrelatingRestClient implements RestClient { private RestTemplate restTemplate = new RestTemplate(); @Override public String getForString(String uri) { String correlationId = RequestCorrelation.getId(); HttpHeaders httpHeaders = new HttpHeaders(); httpHeaders.set(RequestCorrelation.CORRELATION_ID, correlationId); LOGGER.info("start REST request to {} with correlationId {}", uri, correlationId); //TODO: error-handling and fault-tolerance in production ResponseEntity response = restTemplate.exchange(uri, HttpMethod.GET, new HttpEntity(httpHeaders), String.class); LOGGER.info("completed REST request to {} with correlationId {}", uri, correlationId); return response.getBody(); } } //... calling Class public String exampleMethod() { RestClient restClient = new CorrelatingRestClient(); return restClient.getForString(URI_LOCATION); //correlation id handling completely abstracted to RestClient impl } Making this work for asynchronous requests… The code included above works fine when you are handling all of your requests synchronously, but it is often a good idea in a SOA/microservice platform to handle requests in a non-blocking asynchronous manner. In Spring this can be achieved by using the DeferredResult Class in combination with the Servlet 3 asynchronous support. The problem with using ThreadLocal variables within the asynchronous approach is that the Thread that initially handles the request (and creates the DeferredResult/Future) will not be the Thread doing the actual processing. Accordingly, a bit of glue code is needed to ensure that the correlation id is propagated across the Threads. This can be achieved by extending Callable with the required functionality: (don’t worry if example Calling Class code doesn’t look intuitive – this adaption between DeferredResults and Futures is a necessary evil within Spring, and the full code including the boilerplate ListenableFutureAdapter is in my GitHub repo): public class CorrelationCallable implements Callable { private String correlationId; private Callable callable; public CorrelationCallable(Callable targetCallable) { correlationId = RequestCorrelation.getId(); callable = targetCallable; } @Override public V call() throws Exception { RequestCorrelation.setId(correlationId); return callable.call(); } } //... Calling Class @RequestMapping("externalNews") public DeferredResult externalNews() { return new ListenableFutureAdapter<>(service.submit(new CorrelationCallable<>(externalNewsService::getNews))); } And there we have it – the propagation of correlation id regardless of the synchronous/asynchronous nature of processing! You can clone the Github report containing my asynchronous example, and execute the application by running mvn spring-boot:run at the command line. If you access http://localhost:8080/externalNews in your browser (or via curl) you will see something similar to the following in your Spring Boot console, which clearly demonstrates a correlation id being generated on the initial request, and then this being propagated through to a simulated external call (have a look in the ExternalNewsServiceRest Class to see how this has been implemented): [nio-8080-exec-1] u.c.t.e.c.w.f.CorrelationHeaderFilter : No correlationId found in Header. Generated : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : start REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 [nio-8080-exec-2] u.c.t.e.c.w.f.CorrelationHeaderFilter : Found correlationId in Header : d205991b-c613-4acd-97b8-97112b2b2ad0 [pool-1-thread-1] u.c.t.e.c.w.c.CorrelatingRestClient : completed REST request to http://localhost:8080/news with correlationId d205991b-c613-4acd-97b8-97112b2b2ad0 Conclusion I’m quite happy with this simple prototype, and it does meet the two goals I listed above. Future work will include writing some tests for this code (shame on me for not TDDing!), and also extend this functionality to a more realistic example. I would like to say a massive thanks to Sam, not only for sharing his knowledge at the great talks at Geecon, but also for taking time to respond to my emails. If you’re interested in microservices and related work I can highly recommend Sam’s Microservice book which is available in Early Access at O’Reilly. I’ve enjoyed reading the currently available chapters, and having implemented quite a few SOA projects recently I can relate to a lot of the good advice contained within. I’ll be following the development of this book with keen interest! If you have any comments or thoughts then please do share them via the comment below, or feel free to get in touch via the usual mechanisms! References I used Tomasz Nurkiewicz’s excellent blog several times for learning how best to wire up all of the DeferredResult/Future code in Spring: http://www.nurkiewicz.com/2013/03/deferredresult-asynchronous-processing.html
May 28, 2014
by Daniel Bryant
· 73,784 Views · 2 Likes
article thumbnail
Spring/Hibernate Improved SQL Logging with log4jdbc
Hibernate provides SQL logging out of the box, but such logging only shows prepared statements, and not the actual SQL queries sent to the database. It also does not log the execution time of each query, which is useful for performance troubleshooting. This blog post will go over how to setup Hibernate query logging, and then compare it to the logging that can be obtained with log4jdbc . The Hibernate query logging functionality Hibernate does not log the real SQL queries sent to the database. This is because Hibernate interacts with the database via the JDBC driver, to which it sends prepared statements but not the actual queries. So Hibernate can only log the prepared statements and the values of their binding parameters, but not the actual SQL queries themselves. This is how a query looks like when logged by Hibernate: select /* load your.package.Employee */ this_.code, ... from employee this_ where this_.employee_id=? TRACE 12-04-2014@16:06:02 BasicBinder - binding parameter [1] as [NUMBER] - 1000 See this post Hibernate Debugging - Finding the origin of a Query for how to setup this type of logging. Using log4jdbc For a developer it's useful to be able to copy paste a query from the log and be able to execute the query directly in an SQL client, but the variable placeholders ? make that unfeasible. Log4jdbc in an open source tool that allows to do just that, and more. Log4jdbc is a spy driver that will wrap itself around the real JDBC driver, logging queries as they go through it. The version linked from this post provides Spring integration, unlike several other log4jdbc forks. Setting up log4jdbc First include the log4jdbc-remix library in your pom.xml. This library is a fork of the original log4jdbc: org.lazyluke log4jdbc-remix0.2.7 Next, find in the Spring configuration the definition of the data source. As an example, when using the JNDI lookup element this is how the data source looks like: After finding the data source definition, rename it to the following name: Then define a new log4jdbc data source that wraps the real data source, and give it the original name: With this configuration, the query logging should already be working. It's possible to customize the logging level of the several log4jdbc loggers available. The original log4jdbc documentation provides more information on the available loggers: jdbc.sqlonly: Logs only SQL jdbc.sqltiming: Logs the SQL, post-execution, including timing execution statistics jdbc.audit: Logs ALL JDBC calls except for ResultSets jdbc.resultset: all calls to ResultSet objects are logged jdbc.connection: Logs connection open and close events The jdbc.audit logger is especially useful to validate the scope of transactions, as it logs the begin/commit/rollback events of a database transaction. This is the proposed log4j configuration that will print only the SQL queries together with their execution time: Conclusion Using log4jdbc does imply some initial setup, but once it's in place it's really convenient to have. Having a true query log is also useful for performance troubleshooting (to be described in a future post).
May 27, 2014
by Vasco Cavalheiro
· 20,227 Views · 1 Like
article thumbnail
DO... WHILE and REPEAT... UNTIL Loops in MS SQL
Introduction When I am looking for a forum post related to SQL Server, one of the junior professional is asking how to use a DO…WHILE loop is MS SQL Server. Several people wrote their opinion related to it. Everyone is saying to use WHILE loop and some of them suggesting with T-SQL structure of CURSOR with WHILE LOOP. Obviously, when a junior professional is learning MS SQL server, the question in mind arises: is there DO… WHILE, REPEAT … UNTIL loop present in MS SQL Server as there is in C or C++ etc? No one is answering directly on the forum whether we can use DO… WHILE or REPEAT … UNTIL in MS SQL Server or NOT. If yes, how can we implement them? DO… WHILE in MS SQL Sever First we look at the algorithm of DO… WHILE. SET X = 1 DO PRINT X SET X = X + 1 WHILE X <= 10 Now we try to implement it in MS SQL Server. DECLARE @X INT=1; WAY: --> Here the DO statement PRINT @X; SET @X += 1; IF @X<=10 GOTO WAY; --> Here the WHILE @X<=1 REPEAT… UNTIL First we look at the algorithm of REPEAT... UNTIL SET X = 1 REPEAT PRINT X SET X = X + 1 UNTIL X > 10 Now we try to implement it in MS SQL Server DECLARE @X INT = 1; WAY: -- Here the REPEAT statement PRINT @X; SET @X += 1; IFNOT(@X >1 0) GOTO WAY; -- Here the UNTIL @X>10 So we see that it is possible, but a little complicated… So most developers prefer the WHILE loop in MS SQL Server.
May 26, 2014
by Joydeep Das
· 104,692 Views
article thumbnail
Understanding the Cloud Foundry Java Buildpack Code with Tomcat Example
Cloudfoundry's java buildpack is supporting some popular jvm based applications. This article is oriented to the audiences already with experience of cloudfoundry/heroku buildpack who want to have more understanding of how buildpack and cloudfoundry works internally. cf push app -p app.war -b build-pack-url The above command demonstrates the usage of pushing a war file to cloudfoundry by using a custom buildpack (E.g. https://github.com/cloudfoundry/java-buildpack). However, what exactly happens inside, or how cloudfoundry bootstrap the war file with tomcat? There are three contracts phase that bridge communication between buildpack and cloudfoundry. The three phases are detect, compile and release, which are three ruby shell scripts: Java buildpack has multiple sub components, while each of them has all of these three phases (E.g. tomcat is one of the sub components, while it contained another layer of sub components). Detect Phase: detect phase is to check whether a particular buildpack/component applies to the deployed application. Take the war file example, tomcat applies only when https://github.com/cloudfoundry/java-buildpack/blob/master/lib/java_buildpack/container/tomcat.rb is true: def supports? web_inf? && !JavaBuildpack::Util::JavaMainUtils.main_class(@application) end The above code means, the tomcat applies when the application has a WEB-INF folder andthisisnot a main class bootstrapped application. Compile Phase: Compile phase would be the major/comprehensive work for a customized buildpack, while it is trying to build a file system on a lxc container. Take the example of our war application and tomcat example. In https://github.com/cloudfoundry/java-buildpack/blob/master/lib/java_buildpack/container/tomcat/tomcat_instance.rb def compile download(@version, @uri) { |file| expand file } link_to(@application.root.children, root) @droplet.additional_libraries << tomcat_datasource_jar if tomcat_datasource_jar.exist? @droplet.additional_libraries.link_to web_inf_lib end def expand(file) with_timing "Expanding Tomcat to #{@droplet.sandbox.relative_path_from(@droplet.root)}" do FileUtils.mkdir_p @droplet.sandbox shell "tar xzf #{file.path} -C #{@droplet.sandbox} --strip 1 --exclude webapps 2>&1" @droplet.copy_resources end The above code is all about preparing the tomcat and link the application files, so the application files will be available for the tomcat classpath. Before going to the code, we have to understand the working directory when the above code executes: . => working directory .app => @application, contains the extracted war archive .buildpack/tomcat => @droplet.sandbox .buildpack/jdk .buildpack/other needed components Inside compile method: download method will download tomcat binary file (specified here: https://github.com/cloudfoundry/java-buildpack/blob/master/config/tomcat.yml), and then extract the archive file to @droplet.sandbox directory. Then copy the resources folder's files to https://github.com/cloudfoundry/java-buildpack/tree/master/resources/tomcat/conf to @droplet.sandbox/conf Symlink the @droplet.sandbox/webapps/ROOT to .app/ Symlink additional libraries (comes from other component rather than application) to the WEB-INF/lib Note: All the symlinks use relative path, since when the container deployed to DEA, the absolute paths would be different. RELEASE PHASE: Release phase is to setup instructions of how to start tomcat. Look at the code in :https://github.com/cloudfoundry/java-buildpack/blob/master/lib/java_buildpack/container/tomcat.rb def command @droplet.java_opts.add_system_property 'http.port', '$PORT' [ @droplet.java_home.as_env_var, @droplet.java_opts.as_env_var, "$PWD/#{(@droplet.sandbox + 'bin/catalina.sh').relative_path_from(@droplet.root)}", 'run' ].flatten.compact.join(' ') end The above code does: Add java system properties http.port (referenced in tomcat server.xml) with environment properties ($PORT), this is the port on the DEA bridging to the lxc container already setup when the container was provisioned. instruction of how to run the tomcat Eg. "./bin/catalina.sh run"
May 9, 2014
by Shaozhen Ding
· 23,171 Views · 1 Like
article thumbnail
Cyclop: A Web Based Editor for Cassandra Query Language
Cyclop is a web-based tool for querying Cassandra databases with features like syntax highlighting and query completion.
May 9, 2014
by Comsysto Gmbh
· 10,463 Views
article thumbnail
Hibernate Debugging - Finding the origin of a Query
It's not always immediate why and in which part of the program is Hibernate generating a given SQL query, especially if we are dealing with code that we did not write ourselves. This post will go over how to configure Hibernate query logging, and use that together with other tricks to find out why and where in the program a given query is being executed. What does the Hibernate query log look like Hibernate has built-in query logging that looks like this: select /* load your.package.Employee */ this_.code, ... from employee this_ where this_.employee_id=? TRACE 12-04-2014@16:06:02 BasicBinder - binding parameter [1] as [NUMBER] - 1000 Why can't Hibernate log the actual query ? Notice that what is logged by Hibernate is the prepared statement sent by Hibernate to the JDBC driver plus it's parameters. The prepared statement has ? in the place of the query parameters, and the parameter values themselves are logged just bellow the prepared statement. This is not the same as the actual query sent to the database, as there is no way for Hibernate to log the actual query. The reason for this is that Hibernate only knows about the prepared statements and the parameters that it sends to the JDBC driver, and it's the driver that will build the actual queries and then send them to the database. In order to produce a log with the real queries, a tool like log4jdbc is needed, which will be the subject of another post. How to find out the origin of the query The logged query above contains a comment that allows to identify in most cases the origin of the query: if the query is due to a load by ID the comment is /* load your.entity.Name */, if it's a named query then the comment will contain the name of the query. If it's a one to many lazy initialization the comment will contain the name of the class and the property that triggered it, etc. In many cases the query comment created by is enough to identify the origin of the query. Setting up the Hibernate query log In order to obtain a query log, the following flags need to be set in the configuration of the session factory: ... true true true The example above is for Spring configuration of an entity manager factory. This is the meaning of the flags: show_sql enables query logging format_sql pretty prints the SQL use_sql_comments adds an explanatory comment In order to log the query parameters, the following log4j or equivalent configuration is needed: If everything else fails If the query comment added by the option use_sql_comments is not sufficient, then we can start by identifying the entity returned by the query based on the table names involved, and put a breakpoint in the constructor of the returned entity. If the entity does not have a constructor, then we can create one and put the breakpoint in the call to super(): @Entity public class Employee { public Employee() { super(); // put the breakpoint here } ... } When the breakpoint is hit, go to the IDE debug view containing the stack call of the program and go through it from top to bottom. The place where the query was made in the program will be there in the call stack.
May 8, 2014
by Vasco Cavalheiro
· 43,933 Views · 3 Likes
article thumbnail
@OneToOne With Shared Primary Key, Revisited
Take another look at @OneToOne with a shared primary key.
May 6, 2014
by Michal Jastak
· 32,311 Views · 3 Likes
article thumbnail
How to Identify and Cure MySQL Replication Slave Lag
this post was originally written by muhammad irfan here on the percona mysql support team, we often see issues where a customer is complaining about replication delays – and many times the problem ends up being tied to mysql replication slave lag. this of course is nothing new for mysql users and we’ve had a few posts here on the mysql performance blog on this topic over the years (two particularly popular post in the past were: “ reasons for mysql replication lag ” and “ managing slave lag with mysql replication ,” both by percona ceo peter zaitsev) . in today’s post, however, i will share some new ways of identifying delays in replication – including possible causes of lagging slaves – and how to cure this problem. how to identify replication delay mysql replication works with two threads, io_thread & sql_thread. io_thread connects to a master, reads binary log events from the master as they come in and just copies them over to a local log file called relaylog . on the other hand, sql_thread reads events from a relay log stored locally on the replication slave (the file that was written by io thread) and then applies them as fast as possible. whenever replication delays, it’s important to discover first whether it’s delaying on slave io_thread or slave sql_thread. normally, i/o thread would not cause a huge replication delay as it is just reading the binary logs from the master. however, it depends on the network connectivity, network latency… how fast is that between the servers. the slave i/o thread could be slow because of high bandwidth usage. usually, when the slave io_thread is able to read binary logs quickly enough it copies and piles up the relay logs on the slave – which is one indication that the slave io_thread is not the culprit of slave lag. on the other hand, when the slave sql_thread is the source of replication delays it is probably because of queries coming from the replication stream are taking too long to execute on the slave. this is sometimes because of different hardware between master/slave, different schema indexes, workload. moreover, the slave oltp workload sometimes causes replication delays because of locking. for instance, if a long-running read against a myisam table blocks the sql thread, or any transaction against an innodb table creates an ix lock and blocks ddl in the sql thread. also, take into account that slave is single threaded prior to mysql 5.6, which would be another reason for delays on the slave sql_thread. let me show you via master status/slave status example to identify either slave is lagging on slave io_thread or slave sql_thread. mysql-master> show master status; +------------------+--------------+------------------+------------------------------------------------------------------+ | file | position | binlog_do_db | binlog_ignore_db | executed_gtid_set | +------------------+--------------+------------------+------------------------------------------------------------------+ | mysql-bin.018196 | 15818564 | | bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:1-2331947 | +------------------+--------------+------------------+------------------------------------------------------------------+ mysql-slave> show slave status\g *************************** 1. row *************************** slave_io_state: queueing master event to the relay log master_host: master.example.com master_user: repl master_port: 3306 connect_retry: 60 master_log_file: mysql-bin.018192 read_master_log_pos: 10050480 relay_log_file: mysql-relay-bin.001796 relay_log_pos: 157090 relay_master_log_file: mysql-bin.018192 slave_io_running: yes slave_sql_running: yes replicate_do_db: replicate_ignore_db: replicate_do_table: replicate_ignore_table: replicate_wild_do_table: replicate_wild_ignore_table: last_errno: 0 last_error: skip_counter: 0 exec_master_log_pos: 5395871 relay_log_space: 10056139 until_condition: none until_log_file: until_log_pos: 0 master_ssl_allowed: no master_ssl_ca_file: master_ssl_ca_path: master_ssl_cert: master_ssl_cipher: master_ssl_key: seconds_behind_master: 230775 master_ssl_verify_server_cert: no last_io_errno: 0 last_io_error: last_sql_errno: 0 last_sql_error: replicate_ignore_server_ids: master_server_id: 2 master_uuid: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166 master_info_file: /var/lib/mysql/i1/data/master.info sql_delay: 0 sql_remaining_delay: null slave_sql_running_state: reading event from the relay log master_retry_count: 86400 master_bind: last_io_error_timestamp: last_sql_error_timestamp: master_ssl_crl: master_ssl_crlpath: retrieved_gtid_set: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166 executed_gtid_set: bb11b389-d2a7-11e3-b82b-5cf3fcfc8f58:2-973166, ea75c885-c2c5-11e3-b8ee-5cf3fcfc9640:1-1370 auto_position: 1 this clearly suggests that the slave io_thread is lagging and obviously because of that the slave sql_thread is lagging, too, and it yields replication delays. as you can see the master log file is mysql-bin.018196 (file parameter from master status) and slave io_thread is on mysql-bin.018192 ( master_log_file from slave status ) which indicates slave io_thread is reading from that file, while on master it’s writing on mysql-bin.018196 , so the slave io_thread is behind by 4 binlogs. meanwhile, the slave sql_thread is reading from same file i.e. mysql-bin.01819 2 (relay_master_log_file from slave status) this indicates that the slave sql_thread is applying events fast enough, but it’s lagging too, which can be observed from the difference between read_master_log_pos & exec_master_log_pos from show slave status output. you can calculate slave sql_thread lag from read_master_log_pos – exec_master_log_pos in general as long as master_log_file parameter output from show slave status and relay_master_log_file parameter from show slave status output are the same. this will give you rough idea how fast slave sql_thread is applying events. as i mentioned above, the slave io_thread is lagging as in this example then off course slave sql_thread is behind too. you can read detailed description of show slave status output fields here. also, the seconds_behind_master parameter shows a huge delay in seconds. however, this can be misleading, because it only measures the difference between the timestamps of the relay log most recently executed, versus the relay log entry most recently downloaded by the io_thread. if there are more binlogs on the master, the slave doesn’t figure them into the calculation of seconds_behind_master. you can get a more accurate measure of slave lag using pt-heartbeat from percona toolkit. so, we learned how to check replication delays – either it’s slave io_thread or slave sql_thread. now, let me provide some tips and suggestions for what exactly causing this delay. tips and suggestions what causing replication delay & possible fixes usually, the slave io_thread is behind because of slow network between master/slave. most of the time, enabling slave_compressed_protocol helps to mitigate slave io_thread lag. one other suggestion is to disable binary logging on slave as it’s io intensive too unless you required it for point in time recovery. to minimize slave sql_thread lag, focus on query optimization. my recommendation is to enable the configuration option log_slow_slave_statements so that the queries executed by slave that take more than long_query_time will be logged to the slow log. to gather more information about query performance, i would also recommend setting the configuration option log_slow_verbosity to “full”. this way we can see if there are queries executed by slave sql_thread that are taking long time to complete. you can follow my previous post about how to enable slow query log for specific time period with mentioned options here . and as a reminder, log_slow_slave_statements as variable were first introduced in percona server 5.1 which is now part of vanilla mysql from version 5.6.11 in upstream version of mysql server log_slow_slave_statements were introduced as command line option. details can be found here while log_slow_verbosity is percona server specific feature. one another reason for delay on slave sql_thread if you use row based binlog format is that if your any database table missing primary key or unique key then it will scan all rows of the table for dml on slave and causes replication delays so make sure all your tables should have primary key or unique key. check this bug report for details http://bugs.mysql.com/bug.php?id=53375 you can use below query on slave to identify which of database tables missing primary or unique key. mysql> select t.table_schema,t.table_name,engine from information_schema.tables t inner join information_schema .columns c on t.table_schema=c.table_schema and t.table_name=c.table_name group by t.table_schema,t.table_name having sum(if(column_key in ('pri','uni'), 1,0)) =0; one improvement is made for this case in mysql 5.6, where in memory hash is used slave_rows_search_algorithms comes to the rescue. note that seconds_behind_master is not updated while we read huge rbr event, so, “lagging” may be related to just that – we had not completed reading of the event. for example, in row based replication huge transactions may cause delay on slave side e.g. if you have 10 million rows table and you do “delete from table where id < 5000000″ 5m rows will be sent to slave, each row separately which will be painfully slow. so, if you have to delete oldest rows time to time from huge table using partitioning might be good alternative for this for some kind of workloads where instead using delete use drop old partition may be good and only statement is replicated because it will be ddl operation. to explain it better, let suppose you have partition1 holding rows of id’s from 1 to 1000000 , partition2 – id’s from 1000001 to 2000000 and so on so instead of deleting via statement “delete from table where id<=1000000;” you can do “alter table drop partition1;” instead. for alter partitioning operations check manual – check this wonderful post too from my colleague roman explaining possible grounds for replication delays here pt-stalk is one of finest tool from percona toolkit which collects diagnostics data when problems occur. you can setup pt-stalk as follows so whenever there is a slave lag it can log diagnostic information which we can be later analyze to check to see what exactly causing the lag. here is how you can setup pt-stalk so that it captures diagnostic data when there is slave lag: ------- pt-plug.sh contents #!/bin/bash trg_plugin() { mysqladmin $ext_argv ping &> /dev/null mysqld_alive=$? if [[ $mysqld_alive == 0 ]] then seconds_behind_master=$(mysql $ext_argv -e "show slave status" --vertical | grep seconds_behind_master | awk '{print $2}') echo $seconds_behind_master else echo 1 fi } # uncomment below to test that trg_plugin function works as expected #trg_plugin ------- -- that's the pt-plug.sh file you would need to create and then use it as below with pt-stalk: $ /usr/bin/pt-stalk --function=/root/pt-plug.sh --variable=seconds_behind_master --threshold=300 --cycles=60 [email protected] --log=/root/pt-stalk.log --pid=/root/pt-stalk.pid --daemonize you can adjust the threshold, currently its 300 seconds, combining that with –cycles, it means that if seconds_behind_master value is >= 300 for 60 seconds or more then pt-stalk will start capturing data. adding –notify-by-email option will notify via email when pt-stalk captures data. you can adjust the pt-stalk thresholds accordingly so that’s how it triggers to collect diagnostic data during problem. conclusion a lagging slave is a tricky problem but a common issue in mysql replication. i’ve tried to cover most aspects of replication delays in this post. please share in the comments section if you know of any other reasons for replication delay.
May 6, 2014
by Peter Zaitsev
· 24,094 Views
article thumbnail
Pitfalls of the Hibernate Second-Level / Query Caches
This post will go through how to setup the Hibernate Second-Level and Query caches, how they work and what are their most common pitfalls.
May 6, 2014
by Vasco Cavalheiro
· 78,248 Views · 9 Likes
article thumbnail
Open Session In View Design Tradeoffs
The Open Session in View (OSIV) pattern gives rise to different opinions in the Java development community. Let's go over OSIV and some of the pros and cons of this pattern. The problem The problem that OSIV solves is a mismatch between the Hibernate concept of session and it's lifecycle and the way that many server-side view technologies work. In a typical Java frontend application the service layer starts by querying some of the data needed to build the view. The remaining data needed can be lazy-loaded later, with the condition that the Hibernate session remains open - and there lies the problem. Between the moment that the service layer method finishes it's execution and the moment that the view is rendered, Hibernate has already committed the transaction and closed the session. When the view tries to lazy load the extra data that it needs, if finds the Hibernate session closed, causing a LazyInitializationException. The OSIV solution OSIV tackles this problem by ensuring that the Hibernate session is kept open all the way up to the rendering of the view - hence the name of the pattern. Because the session is kept open, no more LazyInitializationExceptions occur. The session or entity manager is kept open by means of a filter that is added to the request processing chain. In the case of JPA the OpenEntityManagerInViewFilter will create an entity manager at the beginning of the request, and then bind it to the request thread. The service layer will then be executed and the business transaction committed or rolled back, but the transaction manager will not remove the entity manager from the thread after the commit. When the view rendering starts, the transaction manager will then check if there is already an entity manager binded to the thread, and if so use it instead of creating a new one. After the request is processed, the filter will then unbind the entity manager from the thread. The end result is that the same entity manager used to commit the business transaction was kept around in the request thread, allowing the view rendering code to lazy load the needed data. Going back to the original problem Let's step back a moment and go back to the initial problem: the LazyInitializationException. Is this exception really a problem? This exception can also be seen as a warning sign of a wrongly written query in the service layer. When building a view and it's backing services, the developer knows upfront what data is needed, and can make sure that the needed data is loaded before the rendering starts. Several relation types such as one-to-many use lazy-loading by default, but that default setting can be overridden if needed at query time using the following syntax: select p FROM Person p left join fetch p.invoices This means that the lazy loading can be turned off on a case by case basis depending on the data needed by the view. OSIV in projects I've worked In projects I have worked that used OSIV, we could see via query logging that the database was getting hit with a high number of SQL queries, sometimes to the point that developers had to turn off the Hibernate SQL logging. The performance of these application was impacted, but it was kept manageable using second-level caches, and due to the fact that these where intranet-based applications with a limited number of users. Pros of OSIV The main advantage of OSIV is that it makes working with ORM and the database more transparent: Less queries need to be manually written Less awareness is required about the Hibernate session and how to solve LazyInitializationExceptions. Cons of OSIV OSIV seems to be easy to misuse and can accidentally introduce N+1 performance problems in the application. On projects I've worked OSIV did not work out well in the long-term. The alternative of writing custom queries that eager fetch data depending on the use case is manageable and turned out well in other projects I've worked. Alternatives to OSIV Besides the application-level solution of writing custom queries to pre-fetch the needed data, there are other framework-level aproaches to OSIV. The Seam Framework was built by some of the same developers as Hibernate , and solves the problem by introducing the notion of conversation. Can you let me know in the comments bellow your thoughts and experiences with OSIV, thanks for reading.
April 30, 2014
by Vasco Cavalheiro
· 19,106 Views · 3 Likes
  • Previous
  • ...
  • 497
  • 498
  • 499
  • 500
  • 501
  • 502
  • 503
  • 504
  • 505
  • 506
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×