Data Engineering Resources

The Latest Data Engineering Topics

When you should and should NOT use ENUM data type

ENUM is a new enumerated data type introduced in CUBRID 9.0. Like in all programming languages, the ENUM type is a data type composed of a set of static, ordered values. Users can define numeric and string values for ENUM columns. Working with ENUM types Creating an ENUM column is done by specifying a static list of possible values: CREATE TABLE person( name VARCHAR(255), gender ENUM('Male', 'Female') ); CUBRID understands the ENUM type as an ordered set of constants which, in the above example, is a set of {NULL: NULL, 1: 'Male', 2: 'Female”}. To assign a value to the gender column, users may either use the index of the value ({NULL, 1, 2}) or the actual constant literal ({NULL}, {'Male'}, {'Female'}). CUBRID restricts the values that can be assigned to this column to only values from this set + NULL. Moreover, ENUM column is case-sensitive, i.e. it will raise an error if you try to enter 'female' in lower case. Also, an empty string is allowed if it is defined as one of the elements of the ENUM column. In our examples, it is not allowed. csql> INSERT INTO person(name, gender) VALUES('Eugene', 'Male'); 1 row affected. 1 command(s) successfully processed. csql> INSERT INTO person(name, gender) VALUES('Anne', 2); 1 row affected. 1 command(s) successfully processed. csql> SELECT * FROM person; === === name gender ============================================ 'Anne' 'Female' 'Eugene' 'Male' 2 rows selected. Any attempt to insert a value outside of the defined set will result in a coercion error. In the below case, trying to insert an empty string raises an error because it is not in the set of allowed values defined in the person table. csql> INSERT INTO person(name, gender) VALUES('John', 'N/A'); IN line 1, COLUMN 44, ERROR: before ' ); ' Cannot coerce 'N/A' TO type enum. 0 command(s) successfully processed. csql> INSERT INTO person(name, gender) VALUES('John', 4); IN line 1, COLUMN 45, ERROR: before ' ); ' Cannot coerce 4 TO type enum. 0 command(s) successfully processed. csql> INSERT INTO person(name, gender) VALUES('John', ''); IN line 1, COLUMN 44, ERROR: before ' ); ' Cannot coerce '' TO type enum. 0 command(s) successfully processed. Why you should use the ENUM type There are three important reasons for which you should consider using the ENUM type: Reduce storage space. Reduce join complexity. Create cheap values constraints. Storage Space CUBRID uses only 1 byte per tuple when 255 or less ENUM elements are defined or 2 bytes for 256~65535 elements. This is because, rather that storing the constant literal of the value, CUBRID stores the index in the ordered set of that value. For very large tables, this might prove to be a significant storage space save. Take, for example, a table with 1,000,000,000 records which has an ENUM column defined as ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'). If you use a VARCHAR type instead of the ENUM type to store these values, the column would require anywhere between 5GB and 9GB of storage space. Using the ENUM type, you can reduce the required space to 2 bytes per tuple, adding up to a total of 2GB. Reduce join complexity JOIN way The same effect of the ENUM type can be achieved by creating a one to many relationship on two or more tables. Considering the example above, you can store values for days of the week like this: CREATE TABLE days_of_week( id SHORT PRIMARY KEY, name VARCHAR(9) ); CREATE TABLE opening_hours( week_day SHORT, opening_time TIME, closing_time TIME, FOREIGN KEY fk_dow (week_day) REFERENCES days_of_week(id) ); Then, when you wish to display the name of the week day, you would execute a query like: SELECT d.name day_name, o.opening_time, o.closing_time FROM days_of_week d, opening_hours o WHERE d.id = o.week_day ORDER BY d.id; === === day_name opening_time closing_time ================================================== 'Monday' 09:00:00 AM 06:00:00 PM 'Tuesday' 09:00:00 AM 06:00:00 PM 'Wednesday' 09:00:00 AM 06:00:00 PM ... ENUM way You can achieve the same effect using an ENUM column: CREATE TABLE opening_hours( week_day ENUM ('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'), opening_time TIME, closing_time TIME ); And there’s no JOIN required to select opening hours: SELECT week_day, opening_time, closing_time FROM opening_hours ORDER BY week_day; === === week_day opening_time closing_time ================================================== 'Monday' 09:00:00 AM 06:00:00 PM 'Tuesday' 09:00:00 AM 06:00:00 PM 'Wednesday' 09:00:00 AM 06:00:00 PM ... This can prove to be very useful, especially if your queries join several tables. Value constraints ENUM columns behave like foreign key relationships in the sense that values from an ENUM column are restricted to the values specified in the column definition. For a short list of values, this is more efficient than creating foreign key relationships. While foreign key relationships use index scans to enforce the restriction, ENUM columns just have to go through a list of predefined values which is faster even for small indexes. Why/When you should NOT use the ENUM type Even though ENUM is a great feature, there are cases when you’d better not use it. For example: When ENUM type is not fixed When ENUM type has a long list of values When your application does not know the list of ENUM values ENUM type is not reusable Portability is a concern When ENUM type is not fixed If you’re not sure if the ENUM type holds all possible values for that column, you should consider using a one to many relationship instead. The only way in which an ENUM column can be changed to handle more values is by using an ALTER statement. This is a very expensive operation in any RDBMS and requires administrator rights. Also, ALTER statements are maintenance operations and should, as much as possible, be performed offline. When ENUM type has a long list of values ENUM types should not be used if you cannot limit a set of possible values to a few elements. When your application does not know the list of ENUM values There are only two ways of getting a list of values you have defined for an ENUM type: parsing the output of SHOW CREATE TABLE statement: csql> SHOW CREATE TABLE opening_hours; === === TABLE CREATE TABLE ============================================ 'opening_hours' 'CREATE TABLE [opening_hours] ([week_day] ENUM('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'), [opening_time] TIME, [closing_time] TIME) selecting information from CUBRID system tables: csql> SELECT d.enumeration FROM _db_domain d, _db_attribute a WHERE a.attr_name = 'week_day' AND a.class_of.class_name = 'opening_hours' AND d IN a.domains; === === enumeration ====================== {'Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'} Both might require complex coding and selecting from system tables requires administrator privileges. ENUM type is not reusable If you have several tables which require the names of week days, you will have to create an ENUM type for each of them. If you create a table to hold week days names, you can join this table with whichever other table that requires this information. Portability is a concern The ENUM type is only supported by a few RDBMSs and each one has its own idea as to how ENUM type is supposed to work. Below is a list of a few notable differences between CUBRID, MySQL and PostgreSQL: CUBRID PostgreSQL MySQL Inserting out of range value Throws error Throws error Inserts special value index 0 Comparing to char literals Compare as strings Compare as ENUM elements Compare as strings Comparing to values outside of the ENUM domain Compare as strings Throws error Compare as strings These subtle differences will most probably break your application in interesting and hard to understand ways. If you’re migrating from PostgreSQL to CUBRID for example, and you expect comparisons with char literals to be performed as ENUM comparisons, you’ll have a hard time understanding why your query returns weird results.

December 19, 2012

by Esen Sagynov

· 60,359 Views

Simulate Network Latency, Packet Loss, and Low Bandwidth on Mac OSX

Sometimes while testing you may want to be able to simulate network latency, or packet loss, or low bandwidth. I have done this with Linux and tc/netem as well as with Shunra on Windows, but I had never done it on Mac OSX. It turns out that Mac OSX includes ‘dummynet’ from FreeBSD which has the capability to do this WAN simulation. Here is a quick example: Inject 250ms latency and 10% packet loss on connections between my workstation and my development web server (10.0.0.1) Simulate maximum bandwidth of 1Mbps # Create 2 pipes and assigned traffic to and from our webserver to each: $ sudo ipfw add pipe 1 ip from any to 10.0.0.1 $ sudo ipfw add pipe 2 ip from 10.0.0.1 to any # Configure the pipes we just created: $ sudo ipfw pipe 1 config delay 250ms bw 1Mbit/s plr 0.1 $ sudo ipfw pipe 2 config delay 250ms bw 1Mbit/s plr 0.1 A quick test: $ ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1): 56 data bytes 64 bytes from 10.0.0.1: icmp_seq=0 ttl=63 time=515.939 ms 64 bytes from 10.0.0.1: icmp_seq=1 ttl=63 time=519.864 ms 64 bytes from 10.0.0.1: icmp_seq=2 ttl=63 time=521.785 ms Request timeout for icmp_seq 3 64 bytes from 10.0.0.1: icmp_seq=4 ttl=63 time=524.461 ms Disable: $sudo ipfw list |grep pipe 01900 pipe 1 ip from any to 10.13.1.133 out 02000 pipe 2 ip from 10.13.1.133 to any in $ sudo ipfw delete 01900 $ sudo ipfw delete 02000 # or, flush all ipfw rules, not just our pipes $ sudo ipfw -q flush Notice that the round-trip on the ping is ~500ms. That is because we applied a 250ms latency to both pipes, incoming and outgoing traffic. Our example was very simple, but you can get quite complex since “pipes” are applied to traffic using standard ipfw firewall rules. For example, you could specify different latency based on port, host, network, etc. Packet loss is configured with the “plr” command. Valid values are 0 – 1. In our example above we used 0.1 which equals 10% packetloss. This is a very handy way for developers on Mac’s to test their applications in a variety of network environments. And you get it for FREE. On Windows you need to buy a commercial tool to achieve this (at least that was true the last time I looked, in 2008.)

December 15, 2012

by Joe Miller

· 17,056 Views

All about Two-Phase Locking and a little bit MVCC

In this blog I will describe the concurrency control methods implemented in database management systems, and the differences between them. I will also explain about what locking technique is used in CUBRID RDBMS, about locking modes and their compatibility, and finally, the deadlocks and the solution for them. Overview When multiple transactions, which change the data, are executed simultaneously, it is required to control the order of processing these transactions to satisfy the ACID (Atomicity, Consistency, Integrity, Durability) property of the database. Executing multiple transactions simultaneously should lead to the same result as executing each transaction independently, in other words, one transaction should not be affected by another transaction. If different data is changed for each transaction, no interference between transactions is made, so there is no issue. However, if the same data is simultaneously changed by multiple transactions, the order of processing each transaction should be controlled. Types of Concurrency Control For example, the T1 transaction changes the A record from 1 to 2 and then changes the B record, the T2 transaction can simultaneously change the A record, too. Let's assume that the T2 transaction changes the A record from 2 to 4 by adding +2. If two transactions are successfully terminated, there is no issue. But it is important that all transactions can be rolled back. If the T1 transaction is rolled back, the value of the A record should be returned to 1, i.e. the value before the T1 transaction was executed. This is to satisfy the ACID property of the database. However, the T2 transaction has already changed the A record value to 3. So, it is impossible to return the A record to 1 regardless of the situation. In this case, there can be two options. Two-phase locking (2PL) The first one is when the T2 transaction tries to change the A record, it knows that the T1 transaction has already changed the A record and waits until the T1 transaction is completed because the T2 transaction cannot know whether the T1 transaction will be committed or rolled back. This method is called Two-phase locking (2PL). Multi-version concurrency control (MVCC) The other one is to allow each of them, T1 and T2 transactions, to have their own changed versions. Even when the T1 transaction has changed the A record from 1 to 2, the T1 transaction leaves the original value 1 as it is and writes that the T1 transaction version of the A record is 2. Then, the following T2 transaction changes the A record from 1 to 3, not from 2 to 4, and writes that the T2 transaction version of the A record is 3. When the T1 transaction is rolled back, it does not matter if the 2, the T1 transaction version, is not applied to the A record. After that, if the T2 transaction is committed, the 3, the T2 transaction version, will be applied to the A record. If the T1 transaction is committed prior to the T2 transaction, the A record is changed to 2, and then to 3 at the time of committing the T2 transaction. The final database status is identical to the status of executing each transaction independently, without any impact on other transactions. Therefore, it satisfies the ACID property. This method is called Multi-version concurrency control (MVCC). CUBRID has implemented 2PL method as well as DB2 and SQL Server, while Oracle, InnoDB and PostgreSQL have implemented MVCC. Two-phase locking in CUBRID The 2PL adopted by CUBRID uses locks to ensure the consistency between transactions that change the identical data. As the "lock" literally means, the locking is executed through two phases: expanding phase (acquiring) shrinking phase (releasing) More accurately, all transactions should acquire lock for the data to be accessed and the acquired locks are released only when the transaction is terminated. After a transaction has acquired the lock for a certain data (regardless of the lock type, S_LOCK for read, stands for Shared Lock, or X_LOCK for write, stands for Exclusive Lock), when another transaction tries to acquire a new lock for the data, the new lock is allowed or pended depending on the lock compatibility rule. Therefore, success or failure of the prior transaction does not have impact on the following transactions, so the data consistency is maintained. Lock Manager in CUBRID Thus, the key point of 2PL, adopted by CUBRID, is that the lock must be processed through two phases: expanding phase and shrinking phase. Then, [Figure 1] release all locks, acquired while executing a transaction, only after the transaction ends (commit or rollback). Figure 1: Two-Phase Locking. 2PL concurrency control method naturally controls access to the identical data from transactions by making all transactions observe the 2PL protocol. The following Figure 2 below shows an example of three transactions using 2PL: Transaction 1 executes B=B+A operation, Transaction 2 executes C=A+B operation, and Transaction 3 executes Print C operation. Since all three transactions are accessing the data A, B and C, the concurrency control is required. In this case, each transaction is executed according to the 2PL protocol so that there is no data conflict. Figure 2: Concurrency Control by using 2PL. Lock modes To understand the concurrency control of multiple transactions more deeply, let's discuss about lock modes, lock conversion and transaction isolation level. In the above figure, you can see that S-lock, Shared Lock, for A was first acquired by Transaction 1, but it is also acquired by Transaction 2, too. On the contrary, the transaction which requested X-lock is blocked until S-lock is released. In this matter, a variety of lock modes are used to minimize conflicts by lockers. Major types of locks utilized in DBMSs are. Shared (S) Lock: Used for read operation. It is generally set on the target record when SELECT statement is executed. It blocks a transaction from changing data which was already read by other transactions. Exclusive (X) Lock: Used for write-operations such as INSERT, UPDATE, DELETE. It blocks one data from being changed by multiple transactions. Update (U) Lock: Used to define that the target resource will be changed. It is used to minimize deadlock which may occur when multiple transactions are executing both read and write. Intent Shared (IS) Lock: Set on the upper resource (e.g. tables) to set the S-lock on some lower resources (e.g. records or pages). It is to prevent other transactions from setting X-lock on the upper resource. Intent lock will soon be described. Intent Exclusive (IX) Lock: Set on the upper resource to set X-lock on some lower resources. Shared with Intent Exclusive (SIX) Lock: Set on the upper resource to set S-lock and X-lock on some lower resources. Lock mode compatibility Among the lock modes above, intent locks are used to improve the transaction concurrency and to prevent deadlock between the upper resources and the lower resources. For example, when Transaction A tries to read Record R on Table T, it sets IS_LOCK on Table T before setting S_LOCK on Record R. Then, Transaction B is prevented from setting X_LOCK on Table T to change the structure of Table T. If Transaction A has not set IS_LOCK on Table T, Transaction B would change the structure of Table T. Then, Transaction A would perform a wrong read operation. This way Transaction B has no need to check all records in Table T to check whether there is any lock set by other transactions for setting X_LOCK on Table T. The following lock mode compatibility table will clearly show the effect of intent locks: Table 1: The lock mode compatibility table of CUBRID. Current Lock Mode NULL IS NS S IX SIX U NX X Newly-requested Lock Mode NULL True True True True True True True True True IS True True N/A True True True N/A N/A False NS True N/A True True N/A N/A False True False S True True True True False False False False False IX True True N/A False True False N/A N/A False SIX True True N/A False False False N/A N/A False U True N/A True True N/A N/A False False False NX True N/A True False N/A N/A False False False X True False False False False False False False False From the lock mode compatibility table, you can see that X_LOCK cannot be set on a table if IS_LOCK is set on the table. And only IS_LOCK can be compatible with SIX_LOCK. This means that SIX_LOCK intends to set S_LOCK and X_LOCK on the record and it will not allow any lock but IS_LOCK for S_LOCK on other non-conflicting records. From the table, you can see that IX_LOCK and IX_LOCK can be compatible with each other. IX_LOCK intends to set X_LOCK for some records. So, the compatibility is available. If there are two transactions that try to change an identical record, IX_LOCK for the table is allowed. However, there is no problem in concurrency control since only the transaction that has acquired X_LOCK for the record first can change the record (X_LOCK and X_LOCK are not compatible). The lock mode compatibility table is expressed as a global variable lock_Comp[][] in the lock_table.c file in CUBRID source code. Among CUBRID sources, most codes related to lock modes are implemented in lock_manager.c file. To set lock on a data object, the lock_object() function is used which receives three parameters: the OID of an object where the lock mode will be set, the OID of the class where the object belongs, and the desired lock mode. In the source code of the function, you can see that the function is executed in several ways based on the target of the lock mode, the lock mode for an instance object or for a class object. Take note of this: in CUBRID, a class object is also an object. Keep it in mind that a class object has an OID and all class objects are the instances of a root class, so it uses ROOTOID, the OID of the root object, as its class OID. From the code, you can see that the required intent lock is set on a class object when a lock mode is required for an instance object. And there is a concept of lock waiting time in the lock mode request. To retrieve the lock timeout value set on the current transaction, the logtb_find_wait_secs() function is called. CUBRID supports the SET TRANSACTION LOCK TIMEOUT SQL command and the setLockTimeout() method in JDBC. The command is to specify the lock timeout of the current transaction. Lock waiting time means the time for a transaction, which has made a request for lock mode, to wait when a lock mode is set on an object by a transaction and the requested lock is not compatible with the already-set lock mode. As you have seen before, the 2PL concurrency control method does not allow lock from other transactions until the existing lock is released. For the following two reasons, lock timeout should be set by a transaction: When a user does not want to wait too long because of the lock mode. To lower the frequency of deadlock. Deadlocks A deadlock occurs when two or more transactions request resources locked by each of them, so all transactions cannot be progressed. Figure 8 below shows an example of a deadlock. Figure 2: Transaction Deadlock. First, Transaction 1 executes UPDATE participant SET gold=10 WHERE host_year=2004 AND nation_code=’KOR’ statement and sets X_LOCK on the ‘KOR’ record. Transaction 2 sets X_LOCK on the ‘JPN’ record. Transaction 3 sets X_LOCK on the ‘CHN’ record. After that, Transaction 1 requests X_LOCK on the ‘JPN’ record for executing UPDATE for that record. However, the ‘JPN’ record is already locked with X_LOCK by Transaction 2. So, Transaction 1 should wait until Transaction 2 ends. Based on the 2PL protocol, the X_LOCK is released when the transaction ends. Transaction 2 requests X_LOCK on the ‘CHN’record and waits for Transaction 3. Finally, Transaction 3 waits for Transaction 1 to acquire the 'KOR' record of Transaction 1 as it has X_LOCK on the ‘CHN’ record. As a result,Transaction 1 waits for Transaction 2 to end, Transaction 2 waits for Transaction 3 to end, and Transaction 3 waits for Transaction 1 to end. So, no transaction can be progressed. This is called a deadlock. Most DBMSs which use the 2PL method, including CUBRID, use the deadlock detection method to solve the deadlock problem. It periodically checks whether the cycle illustrated in the above figure occurs by drawing a Lock Wait Graph for the transactions being executed. In CUBRID, the thread for detecting deadlock checks the Lock Wait Graph every second. When a deadlock is detected, one transaction among the transactions is randomly selected and aborted by force. This is called unilateral abort. When a transaction is selected as a victim to be sacrificed to solve the deadlock and unilaterally aborted, the corresponding SQL statement returns an error code. The error message is "The transaction has timed out due to deadlock while waiting for X_LOCK for an object. It waited until User 2 ended.” When an error is returned and the application aborts the transaction, the locks of the transaction are released and other transactions can be continuously processed. To see how the deadlock is detected, see the lock_detect_local_deadlock() function in the source code. This function is called with the intervals (in seconds) specified by the PRM_LK_RUN_DEADLOCK_INTERVAL variable (the deadlock_detection_interval_in_secs parameter in cubrid.conf file) on the background thread which executes thread_deadlock_detect_thread(). Even if a deadlock does not occur, when the execution time of a transaction is too long, other transactions should wait for too long as well. For a certain application, it is wiser to give up rather than wait. In particular, when a web server has called DB tasks and the wait time is too long, all threads of the web server are used to process the DB, so they cannot be used to process external HTTP requests any more, causing service failures. Therefore, for a web application, the threads should be returned without waiting an unlimited amount of time for DB processing even if an error occurs. Two methods are used for that: One is lock timeout supported by CUBRID. The other is query cancel. JDBC is defined with an API which can cancel the SQL statement being executed. The key data structure of the lock manager is defined in the lock_manager.c file. typedef struct lk_entry LK_ENTRY; struct lk_entry { #if defined(SERVER_MODE) struct lk_res *res_head; /* back to resource entry */ THREAD_ENTRY *thrd_entry; /* thread entry pointer */ int tran_index; /* transaction table index */ LOCK granted_mode; /* granted lock mode */ LOCK blocked_mode; /* blocked lock mode */ int count; /* number of lock requests */ struct lk_entry *next; /* next entry */ struct lk_entry *tran_next; /* list of locks that trans. holds */ struct lk_entry *class_entry; /* ptr. to class lk_entry */ LK_ACQUISITION_HISTORY *history; /* lock acquisition history */ LK_ACQUISITION_HISTORY *recent; /* last node of history list */ int ngranules; /* number of finer granules */ int mlk_count; /* number of instant lock requests */ unsigned char scanid_bitset[1]; /* PRM_LK_MAX_SCANID_BIT/8]; */ #else /* not SERVER_MODE */ int dummy; #endif /* not SERVER_MODE */ }; typedef struct lk_res LK_RES; struct lk_res { MUTEX_T res_mutex; /* resource mutex */ LOCK_RESOURCE_TYPE type; /* type of resource: class,instance */ OID oid; OID class_oid; LOCK total_holders_mode; /* total mode of the holders */ LOCK total_waiters_mode; /* total mode of the waiters */ LK_ENTRY *holder; /* lock holder list */ LK_ENTRY *waiter; /* lock waiter list */ LK_ENTRY *non2pl; /* non2pl list */ LK_RES *hash_next; /* for hash chain */ }; From the file, the lk_Gl global variable of LK_GLOBAL_DATA type is the core. The LK_ENTRY structure stands for the lock itself. For example, when the Transaction T1 has requested a lock, one LK_ENTRY is created. LK_RES is a structure that shows to which resource the lock belongs. In CUBRID, all resources are objects (instance objects and class objects), so they are shaped as OIDs. In the LK_RES structure, you can see the list of holders with LK_ENTRY type and the list of waiters. The list of holders is a list of transactions that hold the lock for the resource now. For example, when Transaction T1 and Transaction T2 have acquired S_LOCK for the data record with OID1, LK_ENTRY that corresponds to the S_LOCK of T1 and T2 will be registered in the list of holders. When Transaction T3 requests X_LOCK on the OID1 record, T3 should wait because of the existing S_LOCK. So, the LK_ENTRY corresponding to X_LOCK of T3 will be registered to the list of waiters. Which lock is held by which transaction is maintained in the tran_lock_table variable which has the LK_TRAN_LOCK structure as a table. The Wait For Graph for detecting a deadlock is expressed as TWFG_node and TWFG_edge of the LK_WFG_NODE structure and the LK_WFG_EDGE structure. The lock_detect_local_deadlock() function creates a Wait For Graph and detects whether there is a cycle on the graph. When a cycle is detected, the lock_select_deadlock_victim() function selects a victim transaction to be sacrificed for solving the deadlock. For reference, transactions are continuously executed while a Wait For Graph is drawn up and checked, the information of the ended transaction is removed from the graph. The victim transaction is selected based on the following criteria: If a transaction is not a holder, it cannot be a victim. When a transaction is in the commit phase or the rollback phase, it cannot be selected as a victim. Select a transaction of which lock timeout is not set to -1 (unlimited waiting) first. Select the latest transaction rather than the older one. (The transaction ID is an incremental number. A transaction with smaller transaction number is the older one.) Conclusion This concludes the talk about Two-Phase Locking in CUBRID. I briefly covered the types of concurrency control, the difference between 2PL and MVCC, about what locking technique is used in CUBRID RDBMS, about locking modes and their compatibility, and finally, the deadlocks and the solution for them. In this article I have mentioned about OID (Object Identifiers) which are used to identify instance objects as well as class objects. In the next article I will continue this talk and explain what Object, Class, and OID are.

December 14, 2012

by Esen Sagynov

· 11,206 Views · 1 Like

Checking DB Connection Using Groovy

Here is a simple Groovy script to verify Oracle database connection using JDBC. @GrabConfig(systemClassLoader=true) @Grab('com.oracle:ojdbc6:11g') url= "jdbc:oracle:thin:@localhost:1521:XE" username = "system" password = "mypassword123" driver = "oracle.jdbc.driver.OracleDriver" // Groovy Sql connection test import groovy.sql.* sql = Sql.newInstance(url, username, password, driver) try { sql.eachRow('select sysdate from dual'){ row -> println row } } finally { sql.close() } This script should let you test connection and perform any quick ad hoc queries programmatically. However, when you first run it, it would likely failed without finding the Maven dependency for JDBC driver jar. In this case, you would need to first install the Oracle JDBC jar into maven local repository. This is due to Oracle has not publish their JDBC jar into any public Maven repository. So we are left with manually steps by installing it. Here are the onetime setup steps: 1. Download Oracle JDBC jar from their site: http://www.oracle.com/technetwork/database/features/jdbc/index-091264.html. 2. Unzip the file into C:/ojdbc directory. 3. Now you can install the jar file into Maven local repository using Cygwin. bash> cd /cygdrive/c/ojdbc bash> mvn install:install-file -DgroupId=com.oracle -DartifactId=ojdbc6 -Dversion=11g -Dpackaging=jar -Dfile=ojdbc6-11g.jar That should make your script run successfully. The Groovy way of using Sql has many sugarcoated methods that you let you quickly query and see data on screens. You can see more Groovy feature by studying their API doc. Note that you would need systemClassLoader=true to make Groovy load the JDBC jar into classpath and use it properly. Oh, BTW, if you are using Oracle DB production, you will likely using a RAC configuration. The JDBC url connection string for that should look something like this: jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=localhost)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=MY_DB))) Update: 12/07/2012 It appears that the groovy.sql.Sql class has a static withInstance method. This let you run onetime DB work without writing try/finally block. See this example: @GrabConfig(systemClassLoader=true) @Grab('com.oracle:ojdbc6:11g') url= "jdbc:oracle:thin:@localhost:1521:XE" username = "system" password = "mypassword123" driver = "oracle.jdbc.driver.OracleDriver" import groovy.sql.* Sql.withInstance(url, username, password, driver) { sql -> sql.eachRow('select sysdate from dual'){ row -> println row } } It's much more compact. But be aware of performance if you run it multiple times, because you will open and close the a java.sql.Connection per each call! I have also collected couple other popular databases connection test examples. These should have their driver jars already in Maven central, so Groovy Grab should able to grab them just fine. // MySQL database test @GrabConfig(systemClassLoader=true) @Grab('mysql:mysql-connector-java:5.1.22') import groovy.sql.* Sql.withInstance("jdbc:mysql://localhost:3306/mysql", "root", "mypassword123", "com.mysql.jdbc.Driver"){ sql -> sql.eachRow('SELECT * FROM USER'){ row -> println row } } // H2Database @GrabConfig(systemClassLoader=true) @Grab('com.h2database:h2:1.3.170') import groovy.sql.* Sql.withInstance("jdbc:h2:~/test", "sa", "", "org.h2.Driver"){ sql -> sql.eachRow('SELECT * FROM INFORMATION_SCHEMA.TABLES'){ row -> println row } }

December 12, 2012

by Zemian Deng

· 29,394 Views

Configuring IIS methods for ASP.NET Web API on Windows Azure Websites

That’s a pretty long title, I agree. When working on my implementation of RFC2324, also known as the HyperText Coffee Pot Control Protocol, I’ve been struggling with something that you will struggle with as well in your ASP.NET Web API’s: supporting additional HTTP methods like HEAD, PATCH or PROPFIND. ASP.NET Web API has no issue with those, but when hosting them on IIS you’ll find yourself in Yellow-screen-of-death heaven. The reason why IIS blocks these methods (or fails to route them to ASP.NET) is because it may happen that your IIS installation has some configuration leftovers from another API: WebDAV. WebDAV allows you to work with a virtual filesystem (and others) using a HTTP API. IIS of course supports this (because flagship product “SharePoint” uses it, probably) and gets in the way of your API. Bottom line of the story: if you need those methods or want to provide your own HTTP methods, here’s the bit of configuration to add to your Web.config file: Here’s what each part does: Under modules, the WebDAVModule is being removed. Just to make sure that it’s not going to get in our way ever again. The security/requestFiltering element I’ve added only applies if you want to define your own HTTP methods. So unless you need the XYZ method I’ve defined here, don’t add it to your config. Under handlers, I’m removing the default handlers that route into ASP.NET. Then, I’m adding them again. The important part? The "verb attribute. You can provide a list of comma-separated methods that you want to route into ASP.NET. Again, I’ve added my XYZ methodbut you probably don’t need it. This will work on any IIS server as well as on Windows Azure Websites. It will make your API… happy.

December 11, 2012

by Maarten Balliauw

· 20,520 Views

Using YAML for Java Application Configuration

YAML is well-known format within Ruby community, quite widely used for a long time now. But we as Java developers mostly deal with property files and XMLs in case we need some configuration for our apps. How many times we needed to express complicated configuration by inventing our own XML schema or imposing property names convention? Though JSON is becoming a popular format for web applications, using JSON files to describe the configuration is a bit cumbersome and, in my opinion, is not as expressive as YAML. Let's see what YAML can do for us to make our life easier. For sure, let's start with the problem. In order for our application to function properly, we need to feed it following data somehow: version and release date database connection parameters list of supported protocols list of users with their passwords This list of parameters sounds a bit weird, but the purpose is to demonstrate different data types in work: strings, numbers, dates, lists and maps. The Java model consists of two simple classes: Connection package com.example.yaml; public final class Connection { private String url; private int poolSize; public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public int getPoolSize() { return poolSize; } public void setPoolSize(int poolSize) { this.poolSize = poolSize; } @Override public String toString() { return String.format( "'%s' with pool of %d", getUrl(), getPoolSize() ); } } and Configuration, both are typical Java POJOs, verbose because of property setters and getters (we get used to it, right?). package com.example.yaml; import static java.lang.String.format; import java.util.Date; import java.util.List; import java.util.Map; public final class Configuration { private Date released; private String version; private Connection connection; private List< String > protocols; private Map< String, String > users; public Date getReleased() { return released; } public String getVersion() { return version; } public void setReleased(Date released) { this.released = released; } public void setVersion(String version) { this.version = version; } public Connection getConnection() { return connection; } public void setConnection(Connection connection) { this.connection = connection; } public List< String > getProtocols() { return protocols; } public void setProtocols(List< String > protocols) { this.protocols = protocols; } public Map< String, String > getUsers() { return users; } public void setUsers(Map< String, String > users) { this.users = users; } @Override public String toString() { return new StringBuilder() .append( format( "Version: %s\n", version ) ) .append( format( "Released: %s\n", released ) ) .append( format( "Connecting to database: %s\n", connection ) ) .append( format( "Supported protocols: %s\n", protocols ) ) .append( format( "Users: %s\n", users ) ) .toString(); } } ow, as model is quite clear, let us try to express it as the human being normally does it. Looking back to our list of required configuration, let's try to write it down one by one. 1. version and release date version: 1.0 released: 2012-11-30 2. database connection parameters connection: url: jdbc:mysql://localhost:3306/db poolSize: 5 3. list of supported protocols protocols: - http - https 4. list of users with their passwords users: tom: passwd bob: passwd And this is it, our configuration expressed in YAML syntax is completed! The whole file sample.yml looks like this: version: 1.0 released: 2012-11-30 # Connection parameters connection: url: jdbc:mysql://localhost:3306/db poolSize: 5 # Protocols protocols: - http - https # Users users: tom: passwd bob: passwd To make it work in Java, we just need to use the awesome library called snakeyml, respectively the Maven POM file is quite simple: 4.0.0 com.example yaml 0.0.1-SNAPSHOT jar UTF-8 org.yaml snakeyaml 1.11 org.apache.maven.plugins maven-compiler-plugin 2.3.1 1.7 1.7 Please notice the usage of Java 1.7, the language extensions and additional libraries simplify a lot of regular tasks as we could see looking into YamlConfigRunner: package com.example.yaml; import java.io.IOException; import java.io.InputStream; import java.nio.file.Files; import java.nio.file.Paths; import org.yaml.snakeyaml.Yaml; public class YamlConfigRunner { public static void main(String[] args) throws IOException { if( args.length != 1 ) { System.out.println( "Usage: " ); return; } Yaml yaml = new Yaml(); try( InputStream in = Files.newInputStream( Paths.get( args[ 0 ] ) ) ) { Configuration config = yaml.loadAs( in, Configuration.class ); System.out.println( config.toString() ); } } } The code snippet here loads the configuration from file (args[ 0 ]), tries to parse it and fill up the Configuration class with meaningful data using JavaBeans conventions, converting to the declared types where possible. Running this class with sample.yml as an argument generates the following output: Version: 1.0 Released: Thu Nov 29 19:00:00 EST 2012 Connecting to database: 'jdbc:mysql://localhost:3306/db' with pool of 5 Supported protocols: [http, https] Users: {tom=passwd, bob=passwd} Totally identical to the values we have configured!

December 10, 2012

by Andriy Redko

· 240,362 Views · 6 Likes

Writing Acceptance Tests for Openshift and MongoDb Applications

Acceptance testing is used to determine if the requirements of a specification are met. It should be run in an environment as similar as possible of the production one. So if your application is deployed into Openshift, you will require a parallel account to the one used in production for running the tests. In this post we are going to write an acceptance test for an application deployed into Openshift that uses MongoDb as database backend. The application deployed is a simple library which returns all the books available for lending. This application uses MongoDb for storing all information related to books. So let's start describing the goal, feature, user story, and acceptance criteria for previous applications. Goal: Expanding a lecture to the most people. Feature: Display available books. User Story: Browse Catalog -> In order to find books I would like to borrow, As a User, I want to be able to browse through all books. Acceptance Criteria: Should see all available books. Scenario: Given I want to borrow a book When I am at catalog page Then I should see information about available books: The Lord Of The Jars - 1299 - LOTRCoverUrl , The Hobbit - 293 - HobbitCoverUrl Notice that this is a very simple application, so the acceptance criteria is simple too. For this example, we need two test frameworks, the first one for writing and running acceptance tests, and the other one for managing the NoSQL backend. In this post we are going to use Thucydides for ATDD and NoSQLUnit for dealing with MongoDb. The application is already deployed in Openshift, and you can take a look at https://books-lordofthejars.rhcloud.com/GetAllBooks Thucydides is a tool designed to make writing automated acceptance and regression tests easier. Thucydides uses WebDriver API to access HTML page elements. But it also helps you to organise your tests and user stories by using a concrete programming model, create reports of executed tests, and finally it also measures functional cover. To write acceptance tests with Thucydides next steps should be followed. First of all, choose a user story of one of your features. Then implement the PageObject class. PageObject is a pattern which models web application's user interface elements as objects, so tests can interact with them programmatically. Note that in this case we are coding "how" we are accessing to html page. Next step is implementing steps library. This class will contain all steps that are required to execute an action. For example creating a new book requires to open addnewbook page, insert new data, and click to submit button. In this case we are coding "what" we need to implement the acceptance criteria. And finally coding the chosen user story following defined Acceptance Criteria and using previous step classes. NoSQLUnit is a JUnit extension that aims us to manage lifecycle of required NoSQL engine, help us to maintain database into known state and standarize the way we write tests for NoSQL applications. NoSQLUnit is composed by two groups of JUnit rules, and two annotations. In current case, we don't need to manage lifecycle of NoSQL engine, because it is managed by external entity (Openshift). So let's getting down on work: First thing we are going to do is create a feature class which contains no test code; it is used as a way of representing the structure of requirements. public class Application { @Feature public class Books { public class ListAllBooks {} } } Note that each implemented feature should be contained within a class annotated with @Feature annotation. Every method of featured class represents a user story. Next step is creating the PageObject class. Remember that PageObject pattern models web application's user interface as object. So let's see the html file to inspect what elements must be mapped. List of Available BooksTitleNumber Of PagesCover ..... The most important thing here is that table tag has an id named listBooks which will be used in PageObject class to get a reference to its parameters and data. Let's write the page object: @DefaultUrl("http://books-lordofthejars.rhcloud.com/GetAllBooks") public class FindAllBooksPage extends PageObject { @FindBy(id = "listBooks") private WebElement tableBooks; public FindAllBooksPage(WebDriver driver) { super(driver); } public TableWebElement getBooksTable() { Map> tableValues = new HashMap>(); tableValues.put("titles", titles()); tableValues.put("numberOfPages", numberOfPages()); tableValues.put("covers", coversUrl()); return new TableWebElement(tableValues); } private List titles() { List namesWebElement = tableBooks.findElements(By.className("title")); return with(namesWebElement).convert(toStringValue()); } private List numberOfPages() { List numberOfPagesWebElement = tableBooks.findElements(By.className("numberOfPages")); return with(numberOfPagesWebElement).convert(toStringValue()); } private List coversUrl() { List coverUrlWebElement = tableBooks.findElements(By.className("cover")); return with(coverUrlWebElement).convert(toImageUrl()); } private Converter toImageUrl() { return new Converter() { @Override public String convert(WebElement from) { WebElement imgTag = from.findElement(By.tagName("img")); return imgTag.getAttribute("src"); } }; } private Converter toStringValue() { return new Converter() { @Override public String convert(WebElement from) { return from.getText(); } }; } } Using @DefaultUrl we are setting which URL is being mapped, with @FindBy we map the web element with id listBooks, and finally getBooksTable() method which returns the content of generated html table. The next thing to do is implementing the steps class; in this simple case we only need two steps, the first one that opens the GetAllBooks page, and the other one which asserts that table contains the expected elements. public class EndUserSteps extends ScenarioSteps { public EndUserSteps(Pages pages) { super(pages); } private static final long serialVersionUID = 1L; @Step public void should_obtain_all_inserted_books() { TableWebElement booksTable = onFindAllBooksPage().getBooksTable(); List titles = booksTable.getColumn("titles"); assertThat(titles, hasItems("The Lord Of The Rings", "The Hobbit")); List numberOfPages = booksTable.getColumn("numberOfPages"); assertThat(numberOfPages, hasItems("1299", "293")); List covers = booksTable.getColumn("covers"); assertThat(covers, hasItems("http://upload.wikimedia.org/wikipedia/en/6/62/Jrrt_lotr_cover_design.jpg", "http://upload.wikimedia.org/wikipedia/en/4/4a/TheHobbit_FirstEdition.jpg")); } @Step public void open_find_all_page() { onFindAllBooksPage().open(); } private FindAllBooksPage onFindAllBooksPage() { return getPages().currentPageAt(FindAllBooksPage.class); } } And finally class for validating the acceptance criteria: @Story(Application.Books.ListAllBooks.class) @RunWith(ThucydidesRunner.class) public class FindBooksStory { private final MongoDbConfiguration mongoDbConfiguration = mongoDb() .host("127.0.0.1").databaseName("books") .username(MongoDbConstants.USERNAME) .password(MongoDbConstants.PASSWORD).build(); @Rule public final MongoDbRule mongoDbRule = newMongoDbRule().configure( mongoDbConfiguration).build(); @Managed(uniqueSession = true) public WebDriver webdriver; @ManagedPages(defaultUrl = "http://books-lordofthejars.rhcloud.com") public Pages pages; @Steps public EndUserSteps endUserSteps; @Test @UsingDataSet(locations = "books.json", loadStrategy = LoadStrategyEnum.CLEAN_INSERT) public void finding_all_books_should_return_all_available_books() { endUserSteps.open_find_all_page(); endUserSteps.should_obtain_all_inserted_books(); } } There are some things that should be considered in previous class: @Story should receive a class defined with @Feature annotation, so Thucydides can create correctly the report. We use MongoDbRule to establish a connection to remote MongoDb instance. Note that we can use localhost address because of port forwarding Openshift capability so although localhost is used, we are really managing remote MongoDb instance. Using @Steps Thucydides will create an instance of previous step library. And finally @UsingDataSet annotation to populate data into MongoDb database before running the test. { "book":[ { "title": "The Lord Of The Rings", "numberOfPages": "1299", "cover": "http:\/\/upload.wikimedia.org\/wikipedia\/en\/6\/62\/Jrrt_lotr_cover_design.jpg" }, { "title": "The Hobbit", "numberOfPages": "293", "cover": "http:\/\/upload.wikimedia.org\/wikipedia\/en\/4\/4a\/TheHobbit_FirstEdition.jpg" } ] } Note that NoSQLUnit maintains the database into known state by cleaning database before each test execution and populating it with known data defined into a json file. Also keep in mind that this example is very simple so only and small subset of capabilities of Thucydides and NoSQLUnit has been shown. Keep watching both sites: http://thucydides.info and https://github.com/lordofthejars/nosql-unit We keep learning, Alex. Love Is A Burning Thing, And It Makes A Fiery Ring, Bound By Wild Desire, I Fell Into A Ring Of Fire (Ring of Fire - Johnny Cash)

December 9, 2012

by Alex Soto

· 5,958 Views

A C# .NET Client Proxy for RabbitMQ Management API

RabbitMQ comes with a very nice Management UI and a HTTP JSON API, that allows you to configure and monitor your RabbitMQ broker. From the website: “The rabbitmq-management plugin provides an HTTP-based API for management and monitoring of your RabbitMQ server, along with a browser-based UI and a command line tool, rabbitmqadmin. Features include: Declare, list and delete exchanges, queues, bindings, users, virtual hosts and permissions. Monitor queue length, message rates globally and per channel, data rates per connection, etc. Send and receive messages. Monitor Erlang processes, file descriptors, memory use. Export / import object definitions to JSON. Force close connections, purge queues.” Wouldn’t it be cool if you could do all these management tasks from your .NET code? Well now you can. I’ve just added a new project to EasyNetQ called EasyNetQ.Management.Client. This is a .NET client-side proxy for the HTTP-based API. It’s on NuGet, so to install it, you simply run: PM> Install-Package EasyNetQ.Management.Client To give an overview of the sort of things you can do with EasyNetQ.Client.Management, have a look at this code. It first creates a new Virtual Host and a User, and gives the User permissions on the Virtual Host. Then it re-connects as the new user, creates an exchange and a queue, binds them, and publishes a message to the exchange. Finally it gets the first message from the queue and outputs it to the console. var initial = new ManagementClient("http://localhost", "guest", "guest"); // first create a new virtual host var vhost = initial.CreateVirtualHost("my_virtual_host"); // next create a user for that virutal host var user = initial.CreateUser(new UserInfo("mike", "topSecret")); // give the new user all permissions on the virtual host initial.CreatePermission(new PermissionInfo(user, vhost)); // now log in again as the new user var management = new ManagementClient("http://localhost", user.name, "topSecret"); // test that everything's OK management.IsAlive(vhost); // create an exchange var exchange = management.CreateExchange(new ExchangeInfo("my_exchagne", "direct"), vhost); // create a queue var queue = management.CreateQueue(new QueueInfo("my_queue"), vhost); // bind the exchange to the queue management.CreateBinding(exchange, queue, new BindingInfo("my_routing_key")); // publish a test message management.Publish(exchange, new PublishInfo("my_routing_key", "Hello World!")); // get any messages on the queue var messages = management.GetMessagesFromQueue(queue, new GetMessagesCriteria(1, false)); foreach (var message in messages) { Console.Out.WriteLine("message.payload = {0}", message.payload); } This library is also ideal for monitoring queue levels, channels and connections on your RabbitMQ broker. For example, this code prints out details of all the current connections to the RabbitMQ broker: var connections = managementClient.GetConnections(); foreach (var connection in connections) { Console.Out.WriteLine("connection.name = {0}", connection.name); Console.WriteLine("user:\t{0}", connection.client_properties.user); Console.WriteLine("application:\t{0}", connection.client_properties.application); Console.WriteLine("client_api:\t{0}", connection.client_properties.client_api); Console.WriteLine("application_location:\t{0}", connection.client_properties.application_location); Console.WriteLine("connected:\t{0}", connection.client_properties.connected); Console.WriteLine("easynetq_version:\t{0}", connection.client_properties.easynetq_version); Console.WriteLine("machine_name:\t{0}", connection.client_properties.machine_name); } On my machine, with one consumer running it outputs this: connection.name = [::1]:64754 -> [::1]:5672 user: guest application: EasyNetQ.Tests.Performance.Consumer.exe client_api: EasyNetQ application_location: D:\Source\EasyNetQ\Source\EasyNetQ.Tests.Performance.Consumer\bin\Debug connected: 14/11/2012 15:06:19 easynetq_version: 0.9.0.0 machine_name: THOMAS You can see the name of the application that’s making the connection, the machine it’s running on and even its location on disk. That’s rather nice. From this information it wouldn’t be too hard to auto-generate a complete system diagram of your distributed messaging application. Now there’s an idea :)

December 7, 2012

by Mike Hadlow

· 8,074 Views

C++ benchmark – std::vector VS std::list

a updated version of this article is available: c++ benchmark – std::vector vs std::list vs std::deque in c++, the two most used data structures are the std::vector and the std::list. in this article, we will compare the performance in practice of these two data structures on several different workloads. in this article, when i talk about a list it is the std::list implementation and vector refers to the std::vector implementation. it is generally said that a list should be used when random insert and remove will be performed (performed in o(1) versus o(n) for a vector). if we look only at the complexity, search in both data structures should be roughly equivalent, complexity being in o(n). when random insert/replace operations are performed on a vector, all the subsequent data needs to be moved so each element will be copied. that is why the size of the data type is an important factor when comparing those two data structures. however, in practice, there is a huge difference in the usage of the memory caches. all the data in a vector is contiguous where the std::list allocates memory separately for each element. how does that change the results in practice ? keep in mind that all the tests performed are made on vector and list even if other data structures could be better suited to the given workload. in the graphs and in the text, n is used to refer to the number of elements of the collection. all the tests performed have been performed on an intel core i7 q 820 @ 1.73ghz. the code has been compiled in 64 bits with gcc 4.7.2 with -02 and -march=native. the code has been compiled with c++11 support (-std=c++11). fill the first test is to fill the data structures by adding elements to the back of the container. two variations of vector are used, vector_pre being a std::vector with the size passed in parameters to the constructor, resulting in only one allocation of memory. fill (8 bytes)vector_prevectorlist100010000100000100000003006009001,200milliseconds fill (1024 bytes)vector_prevectorlist100010000100000100000006,00012,00018,00024,000milliseconds all data structures are impacted the same way when the data size increases because there will be more memory to allocate. the vector_pre is clearly the winner of this test, being one order of magnitude faster than a list and about twice as fast as a vector without pre-allocation. the results are directly linked to the allocations that have to be performed, allocation being slow. whatever the data size is, push_back to a vector will always be faster than to a list. this is logical because vector allocates more memory than necessary and so does not need to allocate memory for each element. but this test is not very interesting, generally building the data structure is not critical. what is critical is the operations that are performed on the data structure. that will be tested in the coming sections. random find the first operation is that is tested is the search. the container is filled with all the numbers in [0, n] and shuffled. then, each number in [0,n] is searched in the container with std::find that performs a simple linear search. yes, vector is represented in the graph, its line is the same as the x line ! performing a linear search in a vector is several orders of magnitude faster than in a list . the only reason is the usage of the cache line. when a data is accessed, the data is fetched from the main memory to the cache. not only the accessed data is accessed, but a whole cacheline is fetched. as the elements in a vector are contiguous, when you access an element, the next element is automatically in the cache. as the main memory is orders of magnitude slower than the cache, this makes a huge difference. in the list case, the processor spends its whole time waiting for data being fetched from memory to the cache. if we augment the size of the data type to 1kb, the results remain the same, but slower: random insert in the case of random insert, in theory, the list should be much faster, its insert operation being in o(1) versus o(n) for a vector. the container is filled with all the numbers in [0, n] and shuffled. then, 1000 random values are inserted at a random position in the container. the random position is found by linear search. in both cases, the complexity of the search is o(n), the only difference comes from the insert that follow the search. when, the vector should be slower than the list, it is almost an order of magnitude faster. again, this is because finding the position in a list is much slower than copying a lot of small elements. if we increase the size: the two lines are getting closer, but vector is still faster. increase it to 1kb: this time, list outperforms vector by an order of magnitude ! the performance of random insert in a list are not impacted much by the size of the data type, where vector suffers a lot when big sizes are used. we can also see that list doesn’t seem to care about the size of the collection. it is because the size of the collection only impact the search and not the insertion and as few search are performed, it does not change the results a lot. if the iterator was already known (no need for linear search), it would be faster to insert into a list than into the vector. random remove in theory, random remove is the same case than random insert. now that we’ve seen the results with random insert, we could expect the same behavior for random remove. the container is filled with all the numbers in [0, n] and shuffled. then, 1000 random values are removed from a random position in the container. again, vector is several times faster and looks to scale better. again, this is because it is very cheap to copy small elements. let’s increase it directly to 1kb element. the two lines have been reversed ! the behavior of random remove is the same as the behavior of random insert, for the same reasons. push front the next operation that we will compare is inserting elements in front of the collection. this is the worst case for vector, because after each insertion, all the previously inserted will be moved and copied. for a list, it does not make a difference compared to pushing to the back. the results are crystal-clear and as expected. vector is very bad at inserting elements to the front. this does not need further explanations. there is no need to change the data size, it will only make vector much slower. sort the next operation that is tested is the performance of sorting a vector or a list. for a vector std::sort is used and for a list the member function sort is used. we can see that sorting a list is several times slower. it comes from the poor usage of the cache. if we increase the size of the element to 1kb: this time the list is faster than the vector. it is not very clear on the graph, but the values for the list are almost the same as for the previous results. that is because std::list::sort() does not perform any copy, only pointers to the elements are changed. on the other hand, swapping two elements in a vector involves at least three copies, so the cost of sorting will increase as the cost of copying increases. number crunching finally, we can also test a number crunching operation. here, random elements are inserted into the container that is kept sorted. it means, that the position where the element has to be inserted is first searched by iterating through elements and the inserted. as we talk about number crunching, only 8 bytes elements are tested. we can clearly see that vector is more than an order of magnitude faster than list and this will only be more as the size of the collection increase. this is because traversing the list is much more expensive than copying the elements of the vector. conclusion to conclude, we can get some facts about each data structure: std::vector is insanely faster than std::list to find an element std::vector always performs faster than std::list with very small data std::vector is always faster to push elements at the back than std::list std::list handles large elements very well, especially for sorting or inserting in the front these are the simple conclusions on usage of each data structure: for number crunching : use std::vector for linear search : use std::vector for random insert/remove : use std::list (if data size very small (< 64b on my computer), use std::vector) for big data size : use std::list (not if intended for searching) if you have the time, in practice, the best way to decide is always to benchmark both versions, or even to try other data structures. i hope that you found this article interesting. if you have any comment or have an idea about another workload that you would like to test, don’t hesitate to post a comment if you have a question on results, don’t hesitate as well. the code source of the benchmark is available online: https://github.com/wichtounet/articles/blob/master/src/vector_list/bench.cpp

December 6, 2012

by Baptiste Wicht

· 45,065 Views

Easy Integration Testing with Spring+Hibernate

I am guilty of not writing integration testing (At least for database related transactions) up until now. So in order to eradicate the guilt i read up on how one can achieve this with minimal effort during the weekend. Came up with a small example depicting how to achieve this with ease using Spring and Hibernate. With integration testing, you can test your DAO(Data access object) layer without ever having to deploy the application. For me this is a huge plus since now i can even test my criteria's, named queries and the sort without having to run the application. There is a property in hibernate that allows you to specify an sql script to run when the Session factory is initialized. With this, i can now populate tables with data that required by my DAO layer. The property is as follows; import.sql According to the hibernate documentation, you can have many comma separated sql scripts.One gotcha here is that you cannot create tables using the script. Because the schema needs to be created first in order for the script to run. Even if you issue a create table statement within the script, this is ignored when executing the script as i saw it. Let me first show you the DAO class i am going to test; package com.unittest.session.example1.dao; import org.springframework.transaction.annotation.Propagation; import org.springframework.transaction.annotation.Transactional; import com.unittest.session.example1.domain.Employee; @Transactional(propagation = Propagation.REQUIRED) public interface EmployeeDAO { public Long createEmployee(Employee emp); public Employee getEmployeeById(Long id); } package com.unittest.session.example1.dao.hibernate; import org.springframework.orm.hibernate3.support.HibernateDaoSupport; import com.unittest.session.example1.dao.EmployeeDAO; import com.unittest.session.example1.domain.Employee; public class EmployeeHibernateDAOImpl extends HibernateDaoSupport implements EmployeeDAO { @Override public Long createEmployee(Employee emp) { getHibernateTemplate().persist(emp); return emp.getEmpId(); } public Employee getEmployeeById(Long id) { return getHibernateTemplate().get(Employee.class, id); } } Nothing major, just a simple DAO with two methods where one is to persist and one is to retrieve. For me to test the retrieval method i need to populate the Employee table with some data. This is where the import sql script which was explained before comes into play. The import.sql file is as follows; insert into Employee (empId,emp_name) values (1,'Emp test'); This is just a basic script in which i am inserting one record to the employee table. Note again here that the employee table should be created through the hibernate auto create DDL option in order for the sql script to run. More info can be found here. Also the import.sql script in my instance is within the classpath. This is required in order for it to be picked up to be executed when the Session factory is created. Next up let us see how easy it is to run integration tests with Spring. package com.unittest.session.example1.dao.hibernate; import static org.junit.Assert.*; import org.junit.Test; import org.junit.runner.RunWith; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import org.springframework.test.context.transaction.TransactionConfiguration; import com.unittest.session.example1.dao.EmployeeDAO; import com.unittest.session.example1.domain.Employee; @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(locations="classpath:spring-context.xml") @TransactionConfiguration(defaultRollback=true,transactionManager="transactionManager") public class EmployeeHibernateDAOImplTest { @Autowired private EmployeeDAO employeeDAO; @Test public void testGetEmployeeById() { Employee emp = employeeDAO.getEmployeeById(1L); assertNotNull(emp); } @Test public void testCreateEmployee() { Employee emp = new Employee(); emp.setName("Emp123"); Long key = employeeDAO.createEmployee(emp); assertEquals(2L, key.longValue()); } } A few things to note here is that you need to instruct to run the test within a Spring context. We use the SpringJUnit4ClassRunner for this. Also the transction attribute is set to defaultRollback=true. Note that with MySQL, for this to work, your tables must have the InnoDB engine set as the MyISAM engine does not support transactions. And finally i present the spring configuration which wires everything up; com.unittest.session.example1.**.* org.hibernate.dialect.MySQLDialect com.mysql.jdbc.Driver jdbc:mysql://localhost:3306/hbmex1 root password true org.hibernate.dialect.MySQLDialect create import.sql That is about it. Personally i would much rather use a more light weight in-memory database such as hsqldb in order to run my integration tests. Here is the eclipse project for anyone who would like to run the program and try it out.

November 27, 2012

by Dinuka Arseculeratne

· 56,207 Views · 2 Likes

Enterprise-ready Tool Support for Apache Camel

apache camel is my favorite integration framework on the java platform due to great dsls, a huge community, and so many different components. camel is used by many developers from different companies all over the world. however, most guys are not aware that some really cool and – more important – enterprise-ready tooling is available for camel, too. many people ask me about camel tooling when i do talks at conferences. this is the reason for this short blog post about camel tooling. [fyi: i work for talend (one of the vendors).] ide support camel consists of a set of normal java libraries and is therefore usable with any java ide (such as eclipse, netbeans or intellij idea) or even a classic text editor. programming dsls are available for java, groovy, and scala. even a kotlin dsl is in the works, thanks to camel’s founder james strachan. all familiar ide features such as code completion or javadoc view are available for these dsls. in the spring xml dsl, the eclipse-based springsource tool suite (sts) should be emphasized, which provides the best support for the spring framework and xml configurations. camel-specific tooling besides classical ide support, further products are available to provide additional functionality. integration problems can be modeled with the help of enterprise integration patterns (eip, http://www.eaipatterns.com/). eips are implemented by camel. visual designers are available to help modeling integration problems with these eips. these tools even generate the corresponding source code automatically. ideally, developers do not have to write any source code by hand. camel tooling is offered by talend with talend esb (http://de.talend.com/products/esb) and jboss, formerly fusesource, with fuse ide (http://fusesource.com/products/fuse-ide). both companies also provide full-time committers for the apache camel project. let’s take a short look at these two products in the following. open studio for talend esb talend esb is an eclipse-based integration platform within the talend unified platform. the familiar “look and feel” and the intuitive use of eclipse remain. the esb is open source and freely available. the paid enterprise version offers additional features and support. the esb can be used independently or in combination with other parts of the talend unified platform, such as BPM, big data, or master data management. the great benefit is that everything can be done within one suite using the same gui and concepts, based on eclipse. the entire talend unified platform is based on the “zero-coding” approach. this way, a very efficient implementation of integration problems is possible using the eips and components. routes are modeled and configured with intuitive tool support, all source code is generated. of course, custom integration logic can still be written and included, for example, pojos, spring beans, scripts in different languages, or own camel components. plenty of other components besides camel’s ones are available for talend esb – for example connectors to alfresco, jasper, sap, salesforce, or host systems. figure 1: visual designer of talend’s esb fuse ide the fuse ide is an eclipse plugin, which is installed from the eclipse update site. the visual designer (see figure 2) generates camel routes as xml code using the spring xml dsl. the generated code is editable vice-versa, i.e. the developer can change the source code. the graphical model applies changes automatically. fuse ide is intuitive to use for creating camel routes. fusesource offers some other products, which can be used in combination with fuse ide – such as management console or fuse mq for messaging. under fusesource, fuse ide was a proprietary product. however, fusesource was recently taken over by redhat (http://www.redhat.com/about/news/press-archive/2012/6/red-hat-to-acquire-fusesource) and now belongs to the jboss division. in the new roadmap, the fuse ide is still included. it will probably be integrated into the jboss enterprise soa platform and become “open sourced”. the integration of fusesource will take at least a few more months time to complete (http://www.redhat.com/promo/jboss_integration_week/). jboss now “owns” three esb products (jboss esb, switchyard and fuse esb). probably, these will be merged into one product in the end (switchyard is also based on camel). nevertheless, the fusesource products will also be supported for some time – primarily in order to satisfy existing customers (my guess). figure 2: visual designer of fuse ide (jboss, former fusesource) enterprise-ready tooling is already available for apache camel! the bottom line is that enterprise-ready tooling is already available for apache camel. it is great to see different companies working on tooling for apache camel. the winner definitely is apache camel… and there is no loser! talend esb and fuse ide are two different approaches for different kinds of projects. if you like the „zero-coding“ approach, then take a closer look at talend’s esb. it is really easy and efficient to realize integration projects without writing source code – nevertheless, there is enough flexibility for customization and adding own source code. the combination with bpm, mdm or big data (based on hadoop) is also supported within the unified platform using the same open source and „zero-coding“ concepts. if you „insist“ on writing and refactoring all source code by yourself within the text editor of an ide, then take a look at fuse ide. your best would be to try out both and see which one fits best into your next enterprise integration project. if you know any other cool camel tooling (no matter if it is enterprise-ready or not), or if you have any other feedback, please write a comment. thank you. best regards, kai wähner (twitter: @kaiwaehner) content from my blog: http://www.kai-waehner.de/blog/2012/11/23/enterprise-ready-tool-support-for-apache-camel/

November 26, 2012

by Kai Wähner

CORE

· 15,556 Views

CentOS Minimal Installation Network Configuration

By default CentOS minimal install does not come with pre-configured network, here’s how to make it work: $ ping google.com ping: unknown host google.com To fix this we’ll need to edit the set up for the ethernet. Let’s start with editing this file: $ vim /etc/sysconfig/network-scripts/ifcfg-eth0 IPADDR=x.x.x.x BOOTPROTO=none NETMASK=255.255.255.0 GATEWAY=y.y.y.y DNS1=y.y.y.y DNS2=y.y.y.y USERCTL=yes HWADDR='your mac address' where x.x.x.x is your static ip, and y.y.y.y is your router ip If you’re not sure what your mac address is, run this command $ ifconfig eth0 | grep -o -E '([[:xdigit:]]{1,2}:){5}[[:xdigit:]]{1,2}' now edit the networks config and make sure you added the line below: $ vi /etc/sysconfig/network Add the line: GATEWAY = y.y.y.y Now restart the network interface: $ /etc/init.d/networking restart Now ping the router: $ ping y.y.y.y Request timeout for icmp_seq 0 64 bytes from y.y.y.y: icmp_seq=1 ttl=56 time=1.792 ms Request timeout for icmp_seq 1 64 bytes from y.y.y.y: icmp_seq=3 ttl=56 time=1.790 ms 64 bytes from y.y.y.y: icmp_seq=4 ttl=56 time=1.762 ms Looks good, now let’s see if we can see anything outside. $ ping google.com PING google.com (173.194.67.138) 56(84) bytes of data. 64 bytes from wi-in-f138.1e100.net (173.194.67.138): icmp_seq=1 ttl=49 time=7.88 ms 64 bytes from wi-in-f138.1e100.net (173.194.67.138): icmp_seq=2 ttl=49 time=7.35 ms 64 bytes from wi-in-f138.1e100.net (173.194.67.138): icmp_seq=3 ttl=49 time=7.13 ms Now you can connect to the internet, and get all the packages you need.

November 26, 2012

by Kasia Gogolek

· 89,965 Views · 1 Like

IndexedDBViewer: Take a Look Inside your IndexedDB

Some days ago I released a new version of the IndexedDBViewer 1.1.0. The IndexedDBViewer is intended for web developers who want to sneak into their indexedDB database. It allows you to inspect the structure of your database as well as the data stored inside this structure. The difference with the previous version is that it no longer needs the jQueryUI library. This way I eliminated at least one reference. The following references are still necessary: jQuery (version 1.8.2 or higher) linq2indexedDB (version 1.0.19 or higher) indexedDBViewer (1.1.0 – JavaScript + css file) If you are using nuget, you can get all the resource by searching for the indexedDBViewer. The second major change is that the viewer can easily be added to an existing page. The only thing you need to do is add a div with “indexedDBViewer” as id and data-dbName attribute to pass the database you want to inspect. The rest will be handled by the script in the viewer. Once this is done and you navigate to the page with the viewer, you will get the following result In the bottom you will see the view appear. On the left pane you get an overview of the database structure. This a list with on top the name of the database. Under that you will find child nodes that represent the object stores present in the database. If we descend an other level we can see the indexes present on the object store. If you click on the “+” or “-“ next to the names, you can expand or hide the structure beneath. If you click on the database name in the navigation pane, you will get information about the database and it’s structure. In the general block you will see the name of the database and the version it is currently in The object stores block gives you an overview of all the object stores present and how they are configured. The indexes block shows all the present indexes and how they are configured. When you click on one of the object store names in the navigation pane, you will get all the data present in the object store. Because the data is saved as a key/value pair, you will see the key with his corresponding value. If the value is an object or contains objects, then you can inspect them by clicking on the “+” to expand and “-” to hide the details. If you click on one of the index names in the navigation pane, you will get – similar as for object stores – all the data present in the object store. But in this case you will see a little more. Besides the key of the index and the value you will see the key the value has in the object store. This can be found under the “primary” key column. As last there are some little extra features: If you click on the top border of the viewer and drag it up or down, then you can change the height of the viewer. if you click on the “-” in the right top of the viewer, you can hide the viewer. If you want it to appear again, then you have to click on the “+” on the right bottom of the page. Conclusion With this Chrome like indexedDBViewer you can inspect the structure of your database inclusive all data stored within it. This with the advantage that it runs inside the browser, so you can use it cross-browser.

November 26, 2012

by Kristof Degrave

· 5,494 Views

Moving the Decimal Place in a Double

This is taken from a popular answer to the question Moving decimal places over in a double Question So I have a double set to equal 1234, I want to move a decimal place over to make it 12.34 So to do this I multiply .1 to 1234 two times, kinda like this double x = 1234; for(int i = 1; i <= 2; i++) { x = x*.1; } System.out.println(x); This will print the result, "12.340000000000002" Is there a way, without simply formatting it to two decimal places, to have the double store 12.34 correctly? Answer If you use double or float, you should use rounding or expect to see some rounding errors. If you can't do this, use BigDecimal. The problem you have is that 0.1 is not an exact representation, and by performing the calculation twice, you are compounding that error. However, 100 can be represented accurately, so try: double x = 1234; x /= 100; System.out.println(x); which prints: 12.34 This works because Double.toString(d) performs a small amount of rounding on your behalf, but it is not much. If you are wondering what it might look like without rounding: System.out.println(new BigDecimal(0.1)); System.out.println(new BigDecimal(x)); prints 0.100000000000000005551115123125782702118158340454101562 12.339999999999999857891452847979962825775146484375 In short, rounding is unavoidable for sensible answers in floating point whether you are doing this explicitly or not. Note: x / 100 and x * 0.01 are not exactly the same when it comes to rounding error. This is because the round error for the first expression depends on the values of x, whereas the 0.01 in the second has a fixed round error. for(int i = 0; i < 200; i++) { double d1 = (double) i / 100; double d2 = i * 0.01; if (d1 != d2) System.out.println(d1 + " != "+d2); } prints 0.35 != 0.35000000000000003 0.41 != 0.41000000000000003 0.47 != 0.47000000000000003 0.57 != 0.5700000000000001 0.69 != 0.6900000000000001 0.7 != 0.7000000000000001 0.82 != 0.8200000000000001 0.83 != 0.8300000000000001 0.94 != 0.9400000000000001 0.95 != 0.9500000000000001 1.13 != 1.1300000000000001 1.14 != 1.1400000000000001 1.15 != 1.1500000000000001 1.38 != 1.3800000000000001 1.39 != 1.3900000000000001 1.4 != 1.4000000000000001 1.63 != 1.6300000000000001 1.64 != 1.6400000000000001 1.65 != 1.6500000000000001 1.66 != 1.6600000000000001 1.88 != 1.8800000000000001 1.89 != 1.8900000000000001 1.9 != 1.9000000000000001 1.91 != 1.9100000000000001

November 25, 2012

by Peter Lawrey

· 15,395 Views

Migrating from BIND DNS servers to Amazon Route 53 by using cli53

Amazon Route 53 offers DNS as a hosted service. They maintain robust DNS service for you with easy to user web interface and API. Name servers are part of any serious hosting infrastructure. They have been traditionally run on BIND or similar UNIX daemons. The usual hosting tasks including mapping domain names and subdomains (A, CNAME, et. al records) to servers and delivering email (MX and SPF records). Usually a basic hosting plan is only good for setting A records for few subdomains like www, but on the instant you need something more advanced like routing MX records to Google Apps email you need to start considering a real DNS service. GoDaddy and other big name registrants provide tools for this, but they are usually specific only for top level domains (TLDs) the registrant supports. If you come from a cold country like .fi you cannot centralize your DNS management to big popular domain registrants. The bad thing with running your own name servers is that when the service goes down the whole your infrastructure becomes virtually inaccessible – web sites, email, anything. Thus you usually run at least two BIND daemons on two physically separated servers just in case one blows up the other can still keep DNS records running. Also DNS, being core part of internet infrastructure, is subject for many kind of attacks and keeping your skills and knowledge up-to-date with BIND may be time consuming. Like with email servers, hosting name servers is a thing you really don’t want to do. Enter Amazon Route 53. (tha name Route 53 probably comes from the famous U.S. road Route 67 and DNS port 53?). Route 53 is ridiculous easy to use. Just go to the site and register one for yourself. However if you have legacy BIND servers running around there exist ways to import data automatically instead of manually reading through zone files and typing them in to Route 53 by hand. 1. Setting up cli53 cli53 is a Python based command line interface for Route 53. It uses buildout for installation (of Python package dependencies). These instructions have been tested with Ubuntu 8.04 server. Prerequisitements for using this stuff is knowing basic Ubuntu server management from the command line. Run everything as root. Install wget and unzip apt-get install wget unzip Download ZIP copy of cli53 from Github (we don’t do git clone here – no need to install git on the server): wget --no-check-certificate https://github.com/barnybug/cli53/zipball/master Extract unzip master Install as user local under the extracted folder (note that the folder varies across GitHub exports) cd barnybug-cli53-3b468b7 python bootstrap.py bin/buildout This will generate the local bin/cli53 command. Now you need to set up AWS credentials. Route 53, and other AWS service, are operated using access keys which you can get from Security Credentials page in AWS web console. Set the access keys for your shell/SSH session using enviroment variables: export AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXX export AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXX Side note: as this point you need to check that your server clock is in correct time your AWS remote operations will fail (300 seconds thresold).. Use date command to see the server clock time and correct it with the same date command using this obscure date command syntax. Now you can run a test by listing your Amazon Route 53 Hosted Zones (Hosted Zone = zonefile = one per top level domain usually) using list command: bin/cli53 list The proper output for one domain which has been created throught-the-web in Route 53 control panel is something like: HostedZones: - CallerReference: 9265CCC3-9C41-98CE-8820-E26C0356C478 Config: Comment: test Id: /hostedzone/Z2L3NT0WCS8OWA Name: xxx.fi. 2. Importing zone files Now when cli53 is running and working it is time to rock’n'roll. On our server configuration BIND zonefiles lived in /srv/dns/zones though this is not the default location for Ubuntu BIND. There is one file per each zone and the filename is the domain name. [root@foobar][23:02][/srv/dns/zones]# ls -l total 176 -rw-r--r-- 1 root root 425 Dec 30 2008 abc.com -rw-r--r-- 1 root root 1713 Dec 30 2008 abcabc.com -rw-r--r-- 1 root root 1719 Dec 30 2008 abcabcabc.com We can simple create an one-liner shell script which will loop through all the files and import them to Amazon Route 53. However – there is a trick. You’ll get unknown origin exception if your zone files lack a certain line: Traceback (most recent call last): File "/root/barnybug-cli53-3b468b7/bin/cli53", line 22, in cli53.cli53.main() File "/root/barnybug-cli53-3b468b7/src/cli53/cli53.py", line 495, in main args.func(args) File "/root/barnybug-cli53-3b468b7/src/cli53/cli53.py", line 268, in cmd_import raise Exception, 'Could not find origin' Exception: Could not find origin This is because the zone file doesn’t have $ORIGIN directive. it’s optional with BIND as BIND uses the filename as $ORIGIN if it’s not defined, but the directive is mandatory for cli53 (cli53 doesn’t yet synthetize $ORIGIN). You can simple add $ORIGIN to the beginning of each file and copy files to a working directory with the following shell one-liner (it’s always safe to make copies of processed files than try to fix them in place):./transfer/$i ; done mkdir ../transfer for i in * ; do echo "\$ORIGIN $i." > ../transfer/$i ; cat $i >> ../transfer/$i ; done Then do a test run for one import: ~/barnybug-cli53-3b468b7/bin/cli53 create xxx.fi ~/barnybug-cli53-3b468b7/bin/cli53 import xxx.fi --file xxx.fi --replace --wait Note: deleting zones in Route 53 control panel is painful difficult, so make sure you import zones you only really need. If everything looks good we are ok to upload everything to Route 53. Again, a shell one-liner does the trick for us: for i in * ; do ~/barnybug-cli53-3b468b7/bin/cli53 create $i ; ~/barnybug-cli53-3b468b7/bin/cli53 import $i --file $i --wait ; done This will take abour 15-20 seconds per domain. Note: if you need to re-import add –replace flag. 3. The ugly part Each zone has Source of Authority record with authoritative nameserves (NS records). These will be change from your BIND server IP addresses to Amazon ones. Route 53 will re-assign its own name servers for each imported Hosted Zone. However, you cannot know beforehand what name servers the dice has chosen for your Hosted Zone, so automatizing this process is little bit difficult. NS records servers are overriden when the zone file is uploaded to Route 53. You can see new SOA and NS records when you choose the domain in Route 53 control panel and press Go to Record Sets. To make things easier later on here is a script which will dump all the name servers of all domains you have in Route 53 and collect are info to domain-info.txt file: for i in * ; do ~/barnybug-cli53-3b468b7/bin/cli53 info $i >> /tmp/domain-info.txt ; done 4. Testing Route 53 DNS Browser through imported domain records in Route 53 control panel and see if they look ok. You can also test the actual DNS functionality. When you use ping command, your web browser or any other “normal” mean which queries DNS records they use the Source of Authority name servers as set by in your domain registrant settings. When in the middle of migration to Route 53, the authorative name servers are still pointing to the old name servers in this point. dig command can be used to query a specific name server for DNS records. In Route 53 control panel pick any domain and any of its name servers Use dig command to query the records of this domain Example: dig -t ANY @ns-1340.awsdns-39.org yourdomainname.com Output should be something like this: ; <<>> DiG 9.4.2-P1 <<>> -t ANY @ns-1340.awsdns-39.org yourdomain.com ; (1 server found) ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9352 ;; flags: qr aa rd; QUERY: 1, ANSWER: 10, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;yourdomain.com. IN ANY ;; ANSWER SECTION: yourdomain.com. 172800 IN NS ns-470.awsdns-58.com. yourdomain.com. 172800 IN NS ns-553.awsdns-05.net. yourdomain.com. 172800 IN NS ns-1340.awsdns-39.org. yourdomain.com. 172800 IN NS ns-1706.awsdns-21.co.uk. yourdomain.com. 900 IN SOA ns-1340.awsdns-39.org. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400 yourdomain.com. 0 IN MX 10 aspmx.l.google.com. yourdomain.com. 0 IN MX 20 alt1.aspmx.l.google.com. yourdomain.com. 0 IN MX 20 alt2.aspmx.l.google.com. yourdomain.com. 0 IN TXT "v=spf1 a mx a:smtp.something.fi a:auth-smtp.something.fi include:aspmx.googlemail.com ~all" yourdomain.com. 0 IN A 84.20.128.49 ;; Query time: 32 msec ;; SERVER: 205.251.197.60#53(205.251.197.60) ;; WHEN: Wed Nov 23 00:04:53 2011 ;; MSG SIZE rcvd: 413 (subdomains are not listed) 5. Finalizing it Go to your domain name registrant and update the domain name servers point to your Route 53 servers. For each domain in domain-info.txt Go to the corresponding registrant from whom you obtained the domain (e.g. domain.ficora.fi) Update the authoritative name servers to be as stated in the file Wait 24-48 hours or whatever time-to-live time you have for your name servers – usually this is longish period See that your services still run Take down your old name servers 6. More info http://hathology.com/2011/02/how-to-import-bind-zone-files-into-amazon-route-53/ http://docs.amazonwebservices.com/Route53/latest/DeveloperGuide/index.html?MigratingDNS.html#Step_UpdateRegistrar

November 23, 2012

by Mikko Ohtamaa

· 13,592 Views

A Node.js speed dilemma: AJAX or Socket.IO?

Originally posted by Daniel Chirca One the first things I stumbled upon when I started my first Node.js project was how to handle the communication between the browser (the client) and my middleware (the middleware being a Node.js application using the CUBRID Node.js driver (node-cubrid) to exchange information with a CUBRID 8.4.1 database). I am already familiar with AJAX (btw, thanks God for jQuery!! ) but, while studying Node.js, I found out about the Socket.IO module and even found some pretty nice code examples on the internet... Examples which were very-very easy to (re)use... So this quickly become a dilemma: what to choose, AJAX or sockets.io? Obviously, as my experience was quite limited, I needed first more information from out there... In other words, it was time to do some quality Google search :) There’s a lot of information available and, obviously, one would need to filter out all the “noise” and keep what is really useful. Let me share with you some of the goods links I found on the topic: http://stackoverflow.com/questions/7193033/nodejs-ajax-vs-socket-io-pros-and-cons http://podefr.tumblr.com/post/22553968711/an-innovative-way-to-replace-ajax-and-jsonp-using http://stackoverflow.com/questions/4848642/what-is-the-disadvantage-of-using-websocket-socket-io-where-ajax-will-do?rq=1 http://howtonode.org/websockets-socketio To summarize, here’s what I quickly found: Socket.IO (usually) uses persistent connection between the client and the server (the middleware), so you can reach a maximum limit of concurrent connections depending on the resources you have on server side (while more AJAX async requests can be served with the same resources). With AJAX you can do RESTful requests. This means that you can take advantage of existing HTTP-infrastructure like e.g. proxies to cache requests and use conditional get requests. There is more (communication) data overhead in AJAX when compared to Socket.IO (HTTP headers, cookies etc.) AJAX is usually faster than Socket.IO to “code”... When using Socket.IO, it is possible to have a two-way communication where each side – client or server - can initiate a request. In AJAX, it is only the client who can initiate a request! Socket.IO has more transport options, including Adobe Flash. Now, for my own application, what I was most interested in was the speed of making requests and getting data from the (Node.js) server! Regarding the middleware data communication with the CUBRID database, as ~90% of my data access was read-only, a good data caching mechanism is obviously a great way to go! But about this, I’ll talk next time. So I decided to put up their (AJAX and socket.io) speed to test, to see which one is faster (at least on my hardware & software environment)....! My middleware was setup to run on an i5 processor, 8GB of RAM and an Intel X25 SSD drive. But seriously, every speed test and, generally speaking, any performance test depends so much(!) on your hardware and software configuration, that it is always a great idea to try the things on your own environment, rely less on various information you find on internet and more on your own findings! The tests I decided to do have to meet the following requirements: Test: AJAX Socket.IO persistent connection Socket.IO non-persistent connections Test 10, 100, 250 and 500 data exchanges between the client and the server Each data exchange between the middleware SERVER (a Node.js web server) and the client (a browser) is a 4KBytes random data string Run the server in release (not debug) mode Use Firefox as the client Minimize the console messages output, for both server and client Do each test after a client full page reload Repeat each test at least 3 times, to make sure the results are consistent Testing Socket.IO, using a persistent connection I've created a small Node.js server, which was handling the client requests: io.sockets.on('connection', function (client) { client.on('send_me_data', function (idx) { client.emit('you_have_data', idx, random_string(4096)); }); }); And this is the JS client script I used for test: var socket = io.connect(document.location.href); socket.on('you_have_data', function (idx, data) { var end_time = new Date(); total_time += end_time - start_time; logMsg(total_time + '(ms.) [' + idx + '] - Received ' + data.length + ' bytes.'); if (idx++ < countMax) { setTimeout(function () { start_time = new Date(); socket.emit('send_me_data', idx); }, 500); } }); Testing Socket.IO, using NON-persistent connection This time, for each data exchange, I opened a new socket-io connection. The Node.js server code was similar with the previous one, but I decided to send back the client data immediately after connect, as a new connection was initiated every time, for each data exchange: io.sockets.on('connection', function (client) { client.emit('you_have_data', random_string(4096)); }); The client test code was: function exchange(idx) { var start_time = new Date(); var socket = io.connect(document.location.href, {'force new connection' : true}); socket.on('you_have_data', function (data) { var end_time = new Date(); total_time += end_time - start_time; socket.removeAllListeners(); socket.disconnect(); logMsg(total_time + '(ms.) [' + idx + '] - Received ' + data.length + ' bytes.'); if (idx++ < countMax) { setTimeout(function () { exchange(idx); }, 500); } }); } Testing AJAX Finally, I put AJAX to test... The Node.js server code was, again, not that different from the previous ones: res.writeHead(200, {'Content-Type' : 'text/plain'}); res.end('_testcb(\'{"message": "' + random_string(4096) + '"}\')'); As for the client code, this is what I used to test: function exchange(idx) { var start_time = new Date(); $.ajax({ url : 'http://localhost:8080/', dataType : "jsonp", jsonpCallback : "_testcb", timeout : 300, success : function (data) { var end_time = new Date(); total_time += end_time - start_time; logMsg(total_time + '(ms.) [' + idx + '] - Received ' + data.length + ' bytes.'); if (idx++ < countMax) { setTimeout(function () { exchange(idx); }, 500); } }, error : function (jqXHR, textStatus, errorThrown) { alert('Error: ' + textStatus + " " + errorThrown); } }); } Remember, when coding together AJAX and Node.js, you need to take into account the you might be doing cross-domain requests and violating same origin policy, therefore you should use the JSONP based format! Btw, as you can see, I quoted only the most significant parts of the test code, to save space. If anyone needs the full code, server and client, please let me know – I’ll be happy to share them. OK – it’s time now to see what we got after all this work! I have run each test for 10, 100, 250 and 500 data exchanges and this is what I got in the end: Data exchanges Socket.IO NON-persistent (ms.) AJAX (ms.) Socket.IO persistent (ms.) 10 90 40 32 100 900 320 340 250 2,400 800 830 500 4,900 1,500 1,600 Looking into the results, we can notice a few things right away: For each type of test, the results behave quite linear; this is good – it shows that the results are consistent. The results clearly show that when using Socket.IO non-persistent connections, the performance numbers are significantly worse than others. It doesn’t seem to be a big difference between AJAX and the Socket.IO persistent connections – we are talking only about some milliseconds differences. This means that if you can live with less than 10,000 data exchanges per day, for example, there are high chances that the user won’t notice a speed difference... The graph below illustrates the numbers I obtained in test: ...So what’s next...? ...Well, I have to figure out what kind of traffic I need to support and then I will re-run the tests for those numbers, but this time excluding Socket.IO non-persistent connections. That’s because it is obvious that I need to choose between AJAX and persistent Socket.IO connections. And I also learned that, most probably, the difference in speed would not be as much as one would expect... at least not for a “small-traffic” web site, so I need to start looking into other advantages and disadvantages for each approach/technology when choosing my solution! That’s pretty much for this post - see you next time with a post about Node.js and caching! P.S. Here are a few more nice resources to find interesting stuff about Node.js, Socket.IO and AJAX: http://socket.io/#how-to-use http://www.hacksparrow.com/jquery-with-node-js.html http://www.slideshare.net/toddeichel/nodejs-talk-at-jquery-pittsburgh http://tech.burningbird.net/article/node-references-and-resources http://davidwalsh.name/websocket

November 22, 2012

by Esen Sagynov

· 17,400 Views

API Server Design - Making De-Normalization the Norm

In database design classes in Computer Science, we learn that normalization is a good thing. And it certainly is a good thing, for databases. In the case of APIs, it is a different story. If a client must do multiple GETs to obtain the data it needs, or multiple PUTs or POSTs to send up data, just because your database happens to be normalized, then something is wrong. One of the functions of an API Server is to de-normalize your data so that clients are spared from making extra REST API calls, with all of the overhead which goes with that. Mugunth Kumar explains this very well in this excellent presentation, using Twitter as an example. When you do a GET on a tweet, it not only returns you the Tweet itself, but also other information (e.g. description of the Twitter user who sent the tweet). This saves the API client (often a mobile app) from making another request for that data. Effectively, the API Server has gathered up that data, which may come from different database tables, and de-normalized it for the response. You can try it out yourself here, by looking at the JSON which comes back from this Twitter API GET the most recent Tweet from my timeline. Many Vordel customers are using the API Server to gather together the data which is returned to the API clients, often taking this data from multiple sources (not only databases, but also message queues and even from other APIs). This data is then amalgamated into single JSON or XML structures. It often then cached at the API Server, in this structure. In this way, clients are spared from doing multiple calls, and instead (like the Twitter API example above) get the data they need in one request, or can PUT or POST up data in one action, rather than piecemeal. De-normalization is key to this process, and is one of the great benefits of an API Server.

November 21, 2012

by Oren Eini

· 9,857 Views

Overflow And Underflow of Data Types in Java

Overflow and underflow of values of various data types is a very common occurence in Java programs. This is usually because the beginners dont' pay proper attention to the default values of various data types. If we are creating a byte type variable and assigning it a value, we should be aware that the value will be treated as an int and hence a potential overflow condition. In Java the overflow and underflow are more serious because there is no warning or exception raised by the JVM when such a condition occurs. Some developers argue that the program should either crash or raise exception in such case but the decision for adding such behavior is in the hands of creators of programming language. By looking at a problem in your program, you can't straightway tell that an overflow or underflow condition has occured. It is only after debugging that we come to know of the real cause. Overflow in int As int data type is 32 bit in Java, any value that surpasses 32 bits gets rolled over. In numerical terms, it means that after incrementing 1 on Integer.MAX_VALUE (2147483647), the returned value will be -2147483648. In fact you don't need to remember these values and the constants Integer.MIN_VALUE and Integer.MAX_VALUE can be used. Underflow of int Underflow is the opposite of overflow. While we reach the upper limit in case of overflow, we reach the lower limit in case of underflow. Thus after decrementing 1 from Integer.MIN_VALUE, we reach Integer.MAX_VALUE. Here we have rolled over from the lowest value of int to the maximum value. For non-integer based data types, the overflow and underflow result in INFINITY and ZERO values. You may try the following lines to verify this: float f = 3.4028235E38f * 20f; System.out.println(f); Note: As with int data type, we have wrappers for all primitive data types. So we can easily see the upper and lower limit of each data type by looking at the MAX_VALUE and MIN_VALUE constants in these wrapper classes. Read more: http://extreme-java.blogspot.com/2012/11/overflow-and-underflow-of-data-types-in.html#ixzz2BvqFu7fk

November 15, 2012

by Sandeep Bhandari

· 69,002 Views · 1 Like

Spock and testing RESTful API services

Spock is a BBD testing framework that allows for easy BDD tests to be written. The framework is an extension upon JUnit which allows for easy IDE integration and using existing JUnit functionality. Spock tests are written in Groovy and can be used for writing a wide range of tests from small unit tests to full application integration tests. Without going into too much detail on how to write Spock based tests (see below for a few excellent links), lets go through how we can use the framework to build integration tests for testing a RESTful API. Our first RESTful API Test package com.wolfware.integration import groovyx.net.http.RESTClient import spock.lang.* import spock.lang.Specification import com.movideo.spock.extension.APIVersion import com.movideo.spock.extension.EnvironmentEndPoint @APIVersion(minimimApiVersion="1.0.0.0") class GetAuthenticationToken extends Specification { @EnvironmentEndPoint protected def environmentHost def "Get authentication token XML from API for valid account"() { given: "a valid account" def authenticationTokenRequestParams = ['key':"AAABBBCCC123", 'user':"[email protected]"] and: "a client to get the authentication token XML" def client = new RESTClient(environmentHost) when: "we attempt to retrieve authentication token XML" def resp = client.get(path : "/authenticate", query : authenticationTokenRequestParams) then: "we should get a valid authentication token XML response" assert resp.data.token.isEmpty() == false // lots more asserts } } As you can see, apart from the @APIVersion and @EnvironmentEndPoint annotations (these are Spock extensions as explained later), the spec is a fairly simple Spock test. This specification has a feature that, as the name suggests, gets a authentication token in XML format and validates it. Lets look at each step: Given The url parameters required to get a authentication token from the RESTful service When using the Groovy RestClient to call the RESTful service for the authentication token details Then We can assert all the details of the response. The thing I really like about Spock is the readability of the tests. From the name being a descriptive sentence rather than some short hand with _ throughout to make a valid method name to being able to easily see where setup of the test is done and then the expectations and assertions. Trying to test any environment RESTful service I've found that when trying to write integration tests, there has either been: Hard coded environment details and the code branched for each environment making it near impossible to keep code in sync as merge hell becomes the norm. Config files that define the environment are used to define environment details, again checked into each branch for each environment. Trying to follow the principles of continuous delivery, it would be great to be able to use the same code base to test against any environment. This is where Spock Extensions come into play to help us out. Spock Extensions In short Spock allows us to extend it to perform other functionality during the test life-cycle (a great post on extensions can be read on this excellent blog post). I've developed two extensions which help to make the idea of running the same test suite across different environments easier. The @EnvironmentEndPoint Extension The aim of this Spock extension is to have a placeholder variable in code that at run-time, can be defined with the environment host of the RESTful services that we want to test. package com.movideo.runtime.extension.custom import org.apache.commons.logging.Log import org.apache.commons.logging.LogFactory import org.spockframework.runtime.extension.AbstractAnnotationDrivenExtension import org.spockframework.runtime.extension.AbstractMethodInterceptor import org.spockframework.runtime.extension.IMethodInvocation import org.spockframework.runtime.model.FieldInfo import org.spockframework.runtime.model.SpecInfo /** * Spock Environment Annotation Extension */ class EnvironmentEndPointExtension extends AbstractAnnotationDrivenExtension { private static final Log LOG = LogFactory.getLog(getClass()); private static def config = new ConfigSlurper().parse(new File('src/test/resources/SpockConfig.groovy').toURL()) /** * env environment variable * * Defaults to {@code LOCAL_END_POINT} */ private static final String envString = System.getProperties().getProperty("env", config.envHost); static { LOG.info("Environment End Point [" + envString + "]") } /** * {@inheritDoc} */ @Override void visitFieldAnnotation(EnvironmentEndPoint annotation, FieldInfo field) { def interceptor = new EnvironmentInterceptor(field, envString) interceptor.install(field.parent.getTopSpec()) } } /** * * Environment Intercepter * */ class EnvironmentInterceptor extends AbstractMethodInterceptor { private final FieldInfo field private final String envString EnvironmentInterceptor(FieldInfo field, String envString) { this.field = field this.envString = envString } private void injectEnvironmentHost(target) { field.writeValue(target, envString) } @Override void interceptSetupMethod(IMethodInvocation invocation) { injectEnvironmentHost(invocation.target) invocation.proceed() } @Override void install(SpecInfo spec) { spec.setupMethod.addInterceptor this } } The EnvironmentEndPointExtension class defines the following: config: is a ConfigSlurper that parses a config file 'SpockConfig.groovy' that is used to define the default environment host (envHost) envString: gets the value of 'env' from all System Properties (these include run-time properties) and defaults to config.envHost With the environment host able to be accessed by Spock, now we need to inject this into the placeholder variable for Spock tests to access. An interceptor is created which is used to inject(field.writeValue method) the value of the environment host into the placeholder variable. This placeholder is the one that the @EnvironmentEndPoint is annotating. When the test is run, the interceptor sets the placeholder variable and the test can then use this value as the host for the RestClient object. When running the Spock tests either the default value from the config file will be used or the JVM argument -Denv=? can be used. This makes running the same test code base against any environment so much easier. A note on Gradle builds. By default, Gradle will not pass through JVM arguments through to forked processes such as running tests. The code snippet below shows how to achieve this: /* * Required to pass all system properties to Test tasks. * Not default for Gradle to pass system properties through to forked processes. */ tasks.withType(Test) { def config = new ConfigSlurper().parse(new File('src/test/resources/SpockConfig.groovy').toURL()) systemProperty 'env', System.getProperty('env', config.envHost) } This allows all tasks that are a type of 'Test' to have some custom code run. In this case, we are defining the 'SpockConfig.groovy' config file and then setting 'systemPropery' within Gradle Test tasks to 'env' and either getting the value from the passed in JVM argument or from the config file. With this code in the build.gradle, we're able to run all tests via a Gradle test build, which will produce lovely test reports (in Gradle HTML and JUnit XML). The @APIVersion Extension Another integration testing problem I've found is that if we try and develop our tests first (or at least during the process of developing a feature or bug fix) that running the same tests against an environment that doesn't yet have the new code base (but we are using the same test code base everywhere), we'll have failing tests that aren't really failures as the new code isn't there yet. To help solve this problem, I've developed the @APIVersion extension to help with this issue. As newly developed code should be deployed with a new version, we can use this version to compare to a minimum version that a test can be run against. package com.movideo.runtime.extension.custom import groovyx.net.http.RESTClient import java.lang.annotation.Annotation import java.util.regex.Pattern import org.apache.commons.logging.Log import org.apache.commons.logging.LogFactory import org.spockframework.runtime.extension.AbstractAnnotationDrivenExtension import org.spockframework.runtime.model.FeatureInfo import org.spockframework.runtime.model.SpecInfo /** * API Version Extension * */ class APIVersionExtension extends AbstractAnnotationDrivenExtension { /** * Logger */ private static final Log LOG = LogFactory.getLog(getClass()); /** * */ private static def config = new ConfigSlurper().parse(new File('src/test/resources/SpockConfig.groovy').toURL()) /** * env environment variable * * Defaults to {@code LOCAL_END_POINT} */ private static final String envString = System.getProperties().getProperty("env", config.envHost); /** * Version REGX pattern */ private static final def VERSION_PATTERN = Pattern.compile(".", Pattern.LITERAL); /** * Max version length */ private static final def MAX_VERSION_LENGTH = 4; /** * Current API Version */ private static final def CURRENT_API_VERSION = getDeployedAPIVersion(); /** * {@inheritDoc} */ @Override void visitFeatureAnnotation(APIVersion annotation, FeatureInfo feature) { if(!isApiVersionGreaterThanMinApiVersion(annotation, feature.name)) { feature.setSkipped(true) } } /** * {@inheritDoc} */ @Override public void visitSpecAnnotation(APIVersion annotation, SpecInfo spec) { if(!isApiVersionGreaterThanMinApiVersion(annotation, spec.name)) { spec.setSkipped(true) } } /** * Get the current deployed API version * * Performs a HTTP request to the current deployed API version. Parses the returned data and get the {@code version} node data. * @return current deployed API version */ private static String getDeployedAPIVersion() { def apiVersion = null try { def client = new RESTClient(envString) def resp = client.get(path : config.versionServiceUri) apiVersion = resp.data.version LOG.info("Current deployed API version [" + apiVersion + "]"); } catch (ex) { APIVersionError apiVersionError = new APIVersionError("Error occurred attempting to get current deployed API version from %s", envString + config.versionServiceUri); apiVersionError.setStackTrace(ex.stackTrace); throw apiVersionError; } return apiVersion } * @param annotation * @param infoName * @return */ private boolean isApiVersionGreaterThanMinApiVersion(APIVersion annotation, String infoName) { def isApiVersionGreaterThanMinApiVersion = true def minApiVersionRequired = annotation.minimimApiVersion(); // normalise both version id's def apiVersionNormalised = normaliseVersion(CURRENT_API_VERSION); def minApiVersionRequiredNormalised = normaliseVersion(minApiVersionRequired); // compare version id's int cmp = apiVersionNormalised.compareTo(minApiVersionRequiredNormalised); // if the comparison is less than 0, min API version is greater than the deployed API version if(cmp < 0) { LOG.info("min api version [" + minApiVersionRequired + "] greater than api version [" + CURRENT_API_VERSION + "], skipping [" + infoName + "]") isApiVersionGreaterThanMinApiVersion = false } return isApiVersionGreaterThanMinApiVersion } * @param version * @return */ private String normaliseVersion(String version) { String[] split = VERSION_PATTERN.split(version); StringBuilder sb = new StringBuilder(); for (String s : split) { sb.append(String.format("%" + MAX_VERSION_LENGTH + 's', s)); } return sb.toString(); } } The @APIVersion extension defines the same environment config as the @EnvironmentEndPoint extension does so that the environment can be injected and used purely for accessing the API version endpoint without the need for @EnvironmentEndPoint. The RESTful API version endpoint is required to be setup and publicly available. The @APIVersion extension will call this service to get details about the version of RESTful API. The version response data should be as follows: Media API 1.51.1 The @APIVersion extension will look for the version data to define what the current deployed version of the RESTful API is. Once the version of the RESTful API is known, the extension then checks the minimum API version required. Example @APIVersion(minimimApiVersion="1.0.0.0") The extension then uses this value to compare against the response data version and if the required version is greater than that of the deployed RESTful API services, then the test is skipped. This extension annotation can be placed on Specification's or Feature's allowing whole Specs to have a minimum version and / or Features to have their own minimum version. This extension has made writing integration tests with Spock even more portable and allows for a 'build once' set of tests that can be run against any environment, with some small changes to allow getting the API version. The SpockConfig.groovy file Here is an example of the SpockConfig.groovy config file used to configure defaults for both @EnvironmentEndPoint and @APIVersion extensions. versionServiceUri="/public/serviceInformation" envHost="http://api.preview.movideo.com" The 'versionServiceUri' is required for @APIVersion extension as the URI for the RESTful API version The 'envHost' is required for both @APIVersion and @EnvironmentEndPoint extensions as the host of the RESTful API Go and start testing Hopefully these Spock extensions might help your Spock integration tests. The framework is really easy and fun to use to build essential tests for the whole test stack. Checkout my GitHub projects for the code for both extensions. Hope this post has been helpful and hopefully I'll post something sooner for my next post. References and really helpful links Spock Homepage Annotation Driven Extensions With Spock

November 14, 2012

by Christian Strzadala

· 39,904 Views · 1 Like

Composite Keys in Cassandra

Introduction A composite key consists of one or more primary key fields. Each field must be of data type supported by underlying data-store. In JPA (Java Persistence API), there are two ways of specifying composite keys: 1. Composite Primary Key: @Entity @IdClass(TimelineId.class) public class Timeline { @Id int userId; @Id long tweetId; //Other non-primary key fields } Class TimelineId { int userId; long tweetId; } 2. Embedded Primary Key: @Entity public class Timeline { @EmbeddedId TimelineId id; //Other non-primary key fields } @Embeddable Class TimelineId { int userId; long tweetId; } Above Timeline entity is inspired from famous twissandra example. Starting 1.1 release, Cassandra supports composite keys. Cassandra Composite Keys in Action Visit this page in order to understand Cassandra Schema in general. In this section I will give you a feel of how composite keys are stored in Cassandra. Let's start Cassandra 1.1.x server and run following commands from Cassandra/bin directory: CQL: ./cqlsh -3 localhost 9160 CREATE KEYSPACE twissandra with strategy_class = 'SimpleStrategy' and strategy_options:replication_factor=1; use twissandra; CREATE TABLE timeline( user_id varchar, tweet_id varchar, tweet_device varchar, author varchar, body varchar, PRIMARY KEY(user_id,tweet_id,tweet_device)); INSERT INTO timeline (user_id, tweet_id, tweet_device, author, body) VALUES ('xamry', 't1', 'web', 'Amresh', 'Here is my first tweet'); INSERT INTO timeline (user_id, tweet_id, tweet_device, author, body) VALUES ('xamry', 't2', 'sms', 'Saurabh', 'Howz life Xamry'); INSERT INTO timeline (user_id, tweet_id, tweet_device, author, body) VALUES ('mevivs', 't1', 'iPad', 'Kuldeep', 'You der?'); INSERT INTO timeline (user_id, tweet_id, tweet_device, author, body) VALUES ('mevivs', 't2', 'mobile', 'Vivek', 'Yep, I suppose'); cqlsh:twissandra> select * from timeline; user_id | tweet_id | author | body ---------+----------+---------+------------------------ xamry | t1 | Amresh | Here is my first tweet xamry | t2 | Saurabh | Howz life Xamry mevivs | t1 | Kuldeep | You der? mevivs | t2 | Vivek | Yep, I suppose cqlsh:twissandra> SELECT * FROM timeline WHERE user_id='xamry'; user_id | tweet_id | tweet_device | author | body ---------+----------+--------------+---------+------------------------ xamry | t1 | web | Amresh | Here is my first tweet xamry | t2 | sms | Saurabh | Howz life Xamry cqlsh:twissandra> select * from timeline where tweet_id = 't1'; user_id | tweet_id | tweet_device | author | body ---------+----------+--------------+---------+------------------------ xamry | t1 | web | Amresh | Here is my first tweet mevivs | t1 | iPad | Kuldeep | You der? cqlsh:twissandra> select * from timeline where user_id = 'xamry' and tweet_id='t1'; user_id | tweet_id | tweet_device | author | body ---------+----------+--------------+--------+------------------------ xamry | t1 | web | Amresh | Here is my first tweet cqlsh:twissandra> select * from timeline where user_id = 'xamry' and author='Amresh'; Bad Request: No indexed columns present in by-columns clause with Equal operator cqlsh:twissandra> select * from timeline where user_id = 'xamry' and tweet_device='web'; Bad Request: PRIMARY KEY part tweet_device cannot be restricted (preceding part tweet_id is either not restricted or by a non-EQ relation) cqlsh:twissandra> select * from timeline where user_id = 'xamry' and tweet_id = 't1' and tweet_device='web'; user_id | tweet_id | tweet_device | author | body ---------+----------+--------------+--------+------------------------ xamry | t1 | web | Amresh | Here is my first tweet Cassandra-cli: impadmin@impetus-ubuntu:/usr/local/apache-cassandra-1.1.2/bin$ ./cassandra-cli -h localhost -p 9160 Connected to: "Test Cluster" on localhost/9160 Welcome to Cassandra CLI version 1.1.2 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use twissandra; Authenticated to keyspace: twissandra [default@twissandra] list timeline; Using default limit of 100 Using default column limit of 100 ------------------- RowKey: xamry => (column=t1:web:author, value=Amresh, timestamp=1343729388951000) => (column=t1:web:body, value=Here is my first tweet, timestamp=1343729388951001) => (column=t2:sms:author, value=Saurabh, timestamp=1343729388973000) => (column=t2:sms:body, value=Howz life Xamry, timestamp=1343729388973001) ------------------- RowKey: mevivs => (column=t1:iPad:author, value=Kuldeep, timestamp=1343729388991000) => (column=t1:iPad:body, value=You der?, timestamp=1343729388991001) => (column=t2:mobile:author, value=Vivek, timestamp=1343729389941000) => (column=t2:mobile:body, value=Yep, I suppose, timestamp=1343729389941001) Observations First part of composite key (user_id) is called "Partition Key", rest (tweet_id, tweet_device) are remaining keys. Cassandra stores columns differently when composite keys are used. Partition key becomes row key. Remaining keys are concatenated with each column name (":" as separator) to form column names. Column values remain unchanged. Remaining keys (other than partition keys) are ordered, and it's not allowed to search on any random column, you have to start with the first one and then you can move to the second one and so on. This is evident from "Bad Request" error.

November 14, 2012

by Amresh Singh

· 20,362 Views · 1 Like