Data Engineering Resources

The Latest Data Engineering Topics

Difference Between Mysql Replace and Insert on Duplicate Key Update

While me and my friend roshan recently working as a support developers at Australia famous e-commerce website. recently roshan as assign a new bug in this site it’s related to the product synchronize process in the ware house product table and the e-commerce site, his main task was check the quickly the site product table and check with ware house product table product if the either insert new data into a site database, or update an existing record on the site database, Of course, doing a lookup to see if the record exists already and then either updating or inserting would be an expensive process (existing items are defined either by a unique key or a primary key). Luckily, MySQL offers two functions to combat this (each with two very different approaches). 1. REPLACE = DELETE+INSERT 2. INSERT ON DUPLICATE KEY UPDATE = UPDATE + INSERT 1 . REPLACE This syntax is the same as the INSERT function. When dealing with a record with a unique or primary key, REPLACE will either do a DELETE and then an INSERT, or just an INSERT if use this this function will cause a record to be removed, and inserted at the end. It will cause the indexing to get broken apart, decreasing the efficiency of the table. If, however REPLACE INTO ds_product SET pID = 3112, catID = 231, uniCost = 232.50, salePrice = 250.23; 2. ON DUPLICATE KEY UPDATE ON DUPLICATE KEY UPDATE clause to the INSERT function. This one actively hunts down an existing record in the table which has the same UNIQUE or PRIMARY KEY as the one we’re trying to update. If it finds an existing one, you specify a clause for which column(s) you would like to UPDATE. Otherwise, it will do a normal INSERT. INSERT INTO ds_product SET pID = 3112, catID = 231, uniCost = 232.50, salePrice = 250.23, ON DUPLICATE KEY UPDATE uniCost = 232.50, salePrice = 250.23; This should be helpful when trying to create database queries that add and update information, without having to go through the extra step. Thanks Have a Nice Day

October 3, 2012

by Prathap Givantha Kalansuriya

· 14,400 Views

Parsing a Connection String With 'Sprache' C# Parser

Sprache is a very cool lightweight parser library for C#. Today I was experimenting with parsing EasyNetQ connection strings, so I thought I’d have a go at getting Sprache to do it. An EasyNetQ connection string is a list of key-value pairs like this: key1=value1;key2=value2;key3=value3 The motivation for looking at something more sophisticated than simply chopping strings based on delimiters, is that I’m thinking of having more complex values that would themselves need parsing. But that’s for the future, today I’m just going to parse a simple connection string where the values can be strings or numbers (ushort to be exact). So, I want to parse a connection string that looks like this: virtualHost=Copa;username=Copa;host=192.168.1.1;password=abc_xyz;port=12345;requestedHeartbeat=3 … into a strongly typed structure like this: public class ConnectionConfiguration : IConnectionConfiguration { public string Host { get; set; } public ushort Port { get; set; } public string VirtualHost { get; set; } public string UserName { get; set; } public string Password { get; set; } public ushort RequestedHeartbeat { get; set; } } I want it to be as easy as possible to add new connection string items. First let’s define a name for a function that updates a ConnectionConfiguration. A uncommonly used version of the ‘using’ statement allows us to give a short name to a complex type: using UpdateConfiguration = Func; Now lets define a little function that creates a Sprache parser for a key value pair. We supply the key and a parser for the value and get back a parser that can update the ConnectionConfiguration. public static Parser BuildKeyValueParser( string keyName, Parser valueParser, Expression> getter) { return from key in Parse.String(keyName).Token() from separator in Parse.Char('=') from value in valueParser select (Func)(c => { CreateSetter(getter)(c, value); return c; }); } The CreateSetter is a little function that turns a property expression (like x => x.Name) into an Action. Next let’s define parsers for string and number values: public static Parser Text = Parse.CharExcept(';').Many().Text(); public static Parser Number = Parse.Number.Select(ushort.Parse); Now we can chain a series of BuildKeyValueParser invocations and Or them together so that we can parse any of our expected key-values: public static Parser Part = new List> { BuildKeyValueParser("host", Text, c => c.Host), BuildKeyValueParser("port", Number, c => c.Port), BuildKeyValueParser("virtualHost", Text, c => c.VirtualHost), BuildKeyValueParser("requestedHeartbeat", Number, c => c.RequestedHeartbeat), BuildKeyValueParser("username", Text, c => c.UserName), BuildKeyValueParser("password", Text, c => c.Password), }.Aggregate((a, b) => a.Or(b)); Each invocation of BuildKeyValueParser defines an expected key-value pair of our connection string. We just give the key name, the parser that understands the value, and the property on ConnectionConfiguration that we want to update. In effect we’ve defined a little DSL for connection strings. If I want to add a new connection string value, I simply add a new property to ConnectionConfiguration and a single line to the above code. Now lets define a parser for the entire string, by saying that we’ll parse any number of key-value parts: public static Parser> ConnectionStringBuilder = from first in Part from rest in Parse.Char(';').Then(_ => Part).Many() select Cons(first, rest); All we have to do now is parse the connection string and apply the chain of update functions to a ConnectionConfiguration instance: public IConnectionConfiguration Parse(string connectionString) { var updater = ConnectionStringGrammar.ConnectionStringBuilder.Parse(connectionString); return updater.Aggregate(new ConnectionConfiguration(), (current, updateFunction) => updateFunction(current)); } We get lots of nice things out of the box with Sprache, one of the best is the excellent error messages: Parsing failure: unexpected 'x'; expected host or port or virtualHost or requestedHeartbeat or username or password (Line 1, Column 1). Sprache is really nice for this kind of task. I’d recommend checking it out.

October 3, 2012

by Mike Hadlow

· 7,524 Views

Customizing Spring Data JPA Repository

Spring Data is a very convenient library. However, as the project as quite new, it is not well featured. By default, Spring Data JPA will provide implementation of the DAO based on SimpleJpaRepository. In recent project, I have developed a customize repository base class so that I could add more features on it. You could add vendor specific features to this repository base class as you like. Configuration You have to add the following configuration to you spring beans configuration file. You have to specified a new repository factory class. We will develop the class later. extends SimpleJpaRepository implements GenericRepository , Serializable{ private static final long serialVersionUID = 1L; static Logger logger = Logger.getLogger(GenericRepositoryImpl.class); private final JpaEntityInformation entityInformation; private final EntityManager em; private final DefaultPersistenceProvider provider; private Class springDataRepositoryInterface; public Class getSpringDataRepositoryInterface() { return springDataRepositoryInterface; } public void setSpringDataRepositoryInterface( Class springDataRepositoryInterface) { this.springDataRepositoryInterface = springDataRepositoryInterface; } /** * Creates a new {@link SimpleJpaRepository} to manage objects of the given * {@link JpaEntityInformation}. * * @param entityInformation * @param entityManager */ public GenericRepositoryImpl (JpaEntityInformation entityInformation, EntityManager entityManager , Class springDataRepositoryInterface) { super(entityInformation, entityManager); this.entityInformation = entityInformation; this.em = entityManager; this.provider = DefaultPersistenceProvider.fromEntityManager(entityManager); this.springDataRepositoryInterface = springDataRepositoryInterface; } /** * Creates a new {@link SimpleJpaRepository} to manage objects of the given * domain type. * * @param domainClass * @param em */ public GenericRepositoryImpl(Class domainClass, EntityManager em) { this(JpaEntityInformationSupport.getMetadata(domainClass, em), em, null); } public S save(S entity) { if (this.entityInformation.isNew(entity)) { this.em.persist(entity); flush(); return entity; } entity = this.em.merge(entity); flush(); return entity; } public T saveWithoutFlush(T entity) { return super.save(entity); } public List saveWithoutFlush(Iterable entities) { List result = new ArrayList(); if (entities == null) { return result; } for (T entity : entities) { result.add(saveWithoutFlush(entity)); } return result; } } As a simple example here, I just override the default save method of the SimpleJPARepository. The default behaviour of the save method will not flush after persist. I modified to make it flush after persist. On the other hand, I add another method called saveWithoutFlush() to allow developer to call save the entity without flush. Define Custom repository factory bean The last step is to create a factory bean class and factory class to produce repository based on your customized base repository class. public class DefaultRepositoryFactoryBean , S, ID extends Serializable> extends JpaRepositoryFactoryBean { /** * Returns a {@link RepositoryFactorySupport}. * * @param entityManager * @return */ protected RepositoryFactorySupport createRepositoryFactory( EntityManager entityManager) { return new DefaultRepositoryFactory(entityManager); } } /** * * The purpose of this class is to override the default behaviour of the spring JpaRepositoryFactory class. * It will produce a GenericRepositoryImpl object instead of SimpleJpaRepository. * */ public class DefaultRepositoryFactory extends JpaRepositoryFactory{ private final EntityManager entityManager; private final QueryExtractor extractor; public DefaultRepositoryFactory(EntityManager entityManager) { super(entityManager); Assert.notNull(entityManager); this.entityManager = entityManager; this.extractor = DefaultPersistenceProvider.fromEntityManager(entityManager); } @SuppressWarnings({ "unchecked", "rawtypes" }) protected JpaRepository getTargetRepository( RepositoryMetadata metadata, EntityManager entityManager) { Class repositoryInterface = metadata.getRepositoryInterface(); JpaEntityInformation entityInformation = getEntityInformation(metadata.getDomainType()); if (isQueryDslExecutor(repositoryInterface)) { return new QueryDslJpaRepository(entityInformation, entityManager); } else { return new GenericRepositoryImpl(entityInformation, entityManager, repositoryInterface); //custom implementation } } @Override protected Class getRepositoryBaseClass(RepositoryMetadata metadata) { if (isQueryDslExecutor(metadata.getRepositoryInterface())) { return QueryDslJpaRepository.class; } else { return GenericRepositoryImpl.class; } } /** * Returns whether the given repository interface requires a QueryDsl * specific implementation to be chosen. * * @param repositoryInterface * @return */ private boolean isQueryDslExecutor(Class repositoryInterface) { return QUERY_DSL_PRESENT && QueryDslPredicateExecutor.class .isAssignableFrom(repositoryInterface); } } Conclusion You could now add more features to base repository class. In your program, you could now create your own repository interface extending GenericRepository instead of JpaRepository. public interface MyRepository extends GenericRepository { void someCustomMethod(ID id); } In next post, I will show you how to add hibernate filter features to this GenericRepository.

September 27, 2012

by Boris Lam

· 98,092 Views · 4 Likes

Enabling JMX Monitoring for Hadoop & Hive

Hadoop’s NameNode and JobTracker expose interesting metrics and statistics over the JMX. Hive seems not to expose anything intersting but it still might be useful to monitor its JVM or do simpler profiling/sampling on it. Let’s see how to enable JMX and how to access it securely, over SSH. Background: We run NameNode, JobTracker and Hive on the same server. Monitoring og TaskTrackers and DataNodes isn’t that interesting but still might be useful to have. Configuration /etc/hadoop/hadoop-env.sh diff --git a/etc/hadoop/hadoop-env.sh b/etc/hadoop/hadoop-env.sh index 69a13b1..e8ca596 100644 --- a/etc/hadoop/hadoop-env.sh +++ b/etc/hadoop/hadoop-env.sh @@ -14,7 +14,8 @@ export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-"/etc/hadoop"} #export HADOOP_NAMENODE_INIT_HEAPSIZE="" # Extra Java runtime options. Empty by default. -export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true $HADOOP_CLIENT_OPTS" +# Added $HIVE_OPTS that is set by hive-env.sh when starting hiveserver +export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true $HADOOP_CLIENT_OPTS $HIVE_OPTS" # Command specific options appended to HADOOP_OPTS when specified export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT $HADOOP_NAMENODE_OPTS" @@ -43,3 +44,16 @@ export HADOOP_SECURE_DN_PID_DIR=/var/run/hadoop # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER + +### JMX settings +export JMX_OPTS=" -Dcom.sun.management.jmxremote.authenticate=false \ + -Dcom.sun.management.jmxremote.ssl=false \ + -Dcom.sun.management.jmxremote.port" +# -Dcom.sun.management.jmxremote.password.file=$HADOOP_HOME/conf/jmxremote.password \ +# -Dcom.sun.management.jmxremote.access.file=$HADOOP_HOME/conf/jmxremote.access" +export HADOOP_NAMENODE_OPTS="$JMX_OPTS=8006 $HADOOP_NAMENODE_OPTS" +export HADOOP_SECONDARYNAMENODE_OPTS="$HADOOP_SECONDARYNAMENODE_OPTS" +export HADOOP_DATANODE_OPTS="$JMX_OPTS=8006 $HADOOP_DATANODE_OPTS" +export HADOOP_BALANCER_OPTS="$HADOOP_BALANCER_OPTS" +export HADOOP_JOBTRACKER_OPTS="$JMX_OPTS=8007 $HADOOP_JOBTRACKER_OPTS" +export HADOOP_TASKTRACKER_OPTS="$JMX_OPTS=8007 $HADOOP_TASKTRACKER_OPTS" The JMX setting is used for Hadoop’s daemons while the HIVE_OPTS was added for Hive. /conf/hive-env.sh Enable JMX when running the Hive thrift server (we don’t want it when running the command-line client etc. since it’s pointless and we wouldn’t need to make sure that each of them has a unique port): if [ "$SERVICE" = "hiveserver" ]; then JMX_OPTS="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=8008" export HIVE_OPTS="$HIVE_OPTS $JMX_OPTS" fi Pitfalls When you start Hive server via hive –service hiveserver then it actually executes “hadoop jar …” so to be able to pass options from hive-env.sh to the JVM we had to add $HIVE_OPTS in hadoop-env.sh. (I haven’t found a cleaner way to do it.) Effects When we now start Hive or any of the Hadoop daemons, they will expose their metrics at their respective ports (NameNode – 8006, JobTracker – 8007, Hive – 8008). (If you are running DataNode and/or TaskTracker on the same machine then you’ll need to change their ports to be unique.) Secure Connection Over SSH Read the post VisualVM: Monitoring Remote JVM Over SSH (JMX Or Not) to find out how to connect securely to the JMX ports over ssh, f.ex. with VisualVM (spolier: ssh -D 9696 hostname; use proxy at localhost:9696).

September 25, 2012

by Jakub Holý

· 15,145 Views

Choosing Static vs. Dynamic Languages for Your Startup

Everyone is thinking why in the world would anyone pick static, when you can be dynamic? Usually the thought process is, "what language am I most proficient in, that can do the job." Totally not a bad way to go about it. Now does this choice affect anything else? Testing? Speed of development? Robustness? Dynamic vs. Static Dynamic languages are languages that don’t necessarily need variables to be declared before they are used. Examples of dynamic languages are Python, Ruby, and PHP. So in dynamic languages the following is possible: num = 10 We have successfully assigned a value to variable without declaring it before hand. Simple enough, try doing this in Java (you can’t). This can *increase* development speed, without having to write boilerplate code. This can somewhat be a double edge sword, since dynamic languages types are checked during runtime, there is no way to tell if there is a bug in code until it is run. I know you can test, but you can’t test for everything. You can’t test for everything. Here is an example albeit trivial. def get_first_problem(problems): for problem in problems: problam = problem + 1 return problam Now if you are raging to some serious dubstep, its easy enough to miss that small typo, you go screw it and do it live, and deploy to production. Python will simply create the new variable and not a single thing will be said. Only you can stop bugs in production! Static languages are languages that variables need to be declared before use and type checking is done at compile time. Examples of static languages include Java, C, and C++. So in static languages the following is enforced static int awesomeNumber; awesomeNumber = 10; Many argue this increases robustness as well as decrease chances of Runtime Errors. Since the compiler will catch those horrible horrible mistakes you made throughout your code. Your methods contracts are tighter, downside to this is crap ton of boilerplate code. Weak and Strong Typing can be often be confused with dynamic and static languages. Weak typed languages can lead to philosophical questions like what does the number 2 added to the word ‘two’ give you? Things like this are possible with a weak typed language. a = 2 b = "2" concatenate(a, b) // Returns "22" add(a, b) // Returns 4 Traditionally languages may place restriction on what transaction may occur for example in a strong typed language adding a string and integer will result in a type error as shown below. >>> a = 10 >>> b = 'ten' >>> a + b Traceback (most recent call last): File "", line 1, in TypeError: unsupported operand type(s) for +: 'int' and 'str' >>> Conclusion Regardless of where you land on this discussion, claiming one is better than the other would lead to flame war, but there are places where each is strong. Dynamic languages are good for fast quick development cycles and prototyping, while static languages are better suited to longer development cycles where trivial bugs could be extremely costly (telecommunication systems, air traffic control). For example if some giant company called Moo Corp. spent millions of dollars on QA and Testing and a bug somehow gets into the field, to fix it would mean another round of testing. When sitting in that chair the choice is clear static languages FTW, its a hard job but someone has to milk the cows. Test, test, and test. Just a little food for thought, for when you are starting your next project. You never know what limitations you maybe placing on yourself and your team. What do you do consider when selecting a programming language for a project?

September 25, 2012

by Mahdi Yusuf

· 24,881 Views

Introducing the New Date and Time API for JDK 8

Date and time handling in Java is a somewhat tricky part when you are new to the language. Time can be accessed via the static method System.currentTimeMillis() which returns the current time in milliseconds from January 1st 1970. If you prefer to work with Objects instead you can use java.util.Date, a class whose methods are mostly deprecated in recent versions of Java. To work with time offsets, say add one month to a date, there is java.util.GregorianCalendar. All in all, those methods described here are not very convenient to work with. Java 7 and below are lacking a good date and time API. The Joda Time library is a common drop-in if you need to work with date/time. With JSR 310 (Java Specification Request) this is about to change. JSR 310 adds a new date, time and calendar API to Java 8. The ThreeTen project provides a reference implementation to this new API and can already be utilized in current Java projects (I however recommend not to do this for production). As the README states: The API is currently considered usable and accurate, yet incomplete and subject to change. If you use this API you must be able to handle incompatible changes in later versions. Building ThreeTen Building the ThreeTen project is relatively easy. It requires both Git and Ant to be installed on your system. git clone git://github.com/ThreeTen/threeten.git cd threeten ant This will first fetch the most recent version of ThreeTen and then start the build process using ant. Note that building the library also requires either OpenJDK 1.6 or Oracle JDK 1.6. JSR 310 The new API specifies a number of new classes which are divided into the categories of continuous and human time. Continuous time is based on Unix time and is represented as a single incrementing number. Class Description Instant A point in time in nanoseconds from January 1st 1970 Duration An amount of time measured in nanoseconds Human time is based on fields that we use in our daily lifes such as day, hour, minute and second. It is represented by a group of classes, some of which we will discuss in this article. Class Description LocalDate a date, without time of day, offset or zone LocalTime the time of day, without date, offset or zone LocalDateTime the date and time, without offset or zone OffsetDate a date with an offset such as +02:00, without time of day or zone OffsetTime the time of day with an offset such as +02:00, without date or zone OffsetDateTime the date and time with an offset such as +02:00, without a zone ZonedDateTime the date and time with a time zone and offset YearMonth a year and month MonthDay month and day Year/MonthOfDay/DayOfWeek/... classes for the important fields DateTimeFields stores a map of field-value pairs which may be invalid Calendrical access to the low-level API Period a descriptive amount of time, such as "2 months and 3 days" In addition to the above classes three support classes have been implemented. The Clock class wraps the current time and date, ZoneOffset is a time offset from UTC and ZoneId defines a time zone such as 'Australia/Brisbane'. Using the API Getting the current time The current time is represented by the Clock class. The class is abstract, so you can not create instances of it. The systemUTC() static method will return the current time based on your system clock and set to UTC. import javax.time.Clock; Clock clock = Clock.systemUTC(); To use the default time zone on your system there also is systemDefaultZone(). Clock clock = Clock.systemDefaultZone(); The millis() method can then be used to access the current time in milliseconds from January 1st, 1970. This shows, that the Clock class and all subclasses are wrapped around System.currentTimeMillis(). Clock clock = Clock.systemDefaultZone(); long time = clock.millis(); Working with time zones To work with time zones you need to import the ZoneId class. The class provides a method to get the default system time zone: import javax.time.ZoneId; import javax.time.Clock; ZoneId zone = ZoneId.systemDefault(); Clock clock = Clock.system(zone); As seen above, the ZoneId can then be used to get an instance of a Clock with that time zone. Other time zones can be accessed by their name, e.g.: ZoneId zone = ZoneId.of("Europe/Berlin"); Clock clock = Clock.system(zone); Getting human date and time Working with a time represented in a single long variable is not what we wanted. We want to work with objects that represent human readable time. The LocalDate, LocalTime and LocalDateTime classes do just that. import javax.time.LocalDate; // The now() method returns the current DateTime LocalDate date = LocalDate.now(); System.out.printf("%s-%s-%s", date.getYear(), date.getMonthValue(), date.getDayOfMonth() ); Using LocalDate to print the current date Doing calculations with times and dates One of the most important functionalities of JSR-310 is that you can do calculations with dates and times. The API makes it very easy to do that. import javax.time.LocalTime; import javax.time.Period; import static javax.time.calendrical.LocalPeriodUnit.HOURS; Period p = Period.of(5, HOURS); LocalTime time = LocalTime.now(); LocalTime newTime; newTime = time.plus(5, HOURS); // or newTime = time.plusHours(5); // or newTime = time.plus(p); Three ways of adding 5 hours to the current time Each class that represents human time implements the AdjustableDateTime interface. The interface requires the plus and the minus method that take a value and a PeriodUnit as argument. Conclusion This article gave a (very) brief introduction into the new date and time API that will ship with Java 8. The API seems to be very consistent and well thought through and provides many ways to interact with dates and times. Upon release of Java 8 the API will be moved from the javax.time package over to java.time, so there will be no conflict if you start using the current implementation.

September 25, 2012

by Fabian Becker

· 78,539 Views

Nested Data Structures, and non-1NF design in PostgreSQL

This has been adapted from an ongoing series currently running on my blog. It has been adapted to be more self-contained, and rely less on other blog entries. For more see http://ledgersmbdev.blogspot.com PostgreSQL provides a very advanced set of tools for doing data modelling in ways which drift back and forth across a relational and non-relational divide. While it is generally a good idea to make the database relational first, and add objects later, the principles of object-relational database design allow you to do a lot more with PostgreSQL than you can on many other database platforms. This article will discuss the use of non-first-normal-form designs, in particular the storage of arrays of tuples in columns to simulate a nested table. The possible uses and problems of such a design will be discussed in detail. One of the promises of object-relational modelling is the ability to address information modelling on complex and nested data structures. Nested data structures bring considerable richness to the database, which is lost in a pure, flat, relational model. Nested data structures can be used to model tuple constraints in ways that are impossible to do when looking at flat data structures, at least as long as those constraints are limited to the information in a single tuple. At the same time there are cases where they simplify things and cases where they complicate things. This is true both in the case of using these for storage and for interfacing with stored procedures. PostgreSQL allows for nested tuples to be stored in a database, and for arrays of tuples. Other ORDBMS's allow something similar (Informix, DB2, and Oracle all support nested tables). Nested tables in PostgreSQL provide a number of gotchas, and additionally exposing the data in them to relational queries takes some extra work. In this post we will look at modelling general ledger transactions using a nested table approach, and both the benefits and limitations of this approach. In general this trades one set of problems for another and it is important to recognize the problems going in. The storage example came out of a brainstorming session I had with Marc Balmer of Micro Systems, though it is worth noting that this is not the solution they use in their products, nor is it the approach currently used by LedgerSMB. Basic Table Structure: The basic data schema will end up looking like this: CREATE TABLE journal_type ( id serial not null unique, label text primary key ); CREATE TABLE account ( id serial not null unique, control_code text primary key, -- account number description text ); CREATE TYPE journal_line_type AS ( account_id int, amount numeric ); CREATE TABLE journal_entry ( id serial not null unique, journal_type int references journal_type(id), source_document_id text,-- for example invoice number date_posted date not null, description text, line_items journal_line_type[], PRIMARY KEY (journal_type, source_document_id) ); This schema has a number of obvious gotchas and cannot, by itself, guarantee the sorts of things we want to do. However, using object-relational modelling we can fix these in ways that cannot do in a purely relational schema. The main problems are: First, since this is a double entry model, we need a constraint that says that the sum of the amounts of the lines must always equal zero. However, if we just add a sum() aggregate, we will end up with it summing every record in the db every time we do an insert, which is not what we want. We also want to make sure that no account_id's are null and no amounts are null. Additionally it is not possible in the schema above to easily expose the journal line information to purely relational tools. However we can use a VIEW to do this, though this produces yet more problems. Finally referential integrity enforcement between the account lines and accounts cannot be done declaratively. We will have to create TRIGGERs to enforce this manually. These problems are traded off against the fact that the relational model does not allow for the first problem to be solved at all so we trade off the fact that we have some solutions which are a bit of a pain for the fact that we have some solutions at all. Nested Table Constraints If we simply had a tuple as a column, we could look inside the tuple with check constraints. Something like check((column).subcolumn is not null). However in this case we cannot do that because we need to aggregate on a set of tuples attached to the row. To do this instead we create a set of table methods for managing the constraints: CREATE OR REPLACE FUNCTION is_balanced(journal_entry) RETURNS BOOL LANGUAGE SQL AS $$ SELECT sum(amount) = 0 FROM unnest($1.line_items); $$; CREATE OR REPLACE FUNCTION has_no_null_account_ids(journal_entry) RETURNS BOOL LANGUAGE SQL AS $$ SELECT bool_and(account_id is not null) FROM unnest($1.line_items); $$; CREATE OR REPLACE FUNCTION has_no_null_amounts(journal_entry) RETURNS BOOL LANGUAGE SQL AS $$ select bool_and(amount is not null) from unnest($1.line_items); $$; We can then create our constraints. Note that because we have to create the methods first, we have to add our constraints after the functions are defined, and these are added after the table is constructed. I have gone ahead and given these friendly names so that errors are easier for people (and machines) to process and handle. ALTER TABLE journal_entry ADD CONSTRAINT is_balanced CHECK ((journal_entry).is_balanced); ALTER TABLE journal_entry ADD CONSTRAINT has_no_null_account_ids CHECK ((journal_entry).has_no_null_account_ids); ALTER TABLE journal_entry ADD CONSTRAINT has_no_null_amounts CHECK ((journal_entry).has_no_null_amounts); Now we have integrity constraints reaching into our nested data. So let's test this out. insert into journal_type (label) values ('General'); We will re-use the account data from the previous post: or_examples=# select * from account; id | control_code | description ----+--------------+------------- 1 | 1500 | Inventory 2 | 4500 | Sales 3 | 5500 | Purchase (3 rows) Let's try inserting a few meaningless transactions, some of which violate our constraints: insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type]); ERROR: new row for relation "journal_entry" violates check constraint "is_balanced" So far so good. insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type, row(null, -100)::journal_line_type]); ERROR: new row for relation "journal_entry" violates check constraint "has_no_null_account_ids" Still good. insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type, row(2, -100)::journal_line_type, row(3, NULL)::journal_line_type]) ERROR: new row for relation "journal_entry" violates check constraint "has_no_null_amounts" Great. All constraints working properly. Let's try inserting a valid row: insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10001', now()::date, 'This is a test', ARRAY[row(1, 100)::journal_line_type, row(2, -100)::journal_line_type]); And it works! or_examples=# select * from journal_entry; id | journal_type | source_document_id | date_posted | description | li ne_items ----+--------------+--------------------+-------------+----------------+------------------------ 5 | 1 | ref-10001 | 2012-08-23 | This is a test | {"(1,100)","(2,-100)"} (1 row) Break-Out Views A second major problem that we will be facing with this schema is that if someone wants to create a report using a reporting tool that only really supports relational data very well, then the financial data will be opaque and not available. This scenario is one of the reasons why I think it is important generally to push the relational model to its breaking point before looking at object-relational functions. Consequently I think when doing nested tables it is important to ensure that the data in them is available through a relational interface, in this case, a view. In this case, we may want to model debits and credits in a way which is re-usable, so we will start by creating two type methods: CREATE OR REPLACE FUNCTION debits(journal_line_type) RETURNS NUMERIC LANGUAGE SQL AS $$ SELECT CASE WHEN $1.amount < 0 THEN $1.amount * -1 ELSE NULL END $$; CREATE OR REPLACE FUNCTION credits(journal_line_type) RETURNS NUMERIC LANGUAGE SQL AS $$ SELECT CASE WHEN $1.amount > 0 THEN $1.amount ELSE NULL END $$; Now we can use these as virtual columns anywhere a journal_line_type is used. The view definition itself is rather convoluted and this may impact performance. I am waiting for the LATERAL construct to become available which will make this easier. CREATE VIEW journal_line_items AS SELECT id AS journal_entry_id, (li).*, (li).debits, (li).credits FROM (SELECT je.*, unnest(line_items) li FROM journal_entry je) j; Remember li.debits and li.credits gets turned by the parser into debits(li) and credits(li), allowing for class.method notation here. Testing this out: SELECT * FROM journal_line_items; gives us journal_entry_id | account_id | amount | debits | credits ------------------+------------+--------+--------+--------- 5 | 1 | 100 | | 100 5 | 2 | -100 | 100 | 6 | 1 | 200 | | 200 6 | 3 | -200 | 200 | As you can see, this works. Now people with purely relational tools can access the information in the nested table. In general it is almost always worth creating break-out views of this sort where nested data is stored. However it is important to note that with larger data sets this is insufficient because indexing considerations makes it hard to look up specific information on a row level. This may or may not be the end of the world depending on data set size. Referential Integrity Controls The final problem is that relational integrity is not a well defined concept for nested data. For this reason, if we value relational integrity and foreign keys are involved, we must find ways of enforcing these. The simplest solution is a trigger which runs on insert, update, or delete, and manages another relation which can be used as a proxy for relational integrity checks. For example, we could: CREATE TABLE je_account ( je_id int references journal_entry (id), account_id int references account(id), primary key (je_id, account_id) ); This will be a very narrow table and so should be quick to search. It may also be useful in determining which accounts to look at for transactions if we need to do that. This table could then be used to optimize queries. To maintain the table we need to recognize that never ever will a journal entry's line items be updated or deleted. This is due to the need to maintain clear audit controls and trails. We may add other flags to the table to indicate transactions but we can handle insert, update, and delete conditions with a trigger, namely: CREATE FUNCTION je_ri_management() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ DECLARE accounts int[]; BEGIN IF TG_OP ILIKE 'INSERT' THEN INSERT INTO je_account (je_id, account_id) SELECT NEW.id, account_id FROM unnest(NEW.line_items) GROUP BY account_id; RETURN NEW; ELSIF TG_OP ILIKE 'UPDATE' THEN IF NEW.line_items <> OLD.line_items THEN RAISE EXCEPTION 'Cannot journal entry line items!'; ELSE RETURN NEW; END IF; ELSIF TG_OP ILIKE 'DELETE' THEN RAISE EXCEPTION 'Cannot delete journal entries!'; ELSE RAISE EXCEPTION 'Invalid TG_OP in trigger'; END IF; END; $$; Then we add the trigger with: CREATE TRIGGER je_breakout_for_ri AFTER INSERT OR UPDATE OR DELETE ON journal_entry FOR EACH ROW EXECUTE PROCEDURE je_ri_management(); The final invalid TG_OP could be omitted but this is not a bad check to have. Let's try this out: insert into journal_entry (journal_type, source_document_id, date_posted, description, line_items) values (1, 'ref-10003', now()::date, 'This is a test', ARRAY[row(1, 200)::journal_line_type, row(3, -200)::journal_line_type]); or_examples=# select * from je_account; je_id | account_id -------+------------ 10 | 3 10 | 1 (2 rows) In this way referential integrity can be enforced. Solution 2.0: Refactoring the above to eliminate the view. The above solution will work great for small businesses but for larger businesses, querying this data will become slow for certain kinds of reports. Storage here is tied to a specific criteria, and indexing is somewhat problematic. There are ways we can address this, but they are not always optimal. At the same time our work is simplified because the actual accounting details are append-only. One solution to this is to refactor the above solution. Instead of: Main table Relational view Materialized view for referential integrity checking we can have: Main table, with tweaked storage for line items Materialized view for RI checking and relational access Unfortunately this sort of refactoring after the fact isn't simple. Typically you want to convert the journal_line_type type to a journal_line_type table, and inherit this in your materialized view table. You cannot simply drop and recreate since the column you are storing the data in is dependent on the structure. The solution is to rename the type, create a new one in its place. This must be done manually and there is no current capability to copy a composite type's structure into a table. You will then need to create a cast and a cast function. Then, when you can afford the downtime, you will want to convert the table to the new type. It is quite possible that the downtime will be delayed and you will have an extended time period where you are half-way through migrating the structure of your database. You can, however, decide to create a cast between the table and the type, perhaps an implicit one (though this is not inherited) and use this to centralize your logic. Unfortunately this leads to duplication-related complexity and in an ideal world would be avoided. However, assuming that the downtime ends up being tolerable, the resulting structures will end up such that they can be more readily optimized for a variety of workloads. In this regard you would have a main table, most likely with line_items moved to extended storage, whose function is to model journal entries as journal entries and apply relevant constraints, and a second table which models journal entry lines as independent lines. This also simplifies some of the constraint issues on the first table, and makes the modelling easier because we only have to look into the nested storage where we are looking at subset constraints. This section then provides a warning regarding the use of advanced ORDBMS functionality, namely that it is easy to get tunnel vision and create problems for the future. The complexity cost here is so high, that the primary model should generally remain relational, with things like nested storage primarily used to create constraints that cannot be effectively modelled otherwise. However, this becomes a great deal more complicated where values may be update or deleted. Here, however, we have a relatively simple case regarding data writes combined with complex constraints that cannot be effectively expressed in normalized, relational SQL. Therefore the standard maintenance concerns that counsel against duplicating information may give way to the fact that such duplication allows for richer constraints. Now, if we had been aware of the problems going in we would have chosen this structure all along. Our design would have been: CREATE TYPE journal_line AS ( entry_id bigserial primary key, --only possible key je_id int not null, account_id int, amount numeric ); After creating the journal entry table we'd: ALTER TABLE journal_line ADD FOREIGN KEY (je_id) REFERENCES journal_entry(id); If we have to handle purging old data we can make that key ON DELETE CASCADE. And the lines would have been of this type instead. We can then get rid of all constraints and their supporting functions other than the is_balanced one. Our debit and credit functions then also reference this type. Our trigger then looks like: CREATE FUNCTION je_ri_management() RETURNS TRIGGER LANGUAGE PLPGSQL AS $$ DECLARE accounts int[]; BEGIN IF TG_OP ILIKE 'INSERT' THEN INSERT INTO journal_line (je_id, account_id, amount) SELECT NEW.id, account_id, amount FROM unnest(NEW.line_items); RETURN NEW; ELSIF TG_OP ILIKE 'UPDATE' THEN RAISE EXCEPTION 'Cannot journal entry line items!'; ELSIF TG_OP ILIKE 'DELETE' THEN RAISE EXCEPTION 'Cannot delete journal entries!'; ELSE RAISE EXCEPTION 'Invalid TG_OP in trigger'; END IF; END; $$; Approval workflows can be handled with a separate status table with its own constraints. Deletions of old information (up to a specific snapshot) can be handled by a stored procedure which is unit tested and disables this trigger before purging data. This system has the advantage of having several small components which are all complete and easily understood, and it is made possible because the data is exclusively append-only. As you can see from the above examples, nested data structures greatly complicate the data model and create problems with relational math that must be addressed if data logic will remain meaningful. This is a complex field, and it adds a lot of complexity to storage. In general, these are best avoided in actual data storage except where this approach makes formerly insurmountable problems manageable. Moreover, they add complexity to optimization once data gets large. Thus while non-atomic fields in this regard make sense as an initial point of entry in some narrow cases, as a point of actual query, they are very rarely the right approaches. It is possible that, at some point, nested storage will be able to have its own indexes, foreign keys, etc. but I cannot imagine this being a high priority and so it isn't clear that this will ever happen. In general, it usually makes the most sense to simply store the data in a pseudo-normalized way, with any non-1NF designs being the initial point of entry in a linear write model. Nested Data Structures as Interfaces Nested data structures as interfaces to stored procedures are a little more manageable. The main difficulties are in application-side data construction and output parsing. Some languages handle this more easily than others. Upper-level construction and handling of these structures is relatively straight-forward on the database-side and poses none of these problems. However, they do cause additional complexity and this must be managed carefully. The biggest issue when interfacing with an application is that ROW types are not usually automatically constructed by application-level frameworks even if they have arrays. This leaves the programmer to choose between unstructured text arrays which are fundamentally non-discoverable (and thus brittle), and arrays of tuples which are discoverable but require a lot of additional application code to handle. At the same time as a chicken and egg problem, frameworks will not add handling for this sort of problem unless people are already trying to do it. So my general recommendation is to use nested data types everywhere in the database sparingly, only where the benefits clearly outweigh the complexity costs. Complexity costs are certainly lower in the interface level and there are many more cases where it these techniques are net wins there, but that does not mean that they should be routinely used even there.

September 25, 2012

by Chris Travers

· 20,822 Views

IndexedDB: MultiEntry Explained

For a long time I was not sure what the purpose of the multiEntry attribute was. Since non of the browsers supported it yet, but since sometime Firefox and even the latest builds of Chrome support it, it all came clear to me. The multiEntry attribute enables you to filter on the individual values of an array. For this reason, the multiEntry attribute is only useful when the index is put on a property that contains an array as value. When the multiEntry attribute is on true, there will be a record added for every value in the array. The key of this record will be the value of the array and the value will be the object keeping the array. Because the values in the array are used as key, means that the values inside the array need to be valid keys. This means they can only be of the following types: Array DOMString float Date So far for the theory, an example will make everything clear. In the example below I will use an object Blog. A blog contains out of the following properties: var blog = { Id: 1 , Title: "Blog post" , content: "content" , tags: ["html5", "indexeddb", "linq2indexeddb"]}; In the indexeddb we have an object store called blog which has an index on the tags property. The index has the multiEntry attribute turned on. If we would insert the object above, we would see the following records in the index: key value “"html5” { Id:1, Title: “Blogpost”, content:”content”, tags: [“html5”, “indexeddb”, “linq2indexeddb”]} “indexeddb” { Id:1, Title: “Blogpost”, content:”content”, tags: [“html5”, “indexeddb”, “linq2indexeddb”]} “linq2indexeddb” { Id:1, Title: “Blogpost”, content:”content”, tags: [“html5”, “indexeddb”, “linq2indexeddb”]} So for every value in the array of the tags attribute, a record is added in the index. This means when you start filtering, it is possible that the same object can be added to the result multiple times. For example if you would filter on all tags greater then “i”, the result would be 2 times the blog object I use in this example.

September 24, 2012

by Kristof Degrave

· 6,711 Views

Asynchronous WMI Queries: Stay Away From Them

So, it turns out that I have a WMI category on my blog. During the last couple of years I almost forgot about it, but WMI got a chance to wrap its poisonous tentacles around me again yesterday. Here’s another story. WMI is known for requiring lots of attention to security. To establish a WMI connection to a remote machine, you need to muck around with registry settings, DCOM configuration, group policy details, and other infernal things which we developers like to defer to someone else. But at least you know that once a machine has been configured properly to give you access through WMI, you can then access it from any other machine. Right? Right? Not so much. WMI has a concept of asynchronous queries, which are notably used for receiving event notifications. For example, the following code registers for an event notification whenever a process is created on my desktop machine: ManagementScope scope = new ManagementScope(@"\\sasha-desktop\root\cimv2"); WqlEventQuery query = new WqlEventQuery( "SELECT * FROM Win32_ProcessStartTrace"); ManagementEventWatcher watcher = new ManagementEventWatcher(scope, query); watcher.EventArrived += (o, e) => ...; //TODO: process the event watcher.Start(); Indeed, this thing works just fine if you point it to a local machine; but it fails when you call the Start method when you connect it to a remote machine. You could now strip the remote machine bare and have it expose its very innate networking guts to the entire Internet, and it still wouldn’t help you establish the connection. Interesting. When troubleshooting this nasty bug, I looked up a VBScript sample that receives new process creation events on another machine. Here it is: Set wmi = GetObject("winmgmts:\\sasha-desktop\root\cimv2") Set query = wmi.ExecNotificationQuery _ ("SELECT * FROM Win32_ProcessStartTrace'") Set process = query.NextEvent VBScript and all, it worked just fine. I started to suspect something smelly in the kingdom of .NET, so I rewrote the VBScript sample in C#, using the long-forgotten Microsoft.VisualBasic.Interaction class: dynamic wmi = Microsoft.VisualBasic.Interaction.GetObject( "winmgmts:\\sasha-desktop\root\cimv2"); dynamic query = wmi.ExecNotificationQuery( "SELECT * FROM Win32_ProcessStartTrace"); dynamic evt = query.NextEvent; This, too, worked just fine – although it’s not much a surprise, as it’s pretty much equivalent to the VBScript code at this time. Still interesting. This is when it hit me – the asynchronous nature of the ManagementEventWatcher.EventArrived event relies on an asynchronous WMI query, which requires a reverse connection to the client machine! This is configuration inferno, x2, on the client machine now, what with the DCOM security settings and sacrifices to the gods of group policy. Unless, of course, we give away the asynchrony and rely on the ManagementEventWatcher.WaitForNextEvent method. It’s synchronous. It burns a thread that has to sit idly by and wait while its siblings execute useful work. But it doesn’t establish a reverse DCOM connection to the caller. At least that.

September 22, 2012

by Sasha Goldshtein

· 11,025 Views

Spring 3.1 Caching and @CacheEvict

My last blog demonstrated the application of Spring 3.1’s @Cacheable annotation that’s used to mark methods whose return values will be stored in a cache. However, @Cacheable is only one of a pair of annotations that the Guys at Spring have devised for caching, the other being @CacheEvict. Like @Cacheable, @CacheEvict has value, key and condition attributes. These work in exactly the same way as those supported by @Cacheable, so for more information on them see my previous blog: Spring 3.1 Caching and @Cacheable. @CacheEvict supports two additional attributes: allEntries and beforeInvocation. If I were a gambling man I'd put money on the most popular of these being allEntries. allEntries is used to completely clear the contents of a cache defined by @CacheEvict's mandatory value argument. The method below demonstrates how to apply allEntries: @CacheEvict(value = "employee", allEntries = true) public void resetAllEntries() { // Intentionally blank } resetAllEntries() sets @CacheEvict’s allEntries attribute to “true” and, assuming that the findEmployee(...) method looks like this: @Cacheable(value = "employee") public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } ...then in the following code, resetAllEntries(), will clear the “employees” cache. This means that in the JUnit test below employee1 will not reference the same object as employee2: @Test public void testCacheResetOfAllEntries() { Person employee1 = instance.findEmployee("John", "Smith", 22); instance.resetAllEntries(); Person employee2 = instance.findEmployee("John", "Smith", 22); assertNotSame(employee1, employee2); } The second attribute is beforeInvocation. This determines whether or not a data item(s) is cleared from the cache before or after your method is invoked. The code below is pretty nonsensical; however, it does demonstrate that you can apply both @CacheEvict and @Cacheable simultaneously to a method. @CacheEvict(value = "employee", beforeInvocation = true) @Cacheable(value = "employee") public Person evictAndFindEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the code above, @CacheEvict deletes any entries in the cache with a matching key before @Cacheable searches the cache. As @Cacheable won’t find any entries it’ll call my code storing the result in the cache. The subsequent call to my method will invoke @CacheEvict which will delete any appropriate entries with the result that in the JUnit test below the variable employee1 will never reference the same object as employee2: @Test public void testBeforeInvocation() { Person employee1 = instance.evictAndFindEmployee("John", "Smith", 22); Person employee2 = instance.evictAndFindEmployee("John", "Smith", 22); assertNotSame(employee1, employee2); } As I said above, evictAndFindEmployee(...) seems somewhat nonsensical as I’m applying both @Cacheable and @CacheEvict to the same method. But, it’s more that that, it makes the code unclear and breaks the Single Responsibility Principle; hence, I’d recommend creating separate cacheable and cache-evict methods. For example, if you have a cacheing method such as: @Cacheable(value = "employee", key = "#surname") public Person findEmployeeBySurname(String firstName, String surname, int age) { return new Person(firstName, surname, age); } then, assuming you need finer cache control than a simple ‘clear-all’, you can easily define its counterpart: @CacheEvict(value = "employee", key = "#surname") public void resetOnSurname(String surname) { // Intentionally blank } This is a simple blank marker method that uses the same SpEL expression that’s been applied to @Cacheable to evict all Person instances from the cache where the key matches the ‘surname’ argument. @Test public void testCacheResetOnSurname() { Person employee1 = instance.findEmployeeBySurname("John", "Smith", 22); instance.resetOnSurname("Smith"); Person employee2 = instance.findEmployeeBySurname("John", "Smith", 22); assertNotSame(employee1, employee2); } In the above code the first call to findEmployeeBySurname(...) creates a Person object, which Spring stores in the “employee” cache with a key defined as: “Smith”. The call to resetOnSurname(...) clears all entries from the “employee” cache with a surname of “Smith” and finally the second call to findEmployeeBySurname(...) creates a new Person object, which Spring again stores in the “employee” cache with the key of “Smith”. Hence, the variables employee1, and employee2 do not reference the same object. Having covered Spring’s caching annotations, the next piece of the puzzle is to look into setting up a practical cache: just how do you enable Spring caching and which caching implementation should you use? More on that later...

September 21, 2012

by Roger Hughes

· 123,660 Views · 7 Likes

Spring 3.1 Caching and Config

I’ve recently being blogging about Spring 3.1 and its new caching annotations @Cacheable and @CacheEvict. As with all Spring features you need to do a certain amount of setup and, as usual, this is done with Spring’s XML configuration file. In the case of caching, turning on @Cacheable and @CacheEvict couldn’t be simpler as all you need to do is to add the following to your Spring config file: ...together with the appropriate schema definition in your beans XML element declaration: ...with the salient lines being: xmlns:cache="http://www.springframework.org/schema/cache" ...and: http://www.springframework.org/schema/cache http://www.springframework.org/schema/cache/spring-cache.xsd However, that’s not the end of the story, as you also need to specify a caching manager and a caching implementation. The good news is that if you’re familiar with the set up of other Spring components, such as the database transaction manager, then there’s no surprises in how this is done. A cache manager class seems to be any class that implements Spring’s org.springframework.cache.CacheManager interface. It’s responsible for managing one or more cache implementations where the cache implementation instance(s) are responsible for actually caching your data. The XML sample below is taken from the example code used in my last two blogs. In the above configurtion, I’m using Spring’s SimpleCacheManager to manage an instance of their ConcurrentMapCacheFactoryBean with a cache implementation named: “employee”. One important point to note is that your cache manager MUST have a bean id of cacheManager. If you get this wrong then you’ll get the following exception: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'org.springframework.cache.interceptor.CacheInterceptor#0': Cannot resolve reference to bean 'cacheManager' while setting bean property 'cacheManager'; nested exception is org.springframework.beans.factory.NoSuchBeanDefinitionException: No bean named 'cacheManager' is defined at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:328) at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:106) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyPropertyValues(AbstractAutowireCapableBeanFactory.java:1360) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.populateBean(AbstractAutowireCapableBeanFactory.java:1118) at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:517) : : trace details removed for clarity : at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: org.springframework.beans.factory.NoSuchBeanDefinitionException: No bean named 'cacheManager' is defined at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBeanDefinition(DefaultListableBeanFactory.java:553) at org.springframework.beans.factory.support.AbstractBeanFactory.getMergedLocalBeanDefinition(AbstractBeanFactory.java:1095) at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:277) at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:193) at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:322) As I said above, in my simple configuration, the whole affair is orchestrated by the SimpleCacheManager. This, according to the documentation, is normally “Useful for testing or simple caching declarations”. Although you could write your own CacheManager implementation, the Guys at Spring have provided other cache managers for different situations SimpleCacheManager - see above. NoOpCacheManager - used for testing, in that it doesn’t actually cache anything, although be careful here as testing your code without caching may trip you up when you turn caching on. CompositeCacheManager - allows the use multiple cache managers in a single application. EhCacheCacheManager - a cache manager that wraps an ehCache instance. See http://ehcache.org   Selecting which cache manager to use in any given environment seems like a really good use for Spring Profiles. See:    Using Spring Profiles in XML Config Using Spring Profiles and Java Configuration And, that just about wraps things up, although just for completeness, below is the complete configuration file used in my previous two blogs: As a Lieutenant Columbo is fond of saying “And just one more thing, you know what bothers me about this case...”; well there are several things that bother me about cache managers, for example: What do the Guys at Spring mean by “Useful for testing or simple caching declarations” when talking about the SimpleCacheManager? Just exactly when should you use it in anger rather than for testing? Would it ever be advisable to write your own CacheManager implementation or even a Cache implementation? What exactly are the advantages of using the EhCacheCacheManager? How often would you really need CompositeCacheManager? All of which I may be looking into in the future...

September 19, 2012

by Roger Hughes

· 27,501 Views · 2 Likes

How To Create A Theme Options Page For WordPress

If you have ever used a WordPress premium theme then you would of seen the custom theme options page that is available. The theme options page that is found under the appearance menu which allows the admin of the WordPress site to change some of the settings on the theme. Most premium themes will come with options to change the colors of fonts, backgrounds, change images or font types...anything that allows you to style the WordPress theme. Some of the most common fields to change are: Theme Options - To edit the theme logo, change the stylesheet, upload a new favicon, Add Google analytics code, enter your feedburner URL and add custom CSS. Styling Options - Change the background colour or change the background image. Fonts - Change the font on all your header tags or the main content text. Social - Providing you theme with your social media profiles will make it easier to link to them in parts of your theme or display your latest tweets. Option pages can also be used on plugins to change settings and to customize the plugin. Examples Of Theme Options Pages Here is what some of the theme options page from premium themes look like. How To Build A Theme Option Page When creating an option page there are a few things you need to setup. Add Menu - If you want to display the menu under the appearance menu or if you want to give the options page it's own menu. Add Sections - These are sections of settings you are adding to the options page. Register Settings - Settings are the different fields you are adding to the options page, they need to be registered with the settings API. Display Settings - The settings API will be used to call a function to display the setting. Validate Setting - When the user saves the settings field the input will need to be validated before stored in the options table. Feedback Messages - When the settings are saved you need to be able to feedback to the user if the settings were saved successfully or if there was an error during validation. To help us perform all these tasks there is a WordPress API called the Settings API. This API allows admin pages to handle setting forms semi-automatically. With the API you can define pages for the settings, sections for the settings and fields for the settings. This works by registering setting fields to be displayed within sections and page will display these sections. WordPress uses the Settings API by default on existing admin pages, this means that by using the Settings API you can add to existing pages by registering new settings. All validation must be performed by the developer of the settings pages but the Settings API will control the creation of the form and storing the values in the form in the options table. Add Menu To WordPress Admin When adding a menu to the WordPress admin screen you have loads of flexibility you have the option of adding brand new menu items or adding the menu as a sub menu. To add a top level menu just use the following function add_menu_page(). $page_title - The title used on the settings page. $menu_title - The title used on the menu. $capability - Only displays the menu if the user matches this capability. $menu_slug - The unique name of the menu slug. $function - This is the callback function to run to display the page. $icon_url - Display a icon just for the menu. $position - This allows you to choose when the menu item appears in the list. If you prefer to have the menu under the appearance parent menu you can use the following code snippet. Or you can use the function add_theme_page() which will add a sub-menu under the appearance menu. add_theme_page( $page_title, $menu_title, $capability, $menu_slug, $function); Registering The Settings To start off we need to register the settings group we are going to store the settings page values. This will use the Settings API to define the group of settings, we will then add the settings to a group. When you store the settings in this group they are stored in the wp_options database table so you can get these values out at a later date. The wp_options table is a key value pairing stored in the database. This is what you should use when storing long term data on your WordPress site. If you are storing a lot of data it's best practice to turn the data into an array and store it under one key, instead of storing all the values over multiple keys. This means that if you have a settings page to change the site logo, background color, font, font size etc, you won't have an option for each of these but you will group them into an option group. The reason you do this is to increase on database efficiency by not adding too many rows to the options database. To register settings on the Settings API you need to use the function register_setting(). The parameters you pass into this are: Option Group - The name of the group of settings you are going to store. This must match the group name used in the settings_field() function. Option name - The name of the option which will be saved, this is the key that is used in the options table. Sanitize Callback - This is the function that is used to validate the settings for this option group. Add Sections To Settings Once the settings are registered we can add section groups to the Settings API. This will allow us to organise the settings on the page, so that you can add styles to display these differently on the page. The benefit of adding sections on your Settings API is so that we can call the function do_settings_sections() as this will display all the settings under this one section. To create you own settings all you have to do is use the function add_settings_section(). The parameters you need to use on this function are: Id - String to use for the ID of the section. Title - The title to use on the section. Callback - This is the function that will display the settings on the page. Page - This is the page that is displaying the section, should match the menu slug of the page. Add Fields To The Sections The last important function we need to use to add settings to the page is the add_settings_field() function, this is used as part of the Settings API to define fields to a section. The function will need to know the page slug and the section Id before you can define the settings to use. All the settings which you setup here will be stored in the options table under the key used in the register_settings() function. To use this function you need to add the following parameters. ID - ID of the field Title - Title of the field. Callback - Function used to display the setting. This is very important as it is used to display the input field you want. Page - Page which is going to display the field should be the same as the menu slug on the section. Section - Section Id which the field will be added to. $args - Additional arguments which are passed to the callback function. Example Of Using The Settings API There is a lot of information to take in above to create this settings page so it can seem a bit complicated but once you get your head around the structure the Settings API uses it's actually quite easy to understand. The best way to understand how this all works is to show you with an example. Create A Theme Option Page With A Textbox Field In this example we will create a theme option page and add a textbox on the page to add additional text to the index.php. Just add the following to your functions.php file to create a theme options page. First we start off by creating the menu item under the appearance menu by using the add_theme_page() function on the admin_menu action. /** * Theme Option Page Example */ function pu_theme_menu() { add_theme_page( 'Theme Option', 'Theme Options', 'manage_options', 'pu_theme_options.php', 'pu_theme_page'); } add_action('admin_menu', 'pu_theme_menu'); As you can see above we set the callback function to the theme options page to be pu_theme_page so we need to create this function to display our page. Here we create a form to submit to the options.php so that we can save in the options table, we call settings_fields() to the get the settings in register_settings() and use the do_settings_sections() function to display our settings. /** * Callback function to the add_theme_page * Will display the theme options page */ function pu_theme_page() { ?> Custom Theme Options Created by Paulund. 'text', 'id' => 'pu_textbox', 'name' => 'pu_textbox', 'desc' => 'Example of textbox description', 'std' => '', 'label_for' => 'pu_textbox', 'class' => 'css_class' ); add_settings_field( 'example_textbox', 'Example Textbox', 'pu_display_setting', 'pu_theme_options.php', 'pu_text_section', $field_args ); } The callback function on creating sections can be used to add addition information that will appear above every section, on this example we are just leaving it blank. /** * Function to add extra text to display on each section */ function pu_display_section($section){ } The callback function on the add_settings_field() function is pu_display_setting, this is the function that is going to echo the display of any input's on the page. The parameter to this function is the $args value on the add_settings_field() we can use this to add things like id, name, default value etc. We want to get any existing values from the wp_option table to display any values which previously typed in by the user, do to this we get the values from the table by using the get_option() function. /** * Function to display the settings on the page * This is setup to be expandable by using a switch on the type variable. * In future you can add multiple types to be display from this function, * Such as checkboxes, select boxes, file upload boxes etc. */ function pu_display_setting($args) { extract( $args ); $option_name = 'pu_theme_options'; $options = get_option( $option_name ); switch ( $type ) { case 'text': $options[$id] = stripslashes($options[$id]); $options[$id] = esc_attr( $options[$id]); echo ""; echo ($desc != '') ? "$desc" : ""; break; } } Finally we can validate the values added to the form by creating the validation callback function pu_validate_settings. All this does at the moment is loop through the inputs passed to it and checks if it's a letter or a number. The return of this function is what will be added to the database. /** * Callback function to the register_settings function will pass through an input variable * You can then validate the values and the return variable will be the values stored in the database. */ function pu_validate_settings($input) { foreach($input as $k => $v) { $newinput[$k] = trim($v); // Check the input is a letter or a number if(!preg_match('/^[A-Z0-9 _]*$/i', $v)) { $newinput[$k] = ''; } } return $newinput; } If you copy all the snippets above into your functions.php file you will see this options form under the appearance menu. Using Theme Options Within Your Theme Now that you understand how to create a theme options page you need to be able to use this value in your theme so you can change the settings. All the settings are stored in the wp_options table with WordPress it's very easy to get these values out all you have to do is use the get_option() function. The option name is the name you put on the register_settings() function. So in our example above you will use this code. The $options variable will now store an array of the values from the theme options, which you can display the value of the textbox we put on the page by using this snippet. Conclusion That's the basics that you need to understand to use the Settings API, now you can take this information and create your own theme options page. Experiment with different input types you can add to the form, experiment with different validation methods you want to use. In future tutorials I will post how you can use some of the inbuilt WordPress third party applications to create a better user experience on your theme options panel. This will include things like color pickers, date pickers, jQuery UI features etc. As you can see we have created a settings option page in just over 100 lines of code, so it's not a hard thing to do but that are a few steps to it and the features can be expanded on. For this reason people have created theme option frameworks to allow you to easily create a theme option page with much higher level of complexity with the options. But like many other frameworks I always recommend you learn the basics before using a framework, this is why it's important to understand how the Settings API works before using or creating a settings page framework.

September 18, 2012

by Paul Underwood

· 23,705 Views

8 Common Code Violations in Java

At work, recently I did a code cleanup of an existing Java project. After that exercise, I could see a common set of code violations that occur again and again in the code. So, I came up with a list of such common violations and shared it with my peers so that an awareness would help to improve the code quality and maintainability. I’m sharing the list here to a bigger audience. The list is not in any particular order and all derived from the rules enforced by code quality tools such as CheckStyle, FindBugs and PMD. Here we go! Format source code and Organize imports in Eclipse: Eclipse provides the option to auto-format the source code and organize the imports (thereby removing unused ones). You can use the following shortcut keys to invoke these functions. Ctrl + Shift + F – Formats the source code. Ctrl + Shift + O – Organizes the imports and removes the unused ones. Instead of you manually invoking these two functions, you can tell Eclipse to auto-format and auto-organize whenever you save a file. To do this, in Eclipse, go to Window -> Preferences -> Java -> Editor -> Save Actions and then enable Perform the selected actions on save and check Format source code + Organize imports. Avoid multiple returns (exit points) in methods: In your methods, make sure that you have only one exit point. Do not use returns in more than one places in a method body. For example, the below code is NOT RECOMMENDED because it has more then one exit points (return statements). private boolean isEligible(int age){ if(age > 18){ return true; }else{ return false; } } The above code can be rewritten like this (of course, the below code can be still improved, but that’ll be later). private boolean isEligible(int age){ boolean result; if(age > 18){ result = true; }else{ result = false; } return result; } Simplify if-else methods: We write several utility methods that takes a parameter, checks for some conditions and returns a value based on the condition. For example, consider the isEligible method that you just saw in the previous point. private boolean isEligible(int age){ boolean result; if(age > 18){ result = true; }else{ result = false; } return result; } The entire method can be re-written as a single return statement as below. private boolean isEligible(int age){ return age > 18; } Do not create new instances of Boolean, Integer or String: Avoid creating new instances of Boolean, Integer, String etc. For example, instead of using new Boolean(true), use Boolean.valueOf(true). The later statement has the same effect of the former one but it has improved performance. Use curly braces around block statements. Never forget to use curly braces around block level statements such as if, for, while. This reduces the ambiguity of your code and avoids the chances of introducing a new bug when you modify the block level statement. NOT RECOMMENDED if(age > 18) return true; else return false; RECOMMENDED if(age > 18){ return true; }else{ return false; } Mark method parameters as final, wherever applicable: Always mark the method parameters as final wherever applicable. If you do so, when you accidentally modify the value of the parameter, you’ll get a compiler warning. Also, it makes the compiler to optimize the byte code in a better way. RECOMMENDED private boolean isEligible(final int age){ ... } Name public static final fields in UPPERCASE: Always name the public static final fields (also known as Constants) in UPPERCASE. This lets you to easily differentiate constant fields from the local variables. NOT RECOMMENDED public static final String testAccountNo = "12345678"; RECOMMENDED public static final String TEST_ACCOUNT_NO = "12345678";, Combine multiple if statements into one: Wherever possible, try to combine multiple if statements into single one. For example, the below code; if(age > 18){ if( voted == false){ // eligible to vote. } } can be combined into single if statements, as: if(age > 18 && !voted){ // eligible to vote } switch should have default: Always add a default case for the switch statements. Avoid duplicate string literals, instead create a constant: If you have to use a string in several places, avoid using it as a literal. Instead create a String constant and use it. For example, from the below code, private void someMethod(){ logger.log("My Application" + e); .... .... logger.log("My Application" + f); } The string literal “My Application” can be made as an Constant and used in the code. public static final String MY_APP = "My Application"; private void someMethod(){ logger.log(MY_APP + e); .... .... logger.log(MY_APP + f); } Additional Resources: A collection of Java best practices. List of available Checkstyle checks. List of PMD Rule sets

September 14, 2012

by Veera Sundar

· 45,973 Views · 1 Like

The Difference Between 'Hadoop DFS' and 'Hadoop FS'

While exploring HDFS, I came across these two syntaxes for querying HDFS: > hadoop dfs > hadoop fs Initally I couldn't differentiate between the two, and kept wondering why we have two different syntaxes for a common purpose. I found a number of people online with the same question -- their thoughts are below: Per Chris's explanation: it seems like there's no difference between the two syntaxes. If we look at the definitions of the two commands (hadoop fs and hadoop dfs) in $HADOOP_HOME/bin/hadoop ... elif [ "$COMMAND" = "datanode" ] ; then CLASS='org.apache.hadoop.hdfs.server.datanode.DataNode' HADOOP_OPTS="$HADOOP_OPTS $HADOOP_DATANODE_OPTS" elif [ "$COMMAND" = "fs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "dfs" ] ; then CLASS=org.apache.hadoop.fs.FsShell HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" elif [ "$COMMAND" = "dfsadmin" ] ; then CLASS=org.apache.hadoop.hdfs.tools.DFSAdmin HADOOP_OPTS="$HADOOP_OPTS $HADOOP_CLIENT_OPTS" ... That was his reasoning. Unconvinced, I kept looking for a more persuasive answer, and these excerpts made more sense to me: FS relates to a generic file system which can point to any file systems like local, HDFS etc. But dfs is very specific to HDFS. So when we use FS it can perform operation with from/to local or hadoop distributed file system to destination. But specifying DFS operation relates to HDFS. Below are two excerpts from the Hadoop documentation that describe these two as different shells. FS Shell The FileSystem (FS) shell is invoked by bin/hadoop fs. All the FS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost). Most of the commands in FS shell behave like corresponding Unix commands. DFShell The HDFS shell is invoked by bin/hadoop dfs. All the HDFS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenode:namenodeport/parent/child or simply as /parent/child (given that your configuration is set to point to namenode:namenodeport). Most of the commands in HDFS shell behave like corresponding Unix commands. So, based on the above, we can conclude that it all depends on the scheme configuration. When using these two commands with absolute URI (i.e. scheme://a/b) the behavior shall be identical. Only it's the default configured scheme value for file and hdfs for fs and dfs respectively, which is the cause for difference in behavior.

September 14, 2012

by Abhishek Jain

· 45,360 Views

Spring 3.1 Caching and @Cacheable

Caches have been around in the software world for long time. They’re one of those really useful things that once you start using them, you wonder how on earth you got along without them so, it seems a little strange that the Guys at Spring only got around to adding a caching implementation to Spring core in version 3.1. I’m guessing that previously it wasn’t seen as a priority and besides, before the introduction of Java annotations one of the difficulties of caching was the coupling of caching code with your business code, which could often become pretty messy. However, the Guys at Spring have now devised a simple to use caching system based around a couple of annotations: @Cacheable and @CacheEvict. The idea of the @Cacheable annotation is that you use it to mark the method return values that will be stored in the cache. The @Cacheable annotation can be applied either at method or type level. When applied at method level, then the annotated method’s return value is cached. When applied at type level, then the return value of every method is cached. @Cacheable(value = "employee") public class EmployeeDAO { public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } public Person findAnotherEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } } The Cacheable annotation takes three arguments: value, which is mandatory, together with key and condition. The first of these, value, is used to specify the name of the cache (or caches) in which the a method’s return value is stored. @Cacheable(value = "employee") public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } The code above ensures that the new Person object is stored in the “employee” cache. Any data stored in a cache requires a key for its speedy retrieval. Spring, by default, creates caching keys using the annotated method’s signature as demonstrated by the code above. You can override this using @Cacheable’s second parameter: key. To define a custom key you use a SpEL expression. @Cacheable(value = "employee", key = "#surname") public Person findEmployeeBySurname(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the findEmployeeBySurname(...) code, the ‘#surname’ string is a SpEL expression that means ‘go and create a key using the surname argument of the findEmployeeBySurname(...) method’. The final @Cacheable argument is the optional condition argument. Again, this references a SpEL expression, but this time it’s specifies a condition that’s used to determine whether or not your method’s return value is added to the cache. @Cacheable(value = "employee", condition = "#age < 25") public Person findEmployeeByAge(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the code above, I’ve applied the ludicrous business rule of only caching Person objects if the employee is less than 25 years old. Having quickly demonstrated how to apply some caching, the next thing to do is to take a look at what it all means. @Test public void testCache() { Person employee1 = instance.findEmployee("John", "Smith", 22); Person employee2 = instance.findEmployee("John", "Smith", 22); assertEquals(employee1, employee2); } The above test demonstrates caching at its simplest. The first call to findEmployee(...), the result isn’t yet cached so my code will be called and Spring will store its return value in the cache. In the second call to findEmployee(...) my code isn’t called and Spring returns the cached value; hence the local variable employee1 refers to the same object reference a @Test public void testCacheWithAgeAsCondition() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 22); Person employee2 = instance.findEmployeeByAge("John", "Smith", 22); assertEquals(employee1, employee2); } s employee2, which means that the following is true: assertEquals(employee1, employee2); But, things aren’t always so clear cut. Remember that in findEmployeeBySurname I’ve modified the caching key so that the surname argument is used to create the key and the thing to watch out for when creating your own keying algorithm is to ensure that any key refers to a unique object. @Test public void testCacheOnSurnameAsKey() { Person employee1 = instance.findEmployeeBySurname("John", "Smith", 22); Person employee2 = instance.findEmployeeBySurname("Jack", "Smith", 55); assertEquals(employee1, employee2); } The code above finds two Person instances which are clearly refer to different employees; however, because I’m caching on surname only, Spring will return a reference to the object that’s created during my first call to findEmployeeBySurname(...). This isn’t a problem with Spring, but with my poor cache key definition. Similar care has to be taken when referring to objects created by methods that have a condition applied to the @Cachable annotation. In my sample code I’ve applied the arbitrary condition of only caching Person instances where the employee is under 25 years old. @Test public void testCacheWithAgeAsCondition() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 22); Person employee2 = instance.findEmployeeByAge("John", "Smith", 22); assertEquals(employee1, employee2); } In the above code, the references to employee1 and employee2 are equal because in the second call to findEmployeeByAge(...) Spring returns its cached instance. @Test public void testCacheWithAgeAsCondition2() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 30); Person employee2 = instance.findEmployeeByAge("John", "Smith", 30); assertFalse(employee1 == employee2); } Similarly, in the unit test code above, the references to employee1 and employee2 refer to different objects as, in this case, John Smith is over 25. That just about covers @Cacheable, but what about @CacheEvict and clearing items form the cache? Also, there’s the question adding caching to your Spring config and choosing a suitable caching implementation. However, more on that later....

September 14, 2012

by Roger Hughes

· 197,035 Views · 8 Likes

Your First Hadoop MapReduce Job

Hadoop MapReduce is a YARN-based system for parallel processing of large data sets. In this article, learn to quickly start writing the simplest MapReduce job.

September 12, 2012

by Amresh Singh

· 19,670 Views

Caching and @Cacheable

Caches have been around in the software world for long time. They’re one of those really useful things that once you start using them you wonder how on earth you got along without them so, it seems a little strange that the guys at Spring only got around to adding a caching implementation to Spring core in version 3.1. I’m guessing that previously it wasn’t seen as a priority and besides, before the introduction of Java annotations, one of the difficulties of caching was the coupling of caching code with your business code, which could often become pretty messy. However, the guys at Spring have now devised a simple to use caching system based around a couple of annotations: @Cacheable and @CacheEvict. The idea of the @Cacheable annotation is that you use it to mark the method return values that will be stored in the cache. The @Cacheable annotation can be applied either at method or type level. When applied at method level, then the annotated method’s return value is cached. When applied at type level, then the return value of every method is cached. The code below demonstrates how to apply @Cacheable at type level: @Cacheable(value = "employee") public class EmployeeDAO { public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } public Person findAnotherEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } } The Cacheable annotation takes three arguments: value, which is mandatory, together with key and condition. The first of these, value, is used to specify the name of the cache (or caches) in which the a method’s return value is stored. @Cacheable(value = "employee") public Person findEmployee(String firstName, String surname, int age) { return new Person(firstName, surname, age); } The code above ensures that the new Person object is stored in the “employee” cache. Any data stored in a cache requires a key for its speedy retrieval. Spring, by default, creates caching keys using the annotated method’s signature as demonstrated by the code above. You can override this using @Cacheable’s second parameter: key. To define a custom key you use a SpEL expression. @Cacheable(value = "employee", key = "#surname") public Person findEmployeeBySurname(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the findEmployeeBySurname(...) code, the ‘#surname’ string is a SpEL expression that means ‘go and create a key using the surname argument of the findEmployeeBySurname(...) method’. The final @Cacheable argument is the optional condition argument. Again, this references a SpEL expression, but this time it’s specifies a condition that’s used to determine whether or not your method’s return value is added to the cache. @Cacheable(value = "employee", condition = "#age < 25") public Person findEmployeeByAge(String firstName, String surname, int age) { return new Person(firstName, surname, age); } In the code above, I’ve applied the ludicrous business rule of only caching Person objects if the employee is less than 25 years old. Having quickly demonstrated how to apply some caching, the next thing to do is to take a look at what it all means. @Test public void testCache() { Person employee1 = instance.findEmployee("John", "Smith", 22); Person employee2 = instance.findEmployee("John", "Smith", 22); assertEquals(employee1, employee2); } The above test demonstrates caching at its simplest. The first call to findEmployee(...), the result isn’t yet cached so my code will be called and Spring will store its return value in the cache. In the second call to findEmployee(...) my code isn’t called and Spring returns the cached value; hence the local variable employee1 refers to the same object reference as employee2, which means that the following is true: assertEquals(employee1, employee2); But, things aren’t always so clear cut. Remember that in findEmployeeBySurname I’ve modified the caching key so that the surname argument is used to create the key and the thing to watch out for when creating your own keying algorithm is to ensure that any key refers to a unique object. @Test public void testCacheOnSurnameAsKey() { Person employee1 = instance.findEmployeeBySurname("John", "Smith", 22); Person employee2 = instance.findEmployeeBySurname("Jack", "Smith", 55); assertEquals(employee1, employee2); } The code above finds two Person instances which are clearly refer to different employees; however, because I’m caching on surname only, Spring will return a reference to the object that’s created during my first call to findEmployeeBySurname(...). This isn’t a problem with Spring, but with my poor cache key definition. Similar care has to be taken when referring to objects created by methods that have a condition applied to the @Cachable annotation. In my sample code I’ve applied the arbitrary condition of only caching Person instances where the employee is under 25 years old. @Test public void testCacheWithAgeAsCondition() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 22); Person employee2 = instance.findEmployeeByAge("John", "Smith", 22); assertEquals(employee1, employee2); } In the above code, the references to employee1 and employee2 are equal because in the second call to findEmployeeByAge(...) Spring returns its cached instance. @Test public void testCacheWithAgeAsCondition2() { Person employee1 = instance.findEmployeeByAge("John", "Smith", 30); Person employee2 = instance.findEmployeeByAge("John", "Smith", 30); assertFalse(employee1 == employee2); } Similarly, in the unit test code above, the references to employee1 and employee2 refer to different objects as, in this case, John Smith is over 25. That just about covers @Cacheable, but what about @CacheEvict and clearing items form the cache? Also, there’s the question adding caching to your Spring config and choosing a suitable caching implementation. However, more on that later...

September 12, 2012

by Roger Hughes

· 15,764 Views

New ActiveMQ failover and Clustering Goodies

For the last two weeks I’ve been working on some interesting use cases for the good ol’ failover transport. I finally have some time at my hands, so here’s a brief recap of what’s coming in 5.6 release in this area. First there’s a new feature, called Priority Backup. It’s described in details here, but in a nutshell it provides you with the mechanism of prioritizing your failover urls and keep your clients connected to them as soon as they are available. The most obvious use case for this is to keep your clients connected to the broker in local data center whenever you can. By doing this, you can both have better performances and stability of your clients, but also save on your bandwidth bills. Another improvement is coming for automatic broker cluster feature. Although this feature is not new, I spent some time hardening it and thought to share some more insight in how (and when) to use it in your projects. In search of high availability, people often default to master-slave architecture. This makes sense in most use cases, but if your flow is purely non-persistent you can probably come up with more optimal architecture. Instead of having one broker at the time handling all your load, and other one just waiting for it to fail, you’ll get more efficient system with some kind of active-active configuration where (possibly multiple) brokers share the load all the time. Ideally clients would be evenly distributed and would rebalance if anything changes. Brokers don’t need to share any messages as clients are distributed and messages are non-persistent so they will be lost if broker fails. So can you achieve this kind of architecture with ActiveMQ? Sure you do. That’s where automatic rebalance and clustering shines. First of all, brokers should be networked but only so they can exchange information on their availability. They shouldn’t exchange the messages (but of course can if your use case needs it). In 5.6 you do that with pure static networks, using configuration like So now imagine three brokers A,B and C forming a full mesh. In addition every broker uses rebalance options on their transport connectors All that is left for the client to do is connect to one of the brokers it knows like failover:(brokerA) and the broker will fill it with all information on other brokers in the cluster and whether it should reconnect to one of them or not. So having a large number of clients connecting like this, very soon they’ll rebalance over available brokers. You can stop one of the brokers in the cluster for updates and clients will rebalance over remaining ones. You can even add a new broker to the cluster and everything will get rebalanced without any need for you to touch your clients. So, basically in this way you have both load balancing and high availability for your non-persistent messages. Additionally, your clients are automatically updated with all information they need, and no manual intervention is needed. Although the basic support for clustering was there since 5.4, I did some more hardening and better rebalancing, so it’s coming in the Apache ActiveMQ 5.6 (and the next Fuse 5.5.1) release. Also, there are some more great stuff regarding broker clustering coming soon, so stay tuned and happy messaging.

September 10, 2012

by Dejan Bosanac

· 15,426 Views

Fixing Bugs - If You Can't Reproduce a Bug, You Can't Fix It

Fixing a problem usually starts with reproducing it – what Steve McConnell calls “stabilizing the error.” Technically speaking, you can’t be sure you are fixing the problem unless you can run through the same steps, see the problem happen yourself, fix it, and then run through the same steps and make sure that the problem went away. If you can’t reproduce it, then you are only guessing at what’s wrong, and that means you are only guessing that your fix is going to work. But let’s face it – it’s not always practical or even possible to reproduce a problem. Lots of bug reports don’t include enough information for you to understand what the hell the problem actually was, never mind what was going on when the problem occurred – especially bug reports from the field. Rahul Premraj and Thomas Zimmermann found in The Art of Collecting Bug Reports (from the book Making Software), that the two most important factors in determining whether a bug report will get fixed or not are: Is the description well-written, can the programmer understand what was wrong or why the customer thought something was wrong? Does it include steps to reproduce the problem, even basic information about what they were doing when the problem happened? It’s not a lot to ask – from a good tester at least. But you can’t reasonably expect this from customers. There are other cases where you have enough information, but don’t have the tools or expertise to reproduce a problem – for example, when a pen tester has found a security bug using specialist tools that you don’t have or don’t understand how to use. Sometimes you can fix a problem without being able to see it happen in front of you, come up with a theory on your own, trusting your gut – especially if this is code that you recently worked on. But reproducing the problem first gives you the confidence that you aren’t wasting your time and that you actually fixed the right issue. Trying to reproduce the problem should almost always be your first step. What’s involved in reproducing a bug? What you want to do is to find, as quickly as possible, a simple test that consistently shows the problem, so that you can then run a set of experiments, trace through the code, isolate what’s wrong, and prove that it went away after you fixed the code. The best explanation that I’ve found of how to reproduce a bug is in Debug It! where Paul Butcher patiently explains the pre-conditions (identifying the differences between your test environment and the customer’s environment, and trying to control as many of them as possible), and then how to walk backwards from the error to recreate the conditions required to make the problem happen again. Butcher is confident that if you take a methodical approach, you will (almost) always be able to reproduce the problem successfully. In Why Programs Fail: A guide to Systematic Debugging, Andreas Zeller, a German Comp Sci professor, explains that it’s not enough just to make the problem happen again. Your goal is to come up with the simplest set of circumstances that will trigger the problem – the smallest set of data and dependencies, the simplest and most efficient test(s) with the fewest variables, the shortest path to making the problem happen. You need to understand what is not relevant to the problem, what’s just noise that adds to the cost and time of debugging and testing – and get rid of it. You do this using binary techniques to slice up the input data set, narrowing in on the data and other variables that you actually need, repeating this until the problem starts to become clear. Code Complete’s chapter on Debugging is another good guide on how to reproduce a problem following a set of iterative steps, and how to narrow in on the simplest and most useful set of test conditions required to make the problem happen; as well as common places to look for bugs: checking for code that has been changed recently, code that has a history of other bugs, code that is difficult to understand (if you find it hard to understand, there’s a good chance that the programmers who worked on it before you did too). Replay Tools One of the most efficient ways to reproduce a problem, especially in server code, is by automatically replaying the events that led up to the problem. To do this you’ll need to capture a time-sequenced record of what happened, usually from an audit log, and a driver to read and play the events against the system. And for this to work properly, the behavior of the system needs to be deterministic – given the same set of inputs in the same sequence, the same results will occur each time. Otherwise you’ll have to replay the logs over and over and hope for the right set of circumstances to occur again. On one system that I worked on, the back-end engine was a deterministic state machine designed specifically to support replay. All of the data and events, including configuration and control data and timer events, were recorded in an inbound event log that we could replay. There were no random factors or unpredictable external events – the behavior of the system could always be recreated exactly by replaying the log, making it easy to reproduce bugs from the field. It was a beautiful thing, but most code isn’t designed to support replay in this way. Recent research in virtual machine technology has led to the development of replay tools to snapshot and replay events in a virtual machine. VMWare Workstation, for example, included a cool replay debugging facility for C/C++ programmers which was “guaranteed to have instruction-by-instruction identical behavior each time.” Unfortunately, this was an expensive thing to make work, and it was dropped in version 8, at the end of last year. Replay Solutions provides replay for Java programs, creating a virtual machine to record the complete stream of events (including database I/O, network I/O, system calls, interrupts) as the application is running, and then later letting you simulate and replay the same events against a copy of the running system, so that you can debug the application and observe its behavior. They also offer similar application record and replay technology for mobile HTML5 and JavaScript applications. This is exciting stuff, especially for complex systems where it is difficult to setup and reproduce problems in different environments. Fuzzing and Randomness If the problem is non-deterministic, or you can't come up with the right set of inputs, one approach to try is to simulate random data inputs and watch to see what happens - hoping to happen on a set of input variables that will trigger the problem. This is called fuzzing. Fuzzing is a brute force testing technique that is used to uncover data validation weaknesses that can cause reliability and security problems. It's effective at finding bugs, but it’s a terribly inefficient way to reproduce a specific problem. First you need to setup something to fuzz the inputs (this is easy if a program is reading from a file, or a web form – there are fuzzing tools to help with this – but a hassle if you need to write your own smart protocol fuzzer to test against internal APIs). Then you need time to run through all of the tests (with mutation fuzzing, you may need to run tens of thousands or hundreds of thousands of tests to get enough interesting combinations) and more time to sift through and review all of the test results and understand any problems that are found. Through fuzzing you will get new information about the system to help you identity problem areas in the code, and maybe find new bugs, but you may not end up any closer to fixing the problem that you started on. Reproducing problems, especially when you are working from a bad bug report (“the system was running fine all day, then it crashed… the error said something about a null pointer I think?”) can be a serious time sink. But what if you can’t reproduce the problem at all? Let’s look at that next…

September 9, 2012

by Jim Bird

· 45,524 Views

"Schemas" in CouchDB

schema noun ( pl. schemata or schemas ) 1 technical a representation of a plan or theory in the form of an outline or model: a schema of scientific reasoning. 2 Logic a syllogistic figure. 3 (in Kantian philosophy) a conception of what is common to all members of a class; a general or essential type or form. CouchDB is a schema-less document store, but there are times when a schema is a good thing to have around, one way or another. So can you have your cake and eat it too? Below I'll take a high level look at adding a kind of schema to an application and the benefits and draw backs associated with this way of working. What I describe below isn't for everyone. It goes against some of the core principles of CouchDB and makes your data much less human readable, but there are cases where that trade off is worth making. Schemas: WTF?! It might seem a bit weird to add a schema to a schema-less database but sometimes it is a very useful thing indeed. When you're dealing with large datasets verbose object key names can be a problem (e.g. cost you money) so you end up stuck between a rock and a hard place; either make your data terse and hard to use or be explicit and spend more on storage and network. { "shape": "triangle", "colour_label": "red", "opposite_length_in_mm": 767.12254256805875, "angle_in_radians": 1.5514293603308698, "adjacent_length_in_mm": 73.59881843627835 } What usually happens is some middle ground where a nice descriptive name like "angle_in_radians" gets reduced to "angle" or "rads". That's fine in that it reduces the storage and network required to deal with all that data. { "adj": 73.59881843627835, "shape": "triangle", "angle": 1.5514293603308698, "opp": 767.12254256805875, "colour": "red" } However, by making this small change you move the description of the data out of your database and into some undefined place; higher level code, documentation, shared knowledge, a whiteboard, a notebook, someones head. As your data becomes more terse you might rely on duck typing (deriving from the data itself what the data describes) to get data that quacks right in your application. That's fine so long as you have data that is sufficiently distinguishable from the other ducks on the pond; if I rely on pulling a triangle object from the database because it has an angle member I might accidentally pull out a rhombus or an icosahedron. To make sure you get the data you expect you might add an explicit type field to each data (e.g. "type=goose" or "shape=triangle") something which I've always felt was rather odd. This starts to add up on storage (remember you have a large dataset/flock of ducks) and, more importantly, it doesn't help with where the description of the data is held - you know that you have a goose but don't know what a goose is. This last point is important, especially if you're working in a team of developers. Knowing what describing a shape as a triangle means is vital in producing consistent code that many people can work on. The straight jacket of a SQL schema looks pretty comfy sometimes. Okay, I'll buy that a schema might be useful... So how do you add a schema into a CouchDB database, something that is inherently schema-less? Can I get the best of both worlds? Here's a little trick that might help. First you define a document that is the schema for a particular type of data: { "_id": "datatype/triangle/v1", "fields": [ "opposite_length_in_mm", "adjacent_length_in_mm", "angle_in_radians", "colour_label" ] } Then you change your document structure to reference that "schema": { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ] } Note that the schema is versioned and that ordering in the data list is important here! I now know precisely what the data represents without having to store that description in the data itself. This way of working has benefits beyond disk storage; you reduce wire traffic, and there is less for a client to parse before rendering it. This is especially useful if you're rendering into a browser based visualisation - you don't need a complex set of objects to make a bar chart, just a list of x and y values. I can also share the data structure with colleagues and be reasonably confident that when I'm talking about a "v1 triangle" they'll know that lengths are in millimeters, are the opposite and adjacent sides and that the angle is in radians, hopefully reducing the chance of costly mistakes. Isn't that error prone? Yes and no. If you make a mistake in the ordering of your fields then, yes you are going to have issues. This is reasonably easy to manage with some form of client verification (e.g. validation on a web form) and generating the interface from the data (e.g. use the schema definition to build the GUI). If you're adding these data into the database by hand (e.g. via a curl or futon) then you aren't going to be in the regime where this trick is useful; your dataset needs to be large for this to make sense. Things still quack What's particularly nice about this way of working is that I can still duck type the data, add additional fields to annotate it etc. since the schema isn't strictly enforced. Nothing stops me from having a triangle document like: { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" } My views that deal with the data with a schema will still work (by ignoring these additional fields), my MVC framework will still render my pages, and I'll still have all the data I want in my database. Nesting You could have a nested object structure like: { "datatype": "pattern/v1", "data": [ { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "datatype": "triangle/v1", "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" }, { "datatype": "square/v1", data: [ 10, "green" ] } ] } But if you're going to have a schema you may as well reflect the nesting inside it, e.g say that you have a list of triangles and a list of squares: { "_id": "datatype/pattern/v1", "fields": [ ["triangle/v1"], ["square/v1"] ] } { "datatype": "pattern/v1", "data": [ [ { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "red" ], "owner": "Simon", "location" "space" }, { "data": [ 879.07395066446952, 84.607510245708468, 1.4444230241122715, "blue" ], "owner": "Fred", "location" "space" } ], [ { data: [ 10, "green" ] } ] } Schema evolution A nice feature of this way of working is that you can deal with schema evolutions; changing the format of your data. { "_id": "datatype/triangle/v2", "fields": [ "opposite_length_in_cm", "hypotenuse_length_in_cm", "angle_in_degrees", "colour_label" ] } There are only so many ways you can represent the data. While sometimes you may have a major schema evolution, one where old data is completely unusable, often changes are just tweaks for consistency (say changing the units of a quantity) or extending the schema by adding in optional data. In either case you should be able to use data from multiple schema versions together by using appropriate manipulations on the data. For example you could instantiate shape objects via a factory which knows how to create the right object for different schema versions. Validation The above does no validation of the data; the color field in the input data could be set to a number instead of a string, the angle to something non- physical etc. If you really needed validation you could do it with CouchDB's validation functions. If you go the fully validated route you'd want to define the schema in the design document (instead of as a normal doc) and use a CommonJS include to make sure that the validator in the app was doing the same thing as the schema. This ties you to a version of the design document (which is where the validators live), which may or may not be an issue. It will also considerably slow down insertion rate as CouchDB has to do more work to add your data. Personally I prefer to put validation logic in the client making writes. Views If I were using this way of working I would want to have a view which returned all the schema's defined on the database. This then allows me to build objects appropriately. A view to return schema's documents would look like: function(doc) { if (doc._id.slice(0, 'datatype'.length) == 'datatype') { emit (doc._id.slice('datatype/'.length, doc._id.length), doc.fields) } } You can pull out documents that have a schema with a simple view like: function(doc) { if (doc.datatype){ emit(doc.datatype, doc.data); } } This can be queried to find objects of a given shape using CouchDB's view slicing (e.g. ?startkey="square/v1"&endkey="square/v2") which returns data like: {"id":"datatype/square/v1","key":["square/v1",0],"value":["side_length_in_mm","colour_label"]}, {"id":"f98ffe7e4cd91cbb0d904f9098499ca8","key":["square/v1",1],"value":[872.4342711412228,"green"]}, {"id":"f98ffe7e4cd91cbb0d904f909849a218","key":["square/v1",1],"value":[370.29971491443905,"yellow"]}, {"id":"f98ffe7e4cd91cbb0d904f909849acd0","key":["square/v1",1],"value":[8.799279300193753,"yellow"]} You'll notice the name of the "schema" is the key and the values are held in value. This means I can parse the data into a set of appropriate objects with something like: var objects = []; function build(schema, data){ // Build the appropriate object for the schema... } for (row in data){ // build up the objects in a factory var obj = build(row.key, row.value); objects.push(obj); } If I wanted all versions of a shape the query would be, and used a vNUMERIC_COUNTER notation for versioning, ?startkey="square/v1"&endkey="square/vXXX" as numbers sort lower than strings. Taking it to the extreme If you are really worried about data size you can take this technique to the extreme by encoding the data arrays as a byte string and using the schema documents to describe that byte array. This effectively turns your JSON structure into something not dissimilar to a protocol buffer, at the expense of human readability and view complexity. If you are particularly concerned with data size over the wire (for example are writing an MMORPG) then this may be an acceptable trade off. Reminder This trick isn't suitable for every dataset. If you modify the data by hand it is prone to error. If you have a small dataset, or only ever send a small subset of the data to the client it's massive overkill. But if you have a large dataset of machine generated data, that needs to be frequently accessed over the WAN (think a monitoring app or game) then this is a nice way to reduce storage, network IO and browser render time. It's also worth reiterating that the schema is not enforced, you could have a square with 3 sides, and that adding strict schema enforcement with a validation function will considerably slow down insert rate.

September 8, 2012

by Simon Metson

· 10,362 Views