Data Engineering Resources

The Latest Data Engineering Topics

Introduction Now that I described the the basics of JPA and Hibernate flush strategies, I can continue unraveling the surprising behavior of Hibernate’s AUTO flush mode. Not all queries trigger a Session flush Many would assume that Hibernate always flushes the Session before any executing query. While this might have been a more intuitive approach, and probably closer to the JPA’s AUTO FlushModeType, Hibernate tries to optimize that. If the current executed query is not going to hit the pending SQL INSERT/UPDATE/DELETE statements then the flush is not strictly required. As stated in the reference documentation, the AUTO flush strategy may sometimessynchronize the current persistence context prior to a query execution. It would have been more intuitive if the framework authors had chosen to name it FlushMode.SOMETIMES. JPQL/HQL and SQL Like many other ORM solutions, Hibernate offers a limited Entity querying language (JPQL/HQL) that’s very much based on SQL-92 syntax. The entity query language is translated to SQL by the current database dialect and so it must offer the same functionality across different database products. Since most database systems are SQL-92 complaint, the Entity Query Language is an abstraction of the most common database querying syntax. While you can use the Entity Query Language in many use cases (selecting Entities and even projections), there are times when its limited capabilities are no match for an advanced querying request. Whenever we want to make use of some specific querying techniques, such as: Window functions Pivot table Common Table Expressions we have no other option, but to run native SQL queries. Hibernate is a persistence framework. Hibernate was never meant to replace SQL. If some query is better expressed in a native query, then it’s not worth sacrificing application performance on the altar of database portability. AUTO flush and HQL/JPQL First we are going to test how the AUTO flush mode behaves when an HQL query is about to be executed. For this we define the following unrelated entities: The test will execute the following actions: A Person is going to be persisted. Selecting User(s) should not trigger a the flush. Querying for Person, the AUTO flush should trigger the entity state transition synchronization (A person INSERT should be executed prior to executing the select query). 1 2 3 4 Product product = newProduct(); session.persist(product); assertEquals(0L, session.createQuery("select count(id) from User").uniqueResult()); assertEquals(product.getId(), session.createQuery("select p.id from Product p").uniqueResult()); Giving the following SQL output: 1 2 3 4 [main]: o.h.e.i.AbstractSaveEventListener - Generated identifier: f76f61e2-f3e3-4ea4-8f44-82e9804ceed0, using strategy: org.hibernate.id.UUIDGenerator Query:{[selectcount(user0_.id) as col_0_0_ from user user0_][]} Query:{[insert into product (color, id) values (?, ?)][12,f76f61e2-f3e3-4ea4-8f44-82e9804ceed0]} Query:{[selectproduct0_.idas col_0_0_ from product product0_][]} As you can see, the User select hasn’t triggered the Session flush. This is because Hibernate inspects the current query space against the pending table statements. If the current executing query doesn’t overlap with the unflushed table statements, the a flush can be safely ignored. HQL can detect the Product flush even for: Sub-selects 1 2 3 4 5 session.persist(product); assertEquals(0L, session.createQuery( "select count(*) "+ "from User u "+ "where u.favoriteColor in (select distinct(p.color) from Product p)").uniqueResult()); Resulting in a proper flush call: 1 2 Query:{[insert into product (color, id) values (?, ?)][Blue,2d9d1b4f-eaee-45f1-a480-120eb66da9e8]} Query:{[selectcount(*) as col_0_0_ from user user0_ where user0_.favoriteColor in(selectdistinct product1_.color from product product1_)][]} Or theta-style joins 1 2 3 4 5 session.persist(product); assertEquals(0L, session.createQuery( "select count(*) "+ "from User u, Product p "+ "where u.favoriteColor = p.color").uniqueResult()); Triggering the expected flush : 1 2 Query:{[insert into product (color, id) values (?, ?)][Blue,4af0b843-da3f-4b38-aa42-1e590db186a9]} Query:{[selectcount(*) as col_0_0_ from user user0_ cross joinproduct product1_ where user0_.favoriteColor=product1_.color][]} The reason why it works is because Entity Queries are parsed and translated to SQL queries. Hibernate cannot reference a non existing table, therefore it always knows the database tables an HQL/JPQL query will hit. So Hibernate is only aware of those tables we explicitly referenced in our HQL query. If the current pending DML statements imply database triggers or database level cascading, Hibernate won’t be aware of those. So even for HQL, the AUTO flush mode can cause consistency issues. If you enjoy reading this article, you might want to subscribe to my newsletter and get a discount for my book as well. AUTO flush and native SQL queries When it comes to native SQL queries, things are getting much more complicated. Hibernate cannot parse SQL queries, because it only supports a limited database query syntax. Many database systems offer proprietary features that are beyond Hibernate Entity Query capabilities. Querying the Person table, with a native SQL query is not going to trigger the flush, causing an inconsistency issue: 1 2 3 Product product = newProduct(); session.persist(product); assertNull(session.createSQLQuery("select id from product").uniqueResult()); 1 2 3 DEBUG [main]: o.h.e.i.AbstractSaveEventListener - Generated identifier: 718b84d8-9270-48f3-86ff-0b8da7f9af7c, using strategy: org.hibernate.id.UUIDGenerator Query:{[selectidfrom product][]} Query:{[insert into product (color, id) values (?, ?)][12,718b84d8-9270-48f3-86ff-0b8da7f9af7c]} The newly persisted Product was only inserted during transaction commit, because the native SQL query didn’t triggered the flush. This is major consistency problem, one that’s hard to debug or even foreseen by many developers. That’s one more reason for always inspecting auto-generated SQL statements. The same behaviour is observed even for named native queries: 1 2 3 4 @NamedNativeQueries( @NamedNativeQuery(name = "product_ids", query = "select id from product") ) assertNull(session.getNamedQuery("product_ids").uniqueResult()); So even if the SQL query is pre-loaded, Hibernate won’t extract the associated query space for matching it against the pending DML statements. Overruling the current flush strategy Even if the current Session defines a default flush strategy, you can always override it on a query basis. Query flush mode The ALWAYS mode is going to flush the persistence context before any query execution (HQL or SQL). This time, Hibernate applies no optimization and all pending entity state transitions are going to be synchronized with the current database transaction. 1 assertEquals(product.getId(), session.createSQLQuery("select id from product").setFlushMode(FlushMode.ALWAYS).uniqueResult()); Instructing Hibernate which tables should be syncronized You could also add a synchronization rule on your current executing SQL query. Hibernate will then know what database tables need to be syncronzied prior to executing the query. This is also useful for second level caching as well. 1 assertEquals(product.getId(), session.createSQLQuery("select id from product").addSynchronizedEntityClass(Product.class).uniqueResult()); If you enjoyed this article, I bet you are going to love my book as well. Conclusion The AUTO flush mode is tricky and fixing consistency issues on a query basis is a maintainer’s nightmare. If you decide to add a database trigger, you’ll have to check all Hibernate queries to make sure they won’t end up running against stale data. My suggestion is to use the ALWAYS flush mode, even if Hibernate authors warned us that: this strategy is almost always unnecessary and inefficient. Inconsistency is much more of an issue that some occasional premature flushes. While mixing DML operations and queries may cause unnecessary flushing this situation is not that difficult to mitigate. During a session transaction, it’s best to execute queries at the beginning (when no pending entity state transitions are to be synchronized) and towards the end of the transaction (when the current persistence context is going to be flushed anyway). The entity state transition operations should be pushed towards the end of the transaction, trying to avoid interleaving them with query operations (therefore preventing a premature flush trigger).

August 15, 2014

by Vlad Mihalcea

· 36,040 Views · 3 Likes

The Programming Challenges of IoT

Pragmatic developers can look at the Internet of Things in two ways: This is amazing. I can only begin to imagine how I can directly improve the world outside the set of networked computer boxes. This is terrifying. If something goes wrong, then it’s on me—and this time the system affected extends outside the set of networked computer boxes. IoT is amazing in the way it bridges physical and virtual environments, but even the phrase “Internet of Things” should give a developer pause. Computers are pretty smart. Things are stupid. IoT tries to put Things online and tries to make them into inter-networked computers. That’s pop-philosophy, but you want to develop in the real world. So what real-world challenges will you face when you shoot for the IoT moon? Two Types of Challenges It seems there are two types of programming challenges for the Internet of Things: Data and control (the comp-sci and networking stuff) Information and business logic (the info-sci and human-computer interaction stuff) For this article, we’re going to talk about the programming problems we can solve around IoT. We’ll start at the bottom (data and control) and work our way up to the big picture (information and business logic). Type 1: Data and Control Challenge 1.1: Power This one is pretty obvious. Many IoT devices are wireless, and no one has invented thumbnail fusion reactors yet. One solution is equally obvious: pick your algorithms carefully. If you can save cycles to perform a given task, then do it. Libraries for implementing power-optimized algorithms will presumably spring up in greater numbers, but even so, you may need to inject some heavy-duty comp-sci know-how into IoT app development. The second solution is more complex than the first. Higher-level developers will have to think more about Dynamic Power Management (DPM), which just means: shutting down devices when they don’t need to be on and starting them up when they do. Normally the operating system worries about this, but an IoT application that orchestrates wearables and phones, for example, will know things that each device’s OS won’t—and therefore will be able to switch things on and off more intelligently than each device’s individual OS. Another option is to write or customize an embedded OS. Challenge 1.2: Latency Latency on IoT sits in two places: at the source and in the pipes. The basic problem is a physical one. Thing-chips often have to be small, which means that the chip can only be as powerful as current transistor technology allows. Another problem is power. Many small devices transmit and receive data in discrete active/sleep cycles (think TDMA) in order to save bandwidth and power, but this increases latency inversely to power saved. Another tradeoff is that network topologies optimized for IoT can involve more hops over slower devices. Mesh networks, for example, are immune to the failure of a few nodes. Similarly, “fog” and “edge” computing paradigms relieve Internet infrastructure by doing as much as possible without hub-nodes. The downside is that each node (a) can’t do very much on its own and (b) can only talk to neighboring nodes. The problem in the pipes is a matter of network infrastructure. Simply: the more Things, the less available bandwidth. Infrastructure technology will get faster, but cell networks won’t catch up overnight. And Things, unlike fancier computers, are often supposed to transmit blindly—that is, without anyone necessarily asking them to. This means there’s a massive potential for wasted bandwidth. Challenge 1.3: Unreliability The third challenge flows from the first two. Devices are unreliable–“Things” even more so. The distributed and decentralized virtues of IoT bring their own reliability problems. Here are just a few: Ubiquitous devices are cheap, so they fail more often. Truly ad-hoc connectivity implies ephemeral SLA, so uptime and recovery time may be unclear. Loosely controlled devices may have better things to do than give you their data (or computing resources), so concurrency may grow very complex. Less-reliable hardware generates less-reliable information (‘does my outlying datapoint just signify device failure?’), so you may need to chew your data more thoroughly at the application level. In a sense, IoT decouples low-level (the sub-session layer) from high-level channel capacity, because the distribution of error-sources on IoT is more heavily weighted toward originating or remote nodes. This means more error-correcting for application developers. Type 2: Information and Business Logic Challenge 2.1: Vast & Thin Data Sensors on smartphones are already generating oceans of raw data. These sensors are pretty sophisticated. Every major mobile OS provides a unified, simple API to access clean sensor and geo data. But start grabbing this data and it’s not immediately clear what to do with it. Try to think of killer applications for barometric data—besides weather and elevation (with GPS)—off the top of your head. Raw sensor data is extremely thin. It doesn’t explain itself, and we haven’t yet produced a complete mapping from physical measurements to business logic—let alone software design. Even if you know what to do with sensor/geo data eventually, you may have to learn new algorithms and data structures to process immediately. Geo-graphs aren’t CS101 graph data structures (for one thing, edge length is a first-class citizen of geo-graphs). The size of data over IoT is itself a problem. Wireless sensors beget tons of data. All the problems (and opportunities) of Big Data cascade naturally from IoT. Massively distributed computing on IoT devices is an exciting thought, but the toolchain for splitting calculations over a thousand idle Fitbits just isn’t here yet. 2. Context-Sensitivity Consider the term “ubiquitous computing,” defined as: what happens when wirelessly connected sensors and actuators, placed more or less everywhere, allow software to interact with much larger swaths of the physical world than just hardware or bare metal. Put ubiquitous computing on the Internet, and IoT makes the software context much larger. This has implications at two basic levels. At a high computer-architectural level: IoT extends the concept of computing environment well outside the von Neumann machine and weakens the concept of peripheral I/O. In an IoT-world interface, sensors are input and actuators are output. As IoT devices process increasingly at the edge (within individual nodes), the devices that appear peripheral to other nodes are actually doing an awful lot of computation. At a high business-logic level: the more stuff outside the computer-box affects the program, the less predictable the program behavior becomes at runtime. The same bizarrely-birthed memory leak might slow down the UI in a smartphone context but contribute to a cascading electrical grid failure in an IoT context. This means that IoT demands more self-monitoring and self-repairing code. Two Types of Solutions Plenty of researchers are working on ambitious solutions to the programming challenges presented by IoT. Two of the more exciting examples include: Abstract Task Graph—a data-driven model that maps the network graph to an application graph [1] Computational REST—replaces content resources with computation resources [2] There are also a few more strategies you can use right now to solve some of the IoT programming challenges mentioned above. Reactive ProgrammingThis general purpose paradigm responds to all major application-level challenges and embraces opportunities presented by IoT. The four definitive attributes of a reactive application are: event-driven, scalable, resilient, and responsive [3]. These four are excellent guiding principles for IoT applications at a high, cross-stack level. Flow-based Programming and the Actor ModelBoth present strongly independent components where only messages can affect processes. Both are deeply amenable to concurrency (for example, shared state is discouraged), nondeterminism, and scaling without exponential complexity growth, because components are black boxes. FBP is a bit more pragmatic and restrictive while the actor model is less restrictive and a bit harder to implement. FBP has already been implemented in Javascript (NoFlo), and the actor model has been implemented in Java (Akka) [4][5][6]. What’s important to remember is that there are already tools and techniques that can help you build IoT applications. FBP, actors, and reactive programming all have key attributes for creating applications that leverage the strengths of IoT to overcome its challenges. [1] https://www.usenix.org/legacy/event/mobisys05/eesr05/tech/full_papers/bakshi/bakshi.pdf [2] http://isr.uci.edu/tech_reports/UCI-ISR-10-3.pdf [3] http://www.reactivemanifesto.org/ [4] http://jpaulmorrison.com/fbp/ [5] http://arxiv.org/ftp/arxiv/papers/1008/1008.1459.pdf [6] http://noflojs.org/ [7] http://akka.io/ 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 14, 2014

by John Esposito

· 16,322 Views

DB Queries in spring's application-context.xml.

While solving one of my assignment, I realize the age long practice of writing db queries in java class file is really a big pain. They are difficult to read and understand. Further if you want to modify it again a lot of task. Thus, to get rid of the pain, I put the db queries in spring’s context xml. Later I realize using such injections using xml, its not only easier to maintain and understand also we can have module wise segregation of db queries in different or in different "xyzDAO-queries.xml". Further its easier to add more customers to the existing application. Please see the code snippet area for the context.xml called "userDAO-queries.xml". I've created Map with an id. In the Map I put the key as and values as .db queries goes here.... , which contains the query. The java classes are DBQueriesImpl.java and UserDAOImpl.java for fetching the exact query from the "xyzDAO-queries.xml". Please see the comments in the java class. Context XML (userDAO-queries.xml) part : SELECT user_id, user_name, user_password, user_fname, user_mname, user_lname, user_create_time FROM ty_users SELECT user_id, user_name, user_password FROM ty_users WHERE user_name = ? and user_password = ? SELECT user_id, user_name, user_password FROM ty_users WHERE user_id=? INSERT INTO ty_users (user_name, user_password) VALUES (?, ?) Java Code Part: UserDAOImpl.java public class UserDAOImpl implements UserDAO{ private static final Logger logger = LoggerFactory.getLogger(UserDAOImpl.class);// logger DBQueries dbQuery = new DBQueriesImpl("userDAO-queries.xml");// DBConnection dbConn = ConnectionFactory.getInstance().getConnectionMySQL(); @Override public List authenitcUserDetails(String userName, String userPassword) { Map queryMap = null; String query ; List> mappedUserList = null; List listUser = null; try{ listUser = new ArrayList(); queryMap = dbQuery.getQueryBeanFromFactory("VALIDATE_USER_CREDENTIAL"); //passing the query-name or Map Id. query = dbQuery.getQuery(queryMap);//passing the query-map to get the query. //rest of the code is common fetching code. mappedUserList = dbConn.dbQueryRead(query, new Object[]{userName,userPassword}); if(mappedUserList.size()>0){ listUser = new QueryResultSetMapperImpl().mapRersultSetToObject(mappedUserList, User.class); } }catch(Exception ex){ logger.error("fetchAllUser Error:: ", ex); } return listUser; } } DBQueriesImpl.java public class DBQueriesImpl implements DBQueries { private static final Logger logger = LoggerFactory.getLogger(DBQueriesImpl.class); private ApplicationContext queriesCtx ; /** * @Desc: Loading the queryContext xml * @param queryContext */ public DBQueriesImpl(String queryContext) { queriesCtx = new ClassPathXmlApplicationContext(queryContext); } /** * @Desc: Reading from the loaded application context and getting the query-map, . */ @Override public Map getQueryBeanFromFactory(String beanId){ Map queryMap = null; if (queriesCtx != null && beanId != null) { queryMap = (Map)queriesCtx.getBean(beanId); } return queryMap; } /** * @Desc: Getting the exact query from the query-map, . */ @Override public String getQuery(Map queryMap) { String query=null; try{ if(queryMap.containsKey(QueryConstants.QUERY_NODE)){ query = (String) queryMap.get(QueryConstants.QUERY_NODE); queryMap.remove(QueryConstants.QUERY_NODE); }else{ throw new NoSuchFieldError(); } }catch(Exception excp){ excp.printStackTrace(); } return query; } } With this kind of setting I can also solve few other problems. In one of my assignment I used this trick. In the application, various clients have different set of data requirements thus, different set of sql-queries. Now instead of writing java classes to met client requirement we just write the set of spring-context.xmls having sql and used those spring-context at runtime and fetch data.

August 13, 2014

by Shantanu Sikdar

· 5,445 Views · 1 Like

6 Rules of Thumb for MongoDB Schema Design: Part 3

By William Zola, Lead Technical Support Engineer at MongoDB This is our final stop in this tour of modeling One-to-N relationships in MongoDB. In the first post, I covered the three basic ways to model a One-to-N relationship. Last time, I covered some extensions to those basics: two-way referencing and denormalization. Denormalization allows you to avoid some application-level joins, at the expense of having more complex and expensive updates. Denormalizing one or more fields makes sense if those fields are read much more often than they are updated. Read part two if you’ve missed it. Whoa! Look at All These Choices! So, to recap: You can embed, reference from the “one” side, or reference from the “N” side, or combine a pair of these techniques You can denormalize as many fields as you like into the “one” side or the “N” side Denormalization, in particular, gives you a lot of choices: if there are 8 candidates for denormalization in a relationship, there are 2 8 (1024) different ways to denormalize (including not denormalizing at all). Multiply that by the three different ways to do referencing, and you have over 3,000 different ways to model the relationship. Guess what? You now are stuck in the “paradox of choice” — because you have so many potential ways to model a “one-to-N” relationship, your choice on how to model it just got harder. Lots harder. Rules of Thumb: Your Guide Through the Rainbow Here are some “rules of thumb” to guide you through these indenumberable (but not infinite) choices One: favor embedding unless there is a compelling reason not to Two: needing to access an object on its own is a compelling reason not to embed it Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed. Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database. Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing. Six: As always with MongoDB, how you model your data depends — entirely — on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it. Your Guide To The Rainbow When modeling “One-to-N” relationships in MongoDB, you have a variety of choices, so you have to carefully think through the structure of your data. The main criteria you need to consider are: What is the cardinality of the relationship: is it “one-to-few”, “one-to-many”, or “one-to-squillions”? Do you need to access the object on the “N” side separately, or only in the context of the parent object? What is the ratio of updates to reads for a particular field? Your main choices for structuring the data are: For “one-to-few”, you can use an array of embedded documents For “one-to-many”, or on occasions when the “N” side must stand alone, you should use an array of references. You can also use a “parent-reference” on the “N” side if it optimizes your data access pattern. For “one-to-squillions”, you should use a “parent-reference” in the document storing the “N” side. Once you’ve decided on the overall structure of the data, then you can, if you choose, denormalize data across multiple documents, by either denormalizing data from the “One” side into the “N” side, or from the “N” side into the “One” side. You’d do this only for fields that are frequently read, get read much more often than they get updated, and where you don’t require strong consistency, since updating a denormalized value is slower, more expensive, and is not atomic. Productivity and Flexibility The upshot of all of this is that MongoDB gives you the ability to design your database schema to match the needs of your application. You can structure your data in MongoDB so that it adapts easily to change, and supports the queries and updates that you need to get the most out of your application.

August 13, 2014

by Francesca Krihely

· 8,740 Views

An Early Mover's Guide to the Internet of Things

[This article was written by Andreea Borca, developer of patient-empowering solutions for the healthcare industry, co-host of Farstuff: The IoT Podcast, and featured author in DZone's 2014 Guide to Internet of Things]. The creation of the Internet was a significant shift in the way people acquire information, interact with each other, and make decisions. Now, the Internet is expanding its reach to a range of devices that can gather and analyze physical data and react to that data in a variety of applications that we’ve never seen before. This “Internet of Things” marks another dynamic shift in the history of technology. This new stage in the Internet’s evolution is changing it from a tool that we actively need to engage with—deliberately using a browser to access it—to one that passively endows the world around us with a “mind” of its own. We are developing a world where things interact intelligently and cooperate to achieve goals without explicit guidance from human operators. Defining the Internet of Things First, we need to define the Internet of Things (also called “The Internet of Everything” by Cisco). A system falls under the Internet of Things definition if it meets the following criteria, known as the “3 Cs”: 1. It must Connect – to the physical world around itself collecting information, to other things in order to interact with them effectively, to the internet or a network, etc. 2. It must Compute – by processing the inputs it receives in some way and making them meaningful to other systems. 3. It must Communicate – with the network, with other things, and with the user if necessary (more often than not, as you’ll see, communicating to the user may be an unnecessary burden). Challenges for the Internet of Things Efficiency Devices within the Internet of Things (IoT) only need to do the bare minimum necessary to effectively work within the existing ecosystem. Many of the newest products rely heavily on the power of your smartphone to connect to the Internet and orchestrate devices, but there is also extensive pressure to reduce the size, energy consumption, and cost of the processing entities within IoT devices. In order to reduce power consumption and manage node outages, there is a concept of daisy-chaining across a network of devices into a more powerful central hub. This is known as mesh networking, and it’s becoming quite popular for IoT systems. Security, Privacy & the need to Share A core requirement of a well-functioning IoT device is to collect, transfer, and store data from a wide variety of sources. As more sensors arrive in cities and healthcare institutions, that increasingly connected information will unavoidably lead to more concern about security and privacy. The debate is still raging over balancing the clear benefits of new discoveries from processing Big Data with the strong personal fear of losing privacy. With IoT now in the picture, there is concern about devices that continuously and passively collect information on users. One recent clash over always-on sensors came with the release of Microsoft’s Xbox One Kinect console, which has a camera that is constantly pointed at your living room. Although the camera itself is not always on, the backlash over that possibility was fierce [1]. Finding this balance will quickly become a requirement for continued progress. Furthermore, the very nature of IoT and the connectivity network necessary for its success does make it particularly vulnerable in certain instances. Devices are especially vulnerable when connected over WiFi, because low tech sensor nodes with minimal computing power tend to be less secure, making them the ideal point of entry for infiltrators. Standards As with all new technologies, the battle over standards is always a struggle. Nest, the company that developed the most popular smart home thermostat, and its new owner, Google, are now making significant strides trying to establish the Nest platform as the foundation for all consumer-based IoT devices and their software counterparts [2]. Cisco, Qualcomm, IBM, Microsoft, and most other major players have a similar strategy for creating standard models for approaching the Internet of Things. The pressure to standardize is especially clear when new devices are appearing weekly. ZigBee already has extensive reach as an established standard for many household IoT devices. However, as a preferred codebase has yet to emerge as the standard of choice, it is recommended to connect with major standardization organizations like the IEEE, IETF, and the ZigBee Alliance [3]. Currently, the most common sensor networks use protocols such as Bluetooth Low Energy (BLE), RFID tags, ZigBee, and Wi-Fi. There are also iBeacons, which allow devices like smartphones to better identify their location and potential needs with NFC-powered micro-location and GPS technology. Opportunities for the Internet of Things There are numerous prospects to consider when looking to develop IoT products. Given the multi-trillion dollar projections for the future IoT economy, we should take a look at these emerging markets for IoT tech [4]. Consumers The consumer IoT space has bred a small but growing segment of followers that have invested early into “smart” tech. At this year’s CES, we saw everything from the Babolat Tennis Racket that becomes your personal tennis coach to the Kolibree Toothbrush that monitors your gum health while you brush. The fastest growing consumer IoT segment seems to be in smart home technology, with products such as self-managing refrigerators and resident-sensing door locks. Commercial Retailers have already proven adept at collecting a consumer’s shopping history. With the functionality of NFC-powered beacons, these retailers are eager to personalize your shopping experience in a whole new way. Essentially, each physical shopping trip can now be as littered with targeted ads as any typical online search, much like a scene from the 2002 sci-fi film Minority Report. Walk into a store and instantly the advertising screens on the wall change to address your particular demographic, income level, and shopping preferences. If you’ve connected your Google calendar to certain applications, these screens would show outfits targeting your next big event. Signs on clothing racks sense you coming near and change prices, fully leveraging a custom pricing model that would have economists drooling. And as you try on outfits, the smart mirror in the dressing room recommends accessories or comments on alternatives that might be a better fit for your body type. After all of these IoT events have helped you with your purchase, there’s no need to checkout. You’ve registered with the store and there’s a beacon at the exit that registers what you picked up and charges your card automatically as you leave. Healthcare With the recent U.S. mandate that all health records must be digital, there has been an explosion in the marketplace of new, patient-centered, smart health devices. The excitement of a healthcare revolution among top innovative companies, incubators, and startups predicts that this trend is not likely to taper off anytime soon. The key areas of focus so far have been: monitoring technologies like wearables (especially passive monitoring), function improving technologies, education, and notification technologies. Wearables are generally the first consumer touch point in the IoT health sphere. With the popularity of Fitbit pedometers and Withings scales, the market is starting to experiment with internal monitoring and potentially replacing some organs completely in the near future. A study at Boston University has had incredibly positive results creating an artificial pancreas for Type 1 diabetics by inserting an insulin and glucagon pump that responds when an attached glucometer goes below a certain level, just like an actual pancreas. Proteus, a promising startup out of San Francisco, has created an all-natural microchip in a pill that the patient swallows in order to monitor whether they are remembering to take their medication. The pill sends data to an armband that the user is wearing, which then can send notifications to family members regarding the patient’s status. The most impressive feature is the fact that these chips are powered by the energy in the patient’s digestive system. Cities, Infrastructure, and Industry The long-term vision of the future includes technology such as self-driving cars and city lights that alert police when there’s been an accident. In this stage of development, the majority of value is coming from technologies that monitor and collect data in urban settings. From an evolutionary perspective, the IoT city as a whole is still in what many would consider a learning phase. The main objective is to collect as much data as possible, make it available via open APIs, and encourage motivated data analysts to find opportunities for improvement in utility usage, environmental impacts, and service management for larger populations. This is one area where being an industrial country like the U.S. may actually impede the ability to progress as quickly as our less established counterparts. Third world countries that haven’t yet built a solid infrastructure allow for the creativity and flexibility to implement sophisticated solutions unfettered by generations of previous development. Silicon Valley powerhouses like Facebook and Google are actively engaged in projects to create a free global Wi-Fi network, and key locations in Africa have allowed them to experiment with these projects. Being an Early Mover In the very near future, as more and more things connect to the internet, internet connectivity from IoT devices will dwarf the amount of traditional web browsing. The core standards and assumptions that will drive this next revolution in computing technology are still being established and, as a result, building anything that can add value to this exploding industry (software, hardware, devices, sensors, beacons etc.) is a remarkable opportunity for the right developer. Right now is the time to start contributing to the development of these technologies if you want to be an early mover in IoT. 2014 Guide to Internet of Things The 2014 Guide to Internet of Things covers 39 different IoT SDKs, developer programs, and hardware options, plus: Key findings from our survey of over 2,000 developers "How to IoT Your Life: The Complete Shopping List" "The Scale of IoT" Infographic Glossary of common IoT terms Four in-depth articles from industry experts DOWNLOAD NOW

August 12, 2014

by Benjamin Ball

· 15,073 Views

6 Rules of Thumb for MongoDB Schema Design: Part 2

By William Zola, Lead Technical Support Engineer at MongoDB This is the second stop on our tour of modeling One-to-N relationships in MongoDB. Last time I covered the three basic schema designs: embedding, child-referencing, and parent-referencing. I also covered the two factors to consider when picking one of these designs: Will the entities on the “N” side of the One-to-N ever need to stand alone? What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions? With these basic techniques under our belt, I can move on to covering more sophisticated schema designs, involving two-way referencing and denormalization. Intermediate: Two-Way Referencing If you want to get a little bit fancier, you can combine two techniques and include both styles of reference in your schema, having both references from the “one” side to the “many” side and references from the “many” side to the “one” side. For an example, let’s go back to that task-tracking system. There’s a “people” collection holding Person documents, a “tasks” collection holding Task documents, and a One-to-N relationship from Person -> Task. The application will need to track all of the Tasks owned by a Person, so we will need to reference Person -> Task. With the array of references to Task documents, a single Person document might look like this: db.person.findOne() { _id: ObjectID("AAF1"), name: "Kate Monster", tasks [ // array of references to Task documents ObjectID("ADF9"), ObjectID("AE02"), ObjectID("AE73") // etc ] } On the other hand, in some other contexts this application will display a list of Tasks (for example, all of the Tasks in a multi-person Project) and it will need to quickly find which Person is responsible for each Task. You can optimize this by putting an additional reference to the Person in the Task document. db.tasks.findOne() { _id: ObjectID("ADF9"), description: "Write lesson plan", due_date: ISODate("2014-04-01"), owner: ObjectID("AAF1") // Reference to Person document } This design has all of the advantages and disadvantages of the “One-to-Many” schema, but with some additions. Putting in the extra ‘owner’ reference into the Task document means that its quick and easy to find the Task’s owner, but it also means that if you need to reassign the task to another person, you need to perform two updates instead of just one. Specifically, you’ll have to update both the reference from the Person to the Task document, and the reference from the Task to the Person. (And to the relational gurus who are reading this — you’re right: using this schema design means that it is no longer possible to reassign a Task to a new Person with a single atomic update. This is OK for our task-tracking system: you need to consider if this works with your particular use case.) Intermediate: Denormalizing With “One-To-Many” Relationships Beyond just modeling the various flavors of relationships, you can also add denormalization into your schema. This can eliminate the need to perform the application-level join for certain cases, at the price of some additional complexity when performing updates. An example will help make this clear. Denormalizing from Many -> One For the parts example, you could denormalize the name of the part into the ‘parts[]’ array. For reference, here’s the version of the Product document without denormalization. > db.products.findOne() { name : 'left-handed smoke shifter', manufacturer : 'Acme Corp', catalog_number: 1234, parts : [ // array of references to Part documents ObjectID('AAAA'), // reference to the #4 grommet above ObjectID('F17C'), // reference to a different Part ObjectID('D2AA'), // etc ] } Denormalizing would mean that you don’t have to perform the application-level join when displaying all of the part names for the product, but you would have to perform that join if you needed any other information about a part. > db.products.findOne() { name : 'left-handed smoke shifter', manufacturer : 'Acme Corp', catalog_number: 1234, parts : [ { id : ObjectID('AAAA'), name : '#4 grommet' }, // Part name is denormalized { id: ObjectID('F17C'), name : 'fan blade assembly' }, { id: ObjectID('D2AA'), name : 'power switch' }, // etc ] } While making it easier to get the part names, this would add just a bit of client-side work to the application-level join: // Fetch the product document > product = db.products.findOne({catalog_number: 1234}); // Create an array of ObjectID()s containing *just* the part numbers > part_ids = product.parts.map( function(doc) { return doc.id } ); // Fetch all the Parts that are linked to this Product > product_parts = db.parts.find({_id: { $in : part_ids } } ).toArray() ; Denormalizing saves you a lookup of the denormalized data at the cost of a more expensive update: if you’ve denormalized the Part name into the Product document, then when you update the Part name you must also update every place it occurs in the ‘products’ collection. Denormalizing only makes sense when there’s an high ratio of reads to updates. If you’ll be reading the denormalized data frequently, but updating it only rarely, it often makes sense to pay the price of slower updates — and more complex updates — in order to get more efficient queries. As updates become more frequent relative to queries, the savings from denormalization decrease. For example: assume the part name changes infrequently, but the quantity on hand changes frequently. This means that while it makes sense to denormalize the part name into the Product document, it does not make sense to denormalize the quantity on hand. Also note that if you denormalize a field, you lose the ability to perform atomic and isolated updates on that field. Just like with the two-way referencing example above, if you update the part name in the Part document, and then in the Product document, there will be a sub-second interval where the denormalized ‘name’ in the Product document will not reflect the new, updated value in the Part document. Denormalizing from One -> Many You can also denormalize fields from the “One” side into the “Many” side: > db.parts.findOne() { _id : ObjectID('AAAA'), partno : '123-aff-456', name : '#4 grommet', product_name : 'left-handed smoke shifter', // Denormalized from the ‘Product’ document product_catalog_number: 1234, // Ditto qty: 94, cost: 0.94, price: 3.99 } However, if you’ve denormalized the Product name into the Part document, then when you update the Product name you must also update every place it occurs in the ‘parts’ collection. This is likely to be a more expensive update, since you’re updating multiple Parts instead of a single Product. As such, it’s significantly more important to consider the read-to-write ratio when denormalizing in this way. Intermediate: Denormalizing With “One-To-Squillions” Relationships You can also denormalize the “one-to-squillions” example. This works in one of two ways: you can either put information about the “one” side (from the ‘hosts’ document) into the “squillions” side (the log entries), or you can put summary information from the “squillions” side into the “one” side. Here’s an example of denormalizing into the “squillions” side. I’m going to add the IP address of the host (from the ‘one’ side) into the individual log message: > db.logmsg.findOne() { time : ISODate("2014-03-28T09:42:41.382Z"), message : 'cpu is on fire!', ipaddr : '127.66.66.66', host: ObjectID('AAAB') } Your query for the most recent messages from a particular IP address just got easier: it’s now just one query instead of two. > last_5k_msg = db.logmsg.find({ipaddr : '127.66.66.66'}).sort({time : -1}).limit(5000).toArray() In fact, if there’s only a limited amount of information you want to store at the “one” side, you can denormalize it ALL into the “squillions” side and get rid of the “one” collection altogether: > db.logmsg.findOne() { time : ISODate("2014-03-28T09:42:41.382Z"), message : 'cpu is on fire!', ipaddr : '127.66.66.66', hostname : 'goofy.example.com', } On the other hand, you can also denormalize into the “one” side. Lets say you want to keep the last 1000 messages from a host in the ‘hosts’ document. You could use the $each / $slice functionality introduced in MongoDB 2.4 to keep that list sorted, and only retain the last 1000 messages: The log messages get saved in the ‘logmsg’ collection as well as in the denormalized list in the ‘hosts’ document: that way the message isn’t lost when it ages out of the ‘hosts.logmsgs’ array. // Get log message from monitoring system logmsg = get_log_msg(); log_message_here = logmsg.msg; log_ip = logmsg.ipaddr; // Get current timestamp now = new Date() // Find the _id for the host I’m updating host_doc = db.hosts.findOne({ipaddr : log_ip },{_id:1}); // Don’t return the whole document host_id = host_doc._id; // Insert the log message, the parent reference, and the denormalized data into the ‘many’ side db.logmsg.save({time : now, message : log_message_here, ipaddr : log_ip, host : host_id ) }); // Push the denormalized log message onto the ‘one’ side db.hosts.update( {_id: host_id }, {$push : {logmsgs : { $each: [ { time : now, message : log_message_here } ], $sort: { time : 1 }, // Only keep the latest ones $slice: -1000 } // Only keep the latest 1000 } ); Note the use of the projection specification ( {_id:1} ) to prevent MongoDB from having to ship the entire ‘hosts’ document over the network. By telling MongoDB to only return the _id field, I reduce the network overhead down to just the few bytes that it takes to store that field (plus just a little bit more for the wire protocol overhead). Just as with denormalizing in the “One-to-Many” case, you’ll want to consider the ratio of reads to updates. Denormalizing the log messages into the Host document makes sense only if log messages are infrequent relative to the number of times the application needs to look at all of the messages for a single host. This particular denormalization is a bad idea if you want to look at the data less frequently than you update it. Recap In this post, I’ve covered the additional choices that you have past the basics of embed, child-reference, or parent-reference. You can use bi-directional referencing if it optimizes your schema, and if you are willing to pay the price of not having atomic updates If you are referencing, you can denormalize data either from the “One” side into the “N” side, or from the “N” side into the “One” side When deciding whether or not to denormalize, consider the following factors: You cannot perform an atomic update on denormalized data Denormalization only makes sense when you have a high read to write ratio Next time, I’ll give you some guidelines to pick and choose among all of these options. To learn more Schema Design tips, see our recent Back to Basics webinar on “Thinking in Documents” Sign up for the MongoDB Newsletter to get MongoDB updates right to your inbox

August 11, 2014

by Francesca Krihely

· 12,398 Views

Introducing BIRT iHub F-Type

Actuate recently released a new, free BIRT server called the BIRT iHub F-Type. It incorporates all the functionality of BIRT iHub and is limited only by the capacity of output it can deliver on a daily basis. It is ideal for departmental and smaller scale applications. When BIRT F-Type reaches its maximum output capacity, additional capacity can be purchased on a subscription based model. Some of the key features of BIRT iHub F-Type that will help improve your BIRT content applications are: Interactivity – Allow end-users to modify and personalize reports, and answer questions themselves. Scheduling – Automate report generation based on rules and calendar, and then notify users. Sharing – Secure document management and distribution that allows users to only access content/data they are entitled to. Excel Emitter – Export as native Excel (not CSV) with formulas/pivot tables/worksheets/charts. Integration – JavaScript API to embed dynamic reports and visualizations in your web app. Downloading BIRT iHub F-Type Before we get started with the installation process, we need to download BIRT iHub F-Type. There are three downloads available: Windows, Linux, and a VMware image. This blog will cover the Windows installation. If you’re installing either of the other types, you’ll find links to guides for them at the bottom of this blog post. Once you click on your chosen download, you’ll be asked to register. If you’ve already registered, click the “Click to Login” button. If not, fill out the short registration form to get started. Next, read and accept the license agreement. Once you’ve done that, click the checkbox, and a link for the download will appear. Click that to start your download. At this point, you should also receive an email with an activation code. Be sure to check your spam folder if you don’t see it in your inbox. Installing BIRT iHub F-Type After the download is complete, launch the executable file named ActuateBIRTiHubFType.exe. A welcome message will appear. Press Next to continue. You must read and accept the license agreement on the next screen. Choose a destination folder for the installation. The default is C:\Actuate\BIRTiHub. If you have existing BIRT designs that depend on a JDBC database driver, you can optionally specify the folder where these drivers are located. Press Next to continue. Once the installation has finished, press Finish to launch the BIRT iHub F-Type. A desktop shortcut is also created that points to the iHub F-Type URL at http://localhost:8700/iportal. The first time you launch the BIRT iHub F-Type, you will need to activate it. Enter the activation code that you should have received in an e-mail. After entering a valid activation code, you should receive a message that the code was accepted and the BIRT iHub F-Type should start initializing services. Once that has completed, you will be presented with the login screen. The default user name is “administrator” and the password is blank for your first log in. You’ll be able to change this after you have logged in. Press “Log In” to continue. The first time you launch the BIRT iHub F-Type, you will be in tutorial mode which will help you get started loading your BIRT content and required resources. You can bypass the tutorial mode at any time by pressing the “Exit Tutorial” button at the top right. Select a BIRT design (*.rptdesign) file and press the Upload button. If you don’t have a BIRT design, you can download a sample from the link on the same page. The BIRT design file is automatically inspected and if there are any dependent files needed, like images, data files, BIRT report libraries, CSS styles, or other linked BIRT designs, you will be asked to upload those files as well. Once your BIRT design and dependent files are uploaded, your BIRT report will be displayed in the BIRT iHub F-Type and is now ready to explore. Thanks for reading. Now, it’s time to unleash the full power of BIRT into your application. If you have any questions or comments, please feel free to use the comments section below or visit the BIRT iHub F-Type forum. -Virgil For more blogs in the “Introducing BIRT iHub F-Type” series, see the list below: Installing iHub F-Type: Linux | VMWare Image - See more at: http://blogs.actuate.com/introducing-birt-ihub-f-type-installing-on-windows/#sthash.QPJhv2gw.dpuf

August 6, 2014

by Michael Singer

· 1,934 Views

Distributed Big Balls of Mud

if you want evidence that the software development industry is susceptible to fashion, just go and take a look at all of the hype around microservices. it's everywhere! for some people microservices is "the next big thing", whereas for others it's simply a lightweight evolution of the big soap service-oriented architectures that we saw 10 years ago "done right". i do like a lot of what the current microservice architectures are doing, but it's by no means a silver bullet. okay, i know that sounds obvious, but i think many people are jumping on them for the wrong reason. i often show this slide in my conference talks, and i've blogged about this before , but basically there are different ways to build software systems. on the one side we have traditional monolithic systems, where everything is bundled up inside a single deployable unit. this is probably where most of the industry is. caveats apply, but monoliths can be built quickly and are easy to deploy, but they provide limited agility because even tiny changes require a full redeployment. we also know that monoliths often end up looking like a big ball of mud because of the way that software often evolves over time. for example, many monolithic systems are built using a layered architecture, and it's relatively easy for layered architectures to be abused (e.g. skipping "around" a service to call the repository/data access layer directly). on the other side we have service-based architectures, where a software system is made up of many separately deployable services. again, caveats apply but, if done well, service-based architectures buy you a lot of flexibility and agility because each service can be developed, tested, deployed, scaled, upgraded and rewritten separately, especially if the services are decoupled via asynchronous messaging. the downside is increased complexity because your software system now has many more moving parts than a monolith. as robert says, the complexity is still there, you're just moving it somewhere else . there is, of course, a mid-ground here. we can build monolithic systems that are made up of in-process components, each of which has an explicit well-defined interface and set of responsibilities. this is old-school component-based design that talks about high cohesion and low coupling, but i usually sense some hesitation when i talk about it. and this seems odd to me. before i explain why, let me quote something from a blog post that i read earlier this morning about the rationale behind a team adopting a microservices approach. when we started building karma, we decided to split the project into two main parts: the backend api, and the frontend application. the backend is responsible for handling orders from the store, usage accounting, user management, device management and so forth, while the frontend offers a dashboard for users which accesses this api. along the way we noticed that if the whole backend api is monolithic it doesn't work very well because everything gets entangled. the blog post also mentions scaling, versioning and multiple languages/frameworks as other reasons to choose microservices. again, there are no silver bullets here, everything is a trade-off. anyway, "everything getting entangled" is not a reason to switch from monoliths to microservices. if you're building a monolithic system and it's turning into a big ball of mud, perhaps you should consider whether you're taking enough care of your software architecture. do you really understand what the core structural abstractions are in your software? are their interfaces and responsibilities clear too? if not, why do you think moving to a microservices architecture will help? sure, the physical separation of services will force you to not take some shortcuts, but you can achieve the same separation between components in a monolith. a little design thinking and an architecturally-evident coding style will help to achieve this without the baggage of going distributed. many of the teams i've spoken to are building monolithic systems and don't want to look at component-based design. the mid-ground seems to be a hard-sell. i ran a software architecture sketching workshop with a team earlier this year where we diagrammed one of their software systems. the diagram started as a strictly layered architecture (presentation, business services, data access) with all arrows pointing downwards and each layer only ever calling the layer directly beneath it. the code told a different story though and the eventual diagram didn't look so neat anymore. we discussed how adopting a package by component approach could fix some of these problems, but the response was, "meh, we like building software using layers". it seems as if teams are jumping on microservices because they're sexy, but the design thinking and decomposition strategy required to create a good microservices architecture are the same as those needed to create a well structured monolith. if teams find it hard to create a well structured monolith, i don't rate their chances of creating a well structured microservices architecture. as michael feathers recently said, " there's a bit of overhead involved in implementing each microservice. if they ever become as easy to create as classes, people will have a freer hand to create trouble - hulking monoliths at a different scale. ". i agree. a world of distributed big balls of mud worries me.

August 4, 2014

by Simon Brown

· 9,237 Views

Glassfish 4 - Performance Tuning, Monitoring and Troubleshooting

This is the third blog in C2B2 series looking at Glassfish 4. The previous two are available here: Part 1 - Getting started with Glassfish 4 Part 2 - Glassfish 4 - Features For High Availability In this blog I will look at 3 areas: Performance Tuning, where I will look at some of the areas to look at when setting up a system for production. Monitoring, where I will look at some of the tools we use for monitoring a system both during performance testing and tuning and once a system is up and running. Troubleshooting, where I will look at some of the tools you can use to help diagnose and detect performance issues. Performance Tuning Glassfish out of the box (as with most app servers) is optimised for development purposes. Developers want the ability to deploy and undeploy continuously, create and remove resources, debug, etc. However, this configuration is not suitable for a production system. When configuring any application server you have to take into account what you are trying to achieve and what is best suited for the applications you intend to run. One size does not fit all! It can be a long and complex process and I'm afraid I can't give you a one-stop solution. However, I can give you some pointers to some of the things you can do to prepare your system for production. So, what kind of things do we look at when we are looking to performance tune a Glassfish system. Some of the most common things are: JVM Settings Garbage Collection Glassfish Settings Logging JVM Settings The standard JVM defaults are not suitable for a production system. One of the simplest changes that can be made is to use the -server flag, rather than the default -client. Although the Server and Client VMs are similar, the Server VM has been specially tuned to maximise peak operating speed. It is intended for executing long-running server applications, which need the fastest possible operating speed more than a fast start-up time or smaller runtime memory footprint. Allocate more memory to the JVM by modifying the value of the -Xmx flag. How much depends on the size and complexity of your enterprise application and how much memory you have available. In addition we also want to make sure we allocate all of the memory on startup. This is done with the -Xms flag. We set the minimum and maximum perm gen to the same value in order to avoid allocation failures & subsequent full garbage collections. Garbage Collection There are a number of settings that can be tweaked regarding Garbage Collection. I'm not going to cover GC tuning as that is a whole topic all of it's own but here are some of the settings we would always recommend regarding GC in a production environment: Firstly we want to ensure we log all Garbage Collection information as this can prove extremely useful in diagnosing issues. -verbose:gc Next we want to make sure we log GC information to a file. This will make it easier to separate the GC from other details in the log files. -Xloggc:/path_to_log_file/gc.log We also want to ensure we have as much detail as possible. -XX:+PrintGCDetails and that the information is timestamped for easier diagnosis of long running errors and to be able to ascertain what normal levels are over time. -XX:+PrintGCDateStamps Finally, we want to ensure that developers aren't making explicit calls to System.gc(). Hopefully they don’t anyway and if they are you need to look into why (doing so is a bad idea since this forces major collections) but this will disable it just in case. -XX:+DisableExplicitGC Heap Dumps Heap dumps can be extremely useful for diagnosing memory issues. There are two settings we would definitely recommend. These tell the JVM to generate a heap dump when an allocation from the Java heap or the permanent generation cannot be satisfied. There is no overhead in running with these options but they can be useful for production systems where OutOfMemoryErrors can take a long time to surface. -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/dumps/glassfish.hprof Configuring Glassfish There are three ways to configure Glassfish: Through the admin console By directly editing the config files Using the asadmin tool Although making changes through the admin console can often be the easiest way to make changes we’d recommend where possible to script all changes so you have a repeatable production server build. Also you should ensure copies of all config files are kept in Config Control so you know you have a working copy and can roll back to a previous version when needed. Turn off development features Turn off auto-deploy and dynamic application reloading. Both of these features are great for development, but can affect performance. Configure the JSP servlet not to check JSP files for changes on every request. Also, set the parameter genStrAsCharArray to true. This will ensure all String values are declared as static char arrays. One reason for this is that the array has less memory overhead than String. These changes will mean you cannot change JSP pages on your production server without redeploying the application, but on a production system this is generally what you want. Acceptor Threads and Request Threads There are two main thread values we would recommend setting, acceptor threads and request threads. Acceptor threads are used to accept new connections to the server and to schedule existing connections when a new request comes in. Set this value equal to the number of CPU cores in your server. So, if you have two quad core CPUs, this value should be set to eight. Request threads run HTTP requests. You want enough of these to keep the machine busy, but not so many that they compete for CPU resources which would cause your throughput to suffer greatly. Static resources By default, GlassFish does not tell the client to cache static resources. It is recommended to cache static resources, like CSS files and images particularly if you have a lot of them. Thread pools Max thread pool and min pool size should be set to the same value. Specifying the same value will allow GlassFish to use a slightly more optimised thread pool. This configuration should be considered unless the load on the server varies significantly. Increasing this value will reduce HTTP response latency times. What to set these values to depends heavily on what your application is doing. In order to get this value right you should look to incrementally increase the thread count and to monitor performance after each incremental increase. When performance stops improving stop increasing the thread count. Logging You should look to turn off as much logging as possible. In a production environment we would generally recommend logging at WARN and above. This includes the logging done by Glassfish as well as your own applications. Monitoring The fewer monitoring options that are enabled, the better the server's performance. All Glassfish monitoring is turned off by default. Switching monitoring on can be very useful when diagnosing issues and when doing initial system testing and performance tuning for monitoring what changes. What to monitor Used Heap Size - Compare this number with the maximum allowed heap size to see what portion of the heap is in use. If the used heap size nears the max heap size, the garbage collector urgently attempts to free memory and this is something that should be avoided where possible. Number of loaded classes - Useful for detecting performance and application development trends. JVM Threads - Important for performance tuning and for troubleshooting JVM crashes. Some of the most essential indicators are the current active JVM thread count and the peak values. Thread pools - You should compare a pools current usage with the maximum number allowed. Problems can start to occur when the current count nears the max threads number. JVM Tools for Monitoring The following is a list of a a few of the tools that come with the JDK that are useful for monitoring information from the JVM. jstat - This tool displays performance statistics regarding usage of the perm gen, new gen and old gen. It also provides class loading and compilation statistics jmap - Gives you visibility of memory usage, can produce a class histogram and can dump the memory to a file jconsole/jvisualvm - These tools can display all the previously mentioned monitoring indicators and graph them over time. This allows you to spot trends and to get a better overall picture of your normal performance levels and changes over time. Note - These should NOT be left running permanently on a production system! Troubleshooting Unfortunately, no matter how much tuning and testing you do all systems WILL go wrong from time to time. So, what should you do when your production server bursts into flames? Well, in that situation you should call the fire service but for more general problems: Gather data - get as much data as you can, there is no such thing as too much! Analyse that data - Data is worthless when you don’t know what it means. Visualise where possible – graphs and charts reveal trends and patterns over time Make educated decisions - Only make decisions based on data. If you go with your “gut instinct” and what “feels right” you will probably make things worse Gathering data First up, for most of the JVM tools you will need the process ID of the server. You can get this information in various ways. Two of the simplest are: jps -v This will list all current running Java processes. The -v flag is for verbose output. ps aux | grep glassfish The ps command with the options aux will show all processes from all users. This will display a LOT of information so pipe it through grep to filter for the glassfish process As mentioned earlier the jstat tool can be used for gathering info on JVM performance. Other useful tools include: jstack This will produce thread stack dumps for all threads running in the JVM. This can be very useful for discovering stuck threads or long running threads. jmap This tool can be used to create a heap dump. It outputs to a file in .hprof format which can be read by a number of analysis tools jrcmd and jrmc These tools are only available with the jRockit JDK. I won't go into any detail here as I have previously blogged about jrcmd here: http://blog.c2b2.co.uk/2012/11/troubleshooting-jrockit-using-jrcmd.html and my colleague has blogged about jrmc here: http://blog.c2b2.co.uk/2012/10/weblogic-troubleshooting-with-jrockit.html Glassfish asadmin The Glassfish asadmin tool has a built in command which will provide similar functionality to the above tools but without the need for the PID. asadmin generate-jvm-report --type=[type] Analysing the data There are various tools available for analysing performance data. The following are some of the most useful: IBM Support Assistant is a free troubleshooting application that helps you research, analyze, and resolve problems using various support features and tools. It contains a Garbage Collection and Memory Visualiser as well as a Heap Analyser. It will also provide a report telling you where issues might exist, and listing red flags with advice on what to change in your applications jRockit Mission Control is a very powerful tool which can be used to monitor live systems or analyse historical data in the form of flight recordings. JVisualVM GCViewer is an optional plugin for jVisualVM which can transform a tool which is already great for live monitoring into a powerful analysis tool jhat is a Java Heap Analysis Tool. It processes heap dump files and produces HTML reports. There are better analysis tools, but it’s always freely available if you’re running a JDK. Others There are many open source and freely available tools and projects to help you, here we’ve covered some very common and widely used ones, but our list is by no means exhaustive! Conclusion Remember, Glassfish out of the box (or out of the zip file!) is not designed to be run 'as is'. You should also note that there is no ideal configuration that will work for all systems. It will take time and effort to get the best configuration for what you require. Hopefully in this blog I have given you some useful guidelines and pointers. You should take time to work out what you want in terms of services, then strip back your config to match that. You should test, test and test again to ensure that your configuration matches the requirements with regards to the applications you will be running on your server. You should tune your JVM to ensure you have the best settings for your particular configuration. You should ensure you have monitoring in place to keep a check on everything and ensure that if your server does crash you have as much information as possible at hand to diagnose what caused it. The next blog in this series looks at Migrating to Glassfish 4: http://blog.c2b2.co.uk/2013/07/glassfish-4-migrating-to-glassfish.html

July 30, 2014

by Andy Overton

· 24,820 Views

ACID Compliance: What It Means and Why You Should Care

Originally written by Dave Anselmi The presence of four components — atomicity, consistency, isolation and durability — can ensure that a database transaction is completed in a timely manner. When databases possess these components, they are said to be ACID-compliant. So just what is ACID compliance, and why should you care? Let’s take a look: Atomicity: Database transactions, like atoms, can be broken down into smaller parts. When it comes to your database, atomicity refers to the integrity of the entire database transaction, not just a component of it. In other words, if one part of a transaction doesn’t work like it’s supposed to, the other will fail as a result—and vice versa. For example, if you’re shopping on an e-commerce site, you must have an item in your cart in order to pay for it. What you can’t do is pay for something that’s not in your cart. (You can add something into your cart and not pay for it, but that database transaction won’t be complete, and thus not ‘atomic’, until you pay for it.) Consistency: For any database to operate as it’s intended to operate, it must follow the appropriate data validation rules. Thus, consistency means that only data which follows those rules is permitted to be written to the database. If a transaction occurs and results in data that does not follow the rules of the database, it will be ‘rolled back’ to a previous iteration of itself (or ‘state’) which complies with the rules. On the other hand, following a successful transaction, new data will be added to the database and the resulting state will be consistent with existing rules. Isolation: It’s safe to say that at any given time on Amazon, there is far more than one transaction occurring on the platform… In fact, an incredibly huge amount of database transactions are occurring simultaneously! For a database, isolation refers to the ability to concurrently process multiple transactions in a way that one does not affect another. So, imagine you and your neighbor are both trying to buy something from the same e-commerce platform at the same time. There are 10 items for sale: your neighbor wants five and you want six. Isolation means that one of those transactions would be completed ahead of the other one. In other words, if your neighbor clicked first, they will get five items, and only five items will be remaining in stock. So you will only get to buy five items. If you clicked first, you will get the six items you want, and they will only get four. Thus,. isolation ensures that eleven items aren’t sold when only ten exist. Durability: All technology fails from time to time… the goal is to make those failures invisible to the end-user. In databases that possess durability, data is saved once a transaction is completed, even if a power outage or system failure occurs. Imagine you’re buying in-demand concert tickets on a site similar to Ticketmaster.com. Right when tickets go on sale, you’re ready to make a purchase. After being stuck in the digital waiting room for some time, you’re finally able to add those tickets to your cart.. You then make the purchase and get your confirmation. However if that database lacks durability, even after your ticket purchase was confirmed, if the database suffers a failure incident your transaction would still be lost! As you might expect, this is a really bad thing to happen for an online e-commerce site, so transaction durability is a must-have. ClustrixDB, a NewSQL cloud database, comes with the added benefit of guaranteed ACID transactions, critical for e-commerce success! Click here to learn more.

July 30, 2014

by Lisa Schultz

· 11,341 Views

CRUD Operation with ASP.NET MVC and Fluent Nhibernate

Before some time I have written a post about Getting Started with Nhibernate and ASP.NET MVC –CRUD operations. It’s one of the most popular post blog post on my blog. I get lots of questions via email and other medium why you are not writing a post about Fluent Nhibernate and ASP.NET MVC. So I thought it will be a good idea to write a blog post about it. What is Fluent Nhibernate: Convection over configuration that is mantra of Fluent Hibernate If you have see the my blog post about Nhibernate then you might have found that we need to create xml mapping to map table. Fluent Nhibernate uses POCO mapping instead of XML mapping and I firmly believe in Convection over configuration that is why I like Fluent Nhibernate a lot. But it’s a matter of Personal Test and there is no strong argument why you should use Fluent Nhibernate instead of Nhibernate. Fluent Nhibernate is team Comprises of James Gregory, Paul Batum, Andrew Stewart and Hudson Akridge. There are lots of committers and it’s a open source project. You can find all more information about at following site. http://www.fluentnhibernate.org/ On this site you can find definition of Fluent Nhibernate like below. Fluent, XML-less, compile safe, automated, convention-based mappings for NHibernate. Get your fluent on. They have excellent getting started guide on following url. You can easily walk through it and learned it. https://github.com/jagregory/fluent-nhibernate/wiki/Getting-started ASP.NET MVC and Fluent Nhibernate: So all set it’s time to write a sample application. So from visual studio 2013 go to file – New Project and add a new web application project with ASP.NET MVC. Once you are done with creating a web project with ASP.NET MVC It’s time to add Fluent Nhibernate to Application. You can add Fluent Nhibernate to application via following NuGet Package. And you can install it with package manager console like following. Now we are done with adding a Fluent Nhibernate to it’s time to add a new database. Now I’m going to create a simple table called “Employee” with four columns Id, FirstName, LastName and Designation. Following is a script for that. CREATE TABLE [dbo].[Employee] ( [Id] INT NOT NULL PRIMARY KEY, [FirstName] NVARCHAR(50) NULL, [LastName] NVARCHAR(50) NULL, [Designation] NVARCHAR(50) NULL ) Now table is ready it’s time to create a Model class for Employee. So following is a code for that. namespace FluentNhibernateMVC.Models { public class Employee { public virtual int EmployeeId { get; set; } public virtual string FirstName { get; set; } public virtual string LastName { get; set; } public virtual string Designation{ get; set; } } } Now Once we are done with our model classes and other stuff now it’s time to create a Map class which map our model class to our database table as we know that Fluent Nhibernate is using POCO classes to map instead of xml mappings. Following is a map class for that. using FluentNHibernate.Mapping; namespace FluentNhibernateMVC.Models { public class EmployeeMap : ClassMap { public EmployeeMap() { Id(x => x.EmployeeId); Map(x => x.FirstName); Map(x => x.LastName); Map(x => x.Designation); Table("Employee"); } } } If you see Employee Map class carefully you will see that I have mapped Id column with EmployeeId in table and another things you can see I have written a table with “Employee” which will map Employee class to employee class. Now once we are done with our mapping class now it’s time to add Nhibernate Helper which will connect to SQL Server database. using FluentNHibernate.Cfg; using FluentNHibernate.Cfg.Db; using NHibernate; using NHibernate.Tool.hbm2ddl; namespace FluentNhibernateMVC.Models { public class NHibernateHelper { public static ISession OpenSession() { ISessionFactory sessionFactory = Fluently.Configure() .Database(MsSqlConfiguration.MsSql2008 .ConnectionString(@"Data Source=(LocalDB)\v11.0;AttachDbFilename=C:\data\Blog\Samples\FluentNhibernateMVC\FluentNhibernateMVC\FluentNhibernateMVC\App_Data\FNhibernateDemo.mdf;Integrated Security=True") .ShowSql() ) .Mappings(m => m.FluentMappings .AddFromAssemblyOf()) .ExposeConfiguration(cfg => new SchemaExport(cfg) .Create(false, false)) .BuildSessionFactory(); return sessionFactory.OpenSession(); } } } Now we are done with classes It’s time to scaffold our Controller for the application. Right click Controller folder and click on add new controller. It will ask for scaffolding items like following. I have selected MVC5 controller with read/write action. Once you click Add It will ask for controller name. Now our controller is ready. It’s time to write code for our actions. Listing: Following is a code for our listing action. public ActionResult Index() { using (ISession session = NHibernateHelper.OpenSession()) { var employees = session.Query().ToList(); return View(employees); } } You add view for listing via right click on View and Add View. That’s is you are done with listing of your database. I have added few records to database table to see whether its working or not. Following is a output as expected. Create: Now it’s time to create code and views for creating a Employee. Following is a code for action results. public ActionResult Create() { return View(); } // POST: Employee/Create [HttpPost] public ActionResult Create(Employee employee) { try { using (ISession session = NHibernateHelper.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Save(employee); transaction.Commit(); } } return RedirectToAction("Index"); } catch (Exception exception) { return View(); } } Here you can see one if for blank action and another for posting data and saving to SQL Server with HttpPost Attribute. You can add view for create via right click on view on action result like following. Now when you run this you will get output as expected. Once you click on create it will return back to listing screen like this. Edit: Now we are done with listing/adding employee it’s time to write code for editing/updating employee. Following is a code Action Methods. public ActionResult Edit(int id) { using (ISession session = NHibernateHelper.OpenSession()) { var employee = session.Get(id); return View(employee); } } // POST: Employee/Edit/5 [HttpPost] public ActionResult Edit(int id, Employee employee) { try { using (ISession session = NHibernateHelper.OpenSession()) { var employeetoUpdate = session.Get(id); employeetoUpdate.Designation = employee.Designation; employeetoUpdate.FirstName = employee.FirstName; employeetoUpdate.LastName = employee.LastName; using (ITransaction transaction = session.BeginTransaction()) { session.Save(employeetoUpdate); transaction.Commit(); } } return RedirectToAction("Index"); } catch { return View(); } } You can create edit view via following via right click on view. Now when you run application it will work as expected. Details: You can write details code like following. public ActionResult Details(int id) { using (ISession session = NHibernateHelper.OpenSession()) { var employee = session.Get(id); return View(employee); } } You can view via right clicking on view as following. Now when you run application following is a output as expected. Delete: Following is a code for deleting Employee one action method for confirmation of delete and another one for deleting data from employee table. public ActionResult Delete(int id) { using (ISession session = NHibernateHelper.OpenSession()) { var employee = session.Get(id); return View(employee); } } [HttpPost] public ActionResult Delete(int id, Employee employee) { try { using (ISession session = NHibernateHelper.OpenSession()) { using (ITransaction transaction = session.BeginTransaction()) { session.Delete(employee); transaction.Commit(); } } return RedirectToAction("Index"); } catch (Exception exception) { return View(); } } You can add view for delete via right click on view like following. When you run it’s running as expected. Once you click on delete it will delete employee. That’s it. You can see it’s very easy. Fluent Nhibernate supports POCO mapping without writing any complex xml mappings. You can find whole source code at GitHub at following location. https://github.com/dotnetjalps/FluentNhiberNateMVC Hope you like it. Stay tuned for more!!

July 28, 2014

by Jalpesh Vadgama

· 24,325 Views · 1 Like

Data-driven Unit Testing in Java

Data-driven testing is a powerful way of testing a given scenario with different combinations of values. In this article, we look at several ways to do data-driven unit testing in JUnit. Suppose, for example, you are implementing a Frequent Flyer application that awards status levels (Bronze, Silver, Gold, Platinum) based on the number of status points you earn. The number of points needed for each level is shown here: level minimum status points result level Bronze 0 Bronze Bronze 300 Silver Bronze 700 Gold Bronze 1500 Platinum Our unit tests need to check that we can correctly calculate the status level achieved when a frequent flyer earns a certain number of points. This is a classic problem where data-driven tests would provide an elegant, efficient solution. Data-driven testing is well-supported in modern JVM unit testing libraries such as Spock and Spec2. However, some teams don’t have the option of using a language other than Java, or are limited to using JUnit. In this article, we look at a few options for data-driven testing in plain old JUnit. Parameterized Tests in JUnit JUnit provides some support for data-driven tests, via the Parameterized test runner. A simple data-driven test in JUnit using this approach might look like this: @RunWith(Parameterized.class) public class WhenEarningStatus { @Parameters(name = "{index}: {0} initially had {1} points, earns {2} points, should become {3} ") public static Iterable data() { return Arrays.asList(new Object[][]{ {Bronze, 0, 100, Bronze}, {Bronze, 0, 300, Silver}, {Bronze, 100, 200, Silver}, {Bronze, 0, 700, Gold}, {Bronze, 0, 1500, Platinum}, }); } private Status initialStatus; private int initialPoints; private int earnedPoints; private Status finalStatus; public WhenEarningStatus(Status initialStatus, int initialPoints, int earnedPoints, Status finalStatus) { this.initialStatus = initialStatus; this.initialPoints = initialPoints; this.earnedPoints = earnedPoints; this.finalStatus = finalStatus; } @Test public void shouldUpgradeStatusBasedOnPointsEarned() { FrequentFlyer member = FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe", "Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); member.earns(earnedPoints).statusPoints(); assertThat(member.getStatus()).isEqualTo(finalStatus); } } You provide the test data in the form of a list of Object arrays, identified by the _@Parameterized@ annotation. These object arrays contain the rows of test data that you use for your data-driven test. Each row is used to instantiate member variables of the class, via the constructor. When you run the test, JUnit will instantiate and run a test for each row of data. You can use the name attribute of the @Parameterized annotation to provide a more meaningful title for each test. There are a few limitations to the JUnit parameterized tests. The most important is that, since the test data is defined at a class level and not at a test level, you can only have one set of test data per test class. Not to mention that the code is somewhat cluttered - you need to define member variables, a constructor, and so forth. Fortunatly, there is a better option. Using JUnitParams A more elegant way to do data-driven testing in JUnit is to use [https://code.google.com/p/junitparams/|JUnitParams]. JUnitParams (see [http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22JUnitParams%22|Maven Central] to find the latest version) is an open source library that makes data-driven testing in JUnit easier and more explicit. A simple data-driven test using JUnitParam looks like this: @RunWith(JUnitParamsRunner.class) public class WhenEarningStatusWithJUnitParams { @Test @Parameters({ "Bronze, 0, 100, Bronze", "Bronze, 0, 300, Silver", "Bronze, 100, 200, Silver", "Bronze, 0, 700, Gold", "Bronze, 0, 1500, Platinum" }) public void shouldUpgradeStatusBasedOnPointsEarned(Status initialStatus, int initialPoints, int earnedPoints, Status finalStatus) { FrequentFlyer member = FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe", "Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); member.earns(earnedPoints).statusPoints(); assertThat(member.getStatus()).isEqualTo(finalStatus); } } Test data is defined in the @Parameters annotation, which is associated with the test itself, not the class, and passed to the test via method parameters. This makes it possible to have different sets of test data for different tests in the same class, or mixing data-driven tests with normal tests in the same class, which is a much more logical way of organizing your classes. JUnitParam also lets you get test data from other methods, as illustrated here: @Test @Parameters(method = "sampleData") public void shouldUpgradeStatusFromEarnedPoints(Status initialStatus, int initialPoints, int earnedPoints, Status finalStatus) { FrequentFlyer member = FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe", "Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); member.earns(earnedPoints).statusPoints(); assertThat(member.getStatus()).isEqualTo(finalStatus); } private Object[] sampleData() { return $( $(Bronze, 0, 100, Bronze), $(Bronze, 0, 300, Silver), $(Bronze, 100, 200, Silver) ); } The $ method provides a convenient short-hand to convert test data to the Object arrays that need to be returned. You can also externalize @Test @Parameters(source=StatusTestData.class) public void shouldUpgradeStatusFromEarnedPoints(Status initialStatus,int initialPoints, int earnedPoints,Status finalStatus){ ... } The test data here comes from a method in the StatusTestData class: public class StatusTestData{ public static Object[] provideEarnedPointsTable(){ return $( $(Bronze,0, 100,Bronze), $(Bronze,0, 300,Silver), $(Bronze,100,200,Silver) ); } } This method needs to be static, return an object array, and start with the word "provide". Getting test data from external methods or classes in this way opens the way to retrieving test data from external sources such as CSV or Excel files. JUnitParam provides a simple and clean way to implement data-driven tests in JUnit, without the overhead and limitations of the traditional JUnit parameterized tests. Testing with non-Java languages If you are not constrained to Java and/or JUnit, more modern tools such as Spock (https://code.google.com/p/spock/) and Spec2 provide great ways of writing clean, expressive unit tests in Groovy and Scala respectively. In Groovy, for example, you could write a test like the following: class WhenEarningStatus extends Specification{ def"should earn status based on the number of points earned"(){ given: def member =FrequentFlyer.withFrequentFlyerNumber("12345678") .named("Joe","Jones") .withStatusPoints(initialPoints) .withStatus(initialStatus); when: member.earns(earnedPoints).statusPoints() then: member.status == finalStatus where: initialStatus | initialPoints | earnedPoints | finalStatus Bronze |0 |100 |Bronze Bronze |0 |300 |Silver Bronze |100 |200 |Silver Silver |0 |700 |Gold Gold |0 |1500 |Platinum } } John Ferguson Smart is a specialist in BDD, automated testing, and software life cycle development optimization, and author of BDD in Action and other books. John runsregular courses in Australia, London and Europe on related topics such as Agile Requirements Gathering, Behaviour Driven Development, Test Driven Development, andAutomated Acceptance Testing. Blog Links >>

July 27, 2014

by John Ferguson Smart

· 24,678 Views · 1 Like

JBoss Data Grid: Installation and Development

In this blog, we will discuss one particular data grid platform from Redhat namely JBoss Data Grid (JDG). We will firstly cover how to access and install this data grid platform and then we will demonstrate how to develop and deploy a simple remote client/server data grid application which utilises the HotRod protocol. We will be using the latest release JDG 6.2 from Redhat in this article. Installation Overview To start using JDG, firstly log on to the redhat site https://access.redhat.com/home and download the software from the Downloads section of the site. We wish to download JDG 6.2 server by clicking on the appropriate links in the Downloads section. For future reference, it is also useful to download the quickstart and maven repository zip files. To install JDG, we simply unzip the JDG server package into an appropriate directory in your environment. JDG Overview In this section, we will provide a brief overview of the contents of the JDG installation package and the most notable configuration options available to users. Out of the box, users are provided with two runtime options either to run JDG in standalone or clustered mode. We can start JDG in either mode by invoking the stanadalone or clustered start up scripts in the / bin directory. To configure the JDG in either mode we need to configure the files standalone.xml and clustered.xml. In our case we will creating a distributed cache which will run on 3 node JDG cluster so we will be utilizing the clustered startup script. In order to set up and add new cache instances to JDG, we modify the infinispan subsystems in the appropriate xml configuration file above. We should also note the principal difference between the standalone and clustered configuration file is that in the clustered configuration file there is a JGroups subsystem configured element which allows for communication and messaging between configured cache instances running in a JDG cluster. Development Environment Setup and Configuration In this section, we will detail how to develop and configure a simple datagrid application which will be deployed to a 3 node JDG cluster. We will demonstrate how to configure and deploy a distributed cache in JDG and also show how to develop a HotRod Java client application which will be used to insert, update and display entries in the distributed cache. We will firstly discuss setting a new distributed cache on a 3 node JDG cluster. In this example, we will run our JDG cluster on a single machine by running each JDG instance on different ports. Firstly, we will create 3 instances of JDG by creating 3 directories (server1, server2, server3) on our host machine and unzipping each JDG installation into each directory. We will now configure each node in our cluster by copying and renaming the clustered.xml configuration file in the \server1\jboss-datagrid-6.2.0-server\standalone\configuration directory. We will name each of the cluster configuration files as "clustered1.xml", "clustered2.xml" and "clustered3.xml" for the JDG instances denoted by "server1", "server2" and "server3" respectively. We will now set up a new distributed cache on our JDG cluster by modifying the infinispan subsystem element in each clustered.xml file. We will demonstrate this for the node denoted "server1" here by modifying the file "clustered1.xml". The cache configuration shown here will be the same across all 3 nodes. To setup a new distributed cache named "directory-dist-cache", we configure the following elements in the file named "clustered1.xml" ......... ...... .............. ...... ...... /socket-binding-group> We will discuss the key elements and attributes relating to the configuration above. In the infinispan endpoint subsystem, we will configure hotrod clients to connect to the JDG server instance on socket 11222. The name of the cache container to host each of the cache instances will be held in the container named "clusteredcache". We have configured the infinispan core subsystem to the default cache container named "clusteredcacahe" whereby we will allow for jmx statistics to be collected relating the configured cache entries i.e statistics="true" We have created a new distributed cache named "directory-dist-cache" whereby there will be two copies of each cache entry held on two of the 3 cluster nodes. We have also set up an eviction policy whereby should there be more than 20 entries in our cache then cache entries will be removed using the LRU algorithm We should have configured nodes "server2" and "server3" to start up with a port offset of 100 and 200 respectively by configuring the socketing binding group element appropriately. Please view the socket bindings noted below. To set the socket binding element with a port offset of 100 on "server2", we configure "clustered2.xml" with the following entry: ...... ...... /socket-binding-group> To set the socket binding element with a port offset of 200 on "server3", we configure "clustered3.xml" with the following entry: ...... ...... /socket-binding-group> Before discussing the setup and configuration of our Hotrod client which will be used to interact with our JDG clustered HotRod server, we will start up each server instance to ensure our newly configured JDG distributed cache starts up correctly. Open up 3 Windows or Linux consoles and execute the following start up commands: Console 1: 1) Navigate to \server1\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the first instance of our JDG cluster denoted "server1": clustered -c=clustered1.xml -Djboss.node.name=server1 Console 2: 1) Navigate to \server2\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the second instance of our JDG cluster denoted "server2": clustered -c=clustered2.xml -Djboss.node.name=server2 Console 3: 1) Navigate to \server3\jboss-datagrid-6.2.0-server\bin 2) Execute this command to start the third instance of our JDG cluster denoted "server3": clustered -c=clustered3.xml -Djboss.node.name=server3 Providing all 3 JDG instances have started up correctly, you should see output in the console window whereby we can see there are 3 JDG instances in the JGroups view: HotRod Client Development Setup Now that the Hotrod server is up and running, we need to develop a Hotrod Java client which will interact with the clustered server application. The development environment consists of the following tools. 1) JDK Hotspot 1.7.0_45 2) IDE - Eclipse Kepler Build id: 20130919-0819 The HotRod client application is a simple application consisting of two Java classes. The application allows users to retrieve a reference to the distributed cache from the JDG server and then perform these actions: a) add new cinema objects. b) add and remove shows to each cinema object. c) print the list of all cinemas and shows stored in our distributed cache. The source code can be downloaded from github @ https://github.com/davewinters/JDG. We could use maven here to build and execute our application by configuring the maven settings.xml to point to the maven repository files we downloaded earlier and set up a maven project file (pom.xml) to build and execute the client application. In this article we will build our application using the Eclipse IDE and run the client application on the command line. To create a HotRod client application and execute the sample application, one should complete the following steps: 1) Create a new Java Project in Eclipse 2) Create a new package named uk.co.c2b2.jdg.hotrod and import the source code that has been downloaded from Github mentioned previously. 3) Now we need to configure the build path in Eclipse to contain the appropriate JDG client jar files which are required to compile the application. You should include all the client jar files in the project build path. These jar files are contained in the JDG installation zip file. For example on my machine these jar files are located in the directory: \server1\jboss-datagrid-6.2.0-server\client\hotrod\java 4. Providing the Eclipse build path has been configured appropriately, the application source should compile without issue. 5. We will need to execute the Hotrod application by opening the console window and executing the following command. Note the path specified here will differ depending on where the JDG client jar files and application class files are located in your environment: java -classpath ".;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\commons-pool-1.6-redhat-4.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-client-hotrod-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-commons-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-query-dsl-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\infinispan-remote-query-client-6.0.1.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-logging-3.1.2.GA-redhat-1.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\jboss-marshalling-river-1.4.2.Final-redhat-2.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protobuf-java-2.5.0.jar;C:\Users\David\Installs\jbossdatagrids62\server1\jboss-datagrid-6.2.0-server\client\hotrod\java\protostream-1.0.0.CR1-redhat-1.jar" uk/co/c2b2/jdg/hotrod/CinemaDirectory 6. The Hotrod client at runtime provides the end user with a number of different options to interact with the distributed cache as we can view from the console window below. Client Application Principal API Details We will not provide a detailed overview of the Hotrod application code however we will describe the principal API and code details briefly. In order to interact with the distributed cache on the JDG cluster using the Hotrod protocol, we will use the RemoteCacheManager Object which will allow us to retrieve a remote reference to the distributed cache. We have initialised a Properties object with the list of JDG instances and the associated with HotRod server port on each instance. We can add Cinema objects into the distributed cache using the RemoteCache.put() method. private RemoteCacheManager cacheManager; private RemoteCache cache; ..... Properties properties = new Properties(); properties.setProperty(ConfigurationProperties.SERVER_LIST, "127.0.0.1:11222;127.0.0.1:11322;127.0.0.1:11422"); cacheManager = new RemoteCacheManager(properties); cache = cacheManager.getCache("directory-dist-cache"); ..... cache.put(cinemaKey, cinemalist); In the webinar below, I describe in further detail how to set up a JDG cluster and how to develop and run the JDG application discussed above. For further details on JDG please visit: http://www.redhat.com/products/jbossenterprisemiddleware/data-grid/ Webinar: Introduction to JBoss Data Grid -- Installation, Configuration and Development In this webinar we will look at the basics of setting up JBoss Data Grid covering installation, configuration and development. We will look at practical examples of storing data, viewing the data in the cache and removing it. We will also take a look at the different clustered modes and what effect these have on the storage of your data:

July 25, 2014

by David Winters

· 16,070 Views

DocFlex/XML - XML Schema Documentation Generator and Toolkit

a powerful multi-format xml schema (xsd) documentation generator and a tool for rapid development of custom xsd documentation generators according to user needs. about docflex/xml "xsddoc" template set template processor template designer integrations generation of xsd diagrams apache ant & maven links about docflex/xml docflex/xml is a java-based software system for development and execution of high performance template-driven documentation generators from any data stored in xml files. the actual doc/report generators are programmed in the form of special templates using a graphic template designer , which represents the templates visually in a form resembling the output they generate. further, the templates are interpreted by a template processor , which takes on input the xml files and produces by them the result documentation. this article describes an application of docflex/xml for the task of generation of high-quality xml schema documentation. that includes the following features of docflex/xml system: " xsddoc " template set that implements the ready-to-use xml schema documentation generator itself. template processor makes the templates works. currently, it provides three interchangeable output generators for html, rtf, txt (plain text) formats. template designer provides a high quality gui to design/modify templates. if you need a special xml schema doc generator, the simplest way to create it is to modify the standard xsddoc templates. the template designer enables you to do that. integrations with altova xmlspy and oxygen xml editor . if you are a user of one of those popular xml editors, you can turn it also into a dynamically linked diagramming engine for docflex, so that to include automatically the xsd diagrams generated by xmlspy/oxygenxml into the xml schema documentation generated by docflex (with the full support of hyperlinks). "xsddoc" template set it is the implementation of xml schema documentation itself, which provides the following functionality: generation of single documentation by any number of xml schema (xsd) files together, in particular: highly navigable framed (javadoc-like) html documentation single-file html documentation rtf documentation (further convertible to pdf) processing of any referenced xml schemas, in particular: correct processing of all , , elements found across all involved xsd files. automatic loading and processing (i.e. inclusion in the documentation scope) all directly/indirectly referenced xsd files. sophisticated documenting of xsd components , including: component diagrams (with hyperlinks to everything depicted on them; see also integrations ) xml representation summary (a textual alternative to diagrams) lists of related components. for elements this includes also the list of possible containing elements . (such a list is never present in the output generated by xslt-based doc generators). list of usage locations support of any xml schema design patterns . this comes down mainly to the following: special treatment of local elements (see below) support and documenting of substitution groups support of importing, inclusion and redefinition of schema files special documenting of local elements . local elements are those components that are declared locally within other xsd components. w3c xml schema spec allows you to declare any number of local elements that may share the same name but have different content. that's because their meaning is local and there will be no collisions with other declarations. that, however, creates a problem for documenting, because in a documentation both global and local elements may appear simultaneously in various lists according to their common properties. if each element component is identified only by its name, you will get the lists with multiple repeating names but little clue what they mean. moreover, some xml schemas may contain lots of identical local element declarations (that is, they have the same both name and content). so, you'll get in those lists a mess of repeating names, some of which referencing to effectively the same entities, whereas others to complete different ones. in xsddoc , those problems are solved in two ways: adding extensions to local element names. the extension provides more information about the element (e.g. where it can be inserted or its global type or where it is defined). that makes the whole string identifying the element unique. here is how it looks. the grey text is the name extension: unifying local elements by type. on the left you can see a documentation generated with such unification. on the right, all local elements are documented straight as they are. click on each screenshot to view the docs: we believe the first documentation (on the left) is easier to understand and use. processing of xhtml markup . you can format your xml schema annotations with xhtml tags, which will be recognized and rendered with the appropriate formatting in both html and rtf output, as shown on the following screenshots (click to see more details): here, on the left you can see the xml source of an xml schema, whose annotations are heavily laden with xhtml markup (including insertion of images). the next is the html documentation generated by that schema. on the right is a page of rtf documentation also generated by that schema. possibility of unlimited customization : xsddoc is controlled by more than 400 parameters, which allow you to adjust the generated documentation within huge range of included details. template parameters serve the same role as options in traditional doc generators. the difference is that docflex template architecture makes the support/implementation of template parameters very cheap (typically, the most of efforts takes writing their descriptions). so, there may be hundreds of parameters controlling a large template application. if parameters are not enough, you can modify the templates themselves using the template designer . in case of html output, you can also apply your own css styles to change how the generated documentation looks. template processor the template processor (also called simply "generator") makes everything work. it consists of two logical parts: 1. template interpreter 2. output generator the output generator actually has three different implementations for each currently supported output format: html, rtf, txt (plain text). the plain-text output can be used to generate documentation in formats not supported directly by docflex. the template processor is started directly from java command line with the following arguments: ● main template ● template parameters ● initial xsd files to be processed (documented) ● xml catalogs (to redirect physical location of input files) ● destination directory/file ● output format (this selects which output generator will be used) ● output format options (specify settings to control the selected output generator) actually, the number of settings may be so large that the template processor provides a special gui to specify everything interactively (click to enlarge): template designer although docflex templates are stored as plain-text files (with an xml-like format), they are not supposed for editing manually. rather, a special graphic template designer must be used, which visualizes the templates in the form of template components they are made of. those components are the actual constructs of the template language (not some textual statements, operators, blocks etc.) the following screenshots show templates open in the template designer (click to see a lot more): that approach has a number of advantages, among them: the processing structures represented by template components may be displayed in a way that visually expresses what a component does (for instance, it may resemble the output it generates). that representation may be both expressive and compact (after all, it is not just a text), which allows you easily to navigate a template, understand what it does and modify anything you need. as template components are visual and interactive, they may have very complex internal structure, for instance, contain lots of properties and nested components. at that, you don't need to scroll and navigate some kind of enormous text, which encodes all of this (as it would be in case of a script). rather, you just need to invoke some property dialogs and expand/collapse some component sections. a template component may be easily copied, pasted and deleted as a whole. at that, you don't need to bother that the template syntax is restored after that. the template designer will also ensure that each component is created, copied or moved only in the allowed place. the highly structured nature of templates eliminates the need for most of various named identifiers. many connections between different template components are also maintained by the template designer (i.e. modified automatically when necessary). as template files are stored and read only programmatically, there is no need to know and understand their syntax. there will be no syntax errors either. the actual syntax of template files may be optimized not for human programmers, but for faster loading and processing of templates by the template processor . there is no need in a compilation phase. the separation of template semantics from the particular structure of template files helps for faster and easier evolution of the template language. the obsolete constructs of older template versions can be automatically converted into new structures. both old and new templates will look and work up-to-date. integrations generation of xsd diagrams docflex/xml is able to work with any kind of diagrams (i.e. inserting them automatically in the generated output). that is supported on the level of templates, along with the generation of hypertext imagemaps, as shown on the following screenshot (click to see a lot more): docflex/xml provides no diagramming engine of its own. instead, it includes integrations with two most popular xml editors that do generate xsd diagrams: ● altova xmlspy ● oxygen xml editor effectively, the third-party software is used as dynamically linked diagramming engine. the advantage of such integrations is that when you are the user of one of those xml editors, you will get in the documentation generated by docflex the same diagrams as you see in your xml editor. here is how such a documentation with diagrams looks (click on a screenshot to view the real html): apache ant & maven as a pure java application, docflex/xml can be run in any environment that runs java itself. the template processor can be easily integrated with ant (that can be specified just in the ant build file). in case of maven, docflex/xml includes a simple maven plugin. it is possible also to use all diagraming integrations with both ant and maven. links docflex/xml (home page): http://www.filigris.com/docflex-xml/ docflex/xml xsddoc: http://www.filigris.com/docflex-xml/xsddoc/ xsddoc examples: http://www.filigris.com/docflex-xml/xsddoc/examples/ xmlspy integration: http://www.filigris.com/docflex-xml/xmlspy/ oxygenxml integration: http://www.filigris.com/docflex-xml/oxygenxml/ free downloads: http://www.filigris.com/downloads/ this original article: http://www.filigris.com/ann/docflex-xsd/

July 23, 2014

by Leonid Rudy

· 7,631 Views

Building Extremely Large In-Memory InputStream for Testing Purposes

For some reason I needed extremely large, possibly even infinite InputStream that would simply return the same byte[]over and over. This way I could produce insanely big stream of data by repeating small sample. Sort of similar functionality can be found in Guava: Iterable Iterables.cycle(Iterable) and Iterator Iterators.cycle(Iterator). For example if you need an infinite source of 0 and 1, simply sayIterables.cycle(0, 1) and get 0, 1, 0, 1, 0, 1... infinitely. Unfortunately I haven't found such utility forInputStream, so I jumped into writing my own. This article documents many mistakes I made during that process, mostly due to overcomplicating and overengineering straightforward solution. We don't really need an infinite InputStream, being able to create very large one (say, 32 GiB) is enough. So we are after the following method: public static InputStream repeat(byte[] sample, int times) It basically takes sample array of bytes and returns an InputStream returning these bytes. However when sample runs out, it rolls over, returning the same bytes again - this process is repeated given number of times, until InputStreamsignals end. One solution that I haven't really tried but which seems most obvious: public static InputStream repeat(byte[] sample, int times) { final byte[] allBytes = new byte[sample.length * times]; for (int i = 0; i < times; i++) { System.arraycopy(sample, 0, allBytes, i * sample.length, sample.length); } return new ByteArrayInputStream(allBytes); } I see you laughing there! If sample is 100 bytes and we need 32 GiB of input repeating these 100 bytes, generatedInputStream shouldn't really allocate 32 GiB of memory, we must be more clever here. As a matter of fact repeat()above has another subtle bug. Arrays in Java are limited to 231-1 entries (int), 32 GiB is way above that. The reason this program compiles is a silent integer overflow here: sample.length * times. This multiplication doesn't fit in int. OK, let's try something that at least theoretically can work. My first idea was as follows: what if I create manyByteArrayInputStreams sharing the same byte[] sample (they don't do an eager copy) and somehow join them together? Thus I needed some InputStream adapter that could take arbitrary number of underlying InputStreams and chain them together - when first stream is exhausted, switch to next one. This awkward moment when you look for something in Apache Commons or Guava and apparently it was in the JDK forever... java.io.SequenceInputStream is almost ideal. However it can only chain precisely two underlying InputStreams. Of course since SequenceInputStreamis an InputStream itself, we can use it recursively as an argument to outer SequenceInputStream. Repeating this process we can chain arbitrary number of ByteArrayInputStreams together: public static InputStream repeat(byte[] sample, int times) { if (times <= 1) { return new ByteArrayInputStream(sample); } else { return new SequenceInputStream( new ByteArrayInputStream(sample), repeat(sample, times - 1) ); } } If times is 1, just wrap sample in ByteArrayInputStream. Otherwise use SequenceInputStream recursively. I think you can immediately spot what's wrong with this code: too deep recursion. Nesting level is the same as times argument, which will reach millions or even billions. There must be a better way. Luckily minor improvement changes recursion depth from O(n) to O(logn): public static InputStream repeat(byte[] sample, int times) { if (times <= 1) { return new ByteArrayInputStream(sample); } else { return new SequenceInputStream( repeat(sample, times / 2), repeat(sample, times - times / 2) ); } } Honestly this was the first implementation I tried. It's a simple application of divide and conquer principle, where we produce result by evenly splitting it into two smaller sub-problems. Looks clever, but there is one issue: it's easy to prove we create t (t = times) ByteArrayInputStreams and O(t) SequenceInputStreams. While sample byte array is shared, millions of various InputStream instances are wasting memory. This leads us to alternative implementation, creating just one InputStream, regardless value of times: import com.google.common.collect.Iterators; import org.apache.commons.lang3.ArrayUtils; public static InputStream repeat(byte[] sample, int times) { final Byte[] objArray = ArrayUtils.toObject(sample); final Iterator infinite = Iterators.cycle(objArray); final Iterator limited = Iterators.limit(infinite, sample.length * times); return new InputStream() { @Override public int read() throws IOException { return limited.hasNext() ? limited.next() & 0xFF : -1; } }; } We will use Iterators.cycle() after all. But before we have to translate byte[] into Byte[] since iterators can only work with objets, not primitives. There is no idiomatic way to turn array of primitives to array of boxed types, so I useArrayUtils.toObject(byte[]) from Apache Commons Lang. Having an array of objects we can create an infiniteiterator that cycles through values of sample. Since we don't want an infinite stream, we cut off infinite iterator usingIterators.limit(Iterator, int), again from Guava. Now we just have to bridge from Iterator toInputStream - after all semantically they represent the same thing. This solution suffers two problems. First of all it produces tons of garbage due to unboxing. Garbage collection is not that much concerned about dead, short-living objects, but still seems wasteful. Second issue we already faced previously:sample.length * times multiplication can cause integer overflow. It can't be fixed because Iterators.limit() takesint, not long - for no good reason. BTW we avoided third problem by doing bitwise and with 0xFF - otherwise byte with value -1 would signal end of stream, which is not the case. x & 0xFF is correctly translated to unsigned 255 (int). So even though implementation above is short and sweet, declarative rather than imperative, it's too slow and limited. If you have a C background, I can imagine how uncomfortable you were seeing me struggle. After all the most straightforward, painfully simple and low-level implementation was the one I came up with last: public static InputStream repeat(byte[] sample, int times) { return new InputStream() { private long pos = 0; private final long total = (long)sample.length * times; public int read() throws IOException { return pos < total ? sample[(int)(pos++ % sample.length)] : -1; } }; } GC free, pure JDK, fast and simple to understand. Let this be a lesson for you: start with the simplest solution that jumps to your mind, don't overengineer and don't be too smart. My previous solutions, declarative, functional, immutable, etc. - maybe they looked clever, but they were neither fast nor easy to understand. The utility we just developed was not just a toy project, it will be used later in subsequent article.

July 23, 2014

by Tomasz Nurkiewicz

· 7,546 Views

5 Reasons to Use a Java Data Grid in Your Application

In this post we explore 5 reasons to use a Java Data Grid for caching Java objects in-memory in your applications. In a later post we will explore some of the other data grid capabilities, beyond data storage, that can revolutionize your Java architectures, like on-grid computation and events. Memory is Fast Java Data Grids store Java objects in memory. Memory access is fast with low latency. So if access to data storage either disk or database is the primary bottleneck in your application then using a data grid as an in-memory cache in front of your storage tier will give you a performance boost. Scale out your Application Shared State If you need to share state across JVMs to scale out your application then using a Java Data Grid rather than a database will increase your scalability. A typical shared state architecture is shown below, the application server tier stores shared Java objects in the data grid and these objects are available to all application server nodes in your architecture. Separating the data grid tier from the application server tier has a number of advantages; Applications can be redeployed and restarted without losing the shared state Data Grid JVMs and Application JVMs can be tuned separately State can be shared across multiple different applications. Each tier can be scaled horizontally separately depending on work load Typical use cases for shared state include; PCI compliant storage of card security codes; In-game state in online games; web session data; prices and catalogues in ecommerce. Anything that needs low latency access can be stored in the shared data grid. High Availability for In-Memory Data As well as low latency access and scaling out shared state. Java Data Grids also provide high availability for your in-memory data. When storing Java objects in a data grid a primary object is stored in one of the Data Grid JVMs and secondary back up copies of the object are stored in different Data Grid JVM node, ensuring that if you lose a node then you don't lose any data. Clients of the data grid do not need to know where data is to access it so high availability is transparent to your application. Scale Out In-Memory Data Volumes Java objects, in data grids, aren't fully replicated across all Data Grid JVMs but are stored as a primary object and a secondary object. This means the more Data Grid JVM nodes we add the more JVM heap we have for storing Java objects in-memory (and remember memory is fast). For example if we build a Data Grid with 20 JVMs each with 4Gb free heap (after per JVM overhead) we could theoretically store 80Gb (4 times 20) of shared Java objects. If we assume we have 1 duplicate for high availability this cuts our storage in half so we can store 40Gb (.5 time 4 times 20 ) of Java Objects in memory. Native Integration with JPA Java Data Grids have native integration with JPA frameworks like TopLink and Hibernate whereby the Data Grid can act as a second level cache between JPA and the database. This can give a large performance boost to your database driven application if latency associated with database access is a key performance bottleneck.

July 22, 2014

by Steve Millidge

· 7,397 Views

From JPA to Hibernate's Legacy and Enhanced Identifier Generators

Read about enhanced identifier generators, like JPA and Hibernate.

July 16, 2014

by Vlad Mihalcea

· 19,452 Views

R/plyr: ddply – Error in vector(type, length) : vector: cannot make a vector of mode ‘closure’.

In my continued playing around with plyr’s ddply function I was trying to group a data frame by one of its columns and return a count of the number of rows with specific values and ran into a strange (to me) error message. I had a data frame: n = c(2, 3, 5) s = c("aa", "bb", "cc") b = c(TRUE, FALSE, TRUE) df = data.frame(n, s, b) And wanted to group and count on column ‘b’ so I’d get back a count of 2 for TRUE and 1 for FALSE. I wrote this code: ddply(df, "b", function(x) { countr <- length(x$n) data.frame(count = count) }) which when evaluated gave the following error: Error in vector(type, length) : vector: cannot make a vector of mode 'closure'. It took me quite a while to realise that I’d just made a typo in assigned the count to a variable called ‘countr’ instead of ‘count’. As a result of that typo I think the R compiler was trying to find a variable called ‘count’ somwhere else in the lexical scope but was unable to. If I’d defined the variable ‘count’ outside the call to ddply function then my typo wouldn’t have resulted in an error but rather an unexpected resulte.g. > count = 10 > ddply(df, "b", function(x) { + countr <- length(x$n) + data.frame(count = count) + }) b count 1 FALSE 4 2 TRUE 4 Once I spotted the typo and fixed it things worked as expected: > ddply(df, "b", function(x) { + count <- length(x$n) + data.frame(count = count) + }) b count 1 FALSE 1 2 TRUE 2

July 10, 2014

by Mark Needham

· 8,800 Views

Hibernate Identity, Sequence and Table (Sequence) Generator

Learn about Identity, Sequence, and Table in Hibernate.

July 9, 2014

by Vlad Mihalcea

· 178,027 Views · 2 Likes

Designing a Data Architecture to Support both Fast and Big Data

Originally written by Scott Jarr for VoltDB. In post one of this series, we introduced the ideas that a Corporate Data Architecture was taking shape and that working with Fast Data is different from working with Big Data. In the second post we looked at examples of Fast Data and what is required of applications that interact with Fast Data. In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big. The following diagram depicts a basic view of how the “Big” side of the picture is starting to fill out. At the center is a Data Lake, or pool or reservoir or…. there is no shortage of clever names and debate over what to call it. What is clear is this is the spot in which the enterprise will dump ALL of its data. This component is not necessarily unique because of its design or functionality, but because it is an enormously cost effective system to store everything. Essentially, it is a distributed file system on cheap commodity machines. There may or may not be a single winning technology here. It may be HDFS or some other store (maybe S3 if you’re on Amazon), but the point is, this is where all data will go. This platform will: 1. Store data that will be sent to other data management products, and 2. Support frameworks for executing jobs directly against the data in the file system. Moving around the outside of our Data Lake are the complementary pieces of technology that allow people to gain insight and value from the data stored in the Data Lake. Starting at 12 o’clock in the diagram above and moving clockwise: BI – Reporting: Data warehouses do an excellent job of reporting, and will continue to offer this capability. Some data will be exported to those systems and temporarily stored there, while other data will be accessed directly from the Data Lake in a hybrid fashion. These data warehouse systems were specifically designed to run complex report analytics, and do this well. SQL on Hadoop: There is a lot of innovation here. The goal of many of these products is to displace the data warehouse. Advances have been made with the likes of Hawq and Impala. But make no mistake, there is a long way to go for these systems to get near the speed and efficiency of the data warehouses, especially those with columnar designs. SQL-on-Hadoop systems exist for a couple of important reasons: 1) SQL is still the best way to get at data, and 2) Processing can occur without moving big chunks of data around. Exploratory Analytics: This is the realm of the data scientist. These tools offer the ability to “find” things in data – patterns, obscure relationships, statistical rules, etc. Mahout and R are popular tools in this category. MapReduce: This is a lazily-named group of all the job scheduling and management tasks that often occur on Hadoop (I really should come up with something more accurate). Many Hadoop use cases today involve pre-processing or cleaning data prior to the use of the analytics tools described above. These are the tools and interfaces that allow that to happen. ETL of Enterprise Apps: Last at 6 o’clock is the ETL process that will help get all the legacy data from our trusty enterprise applications into our data lake that stores everything. These applications will slowly migrate to full-fledged Fast+Big Data apps in time, which I will discuss in a future post. But suffice it to say: once I add sensors to a manufacturing line, I have a Fast+Big Data problem. OK, we now have analytics … so what? Why do we do analytics in the first place? Simple. We want: Better decisions Better personalization Better detection Better …. Interaction. Interaction is what the application is responsible for, and the most valuable improvements come when you can do these interactions accurately and in real-time. This brings us to the second half of the architecture where we deal with Fast Data to make better, faster real-time applications, depicted in the diagram below. The first thing to notice is that there is a tight coupling of Fast and Big, although they are separate systems. They have to be, at least at scale. The database system designed to work with millions of event decisions per second is wholly different from the system designed to hold Petabytes of data and generate extensive reports. The nature of Fast Data produces a number of critical requirements to get the most out of it. These include the ability to: Ingest / interact with the data feed Make decisions on each event in the feed Provide visibility into fast-moving data with real-time analytics Seamlessly integrate into the systems designed to store Big Data Ability to serve analytic results and knowledge from the Big Data systems quickly to users and applications, closing the data loop. There is no better technology to meet these requirements than an operational database. The challenge we have faced is that there hasn’t been an operational database that can manage this kind of throughput. As a result, there have been a number of Band-Aids people have used to attempt to meet their needs, often giving up capabilities and always adding complexity. In a next post, I will detail the capabilities I see customers looking for to support their Fast Data applications. Then we will take a look at the results of attempting this solution with a popular alternative, stream processing. Originally written by Scott Jarr for VoltDB.

July 9, 2014

by John Piekos

· 14,133 Views