DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
Product Catalog with MongoDB, Part 1: Schema Design
This post is part of the Product Catalog MongoDB Series, in which we will cover many aspects of building a Product Catalog with MongoDB. This approach has been tested with a varied product catalog of 130 million items running on a single server (EC2 i2.2xlarge). MongoDB seems to be the perfect fit to implement a product catalog since products maps so well to documents. But as we shall see it is not as easy as it seems! The data is fairly complex with many relationships involved. Also almost every other system will want to make use of the catalog instead of making its own copy, so typically a low latency, scalable and geo distributed catalog service is the ideal solution. A product has at least the following information: Item: the overall product info (e.g. Levi’s 501) Variant: a specific variant of an item (e.g. in black size 6) which typically has a specific SKU / UPC Price: price information may vary based on the store, the variant, etc Hierarchy: the item taxonomy Facet: facets to search products by Vendors: a given sku may be available through different vendors if the site is a marketplace A classic pitfall is to try to fit everything into a single document. As a result you end up with something very complex with many nested lists, which makes it difficult to navigate and index. Additionally APIs find themselves sending back massive documents even if only partial info is needed. In certain cases, we’ve seen items with 1000s of variant (e.g. Automotive part) which go beyond 16MB of pure JSON (in which case compression becomes mandatory)! Instead here we are going to model the data in a way that is natural, maps well to the API, at the sweet spot between normalization and denorm. Item Model The item collection has document representing the high level data of a product. Here is a sample item document for a shoe: { "_id": "054VA72303012P", "desc": [ { "lang": "en", "val": "Give your dressy look a lift with these women's Kate high-heel shoes by Metaphor. These playful peep-toe pumps feature satin-wrapped stiletto heels and chiffon pompoms at the toes. Rhinestones on each of the silvertone buckles add just a touch of sparkle to these shoes for a flirty footwear look that's made for your next night out." } ], "name": "Women's Kate Ivory Peep-Toe Stiletto Heel", "lname": "women's kate ivory peep-toe stiletto heel", "category": "/84700/80009/1282094266/1200003270", "brand": { "id": "2483510", "img": { "src": "http://i.sears.com/s/i/bl/image/spin_prod_metadata_168138610" }, "name": "Metaphor" }, "assets": { "imgs": [ { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_967112812", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945877912", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945878012", "width": "1900" } } ] }, "shipping": { "dimensions": { "height": "13.0", "length": "1.8", "width": "26.8" }, "weight": "1.75" }, "specs": [ { "name": "Heel Height (in.)", "val": "3.75" } ], "attrs": [ { "name": "Heel Height", "value": "High (2-1/2 to 4 in.)" }, { "name": "Upper Material", "value": "Synthetic" }, { "name": "Toe", "value": "Open toe" } { "name": "Brand", "value": "Metaphor" } ], "variants": { "cnt": 9, "attrs": [ { "dispType": "COMBOBOX", "name": "Width", }, { "dispType": "DROPDOWN", "name": "Color", }, { "dispType": "DROPDOWN", "name": "Shoe Size", } ] }, "lastUpdated": 1400877254787 } Fields of interest: _id: the product id lastUpdated: useful timestamp to see recently updated category: the category path made up of hierarchy nodes name: the product name lname: a lower-case version of the name. This can be useful for doing case-insensitive matching with an index brand: the brand desc: list of descriptions (website, retail box, etc) assets: list of assets (images, etc) attrs: list of attributes as name-value pairs. Will be used to implement facetting. Note that the brand is also included as one attribute. variants: some information on variants, but not the variants themselves Common queries (indexed): find by id: { _id: "the product id" } find by category prefix: { product.cat: { $regex: "^category prefix" } } find by case-insensitive name prefix: { product.lname: { $regex: "^name prefix" } } Variant Model The Variant documents represent specific variations of a product. Certain products only exist in a unique variant (e.g. XBox, no options to pick) whereas other products may have thousands. Here is a sample variant document for the same shoe: { "_id": "05458452563", "name": "Width:Medium,Color:Ivory,Shoe Size:6.5", "lname": "width:medium,color:ivory,shoe size:6.5", "itemId": "054VA72303012P", "altIds": { "upc": "632576103580" }, "assets": { "imgs": [ { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348512" }, { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348612" } ] }, "attrs": [ { "name": "Width", "value": "Medium" }, { "name": "Color", "family": "White", "value": "Ivory" }, { "name": "6.5", "value": "6.5" } ] } Fields of interest: _id: the SKU itemId: the parent item id. attrs: a list of attributes specific to the variant. Note that some of the attributes may have both a specific value (e.g. ivory) and a family value (e.g. white). assets: assets specific to the variant (e.g. image with a specific color). Common query parameters (indexed): find by SKU: { _id: "the sku" } find by item Id: { itemId: "item id" } Hierarchy Model The Hierarchy document represents a node of the hierarchical tree representing product taxonomy. The top level nodes represent departments, while further nodes represent specific categories. { "_id": "1200003270", "name": "Women's Heels & Pumps", "count": 223, "parents": [ "1282094266" ], "facets": [ "Heel Height", "Toe", "Upper Material", "Width", "Shoe Size", "Color" ] } Fields of interest: _id: the category id name: the category name count: the number of items in this category. It can be a useful statistic to display. parents: list of parent nodes. Simpler implementations could make use of a single value. facets: list of facets that exist for this category (e.g. color, size). This info will be used when displaying the facets available in the searching page. Common queries (indexed): find by parent id: { p: "parent id" } find top level departments: { p: null } Facet Model The facet document represent a name/value pair representing a product attribute. { "_id": "accessory type=hosiery", "name": "Accessory Type", "value": "Hosiery", "count": 14 } Fields of interest: _id: the id, which is a concatenation of lower-cased facet name and value. name: the facet name with original casing, e.g. “Accessory Type” value: the facet value with original casing, e.g. “Hosiery”. Important note: here the value should be the family value is possible, e.g. “White” rather than “Ivory”. Those facets will be used for searching items, and the family value is better for that purpose. count: the number of items that have this facet. This count will be important in defining the order of attributes in a query when doing faceted search. Common query parameters (indexed): find a specific facet: { _id: "name_value" } find facets for a name: { _id: { $regex: "^name_" } } Price Model The Price document obviously represents the price of an item, but there is quite a bit more to it. We want to be able to vary the price per variant (e.g. gold color is more expensive) or per store (e.g. online store is cheaper). While we will not touch on the store model in this post, let’s just imagine we have a few thousand stores which are grouped into a dozen store groups (e.g. online, west coast, etc). If we implement this naively, we would end up with 1000 stores x 200m variants = 2 billion price documents! Instead let’s be a bit smarter and make good use of MongoDB’s querying capability, by allowing to price products at different levels as needed, thus keeping the number of documents in the millions. A document looks like: { "_id": "SPM8824542513_1234", "price": "69.99", "sale": { "salePrice": "42.72", "saleEndDate": "2050-12-31 23:59:59" }, "lastUpdated": 1374647707394 } Fields of interest: _id: the id is built in a specific way. It is the concatenation of the item information and store information. The item information is either the item id or the variant id (SKU). The store information is either the store group id or the store id. price: the regular price sale: sales information, optional Common queries (indexed): find all prices by item id: { _id: { "$regex": "^itemId_" } } find all prices by SKU (price could be at item level): { _id: { "$in: [ { "$regex": "^itemId_" }, { "$regex": "^sku_" } ] } find price for a given SKU and store (4 combinations are possible): { _id: { "$in: [ "itemId_storeGroupId", "itemId_storeId", "sku_storeGroupId", "sku_storeId" ] } find items on sale, starting with ones ending soonest: { "sale.saleEndDate": { $ne: null } } with sort by { "sale.saleEndDate": 1 } (sparse index on "sale.saleEndDate") Summary Model The previous documents are now properly modeled, easy to maintain, and can efficiently power an API to serve product details pages. The last, and most difficult issue left to tackle is how to do faceted searching and other kind of browsing. We could leave it off to a full text search system or similar software, but it’s actually doable with MongoDB. For this purpose we face some tough challenges: whatever the search is, need a response within milliseconds returning hundreds of items the search can be a combination of many facets: category, brand, etc facets can be both at the item and variant levels: color, size, etc. If matching a specific variant, we should display that specific image (e.g. red shoes). hundreds of variants of the same item could match, in which case only the parent item should be returned as result efficient sorting on several attributes: price, popularity pagination feature which requires deterministic ordering For this purpose we create a separate collection called Summary in which each document represent the summary information of an item and all its variants. The data is stripped out to the minimum needed to power a browse & search feature. Such a document looks like: { "_id": "3ZZVA46759401P", "name": "Women's Chic - Black Velvet Suede", "lname": "women's chic - black velvet suede", "dep": "84700", "cat": "/84700/80009/1282094266/1200003270", "desc": [ { "lang": "en", "val": "This pointy toe slingback features a high quality upper and a classy, simple silhouette. This heel has a classic shape, an adjustable ankle strap for a vintage feel and a secure fit. The Chic is the perfect combination between dressy and professional." } ], "img": [ { "height": "330", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726201", "title": "spin_prod_591726201", "width": "450" } ], "attrs": [ "heel height=mid (1-3/4 to 2-1/4 in.)", "brand=metaphor" ], "sattrs": [ "upper material=synthetic", "toe=open toe" ], "vars": [ { "id": "05497884001", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6" ] }, { "id": "05497884002", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6.5" ] }, { "id": "05497884004", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=7.5" ] } ] } Fields of interest: _id: the item id name: the item name lname: the item name, lower-case img: list of images, ideally just the thumbnail dep: the department (top level of category). Needs to be separate for proper indexing cat: the category path attrs: the item attributes, to be indexed sattrs: the item secondary attributes, not to be indexed vars: list of variants vars.id: the variant sku vars.attrs: the variant attributes, to be indexed vars.sattrs: the secondary variant attributes, not to be indexed Indices: department + attr + category + _id department + vars.attrs + category + _id department + category + _id department + price + _id department + rating + _id Common queries (indexed): find by department: { dep: "department" } find by category prefix: { dep: "department", cat: { $regex: "^category prefix"} } find by item attribute: { dep: "department", attrs: "name=value" } find by several item attributes: { dep: "department", attrs: { $all: [ "name=value", ... ] } find by variant attribute: { dep: "department", vars.attrs: "name=value" } find by several variant attributes: { dep: "department", vars.attrs: { $all: [ "name=value", ... ] } find by item attributes, variant attributes, category: { dep: "department", attrs: { $all: [ "name=value", ... ], vars.attrs: { $all: [ "name=value", ... ], cat: { $regex: "^category prefix"} } A few interesting notes on indexing / querying: each index starts with the department, which is a convenient way to subdivide our product catalog. It is an acceptable restriction to force the user to pick a department before displaying any kind of search facet (unless we’re displaying a pre-computed list like “most popular”). Hence having the department there will ensure that there is always a large amount of filtering done for cheap by the index :) each index ends with “_id” which is useful for pagination. It will give sorting on _id for free for some common queries. It’s always better to avoid resorting the skip/limit for pagination, which is only fine for a low number of pages. for queries using “$all“ the most restrictive attribute should be specified first (e.g. “color=red”). This information can be inferred from the “facet“ collection described earlier. This piece is critical to make facetted searches efficient and keep them in the few milliseconds. Conclusion In conclusion, we’ve seen here how to model and index a product catalog in MongoDB which will allow high performance, flexibility, and easy maintenance. More details on this topic can be see in the MongoDB World video Product Catalog Stay tuned for more information on our Product Catalog MongoDB solution, including: How to implement full text search within MongoDB or with a connector to an external FTS system More statistics and benchmarking on the faceted search capability Operational considerations: geo distributed for low latency queries, stringent read latency SLA, and spiky catalog updates Also check out another interesting topic: how to log all user activities around the site and run useful analytics on them in the MongoDB World video covering the Insight Component
September 25, 2014
by Antoine Girbal
· 65,714 Views · 8 Likes
article thumbnail
The No Fluff Introduction to Big Data
big data traditionally has referred to a collection of data too massive to be handled efficiently by traditional database tools and methods. this original definition has expanded over the years to identify tools (big data tools) that tackle extremely large datasets (nosql databases, mapreduce, hadoop, newsql, etc.), and to describe the industry challenge posed by having data harvesting abilities that far outstrip the ability to process, interpret, and act on that data. technologists knew that those huge batches of user data and other data types were full of insights that could be extracted by analyzing the data in large aggregates. they just didn’t have any cheap, simple technology for organizing and querying these large batches of raw, unstructured data. the term quickly became a buzzword for every sort of data processing product’s marketing team. big data became a catchall term for anything that handled non-trivial sizes of data. sean owen, a data scientist at cloudera, has suggested that big data is a stage where individual data points are irrelevant and only aggregate analysis matters [1]. but this is true for a 400 person survey as well, and most people wouldn’t consider that very big. the key part missing from that definition is the transformation of unstructured data batches into structured datasets. it doesn’t matter if the database is relational or non-relational. big data is not defined by a number of terabytes, it’s rooted in the push to discoverhidden insights in data that companies used to disregard or throw away. due to the obstacles presented by large scale data management, the goal for developers and data scientists is two-fold: first, systems must be created to handle large scale data, and two, business intelligence and insights should be acquired from analysis of the data. acquiring the tools and methods to meet these goals is a major focus in the data science industry, but it’s a landscape where needs and goals are still shifting. what are the characteristics of big data? tech companies are constantly amassing data from a variety of digital sources that is almost without end—everything from email addresses to digital images, mp3s, social media communication, server traffic logs, purchase history, and demographics. and it’s not just the data itself, but data about the data (metadata). it is a barrage of information on every level. what is it that makes this mountain of data big data? one of the most helpful models for understanding the nature of big data is “the three vs:” volume, velocity, and variety. data volume volumeis the sheer size of the data being collected. there was a point in not-so-distant history where managing gigabytes of data was considered a serious task—now we have web giants like google and facebook handling petabytes of information about users’ digital activities. the size of the data is often seen as the first challenge of characterizing big data storage, but even beyond that is the capability of programs to provide architecture that can not only store but query these massive datasets. one of the most popular models for big data architecture comes from google’s mapreduce concept, which was the basis for apache hadoop, a popular data management solution. data velocity velocityis a problem that flows naturally from the volume characteristics of big data. data velocity is the speed at which data is flowing into a business’ infrastructure and the ability of software solutions to receive and ingest that data quickly. certain types of high-velocity data, such as streaming data, needs to be moved into storage and processed on the fly. this is often referred to as complex event processing (cep). the ability to intercept and analyze data that has a lifespan of milliseconds is a widely sought after. this kind of quick-fire data processing has long been the cornerstone of digital financial transactions, but it is also being used to track live consumer behavior or to bring instant updates to social media feeds. data variety variety refers to the source and type of data that is being collected. this data could be anything from raw image data to sensor readings, audio recordings, social media communication, and metadata. the challenge of data variety is being able to take raw, unstructured data and organize it so that an application can use it. this kind of structure can be achieved through architectural models that traditionally favor relational databases—but there is often a need to tidy up this data before it will even be useful to store in a raw form. sometimes a better option is to use a schema-less, non-relational database. how do you manage big data? the three vs is a great model for getting an initial understanding of what makes big data a challenge for businesses. however, big data is not just about the data itself, but the way that it is handled. a popular way of thinking about these challenges is to look at how a business stores, processes, and accesses their data. · store: can you store the vast amounts of data being collected? · process: can you organize, clean, and analyze the data collected? · access: can you search and query this data in an organized manner? the store, process, and access model is useful for two reasons: it reminds businesses that big data is largely about managing data, and it demonstrates the problem of scale within big data management. “big” is relative. the data batches that challenge some companies could be moved through a single google datacenter in under a minute. the only question a company needs to ask itself is how it will store and access increasingly massive amounts of data for its particular use case. there are several high level approaches that companies have turned to in the last few years. the traditional approach the traditional method for handling most data is to use relational databases. data warehouses are then used to integrate and analyze data from many sources. these databases are structured according to the concept of “early structure binding”—essentially, the database has predetermined “questions” that can be asked based on a schema. relational databases are highly functional, and the goal with this type of data processing is for the database to be fully transactional. although relational databases are the most common persistence type by a large margin (see key findings pg. 4-5), a growing number of use cases are not well-suited for relational schema. relational architectures tend to have difficulty when dealing with the velocity and variety of big data, since their structure is very rigid. when you perform functions such as join on many large data sets, the volume can be a problem as well. instead, businesses are looking to non-relational databases, or a mixture of both types, to meet data demand. the newer approach - mapreduce, hadoop, and nosql databases in the early 2000s, web giant google released two helpful web technologies: google file system (gfs) and mapreduce. both were new and unique approaches to the growing problem of big data, but mapreduce was chief among them, especially when it comes to its role as a major influencer of later solution models. mapreduce is a programming paradigm that allows for low cost data analysis and clustered scale-out processing. mapreduce became the primary architectural influence for the next big thing in big data: the creation of the big data management infrastructure known as hadoop. hadoop’s open source ecosystem and ease of use for handling large-scale data processing operations have secured a large part of the big data marketplace. besides hadoop, there was a host of non-relational (nosql) databases that emerged around 2009 to meet a different set of demands for processing big data. whereas hadoop is used for its massive scalability and parallel processing, nosql databases are especially useful for handling data stored within large multi-structured datasets. this kind of discrete data handling is not traditionally seen as a strong point of relational databases, but it’s also not the same kind of data operations that hadoop is running. the solution for many businesses ends up being a combination of these approaches to data management. finding hidden data insights once you get beyond storage and management, you still have the enormous task of creating actionable business intelligence (bi) from the datasets you’ve collected. this problem of processing and analyzing data is maybe one of the trickiest in the data management lifecycle. the best options for data analytics will favor an approach that is predictive and adaptable to changing data streams. the thing is, there’s so many types of analytic models and different ways of providing infrastructure for this process. your analytics solution should scale, but to what degree? scalability can be an enormous pain in your analytical neck, due to the problem of decreasing performance returns when scaling out an algorithm. ultimately, analytics tools rely on a great deal of reasoning and analysis to extract data patterns and data insights, but this capacity means nothing for a business if they can’t then create actionable intelligence. part of this problem is that many businesses have the infrastructure to accommodate big data, but they aren’t asking questions about what problems they’re going to solve with the data. implementing a big data-ready infrastructure before knowing what questions you want to ask is like putting the cart before the horse. but even if we do know the questions we want to ask, data analysis can always reveal many correlations with no clear causes. as organizations get better at processing and analyzing big data, the next major hurdle will be pinpointing the causes behind the trends by asking the right questions and embracing the complexity of our answers. [1] http://www.quora.com/what-is-big-data 2014 guide to big data this guide explores the meaning of big data, how businesses use it, and uncovers new tools and techniques for the future of big data. this guide includes: detailed profiles on 43 big data vendor solutions in-depth articles written by industry experts results from our survey of 850 it professionals "finding the database for your use case" download now
September 25, 2014
by Benjamin Ball
· 10,660 Views · 1 Like
article thumbnail
JPA tutorial: Mapping Entities – Part 1
In this article I will discuss about the entity mapping procedure in JPA. As for my examples I will use the same schemathat I used in one of my previous articles. In my two previous articles I explained how to set up JPA in a Java SE environment. I do not intend to write the setup procedure for a web application because most of the tutorials on the web do exactly that. So let’s skip over directly to object relational mapping, or entity mapping. Wikipedia defines Object Relational Mapping as follows - Object-relational mapping (ORM, O/RM, and O/R mapping) in computer science is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to create their own ORM tools. Typically, mapping is the process through which you provide necessary information about your database to your ORM tool. The tool then uses this information to read/write objects into the database. Usually you tell your ORM tool the table name to which an object of a certain type will be saved. You also provide column names to which an object’s properties will be mapped to. Relation between different object types also need to be specified. All of these seem to be a lot of tasks, but fortunately JPA follows what is known as “Convention over Configuration” approach, which means if you adopt to use the default values provided by JPA, you will have to configure very little parts of your applications. In order to properly map a type in JPA, you will at a minimum need to do the following - Mark your class with the @Entity annotation. These classes are called entities. Mark one of the properties/getter methods of the class with the @Id annotation. And that’s it. Your entities are ready to be saved into the database because JPA configures all other aspects of the mapping automatically. This also shows the productivity gain that you can enjoy by using JPA. You do not need to manually populate your objects each time you query the database, saving you from writing lots of boilerplate code. Let’s see an example. Consider the following Address entity which I have mapped according to the above two rules - import javax.persistence.Entity; import javax.persistence.Id; @Entity public class Address { @Id private Integer id; private String street; private String city; private String province; private String country; private String postcode; /** * @return the id */ public Integer getId() { return id; } /** * @param id the id to set */ public Address setId(Integer id) { this.id = id; return this; } /** * @return the street */ public String getStreet() { return street; } /** * @param street the street to set */ public Address setStreet(String street) { this.street = street; return this; } /** * @return the city */ public String getCity() { return city; } /** * @param city the city to set */ public Address setCity(String city) { this.city = city; return this; } /** * @return the province */ public String getProvince() { return province; } /** * @param province the province to set */ public Address setProvince(String province) { this.province = province; return this; } /** * @return the country */ public String getCountry() { return country; } /** * @param country the country to set */ public Address setCountry(String country) { this.country = country; return this; } /** * @return the postcode */ public String getPostcode() { return postcode; } /** * @param postcode the postcode to set */ public Address setPostcode(String postcode) { this.postcode = postcode; return this; } } Now based on your environment, you may or may not add this entity declaration in your persistence.xml file, which I have explained in my previous article. Ok then, let’s save some object! The following code snippet does exactly that - import com.keertimaan.javasamples.jpaexample.entity.Address; import javax.persistence.EntityManager; import com.keertimaan.javasamples.jpaexample.persistenceutil.PersistenceManager; public class Main { public static void main(String[] args) { EntityManager em = PersistenceManager.INSTANCE.getEntityManager(); Address address = new Address().setId(1) .setCity("Dhaka") .setCountry("Bangladesh") .setPostcode("1000") .setStreet("Poribagh"); em.getTransaction() .begin(); em.persist(address); em.getTransaction() .commit(); System.out.println("addess is saved! It has id: " + address.getId()); Address anotherAddress = new Address().setId(2) .setCity("Shinagawa-ku, Tokyo") .setCountry("Japan") .setPostcode("140-0002") .setStreet("Shinagawa Seaside Area"); em.getTransaction() .begin(); em.persist(anotherAddress); em.getTransaction() .commit(); em.close(); System.out.println("anotherAddress is saved! It has id: " + anotherAddress.getId()); PersistenceManager.INSTANCE.close(); } } Let’s take a step back at this point and think what we needed to do if we had used plain JDBC for persistence. We had to manually write the insert queries and map each of the attributes to the corresponding columns for both cases, which would have required a lot of code. An important point to note about the example is the way I am setting the id of the entities. This approach will only work for short examples like this, but for real applications this is not good. You’d typically want to use, say, auto-incremented id columns or database sequences to generate the id values for your entities. For my example, I am using a MySQL database, and all of my id columns are set to auto increment. To reflect this in my entity model, I can use an additional annotation called @GeneratedValue in the id property. This tells JPA that the id value for this entity will be automatically generated by the database during the insert, and it should fetch that id after the insert using a select command. With the above modifications, my entity class becomes something like this - import javax.persistence.Entity; import javax.persistence.Id; import javax.persistence.GeneratedValue; @Entity public class Address { @Id @GeneratedValue private Integer id; // Rest of the class code........ And the insert procedure becomes this - Address anotherAddress = new Address() .setCity("Shinagawa-ku, Tokyo") .setCountry("Japan") .setPostcode("140-0002") .setStreet("Shinagawa Seaside Area"); em.getTransaction() .begin(); em.persist(anotherAddress); em.getTransaction() .commit(); How did JPA figure out which table to use to save Address entities? Turns out, it’s pretty straight-forward - When no explicit table information is provided with the mapping then JPA tries to find a table whose name matches with the entity name. The name of an entity can be explicitly specified by using the “name” attribute of the @Entity annotation. If no name attribute is found, then JPA assumes a default name for an entity. The default name of an entity is the simple name (not fully qualified name) of the entity class, which in our case is Address. So our entity name is then determined to be “Address”. Since our entity name is “Address”, JPA tries to find if there is a table in the database whose name is “Address” (remember, most of the cases database table names are case-insensitive). From our schema, we can see that this is indeed the case. So how did JPA figure our which columns to use to save property values for address entities? At this point I think you will be able to easily guess that. If you cannot, stay tuned for my next post! Until next time. [ Full working code can be found at github.]
September 22, 2014
by MD Sayem Ahmed
· 9,743 Views · 2 Likes
article thumbnail
How to Setup Custom Remote Deployment Repositories for JBoss BPM Suite
In this article we wanted to share another configuration property that can provide surprising help when setting up your JBoss BPM Suite. Previously we outlined a basic set of configuration properties to provide you with a few tricks when installing your own JBoss BRMS or JBoss BPM Suite products. As the JBoss BPM Suite is a super set, including full JBoss BRMS functionality, the rest of this article will refer only to JBoss BPM Suite but apply to both products. In this article we will show you how to modify your JBoss EAP container configuration to point the products at a custom deployment repository by adjusting a single configuration property. Maven repository The default setup is that the products will look for your maven setting in the default settings.xml as found set in theM2_HOME variable or in the users home directory at .m2/settings.xml. The following system property can be added to JBoss EAP standalone.xml configuration file to point to any file containing your custom settings. kie.maven.settings.custom Location of the maven configuration file where it can find it's settings. Default: the M2_HOME/conf/settings.xml or users home directory .m2/settings.xml Example usage in JBoss EAP When initially setting up the product for use on JBoss EAP containers, one can adjust configuration with the help of system properties. Below we show how to configure an installation to point to our custom maven deployment repository by using a custom settings file we will call bpmsuite-settings.xml We hope this helps you with configuring your own custom deployment repositories and enables you to tie into existing continuous integration infrastructures that might exist in your organization.
September 19, 2014
by Eric D. Schabell DZone Core CORE
· 6,201 Views · 1 Like
article thumbnail
MySQL 101: Monitor Disk I/O with pt-diskstats
Originally Written by Muhammad Irfan Here on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats. In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool. pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents. rd_s tells about number of reads per second while wr_s represents number of writes per second. rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package. rd_mrg and wr_mrg are other two important columns in pt-diskstats output. *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes. qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated. Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the –devices-regex option which prints only device information that matches this Perl regex. $ pt-diskstats --devices-regex=sd --interval 5 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.1 sda 21.6 22.8 0.5 45% 1.2 29.4 275.5 4.0 1.1 0% 40.0 145.1 65% 158 297.1 155.0 2.1 1.1 sdb 15.0 21.0 0.3 33% 0.1 5.2 0.0 0.0 0.0 0% 0.0 0.0 11% 1 15.0 0.5 4.7 1.1 sdc 5.6 10.0 0.1 0% 0.0 5.2 1.9 6.0 0.0 33% 0.0 2.0 3% 0 7.5 0.4 3.6 1.1 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 5.0 sda 17.0 14.8 0.2 64% 3.1 66.7 404.9 4.6 1.8 14% 140.9 298.5 100% 111 421.9 277.6 1.9 5.0 sdb 14.0 19.9 0.3 48% 0.1 5.5 0.4 174.0 0.1 98% 0.0 0.0 11% 0 14.4 0.9 2.4 5.0 sdc 3.6 27.1 0.1 61% 0.0 3.5 2.8 5.7 0.0 30% 0.0 2.0 3% 0 6.4 0.7 2.4 5.0 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 These are the stats from 7200 RPM SATA disks. As you can see, the write-response time is very high and most of that is made up of IO queue time. This shows the problem exactly. The problem is that the IO subsystem is not able to handle the write workload because the amount of writes that are being performed are way beyond what it can handle. It means the disks cannot service every request concurrently. The workload would actually depend a lot on where the hot data is stored and as we can see in this particular case the workload only hits a single disk out of the 4 disks. A single 7.2K RPM disk can only do about 100 random writes per second which is not a lot considering heavy workload. It’s not particularly a hardware issue but a hardware capacity issue. The kind of workload that is present and the amount of writes that are performed per second are not something that the IO subsystem is able to handle in an efficient manner. Mostly writes are generated on this server as can be seen by the disk stats. Let me show you a second example. Here you can see read latency. rd_rt is consistently between 10ms-30ms. It depends on how fast the disks are spinning and the number of disks. To deal with it possible solutions would be to optimize queries to avoid table scans, use memcached where possible, use SSD’s as it can provide good I/O performance with high concurrency. You will find this post useful on SSD’s from our CEO, Peter Zaitsev. #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 sdb 33.0 29.1 0.9 0% 1.1 34.7 7.0 10.3 0.1 61% 0.0 0.4 99% 1 40.0 2.2 19.5 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 7.0 10.3 0.1 61% 0.0 0.4 1% 0 7.0 0.0 0.4 1.0 sdb2 33.0 29.1 0.9 0% 1.1 34.7 0.0 0.0 0.0 0% 0.0 0.0 99% 1 33.0 3.5 30.2 1.0 sdb 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 1.0 sdb2 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb 50.0 25.7 1.3 0% 1.3 25.1 13.0 11.7 0.1 66% 0.0 0.7 99% 1 63.0 3.4 11.3 1.0 sdb1 25.0 21.3 0.5 0% 0.6 25.2 13.0 11.7 0.1 66% 0.0 0.7 46% 1 38.0 3.2 7.3 1.0 sdb2 25.0 30.1 0.7 0% 0.6 25.0 0.0 0.0 0.0 0% 0.0 0.0 56% 0 25.0 3.6 22.2 From the below diskstats output it seems that IO is saturated between both reads and writes. This can be noticed with high value for columns rd_s and wr_s. In this particular case, consider having disks in either RAID 5 (better for read only workload) or RAID 10 array is good option along with battery-backed write cache (BBWC) as single disk can really be bad for performance when you are IO bound. device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime sdb1 362.0 27.4 9.7 0% 2.7 7.5 525.2 20.2 10.3 35% 6.4 8.0 100% 0 887.2 7.0 0.9 sdb1 439.9 26.5 11.4 0% 3.4 7.7 545.7 20.8 11.1 34% 9.8 11.9 100% 0 985.6 9.6 0.8 sdb1 576.6 26.5 14.9 0% 4.5 7.8 400.2 19.9 7.8 34% 6.7 10.9 100% 0 976.8 8.6 0.8 sdb1 410.8 24.2 9.7 0% 2.9 7.1 403.1 18.3 7.2 34% 10.8 17.7 100% 0 813.9 12.5 1.0 sdb1 378.4 24.6 9.1 0% 2.7 7.3 506.1 16.5 8.2 33% 5.7 7.6 100% 0 884.4 6.6 0.9 sdb1 572.8 26.1 14.6 0% 4.8 8.4 422.6 17.2 7.1 30% 1.7 2.8 100% 0 995.4 4.7 0.8 sdb1 429.2 23.0 9.6 0% 3.2 7.4 511.9 14.5 7.2 31% 1.2 1.7 100% 0 941.2 3.6 0.9 The following example reflects write heavy activity but write-response time is very good, under 1ms, which shows disks are healthy and capable of handling high number of IOPS. #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 dm-0 530.8 16.0 8.3 0% 0.3 0.5 6124.0 5.1 30.7 0% 1.7 0.3 86% 2 6654.8 0.2 0.1 2.0 dm-0 633.1 16.1 10.0 0% 0.3 0.5 6173.0 6.1 36.6 0% 1.7 0.3 88% 1 6806.1 0.2 0.1 3.0 dm-0 731.8 16.0 11.5 0% 0.4 0.5 6064.2 5.8 34.1 0% 1.9 0.3 90% 2 6795.9 0.2 0.1 4.0 dm-0 711.1 16.0 11.1 0% 0.3 0.5 6448.5 5.4 34.3 0% 1.8 0.3 92% 2 7159.6 0.2 0.1 5.0 dm-0 700.1 16.0 10.9 0% 0.4 0.5 5689.4 5.8 32.2 0% 1.9 0.3 88% 0 6389.5 0.2 0.1 6.0 dm-0 774.1 16.0 12.1 0% 0.3 0.4 6409.5 5.5 34.2 0% 1.7 0.3 86% 0 7183.5 0.2 0.1 7.0 dm-0 849.6 16.0 13.3 0% 0.4 0.5 6151.2 5.4 32.3 0% 1.9 0.3 88% 3 7000.8 0.2 0.1 8.0 dm-0 664.2 16.0 10.4 0% 0.3 0.5 6349.2 5.7 35.1 0% 2.0 0.3 90% 2 7013.4 0.2 0.1 9.0 dm-0 951.0 16.0 14.9 0% 0.4 0.4 5807.0 5.3 29.9 0% 1.8 0.3 90% 3 6758.0 0.2 0.1 10.0 dm-0 742.0 16.0 11.6 0% 0.3 0.5 6461.1 5.1 32.2 0% 1.7 0.3 87% 1 7203.2 0.2 0.1 Let me show you a final example. I used –interval and –iterations parameters for pt-diskstats which tells us to wait for a number of seconds before printing the next disk stats and to limit the number of samples respectively. If you notice, you will see in 3rd iteration high latency (rd_rt, wr_rt) mostly for reads. Also, you can notice a high value for queue time (qtime) and service time (stime) where qtime is related to disk IO scheduler settings. For MySQL database servers we usually recommends noop/deadline instead of default cfq. $ pt-diskstats --interval=20 --iterations=3 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 10.4 hda 11.7 4.0 0.0 0% 0.0 1.1 40.7 11.7 0.5 26% 0.1 2.1 10% 0 52.5 0.4 1.5 10.4 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 7.0 0.0 43% 0.0 0.1 0% 0 0.4 0.0 0.1 10.4 hda3 0.0 0.0 0.0 0% 0.0 0.0 0.4 107.0 0.0 96% 0.0 0.2 0% 0 0.4 0.0 0.2 10.4 hda5 0.0 0.0 0.0 0% 0.0 0.0 0.7 20.0 0.0 80% 0.0 0.3 0% 0 0.7 0.1 0.2 10.4 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.1 4.0 0.0 0% 0.0 4.0 0% 0 0.1 0.0 4.0 10.4 hda9 11.7 4.0 0.0 0% 0.0 1.1 39.2 10.7 0.4 3% 0.1 2.7 9% 0 50.9 0.5 1.8 10.4 drbd1 11.7 4.0 0.0 0% 0.0 1.1 39.1 10.7 0.4 0% 0.1 2.8 9% 0 50.8 0.5 1.7 20.0 hda 14.6 4.0 0.1 0% 0.0 1.4 39.5 12.3 0.5 26% 0.3 6.4 18% 0 54.1 2.6 2.7 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 9.1 0.0 56% 0.0 42.0 3% 0 0.4 0.0 42.0 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 1.5 22.3 0.0 82% 0.0 1.5 0% 0 1.5 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 1.1 18.9 0.0 79% 0.1 21.4 11% 0 1.1 0.1 21.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.8 10.4 0.0 62% 0.0 1.5 0% 0 0.8 1.3 0.2 20.0 hda9 14.6 4.0 0.1 0% 0.0 1.4 35.8 11.7 0.4 3% 0.2 4.9 18% 0 50.4 0.5 3.5 20.0 drbd1 14.6 4.0 0.1 0% 0.0 1.4 36.4 11.6 0.4 0% 0.2 5.1 17% 0 51.0 0.5 3.4 20.0 hda 0.9 4.0 0.0 0% 0.2 251.9 28.8 61.8 1.7 92% 4.5 13.1 31% 2 29.6 12.8 0.9 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.6 8.3 0.0 52% 0.1 98.2 6% 0 0.6 48.9 49.3 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 2.0 23.2 0.0 83% 0.0 1.4 0% 0 2.0 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 4.9 249.4 1.2 98% 4.0 13.2 9% 0 4.9 12.9 0.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 20.0 hda9 0.9 4.0 0.0 0% 0.2 251.9 21.3 24.2 0.5 32% 0.4 12.9 31% 2 22.2 10.2 9.7 20.0 drbd1 0.9 4.0 0.0 0% 0.2 251.9 30.6 17.0 0.5 0% 0.7 24.1 30% 5 31.4 21.0 9.5 You can see the busy column in pt-diskstats output which is the same as the util column in iostat – which points to utilization. Actually, pt-diskstats is quite similar to the iostat tool but pt-diskstats is more interactive and has more information. The busy percentage is only telling us for how long the IO subsystem was busy, but is not indicating capacity. So the only time you care about %busy is when it’s 100% and at the same time latency (await in iostat and rd_rt/wr_rt in diskstats output) increases over -say- 5ms. You can estimate capacity of your IO subsystem and then look at the IOPS being consumed (r/s + w/s columns). Also, the system can process more than one request in parallel (in case of RAID) so %busy can go beyond 100% in pt-diskstats output. If you need to check disk throughput, block device IOPS run the following to capture metrics from your IO subsystem and see if utilization matches other worrisome symptoms. I would suggest capturing disk stats during peak load. Output can be grouped by sample or by disk using the –group-by option. You can use the sysbench benchmark tool for this purpose to measure database server performance. You will find this link useful for sysbench tool details. $ pt-diskstats --group-by=all --iterations=7200 > /tmp/pt-diskstats.out; Conclusion: pt-diskstats is one of the finest tools from Percona Toolkit. By using this tool you can easily spot disk bottlenecks, measure the IO subsystem and identify how much IOPS your drive can handle (i.e. disk capacity).
September 19, 2014
by Peter Zaitsev
· 5,268 Views
article thumbnail
Java - Four Security Vulnerabilities Related Coding Practices to Avoid
This article represents top 4 security vulnerabilities related coding practice to avoid while you are programming with Java language. Recently, I came across few Java projects where these instances were found. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Executing a dynamically generated SQL statement Directly writing an Http Parameter to Servlet output Creating an SQL PreparedStatement from dynamic string Array is stored directly Executing a Dynamically Generated SQL Statement This is most common of all. One can find mention of this vulenrability at several places. As a matter of fact, many developers are also aware of this vulnerability, although this is a different thing they end up making mistakes once in a while. In several DAO classes, the instances such as following code were found which could lead to SQL injection attacks. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); Statement statement = connection.createStatement(); resultSet = statement.executeQuery(query.toString()); } Instead of above query, one could as well make use of prepared statement such as that demonstrated in the code below. It not only makes code less vulnerable to SQL injection attacks but also makes it more efficient. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (?)" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareCall(query.toString()); statement.setString( 1, namesString ); resultSet = statement.execute(); } Directly writing an Http Parameter to Servlet Output In Servlet classes, I found instances where the Http request parameter was written as it is, to the output stream, without any validation checks. Following code demonstrate the same: public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String content = request.getParameter("some_param"); // // .... some code goes here // response.getWriter().print(content); } Note that above code does not persist anything. Code like above may lead to what is called reflected (or non-persistent) cross site scripting (XSS) vulnerability. Reflected XSS occur when an attacker injects browser executable code within a single HTTP response. As it goes by definition (being non-persistent), the injected attack does not get stored within the application; it manifests only users who open a maliciously crafted link or third-party web page. The attack string is included as part of the crafted URI or HTTP parameters, improperly processed by the application, and returned to the victim. You could read greater details on following OWASP page on reflect XSS Creating an SQL PreparedStatement from Dynamic Query String What it essentially means is the fact that although PreparedStatement was used, but the query was generated as a string buffer and not in the way recommended for prepared statement (parametrized). If unchecked, tainted data from a user would create a String where SQL injection could make it behave in unexpected and undesirable manner. One should rather make the query statement parametrized and, use the PreparedStatement appropriately. Take a look at following code to identify the vulnerable code. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareStatement(query.toString()); resultSet = statement.executeQuery(); } Array is Stored Directly Instances of this vulnerability, Array is stored directly, could help the attacker change the objects stored in array outside of program, and the program behave in inconsistent manner as the reference to the array passed to method is held by the caller/invoker. The solution is to make a copy within the object when it gets passed. In this manner, a subsequent modification of the collection won’t affect the array stored within the object. You could read the details on following stackoverflow page. Following code represents the vulnerability: // Note that values is a String array in the code below. // public void setValues(String[] somevalues) { this.values = somevalues; }
September 19, 2014
by Ajitesh Kumar
· 19,439 Views · 1 Like
article thumbnail
A Closer Look at the MySQL ibdata1 Disk Space Issue and Big Tables
A recurring customer issue seen by the Percona Support team involves how to make the ibdata1 file “shrink” within MySQL. I'll show you how to handle big tables.
September 16, 2014
by Peter Zaitsev
· 7,857 Views
article thumbnail
How to Quickly Get Started with Sonar
Jump into Sonar with this tutorial that provides installation instructions for SonarQube and the Code Analyzer, followed by a Java example.
September 15, 2014
by Ajitesh Kumar
· 159,365 Views · 3 Likes
article thumbnail
DB2 CONCAT (Concatenate) Function
The DB2 CONCAT function will combine two separate expressions to form a single string expression.
September 15, 2014
by Drew Harvey
· 140,567 Views
article thumbnail
Python 101: An Intro to Pony ORM
The Pony ORM project is another object relational mapper package for Python. They allow you to query a database using generators. They also have an online ER Diagram Editor that is supposed to help you create a model. They are also one of the only Python packages I’ve seen with a multi-licensing scheme where you can develop using a GNU license or purchase a license for non-open source work. See their website for additional details. In this article, we will spend some time learning the basics of this package. Getting Started Since this project is not included with Python, you will need to download and install it. If you have pip, then you can just do this: pip install pony Otherwise you’ll have to download the source and install it via its setup.py script. Creating the Database We will start out by creating a database to hold some music. We will need two tables: Artist and Album. Let’s get started! import datetime import pony.orm as pny database = pny.Database("sqlite", "music.sqlite", create_db=True) ######################################################################## class Artist(database.Entity): """ Pony ORM model of the Artist table """ name = pny.Required(unicode) albums = pny.Set("Album") ######################################################################## class Album(database.Entity): """ Pony ORM model of album table """ artist = pny.Required(Artist) title = pny.Required(unicode) release_date = pny.Required(datetime.date) publisher = pny.Required(unicode) media_type = pny.Required(unicode) # turn on debug mode pny.sql_debug(True) # map the models to the database # and create the tables, if they don't exist database.generate_mapping(create_tables=True) Pony ORM will create our primary key for us automatically if we don’t specify one. To create a foreign key, all you need to do is pass the model class into a different table, as we did in the Album class. Each Required field takes a Python type. Most of our fields are unicode, with one being a datatime object. Next we turn on debug mode, which will output the SQL that Pony generates when it creates the tables in the last statement. Note that if you run this code multiple times, you won’t recreate the table. Pony will check to see if the tables exist before creating them. If you run the code above, you should see something like this get generated as output: GET CONNECTION FROM THE LOCAL POOL PRAGMA foreign_keys = false BEGIN IMMEDIATE TRANSACTION CREATE TABLE "Artist" ( "id" INTEGER PRIMARY KEY AUTOINCREMENT, "name" TEXT NOT NULL ) CREATE TABLE "Album" ( "id" INTEGER PRIMARY KEY AUTOINCREMENT, "artist" INTEGER NOT NULL REFERENCES "Artist" ("id"), "title" TEXT NOT NULL, "release_date" DATE NOT NULL, "publisher" TEXT NOT NULL, "media_type" TEXT NOT NULL ) CREATE INDEX "idx_album__artist" ON "Album" ("artist") SELECT "Album"."id", "Album"."artist", "Album"."title", "Album"."release_date", "Album"."publisher", "Album"."media_type" FROM "Album" "Album" WHERE 0 = 1 SELECT "Artist"."id", "Artist"."name" FROM "Artist" "Artist" WHERE 0 = 1 COMMIT PRAGMA foreign_keys = true CLOSE CONNECTION Wasn’t that neat? Now we’re ready to learn how to add data to our database. How to Insert / Add Data to Your Tables Pony makes adding data to your tables pretty painless. Let’s take a look at how easy it is: import datetime import pony.orm as pny from models import Album, Artist #---------------------------------------------------------------------- @pny.db_session def add_data(): """""" new_artist = Artist(name=u"Newsboys") bands = [u"MXPX", u"Kutless", u"Thousand Foot Krutch"] for band in bands: artist = Artist(name=band) album = Album(artist=new_artist, title=u"Read All About It", release_date=datetime.date(1988,12,01), publisher=u"Refuge", media_type=u"CD") albums = [{"artist": new_artist, "title": "Hell is for Wimps", "release_date": datetime.date(1990,07,31), "publisher": "Sparrow", "media_type": "CD" }, {"artist": new_artist, "title": "Love Liberty Disco", "release_date": datetime.date(1999,11,16), "publisher": "Sparrow", "media_type": "CD" }, {"artist": new_artist, "title": "Thrive", "release_date": datetime.date(2002,03,26), "publisher": "Sparrow", "media_type": "CD"} ] for album in albums: a = Album(**album) if __name__ == "__main__": add_data() # use db_session as a context manager with pny.db_session: a = Artist(name="Skillet") You will note that we need to use a decorator caled db_session to work with the database. It takes care of opening a connection, committing the data and closing the connection. You can also use it as a context manager, which is demonstrated at the very end of this piece of code. Using Basic Queries to Modify Records with Pony ORM In this section, we will learn how to make some basic queries and modify a few entries in our database. ] import pony.orm as pny from models import Artist, Album with pny.db_session: band = Artist.get(name="Newsboys") print band.name for record in band.albums: print record.title # update a record band_name = Artist.get(name="Kutless") band_name.name = "Beach Boys" Here we use the db_session as a context manager. We make a query to get an artist object from the database and print its name. Then we loop over the artist’s albums that are also contained in the returned object. Finally, we change one of the artist’s names. Let’s try querying the database using a generator: result = pny.select(i.name for i in Artist) result.show() If you run this code, you should see something like the following: i.name -------------------- Newsboys MXPX Beach Boys Thousand Foot Krutch The documentation has several other examples that are worth checking out. Note that Pony also supports using SQL itself via its select_by_sql and get_by_sql methods. How to Delete Records in Pony ORM Deleting records with Pony is also pretty easy. Let’s remove one of the bands from the database: import pony.orm as pny from models import Artist with pny.db_session: band = Artist.get(name="MXPX") band.delete() Once more we use db_session to access the database and commit our changes. We use the band object’s delete method to remove the record. You will need to dig to find out if Pony supports cascading deletes where if you delete the Artist, it will also delete all the Albums that are connected to it. According to the docs, if the field is Required, then cascade is enabled. Wrapping Up Now you know the basics of using the Pony ORM package. I personally think the documentation needs a little work as you have to dig a lot to find some of the functionality that I felt should have been in the tutorials. Overall though, the documentation is still a lot better than most projects. Give it a go and see what you think! Additional Resources Pony ORM’s website Pony documentation SQLAlchemy Tutorial An Intro to peewee
September 12, 2014
by Mike Driscoll
· 8,748 Views
article thumbnail
Introducing BIRT iHub F-Type: Installing on Windows
Originally written by Virgil Dodson Actuate recently released a new, free BIRT server called the BIRT iHub F-Type. It incorporates all the functionality of BIRT iHub and is limited only by the capacity of output it can deliver on a daily basis. It is ideal for departmental and smaller scale applications. When BIRT F-Type reaches its maximum output capacity, additional capacity can be purchased on a subscription based model. Some of the key features of BIRT iHub F-Type that will help improve your BIRT content applications are: Interactivity – Allow end-users to modify and personalize reports, and answer questions themselves. Scheduling – Automate report generation based on rules and calendar, and then notify users. Sharing – Secure document management and distribution that allows users to only access content/data they are entitled to. Excel Emitter – Export as native Excel (not CSV) with formulas/pivot tables/worksheets/charts. Integration – JavaScript API to embed dynamic reports and visualizations in your web app. Downloading BIRT iHub F-Type Before we get started with the installation process, we need to download BIRT iHub F-Type. There are three downloads available: Windows, Linux, and a VMware image. This blog will cover the Windows installation. If you’re installing either of the other types, you’ll find links to guides for them at the bottom of this blog post. Once you click on your chosen download, you’ll be asked to register. If you’ve already registered, click the “Click to Login” button. If not, fill out the short registration form to get started. Next, read and accept the license agreement. Once you’ve done that, click the checkbox, and a link for the download will appear. Click that to start your download. At this point, you should also receive an email with an activation code. Be sure to check your spam folder if you don’t see it in your inbox. Installing BIRT iHub F-Type After the download is complete, launch the executable file named ActuateBIRTiHubFType.exe. A welcome message will appear. Press Next to continue. You must read and accept the license agreement on the next screen. Choose a destination folder for the installation. The default is C:\Actuate\BIRTiHub. If you have existing BIRT designs that depend on a JDBC database driver, you can optionally specify the folder where these drivers are located. Press Next to continue. Once the installation has finished, press Finish to launch the BIRT iHub F-Type. A desktop shortcut is also created that points to the iHub F-Type URL at http://localhost:8700/iportal. The first time you launch the BIRT iHub F-Type, you will need to activate it. Enter the activation code that you should have received in an e-mail. After entering a valid activation code, you should receive a message that the code was accepted and the BIRT iHub F-Type should start initializing services. Once that has completed, you will be presented with the login screen. The default user name is “administrator” and the password is blank for your first log in. You’ll be able to change this after you have logged in. Press “Log In” to continue. The first time you launch the BIRT iHub F-Type, you will be in tutorial mode which will help you get started loading your BIRT content and required resources. You can bypass the tutorial mode at any time by pressing the “Exit Tutorial” button at the top right. Select a BIRT design (*.rptdesign) file and press the Upload button. If you don’t have a BIRT design, you can download a sample from the link on the same page. The BIRT design file is automatically inspected and if there are any dependent files needed, like images, data files, BIRT report libraries, CSS styles, or other linked BIRT designs, you will be asked to upload those files as well. Once your BIRT design and dependent files are uploaded, your BIRT report will be displayed in the BIRT iHub F-Type and is now ready to explore. Thanks for reading. Now, it’s time to unleash the full power of BIRT into your application. If you have any questions or comments, please feel free to use the comments section below or visit the BIRT iHub F-Type forum. -Virgil For more blogs in the “Introducing BIRT iHub F-Type” series, see the list below: Installing iHub F-Type: Linux | VMWare Image
September 12, 2014
by Michael Singer
· 7,010 Views
article thumbnail
Creating a Custom SQL Server VM Image in Azure
Recently I had the opportunity to work on a project were I needed to create a custom SQL Server image for use with Azure VMs. The process was a little more challenging than I initially anticipated. I think this is mostly because I was not familiar with the process of preparing a SQL Server image. Perhaps this isn’t much of a challenge for an experienced SQL Server DBA or IT Pro. For me, it was a great learning experience. Why a Custom SQL Server Image? The Azure VM image gallery already contains a SQL Server image. It’s very easy to create a new SQL Server VM using this image. However, doing so has a few important trade-offs to consider: Unable to fully customize the base install of SQL Server. This is a template/image after all – you get a VM configured the way the image was configured. Unable to use your own SQL Server license. If your company has an Enterprise Agreement (EA) with Microsoft, it’s likely there is already some SQL Server licenses built into that agreement. Depending on the details, it may be significantly cheaper to use the licenses from the EA instead of paying the SQL Server VM image upcharge from Azure. The Basic Steps There are 6 basic steps to creating a custom SQL Server VM image for use in Azure. Provision a new base Windows Server VM Download the SQL Server installation media Run SQL Server setup to prepare an image Configure Windows to complete the installation of SQL Server Capture the image and add it to the Azure VM image gallery Create a new VM instance using the custom SQL Server image The basic idea here is to create a base VM, customize it with a SQL Server image, capture the VM to create an image, and then provision new VMs using that captured VM image. Let’s dive into each of these in a little more detail. Note: the terminology here can be a little confusing. When referring to the VM used to create the template/image, I’ll use the term “base VM”. When referring to the VM created from the base VM, I’ll use the term “VM instance”. 1. Provision a new base Windows Server VM There are multiple ways to create a Windows Server VM in Azure. Creating a VM via the Azure management portal and PowerShell are probably the two most popular options. Be sure to check out this tutorial to learn how to do so via the portal. For the purposes of this post, I’ll do so via PowerShell. $img = Get-AzureVMImage ` | where { ( $_.PublisherName -ilike "Microsoft*" -and $_.ImageFamily -ilike "Windows Server 2012 Datacenter" ) } ` | Sort-Object -Unique -Descending -Property ImageFamily ` | sort -Descending -Property PublishDate ` | select -First(1) $vmConfig = New-AzureVMConfig -Name "sql-1" -InstanceSize Small -ImageName $img.ImageName | Add-AzureProvisioningConfig -Windows -AdminUsername "[admin-username-here]" -Password "[admin-password-here]" New-AzureVM -ServiceName "SQLServerVMTemplate" -VMs $vmConfig -Location "East US" -WaitForBoot 2. Download the SQL Server installation media With the base Windows Server 2012 VM created, we can now get ready to prepare (sysprep) the SQL Server installation. To do that, we need to get the SQL Server installation media onto the machine. The easiest way I found to do this was to leverage Azure blob storage. Upload the SQL Server ISO file to Azure blob storage Remote Desktop (RDP) into the base VM From the VM, download the SQL Server ISO file to the local disk Mount the SQL Server ISO file to the VM Copy the ISO contents (not the ISO file itself) to the VM’s C:\ drive. For example, use C:\sql The SQL Server installation media files need to be copied to the local C: drive so it can be used later to complete the SQL Server installation (when provisioning the actual SQL Server VM instance). 3. Run SQL Server setup to prepare an image In order to prepare the (sysprep’d) SQL Server VM image (which we can use as a template for future VMs), we need to run the SQL Server installation and instruct it topreparean image – not run the full installation. An easy way to do this is with a SQL Server configuration file, an example of which I’ve included below. ConfigurationFile.ini ;SQL Server 2012 Configuration File [OPTIONS] ; Specifies a Setup workflow, like INSTALL, UNINSTALL, or UPGRADE. This is a required parameter. ACTION="PrepareImage" ; Detailed help for command line argument ENU has not been defined yet. ENU="True" ; Parameter that controls the user interface behavior. Valid values are Normal for the full UI, AutoAdvance for a simplified UI, and EnableUIOnServerCore for bypassing Server Core setup GUI block. ;UIMODE="Normal" ; Specifies setup not display any user interface. ;QUIET="False" ; Specifies setup to display progress only, without any user interaction. QUIETSIMPLE="True" ; Specifies whether SQL Server Setup should discover and include product updates. The valid values are True and False or 1 and 0. By default SQL Server Setup will include updates that are found. UpdateEnabled="True" ; Specifies features to install, uninstall, or upgrade. The list of top-level features include SQL, AS, RS, IS, MDS, and Tools. The SQL feature will install the Database Engine, Replication, Full-Text, and Data Quality Services (DQS) server. The Tools feature will install Management Tools, Books online components, SQL Server Data Tools, and other shared components. FEATURES=SQLENGINE ; Specifies the location where SQL Server Setup will obtain product updates. The valid values are "MU" to search Microsoft Update, a valid folder path, a relative path such as .\MyUpdates or a UNC share. By default SQL Server Setup will search Microsoft Update or a Windows Update service through the Window Server Update Services. UpdateSource="MU" ; Displays the command line parameters usage HELP="False" ; Specifies that the detailed Setup log should be piped to the console. INDICATEPROGRESS="False" ; Specifies that Setup should install into WOW64. This command line argument is not supported on an IA64 or a 32-bit system. X86="False" ; Specifies the root installation directory for shared components. This directory remains unchanged after shared components are already installed. INSTALLSHAREDDIR="C:\Program Files\Microsoft SQL Server" ; Specifies the root installation directory for the WOW64 shared components. This directory remains unchanged after WOW64 shared components are already installed. INSTALLSHAREDWOWDIR="C:\Program Files (x86)\Microsoft SQL Server" ; Specifies the Instance ID for the SQL Server features you have specified. SQL Server directory structure, registry structure, and service names will incorporate the instance ID of the SQL Server instance. INSTANCEID="MSSQLSERVER" ; Specifies the installation directory. INSTANCEDIR="C:\Program Files\Microsoft SQL Server" There are two steps in this process: Copy the ConfigurationFile.ini file (from your local PC) to the same location as the SQL Server installation media (i.e.c:\sql) on the base VM. Run SQL Server setup to prepare an image. From a command prompt (on the base VM), navigate to theC:\sqlfolder and then execute the following command: Setup.exe /ConfigurationFile=ConfigurationFile.ini /IAcceptSQLServerLicenseTerms=true 4. Configure Windows to complete the installation of SQL Server At this point the base VM should have an “installation” of SQL Server that is not fully completed. The SQL Server bits are in place, but they’re not configured for a full server install . . . at least not yet. The final configuration of SQL Server will take place when the VM instance (of which this template/image is the base) is provisioned and boots up for the first time. This is accomplished by using a CMD file with the following content: @ECHO OFF && SETLOCAL && SETLOCAL ENABLEDELAYEDEXPANSION && SETLOCAL ENABLEEXTENSIONS REM All commands will be executed during first Virtual Machine boot "C:\Program Files\Microsoft SQL Server\110\Setup Bootstrap\SQLServer2012\setup.exe" /QS /ACTION=CompleteImage /INSTANCEID=MSSQLSERVER /INSTANCENAME=MSSQLSERVER /IACCEPTSQLSERVERLICENSETERMS=1 /SQLSYSADMINACCOUNTS=%COMPUTERNAME%\Administrators /BROWSERSVCSTARTUPTYPE=AUTOMATIC /INDICATEPROGRESS /TCPENABLED=1 /PID="[YOUR-SQL-SERVER-PRODUCT-ID-HERE]" On your local PC, save the file as SetupComplete2.cmd RDP / log into the base VM Copy the SetupComplete2.cmd from your local PC file to the c:\Windows\OEM folder on the base VM Change the value for the SQLSYSADMINACCOUNTS value to be that of the administrative account created on the VM (or better yet – the local Administrators group account) If needed, supply the SQL Server product ID (PID) value. When Windows starts on the new VM instance for the first time, the SetupComplete2.cmd file should automatically run. It is invoked by the SetupComplete.cmd file already on the machine. 5. Capture the image and add it to the Azure VM image gallery At this point a base SQL Server VM has been created and the groundwork laid to complete the install. Now it is time to create the VM image from the base VM, and do to that you sysprep and capture the base VM. Please follow the guide on How to Capture a Windows Virtual Machine to Use as a Template. 6. Create a new VM using the custom SQL Server image With a new custom VM image template available in the VM image gallery, you can provision a new VM instance using that custom template. Upon first boot, the newly provisioned VM should complete the full SQL Server installation as laid out in your SetupComplete2.cmd file. Please follow the guide on How to Create a Custom Virtual Machine for more information on creating the VM from the template. Closing Thoughts One of the quirks I noticed when preparing the base SQL Server image is that it was not possible to prepare the image with SQL Server Management Studio (SSMS). I would have to do the install after the newly provisioned VM instance is created. Not hard, but time consuming (an annoying if doing this on multiple VM instances). I later learned that SQL Server 2012 Cumulative Update 1 does allow for preparing a SQL Server image with SSMS installed. I’ve included a link below that describes the process for creating a SQL Server image with CU1. In the end, this process really is not all that hard. Time consuming? Yes! The worst part (at least for me) was really just understanding how the SQL Server installation and sysprep process works. Once I wrapped my head around that, the process was a lot smoother. Helpful Resources While I was learning how to create a custom SQL Server VM image, the following resources were very helpful: How to: Create a Windows Azure Virtual Machine Operating System Image for Microsoft Dynamics NAV. This MSDN article provided the jumping off point on learning how to install SQL Server by using a sysprep image. Install SQL Server 2012 from the Command Prompt Install SQL Server 2012 Using a Configuration File Install SQL Server 2012 Using SysPrep How to create a slipstream SQL Server 2012 and Cumulative Update 1 image –http://sqlperformance.com/2012/12/system-configuration/sql-2012-slipstream I would like to thank Scott Klein for his assistance in verifying these steps. His help was extremely valuable to ensure I was doing this the right way.
September 10, 2014
by Michael Collier
· 6,474 Views
article thumbnail
How JSF Works and how to Debug it - is Polyglot an Alternative?
JSF is not what we often think it is. It's also a framework that can be somewhat tricky to debug, especially when first encountered. In this post let's go over on why that is and provide some JSF debugging techniques. We will go through the following topics: JSF is not what we often think The difficulties of JSF debugging How to debug JSF systematically How JSF Works - The JSF lifecycle Debugging an Ajax request from browser to server and back Debugging the JSF frontend Javascript code Final thoughts - alternatives? (questions to the reader) JSF is not what we often think JSF looks on first look like an enterprise Java/XML frontend framework, but under the hood it really isn't. It's really a polyglot Java/Javascript framework, where the client Javascript part is non-neglectable and also important to understand it. It also has good support for direct HTML/CSS use. JSF developers are on ocasion already polyglot developers, whose primary language is Java but still need to use ocasionally Javascript. The difficulties of JSF debugging When comparing JSF to GWT and AngularJS in a previous post, I found that the (most often used) approach that the framework takes of abstracting HTML and CSS from the developer behind XML adds to the difficulty of debugging, because it creates an extra level of indirection. A more direct approach of using HTML/CSS directly is also possible, but it seems enterprise Java developers tend to stick to XML in most cases, because it's a more familiar technology. Also another problem is that the client side Javascript part of the framework/libraries is not very well documented, and it's often important to understand what is going on. The only way to debug JSF systematically When first encountering JSF, I first tried to approach it from a Java, XML and documentation only. While I could do a part of the work that way, there where frequent situations where that approach was really not sufficient. The conclusion that I got to is that in order to be able to debug JSF applications effectively, an understanding of the following is needed: HTML CSS Javascript HTTP Chrome Dev Tools, Firebug or equivalent The JSF Lifecycle This might sound surprising to developers that work mostly in Java/XML, but this web-centric approach to debugging JSF is the only way that I managed to tackle many requirements that needed some significant component customization, or to be able to fix certain bugs. Let’s start by understanding the inner workings of JSF, so that we can debug it better. The JSF take on MVC The way JSF approaches MVC is that the whole 3 components reside on the server side: The Model is a tree of plain Java objects The View is a server side template defined in XML that is read to build an in-memory view definition The Controller is a Java servlet, that receives each request and processes them through a series of steps The browser is assumed to be simply a rendering engine for the HTML generated at server side. Ajax is achieved by submitting parts of the page for server processing, and requesting a server to ‘repaint’ only portions of the screen, without navigating away from the page. The JSF Lifecycle Once an HTTP request reaches the backend, it gets caught by the JSF Controller that will then process it. The request goes through a series of phases known as the JSF lifecycle, which is essential to understand how JSF works: Design Goals of the JSF Lifecycle The whole point of the lifecycle is to manage MVC 100% on the server side, using the browser as a rendering platform only. The initial idea was to decouple the rendering platform from the server-side UI component model, in order to allow to replace HTML with alternative markup languages by swapping the Render Response phase. This was in the early 2000's when HTML could be soon replaced by XML-based alternatives (that never came to be), and then HTML5 came along. Also browsers where much more qwirkier than what they are today, and the idea of cross-browser Javascript libraries was not widespread. So let’s go through each phase and see how to debug it if needed, starting in the browser. Let's base ourselves in a simple example that uses an Ajax request. A JSF 2 Hello World Example The following is a minimal JSF 2 page, that receives an input text from the user, sends the text via an Ajax request to the backend and refreshes only an output label: JSF 2.2 Hello World Example The page looks like this: Following one Ajax request - to the server and back Let’s click submit in order to trigger the Ajax request, and use the Chrome Dev Tools Network tab (right click and inspect any element on the page).What goes over the wire? This is what we see in the Form Data section of the request: j_idt8:input: Hello World javax.faces.ViewState: -2798727343674530263:954565149304692491 javax.faces.source: j_idt8:j_idt9 javax.faces.partial.event: click javax.faces.partial.execute: j_idt8:j_idt9 j_idt8:input javax.faces.partial.render: j_idt8:output javax.faces.behavior.event: action javax.faces.partial.ajax:true This request says: The new value of the input field is "Hello World", send me a new value for the output field only, and don't navigate away from this page. Let's see how this can be read from the request. As we can see, the new values of the form are submitted to the server, namely the “Hello World” value. This is the meaning of the several entries: javax.faces.ViewState identifies the view from which the request was made. The request is an Ajax request, as indicated by the flag javax.faces.partial.ajax, The request was triggered by a click as defined in javax.faces.partial.event. But what are those j_ strings ? Those are space separated generated identifiers of HTML elements. For example this is how we can see what is the page element corresponding to j_idt8:input, using the Chrome Dev Tools: There are also 3 extra form parameters that use these identifiers, that are linked to UI components: javax.faces.source: The identifier of the HTML element that originated this request, in this case the Id of the submit button. javax.faces.execute: The list of identifiers of the elements whose values are sent to the server for processing, in this case the input text field. javax.faces.render: The list of identifiers of the sections of the page that are to be ‘repainted', in this case the output field only. But what happens when the request hits the server ? JSF lifecycle - Restore View Phase Once the request reaches the server, the JSF controller will inspect the javax.faces.ViewState and identify to which view it refers. It will then build or restore a Java representation of the view, that is somehow similar to the document definition in the browser side. The view will be attached to the request and used throughout. There is usually little need to debug this phase during application development. JSF Lifecycle - Apply Request Values The JSF Controller will then apply to the view widgets the new values received via the request. The values might be invalid at this point. Each JSF component gets a call to it’s decode method in this phase. This method will retrieve the submitted value for the widget in question from the HTTP request and store it on the widget itself. To debug this, let’s put a breakpoint in the decode method of the HtmlInputText class, to see the value “Hello World”: Notice the conditional breakpoint using the HTML clientId of the field we want. This would allow to quickly debug only the decoding of the component we want, even in a large page with many other similar widgets. Next after decoding is the validation phase. JSF Lifecycle - Process Validations In this phase, validations are applied and if the value is found to be in error (for example a date is invalid), then the request bypasses Invoke Application and goes directly to Render Response phase. To debug this phase, a similar breakpoint can be put on method processValidators, or in the validators themselves if you happen to know which ones or if they are custom. JSF Lifecycle - Update Model In this phase, we know all the submitted values where correct. JSF can now update the view model by applying the new values received in the requests to the plain Java objects in the view model. This phase can be debugged by putting a breakpoint in the processUpdates method of the component in question, eventually using a similar conditional breakpoint to break only on the component needed. JSF Lifecycle - Invoke Application This is the simplest phase to debug. The application now has an updated view model, and some logic can be applied on it. This is where the action listeners defined in the XML view definition (the 'action' properties and the listener tags) are executed. JSF Lifecycle - Render Response This is the phase that I end up debugging the most: why is the value not being displayed as we expect it, etc, it all can be found here. In this phase the view and the new model values will be transformed from Java objects into HTML, CSS and eventually Javascript and sent back over the wire to the browser. This phase can be debugged using breakpoints in the encodeBegin, encodeChildren and encodeEnd methods of the component in question. The components will either render themselves or delegate rendering to aRenderer class. Back in the browser It was a long trip, but we are back where we started! This is how the response generated by JSF looks once received in the browser: -8188482707773604502:6956126859616189525> What the Javascript part of the framework will do is to take the contents of the partial response, update by update. Using the Id of the update, the client side JSF callback will search for a component with that Id, delete it from the document and replace it with the new updated version. In this case, "Hello World" will show up on the label next to the Input text field! And so thats how JSF works under the hood. But what about if we need to debug the Javascript part of the framework? Debugging the JSF Javascript Code The Chrome Dev Tools can help debug the client part. For example let’s say that we want to halt the client when an Ajax request is triggered. We need to go to the sources tab, add an XHR (Ajax) breakpoint and trigger the browser action. The debugger will stop and the call stack can be examined: For some frameworks like Primefaces, the Javascript sources might be minified (non human-readable) because they are optimized for size. To solve this, download the source code of the library and do a non minified build of the jar. There are usually instructions for this, otherwise check the project poms. This will install in your Maven repository a jar with non minified sources for debugging. The UI Debug tag: The ui:debug tag allows to view a lot of debugging information using a keyboard shortcut, see here for further details. Final Thoughts JSF is very popular in the enterprise Java world, and it handles a lot of problems well, specially if the UI designers take into account the possibilities of the widget library being used. The problem is that there are usually feature requests that force us to dig deeper into the widgets internal implementation in order to customize them, and this requires HTML, CSS, Javascript and HTTP plus JSF lifecycle knowledge. Is polyglot an alternative? We can wonder that if developers have to know a fair amount about web technologies in order to be able to debug JSF effectively, then it would be simpler to build enterprise front ends (just the client part) using those technologies directly instead. It's possible that a polyglot approach of a Java backend plus a Javascript-only frontend could be proved effective in a nearby future, specially using some sort of a client side MVC framework like Angular. This would require learning more Javascript, (have a look at Javascript for Java developers post if curious), but this is already often necessary to do custom widget development in JSF anyway. Conclusions and some questions to the reader Thanks for reading, please take a moment to share your thoughts on these matters on the comments bellow: do you believe polyglot development (Java/Javascript) is a viable alternative in general, and in your workplace in particular? Did you find one of the GWT-based frameworks (plain GWT, Vaadin, Errai), or the Play Framework to be easier to use and of better productivity?
September 10, 2014
by Vasco Cavalheiro
· 44,526 Views · 5 Likes
article thumbnail
Spring Batch Tutorial with Spring Boot and Java Configuration
I’ve been working on migrating some batch jobs for Podcastpedia.org to Spring Batch. Before, these jobs were developed in my own kind of way, and I thought it was high time to use a more “standardized” approach. Because I had never used Spring with java configuration before, I thought this were a good opportunity to learn about it, by configuring the Spring Batch jobs in java. And since I am all into trying new things with Spring, why not also throw Spring Boot into the boat… Before you begin with this tutorial I recommend you read first Spring’s Getting started – Creating a Batch Service, because the structure and the code presented here builds on that original. 1. What I’ll build So, as mentioned, in this post I will present Spring Batch in the context of configuring it and developing with it some batch jobs for Podcastpedia.org. Here’s a short description of the two jobs that are currently part of the Podcastpedia-batch project: addNewPodcastJob reads podcast metadata (feed url, identifier, categories etc.) from a flat file transforms (parses and prepares episodes to be inserted with Http Apache Client) the data and in the last step, insert it to the Podcastpedia database and inform the submitter via emailabout it notifyEmailSubscribersJob – people can subscribe to their favorite podcasts on Podcastpedia.orgvia email. For those who did it is checked on a regular basis (DAILY, WEEKLY, MONTHLY) if new episodes are available, and if they are the subscribers are informed via email about those; read from database, expand read data via JPA, re-group it and notify subscriber via email Source code: The source code for this tutorial is available on GitHub – Podcastpedia-batch. Note: Before you start I also highly recommend you read the Domain Language of Batch, so that terms like “Jobs”, “Steps” or “ItemReaders” don’t sound strange to you. 2. What you’ll need A favorite text editor or IDE JDK 1.7 or later Maven 3.0+ 3. Set up the project The project is built with Maven. It uses Spring Boot, which makes it easy to create stand-alone Spring based Applications that you can “just run”. You can learn more about the Spring Boot by visiting theproject’s website. 3.1. Maven build file Because it uses Spring Boot it will have the spring-boot-starter-parent as its parent, and a couple of other spring-boot-starters that will get for us some libraries required in the project: pom.xml of the podcastpedia-batch project 4.0.0 org.podcastpedia.batch podcastpedia-batch 0.1.0 1.1.6.RELEASE 1.7 org.springframework.boot spring-boot-starter-parent 1.1.6.RELEASE org.springframework.boot spring-boot-starter-batch org.springframework.boot spring-boot-starter-data-jpa org.apache.httpcomponents httpclient 4.3.5 org.apache.httpcomponents httpcore 4.3.2 org.apache.velocity velocity 1.7 org.apache.velocity velocity-tools 2.0 org.apache.struts struts-core rome rome 1.0 rome rome-fetcher 1.0 org.jdom jdom 1.1 xerces xercesImpl 2.9.1 mysql mysql-connector-java 5.1.31 org.springframework.boot spring-boot-starter-freemarker org.springframework.boot spring-boot-starter-remote-shell javax.mail mail javax.mail mail 1.4.7 javax.inject javax.inject 1 org.twitter4j twitter4j-core [4.0,) org.springframework.boot spring-boot-starter-test maven-compiler-plugin org.springframework.boot spring-boot-maven-plugin Note: One big advantage of using the spring-boot-starter-parent as the project’s parent is that you only have to upgrade the version of the parent and it will get the “latest” libraries for you. When I started the project spring boot was in version 1.1.3.RELEASE and by the time of finishing to write this post is already at 1.1.6.RELEASE. 3.2. Project directory structure I structured the project in the following way: └── src └── main └── java └── org └── podcastpedia └── batch └── common └── jobs └── addpodcast └── notifysubscribers Note: the org.podcastpedia.batch.jobs package contains sub-packages having specific classes to particular jobs. the org.podcastpedia.batch.jobs.common package contains classes used by all the jobs, like for example the JPA entities that both the current jobs require. 4. Create a batch Job configuration I will start by presenting the Java configuration class for the first batch job: package org.podcastpedia.batch.jobs.addpodcast; import org.podcastpedia.batch.common.configuration.DatabaseAccessConfiguration; import org.podcastpedia.batch.common.listeners.LogProcessListener; import org.podcastpedia.batch.common.listeners.ProtocolListener; import org.podcastpedia.batch.jobs.addpodcast.model.SuggestedPodcast; import org.springframework.batch.core.Job; import org.springframework.batch.core.Step; import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing; import org.springframework.batch.core.configuration.annotation.JobBuilderFactory; import org.springframework.batch.core.configuration.annotation.StepBuilderFactory; import org.springframework.batch.item.ItemProcessor; import org.springframework.batch.item.ItemReader; import org.springframework.batch.item.ItemWriter; import org.springframework.batch.item.file.FlatFileItemReader; import org.springframework.batch.item.file.LineMapper; import org.springframework.batch.item.file.mapping.BeanWrapperFieldSetMapper; import org.springframework.batch.item.file.mapping.DefaultLineMapper; import org.springframework.batch.item.file.transform.DelimitedLineTokenizer; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.context.annotation.Import; import org.springframework.core.io.ClassPathResource; import com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException; @Configuration @EnableBatchProcessing @Import({DatabaseAccessConfiguration.class, ServicesConfiguration.class}) public class AddPodcastJobConfiguration { @Autowired private JobBuilderFactory jobs; @Autowired private StepBuilderFactory stepBuilderFactory; // tag::jobstep[] @Bean public Job addNewPodcastJob(){ return jobs.get("addNewPodcastJob") .listener(protocolListener()) .start(step()) .build(); } @Bean public Step step(){ return stepBuilderFactory.get("step") .chunk(1) //important to be one in this case to commit after every line read .reader(reader()) .processor(processor()) .writer(writer()) .listener(logProcessListener()) .faultTolerant() .skipLimit(10) //default is set to 0 .skip(MySQLIntegrityConstraintViolationException.class) .build(); } // end::jobstep[] // tag::readerwriterprocessor[] @Bean public ItemReader reader(){ FlatFileItemReader reader = new FlatFileItemReader(); reader.setLinesToSkip(1);//first line is title definition reader.setResource(new ClassPathResource("suggested-podcasts.txt")); reader.setLineMapper(lineMapper()); return reader; } @Bean public LineMapper lineMapper() { DefaultLineMapper lineMapper = new DefaultLineMapper(); DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer(); lineTokenizer.setDelimiter(";"); lineTokenizer.setStrict(false); lineTokenizer.setNames(new String[]{"FEED_URL", "IDENTIFIER_ON_PODCASTPEDIA", "CATEGORIES", "LANGUAGE", "MEDIA_TYPE", "UPDATE_FREQUENCY", "KEYWORDS", "FB_PAGE", "TWITTER_PAGE", "GPLUS_PAGE", "NAME_SUBMITTER", "EMAIL_SUBMITTER"}); BeanWrapperFieldSetMapper fieldSetMapper = new BeanWrapperFieldSetMapper(); fieldSetMapper.setTargetType(SuggestedPodcast.class); lineMapper.setLineTokenizer(lineTokenizer); lineMapper.setFieldSetMapper(suggestedPodcastFieldSetMapper()); return lineMapper; } @Bean public SuggestedPodcastFieldSetMapper suggestedPodcastFieldSetMapper() { return new SuggestedPodcastFieldSetMapper(); } /** configure the processor related stuff */ @Bean public ItemProcessor processor() { return new SuggestedPodcastItemProcessor(); } @Bean public ItemWriter writer() { return new Writer(); } // end::readerwriterprocessor[] @Bean public ProtocolListener protocolListener(){ return new ProtocolListener(); } @Bean public LogProcessListener logProcessListener(){ return new LogProcessListener(); } } The @EnableBatchProcessing annotation adds many critical beans that support jobs and saves us configuration work. For example you will also be able to @Autowired some useful stuff into your context: a JobRepository (bean name “jobRepository”) a JobLauncher (bean name “jobLauncher”) a JobRegistry (bean name “jobRegistry”) a PlatformTransactionManager (bean name “transactionManager”) a JobBuilderFactory (bean name “jobBuilders”) as a convenience to prevent you from having to inject the job repository into every job, as in the examples above a StepBuilderFactory (bean name “stepBuilders”) as a convenience to prevent you from having to inject the job repository and transaction manager into every step The first part focuses on the actual job configuration: @Bean public Job addNewPodcastJob(){ return jobs.get("addNewPodcastJob") .listener(protocolListener()) .start(step()) .build(); } @Bean public Step step(){ return stepBuilderFactory.get("step") .chunk(1) //important to be one in this case to commit after every line read .reader(reader()) .processor(processor()) .writer(writer()) .listener(logProcessListener()) .faultTolerant() .skipLimit(10) //default is set to 0 .skip(MySQLIntegrityConstraintViolationException.class) .build(); } The first method defines a job and the second one defines a single step. As you’ve read in The Domain Language of Batch, jobs are built from steps, where each step can involve a reader, a processor, and a writer. In the step definition, you define how much data to write at a time (in our case 1 record at a time). Next you specify the reader, processor and writer. 5. Spring Batch processing units Most of the batch processing can be described as reading data, doing some transformation on it and then writing the result out. This mirrors somehow the Extract, Transform, Load (ETL) process, in case you know more about that. Spring Batch provides three key interfaces to help perform bulk reading and writing: ItemReader, ItemProcessor and ItemWriter. 5.1. Readers ItemReader is an abstraction providing the mean to retrieve data from many different types of input: flat files, xml files, database, jms etc., one item at a time. See the Appendix A. List of ItemReaders and ItemWriters for a complete list of available item readers. In the Podcastpedia batch jobs I use the following specialized ItemReaders: 5.1.1. FlatFileItemReader which, as the name implies, reads lines of data from a flat file that typically describe records with fields of data defined by fixed positions in the file or delimited by some special character (e.g. Comma). This type of ItemReader is being used in the first batch job, addNewPodcastJob. The input file used is named suggested-podcasts.in, resides in the classpath (src/main/resources) and looks something like the following: FEED_URL; IDENTIFIER_ON_PODCASTPEDIA; CATEGORIES; LANGUAGE; MEDIA_TYPE; UPDATE_FREQUENCY; KEYWORDS; FB_PAGE; TWITTER_PAGE; GPLUS_PAGE; NAME_SUBMITTER; EMAIL_SUBMITTER http://www.5minutebiographies.com/feed/; 5minutebiographies; people_society, history; en; Audio; WEEKLY; biography, biographies, short biography, short biographies, 5 minute biographies, five minute biographies, 5 minute biography, five minute biography; https://www.facebook.com/5minutebiographies;https://twitter.com/5MinuteBios; ; Adrian Matei; [email protected] http://notanotherpodcast.libsyn.com/rss; NotAnotherPodcast; entertainment; en; Audio; WEEKLY; Comedy, Sports, Cinema, Movies, Pop Culture, Food, Games; https://www.facebook.com/notanotherpodcastusa;https://twitter.com/NAPodcastUSA;https://plus.google.com/u/0/103089891373760354121/posts; Adrian Matei; [email protected] As you can see the first line defines the names of the “columns”, and the following lines contain the actual data (delimited by “;”), that needs translating to domain objects relevant in the context. Let’s see now how to configure the FlatFileItemReader: @Bean public ItemReader reader(){ FlatFileItemReader reader = new FlatFileItemReader(); reader.setLinesToSkip(1);//first line is title definition reader.setResource(new ClassPathResource("suggested-podcasts.in")); reader.setLineMapper(lineMapper()); return reader; } You can specify, among other things, the input resource, the number of lines to skip, and a line mapper. 5.1.1.1. LineMapper The LineMapper is an interface for mapping lines (strings) to domain objects, typically used to map lines read from a file to domain objects on a per line basis. For the Podcastpedia job I used the DefaultLineMapper, which is two-phase implementation consisting of tokenization of the line into a FieldSet followed by mapping to item: @Bean public LineMapper lineMapper() { DefaultLineMapper lineMapper = new DefaultLineMapper(); DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer(); lineTokenizer.setDelimiter(";"); lineTokenizer.setStrict(false); lineTokenizer.setNames(new String[]{"FEED_URL", "IDENTIFIER_ON_PODCASTPEDIA", "CATEGORIES", "LANGUAGE", "MEDIA_TYPE", "UPDATE_FREQUENCY", "KEYWORDS", "FB_PAGE", "TWITTER_PAGE", "GPLUS_PAGE", "NAME_SUBMITTER", "EMAIL_SUBMITTER"}); BeanWrapperFieldSetMapper fieldSetMapper = new BeanWrapperFieldSetMapper(); fieldSetMapper.setTargetType(SuggestedPodcast.class); lineMapper.setLineTokenizer(lineTokenizer); lineMapper.setFieldSetMapper(suggestedPodcastFieldSetMapper()); return lineMapper; } the DelimitedLineTokenizer splits the input String via the “;” delimiter. if you set the strict flag to false then lines with less tokens will be tolerated and padded with empty columns, and lines with more tokens will simply be truncated. the columns names from the first line are set lineTokenizer.setNames(...); and the fieldMapper is set (line 14) Note: The FieldSet is an “interface used by flat file input sources to encapsulate concerns of converting an array of Strings to Java native types. A bit like the role played by ResultSet in JDBC, clients will know the name or position of strongly typed fields that they want to extract.“ 5.1.1.2. FieldSetMapper The FieldSetMapper is an interface that is used to map data obtained from a FieldSet into an object. Here’s my implementation which maps the fieldSet to the SuggestedPodcast domain object that will be further passed to the processor: public class SuggestedPodcastFieldSetMapper implements FieldSetMapper { @Override public SuggestedPodcast mapFieldSet(FieldSet fieldSet) throws BindException { SuggestedPodcast suggestedPodcast = new SuggestedPodcast(); suggestedPodcast.setCategories(fieldSet.readString("CATEGORIES")); suggestedPodcast.setEmail(fieldSet.readString("EMAIL_SUBMITTER")); suggestedPodcast.setName(fieldSet.readString("NAME_SUBMITTER")); suggestedPodcast.setTags(fieldSet.readString("KEYWORDS")); //some of the attributes we can map directly into the Podcast entity that we'll insert later into the database Podcast podcast = new Podcast(); podcast.setUrl(fieldSet.readString("FEED_URL")); podcast.setIdentifier(fieldSet.readString("IDENTIFIER_ON_PODCASTPEDIA")); podcast.setLanguageCode(LanguageCode.valueOf(fieldSet.readString("LANGUAGE"))); podcast.setMediaType(MediaType.valueOf(fieldSet.readString("MEDIA_TYPE"))); podcast.setUpdateFrequency(UpdateFrequency.valueOf(fieldSet.readString("UPDATE_FREQUENCY"))); podcast.setFbPage(fieldSet.readString("FB_PAGE")); podcast.setTwitterPage(fieldSet.readString("TWITTER_PAGE")); podcast.setGplusPage(fieldSet.readString("GPLUS_PAGE")); suggestedPodcast.setPodcast(podcast); return suggestedPodcast; } } 5.2. JdbcCursorItemReader In the second job, notifyEmailSubscribersJob, in the reader, I only read email subscribers from a single database table, but further in the processor a more detailed read(via JPA) is executed to retrieve all the new episodes of the podcasts the user subscribed to. This is a common pattern employed in the batch world. Follow this link for more Common Batch Patterns. For the initial read, I chose the JdbcCursorItemReader, which is a simple reader implementation that opens a JDBC cursor and continually retrieves the next row in the ResultSet: @Bean public ItemReader notifySubscribersReader(){ JdbcCursorItemReader reader = new JdbcCursorItemReader(); String sql = "select * from users where is_email_subscriber is not null"; reader.setSql(sql); reader.setDataSource(dataSource); reader.setRowMapper(rowMapper()); return reader; } Note I had to set the sql, the datasource to read from and a RowMapper. 5.2.1. RowMapper The RowMapper is an interface used by JdbcTemplate for mapping rows of a Result’set on a per-row basis. My implementation of this interface, , performs the actual work of mapping each row to a result object, but I don’t need to worry about exception handling: public class UserRowMapper implements RowMapper { @Override public User mapRow(ResultSet rs, int rowNum) throws SQLException { User user = new User(); user.setEmail(rs.getString("email")); return user; } } 5.2. Writers ItemWriter is an abstraction that represents the output of a Step, one batch or chunk of items at a time. Generally, an item writer has no knowledge of the input it will receive next, only the item that was passed in its current invocation. The writers for the two jobs presented are quite simple. They just use external services to send email notifications and post tweets on Podcastpedia’s account. Here is the implementation of the ItemWriterfor the first job – addNewPodcast: package org.podcastpedia.batch.jobs.addpodcast; import java.util.Date; import java.util.List; import javax.inject.Inject; import javax.persistence.EntityManager; import org.podcastpedia.batch.common.entities.Podcast; import org.podcastpedia.batch.jobs.addpodcast.model.SuggestedPodcast; import org.podcastpedia.batch.jobs.addpodcast.service.EmailNotificationService; import org.podcastpedia.batch.jobs.addpodcast.service.SocialMediaService; import org.springframework.batch.item.ItemWriter; import org.springframework.beans.factory.annotation.Autowired; public class Writer implements ItemWriter{ @Autowired private EntityManager entityManager; @Inject private EmailNotificationService emailNotificationService; @Inject private SocialMediaService socialMediaService; @Override public void write(List items) throws Exception { if(items.get(0) != null){ SuggestedPodcast suggestedPodcast = items.get(0); //first insert the data in the database Podcast podcast = suggestedPodcast.getPodcast(); podcast.setInsertionDate(new Date()); entityManager.persist(podcast); entityManager.flush(); //notify submitter about the insertion and post a twitt about it String url = buildUrlOnPodcastpedia(podcast); emailNotificationService.sendPodcastAdditionConfirmation( suggestedPodcast.getName(), suggestedPodcast.getEmail(), url); if(podcast.getTwitterPage() != null){ socialMediaService.postOnTwitterAboutNewPodcast(podcast, url); } } } private String buildUrlOnPodcastpedia(Podcast podcast) { StringBuffer urlOnPodcastpedia = new StringBuffer( "http://www.podcastpedia.org"); if (podcast.getIdentifier() != null) { urlOnPodcastpedia.append("/" + podcast.getIdentifier()); } else { urlOnPodcastpedia.append("/podcasts/"); urlOnPodcastpedia.append(String.valueOf(podcast.getPodcastId())); urlOnPodcastpedia.append("/" + podcast.getTitleInUrl()); } String url = urlOnPodcastpedia.toString(); return url; } } As you can see there’s nothing special here, except that the write method has to be overriden and this is where the injected external services EmailNotificationService and SocialMediaService are used to inform via email the podcast submitter about the addition to the podcast directory, and if a Twitter page was submitted a tweet will be posted on the Podcastpedia’s wall. You can find detailed explanation on how to send email via Velocity and how to post on Twitter from Java in the following posts: How to compose html emails in Java with Spring and Velocity How to post to Twittter from Java with Twitter4J in 10 minutes 5.3. Processors ItemProcessor is an abstraction that represents the business processing of an item. While theItemReader reads one item, and the ItemWriter writes them, the ItemProcessor provides access to transform or apply other business processing. When using your own Processors you have to implement the ItemProcessor interface, with its only method O process(I item) throws Exception, returning a potentially modified or a new item for continued processing. If the returned result is null, it is assumed that processing of the item should not continue. While the processor of the first job requires a little bit of more logic, because I have to set the etag andlast-modified header attributes, the feed attributes, episodes, categories and keywords of the podcast: public class SuggestedPodcastItemProcessor implements ItemProcessor { private static final int TIMEOUT = 10; @Autowired ReadDao readDao; @Autowired PodcastAndEpisodeAttributesService podcastAndEpisodeAttributesService; @Autowired private PoolingHttpClientConnectionManager poolingHttpClientConnectionManager; @Autowired private SyndFeedService syndFeedService; /** * Method used to build the categories, tags and episodes of the podcast */ @Override public SuggestedPodcast process(SuggestedPodcast item) throws Exception { if(isPodcastAlreadyInTheDirectory(item.getPodcast().getUrl())) { return null; } String[] categories = item.getCategories().trim().split("\\s*,\\s*"); item.getPodcast().setAvailability(org.apache.http.HttpStatus.SC_OK); //set etag and last modified attributes for the podcast setHeaderFieldAttributes(item.getPodcast()); //set the other attributes of the podcast from the feed podcastAndEpisodeAttributesService.setPodcastFeedAttributes(item.getPodcast()); //set the categories List categoriesByNames = readDao.findCategoriesByNames(categories); item.getPodcast().setCategories(categoriesByNames); //set the tags setTagsForPodcast(item); //build the episodes setEpisodesForPodcast(item.getPodcast()); return item; } ...... } the processor from the second job uses the ‘Driving Query’ approach, where I expand the data retrieved from the Reader with another “JPA-read” and I group the items on podcasts with episodes so that it looks nice in the emails that I am sending out to subscribers: @Scope("step") public class NotifySubscribersItemProcessor implements ItemProcessor { @Autowired EntityManager em; @Value("#{jobParameters[updateFrequency]}") String updateFrequency; @Override public User process(User item) throws Exception { String sqlInnerJoinEpisodes = "select e from User u JOIN u.podcasts p JOIN p.episodes e WHERE u.email=?1 AND p.updateFrequency=?2 AND" + " e.isNew IS NOT NULL AND e.availability=200 ORDER BY e.podcast.podcastId ASC, e.publicationDate ASC"; TypedQuery queryInnerJoinepisodes = em.createQuery(sqlInnerJoinEpisodes, Episode.class); queryInnerJoinepisodes.setParameter(1, item.getEmail()); queryInnerJoinepisodes.setParameter(2, UpdateFrequency.valueOf(updateFrequency)); List newEpisodes = queryInnerJoinepisodes.getResultList(); return regroupPodcastsWithEpisodes(item, newEpisodes); } ....... } Note: If you’d like to find out more how to use the Apache Http Client, to get the etag and last-modifiedheaders, you can have a look at my post – How to use the new Apache Http Client to make a HEAD request 6. Execute the batch application Batch processing can be embedded in web applications and WAR files, but I chose in the beginning the simpler approach that creates a standalone application, that can be started by the Java main() method: package org.podcastpedia.batch; //imports ...; @ComponentScan @EnableAutoConfiguration public class Application { private static final String NEW_EPISODES_NOTIFICATION_JOB = "newEpisodesNotificationJob"; private static final String ADD_NEW_PODCAST_JOB = "addNewPodcastJob"; public static void main(String[] args) throws BeansException, JobExecutionAlreadyRunningException, JobRestartException, JobInstanceAlreadyCompleteException, JobParametersInvalidException, InterruptedException { Log log = LogFactory.getLog(Application.class); SpringApplication app = new SpringApplication(Application.class); app.setWebEnvironment(false); ConfigurableApplicationContext ctx= app.run(args); JobLauncher jobLauncher = ctx.getBean(JobLauncher.class); if(ADD_NEW_PODCAST_JOB.equals(args[0])){ //addNewPodcastJob Job addNewPodcastJob = ctx.getBean(ADD_NEW_PODCAST_JOB, Job.class); JobParameters jobParameters = new JobParametersBuilder() .addDate("date", new Date()) .toJobParameters(); JobExecution jobExecution = jobLauncher.run(addNewPodcastJob, jobParameters); BatchStatus batchStatus = jobExecution.getStatus(); while(batchStatus.isRunning()){ log.info("*********** Still running.... **************"); Thread.sleep(1000); } log.info(String.format("*********** Exit status: %s", jobExecution.getExitStatus().getExitCode())); JobInstance jobInstance = jobExecution.getJobInstance(); log.info(String.format("********* Name of the job %s", jobInstance.getJobName())); log.info(String.format("*********** job instance Id: %d", jobInstance.getId())); System.exit(0); } else if(NEW_EPISODES_NOTIFICATION_JOB.equals(args[0])){ JobParameters jobParameters = new JobParametersBuilder() .addDate("date", new Date()) .addString("updateFrequency", args[1]) .toJobParameters(); jobLauncher.run(ctx.getBean(NEW_EPISODES_NOTIFICATION_JOB, Job.class), jobParameters); } else { throw new IllegalArgumentException("Please provide a valid Job name as first application parameter"); } System.exit(0); } } The best explanation for SpringApplication-, @ComponentScan- and @EnableAutoConfiguration-magic you get from the source – Getting Started – Creating a Batch Service: “The main() method defers to the SpringApplication helper class, providing Application.class as an argument to its run() method. This tells Spring to read the annotation metadata from Application and to manage it as a component in the Spring application context. The @ComponentScan annotation tells Spring to search recursively through theorg.podcastpedia.batchpackage and its children for classes marked directly or indirectly with Spring’s @Component annotation. This directive ensures that Spring finds and registers BatchConfiguration, because it is marked with @Configuration, which in turn is a kind of @Component annotation. The @EnableAutoConfiguration annotation switches on reasonable default behaviors based on the content of your classpath. For example, it looks for any class that implements the CommandLineRunner interface and invokes its run() method.” Execution construction steps: the JobLauncher, which is a simple interface for controlling jobs, is retrieved from the ApplicationContext. Remember this is automatically made available via the@EnableBatchProcessing annotation. now based on the first parameter of the application (args[0]), I will retrieve the correspondingJob from the ApplicationContext then the JobParameters are prepared, where I use the current date - .addDate("date", new Date()), so that the job executions are always unique. once everything is in place, the job can be executed: JobExecution jobExecution = jobLauncher.run(addNewPodcastJob, jobParameters); you can use the returned jobExecution to gain access to BatchStatus, exit code, or job name and id. Note: I highly recommend you read and understand the Meta-Data Schema for Spring Batch. It will also help you better understand the Spring Batch Domain objects. 6.1. Running the application on dev and prod environments To be able to run the Spring Batch / Spring Boot application on different environments I make use of the Spring Profiles capability. By default the application runs with development data (database). But if I want the job to use the production database I have to do the following: provide the following environment argument -Dspring.profiles.active=prod have the production database properties configured in the application-prod.properties file in the classpath, right besides the default application.properties file Summary In this tutorial we’ve learned how to configure a Spring Batch project with Spring Boot and Java configuration, how to use some of the most common readers in batch processing, how to configure some simple jobs, and how to start Spring Batch jobs from a main method. Note: As I mentioned, I am fairly new to Spring Batch, and especially to Spring Boot and Spring Configuration with Java, so if you see any potential for improvement (code, job design etc.) please make a pull request or leave a comment below. Thanks a lot.
September 9, 2014
by Adrian Matei
· 146,310 Views · 7 Likes
article thumbnail
mysqld_multi: How to Run Multiple Instances of MySQL
Originally written by Fernando Laudares The need to have multiple instances of MySQL (the well-known mysqld process) running in the same server concurrently in a transparent way, instead of having them executed in separate containers/virtual machines, is not very common. Yet from time to time the Percona Support team receives a request from a customer to assist in the configuration of such an environment. MySQL provides a tool to facilitate the execution of multiple instances called mysqld_multi: “mysqld_multi is designed to manage several mysqld processes that listen for connections on different Unix socket files and TCP/IP ports. It can start or stop servers, or report their current status.” For tests and development purposes, MySQL Sandbox might be more practical and I personally prefer to use it for my own tests. Both tools work around launching and managing multiple mysqld processes but Sandbox has, as the name suggests, a “sandbox” approach, making it easy to both create and dispose a new instance (including all data inside it). It is more usual to see mysqld_multi being used in production servers: It’s provided with the server package and uses the same single configuration file that people are used to look for when setting up MySQL. So, how does it work? How do we configure and manage the instances? And as importantly, how do we backup all the instances we create? Understanding the concept of groups in my.cnf You may have noticed already that MySQL’s main configuration file (or “option file“), my.cnf, is arranged under what is called group structures: Sections defining configuration options specific to a given program or purpose. Usually, the program itself gives name to the group, which appears enclosed by brackets. Here’s a basic my.cnf showing three such groups: [client] port= 3306 socket= /var/run/mysqld/mysqld.sock user = john password = p455w0rd [mysqld] user= mysql pid-file= /var/run/mysqld/mysqld.pid socket= /var/run/mysqld/mysqld.sock port= 3306 datadir= /var/lib/mysql [xtrabackup] target_dir = /backups/mysql/ The options defined in the group [client] above are used by the mysql command-line tool. As such, if you don’t specify any other option when executing mysql it will attempt to connect to the local MySQL server through the socket in /var/run/mysqld/mysqld.sock and using the credentials stated in that group. Similarly, mysqld will look for the options defined under its section at startup, and the same happens with Percona XtraBackup when you run a backup with that tool. However, the operating parameters defined by the above groups may also be stated as command-line options during the execution of the program, in which case they they replace the ones defined in my.cnf. Getting started with multiple instances To have multiple instances of MySQL running we must replace the [mysqld] group in the my.cnf configuration file by as many [mysqlN] groups as we want instances running, with “N” being a positive integer, also called option group number. This number is used by mysqld_multi to identify each instance, so it must be unique across the server. Apart from the distinct group name, the same options that are valid for [mysqld] applies on [mysqldN] groups, the difference being that while stating them is optional for [mysqld] (it’s possible to start MySQL with an empty my.cnf as default values are used if not explicitly provided) some of them (like socket, port, pid-file, and datadir) are mandatory when defining multiple instances – so they don’t step on each other’s feet. Here’s a simple modified my.cnf showing the original [mysqld] group plus two other instances: [mysqld] user= mysql pid-file= /var/run/mysqld/mysqld.pid socket= /var/run/mysqld/mysqld.sock port= 3306 datadir= /var/lib/mysql [mysqld1] user= mysql pid-file= /var/run/mysqld/mysqld1.pid socket= /var/run/mysqld/mysqld1.sock port= 3307 datadir= /data/mysql/mysql1 [mysqld7] user= mysql pid-file= /var/run/mysqld/mysqld7.pid socket= /var/run/mysqld/mysqld7.sock port= 3308 datadir= /data/mysql/mysql7 Besides using different pid files, ports and sockets for the new instances I’ve also defined a different datadir for each – it’s very important that the instances do not share the same datadir. Chances are you’re importing the data from a backup but if that’s not the case you can simply use mysql_install_db to create each additional datadir (but make sure the parent directory exists and that the mysql user has write access on it): mysql_install_db --user=mysql --datadir=/data/mysql/mysql7 Note that if /data/mysql/mysql7 doesn’t exist and you start this instance anyway then myqld_multi will call mysqld_install_db itself to have the datadir created and the system tables installed inside it. Alternatively from restoring a backup or having a new datadir created you can make a physical copy of the existing one from the main instance – just make sure to stop it first with a clean shutdown, so any pending changes are flushed to disk first. Now, you may have noted I wrote above that you need to replace your original MySQL instance group ([mysqld]) by one with an option group number ([mysqlN]). That’s not entirely true, as they can co-exist in harmony. However, the usual start/stop script used to manage MySQL won’t work with the additional instances, nor mysqld_multi really manages [mysqld]. The simple solution here is to have the group [mysqld] renamed with a suffix integer, say [mysqld0] (you don’t need to make any changes to it’s current options though), and let mysqld_multi manage all instances. Two commands you might find useful when configuring multiple instances are: $ mysqld_multi --example …which provides an example of a my.cnf file configured with multiple instances and showing the use of different options, and: $ my_print_defaults --defaults-file=/etc/my.cnf mysqld7 …which shows how a given group (“mysqld7″ in the example above) was defined within my.cnf. Managing multiple instances mysqld_multi allows you to start, stop, reload (which is effectively a restart) and report the current status of a given instance, all instances or a subset of them. The most important observation here is that the “stop” action is managed through mysqladmin – and internally that happens on an individual basis, with one “mysqladmin … stop” call per instance, even if you have mysqld_multi stop all of them. For this to work properly you need to setup a MySQL account with the SHUTDOWN privilege and defined with the same user name and password in all instances. Yes, it will work out of the box if you run mysqld_multi as root in a freshly installed server where the root user can access MySQL passwordless in all instances. But as the manual suggests, it’s better to have an specific account created for this purpose: mysql> GRANT SHUTDOWN ON *.* TO 'multi_admin'@'localhost' IDENTIFIED BY 'multipass'; mysql> FLUSH PRIVILEGES; If you plan on replicating the datadir of the main server across your other instances you can have that account created before you make copies of it, otherwise you just need to connect to each instance and create a similar account (remember, the privileged account is only needed by mysqld_multi to stop the instances, not to start them). There’s a special group that can be used on my.cnf to define options for mysqld_multi, which should be used to store these credentials. You might also indicate in there the path for the mysqladmin and mysqld (or mysqld_safe) binaries to use, though you might have a specific mysqld binary defined for each instance inside it’s respective group. Here’s one example: [mysqld_multi] mysqld = /usr/bin/mysqld_safe mysqladmin = /usr/bin/mysqladmin user = multi_admin password = multipass You can use mysqld_multi to start, stop, restart or report the status of a particular instance, all instances or a subset of them. Here’s a few examples that speak for themselves: $ mysqld_multi report Reporting MySQL (Percona Server) servers MySQL (Percona Server) from group: mysqld0 is not running MySQL (Percona Server) from group: mysqld1 is not running MySQL (Percona Server) from group: mysqld7 is not running $ mysqld_multi start $ mysqld_multi report Reporting MySQL (Percona Server) servers MySQL (Percona Server) from group: mysqld0 is running MySQL (Percona Server) from group: mysqld1 is running MySQL (Percona Server) from group: mysqld7 is running $ mysqld_multi stop 7,0 $ mysqld_multi report 7 Reporting MySQL (Percona Server) servers MySQL (Percona Server) from group: mysqld7 is not running $ mysqld_multi report Reporting MySQL (Percona Server) servers MySQL (Percona Server) from group: mysqld0 is not running MySQL (Percona Server) from group: mysqld1 is running MySQL (Percona Server) from group: mysqld7 is not running Managing the MySQL daemon What is missing here is an init script to automate the start/stop of all instances upon server initialization/shutdown; now that we use mysqld_multi to control the instances, the usual /etc/init.d/mysql won’t work anymore. But a similar startup script (though much simpler and less robust) relying on mysqld_multi is provided alongside MySQL/Percona Server, which can be found in /usr/share//mysqld_multi.server. You can simply copy it over as /etc/init.d/mysql, effectively replacing the original script while maintaining it’s name. Please note: You may need to edit it first and modify the first two lines defining “basedir” and “bindir” as this script was not designed to find out the good working values for these variables itself, which the original single-instance /etc/init.d/mysql does. Considering you probably have mysqld_multi installed in /usr/bin, setting these variables as follows is enough: basedir=/usr bindir=/usr/bin Configuring an instance with a different version of MySQL If you’re planning to have multiple instances of MySQL running concurrently chances are you want to use a mix of different versions for each of them, such as during a development cycle to test an application compatibility. This is a common use for mysqld_multi, and simple enough to achieve. To showcase its use I downloaded the latest version of MySQL 5.6 available and extracted the TAR file in /opt: $ tar -zxvf mysql-5.6.20-linux-glibc2.5-x86_64.tar.gz -C /opt Then I made a cold copy of the datadir from one of the existing instances to /data/mysql/mysqld574: $ mysqld_multi stop 0 $ cp -r /data/mysql/mysql1 /data/mysql/mysql5620 $ chown mysql:mysql -R /data/mysql/mysql5620 and added a new group to my.cnf as follows: [mysqld5620] user = mysql pid-file = /var/run/mysqld/mysqld5620.pid socket = /var/run/mysqld/mysqld5620.sock port = 3309 datadir = /data/mysql/mysql5620 basedir = /opt/mysql-5.6.20-linux-glibc2.5-x86_64 mysqld = /opt/mysql-5.6.20-linux-glibc2.5-x86_64/bin/mysqld_safe Note the use of basedir, pointing to the path were the binaries for MySQL 5.6.20 were extracted, as well as an specific mysqld to be used with this instance. If you have made a copy of the datadir from an instance running a previous version of MySQL/Percona Server you will need to consider the same approach use when upgrading and run mysql_upgrade. * I did try to use the latest experimental release of MySQL 5.7 (mysql-5.7.4-m14-linux-glibc2.5-x86_64.tar.gz) but it crashed with: *** glibc detected *** bin/mysqld: double free or corruption (!prev): 0x0000000003627650 *** Using the conventional tools to start and stop an instance Even though mysqld_multi makes things easier to control in general let’s not forget it is a wrapper; you can still rely (though not always, as shown below) on the conventional tools directly to start and stop an instance: mysqld* and mysqladmin. Just make sure to use the parameter –defaults-group-suffix to identify which instance you want to start: mysqld --defaults-group-suffix=5620 and –socket to indicate the one you want to stop: $mysqladmin -S /var/run/mysqld/mysqld5620.sock shutdown * However, mysqld won’t work to start an instance if you have redefined the option ‘mysqld’ on the configuration group, as I did for [mysqld5620] above, stating: [ERROR] mysqld: unknown variable 'mysqld=/opt/mysql-5.6.20-linux-glibc2.5-x86_64/bin/mysqld_safe' I’ve tested using “ledir” to indicate the path to the directory containing the binaries for MySQL 5.6.20 instead of “mysqld” but it also failed with a similar error. If nothing else, that shows you need to stick with mysqld_multi when starting instances in a mixed-version environment. Backups The backup of multiple instances must be done in an individual basis, like you would if each instance was located in a different server. You just need to provide the appropriate parameters to identify the instance you’re targeting. For example, we can simply use socket with mysqldump when running it locally: $ mysqldump --socket=/var/run/mysqld/mysqld7.sock --all-databases > mysqld7.sql In Percona XtraBackup there’s an option named –defaults-group that should be used in environments running multiple instances to indicate which one you want to backup : $ innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld7 --socket=/var/run/mysqld/mysqld7.sock /root/Backup/ Yes, you also need to provide a path to the socket (when running the command locally), even though that information is already available in “–defaults-group=mysqld7″; as it turns out, only the Percona XtraBackup tool (which is called by innobackupex during the backup process) makes use of the information available in the group option. You may need to provide credentials as well (“–user” & “–password”), and don’t forget you’ll need to prepare the backup afterwards. The option “defaults-group” is not available in all versions of Percona XtraBackup so make sure to use the latest one. Summary Running multiple instances of MySQL concurrently in the same server transparently and without any contextualization or a virtualization layer is possible with both mysqld_multi and MySQL Sandbox. We have been using the later at Percona Support to quickly spin on new disposable instances (though you might as easily keep them running indefinitely). In this post though I’ve looked at mysqld_multi, which is provided with MySQL server and remains the official solution for providing an environment with multiple instances. The key aspect when configuring multiple instances in my.cnf is the notion of group name option, as you replace a single [mysqld] section by as many [mysqldN] sections as you want instances running. It’s important though to pay attention to certain details when defining the options for each one of these groups, specially when mixing instances from different MySQL/Percona Server versions. Differently from MySQL Sandbox, where each instance relies on it’s own configuration file, you should be careful each time you edit the shared my.cnf file as a syntax error when configuring a single group option will prevent all instances from starting upon the server’s (re)initialization. I hope to have covered the major points about mysqld_multi here but feel free to leave us a note below if you have something else to add or any comment to contribute.
September 8, 2014
by Peter Zaitsev
· 25,735 Views · 1 Like
article thumbnail
JPA Tutorial: Setting up Persistence Configuration for Java SE Environment
Here's how to create a persistence configuration in the Java SE environment.
September 4, 2014
by MD Sayem Ahmed
· 103,208 Views · 4 Likes
article thumbnail
"How Thin is Thin?" An Example of Effective Story Slicing
Graphene is pure carbon in the form of a very thin, nearly transparent sheet, one atom thick. It is remarkably strong for its very low weight and it conducts heat and electricity with great efficiency. Wikipedia If you have spent any time at all working in an Agile software development environment, you've heard the mantra to split your Stories as thin as you possibly can while still delivering value. This is indeed great advice, but the term "thin" is relative - our notion of what thin means is anchored by our previous experience! To help communicate what I mean when I say that Stories should be thinly sliced, I'm going to provide examples from a recent client who was building a relatively standard system for entering orders from their wholesale customers. To start, here's a very simplified model of the entities we'll be working with (using GML - the Galactic Modeling Language). In plain language, the system creates Orders, which have a Customer and from 1 to some number of Products. For initial planning, the team (consisting of developers, QA person, business analyst and product owner) used Jeff Patton's Story Mapping technique to determine what work the salespeople did at a relatively high level. Not surprisingly, the group determined that creating an Order was the most fundamental part of the system! So we started with a simple sticky note with the title Create Order. Note that there wasn't any other descriptive text, no acceptance criteria, no indication of performance requirements, or even mention of how someone using the system would actually create the order. At this point, we simply didn't need all that - we just needed a placeholder for the conversations that were required for the action of creating orders. Everyone in the group knew that an order would need a customer and some number of products. When I asked what the first Story should be, reminding them that it should be very, very thin, the first suggestion was to have the user select a customer from a list, select products from a list, and save the order. I asked if it was possible to make that story even thinner, and received puzzled looks. I followed with another question - why did we need to select a customer at this point, or even products for that matter? More puzzled looks. Couldn't we just hard-code a customer ID and a single product SKU? More puzzled looks. Another question - if we presented a simple page with a button that when pressed created the order in our database with a hard-coded customer ID and a single, hard-coded product SKU, what percentage of the overall system infrastructure would be fleshed out. That was the point at which people started to understand. The response I received was, "Almost all of it... we need to render a page, capture the submit action, write the order to an Orders table, the product to the Order Lines table and the customer & product to the Customers and Products tables respectively". My response was, "Great!! But it's still a little too thick." More puzzled looks... how could that story be made any thinner? I asked, "Why do we need tables for Customers and Products right now?" More puzzled looks, followed by a sheepish, "Um, because...". The need to create those tables was an assumption at that point. Everyone in the room, myself included, knew that the database would have Customers and Products tables at some point, but we just didn't need them right now. All that was required from a persistence perspective was the Orders and Order Lines tables. Someone asked, "How do we know that the Customer ID and Product SKU are valid?" My response was, "We don't! And at this point, we don't care!" This is another way to split stories - assume the happy path where everything is valid, and build validation and error-handling later. So, this very thin story became, The acceptance criteria for the story were very simple as well: When you navigate to the page index.html, it should be displayed with a Submit button When the Submit button is clicked, the application creates one row in the Order table with a Customer ID of XXX, and one row in the Order Line table with a Product ID of YYY. Note that I could have gone even further and argued that we didn't even need the database yet. We could have simply used a flat file and broken all the rules of relational data modeling by putting all the Order and Order Line data in a single line in the file. Why? BECAUSE WE DIDN'T NEED IT YET! However, in the interest of establishing the overall architecture early, we decided to proceed with the database. It did add some complexity to the story, but it was on the order of hours rather than days or weeks. We also "gold-plated" the story to an extent by providing a rudimentary confirmation page that was displayed after submitting the Order. After the group created this one, simple, thin story with hard-coded values, the next steps were to gradually replace the hard-coding with the selections and searches that you would expect in this type of application. Subsequent stories were, You can now see the progression where the creation of an Order is gradually built out using very thin slices of functionality. Also note that the last two stories, which "wire up" the selection of Customers and Products to an external web service, could be implemented in either order - there is nothing in the story that indicates a dependency on anything else. Splitting Techniques In these stories we used several common splitting techniques: Hard coding values; Simple interface; Defer validation; Zero, then One, then Many (with the Products); Defer complexity. This is only a small handful of the available techniques, but it gave the group a powerful way to work in very, very small increments while still delivering some value with each one. We Can't Ship That!! The moment I suggested the first story with all the hard-coding, I was challenged by the group. Why would you want to build something that isn't production-worthy? The answer is that some aspects of the feature aren't production worthy when you look at the story in isolation. However, the whole concept of a Minimal Viable Product or Minimal Viable Feature is each consists of 1 to many smaller slices (stories in our case) that have been built together to provide enough value to warrant the cost of being put into production. The Product Owner or Product Manager is the person who decides when enough value has accrued to meet that threshold. The Benefits of Thin Slicing I've provide a number of examples from a real-world project but haven't really dug into the reasons why this is a good approach. There are a number of them: You are deferring design decisions as long as possible which gives you more options when it does come time to implement a Story; Having options also allows the Product Owner/Manager to defer business decisions as long as possible, providing the ability to make changes to the product that better suit the actual needs of the business; Thin slices (in conjunction with acceptance criteria) help prevent assumptions which can lead to bloated Stories that are seemingly never done, Agile's version of scope creep; Thin slices are much easier to verify since they are only providing a small increment of functionality beyond what has already been implemented; Thin slices allow a product to be reviewed sooner by both the Product Owner/Manager and users/stakeholders, such that their feedback can be incorporated sooner in the process; Thin slices allow the Product Owner/Manager decides to change or drop a Story without the burden of the Sunk Cost Fallacy. While these are all great reasons to split Stories into thin slices, here's the one that my experience suggests is by far the most important: With Stories sliced as thinly as possible, they all become nearly the same size in terms of effort. Therefore, no estimation of individual Stories is required and a team's capacity can simply be derived by counting the number of Stories completed for some unit of time. That means no haggling over whether a Story is a 3 or 5 on the modified Fibonacci series many teams use for sizing Stories. It means that the need for spending any time estimating individual Stories is no longer required, an activity that is downright painful for some groups. The group from which these examples came was in the situation where there was a hard release date to be met. This stories gave the Product Owner/Manager the ability to easily tweak the what was and wasn't going to be in that initial release and, in several cases which features could be released partially completed. If a group is in the situation where they are being asked for an estimate of when some work could be completed, my experience has also been that Story Mapping coupled with a technique called Blink Estimation provides a high-level estimate of the overall work that's much more grounded in reality. Once a group begins working on thinly-sliced Stories, you begin to have data points with respect to how many they can complete per some unit of time. Those real numbers can then be used to make better business decisions, and deliver a product without the pressure of Crunch Time. Conclusion The group with whom I worked for this example system never spent any time estimating individual Stories. Ever. There were some instances where a Story was larger than it first appeared, but those were either balanced out by Stories that were very small, or by further splitting. Like the graphene example at the beginning of the post, thin stories have remarkable properties far beyond the fact that they are just "thin". The value gained by learning how to split Stories effectively is enormous owing to the flexibility it provides by deferring decisions as long as possible and the removal of the need to estimate at a granular level. Splitting Stories this thin isn't really a specialized skill - it's something that can be learned by anyone in any domain. If you need help learning this technique, let's have a chat.
August 27, 2014
by Dave Rooney
· 35,776 Views
article thumbnail
Setting up Java Applications to Communicate with MongoDB, Kerberos and SSL
By Alex Komyagin, Technical Services Engineer at MongoDB Setting up Kerberos authentication and SSL encryption in a MongoDB Java application is not as simple as other languages. In this post, I’m going to show you how to create a Kerberos and SSL enabled Java application that communicates with MongoDB. My original setup consists of the following: 1) KDC server: kdc.mongotest.com kerberos config file (/etc/krb5.conf): [logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log [libdefaults] default_realm = MONGOTEST.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] MONGOTEST.COM = { kdc = kdc.mongotest.com admin_server = kdc.mongotest.com } [domain_realm] .mongotest.com = MONGOTEST.COM mongotest.com = MONGOTEST.COM KDC has the following principals: [email protected] - user principle (for java app) mongodb/[email protected] - service principle (for mongodb server) 2) MongoDB server: rhel64.mongotest.com MongoDB version: 2.6.0 MongoDB config file: dbpath= logpath= fork=true auth = true setParameter = authenticationMechanisms=GSSAPI sslOnNormalPorts = true sslPEMKeyFile = /etc/ssl/mongodb.pem This server also has the global environment variable $KRB5_KTNAME set to the keytab file exported from KDC. Application user is configured in the admin database like this: { "_id" : "[email protected]", "user" : "[email protected]", "db" : "$external", "credentials" : { "external" : true }, "roles" : [ { "role" : "readWrite", "db" : "test" } ] } Download the Java driver: wget http://central.maven.org/maven2/org/mongodb/mongo-java-driver/2.12.1/mongo-java-driver-2.12.1.jar Install java and jdk: sudo yum install java-1.7.0 sudo yum install java-1.7.0-devel Create a certificate store for Java and store the server certificate there, so that Java knows who it should trust: keytool -importcert -file mongodb.crt -alias mongoCert -keystore firstTrustStore (mongodb.crt is just a public certificate part of mongodb.pem) Copy kerberos config file to the application server: /etc/krb5.conf or ““C:\WINDOWS\krb5.ini“` (otherwise you’ll have to specify kdc and realm as Java runtime options) Use kinit to store the principal password on the application server: kinit [email protected] As an alternative to kinit, you can use JAAS to cache kerberos credentials. Compile and run the Java program javac -cp ../mongo-java-driver-2.12.1.jar SSLApp.java java -cp .:../mongo-java-driver-2.12.1.jar -Djavax.net.ssl.trustStore=firstTrustStore -Djavax.net.ssl.trustStorePassword=changeme -Djavax.security.auth.useSubjectCredsOnly=false SSLApp It is important to specify useSubjectCredsOnly=false, otherwise you’ll get the “No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)” exception from Java. As we discovered, this is not strictly necessary in all cases, but it is if you are relying on kinit to get the service ticket. The Java driver needs to construct MongoDB service principal name in order to request the Kerberos ticket. The service principal is constructed based on the server name you provide (unless you explicitly asked to canonicalize server name). For example, if I change rhel64.mongotest.com to the host IP address in the connection URI, I would be getting Kerberos exceptions No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - UNKNOWN_SERVER)]. So be sure you specify the same server host name as you used in the Kerberos principal (). Adding -Dsun.security.krb5.debug=true to Java runtime options helps a lot in debugging kerberos auth issues. These steps should help simplify the process of connecting Java applications with SSL. Before deploying any application with MongoDB, be sure to read through our Security Checklist which outlines recommended security measures to protect your MongoDB installation. More information on configuring MongoDB Security can be found in the MongoDB Manual. For further questions, feel free to reach out to the MongoDB team through google-groups.
August 26, 2014
by Francesca Krihely
· 8,271 Views
article thumbnail
Microservices and PaaS (Part II)
[This article was written by John Wetherill.] This is a continuation of the Microservices and PaaS - Part I blog post I wrote last week, which was an attempt to distil the wealth of information presented at the microservices meetup hosted by Cisco, with Adrian Cockcroft and others presenting. Part I provided a brief background on microservices, with a summary of some lessons learned by microservices pioneers. In this installment I will cover a number of practices related to microservices that were discussed during the meetup. A followup article will dive into the advantages that Platform as a Service brings to microservice development. Microservices Practices I'm calling these "Microservice Practices," not "Microservices Best Practices" because microservices-based architectures are still evolving, with new practices, techniques, tools, and patterns emerging constantly. At the meetup a number of practices were highlighted that Netflix and other microservices pioneers have spearheaded in their efforts to adopt a microservices mentality across their organizations. Break Things Deliberately According to Netflix: "We have found that the best defense against major unexpected failures is to fail often." Netflix has brought us "Chaos Monkey" which is a powerful tool the sole purpose of which is to break things, often and randomly. They use this tool continuously on their production systems to bring down essential services, to ensure that doing so doesn't disrupt the user experience or their overall service. It's much better to deliberately break the system in the middle of the morning when all teams are assembled and sufficient caffeine has been consumed, than to be informed of a breakage by a page at 3am. No Manual "Anything" In a world where microservices come and go, grow and shrink, and migrate around racks and data centers in seconds - there's absolutely no room for manual intervention. All aspects of deployment, monitoring, testing, and recovery must be fully automated. For example, monitoring a service should occur instantly and automatically by virtue of it being deployed, not requiring a separate manual step. Similarly failure discovery and rerouting to old code, as described in Part I of this blog, must be fully automated, no human intervention required. Respect Human Attention Span Speaking of humans, a typical human's attention span, say when filling out a shopping cart, is around 10 seconds. If a failure occurs when deploying an updated shopping cart microservice, it's important that the time between the failure, reporting, and rerouting to existing, working code is kept under around this 10 second range. Obviously this shouldn't happen too often, but the occasional 10 second gap in response will probably not lose the customer. A five minute, or 5 hour lag, resulting from manual intervention and rollback, will. Denormalize like Crazy Refactor database schemas, and de-normalize everything, to allow complete separation and partitioning of data. That is, do not use underlying tables that serve multiple microservices. There should be no sharing of underlying tables that span multiple microservices, and no sharing of data. Instead, if several services need access to the same data, it should be shared via a service API (such as a published REST or a message service interface). Polyglot Persistence Each microservice can have its own persistence layer. Gone are the days of a single monolithic database instance that's shared across all parts of an application. Databases are getting cheaper and easier. As an example, Neo4J allows you to embed an industry-strength self-contained graph database in your microservice at the cost of a few megabytes in a jarfile, with startup time on the order of milliseconds. That's essentially free. Even better, any PaaS worth its salt will provide multiple database services that can be spawned and accessed at the drop of a hat. With technology like this at our disposal, it makes sense to use the persistence layer that fits, both to the problem being solved, and to the expertise - and passions - of the team that's solving the problem. Avoid Trunk Conflicts The old mindset had all code for a large project contained in a single source repository. This can be slightly easier to setup and manage, but it ties the microservices together and makes it much more difficult to evolve them independently. Instead each microservice should have its own scm repository so it can truly be updated and enhanced independent of other services. One Service, One Manifest Each microservice must have its own manifest and dependencies, instead of maintaining a global dependency list for all services. This allows, for example, one microservice to depend on Spring v3.2, while another can require Spring 4.1. The dependencies for one microservice can change over time with no effect on the dependencies of other microservices. Contain Everything All microservices should run in a container, such as Tomcat, Docker, or in whatever container system is provided by the PaaS (you are running a PaaS aren't you?). Do not run microservices on bare metal, or directly on a VM. Containerization brings countless advantages, particularly a consistent, isolated runtime environment that can easily migrate around the datacenter or around the globe. With Docker and other modern containerization approaches, there is very little overhead in running in a container, and considerable upside. No State Do not build stateful services. Instead, maintain state in a dedicated persistence service, or elsewhere. This is a well-known practice brought to us by the cloud. When an application instance maintains state, it can't easily be moved, scaling is more complex, and it's more likely to cause problems when it fails. This practice applies even more to microservices which in general should be light-weight, instantly replaceable on failure, and should be able to hop around data-centers. Don't Name your Chickens People who raise chickens soon learn that naming chickens is a bad idea: after naming a chicken you get attached to it, at least the kids do, and it can be uncomfortable to have to explain at the dinner table that the chicken pot pie is really "Molly." Instead, number your chickens, so you can say "that was chicken #38" or even better, "that was chicken 586ec9bd." Makes for a much more enjoyable meal. The same can be said of computer systems. Do not name systems after planets, or animals, or philosophers, or prisons, as was common practice in the UNIX world for decades. Instead, assign them guid's, and don't attach any sort of significance to them, like assigning them specific roles or purposes. Systems should be commodities, like McDonalds Franchises. Each McDonalds is eerily similar, with the advantage that if one shuts down you can just walk an extra few blocks and be served the exact same burger at the same price in the same amount of time. Create and Curate Access Libraries Microservices are accessed by externally published APIs or protocols. This allows the microservice implementation to completely change with no effect on its consumers, as long as the API remains constant. But just publishing an API is not enough. The microservice provider should also be responsible for building and stewarding client libraries used to access the service. If this is not done, the construction of these libraries will be left to third parties, and will likely result in fragmentation where various implementations might have slight differences, or implementors may incorrectly interpret the spec and introduce inconsistencies which then stick. Optimize the Interaction One downside of a microservices architecture is the "fanout" problem where a single request to the overall application results in 10 or 20 requests bubbling throughout the various microservices the application relies on. This dramatic increase in network traffic calls for more optimal communication between microservices. Instead of transmitting the standard text/html REST content type, consider using something like Google Protocol Buffers, Simple Binary Encoding, or Apache Thrift, to decrease the size of the payload and optimize the inter-microservice communications. Release the Monkeys Netflix has released what they call the "Simian Army," a suite of tools including Chaos Monkey, mentioned above, whose purpose is to help an organization build resilient, scalable, fault-tolerant software. The suite includes such tools as Janitor Monkey, to reclaim unused resources, Security Monkey which looks for security vulnerabilities, Latency Monkey, which induces artificial delays in the REST layer to scare out latency issues, and many more. As Phil described last week in his blog Devops: Tools vs. Culture, most organizations don't have the resources or luxury of being able to build their own toolsets when evolving to a microservices and devops culture. Instead they must leverage existing tools, and fortunately lots of tools are constantly appearing. It's worth spending the effort searching and researching these tools, and incorporating them into your overall development process when they make sense. To be continued... Again I originally intended to cover last week's microservices meetup in a single blog post, which then expanded to two. I have yet to address the power of PaaS in microservices architectures, and I'm out of space already. So I will continue this Microservices and PaaS theme next week, finally getting into PaaS, and discuss how Platform as a Service can significantly streamline the microservices development process.
August 26, 2014
by John Wetherill
· 9,652 Views
article thumbnail
How to Setup Realtime Analytics over Logs with ELK Stack
Once we know something, we find it hard to imagine what it was like not to know it. - Chip & Dan Heath, Authors of Made to Stick, Switch Update: I have recently published a book on ELK stack titled - Learning ELK Stack , more details can be found here. What is the ELK stack ? The ELK stack is ElasticSearch, Logstash and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data. ElasticSearch ElasticSearch,built on top of Apache Lucene, is a search engine with focus on real-time analysis of the data, and is based on the RESTful architecture. It provides standard full text search functionality and powerful search based on query. ElasticSearch is document-oriented/based and you can store everything you want as JSON. This makes it powerful, simple and flexible. Logstash Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use.In ELK Stack logstash plays an important role in shipping the log and indexing them later which can be supplied to Elastic Search. Kibana Kibana is a user friendly way to view, search and visualize your log data, which will present the data stored from Logstash into ElasticSearch, in a very customizable interface with histogram and other panels which provides real-time analysis and search of data you have parsed into ElasticSearch. How Do I Get It ? http://www.elasticsearch.org/overview/elkdownloads/ How Do They Work Together ? Logstash is essentially a pipelining tool. In a basic, centralized installation a logstash agent, known as the shipper, will read input from one to many input sources and output that text wrapped in a JSON message to a broker. Typically Redis, the broker, caches the messages until another logstash agent, known as the collector, picks them up, and sends them to another output. In the common example this output is Elasticsearch, where the messages will be indexed and stored for searching. The Elasticsearch store is accessed via the Kibana web application which allows you to visualize and search through the logs. The entire system is scalable. Many different shippers may be running on many different hosts, watching log files and shipping the messages off to a cluster of brokers. Then many collectors can be reading those messages and writing them to an Elasticsearch cluster. (E)lasticSearch (L)ogstash (K)ibana (The ELK Stack) How Do I Fetch Useful Information Out of Logs? Fetching useful information from logs is one of the most important part of this stack and is being done in logstash using its grok filters and a set of input , filter and output plugins which helps to scale this functionality for taking various kinds of inputs ( file,tcp, udp, gemfire, stdin, unix, web sockets and even IRC and twitter and many more) , filter them using (groks,grep,date filters etc.) and finally write ouput to ElasticSearch,redis,email,HTTP,MongoDB,Gemfire , Jira , Google Cloud Storage etc. A Bit More About Log Stash Filters Transforming the logs as they go through the pipeline is possible as well using filters. Either on the shipper or collector, whichever suits your needs better. As an example, an Apache HTTP log entry can have each element (request, response code, response size, etc) parsed out into individual fields so they can be searched on more seamlessly. Information can be dropped if it isn’t important. Sensitive data can be masked. Messages can be tagged. The list goes on. e.g. input { file { path => ["var/log/apache.log"] type => "saurzcode_apache_logs" } } filter { grok { match => ["message","%{COMBINEDAPACHELOG}"] } } output{ stdout{} } Above example takes input from an apache log file applies a grok filter with %{COMBINEDAPACHELOG}, which will index apache logs information on fields and finally output to Standard Output Console. Writing Grok Filters Writing grok filters and fetching information is the only task that requires some serious efforts and if done properly will give you great insights in to your data like Number of Transations performed over time, Which type of products have most hits etc. Below links will help you a lot in writing grok filters and test them with ease - Grok Debugger http://grokdebug.herokuapp.com/ Grok Patterns Lookup https://github.com/elasticsearch/logstash/tree/v1.4.2/patterns References http://www.elasticsearch.org/overview/ http://logstash.net/ http://rashidkpc.github.io/Kibana/about.html
August 26, 2014
by Saurabh Chhajed
· 47,847 Views · 4 Likes
  • Previous
  • ...
  • 496
  • 497
  • 498
  • 499
  • 500
  • 501
  • 502
  • 503
  • 504
  • 505
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×