Data Engineering Resources

The Latest Data Engineering Topics

This year I’ve been demonstrating how easy it is to create modern web apps using AngularJS, Java and MongoDB. I also use Groovy during this demo to do the sorts of things Groovy is really good at - writing descriptive tests, and creating scripts. Due to the time pressures in the demo, I never really get a chance to go into the details of the script I use, so the aim of this long-overdue blog post is to go over this Groovy script in a bit more detail. Firstly I want to clarify that this is not my original work - I stoleborrowed most of the ideas for the demo from my colleague Ross Lawley. In this blog post he goes into detail of how he built up an application that finds the most popular pub names in the UK. There’s asection in there where he talks about downloading the open street map data and using python to convert the XML into something more MongoDB-friendly - it’s this process that I basically stole, re-worked for coffee shops, and re-wrote for the JVM. I’m assuming if you’ve worked with Java for any period of time, there has come a moment where you needed to use it to parse XML. Since my demo is supposed to be all about how easy it is to work with Java, I didnot want to do this. When I wrote the demo I wasn’t really all that familiar with Groovy, but what I did know was that it has built in support for parsing and manipulating XML, which is exactly what I wanted to do. In addition, creating Maps (the data structures, not the geographical ones) with Groovy is really easy, and this is effectively what we need to insert into MongoDB. Goal Of The Script Parse an XML file containing open street map data of all coffee shops. Extract latitude and longitude XML attributes and transform intoMongoDB GeoJSON. Perform some basic validation on the coffee shop data from the XML. Insert into MongoDB. Make sure MongoDB knows this contains query-able geolocation data. The script is PopulateDatabase.groovy, that link will take you to the version I presented at JavaOne: Firstly, We Need Data I used the same service Ross used in his blog post to obtain the XML file containing “all” coffee shops around the world. Now, the open street map data is somewhat… raw and unstructured (which is why MongoDB is such a great tool for storing it), so I’m not sure I really have all the coffee shops, but I obtained enough data for an interesting demo using http://www.overpass-api.de/api/xapi?*[amenity=cafe][cuisine=coffee_shop] The resulting XML file is in the github project, but if you try this yourself you might (in fact, probably will) get different results. Each XML record looks something like: Each coffee shop has a unique identifier and a latitude and longitude as attributes of a node element. Within this node is a series of tag elements, all with k and v attributes. Each coffee shop has a varying number of these attributes, and they are not consistent from shop to shop (other than amenity and cuisine which we used to select this data). Initialisation Before doing anything else we want to prepare the database. The assumption of this script is that either the collection we want to store the coffee shops in is empty, or full of stale data. So we’re going to use the MongoDB Java Driver to get the collection that we’re interested in, and then drop it. There’s two interesting things to note here: This Groovy script is simply using the basic Java driver. Groovy can talk quite happily to vanilla Java, it doesn’t need to use a Groovy library. There are Groovy-specific libraries for talking to MongoDB (e.g. the MongoDB GORM Plugin), but the Java driver works perfectly well. You don’t need to create databases or collections (collections are a bit like tables, but less structured) explicitly in MongoDB. You simply use the database and collection you’re interested in, and if it doesn’t already exist, the server will create them for you. In this example, we’re just using the default constructor for theMongoClient, the class that represents the connection to the database server(s). This default is localhost:27017, which is where I happen to be running the database. However you can specify your own address and port - for more details on this see Getting Started With MongoDB and Java. Turn The XML Into Something MongoDB-Shaped So next we’re going to use Groovy’s XmlSlurper to read the open street map XML data that we talked about earlier. To iterate over every node we use: xmlSlurper.node.each. For those of you who are new to Groovy or new to Java 8, you might notice this is using a closure to define the behaviour to apply for every “node” element in the XML. Create GeoJSON Since MongoDB documents are effectively just maps of key-value pairs, we’re going to create a Map coffeeShop that contains the document structure that represents the coffee shop that we want to save into the database. Firstly, we initialise this map with the attributes of the node. Remember these attributes are something like: We’re going to save the ID as a value for a new field calledopenStreetMapId. We need to do something a bit more complicated with the latitude and longitude, since we need to store them as GeoJSON, which looks something like: { 'location' : { 'coordinates': [, ], 'type' : 'Point' } } In lines 12-14 you can see that we create a Map that looks like the GeoJSON, pulling the lat and lon attributes into the appropriate places. Insert Remaining Fields Now for every tag element in the XML, we get the k attribute and check if it’s a valid field name for MongoDB (it won’t let us insert fields with a dot in, and we don’t want to override our carefully constructed locationfield). If so we simply add this key as the field and its the matching vattribute as the value into the map. This effectively copies theOpenStreetMap key/value data into key/value pairs in the MongoDB document so we don’t lose any data, but we also don’t do anything particularly interesting to transform it. Save Into MongoDB Finally, once we’ve created a simple coffeeShop Map representing the document we want to save into MongoDB, we insert it into MongoDB if the map has a field called name. We could have checked this when we were reading the XML and putting it into the map, but it’s actually much easier just to use the pretty Groovy syntax to check for a key called namein coffeeShop. When we want to insert the Map we need to turn this into aBasicDBObject, the Java Driver’s document type, but this is easily done by calling the constructor that takes a Map. Alternatively, there’s a Groovy syntax which would effectively do the same thing, which you might prefer: collection.insert(coffeeShop as BasicDBObject) Tell MongoDB That We Want To Perform Geo Queries On This Data Because we’re going to do a nearSphere query on this data, we need to add a “2dsphere” index on our location field. We created the locationfield as GeoJSON, so all we need to do is call createIndex for this field. Conclusion So that’s it! Groovy is a nice tool for this sort of script-y thing - not only is it a scripting language, but its built-in support for XML, really nice Map syntax and support for closures makes it the perfect tool for iterating over XML data and transforming it into something that can be inserted into a MongoDB collection.

October 8, 2014

by Trisha Gee

· 10,287 Views

How to Allow Only HTTPS on an S3 Bucket

It is possible to disable HTTP access on S3 bucket, limiting S3 traffic to only HTTPS requests. The documentation is scattered around the Amazon AWS documentation, but the solution is actually straightforward. All you need to do to block HTTP traffic on an S3 bucket is add a Condition in your bucket's policy. AWS supports a global condition for verifying SSL. So you can add a condition like this: "Condition": { "Bool": { "aws:SecureTransport": "true" } } Here's a complete example: { "Version": "2008-10-17", "Id": "some_policy", "Statement": [ { "Sid": "AddPerm", "Effect": "Allow", "Principal": { "AWS": "*" }, "Action": "s3:GetObject", "Resource": "arn:aws:s3:::my_bucket/*", "Condition": { "Bool": { "aws:SecureTransport": "true" } } } ] } Now accessing the contents of my_bucket over HTTP will produce a 403 error, while using HTTPS will work fine.

October 8, 2014

by Matt Butcher

· 17,750 Views

What is Write Concern in MongoDB?

In MongoDB there are multiple guarantee levels available for reporting the success of a write operation, called Write Concerns. The strength of the write concerns determine the level of guarantee. A weak Write Concern has better performance at the cost of lesser guarantee, while a strong Write Concern has higher guarantee as clients wait to confirm the write operations. MongoDB provides different levels of write concern to better address the specific needs of applications. Clients may adjust write concern to ensure that the most important operations persist successfully to an entire MongoDB deployment. For other less critical operations, clients can adjust the write concern to ensure faster performance rather than ensure persistence to the entire deployment. Write Concern Levels MongoDB has the following levels of conceptual write concern, listed from weakest to strongest: Unacknowledged With an unacknowledged write concern, MongoDB does not acknowledge the receipt of write operations. Unacknowledged is similar to errors ignored; however, drivers will attempt to receive and handle network errors when possible. The driver’s ability to detect network errors depends on the system’s networking configuration. Acknowledged With a receipt acknowledged write concern, the mongod confirms the receipt of the write operation. Acknowledged write concern allows clients to catch network, duplicate key, and other errors. This is default write concern. Journaled With a journaled write concern, the MongoDB acknowledges the write operation only after committing the data to the journal. This write concern ensures that MongoDB can recover the data following a shutdown or power interruption. You must have journaling enabled to use this write concern. Replica Acknowledged Replica sets present additional considerations with regards to write concern. The default write concern only requires acknowledgement from the primary. With replica acknowledged write concern, you can guarantee that the write operation propagates to additional members of the replica set. Write operation to a replica set with write concern level of w:2 or write to the primary and at least one secondary.

October 7, 2014

by Rishav Rohit

· 26,344 Views · 2 Likes

PostgreSQL: ERROR: Column Does Not Exist

I’ve been playing around with PostgreSQL recently and in particular the Northwind dataset typically used as an introductory data set for relational databases. Having imported the data I wanted to take a quick look at the employees table: postgres=# SELECT * FROM employees LIMIT 1; EmployeeID | LastName | FirstName | Title | TitleOfCourtesy | BirthDate | HireDate | Address | City | Region | PostalCode | Country | HomePhone | Extension | Photo | Notes | ReportsTo | PhotoPath ------------+----------+-----------+----------------------+-----------------+------------+------------+-----------------------------+---------+--------+------------+---------+----------------+-----------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------------------------------- 1 | Davolio | Nancy | Sales Representative | Ms. | 1948-12-08 | 1992-05-01 | 507 - 20th Ave. E.\nApt. 2A | Seattle | WA | 98122 | USA | (206) 555-9857 | 5467 | \x | Education includes a BA IN psychology FROM Colorado State University IN 1970. She also completed "The Art of the Cold Call." Nancy IS a member OF Toastmasters International. | 2 | http://accweb/emmployees/davolio.bmp (1 ROW) That works fine but what if I only want to return the ‘EmployeeID’ field? postgres=# SELECT EmployeeID FROM employees LIMIT 1; ERROR: COLUMN "employeeid" does NOT exist LINE 1: SELECT EmployeeID FROM employees LIMIT 1; I hadn’t realised (or had forgotten) that field names get lower cased so we need to quote the name if it’s been stored in mixed case: postgres=# SELECT "EmployeeID" FROM employees LIMIT 1; EmployeeID ------------ 1 (1 ROW) From my reading the suggestion seems to be to have your field names lower cased to avoid this problem but since it’s just a dummy data set I guess I’ll just put up with the quoting overhead for now.

October 7, 2014

by Mark Needham

· 17,587 Views

Simple SecurePasswordVault in Java

There are some instances when you want to store your passwords in files to be used by programs or scripts. But storing your passwords in plain text is not a good idea. Use the SecurePasswordVault to encrypt your passwords before storing and get it decrypted when you want to use it. You can use the SecurePasswordVault described here to store any number of encrypted passwords. Passwords are stored as key value pairs. Key - any name given by the user for the password Value - encrypted password SecurePasswordVault will create a file with the given name in the working directory if it doesn't exist. If a file exists then the information in that file will be read. Passwords are encrypted using the MAC address of the network card. SecurePasswordVault will use the first network card MAC which is not the loop back interface. So the encrypted file can only be decrypted with that particular MAC address. If you want to reset the pass word details, just delete the password file and run the SecurePasswordVault. You can download the sample code from the following GitHub repository https://github.com/jsdjayanga/secure_password com.wso2.devgov; import org.bouncycastle.util.encoders.Base64; import javax.crypto.*; import javax.crypto.spec.SecretKeySpec; import java.io.*; import java.net.NetworkInterface; import java.net.SocketException; import java.security.InvalidKeyException; import java.security.NoSuchAlgorithmException; import java.security.Security; import java.util.*; /** * Created by jayanga on 3/31/14. */ public class SecurePasswordVault { private static final int AES_KEY_LEN = 32; private static final int PASSWORD_LEN = 256; private static boolean initialized; private final String secureFile; private final byte[] networkHardwareHaddress; private Map secureDataMap; private List secureDataList; SecretKeySpec secretKey; public SecurePasswordVault(String filename, String[] secureData) throws IOException { Security.addProvider(new org.bouncycastle.jce.provider.BouncyCastleProvider()); initialized = false; secureFile = filename; networkHardwareHaddress = SecurePasswordVault.readNetworkHardwareAddress(); secureDataMap = new HashMap(); this.secureDataList = new ArrayList(secureData.length); Collections.addAll(secureDataList, secureData); byte[] key = new byte[AES_KEY_LEN]; Arrays.fill(key, (byte)0); for(int index = 0; index < networkHardwareHaddress.length; index++){ key[index] = networkHardwareHaddress[index]; } secretKey = new SecretKeySpec(key, "AES"); if (!isInitialized()){ readSecureData(secureDataList); persistSecureData(); } readSecureDataFromFile(); } private boolean isInitialized(){ if (initialized == true){ return true; }else{ File file = new File(secureFile); if (file.exists()){ initialized = true; return initialized; } } return false; } private static byte[] readNetworkHardwareAddress() throws SocketException { Enumeration networkInterfaceEnumeration = NetworkInterface.getNetworkInterfaces(); if (networkInterfaceEnumeration != null){ NetworkInterface networkInterface = null; while (networkInterfaceEnumeration.hasMoreElements()){ networkInterface = networkInterfaceEnumeration.nextElement(); if (!networkInterface.isLoopback()){ break; } } if (networkInterface == null){ networkInterface = networkInterfaceEnumeration.nextElement(); } byte[] hwaddr = networkInterface.getHardwareAddress(); return hwaddr; }else{ throw new RuntimeException("Cannot initialize. Failed to generate unique id."); } } private byte[] encrypt(String word) { byte[] password = new byte[PASSWORD_LEN]; Arrays.fill(password, (byte)0); byte[] pw = new byte[0]; try { pw = word.getBytes("UTF-8"); for(int index = 0; index < pw.length; index++){ password[index] = pw[index]; } byte[] cipherText = new byte[password.length]; Cipher cipher = null; try { cipher = Cipher.getInstance("AES/ECB/NoPadding"); try { cipher.init(Cipher.ENCRYPT_MODE, secretKey); int ctLen = 0; try { ctLen = cipher.update(password, 0, password.length, cipherText, 0); ctLen += cipher.doFinal(cipherText, ctLen); return cipherText; } catch (ShortBufferException e) { e.printStackTrace(); } catch (BadPaddingException e) { e.printStackTrace(); } catch (IllegalBlockSizeException e) { e.printStackTrace(); } } catch (InvalidKeyException e) { e.printStackTrace(); } } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } catch (NoSuchPaddingException e) { e.printStackTrace(); } } catch (UnsupportedEncodingException e) { e.printStackTrace(); } return null; } private String decrypt(byte[] cipherText) { byte[] plainText = new byte[PASSWORD_LEN]; Cipher cipher = null; try { cipher = Cipher.getInstance("AES/ECB/NoPadding"); try { cipher.init(Cipher.DECRYPT_MODE, secretKey); int plainTextLen = 0; try { plainTextLen = cipher.update(cipherText, 0, PASSWORD_LEN, plainText, 0); try { plainTextLen += cipher.doFinal(plainText, plainTextLen); String password = new String(plainText); return password.trim(); } catch (IllegalBlockSizeException e) { e.printStackTrace(); } catch (BadPaddingException e) { e.printStackTrace(); } } catch (ShortBufferException e) { e.printStackTrace(); } } catch (InvalidKeyException e) { e.printStackTrace(); } } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } catch (NoSuchPaddingException e) { e.printStackTrace(); } return null; } public void readSecureData(List secureDataList) throws IOException { BufferedReader bufferRead = new BufferedReader(new InputStreamReader(System.in)); for(int index = 0; index < secureDataList.size(); index++){ System.out.println("Please enter the value for :" + secureDataList.get(index)); String value = new String(Base64.encode(encrypt(bufferRead.readLine()))); secureDataMap.put(secureDataList.get(index), value); } } public String getSecureData(String key) { String value = secureDataMap.get(key); if (value != null){ return decrypt(Base64.decode(value.getBytes())); } throw new RuntimeException("Given key is unknown. [key=" + key + "]"); } private void readSecureDataFromFile() throws IOException { BufferedReader br = new BufferedReader(new FileReader(secureFile)); String line; while ((line = br.readLine()) != null){ int dividerPoint = line.indexOf("="); if (dividerPoint > 0){ secureDataMap.put(line.substring(0, dividerPoint), line.substring(dividerPoint + 1)); } } } private void persistSecureData() throws IOException { FileWriter fileWriter = new FileWriter(secureFile); for(String key : secureDataMap.keySet()){ fileWriter.append(key + "=" + secureDataMap.get(key) + "\n"); } fileWriter.close(); } }

October 5, 2014

by Jayanga Dissanayake

· 15,275 Views · 1 Like

Comparison of SQL Server Compact, SQLite, SQL Server Express and LocalDB

Now that SQL Server 2014 and SQL Server Compact 4 has been released, some developers are curious about the differences between SQL Server Compact 4.0 and SQL Server Express 2014 (including LocalDB) I have updated the comparison table from the excellent discussion of the differences between Compact 3.5 and Express 2005 here to reflect the changes in the newer versions of each product. Information about LocalDB comes from here and SQL Server 2014 Books Online. LocalDB is the full SQL Server Express engine, but invoked directly from the client provider. It is a replacement of the current “User Instance” feature in SQL Server Express. Feature SQL Server Compact 3.5 SP2 SQL Server Compact 4.0 SQLite, incl SQLite ADO.NET Provider SQL Server Express 2012 SQL Server 2012 LocalDB Deployment/ Installation Features Installation size 2.5 MB download size 12 MB expanded on disk 2.5 MB download size 18 MB expanded on disk 10 MB download, 14 MB expanded on disk 120 MB download size > 300 MB expanded on disk 32 MB download size > 160 MB on disk ClickOnce deployment Yes Yes Yes Yes Yes Privately installed, embedded, with the application Yes Yes Yes No No Non-admin installation option Yes Yes Yes No No Runs under ASP.NET No Yes Yes Yes Yes Runs on Windows Mobile / Windows Phone platform Yes No Yes No No Runs on WinRT (Phone/Store Apps) No No Yes No No Runs on non-Microsoft platforms No No Yes No No Installed centrally with an MSI Yes Yes Yes Yes Yes Runs in-process with application Yes Yes Yes No No (as process started by app) 64-bit support Yes Yes Yes Yes Yes Runs as a service No – In process with application No - In process with application No - In process with application Yes No – as launched process Data file features File format Single file Single file Single file Multiple files Multiple files Data file storage on a network share No No No No No Support for different file extensions Yes Yes Yes No No Database size support 4 GB 4 GB 140 TB 10 GB 10 GB XML storage Yes – stored as ntext Yes - stored as ntext Yes, stored as text Yes, native Yes, native Binary (BLOB) storage Yes – stored as image Yes - stored as image Yes Yes Yes FILESTREAM support No No No Yes No Code free, document safe, file format Yes Yes Yes No No Programmability Transact-SQL - Common Query Features Yes Yes No Yes Yes Procedural T-SQL - Select Case, If, features No No Limited Yes Yes Remote Data Access (RDA) Yes No (not supported) No No No ADO.NET Sync Framework Yes No No Yes Yes LINQ to SQL Yes No (not supported) No Yes Yes ADO.NET Entity Framework 4.1 Yes (no Code First) Yes Yes Yes Yes ADO.NET Entity Framework 6 Yes (fully) Yes (fully) Yes (limited) Yes Yes Subscriber for merge replication Yes No No Yes No Simple transactions Yes Yes Yes Yes Yes Distributed transactions No No No Yes Yes Native XML, XQuery/XPath No No No Yes Yes Stored procedures, views, triggers No No Views and triggers Yes Yes Role-based security No No No Yes Yes Number of concurrent connections 256 (100) 256 Unlimited Unlimited Unlimited (but only local) There is also a table here that allows you to determine which Transact-SQL commands, features, and data types are supported by SQL Server Compact 3.5 (which are the same a 4.0 with very few exceptions), compared with SQL Server 2005 and 2008.

October 4, 2014

by Erik Ejlskov Jensen

· 24,875 Views

Datacenter Resource Fragmentation

The concept of resource fragmentation is common in the IT world. In the simplest of contexts, resource fragmentation occurs when blocks of capacity (compute, storage, whatever) are allocated, freed, and ultimately re-allocated to create noncontiguous blocks. While the most familiar setting for fragmentation is memory allocation, the phenomenon plays itself out within the datacenter as well. But what does resource fragmentation look like in the datacenter? And more importantly, what is the remediation? The impacts of virtualization Server virtualization does for applications and compute what fragmentation and noncontiguous memory blocks did for storage. By creating virtual machines on servers, each with a customizable resource footprint, the once large contiguous blocks of compute capacity (each server) can be divided into much smaller subdivisions. And as applications take advantage of this architectural compute model, they become more distributed. The result of this is an application environment where individual components are distributed across multiple devices, effectively occupying a noncontiguous set of compute resources that must be unified via the network. It is not a stretch to say that for server virtualization to deliver against its promise of higher utilization, the network must act as the Great Uniter. Not just a virtual phenomenon While fragmentation is easily explained in a virtualized context, the phenomenon is certainly not only a virtual one. After creation, datacenters grow organically. In the best of times, they grow at very predictable, steady rates. More frequently, they grow in fits and spurts as business requirements heap new application demands on top of existing infrastructure. This growth model is made even more chaotic because of physical constraints. Rows are finite and have an end. If you want to rack up additional compute next to existing compute for a particular application, you might have to move a row over. But what about when that row itself is taken? Then maybe you move a couple rows over. Or a room over. Or maybe into another datacenter entirely. Physical locations are also constrained by how much space they have. Even if you have the will to expand, there might simply be no additional real estate to consume. So you build up, in which case the resources you need are now separated by a floor. Or maybe you build out and separate resources by a short distance across the campus. Or across the city. Or maybe even across the country. Sometimes it’s not even the physical space. With very large footprints, trying to pull enough power from the grid might be impossible. And then there are all the business continuity requirements that frequently lead to datacenter resource sprawl across physical locations. The point is that growth is rarely linear, and this means that physical resources cannot normally be guaranteed to be in close proximity. What started as a nicely groomed cluster of compute and storage turns into a set of noncontiguous resources spread out across whatever physical footprint your datacenter (or datacenters) occupies. Unifying contiguous resources There are, of course, ways to unify resources that suffer from this type of sprawl. In the best of cases, if all of your servers are equivalent, you can migrate VMs over time to achieve continuity. The orchestration of such a feat is nightmarish enough, forgetting for a moment the impact of all that activity and the risk it incurs. So if there is no datacenter equivalent for defragmentation, what do you do? The network ends up playing a unifying role. So long as resources are connected, they can work in concert to deliver some application workload. But not all networks are the same, and depending on the spread of resources, the type of network needed varies. Not all networks are the same If resources are contained now and forever in a fairly tight geographical space, then providing rack-to-rack or row-to-row connectivity is fairly straightforward. But what if the applications across those resources are more bandwidth hungry? You might need to consider cross-connect and offload solutions. How about if those applications are particularly latency-sensitive? You might favor completely flat architectures over more traditional two- and three-tier networks. If resources are not so easily contained, the network choices expand. If application workloads are distributed across different rooms in a datacenter, you have to consider the impact of room-to-room connectivity. Is that done through a WAN connection, in which case you take on yet another networking layer? Or do you use optical equipment to stretch an L2 domain across some physical distance, in which case you have to consider laying or leasing fiber? And even then, as distances grow from a few hundred meters to a few thousand kilometers, the considerations change again. Conditions will change Finally, the complexity only increases as you consider that all of this is a moving target. When your business is smaller, perhaps you can keep everything in one location. A few years down the road, maybe you outgrow your site or leasing terms change. Your company acquires another company, and you now have resource sprawl with a datacenter consolidation project on the horizon. Accounting for all of the potential outcomes is challenging. The best that you can do is create solid architectural building blocks that provide the most optionality for whatever outcomes exist. In that regard, planning for growth is about considering how that growth might materialize and including flexibility as one of the primary requirements around the underlying infrastructure. The bottom line As datacenters grow, application resources will become fragmented. The question is not whether you will have to deal with this but rather how quickly your infrastructure can adapt. Architecting with this explicitly in mind could mean the difference between natural evolution or the types of transformation initiatives that stop companies dead in their tracks every 3-5 years. [Today’s fun fact: Chewing gum while peeling onions will prevent you from crying. It doesn’t work as well in romantic comedies.]

October 3, 2014

by Mike Bushong

· 8,466 Views

Product Catalog with MongoDB, Part 2: Product Search

Continue learning about product catalogs in MongoDB as we look at product seraches.

September 26, 2014

by Antoine Girbal

· 19,521 Views · 4 Likes

Product Catalog with MongoDB, Part 1: Schema Design

This post is part of the Product Catalog MongoDB Series, in which we will cover many aspects of building a Product Catalog with MongoDB. This approach has been tested with a varied product catalog of 130 million items running on a single server (EC2 i2.2xlarge). MongoDB seems to be the perfect fit to implement a product catalog since products maps so well to documents. But as we shall see it is not as easy as it seems! The data is fairly complex with many relationships involved. Also almost every other system will want to make use of the catalog instead of making its own copy, so typically a low latency, scalable and geo distributed catalog service is the ideal solution. A product has at least the following information: Item: the overall product info (e.g. Levi’s 501) Variant: a specific variant of an item (e.g. in black size 6) which typically has a specific SKU / UPC Price: price information may vary based on the store, the variant, etc Hierarchy: the item taxonomy Facet: facets to search products by Vendors: a given sku may be available through different vendors if the site is a marketplace A classic pitfall is to try to fit everything into a single document. As a result you end up with something very complex with many nested lists, which makes it difficult to navigate and index. Additionally APIs find themselves sending back massive documents even if only partial info is needed. In certain cases, we’ve seen items with 1000s of variant (e.g. Automotive part) which go beyond 16MB of pure JSON (in which case compression becomes mandatory)! Instead here we are going to model the data in a way that is natural, maps well to the API, at the sweet spot between normalization and denorm. Item Model The item collection has document representing the high level data of a product. Here is a sample item document for a shoe: { "_id": "054VA72303012P", "desc": [ { "lang": "en", "val": "Give your dressy look a lift with these women's Kate high-heel shoes by Metaphor. These playful peep-toe pumps feature satin-wrapped stiletto heels and chiffon pompoms at the toes. Rhinestones on each of the silvertone buckles add just a touch of sparkle to these shoes for a flirty footwear look that's made for your next night out." } ], "name": "Women's Kate Ivory Peep-Toe Stiletto Heel", "lname": "women's kate ivory peep-toe stiletto heel", "category": "/84700/80009/1282094266/1200003270", "brand": { "id": "2483510", "img": { "src": "http://i.sears.com/s/i/bl/image/spin_prod_metadata_168138610" }, "name": "Metaphor" }, "assets": { "imgs": [ { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_967112812", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945877912", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945878012", "width": "1900" } } ] }, "shipping": { "dimensions": { "height": "13.0", "length": "1.8", "width": "26.8" }, "weight": "1.75" }, "specs": [ { "name": "Heel Height (in.)", "val": "3.75" } ], "attrs": [ { "name": "Heel Height", "value": "High (2-1/2 to 4 in.)" }, { "name": "Upper Material", "value": "Synthetic" }, { "name": "Toe", "value": "Open toe" } { "name": "Brand", "value": "Metaphor" } ], "variants": { "cnt": 9, "attrs": [ { "dispType": "COMBOBOX", "name": "Width", }, { "dispType": "DROPDOWN", "name": "Color", }, { "dispType": "DROPDOWN", "name": "Shoe Size", } ] }, "lastUpdated": 1400877254787 } Fields of interest: _id: the product id lastUpdated: useful timestamp to see recently updated category: the category path made up of hierarchy nodes name: the product name lname: a lower-case version of the name. This can be useful for doing case-insensitive matching with an index brand: the brand desc: list of descriptions (website, retail box, etc) assets: list of assets (images, etc) attrs: list of attributes as name-value pairs. Will be used to implement facetting. Note that the brand is also included as one attribute. variants: some information on variants, but not the variants themselves Common queries (indexed): find by id: { _id: "the product id" } find by category prefix: { product.cat: { $regex: "^category prefix" } } find by case-insensitive name prefix: { product.lname: { $regex: "^name prefix" } } Variant Model The Variant documents represent specific variations of a product. Certain products only exist in a unique variant (e.g. XBox, no options to pick) whereas other products may have thousands. Here is a sample variant document for the same shoe: { "_id": "05458452563", "name": "Width:Medium,Color:Ivory,Shoe Size:6.5", "lname": "width:medium,color:ivory,shoe size:6.5", "itemId": "054VA72303012P", "altIds": { "upc": "632576103580" }, "assets": { "imgs": [ { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348512" }, { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348612" } ] }, "attrs": [ { "name": "Width", "value": "Medium" }, { "name": "Color", "family": "White", "value": "Ivory" }, { "name": "6.5", "value": "6.5" } ] } Fields of interest: _id: the SKU itemId: the parent item id. attrs: a list of attributes specific to the variant. Note that some of the attributes may have both a specific value (e.g. ivory) and a family value (e.g. white). assets: assets specific to the variant (e.g. image with a specific color). Common query parameters (indexed): find by SKU: { _id: "the sku" } find by item Id: { itemId: "item id" } Hierarchy Model The Hierarchy document represents a node of the hierarchical tree representing product taxonomy. The top level nodes represent departments, while further nodes represent specific categories. { "_id": "1200003270", "name": "Women's Heels & Pumps", "count": 223, "parents": [ "1282094266" ], "facets": [ "Heel Height", "Toe", "Upper Material", "Width", "Shoe Size", "Color" ] } Fields of interest: _id: the category id name: the category name count: the number of items in this category. It can be a useful statistic to display. parents: list of parent nodes. Simpler implementations could make use of a single value. facets: list of facets that exist for this category (e.g. color, size). This info will be used when displaying the facets available in the searching page. Common queries (indexed): find by parent id: { p: "parent id" } find top level departments: { p: null } Facet Model The facet document represent a name/value pair representing a product attribute. { "_id": "accessory type=hosiery", "name": "Accessory Type", "value": "Hosiery", "count": 14 } Fields of interest: _id: the id, which is a concatenation of lower-cased facet name and value. name: the facet name with original casing, e.g. “Accessory Type” value: the facet value with original casing, e.g. “Hosiery”. Important note: here the value should be the family value is possible, e.g. “White” rather than “Ivory”. Those facets will be used for searching items, and the family value is better for that purpose. count: the number of items that have this facet. This count will be important in defining the order of attributes in a query when doing faceted search. Common query parameters (indexed): find a specific facet: { _id: "name_value" } find facets for a name: { _id: { $regex: "^name_" } } Price Model The Price document obviously represents the price of an item, but there is quite a bit more to it. We want to be able to vary the price per variant (e.g. gold color is more expensive) or per store (e.g. online store is cheaper). While we will not touch on the store model in this post, let’s just imagine we have a few thousand stores which are grouped into a dozen store groups (e.g. online, west coast, etc). If we implement this naively, we would end up with 1000 stores x 200m variants = 2 billion price documents! Instead let’s be a bit smarter and make good use of MongoDB’s querying capability, by allowing to price products at different levels as needed, thus keeping the number of documents in the millions. A document looks like: { "_id": "SPM8824542513_1234", "price": "69.99", "sale": { "salePrice": "42.72", "saleEndDate": "2050-12-31 23:59:59" }, "lastUpdated": 1374647707394 } Fields of interest: _id: the id is built in a specific way. It is the concatenation of the item information and store information. The item information is either the item id or the variant id (SKU). The store information is either the store group id or the store id. price: the regular price sale: sales information, optional Common queries (indexed): find all prices by item id: { _id: { "$regex": "^itemId_" } } find all prices by SKU (price could be at item level): { _id: { "$in: [ { "$regex": "^itemId_" }, { "$regex": "^sku_" } ] } find price for a given SKU and store (4 combinations are possible): { _id: { "$in: [ "itemId_storeGroupId", "itemId_storeId", "sku_storeGroupId", "sku_storeId" ] } find items on sale, starting with ones ending soonest: { "sale.saleEndDate": { $ne: null } } with sort by { "sale.saleEndDate": 1 } (sparse index on "sale.saleEndDate") Summary Model The previous documents are now properly modeled, easy to maintain, and can efficiently power an API to serve product details pages. The last, and most difficult issue left to tackle is how to do faceted searching and other kind of browsing. We could leave it off to a full text search system or similar software, but it’s actually doable with MongoDB. For this purpose we face some tough challenges: whatever the search is, need a response within milliseconds returning hundreds of items the search can be a combination of many facets: category, brand, etc facets can be both at the item and variant levels: color, size, etc. If matching a specific variant, we should display that specific image (e.g. red shoes). hundreds of variants of the same item could match, in which case only the parent item should be returned as result efficient sorting on several attributes: price, popularity pagination feature which requires deterministic ordering For this purpose we create a separate collection called Summary in which each document represent the summary information of an item and all its variants. The data is stripped out to the minimum needed to power a browse & search feature. Such a document looks like: { "_id": "3ZZVA46759401P", "name": "Women's Chic - Black Velvet Suede", "lname": "women's chic - black velvet suede", "dep": "84700", "cat": "/84700/80009/1282094266/1200003270", "desc": [ { "lang": "en", "val": "This pointy toe slingback features a high quality upper and a classy, simple silhouette. This heel has a classic shape, an adjustable ankle strap for a vintage feel and a secure fit. The Chic is the perfect combination between dressy and professional." } ], "img": [ { "height": "330", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726201", "title": "spin_prod_591726201", "width": "450" } ], "attrs": [ "heel height=mid (1-3/4 to 2-1/4 in.)", "brand=metaphor" ], "sattrs": [ "upper material=synthetic", "toe=open toe" ], "vars": [ { "id": "05497884001", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6" ] }, { "id": "05497884002", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6.5" ] }, { "id": "05497884004", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=7.5" ] } ] } Fields of interest: _id: the item id name: the item name lname: the item name, lower-case img: list of images, ideally just the thumbnail dep: the department (top level of category). Needs to be separate for proper indexing cat: the category path attrs: the item attributes, to be indexed sattrs: the item secondary attributes, not to be indexed vars: list of variants vars.id: the variant sku vars.attrs: the variant attributes, to be indexed vars.sattrs: the secondary variant attributes, not to be indexed Indices: department + attr + category + _id department + vars.attrs + category + _id department + category + _id department + price + _id department + rating + _id Common queries (indexed): find by department: { dep: "department" } find by category prefix: { dep: "department", cat: { $regex: "^category prefix"} } find by item attribute: { dep: "department", attrs: "name=value" } find by several item attributes: { dep: "department", attrs: { $all: [ "name=value", ... ] } find by variant attribute: { dep: "department", vars.attrs: "name=value" } find by several variant attributes: { dep: "department", vars.attrs: { $all: [ "name=value", ... ] } find by item attributes, variant attributes, category: { dep: "department", attrs: { $all: [ "name=value", ... ], vars.attrs: { $all: [ "name=value", ... ], cat: { $regex: "^category prefix"} } A few interesting notes on indexing / querying: each index starts with the department, which is a convenient way to subdivide our product catalog. It is an acceptable restriction to force the user to pick a department before displaying any kind of search facet (unless we’re displaying a pre-computed list like “most popular”). Hence having the department there will ensure that there is always a large amount of filtering done for cheap by the index :) each index ends with “_id” which is useful for pagination. It will give sorting on _id for free for some common queries. It’s always better to avoid resorting the skip/limit for pagination, which is only fine for a low number of pages. for queries using “$all“ the most restrictive attribute should be specified first (e.g. “color=red”). This information can be inferred from the “facet“ collection described earlier. This piece is critical to make facetted searches efficient and keep them in the few milliseconds. Conclusion In conclusion, we’ve seen here how to model and index a product catalog in MongoDB which will allow high performance, flexibility, and easy maintenance. More details on this topic can be see in the MongoDB World video Product Catalog Stay tuned for more information on our Product Catalog MongoDB solution, including: How to implement full text search within MongoDB or with a connector to an external FTS system More statistics and benchmarking on the faceted search capability Operational considerations: geo distributed for low latency queries, stringent read latency SLA, and spiky catalog updates Also check out another interesting topic: how to log all user activities around the site and run useful analytics on them in the MongoDB World video covering the Insight Component

September 25, 2014

by Antoine Girbal

· 65,707 Views · 8 Likes

The No Fluff Introduction to Big Data

big data traditionally has referred to a collection of data too massive to be handled efficiently by traditional database tools and methods. this original definition has expanded over the years to identify tools (big data tools) that tackle extremely large datasets (nosql databases, mapreduce, hadoop, newsql, etc.), and to describe the industry challenge posed by having data harvesting abilities that far outstrip the ability to process, interpret, and act on that data. technologists knew that those huge batches of user data and other data types were full of insights that could be extracted by analyzing the data in large aggregates. they just didn’t have any cheap, simple technology for organizing and querying these large batches of raw, unstructured data. the term quickly became a buzzword for every sort of data processing product’s marketing team. big data became a catchall term for anything that handled non-trivial sizes of data. sean owen, a data scientist at cloudera, has suggested that big data is a stage where individual data points are irrelevant and only aggregate analysis matters [1]. but this is true for a 400 person survey as well, and most people wouldn’t consider that very big. the key part missing from that definition is the transformation of unstructured data batches into structured datasets. it doesn’t matter if the database is relational or non-relational. big data is not defined by a number of terabytes, it’s rooted in the push to discoverhidden insights in data that companies used to disregard or throw away. due to the obstacles presented by large scale data management, the goal for developers and data scientists is two-fold: first, systems must be created to handle large scale data, and two, business intelligence and insights should be acquired from analysis of the data. acquiring the tools and methods to meet these goals is a major focus in the data science industry, but it’s a landscape where needs and goals are still shifting. what are the characteristics of big data? tech companies are constantly amassing data from a variety of digital sources that is almost without end—everything from email addresses to digital images, mp3s, social media communication, server traffic logs, purchase history, and demographics. and it’s not just the data itself, but data about the data (metadata). it is a barrage of information on every level. what is it that makes this mountain of data big data? one of the most helpful models for understanding the nature of big data is “the three vs:” volume, velocity, and variety. data volume volumeis the sheer size of the data being collected. there was a point in not-so-distant history where managing gigabytes of data was considered a serious task—now we have web giants like google and facebook handling petabytes of information about users’ digital activities. the size of the data is often seen as the first challenge of characterizing big data storage, but even beyond that is the capability of programs to provide architecture that can not only store but query these massive datasets. one of the most popular models for big data architecture comes from google’s mapreduce concept, which was the basis for apache hadoop, a popular data management solution. data velocity velocityis a problem that flows naturally from the volume characteristics of big data. data velocity is the speed at which data is flowing into a business’ infrastructure and the ability of software solutions to receive and ingest that data quickly. certain types of high-velocity data, such as streaming data, needs to be moved into storage and processed on the fly. this is often referred to as complex event processing (cep). the ability to intercept and analyze data that has a lifespan of milliseconds is a widely sought after. this kind of quick-fire data processing has long been the cornerstone of digital financial transactions, but it is also being used to track live consumer behavior or to bring instant updates to social media feeds. data variety variety refers to the source and type of data that is being collected. this data could be anything from raw image data to sensor readings, audio recordings, social media communication, and metadata. the challenge of data variety is being able to take raw, unstructured data and organize it so that an application can use it. this kind of structure can be achieved through architectural models that traditionally favor relational databases—but there is often a need to tidy up this data before it will even be useful to store in a raw form. sometimes a better option is to use a schema-less, non-relational database. how do you manage big data? the three vs is a great model for getting an initial understanding of what makes big data a challenge for businesses. however, big data is not just about the data itself, but the way that it is handled. a popular way of thinking about these challenges is to look at how a business stores, processes, and accesses their data. · store: can you store the vast amounts of data being collected? · process: can you organize, clean, and analyze the data collected? · access: can you search and query this data in an organized manner? the store, process, and access model is useful for two reasons: it reminds businesses that big data is largely about managing data, and it demonstrates the problem of scale within big data management. “big” is relative. the data batches that challenge some companies could be moved through a single google datacenter in under a minute. the only question a company needs to ask itself is how it will store and access increasingly massive amounts of data for its particular use case. there are several high level approaches that companies have turned to in the last few years. the traditional approach the traditional method for handling most data is to use relational databases. data warehouses are then used to integrate and analyze data from many sources. these databases are structured according to the concept of “early structure binding”—essentially, the database has predetermined “questions” that can be asked based on a schema. relational databases are highly functional, and the goal with this type of data processing is for the database to be fully transactional. although relational databases are the most common persistence type by a large margin (see key findings pg. 4-5), a growing number of use cases are not well-suited for relational schema. relational architectures tend to have difficulty when dealing with the velocity and variety of big data, since their structure is very rigid. when you perform functions such as join on many large data sets, the volume can be a problem as well. instead, businesses are looking to non-relational databases, or a mixture of both types, to meet data demand. the newer approach - mapreduce, hadoop, and nosql databases in the early 2000s, web giant google released two helpful web technologies: google file system (gfs) and mapreduce. both were new and unique approaches to the growing problem of big data, but mapreduce was chief among them, especially when it comes to its role as a major influencer of later solution models. mapreduce is a programming paradigm that allows for low cost data analysis and clustered scale-out processing. mapreduce became the primary architectural influence for the next big thing in big data: the creation of the big data management infrastructure known as hadoop. hadoop’s open source ecosystem and ease of use for handling large-scale data processing operations have secured a large part of the big data marketplace. besides hadoop, there was a host of non-relational (nosql) databases that emerged around 2009 to meet a different set of demands for processing big data. whereas hadoop is used for its massive scalability and parallel processing, nosql databases are especially useful for handling data stored within large multi-structured datasets. this kind of discrete data handling is not traditionally seen as a strong point of relational databases, but it’s also not the same kind of data operations that hadoop is running. the solution for many businesses ends up being a combination of these approaches to data management. finding hidden data insights once you get beyond storage and management, you still have the enormous task of creating actionable business intelligence (bi) from the datasets you’ve collected. this problem of processing and analyzing data is maybe one of the trickiest in the data management lifecycle. the best options for data analytics will favor an approach that is predictive and adaptable to changing data streams. the thing is, there’s so many types of analytic models and different ways of providing infrastructure for this process. your analytics solution should scale, but to what degree? scalability can be an enormous pain in your analytical neck, due to the problem of decreasing performance returns when scaling out an algorithm. ultimately, analytics tools rely on a great deal of reasoning and analysis to extract data patterns and data insights, but this capacity means nothing for a business if they can’t then create actionable intelligence. part of this problem is that many businesses have the infrastructure to accommodate big data, but they aren’t asking questions about what problems they’re going to solve with the data. implementing a big data-ready infrastructure before knowing what questions you want to ask is like putting the cart before the horse. but even if we do know the questions we want to ask, data analysis can always reveal many correlations with no clear causes. as organizations get better at processing and analyzing big data, the next major hurdle will be pinpointing the causes behind the trends by asking the right questions and embracing the complexity of our answers. [1] http://www.quora.com/what-is-big-data 2014 guide to big data this guide explores the meaning of big data, how businesses use it, and uncovers new tools and techniques for the future of big data. this guide includes: detailed profiles on 43 big data vendor solutions in-depth articles written by industry experts results from our survey of 850 it professionals "finding the database for your use case" download now

September 25, 2014

by Benjamin Ball

· 10,659 Views · 1 Like

How to Trace Transactions Across Every Layer of Your Distributed Software Stack

APM solutions give you great visibility into any code you have control over; however, today’s systems are largely a combination of code you write along with off-the-shelf components, sitting on top of VMs/containers, and cloud-based services. Thus, full system-wide visibility requires an ability to look into your APM tool as well as log data produced from the components that you may not be able to instrument. This post offers an outline of how APM solutions work and how you can combine them with your system logs to finally get an end-to-end and top-to-bottom view of your system behavior and performance. How APM Tools Work – Hello APM APM tools give you insight deep into your code and often work using cool techniques like dynamic instrumentation. Dynamic instrumentation essentially allows you to instrument your apps on the fly without any need to modify your application source code. Such techniques have become widely been supported by mainstream programming languages to make it possible for even mere mortals to build their own APM tools. For example, since Java version 5, any Java applications can be instrumented using java.lang.instrument, which allows for the instrumentation of any programs running on the JVM through modification of the byte code of methods. It works by letting you alter the corresponding byte code of a class when it is being loaded, such that you can introduce monitoring capabilities such as execution profiling or event tracing. There’e a great beginner tutorial here by Julien Paoletti on how to write your first APM tool in java. It essentially shows you how you can intercept classes at class load time and then inject code into methods of your choice to record how long it takes for given methods to execute. While building a full APM solution is not for the faint hearted, you can easily begin to build your first ‘Hello APM’ tool, and play around with JVM internals following Julien’s post above. Transaction Tracing For those interested in moving beyond simply recording method execution time, you can begin to trace full transactions using some simple techniques. To do so, you essentially need a unique identifier to be passed along to any methods executed in that transaction. Continuing on from our hello world profiler above, you could do this by injecting a unique ID into the thread at any entry point in the system (e.g. new incoming requests). Java provides ThreadLocal storage that allows you to do just this. Using ThreadLocal you can embed a unique ID that gets recorded as each method executes. Reconstructing a Transaction On every invocation of a method along the transaction data is logged. An example of what might be logged by an APM tool is as follows: unique transaction id sequence number call depth method details performance data You can then easily piece together full transaction traces by ordering all method calls by sequence number. Further analysis can be applied to this information for a number of purposes. For example, by analysing the transactions, developers can easily construct design diagrams that can help quickly deduce overall system structure. Relationships between system components can help understand interdependencies enabling developers to anticipate potential conflicts and to debug problems as well as allowing them to reason about their system design (which in turn can have a major impact of system performance). Tracing Transactions Across the Network The real challenge with transaction tracing can come when you are dealing with distributed components. In such scenarios you need to be able to trace transactions across the network. One approach here is to piggy-back the necessary correlation data (unique transaction ID) onto the request from a client to the remote server. RPC (remote procedure call) systems generally employ a standard mechanism, known as ‘stubs and skeletons‘, to hide the complexities of the network from the client making any remote calls. Stubs and skeletons work as follows: The stub masks the low level networking issues from the client and forwards the request on to a server side proxy object (the skeleton). The skeleton masks the low level networking issues from the distributed component. It also delegates the remote request to the distributed component. The distributed component then handles the request and returns control to the skeleton, which in turn returns control to the stub. The stub, in turn, hands back to the client. One approach to the issue of tracing transactions across the network can be achieved by taking advantage of the stubs and skeletons model. Essentially the stub and skeletons can be modified such that the unique transaction ID piggy backs on the communication and is sent as part of the request to the stub and response from the skeleton. The implementation may differ from platform to platform, but the principles can generally be applied. For example, Remote Method Invocation is used for distributed communication on java platforms and details on how this can be achieved for RMI can be found in one of my older research papers here. RMI with Custom Stub Wrapper and Server Side Interception Point Going Beyond APM The above transaction tracing will give you visibility at a method call level across your distributed application. However sometimes external factors outside your application code (server resource, SAAS components your app communicates with, network speed etc.) will have an impact on your overall system performance. One way of enhancing the information provided by your APM solution is to collect and analyze your log data. Logs provide a very flexible way of gathering information on your system behavior without any requirement for deep instrumentation and any of the complex techniques described above. Furthermore you may not be able to instrument every software component or cloud service that makes up your overall system – yet almost all of these will produce valuable log data containing system usage and performance information. In such scenarios, combining APM and log data will give you the complete picture. Below are some tips that will allow you to map logs to APM transactions or to enhance them with data from additional components such as OS, middleware or network level components: Logging the Transaction ID: Any log data produced by your apps can be easily mapped to transactions produced by your APM tool by logging the transaction ID used to trace the transaction. Client Side Logging & Logging User/Session/Account ID: Logging other unique identifiers such as session ID, user ID or account ID, can also help you assist with tracing transactions across log events, where the transaction ID used by the APM tool is unavailable. This can be particularly useful if you are logging events from the client side as well as from back end components where you want to be able to view the sequence of log events related to a give user action or session for example. Same Time Frame For System Logs: Where unique identifiers have not been logged as part of your log events, viewing logs within the same time frame window as your APM tool will help you narrow down related log events and will give you a view into system behavior during the transaction time frame. Correlating with Other KPIs: Logs will contain key performance and resource usage metrics that can be rolled up into trend lines and charts. Correlating APM transaction traces with performance metrics and server resource usage information can help with investigation can result in quicker root cause than investigating transaction traces in isolation. Build It Yourself? Naturally anybody in their right mind would not actually go about building their own APM solution, it’s almost as hair-brained as rolling your own logging solution The good news is you don’t have to do either – simply take advantage of the new Logentries and New Relic integration such that you can trace transactions from end to end and from top to bottom of your entire distributed software stack.

September 24, 2014

by Trevor Parsons

· 6,846 Views · 1 Like

How to Resolve Maven's ''Failure to Transfer'' Error

Learn how to resolve the ''failure to transfer'' error encountered in Maven in this quick tutorial.

September 24, 2014

by Jose Roy Javelosa

· 129,936 Views · 3 Likes

JPA tutorial: Mapping Entities – Part 1

In this article I will discuss about the entity mapping procedure in JPA. As for my examples I will use the same schemathat I used in one of my previous articles. In my two previous articles I explained how to set up JPA in a Java SE environment. I do not intend to write the setup procedure for a web application because most of the tutorials on the web do exactly that. So let’s skip over directly to object relational mapping, or entity mapping. Wikipedia defines Object Relational Mapping as follows - Object-relational mapping (ORM, O/RM, and O/R mapping) in computer science is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to create their own ORM tools. Typically, mapping is the process through which you provide necessary information about your database to your ORM tool. The tool then uses this information to read/write objects into the database. Usually you tell your ORM tool the table name to which an object of a certain type will be saved. You also provide column names to which an object’s properties will be mapped to. Relation between different object types also need to be specified. All of these seem to be a lot of tasks, but fortunately JPA follows what is known as “Convention over Configuration” approach, which means if you adopt to use the default values provided by JPA, you will have to configure very little parts of your applications. In order to properly map a type in JPA, you will at a minimum need to do the following - Mark your class with the @Entity annotation. These classes are called entities. Mark one of the properties/getter methods of the class with the @Id annotation. And that’s it. Your entities are ready to be saved into the database because JPA configures all other aspects of the mapping automatically. This also shows the productivity gain that you can enjoy by using JPA. You do not need to manually populate your objects each time you query the database, saving you from writing lots of boilerplate code. Let’s see an example. Consider the following Address entity which I have mapped according to the above two rules - import javax.persistence.Entity; import javax.persistence.Id; @Entity public class Address { @Id private Integer id; private String street; private String city; private String province; private String country; private String postcode; /** * @return the id */ public Integer getId() { return id; } /** * @param id the id to set */ public Address setId(Integer id) { this.id = id; return this; } /** * @return the street */ public String getStreet() { return street; } /** * @param street the street to set */ public Address setStreet(String street) { this.street = street; return this; } /** * @return the city */ public String getCity() { return city; } /** * @param city the city to set */ public Address setCity(String city) { this.city = city; return this; } /** * @return the province */ public String getProvince() { return province; } /** * @param province the province to set */ public Address setProvince(String province) { this.province = province; return this; } /** * @return the country */ public String getCountry() { return country; } /** * @param country the country to set */ public Address setCountry(String country) { this.country = country; return this; } /** * @return the postcode */ public String getPostcode() { return postcode; } /** * @param postcode the postcode to set */ public Address setPostcode(String postcode) { this.postcode = postcode; return this; } } Now based on your environment, you may or may not add this entity declaration in your persistence.xml file, which I have explained in my previous article. Ok then, let’s save some object! The following code snippet does exactly that - import com.keertimaan.javasamples.jpaexample.entity.Address; import javax.persistence.EntityManager; import com.keertimaan.javasamples.jpaexample.persistenceutil.PersistenceManager; public class Main { public static void main(String[] args) { EntityManager em = PersistenceManager.INSTANCE.getEntityManager(); Address address = new Address().setId(1) .setCity("Dhaka") .setCountry("Bangladesh") .setPostcode("1000") .setStreet("Poribagh"); em.getTransaction() .begin(); em.persist(address); em.getTransaction() .commit(); System.out.println("addess is saved! It has id: " + address.getId()); Address anotherAddress = new Address().setId(2) .setCity("Shinagawa-ku, Tokyo") .setCountry("Japan") .setPostcode("140-0002") .setStreet("Shinagawa Seaside Area"); em.getTransaction() .begin(); em.persist(anotherAddress); em.getTransaction() .commit(); em.close(); System.out.println("anotherAddress is saved! It has id: " + anotherAddress.getId()); PersistenceManager.INSTANCE.close(); } } Let’s take a step back at this point and think what we needed to do if we had used plain JDBC for persistence. We had to manually write the insert queries and map each of the attributes to the corresponding columns for both cases, which would have required a lot of code. An important point to note about the example is the way I am setting the id of the entities. This approach will only work for short examples like this, but for real applications this is not good. You’d typically want to use, say, auto-incremented id columns or database sequences to generate the id values for your entities. For my example, I am using a MySQL database, and all of my id columns are set to auto increment. To reflect this in my entity model, I can use an additional annotation called @GeneratedValue in the id property. This tells JPA that the id value for this entity will be automatically generated by the database during the insert, and it should fetch that id after the insert using a select command. With the above modifications, my entity class becomes something like this - import javax.persistence.Entity; import javax.persistence.Id; import javax.persistence.GeneratedValue; @Entity public class Address { @Id @GeneratedValue private Integer id; // Rest of the class code........ And the insert procedure becomes this - Address anotherAddress = new Address() .setCity("Shinagawa-ku, Tokyo") .setCountry("Japan") .setPostcode("140-0002") .setStreet("Shinagawa Seaside Area"); em.getTransaction() .begin(); em.persist(anotherAddress); em.getTransaction() .commit(); How did JPA figure out which table to use to save Address entities? Turns out, it’s pretty straight-forward - When no explicit table information is provided with the mapping then JPA tries to find a table whose name matches with the entity name. The name of an entity can be explicitly specified by using the “name” attribute of the @Entity annotation. If no name attribute is found, then JPA assumes a default name for an entity. The default name of an entity is the simple name (not fully qualified name) of the entity class, which in our case is Address. So our entity name is then determined to be “Address”. Since our entity name is “Address”, JPA tries to find if there is a table in the database whose name is “Address” (remember, most of the cases database table names are case-insensitive). From our schema, we can see that this is indeed the case. So how did JPA figure our which columns to use to save property values for address entities? At this point I think you will be able to easily guess that. If you cannot, stay tuned for my next post! Until next time. [ Full working code can be found at github.]

September 22, 2014

by MD Sayem Ahmed

· 9,740 Views · 2 Likes

How to Setup Custom Remote Deployment Repositories for JBoss BPM Suite

In this article we wanted to share another configuration property that can provide surprising help when setting up your JBoss BPM Suite. Previously we outlined a basic set of configuration properties to provide you with a few tricks when installing your own JBoss BRMS or JBoss BPM Suite products. As the JBoss BPM Suite is a super set, including full JBoss BRMS functionality, the rest of this article will refer only to JBoss BPM Suite but apply to both products. In this article we will show you how to modify your JBoss EAP container configuration to point the products at a custom deployment repository by adjusting a single configuration property. Maven repository The default setup is that the products will look for your maven setting in the default settings.xml as found set in theM2_HOME variable or in the users home directory at .m2/settings.xml. The following system property can be added to JBoss EAP standalone.xml configuration file to point to any file containing your custom settings. kie.maven.settings.custom Location of the maven configuration file where it can find it's settings. Default: the M2_HOME/conf/settings.xml or users home directory .m2/settings.xml Example usage in JBoss EAP When initially setting up the product for use on JBoss EAP containers, one can adjust configuration with the help of system properties. Below we show how to configure an installation to point to our custom maven deployment repository by using a custom settings file we will call bpmsuite-settings.xml We hope this helps you with configuring your own custom deployment repositories and enables you to tie into existing continuous integration infrastructures that might exist in your organization.

September 19, 2014

by Eric D. Schabell

CORE

· 6,199 Views · 1 Like

MySQL 101: Monitor Disk I/O with pt-diskstats

Originally Written by Muhammad Irfan Here on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats. In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool. pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents. rd_s tells about number of reads per second while wr_s represents number of writes per second. rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package. rd_mrg and wr_mrg are other two important columns in pt-diskstats output. *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes. qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated. Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the –devices-regex option which prints only device information that matches this Perl regex. $ pt-diskstats --devices-regex=sd --interval 5 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.1 sda 21.6 22.8 0.5 45% 1.2 29.4 275.5 4.0 1.1 0% 40.0 145.1 65% 158 297.1 155.0 2.1 1.1 sdb 15.0 21.0 0.3 33% 0.1 5.2 0.0 0.0 0.0 0% 0.0 0.0 11% 1 15.0 0.5 4.7 1.1 sdc 5.6 10.0 0.1 0% 0.0 5.2 1.9 6.0 0.0 33% 0.0 2.0 3% 0 7.5 0.4 3.6 1.1 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 5.0 sda 17.0 14.8 0.2 64% 3.1 66.7 404.9 4.6 1.8 14% 140.9 298.5 100% 111 421.9 277.6 1.9 5.0 sdb 14.0 19.9 0.3 48% 0.1 5.5 0.4 174.0 0.1 98% 0.0 0.0 11% 0 14.4 0.9 2.4 5.0 sdc 3.6 27.1 0.1 61% 0.0 3.5 2.8 5.7 0.0 30% 0.0 2.0 3% 0 6.4 0.7 2.4 5.0 sdd 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 These are the stats from 7200 RPM SATA disks. As you can see, the write-response time is very high and most of that is made up of IO queue time. This shows the problem exactly. The problem is that the IO subsystem is not able to handle the write workload because the amount of writes that are being performed are way beyond what it can handle. It means the disks cannot service every request concurrently. The workload would actually depend a lot on where the hot data is stored and as we can see in this particular case the workload only hits a single disk out of the 4 disks. A single 7.2K RPM disk can only do about 100 random writes per second which is not a lot considering heavy workload. It’s not particularly a hardware issue but a hardware capacity issue. The kind of workload that is present and the amount of writes that are performed per second are not something that the IO subsystem is able to handle in an efficient manner. Mostly writes are generated on this server as can be seen by the disk stats. Let me show you a second example. Here you can see read latency. rd_rt is consistently between 10ms-30ms. It depends on how fast the disks are spinning and the number of disks. To deal with it possible solutions would be to optimize queries to avoid table scans, use memcached where possible, use SSD’s as it can provide good I/O performance with high concurrency. You will find this post useful on SSD’s from our CEO, Peter Zaitsev. #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 sdb 33.0 29.1 0.9 0% 1.1 34.7 7.0 10.3 0.1 61% 0.0 0.4 99% 1 40.0 2.2 19.5 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 7.0 10.3 0.1 61% 0.0 0.4 1% 0 7.0 0.0 0.4 1.0 sdb2 33.0 29.1 0.9 0% 1.1 34.7 0.0 0.0 0.0 0% 0.0 0.0 99% 1 33.0 3.5 30.2 1.0 sdb 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb1 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 1.0 sdb2 81.9 28.5 2.3 0% 1.1 14.0 0.0 0.0 0.0 0% 0.0 0.0 99% 1 81.9 2.0 12.0 1.0 sdb 50.0 25.7 1.3 0% 1.3 25.1 13.0 11.7 0.1 66% 0.0 0.7 99% 1 63.0 3.4 11.3 1.0 sdb1 25.0 21.3 0.5 0% 0.6 25.2 13.0 11.7 0.1 66% 0.0 0.7 46% 1 38.0 3.2 7.3 1.0 sdb2 25.0 30.1 0.7 0% 0.6 25.0 0.0 0.0 0.0 0% 0.0 0.0 56% 0 25.0 3.6 22.2 From the below diskstats output it seems that IO is saturated between both reads and writes. This can be noticed with high value for columns rd_s and wr_s. In this particular case, consider having disks in either RAID 5 (better for read only workload) or RAID 10 array is good option along with battery-backed write cache (BBWC) as single disk can really be bad for performance when you are IO bound. device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime sdb1 362.0 27.4 9.7 0% 2.7 7.5 525.2 20.2 10.3 35% 6.4 8.0 100% 0 887.2 7.0 0.9 sdb1 439.9 26.5 11.4 0% 3.4 7.7 545.7 20.8 11.1 34% 9.8 11.9 100% 0 985.6 9.6 0.8 sdb1 576.6 26.5 14.9 0% 4.5 7.8 400.2 19.9 7.8 34% 6.7 10.9 100% 0 976.8 8.6 0.8 sdb1 410.8 24.2 9.7 0% 2.9 7.1 403.1 18.3 7.2 34% 10.8 17.7 100% 0 813.9 12.5 1.0 sdb1 378.4 24.6 9.1 0% 2.7 7.3 506.1 16.5 8.2 33% 5.7 7.6 100% 0 884.4 6.6 0.9 sdb1 572.8 26.1 14.6 0% 4.8 8.4 422.6 17.2 7.1 30% 1.7 2.8 100% 0 995.4 4.7 0.8 sdb1 429.2 23.0 9.6 0% 3.2 7.4 511.9 14.5 7.2 31% 1.2 1.7 100% 0 941.2 3.6 0.9 The following example reflects write heavy activity but write-response time is very good, under 1ms, which shows disks are healthy and capable of handling high number of IOPS. #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 1.0 dm-0 530.8 16.0 8.3 0% 0.3 0.5 6124.0 5.1 30.7 0% 1.7 0.3 86% 2 6654.8 0.2 0.1 2.0 dm-0 633.1 16.1 10.0 0% 0.3 0.5 6173.0 6.1 36.6 0% 1.7 0.3 88% 1 6806.1 0.2 0.1 3.0 dm-0 731.8 16.0 11.5 0% 0.4 0.5 6064.2 5.8 34.1 0% 1.9 0.3 90% 2 6795.9 0.2 0.1 4.0 dm-0 711.1 16.0 11.1 0% 0.3 0.5 6448.5 5.4 34.3 0% 1.8 0.3 92% 2 7159.6 0.2 0.1 5.0 dm-0 700.1 16.0 10.9 0% 0.4 0.5 5689.4 5.8 32.2 0% 1.9 0.3 88% 0 6389.5 0.2 0.1 6.0 dm-0 774.1 16.0 12.1 0% 0.3 0.4 6409.5 5.5 34.2 0% 1.7 0.3 86% 0 7183.5 0.2 0.1 7.0 dm-0 849.6 16.0 13.3 0% 0.4 0.5 6151.2 5.4 32.3 0% 1.9 0.3 88% 3 7000.8 0.2 0.1 8.0 dm-0 664.2 16.0 10.4 0% 0.3 0.5 6349.2 5.7 35.1 0% 2.0 0.3 90% 2 7013.4 0.2 0.1 9.0 dm-0 951.0 16.0 14.9 0% 0.4 0.4 5807.0 5.3 29.9 0% 1.8 0.3 90% 3 6758.0 0.2 0.1 10.0 dm-0 742.0 16.0 11.6 0% 0.3 0.5 6461.1 5.1 32.2 0% 1.7 0.3 87% 1 7203.2 0.2 0.1 Let me show you a final example. I used –interval and –iterations parameters for pt-diskstats which tells us to wait for a number of seconds before printing the next disk stats and to limit the number of samples respectively. If you notice, you will see in 3rd iteration high latency (rd_rt, wr_rt) mostly for reads. Also, you can notice a high value for queue time (qtime) and service time (stime) where qtime is related to disk IO scheduler settings. For MySQL database servers we usually recommends noop/deadline instead of default cfq. $ pt-diskstats --interval=20 --iterations=3 #ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime 10.4 hda 11.7 4.0 0.0 0% 0.0 1.1 40.7 11.7 0.5 26% 0.1 2.1 10% 0 52.5 0.4 1.5 10.4 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 7.0 0.0 43% 0.0 0.1 0% 0 0.4 0.0 0.1 10.4 hda3 0.0 0.0 0.0 0% 0.0 0.0 0.4 107.0 0.0 96% 0.0 0.2 0% 0 0.4 0.0 0.2 10.4 hda5 0.0 0.0 0.0 0% 0.0 0.0 0.7 20.0 0.0 80% 0.0 0.3 0% 0 0.7 0.1 0.2 10.4 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.1 4.0 0.0 0% 0.0 4.0 0% 0 0.1 0.0 4.0 10.4 hda9 11.7 4.0 0.0 0% 0.0 1.1 39.2 10.7 0.4 3% 0.1 2.7 9% 0 50.9 0.5 1.8 10.4 drbd1 11.7 4.0 0.0 0% 0.0 1.1 39.1 10.7 0.4 0% 0.1 2.8 9% 0 50.8 0.5 1.7 20.0 hda 14.6 4.0 0.1 0% 0.0 1.4 39.5 12.3 0.5 26% 0.3 6.4 18% 0 54.1 2.6 2.7 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.4 9.1 0.0 56% 0.0 42.0 3% 0 0.4 0.0 42.0 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 1.5 22.3 0.0 82% 0.0 1.5 0% 0 1.5 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 1.1 18.9 0.0 79% 0.1 21.4 11% 0 1.1 0.1 21.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.8 10.4 0.0 62% 0.0 1.5 0% 0 0.8 1.3 0.2 20.0 hda9 14.6 4.0 0.1 0% 0.0 1.4 35.8 11.7 0.4 3% 0.2 4.9 18% 0 50.4 0.5 3.5 20.0 drbd1 14.6 4.0 0.1 0% 0.0 1.4 36.4 11.6 0.4 0% 0.2 5.1 17% 0 51.0 0.5 3.4 20.0 hda 0.9 4.0 0.0 0% 0.2 251.9 28.8 61.8 1.7 92% 4.5 13.1 31% 2 29.6 12.8 0.9 20.0 hda2 0.0 0.0 0.0 0% 0.0 0.0 0.6 8.3 0.0 52% 0.1 98.2 6% 0 0.6 48.9 49.3 20.0 hda3 0.0 0.0 0.0 0% 0.0 0.0 2.0 23.2 0.0 83% 0.0 1.4 0% 0 2.0 1.2 0.3 20.0 hda5 0.0 0.0 0.0 0% 0.0 0.0 4.9 249.4 1.2 98% 4.0 13.2 9% 0 4.9 12.9 0.3 20.0 hda6 0.0 0.0 0.0 0% 0.0 0.0 0.0 0.0 0.0 0% 0.0 0.0 0% 0 0.0 0.0 0.0 20.0 hda9 0.9 4.0 0.0 0% 0.2 251.9 21.3 24.2 0.5 32% 0.4 12.9 31% 2 22.2 10.2 9.7 20.0 drbd1 0.9 4.0 0.0 0% 0.2 251.9 30.6 17.0 0.5 0% 0.7 24.1 30% 5 31.4 21.0 9.5 You can see the busy column in pt-diskstats output which is the same as the util column in iostat – which points to utilization. Actually, pt-diskstats is quite similar to the iostat tool but pt-diskstats is more interactive and has more information. The busy percentage is only telling us for how long the IO subsystem was busy, but is not indicating capacity. So the only time you care about %busy is when it’s 100% and at the same time latency (await in iostat and rd_rt/wr_rt in diskstats output) increases over -say- 5ms. You can estimate capacity of your IO subsystem and then look at the IOPS being consumed (r/s + w/s columns). Also, the system can process more than one request in parallel (in case of RAID) so %busy can go beyond 100% in pt-diskstats output. If you need to check disk throughput, block device IOPS run the following to capture metrics from your IO subsystem and see if utilization matches other worrisome symptoms. I would suggest capturing disk stats during peak load. Output can be grouped by sample or by disk using the –group-by option. You can use the sysbench benchmark tool for this purpose to measure database server performance. You will find this link useful for sysbench tool details. $ pt-diskstats --group-by=all --iterations=7200 > /tmp/pt-diskstats.out; Conclusion: pt-diskstats is one of the finest tools from Percona Toolkit. By using this tool you can easily spot disk bottlenecks, measure the IO subsystem and identify how much IOPS your drive can handle (i.e. disk capacity).

September 19, 2014

by Peter Zaitsev

· 5,266 Views

Java - Four Security Vulnerabilities Related Coding Practices to Avoid

This article represents top 4 security vulnerabilities related coding practice to avoid while you are programming with Java language. Recently, I came across few Java projects where these instances were found. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Executing a dynamically generated SQL statement Directly writing an Http Parameter to Servlet output Creating an SQL PreparedStatement from dynamic string Array is stored directly Executing a Dynamically Generated SQL Statement This is most common of all. One can find mention of this vulenrability at several places. As a matter of fact, many developers are also aware of this vulnerability, although this is a different thing they end up making mistakes once in a while. In several DAO classes, the instances such as following code were found which could lead to SQL injection attacks. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); Statement statement = connection.createStatement(); resultSet = statement.executeQuery(query.toString()); } Instead of above query, one could as well make use of prepared statement such as that demonstrated in the code below. It not only makes code less vulnerable to SQL injection attacks but also makes it more efficient. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (?)" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareCall(query.toString()); statement.setString( 1, namesString ); resultSet = statement.execute(); } Directly writing an Http Parameter to Servlet Output In Servlet classes, I found instances where the Http request parameter was written as it is, to the output stream, without any validation checks. Following code demonstrate the same: public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { String content = request.getParameter("some_param"); // // .... some code goes here // response.getWriter().print(content); } Note that above code does not persist anything. Code like above may lead to what is called reflected (or non-persistent) cross site scripting (XSS) vulnerability. Reflected XSS occur when an attacker injects browser executable code within a single HTTP response. As it goes by definition (being non-persistent), the injected attack does not get stored within the application; it manifests only users who open a maliciously crafted link or third-party web page. The attack string is included as part of the crafted URI or HTTP parameters, improperly processed by the application, and returned to the victim. You could read greater details on following OWASP page on reflect XSS Creating an SQL PreparedStatement from Dynamic Query String What it essentially means is the fact that although PreparedStatement was used, but the query was generated as a string buffer and not in the way recommended for prepared statement (parametrized). If unchecked, tainted data from a user would create a String where SQL injection could make it behave in unexpected and undesirable manner. One should rather make the query statement parametrized and, use the PreparedStatement appropriately. Take a look at following code to identify the vulnerable code. StringBuilder query = new StringBuilder(); query.append( "select * from user u where u.name in (" + namesString + ")" ); try { Connection connection = getConnection(); PreparedStatement statement = connection.prepareStatement(query.toString()); resultSet = statement.executeQuery(); } Array is Stored Directly Instances of this vulnerability, Array is stored directly, could help the attacker change the objects stored in array outside of program, and the program behave in inconsistent manner as the reference to the array passed to method is held by the caller/invoker. The solution is to make a copy within the object when it gets passed. In this manner, a subsequent modification of the collection won’t affect the array stored within the object. You could read the details on following stackoverflow page. Following code represents the vulnerability: // Note that values is a String array in the code below. // public void setValues(String[] somevalues) { this.values = somevalues; }

September 19, 2014

by Ajitesh Kumar

· 19,436 Views · 1 Like

A Closer Look at the MySQL ibdata1 Disk Space Issue and Big Tables

A recurring customer issue seen by the Percona Support team involves how to make the ibdata1 file “shrink” within MySQL. I'll show you how to handle big tables.

September 16, 2014

by Peter Zaitsev

· 7,855 Views

How to Quickly Get Started with Sonar

Jump into Sonar with this tutorial that provides installation instructions for SonarQube and the Code Analyzer, followed by a Java example.

September 15, 2014

by Ajitesh Kumar

· 159,364 Views · 3 Likes

DB2 CONCAT (Concatenate) Function

The DB2 CONCAT function will combine two separate expressions to form a single string expression.

September 15, 2014

by Drew Harvey

· 140,560 Views

Python 101: An Intro to Pony ORM

The Pony ORM project is another object relational mapper package for Python. They allow you to query a database using generators. They also have an online ER Diagram Editor that is supposed to help you create a model. They are also one of the only Python packages I’ve seen with a multi-licensing scheme where you can develop using a GNU license or purchase a license for non-open source work. See their website for additional details. In this article, we will spend some time learning the basics of this package. Getting Started Since this project is not included with Python, you will need to download and install it. If you have pip, then you can just do this: pip install pony Otherwise you’ll have to download the source and install it via its setup.py script. Creating the Database We will start out by creating a database to hold some music. We will need two tables: Artist and Album. Let’s get started! import datetime import pony.orm as pny database = pny.Database("sqlite", "music.sqlite", create_db=True) ######################################################################## class Artist(database.Entity): """ Pony ORM model of the Artist table """ name = pny.Required(unicode) albums = pny.Set("Album") ######################################################################## class Album(database.Entity): """ Pony ORM model of album table """ artist = pny.Required(Artist) title = pny.Required(unicode) release_date = pny.Required(datetime.date) publisher = pny.Required(unicode) media_type = pny.Required(unicode) # turn on debug mode pny.sql_debug(True) # map the models to the database # and create the tables, if they don't exist database.generate_mapping(create_tables=True) Pony ORM will create our primary key for us automatically if we don’t specify one. To create a foreign key, all you need to do is pass the model class into a different table, as we did in the Album class. Each Required field takes a Python type. Most of our fields are unicode, with one being a datatime object. Next we turn on debug mode, which will output the SQL that Pony generates when it creates the tables in the last statement. Note that if you run this code multiple times, you won’t recreate the table. Pony will check to see if the tables exist before creating them. If you run the code above, you should see something like this get generated as output: GET CONNECTION FROM THE LOCAL POOL PRAGMA foreign_keys = false BEGIN IMMEDIATE TRANSACTION CREATE TABLE "Artist" ( "id" INTEGER PRIMARY KEY AUTOINCREMENT, "name" TEXT NOT NULL ) CREATE TABLE "Album" ( "id" INTEGER PRIMARY KEY AUTOINCREMENT, "artist" INTEGER NOT NULL REFERENCES "Artist" ("id"), "title" TEXT NOT NULL, "release_date" DATE NOT NULL, "publisher" TEXT NOT NULL, "media_type" TEXT NOT NULL ) CREATE INDEX "idx_album__artist" ON "Album" ("artist") SELECT "Album"."id", "Album"."artist", "Album"."title", "Album"."release_date", "Album"."publisher", "Album"."media_type" FROM "Album" "Album" WHERE 0 = 1 SELECT "Artist"."id", "Artist"."name" FROM "Artist" "Artist" WHERE 0 = 1 COMMIT PRAGMA foreign_keys = true CLOSE CONNECTION Wasn’t that neat? Now we’re ready to learn how to add data to our database. How to Insert / Add Data to Your Tables Pony makes adding data to your tables pretty painless. Let’s take a look at how easy it is: import datetime import pony.orm as pny from models import Album, Artist #---------------------------------------------------------------------- @pny.db_session def add_data(): """""" new_artist = Artist(name=u"Newsboys") bands = [u"MXPX", u"Kutless", u"Thousand Foot Krutch"] for band in bands: artist = Artist(name=band) album = Album(artist=new_artist, title=u"Read All About It", release_date=datetime.date(1988,12,01), publisher=u"Refuge", media_type=u"CD") albums = [{"artist": new_artist, "title": "Hell is for Wimps", "release_date": datetime.date(1990,07,31), "publisher": "Sparrow", "media_type": "CD" }, {"artist": new_artist, "title": "Love Liberty Disco", "release_date": datetime.date(1999,11,16), "publisher": "Sparrow", "media_type": "CD" }, {"artist": new_artist, "title": "Thrive", "release_date": datetime.date(2002,03,26), "publisher": "Sparrow", "media_type": "CD"} ] for album in albums: a = Album(**album) if __name__ == "__main__": add_data() # use db_session as a context manager with pny.db_session: a = Artist(name="Skillet") You will note that we need to use a decorator caled db_session to work with the database. It takes care of opening a connection, committing the data and closing the connection. You can also use it as a context manager, which is demonstrated at the very end of this piece of code. Using Basic Queries to Modify Records with Pony ORM In this section, we will learn how to make some basic queries and modify a few entries in our database. ] import pony.orm as pny from models import Artist, Album with pny.db_session: band = Artist.get(name="Newsboys") print band.name for record in band.albums: print record.title # update a record band_name = Artist.get(name="Kutless") band_name.name = "Beach Boys" Here we use the db_session as a context manager. We make a query to get an artist object from the database and print its name. Then we loop over the artist’s albums that are also contained in the returned object. Finally, we change one of the artist’s names. Let’s try querying the database using a generator: result = pny.select(i.name for i in Artist) result.show() If you run this code, you should see something like the following: i.name -------------------- Newsboys MXPX Beach Boys Thousand Foot Krutch The documentation has several other examples that are worth checking out. Note that Pony also supports using SQL itself via its select_by_sql and get_by_sql methods. How to Delete Records in Pony ORM Deleting records with Pony is also pretty easy. Let’s remove one of the bands from the database: import pony.orm as pny from models import Artist with pny.db_session: band = Artist.get(name="MXPX") band.delete() Once more we use db_session to access the database and commit our changes. We use the band object’s delete method to remove the record. You will need to dig to find out if Pony supports cascading deletes where if you delete the Artist, it will also delete all the Albums that are connected to it. According to the docs, if the field is Required, then cascade is enabled. Wrapping Up Now you know the basics of using the Pony ORM package. I personally think the documentation needs a little work as you have to dig a lot to find some of the functionality that I felt should have been in the tutorials. Overall though, the documentation is still a lot better than most projects. Give it a go and see what you think! Additional Resources Pony ORM’s website Pony documentation SQLAlchemy Tutorial An Intro to peewee

September 12, 2014

by Mike Driscoll

· 8,748 Views