Databases Resources

The Latest Databases Topics

Sharding Pitfalls Part III: Chunk Balancing and Collection Limits

In Parts 1 and 2 we have covered a number of common issues people run into when managing a sharded MongoDB cluster. In this final post of the series we will cover a subtle, but important distinction in terms of balancing a sharded cluster as well as an interesting limitation that can be worked around relatively easily, but is nonetheless surprising when it comes up. 6. Chunk balancing != data balancing != traffic balancing The balancer in a sharded cluster cares about just one thing: Are chunks for a given collection evenly balanced across all shards? If they are not, then it will take steps to rectify that imbalance. This all sounds perfectly logical, and even with extra complexity like tagging involved the logic is pretty straight forward. If we assume that all chunks are equal, then we can rest assured that our data is being evenly balanced across all the shards in our cluster and rest easy at night. Although that is sometimes, perhaps even frequently, the case it is not always true - chunks are not always equal. There can be massive “jumbo” chunks that exceed the maximum chunk size (64MiB), completely empty chunks and everything in between. Let’s use an example from our first pitfall, the monotonically increasing shard key. For our example, we have picked just such a key to shard on (date), and up until this point we have had just one shard and had not sharded the collection. We are about to add a second shard to our cluster and so we enable sharding on the collection and do the necessary admin work to add the new shard into the cluster. Once the collection is enabled for sharding, the first shard contains all the newly minted chunks. Let’s represent them in a simplified table of 10 chunks. This is not representative of a real data set, but it will do for illustrative purposes: Table 1 - Initial Chunk Layout Now we add our second shard. The balancer will kick in and attempt to distribute the chunks evenly. It will do this by moving the lowest range chunks to the new shard until the counts are identical. Once it is finished balancing, our table now looks like this: Table 2 - Balanced Chunk Layout That looks pretty good at the moment, but lets imagine that more recent chunks are more likely to have more activity (updates say) than older chunks. Adding the traffic share estimates for each chunk shows that shard1 is taking far more traffic (72%) than shard2 (28%) despite the chunks seeming balanced overall based on the approximate size. Hence, chunk balancing is not equal to traffic balancing. Using that same example, let’s add another wrinkle - periodic deletion of old data. Every 3 months we run a job to delete any data older than 12 months. Let’s look at the impact of that on our table after we run it for the first time (assuming the first run happens on July 1st 2015). Table 3 - Post-Delete Chunk Layout The distribution of data is now completely skewed toward shard1 - shard2 is in fact empty! However, the balancer is completely unaware of this imbalance - the chunk count has remained the same the entire time, and as far as it is concerned the system is in a steady state. With no data on shard2, our traffic imbalance as seen above will be even worse, and we have essentially negated the benefit of having a second shard for this collection. Possible Mitigation Strategies If data and traffic balance are important, select an appropriate shard key Move chunks manually to address the imbalances - swap “hot” chunks for “cool” chunks, empty chunks for larger chunks 7. Waiting too long to shard a collection (collection too large) This is not very common, but when it falls on your shoulders, it can be quite challenging to solve. There is a maximum data size for a collection when when it is initially split which is a function of the chunk size and data size as noted on the limits page. If your collection contains less than 256GiB of data, then there will be no issue. If the collection size exceeds 256GiB but is less than 400GiB, then MongoDB may be able to do an initial split without any special measures being taken. Otherwise, with larger initial data sizes and the default settings, the initial split will fail. It is worth noting that once split the collection may grow as needed and without any real limitations as long as you can continue to add shards as data size grows. Possible Mitigation Strategies Since the limit is dictated by the chunk size and the data size, and assuming there is not much to be done about the data size, then the remaining variable is the chunk size. This is adjustable (default is 64MiB) and can be raised in order to let a large collection split initially and then reduced once that has been completed. The required chunk size increase will depend on the actual data size. However, this is relatively easy to work out - simply divide your data size by 256GB and then multiply that figure by 64MiB (and round up if it is not a nice even number). As an example, let’s consider a 4TiB collection: 4TiB divided by 256GiB = 16 64MiB x 16 = 1024MiB Hence, set the max chunk size to 1024MiB, then perform the initial sharding of the collection, and then finally reduce the chunk size back to 64MiB using the same procedure. . Thanks for reading through the Sharding Pitfall series! If you want to learn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations. Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.

October 27, 2014

by Francesca Krihely

· 4,307 Views

How to Avoid Hash Collisions When Using MySQL’s CRC32 Function

Originally Written by Arunjith Aravindan Percona Toolkit’s pt-table-checksum performs an online replication consistency check by executing checksum queries on the master, which produces different results on replicas that are inconsistent with the master – and the tool pt-table-sync synchronizes data efficiently between MySQL tables. The tools by default use the CRC32. Other good choices include MD5 and SHA1. If you have installed the FNV_64 user-defined function, pt-table-sync will detect it and prefer to use it, because it is much faster than the built-ins. You can also use MURMUR_HASH if you’ve installed that user-defined function. Both of these are distributed with Maatkit. For details please see the tool’s documentation. Below are test cases similar to what you might have encountered. By using the table checksum we can confirm that the two tables are identical and useful to verify a slave server is in sync with its master. The following test cases with pt-table-checksum and pt-table-sync will help you use the tools more accurately. For example, in a master-slave setup we have a table with a primary key on column “a” and a unique key on column “b”. Here the master and slave tables are not in sync and the tables are having two identical values and two distinct values. The pt-table-checksum tool should be able to identify the difference between master and slave and the pt-table-sync in this case should sync the tables with two REPLACE queries. +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ +-----+-----+ Case 1: Non-cryptographic Hash function (CRC32) and the Hash collision. The tables in the source and target have two different columns and in general way of thinking the tools should identify the difference. But the below scenarios explain how the tools can be wrongly used and how to avoid them – and make things more consistent and reliable when using the tools in your production. The tools by default use the CRC32 checksums and it is prone to hash collisions. In the below case the non-cryptographic function (CRC32) is not able to identify the two distinct values as the function generates the same value even we are having the distinct values in the tables. CREATE TABLE `t1` ( `a` int(11) NOT NULL, `b` int(11) NOT NULL, PRIMARY KEY (`a`), UNIQUE KEY `b` (`b`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ +-----+-----+ Master: [root@localhost mysql]# pt-table-checksum --replicate=percona.checksum --create-replicate-table --databases=db1 --tables=t1 localhost --user=root --password=*** --no-check-binlog-format TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-17T00:59:45 0 0 4 1 0 1.081 db1.t1 Slave: [root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=*** --verbose --sync-to-master 192.**.**.** # Syncing via replication h=192.**.**.**,p=...,u=root # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE Narrowed down to BIT_XOR: Master: mysql> SELECT BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) FROM `db1`.`t1`; +------------------------------------------------------------+ | BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) | +------------------------------------------------------------+ | 6581445 | +------------------------------------------------------------+ 1 row in set (0.00 sec) Slave: mysql> SELECT BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) FROM `db1`.`t1`; +------------------------------------------------------------+ | BIT_XOR(CAST(CRC32(CONCAT_WS('#', `a`, `b`)) AS UNSIGNED)) | +------------------------------------------------------------+ | 6581445 | +------------------------------------------------------------+ 1 row in set (0.16 sec) Case 2: As the tools are not able to identify the difference, let us add a new row to the slave and check if the tools are able to identify the distinct values. So I am adding a new row (5,5) to the slave. mysql> insert into db1.t1 values(5,5); Query OK, 1 row affected (0.05 sec) Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ | 5 | 5 | +-----+-----+ [root@localhost mysql]# pt-table-checksum --replicate=percona.checksum --create-replicate-table --databases=db1 --tables=t1 localhost --user=root --password=*** --no-check-binlog-format TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-17T01:01:13 0 1 4 1 0 1.054 db1.t1 [root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=*** --verbose --sync-to-master 192.**.**.** # Syncing via replication h=192.**.**.**,p=...,u=root # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE DELETE FROM `db1`.`t1` WHERE `a`='5' LIMIT 1 /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**. 10,p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/; REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('3', '4') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**, p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/; REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('4', '3') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.**.**.**, p=...,u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.**.**.**,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5205 user:root host:localhost.localdomain*/; # 1 2 0 0 Chunk 01:01:43 01:01:43 2 db1.t1 Well, apparently the tools are now able to identify the newly added row in the slave and the two other rows having the difference. Case 3: Advantage of Cryptographic Hash functions (Ex: Secure MD5) As such let us make the tables as in the case1 and ask the tools to use the cryptographic (secure MD5) hash functions instead the usual non-cryptographic function. The default CRC32 function provides no security due to their simple mathematical structure and too prone to hash collisions but the MD5 provides better level of integrity. So let us try with the –function=md5 and see the result. Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 3 | 3 | | 3 | 4 | | 4 | 4 | +-----+-----+ +-----+-----+ Narrowed down to BIT_XOR: Master: mysql> SELECT 'test', 't2', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONCAT(LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING (@crc, 1, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'), LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING(@crc := md5(CONCAT_WS('#', `a`, `b`)) , 17, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'))), 0) AS crc FROM `db1`.`t1`; +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | cnt | crc | +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | 4 | 000000000000000063f65b71e539df48 | +------+----+---+------+------+------+-----+----------------------------------+ 1 row in set (0.00 sec) Slave: mysql> SELECT 'test', 't2', '1', NULL, NULL, NULL, COUNT(*) AS cnt, COALESCE(LOWER(CONCAT(LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING (@crc, 1, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'), LPAD(CONV(BIT_XOR(CAST(CONV(SUBSTRING(@crc := md5(CONCAT_WS('#', `a`, `b`)) , 17, 16), 16, 10) AS UNSIGNED)), 10, 16), 16, '0'))), 0) AS crc FROM `db1`.`t1`; +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | cnt | crc | +------+----+---+------+------+------+-----+----------------------------------+ | test | t2 | 1 | NULL | NULL | NULL | 4 | 0000000000000000df024e1a4a32c31f | +------+----+---+------+------+------+-----+----------------------------------+ 1 row in set (0.00 sec) [root@localhost mysql]# pt-table-checksum --replicate=percona.checksum --create-replicate-table --function=md5 --databases=db1 --tables=t1 localhost --user=root --password=*** --no-check-binlog-format TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-23T23:57:52 0 1 12 1 0 0.292 db1.t1 [root@localhost bin]# ./pt-table-sync --print --execute --replicate=percona.checksum --tables db1.t1 --user=root --password=amma --verbose --function=md5 --sync-to-master 192.***.***.*** # Syncing via replication h=192.168.56.102,p=...,u=root # DELETE REPLACE INSERT UPDATE ALGORITHM START END EXIT DATABASE.TABLE REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('3', '4') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.168.56.101,p=..., u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.***.***.***,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5608 user:root host:localhost.localdomain*/; REPLACE INTO `db1`.`t1`(`a`, `b`) VALUES ('4', '3') /*percona-toolkit src_db:db1 src_tbl:t1 src_dsn:P=3306,h=192.168.56.101,p=..., u=root dst_db:db1 dst_tbl:t1 dst_dsn:h=192.***.**.***,p=...,u=root lock:1 transaction:1 changing_src:percona.checksum replicate:percona.checksum bidirectional:0 pid:5608 user:root host:localhost.localdomain*/; # 0 2 0 0 Chunk 04:46:04 04:46:04 2 db1.t1 Master Slave +-----+-----+ +-----+-----+ | a | b | | a | b | +-----+-----+ +-----+-----+ | 2 | 1 | | 2 | 1 | | 1 | 2 | | 1 | 2 | | 4 | 3 | | 4 | 3 | | 3 | 4 | | 3 | 4 | +-----+-----+ +-----+-----+ The MD5 did the trick and solved the problem. See the BIT_XOR result for the MD5 given above and the function is able to identify the distinct values in the tables and resulted with the different crc values. The MD5 (Message-Digest algorithm 5) is a well-known cryptographic hash function with a 128-bit resulting hash value. MD5 is widely used in security-related applications, and is also frequently used to check the integrity but MD5() and SHA1() are very CPU-intensive with slower checksumming if chunk-time is included.

October 24, 2014

by Peter Zaitsev

· 7,130 Views

Understanding Information Retrieval by Using Apache Lucene and Tika - Part 1

introduction in this tutorial, the apache lucene and apache tika frameworks will be explained through their core concepts (e.g. parsing, mime detection, content analysis, indexing, scoring, boosting) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. we assume you have a working knowledge of the java™ programming language and plenty of content to analyze. throughout this tutorial, you will learn: how to use apache tika's api and its most relevant functions how to develop code with apache lucene api and its most important modules how to integrate apache lucene and apache tika in order to build your own piece of software that stores and retrieves information efficiently. (project code is available for download) what are lucene and tika? according to apache lucene's site, apache lucene represents an open source java library for indexing and searching from within large collections of documents. the index size represents roughly 20-30% the size of text indexed and the search algorithms provide features like: ranked searching - best results returned first many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more. in this tutorial we will demonstrate only phrase queries. fielded search (e.g. title, author, contents) sorting by any field flexible faceting, highlighting, joins and result grouping pluggable ranking models, including the vector space model and okapi bm25 but lucene's main purpose is to deal directly with text and we want to manipulate documents, who have various formats and encoding. for parsing document content and their properties the apache tika library it is necessary. apache tika is a library that provides a flexible and robust set of interfaces that can be used in any context where metadata analyzis and structured text extraction is needed. the key component of apache tika is the parser (org.apache.tika.parser.parser ) interface because it hides the complexity of different file formats while providing a simple and powerful mechanism to extract structured text content and metadata from all sorts of documents. criterias for tika parsing design streamed parsing the interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. this allows even huge documents to be parsed without excessive resource requirements. structured content a parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. a client application can use this information for example to better judge the relevance of different parts of the parsed document. input metadata a client application should be able to include metadata like the file name or declared content type with the document to be parsed. the parser implementation can use this information to better guide the parsing process. output metadata a parser implementation should be able to return document metadata in addition to document content. many document formats contain metadata like the name of the author that may be useful to client applications. context sensitivity while the default settings and behaviour of tika parsers should work well for most use cases, there are still situations where more fine-grained control over the parsing process is desirable. it should be easy to inject such context-specific information to the parsing process without breaking the layers of abstraction. requirements maven 2.0 or higher java 1.6 se or higher lesson 1: automate metadata extraction from any file type our premisses are the following: we have a collection of documents stored on disk/database and we would like to index them; these documents can be word documents, pdfs, htmls, plain text files etc. as we are developers, we would like to write reusable code that extracts file properties regarding format (metadata) and file content. apache tika has a mimetype repository and a set of schemes (any combination of mime magic, url patterns, xml root characters, or file extensions) to determine if a particular file, url, or piece of content matches one of its known types. if the content does match, tika has detected its mimetype and can proceed to select the appropriate parser. in the sample code, the file type detection and its parsing is being covered inside the class com.retriever.lucene.index.indexcreator , method indexfile. listing 1.1 analyzing a file with tika public static documentwithabstract indexfile(analyzer analyzer, file file) throws ioexception { metadata metadata = new metadata(); contenthandler handler = new bodycontenthandler(10 * 1024 * 1024); parsecontext context = new parsecontext(); parser parser = new autodetectparser(); inputstream stream = new fileinputstream(file); //open stream try { parser.parse(stream, handler, metadata, context); //parse the stream } catch (tikaexception e) { e.printstacktrace(); } catch (saxexception e) { e.printstacktrace(); } finally { stream.close(); //close the stream } //more code here } the above code displays how a file it is being parsed using org.apache.tika.parser.autodetectparser; this kind of implementation was chosen because we would like to achieve parsing documents disregarding their format. also, for handling the content the org.apache.tika.sax.bodycontenthandler wasconstructed with a writelimit given as parameter ( 10*1024*1024); this type of constructor creates a content handler that writes xhtml body character events to an internal string buffer and in case of documents with large content is less likely to throw a saxexception (thrown when the default write limit is reached). as a result of our parsing we have obtained a metadata object that we can now use to detect file properties (title or any other header specific to a document format). metadata processing can be done as described below ( com.retriever.lucene.index.indexcreator , method indexfiledescriptors) : listing 1.2 processing metadata private static document indexfiledescriptors(string filename, metadata metadata) { document doc = new document(); //store file name in a separate textfield doc.add(new textfield(isearchconstants.field_file, filename, store.yes)); for (string key : metadata.names()) { string name = key.tolowercase(); string value = metadata.get(key); if (stringutils.isblank(value)) { continue; } if ("keywords".equalsignorecase(key)) { for (string keyword : value.split(",?(\\s+)")) { doc.add(new textfield(name, keyword, store.yes)); } } else if (isearchconstants.field_title.equalsignorecase(key)) { doc.add(new textfield(name, value, store.yes)); } else { doc.add(new textfield(name, filename, store.no)); } } in the method presented above we store the file name in a separate field and also the document's title ( a document can have a title different from its file name); we are not interested in storing other informations.

October 22, 2014

by Ana-Maria Mihalceanu

· 18,836 Views · 4 Likes

Sharding Pitfalls Part II: Running a Sharded Cluster

By Adam Comerford, Senior Solutions Engineer In Part I we discussed important considerations when picking a shard key. In this post we will go through some recommendations when running a sharded cluster at scale. Scalability is one of the core benefits of sharding in MongoDB but this can give you a false sense of security; even with that flexibility, you still have to make smart decisions about how and when you deploy resources. In this post, we will cover a couple of common mistakes that people tend to make when it comes to running a sharded cluster. 3. Waiting too long to add a new shard (overloaded) You sharded your database and scaled horizontally for a reason, perhaps it was to add more memory or disk capacity. Whatever the reason, if your application usage grows over time so (generally) does your database utilization. Eventually, your current sharded cluster will pass a certain point, let’s call it 80% utilized (as a nice round estimate), such that it becomes problematic to add another shard. Why? Well, adding a new shard to a cluster is not free, and it is not instantaneous. It consumes resources and (initially) accepts very little traffic. Essentially, at the start of its existence, a newly added shard costs you capacity instead of adding capacity. The length of time it will stay in this state will depend on the balancer and how long it takes for a significant portion of “busy/active” chunks to move onto the new shard. It can often be easier to visualize this process, so let’s make up some hypothetical numbers and set the bar relatively low. Our imaginary existing cluster will be a set of 2 shards, with 2000 chunks (500 considered “active”) and to that we need to add a 3rd shard. This 3rd shard will eventually store one third of the active chunks (and total chunks). The question is, when does this shard stop adding overhead overall and instead become an asset? In reality, this will vary from cluster to cluster and have a lot of dependencies and variables - in other words you need to have good metrics about your cluster, particularly your load bottleneck. Therefore we will once again use our imaginations and go with a relatively low bar: when 5% of active chunks—that is, those chunks seeing most traffic—have migrated to the new shard, you should expect a net gain in performance. In our imaginary system we have evaluated our load levels, the expected impact of migrations and have determine that once that 5% threshold of active chunks has been migrated to the new shard it can be considered a net gain for the overall system. Once all chunks have been balanced, then the migration overhead disappears, but initially this will be an expected trade off. This chart shows how long it would take for new shards to reach net positive contribution in your cluster (the dotted line implies net gain): In this fabricated example, it takes almost 2 hours for the new shard to attain a viable level of active chunks and be considered a net gain for the overall system. Although these numbers are fictional, these numbers are based on setups we have seen in real systems with moderate load. From there it is relatively easy to imagine this set of migrations taking even longer on an overloaded set of shards, and taking far longer for our newly added shard to cross the threshold and become a net gain. As such it is best to be proactive and add capacity before it becomes a necessity. Possible Mitigation Strategies Manual balancing of targeted “hot” chunks (chunk that is being accessed more than others) to move activity to the new shard more quickly Add the shard at low traffic time so that there is less competition for resources Disable balancing on some collections, prioritise balancing busy collections first 4. Under-provisioning Config Servers Provisioning enough resources without being wasteful is always tricky, and all the more so in a complicated distributed system like a MongoDB sharded cluster. Everyone wants to use their hardware, virtual instances, virtual machines, containers and the like in the most efficient way possible, and get the best bang for their buck. Hence it is only natural to take a look at the various pieces of a distributed cluster and look for lower utilized pieces that could be put on less expensive resources. The most common pitfall here with MongoDB are the config servers, which are often neglected when stress testing a cluster. In testing environments and smaller deployments (unless specific measures are taken to stress them) they are relatively lightly loaded and usually identified as candidates for lesser instances/hardware. The problem is that these are critical pieces of infrastructure. They may not be heavily loaded all the time, but when they do see load and struggle to service requests, that can impact all queries (reads, writes, authentication) and add latency to all requests made of the cluster in question. In particular, the first config server in the list supplied to your mongos processes is vital. This is the config server that all mongos processes will default to read from when fetching or refreshing their view of the data distribution in your cluster. Similarly, this is the server that will be hit when attempting to authenticate a user. If it is under-provisioned and cannot service queries, or if it has problems with networking (packet loss, congestion), then the effects will be significant. Possible Mitigation Strategies Ensure the config servers are load tested, slightly over-provisioned (the first config server in particular) If using virtual machines or cloud based instances, investigate increasing available resources Turning off the balancer, disabling chunk splitting will reduce the chances of high read traffic to the config servers (no migrations, no meta data refresh) but this is only a temporary fix unless you have a perfect write distribution and may not eliminate issues completely. 5. Using the count() command on sharded collections This pitfall is very common, and it seems to hit somewhat randomly in terms of how long someone has been running a sharded environment. At some point, a question will arise along the lines of: “How are we tracking/verifying/checking how many documents we have in each collection on each shard, how balanced are they and do they agree with ?” Hopefully no one is actually constructing questions this way in your organization, but you get the basic idea. The most obvious way to do a quick check on this type of thing is to count the documents and see if the numbers make sense and/or agree with counts elsewhere. That thinking naturally leads people to the count command and they proceed to use it to gather figures for their documents and collections. Unfortunately, on a busy, mature sharded cluster, the results will very rarely be what is expected. The reason for this is that the count command as implemented today has several optimizations in place to make it faster to run in general and those speed optimizations essentially bypass a key piece of the sharding functionality needed to return accurate results in this case. This is a known bug and is being tracked in SERVER-3645, but does not stop people from consistently hitting this issue. The nature of the issue means that count will report documents in the results that it should not, for example: Documents that are being deleted as part of a chunk migrations Documents that have been left behind from previous chunk migrations (also known as orphans) Documents currently being copied as part of an in-flight chunk migration A regular query (rather than a count) will have its results filtered by the respective primary and not suffer from the same problem. Hence, if you were to manually count the results from a query client-side you would get an accurate result. This quirk of sharded environments will eventually be fixed, but for now it will inevitably crop up from time to time in all active sharded clusters used by a large team. Possible Mitigation Strategies Do counts on the client side, or use targeted, range based queries (with a primary read preference) to count instead Use cleanUpOrphaned and disable the balancer (make sure it has finished current round) when performing counts across the cluster If you want tolearn more about managing MongoDB deployments at scale, sign up for my online education course, MongoDB Advanced Deployment and Operations. Planning for scale? No problem: MongoDB is here to help. Get a preview of what it’s like to work with MongoDB’s Technical Services Team. Give us some details on your deployment and we can set you up with an expert who can provide detailed guidance on all aspects of scaling with MongoDB, based on our experience with hundreds of deployments.

October 21, 2014

by Francesca Krihely

· 4,757 Views

AppDynamics VS New Relic – Which Tool is Right For You? The Complete Guide

New Relic VS AppDynamics: All the performance features, integrations, installation procedures and pricing plans side by side to help you decide which tool to use When thinking about performance, AppDynamics and New Relic are the main modern tools that come to mind. Both spawned from the same company, Wily Technology, who also dealt with performance monitoring and was acquired by CA back in 2006 - making way to new technology. New Relic is an anagram of Lew Cirne, its founder and CEO. AppDynamics was founded by Jyoti Bansal, who was a Lead Software Architect at the same Wily Technology, which was also founded by Lew. The main goal of this guide is to help you understand the similarities and differences between the two, so you can decide which one fits your company’s needs. Table of Contents What is APM anyhow? Supported Languages and Environments Features - Backend Monitoring - Fronted & Mobile Monitoring How to Solve the Errors You Find Installation Dashboard and Usage Integrations and Plugins Pricing Conclusion 1. What is APM anyhow? It’s the only buzzword you’ll read in this article, promise. Well, maybe also DevOps, but that’s it. So application Performance Management has been around for a while, though it seems like many developers are not comfortable with it yet. APM provides us with analytics around our application’s performance - at the core this means timing how long it takes to execute different areas in the code and complete transactions - this is done either by instrumenting the code, monitoring logs, or including network / hardware metrics. On top of this basic concept, many different implementations exist - but there are a basic truths we can agree on: A modern solution should monitor production environments, so its overhead (in terms of CPU and throughput) becomes very important. Also, it should display what the web/mobile end users are experiencing, which was not part of traditional APMs. What was once considered a luxury is becoming commonplace: Rapid new deployments in production mean more chances to introduce errors to your systems architecture, slow it down, and maybe even crash it. Let’s see what AppDynamics and New Relic have in store for us. 2. Supported Environments AppDynamics: Java, Scala, .NET, PHP, Node.js, iOS and Android; including your favorite flavour of database and cloud platform. New Relic: Java, Scala, .NET, PHP, Node.js, Ruby and Python. Supported databases, cloud platforms and other plug-ins are available here. We’ll dig in deeper with extensions later on. On the user monitoring front, iOS, Android, and JavaScript support is included with both tools. Bottom line: Main difference here is New Relic’s Ruby and Python support, and different levels of support for various platforms. 3. Features Both New Relic and AppDynamics can be broken down into 6 different products, all reporting to a main dashboard interface. Let’s split these to backend, mobile and frontend to do a quick runthrough over the main offerings. Backend Monitoring The bread and butter of performance management - reporting stats, graphs and insights of your applications performance under the hood. AppDynamics and NewRelic each offer 4 approaches here: Application Performance Management High level metrics with drill downs to code level data about how your application is performing. Must have metrics include transaction response time, error rate, throughput (Requests per Minute) on NewRelic and load (calls/min) on AppDynamics. AppDynamics dashboard on the left, New Relic on the Right You’ve probably noticed the main screen at AppDynamics include a map of the services the application is using with their call loads and health index while NewRelic displays a response time graph. This might be a way to signal each tool’s monitoring priorities and AppDynamics inclination to larger enterprises. Anyhow, enough with this dev tool psychology, but it’s worth noting that a similar map is also available on New Relic: New Relic’s application map One of the thorny issues here is alerting and reporting, with so many metrics and moving parts, it’s hard to identify which matters most. Is it a low error rate? Responsiveness? Throughput? AppDynamics and New Relic each took a different approach to distill these metrics into performance indicators. New Relic is using the Apdex score index, which uses a user defined response time threshold T to imply end-user satisfaction. Simply put, they require you to manually set the threshold. Here’s an example for the way this score is calculated: Calculating Apdex, now sum this over all requests for a given time and you’ll get the score AppDynamics on the other hand, doesn’t believe in Apdex (as they explained in an article called ”Apdex is Fatally Flawed”). They’ve come up with a solution of their own that automatically creates a dynamic baseline for the apps performance which varies by time. For example, the definition of a slow transaction might vary under low and high loads on the system. Bottom line: We’re seeing that AppDynamics puts its priority on visualizing the stack from end to end, while NewRelic is focused on bottom line response times. We’ve also seen the difference in alerting with Apdex and a dynamic baselines. Server Monitoring Another monitoring capability offered by both tools focuses on the hardware your servers run on: specs, CPU usage, memory utilization, disk I/O and network IO. AppDynamics on the left, New Relic on the right with some sample data In this category, AppDynamics offers a few more features than New Relic, mostly around memory: heap size & utilization, garbage collection stats divided by gens and memory leak detection. AppDynamics Server Monitoring - Memory features Bottom line: AppDynamics provides deeper insights into garbage collection and memory leak detection beyond the standard metrics. Database Monitoring Moving on to other components in your stack, the first thing that comes to mind is the database. Here we have a greater distinction with a richer AppDynamics dashboard looking into things like resource consumption, wait states, user sessions, specific query calls and more. On New Relic’s end the situation is a bit different with the Database dashboard as part of the basic APM product. Both tools have specific database monitoring metrics available through plugins to view data from external services (we’ll talk a bit more about integrations later). Either way, both native and external feature sets here might be different depending on the database you use. AppDynamics on the left with an Oracle DB, New Relic on the right with MySQL plugin Bottom line: Beyond the shared database metrics that go a bit deeper with AppDynamics, it’s worth looking into the features available for your specific database within each tool. Insights and Analytics This one is a wildcard, going beyond traditional APM and opening up to business intelligence metrics. Since both New Relic and AppDynamics already have access to the messages that go through your application, they’ve built this opt-in additional database to store your stats and enable you to query them. AppDynamics on the left, New Relic on the right with some sample data Bottom line: If you don’t have a solution that already allows you to process such queries, it might be about time to get one. Frontend & Mobile Monitoring Switching seats from the backend, lets take a quick look at what we’re getting on the Real-User Monitoring front. Both AppDynamics and New Relic have a product targeting browsers and a product targeting mobile with iOS & Android support. On Mobile, the flagship features include insights on slowdowns and crashes, that are filtered through geographic regions, devices, operating systems and operator networks: AppDynamics on the left, New Relic on the right - Mobile Real-User Monitoring With end-user browser analytics, it feels like having the visibility you have on your browsers load times through Chrome dev tools available on the actual users of your app: AppDynamics on the left, New Relic on the right - Browser Real-User Monitroing Bottom line: We’re seeing again how New Relic’s focus is on response time bottom lines while AppDynamics emphasizes the global picture. 4. How to Solve the errors you find To go beyond the reporting and alerting of errors by AppDynamics and New Relic, many of our users add Takipi to their toolbox. This allows them not only to monitor server slowdowns and errors via New Relic or AppDynamics, but also to solve them using Takipi. Whenever a new exception is thrown or a log error occurs - Takipi captures it and shows you the variable state which caused it, across methods and machines. Takipi will overlay this over the actual code which executed at the moment of error – so you can analyze the exception as if you were there when it happened. The dashboard links each error to a recorded instance of all involved code when the bug happened, and includes the variable values that caused it: Takipi plays well with AppDynamics, and it also has a New Relic plug-in that displays an exception and log error dashboard: Takipi for New Relic Bottom line: It’s one thing to identify what’s stopping you down, but solving it, is a whole different issue. Java or Scala developers? Whether you're using an APM tool or not, try Takipi. 5. Dashboard and Usage To get a better feel of each tool’s user experience and way of solving problems, I think it’s probably best to browse through a video. But before that, it’s worth noting that AppDynamics uses a dashboard based on Flash, yes… Flash, turns out it’s still out there. This felt like a drawback but its probably the best Flash application I’ve seen out there: NewRelic at TypeSafe, a webinar that gives an overview of New Relic for Play (it’s a bit long but gives a nice overview if you browse through): Bottom line: I still can’t believe the Flash didn’t scare me off completely. And it was ok actually. Both tools provide a nice experience, but still feel a bit cluttered. 6. Installation SaaS/On-Premise: AppDynamics offers a few modes of operation - SaaS, on-premises and a hybrid approach, each with its installation instructions. New Relic is only available through SaaS. Agents: Monitoring your application becomes available through attaching language specific agents to your server. For example, with Java there are 2 possible ways to instrument your code with agents, either by using a Java agent or a native agent. New Relic and AppDynamics use a Java agent to collect the performance data they’re reporting. To gather the low-level data required not only to point to an error but to help solve it, Takipi uses a native agent. Code and configuration changes: On the Real-User Monitoring front, project and configurations changes including introducing a few dependencies would be needed if you’d like to add monitoring capabilities to your web or mobile app. This includes adding JS agents to your website and native mobile agents to your mobile application. Alerting: AppDynamics computes your response time thresholds by itself and might take some time to learn your system. New Relic relies on custom thresholds defined by you for its Apdex index. Bottom line: If you require an on-premise version, the answer is clearly AppDynamics. Otherwise, ease of installation is pretty much the same - mind the alerting though. 7. Integrations and Plugins Branching out, AppDynamics and New Relic offer integrations and plug-ins to hundreds of services. Let’s start with NewRelic, we’ve already mentioned the Platform program earlier: a plug in platform with 116 (Last time I checked) plugins to services like Hadoop, RabbitMQ and Redis, that stream metrics of their data so you can view in on New Relic. On the integrations side of the table, there’s Connect, with 53 integrations with tools like Jira, HipChat, Takipi and pagerduty. AppDynamics Exchange offers 100 plugins and is also an open platform for developers to build plugins. Bottom line: New Relic has richer integrations that feel friendlier, but any way you go it’s an individual decision to see where and how your tools of choice integrate better. 8. Pricing Both tools have a free lite version with limited features across all products, including a 24hr data retention with pro trials of 14-30 days. Pricing with AppDynamics pro programs is more individual, you’ll have to contact sales to get a customized plan based on the number of agents you need. Mobile monitoring is a bit different when each agent is priced per 5000 Monthly Active Users. With New Relic, pro account pricing starts with with $199 per month per host ($149 on a yearly plan), this includes APM, Servers, Platfrom and Browser basics. Mobile monitoring costs $49 per month ($29 on a yearly plan) with 1 week of data retention. The Insights product start from $250 per month for up to 75 million events. Bottom line: New Relic’s pricing caters more to startups and small-medium business while AppDynamics focus is on customizing solutions to enterprises. With that said, each tool ventures off to the other’s natural playground and this distinction today is not that clear as it was. Conclusion AppDynamics and New Relic are top of the line APM tools, each traditionally targeted a different type of developer, from enterprises to startups. But as both are stepping forward to their IPOs and after experiencing huge growth the lines are getting blurred. The choice is not clear, but you could not go wrong – On premise = AppDynamics, otherwise, it’s an individual call depends on which better fits your stack (and which of all these features are you actually thinking you’re going to use). Originally posted on Takipi's blog

October 16, 2014

by Chen Harel

· 12,540 Views

MySQL Replication: 'Got fatal error 1236' Causes and Cures

Originally Written by Muhammad Irfan MySQL replication is a core process for maintaining multiple copies of data – and replication is a very important aspect in database administration. In order to synchronize data between master and slaves you need to make sure that data transfers smoothly, and to do so you need to act promptly regarding replication errors to continue data synchronization. Here on the Percona Support team, we often help customers with replication broken-related issues. In this post I’ll highlight the top most critical replication error code 1236 along with the causes and cure. MySQL replication error “Got fatal error 1236” can be triggered by multiple reasons and I will try to cover all of them. Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event ‘binlog.000201′ at 5480571 This is a typical error on the slave(s) server. It reflects the problem around max_allowed_packet size. max_allowed_packet refers to single SQL statement sent to the MySQL server as binary log event from master to slave. This error usually occurs when you have a different size of max_allowed_packet on the master and slave (i.e. master max_allowed_packet size is greater then slave server). When the MySQL master server tries to send a bigger packet than defined on the slave server, the slave server then fails to accept it and hence the error. In order to alleviate this issue please make sure to have the same value for max_allowed_packet on both slave and master. You can read more about max_allowed_packet here. This error usually occurs when updating a huge number of rows on the master and it doesn’t fit into the value of slave max_allowed_packet size because slave max_allowed_packet size is lower then the master. This usually happens with queries “LOAD DATA INFILE” or “INSERT .. SELECT” queries. As per my experience, this can also be caused by application logic that can generate a huge INSERT with junk data. Take into account, that one new variable introduced in MySQL 5.6.6 and later slave_max_allowed_packet_size which controls the maximum packet size for the replication threads. It overrides the max_allowed_packet variable on slave and it’s default value is 1 GB. In this post, “max_allowed_packet and binary log corruption in MySQL,”my colleague Miguel Angel Nieto explains this error in detail. Got fatal error 1236 from master when reading data from binary log: ‘Could not find first log file name in binary log index file’ This error occurs when the slave server required binary log for replication no longer exists on the master database server. In one of the scenarios for this, your slave server is stopped for some reason for a few hours/days and when you resume replication on the slave it fails with above error. When you investigate you will find that the master server is no longer requesting binary logs which the slave server needs to pull in order to synchronize data. Possible reasons for this include the master server expired binary logs via system variable expire_logs_days – or someone manually deleted binary logs from master via PURGE BINARY LOGS command or via ‘rm -f’ command or may be you have some cronjob which archives older binary logs to claim disk space, etc. So, make sure you always have the required binary logs exists on the master server and you can update your procedures to keep binary logs that the slave server requires by monitoring the “Relay_master_log_file” variable from SHOW SLAVE STATUS output. Moreover, if you have set expire_log_days in my.cnf old binlogs expire automatically and are removed. This means when MySQL opens a new binlog file, it checks the older binlogs, and purges any that are older than the value of expire_logs_days (in days). Percona Server added a feature to expire logs based on total number of files used instead of the age of the binlog files. So in that configuration, if you get a spike of traffic, it could cause binlogs to disappear sooner than you expect. For more information check Restricting the number of binlog files. In order to resolve this problem, the only clean solution I can think of is to re-create the slave server from a master server backup or from other slave in replication topology. – Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000525′ at 175770780, the last event read from ‘/data/mysql/repl/mysql-bin.000525′ at 175770780, the last byte read from ‘/data/mysql/repl/mysql-bin.000525′ at 175771648.’ Usually, this caused by sync_binlog <>1 on the master server which means binary log events may not be synchronized on the disk. There might be a committed SQL statement or row change (depending on your replication format) on the master that did not make it to the slave because the event is truncated. The solution would be to move the slave thread to the next available binary log and initialize slave thread with the first available position on binary log as below: mysql>CHANGE MASTERTOMASTER_LOG_FILE='mysql-bin.000526',MASTER_LOG_POS=4; – [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: ‘Client requested master to start replication from impossible position; the first event ‘mysql-bin.010711′ at 55212580, the last event read from ‘/var/lib/mysql/log/mysql-bin.000711′ at 4, the last byte read from ‘/var/lib/mysql/log/mysql-bin.010711′ at 4.’, Error_code: 1236 I foresee master server crashed or rebooted and hence binary log events not synchronized on disk. This usually happens when sync_binlog != 1 on the master. You can investigate it as inspecting binary log contents as below: $mysqlbinlog--base64-output=decode-rows--verbose--verbose--start-position=55212580mysql-bin.010711 You will find this is the last position of binary log and end of binary log file. This issue can usually be fixed by moving the slave to the next binary log. In this case it would be: mysql>CHANGE MASTER TOMASTER_LOG_FILE='mysql-bin.000712',MASTER_LOG_POS=4; This will resume replication. To avoid corrupted binlogs on the master, enabling sync_binlog=1 on master helps in most cases. sync_binlog=1 will synchronize the binary log to disk after every commit. sync_binlog makes MySQL perform on fsync on the binary log in addition to the fsync by InnoDB. As a reminder, it has some cost impact as it will synchronize the write-to-binary log on disk after every commit. On the other hand, sync_binlog=1 overhead can be very minimal or negligible if the disk subsystem is SSD along with battery-backed cache (BBU). You can read more about this here in the manual. sync_binlog is a dynamic option that you can enable on the fly. Here’s how: mysql-master>SET GLOBAL sync_binlog=1; To make the change persistent across reboot, you can add this parameter in my.cnf. As a side note, along with replication fixes, it is always a better option to make sure your replica is in the master and to validate data between master/slaves. Fortunately, Percona Toolkit has tools for this purpose: pt-table-checksum & pt-table-sync. Before checking for replication consistency, be sure to check the replication environment and then, later, to sync any differences.

October 15, 2014

by Peter Zaitsev

· 40,882 Views

JSR 199 - Compiler API

JSR 199 provides the compiler API to compile the Java code inside another Java program. The following are the important classes and interfaces provided for facilitating the compilation from a Java program. JavaFileObject - Represents a compilation unit, typically a class source. SimpleJavaFileObject - Implementation of the methods defined in JavaFileObject DiagnosticCollector - Collects the compilation errors, warning into a list of Diagnostic type Diagnostic - Reports the type of the problem and details like line number, character, error reason etc. JavaFileManager - To work on the Java source and class files. JavaCompiler - The compiler instance for compiling the compilation unit. CompilationTask - A sub interface of JavaCompiler which helps to compile and return the status with diagnostic when used call method on it. Where to start To compile a Java code, we need the Java source. The source can be a physical file on the disk or a string inside the program. Using the source, we need create an instance type of JavaFileObject. Using String literal Create a class which implements JavaFileObject, here i am using SimpleJavaFileObject. We need create the path URI of the class file package com.test; import java.io.IOException; import java.net.URI; import javax.tools.SimpleJavaFileObject; public class SampleSource extends SimpleJavaFileObject { private String source; protected SampleSource(String name, String code) { super(URI.create("string:///" +name.replaceAll("\\.", "/") + Kind.SOURCE.extension), Kind.SOURCE); this.source = code ; } @Override public CharSequence getCharContent(boolean ignoreEncodingErrors) throws IOException { return source ; } } Now, create the instance of JavaFileObject and from those, create the Compilation Unit (A collection of JavaFileObject) String str = "package com.test;" + "\n" + "public class Test {" + "\npublic static void test() {" + "\nSystem.out.println(\"Comiler API Test\")-;" + "" + "\n}" + "\n}"; SimpleJavaFileObject fileObject = new SampleSource("com.test.Test", str); JavaFileObject javaFileObjects[] = new JavaFileObject[] { fileObject }; Iterable compilationUnits = Arrays .asList(javaFileObjects); From File System If the source is from physical location. Then create like this. File []files = new File[]{file1, file2, file3, file4} ; Iterable units = fileManager.getJavaFileObjectsFromFiles(Arrays.asList(files)); Create a JavaFileManger We will see, how to create a fileManger now. JavaFileManager fileManager = compiler.getStandardFileManager( diagnostics, Locale.getDefault(), Charset.defaultCharset()); To get the FileManger, we need diagnostic - A DiagnosticCollector of JavaFileObject locale - The locale of the compilation charset - The charset to be used. Compiler Get the compiler instance using ToolProvider. Finally, create the CompilationTask from the compiler instance using diagnostics, file manager and compilation units (Optionally writer and compilation options). JavaCompiler compiler = ToolProvider.getSystemJavaCompiler(); CompilationTask task = compiler.getTask(null, fileManager, diagnostics, compilationOptionss, null, compilationUnits); The argument required to get the CompilationTask are out - A writer which writes the output of the compiler. Defaults to System.err if null listener - A diagnostic listener, the errors or warning can be accessed using. options - Compiler options (Ex : -d, like we give in command line using javac ) classes - Name of the classes to be processed compilationUnits - List of compilation units Compile Finally, call the method to compile. This method to be called only once otherwise it throws IllegalStateException on multiple calls. Once compiled, returns true for successful compilation otherwise false. We need to look the diagnosticCollector to get the error/warning details. boolean status = task.call(); All together Putting all together. public static void main(String[] args) { String str = "package com.test;" + "\n" + "public class Test {" + "\npublic static void test() {" + "\nSystem.out.println(\"Comiler API Test\")-;" + "" + "\n}" + "\n}"; SimpleJavaFileObject fileObject = new SampleSource("com.test.Test", str); JavaFileObject javaFileObjects[] = new JavaFileObject[] { fileObject }; Iterable compilationUnits = Arrays .asList(javaFileObjects); Iterable compilationOptionss = Arrays.asList(new String[] { "-d", "classes" }); DiagnosticCollector diagnostics = new DiagnosticCollector(); JavaCompiler compiler = ToolProvider.getSystemJavaCompiler(); JavaFileManager fileManager = compiler.getStandardFileManager( diagnostics, Locale.getDefault(), Charset.defaultCharset()); CompilationTask task = compiler.getTask(null, fileManager, diagnostics, compilationOptionss, null, compilationUnits); boolean status = task.call(); if(!status) { System.out.println("Found errors in compilation"); int errors = 1; for(Diagnostic diagnostic : diagnostics.getDiagnostics()) { printError(errors, diagnostic); errors++; } } else System.out.println("Compilation sucessfull"); try { fileManager.close(); } catch (IOException e){} } public static void printError(int number,Diagnostic diagnostic) { System.out.println(); System.out.print(diagnostic.getKind()+" : "+number+" Type : "+diagnostic.getMessage(Locale.getDefault())); System.out.print(" at column : "+diagnostic.getColumnNumber()); System.out.println(" Line number : "+diagnostic.getLineNumber()); System.out.println("Source : "+diagnostic.getSource()); } Output Output with an error will be (because of an hyphen in System.out.println in main method of Test) Found errors in compilation ERROR : 1 Type : illegal start of expression at column : 40 Line number : 4 Source : com.test.SampleSource[string:///com/test/Test.java] ERROR : 2 Type : not a statement at column : 39 Line number : 4 Source : com.test.SampleSource[string:///com/test/Test.java] To read more about JSR 199, follow the official link. Happy Learning!!!! Read more articles at blog

October 15, 2014

by Veeresham Kardas

· 6,549 Views

jQuery Mobile Tutorial: User Registration, Login and Logout Screens for the Meeting Room Booking App

in this jquery mobile tutorial we will create the screens that will handle user registration, login and logout in a real-world meeting room booking application. this article is part of a series of mobile application development tutorials that i have been publishing on my blog jorgeramon.me. if you are new to this series, i recommend that you read its first part, as well as this mobile ui patterns article where i provide a flowchart describing the user registration, login and logout screens in a mobile application. we will use this chart as a guide for this article. here’s a screenshot: in this part of the tutorial we will only create the static html for the screens. in future articles we will implement the programming logic that makes the pages work. the first step we are going to take is to set up a jquery mobile project for the app. how to set up a jquery mobile project while you can use mobile sdks such as kendo ui mobile and intel xdk to create, debug and deploy jquery mobile apps, in this tutorial i will show you how to create a simple jquery mobile project without using the facilities provided by those sdks. i think that it’s important to understand how you can create this type of project from scratch, and how the different pieces in the project work together in an app. the project’s directories and files we need to pick a directory in our development workstations where we will place the project’s files. in my case i named that directory “apps”. in that directory, we will create a root directory for the application, which we will name “conf-rooms”. make sure that this directory is set up so it can be accessed from your local web server. under “conf-rooms” we will create a “css” directory, where we will place the css assets of the project; and an “img” directory for the images that we will use. at the same level of the “apps” directory, we will create a “lib” directory. this is where we will place the jquery mobile and any other libraries that our application will use. you also need to set up this directory so it can be accessed from your local web server. on my workstation the directories look as depicted below: now is a good time to download the jquery mobile and jquery libraries from their respective websites, and place them in the “jqm” and “jquery” directories, all under the “lib” directory. this is how the files look on my workstation: how jquery mobile works a short overview of jquery mobile for those who aren’t very familiar with it yet. as its documentation clearly explains, jquery mobile is a unified user interface system with the following characteristics: it works seamlessly across all popular mobile device platforms. it uses jquery and jquery ui as its foundations. it has a lightweight codebase built on progressive enhancement. it has a flexible and easily themeable design. an attribute that differentiates jquery mobile from other frameworks is that it targets a wide variety of mobile browsers. the reason this coverage is possible has to do with the way jquery mobile works. jquery mobile works by applying css and javascript enhancements to html pages built with clean, semantic html. the usage of semantic html ensures compatibility with most web-enabled devices. the techniques applied by the framework to an html page, transform the semantic page into a rich and interactive experience. we call these changes progressive enhancements, as they are applied progressively to the page, taking advantage of the capabilities of the browser on the web-enabled device. the enhancements result in pages that provide a great user experience on the latest mobile browsers and degrade gracefully on less capable browsers, without losing their intrinsic functionality. in addition, the framework provides support for screen readers and other assistive technologies through a tight integration with the web accessibility initiative – accessible rich internet applications suite (wai-aria) technical specification. creating the landing screen the first application that we will create is the landing screen. this screen will come up when users launch our application. as reflected in the flowchart at the beginning of this article, the landing screen is the door to all the areas of the app, and it requires that users log in before navigating any further. in the “wireframes for signing in and signing up” section of the third part of this tutorial we created the following mockup for this screen. let’s create an empty index.html file in the “conf-rooms” directory, and add a jquery mobile page template to the file as follows: book it welcome! existing users sign in don't have an account? sign up before we step through this code i want you to check out this file in a mobile browser or simulator. the result should look like this screenshot: what do you think? back in the index.html file, in the head section we have references to the jquery and jquery mobile libraries. double check that yours are pointing to the correct directories in your workstation. how to use a custom theme in jquery mobile now i want to direct your attention to the following lines: these lines mean that we are using a custom jquery mobile theme that resides in the “conf-room1.min.css” file. i created this file using jquery mobile’s theme roller . we will use this theme to give our app a look different than the standard jquery mobile themes. you can download the theme using this link . after downloading the zipped theme files, we will go to the “css” directory, create a “css/themes/1″ directory and place the unzipped theme files there. when done, the “css” directory should look like this: in the head section of the index.html file we also have this code: app.css is the css file where we will place any additional custom styles that we will use in the app. for the moment, we will add the following code to the “app.css” file: /* change html headers color */ h1,h2,h3,h4,h5 { color:#0071bc; } h2.mc-text-danger, h3.mc-text-danger { color:red; } /* change border radius of icon buttons */ .ui-btn-icon-notext.ui-corner-all { -webkit-border-radius: .3125em; border-radius: .3125em; } /* change color of jquery mobile page headers */ .ui-title { color:#fff; } /* center-aligned text */ .mc-text-center { text-align:center; } /* top margin for some elements */ .mc-top-margin-1-5 { margin-top:1.5em; } these are just a few cosmetic changes that will enhance the look of the app. notice that i prefixed non-jquerymobile classes with the characters “mc-” to avoid potential collisions with jquerymobile’s classes. the remaining lines in the head section of index.html are the references to the jquery and jquery mobile libraries. as i suggested earlier, make sure that yours are pointing to the correct directories in your project. let’s move on to the body section of the “index.html file”. there you will find the standard jquery mobile page template with a header and main divs: book it welcome! existing users sign in don't have an account? sign up we have decorated the “header” div with the data-theme=”c” attribute, which gives it the nice purple background color that we defined in the custom theme: in the “main” div we are using a couple of links to the sign in and sign up screens respectively. the links point to the sign-in.html and sing-up.html files that we will create in a few minutes. these links are decorated with the jquery mobile ui-btn, ui-btn-b and ui-corner-all classes, which make them look like buttons: this is all we need to do in the lading page for the moment. let’s move on to the log in screen. creating a log in screen with jquery mobile here’s the log in screen’s mockup that we built in the third part of this tutorial : we will use the log in screen to capture the user’s credentials and validate them against the application’s user accounts database. if validation succeeds, we will direct users to a “main menu” scree that we will create in an upcoming tutorial. let’s create an empty sign-in.html file in the project’s directory. in the file, we will write the following code: book it sign in email address password remember me submit can't access your account? login failed did you enter the right credentials? ok if you open this sign-in.html with a mobile browser or emulator, you will see something like this: the head section of the html document is similar to the index.html file we created a few minutes ago, with the exception of the document’s title. no need to explain much there. in the “main” section of the jquery mobile page that wee added to the file, we dropped a few controls that will allow us to capture the user’s email and password, along with a checkbox that will let us know when the user wants the app to remember their credentials: sign in email address password remember me submit can't access your account? we also added a link that will allow users to initiate the password reset process if they have problems logging in. you will also notice that the “submit” link points to the “dlg-invalid-credentials” anchor defined in the same jquery mobile page. this link is decorated with the data-rel=”popup”, data-transition=”pop” and data-position-to=”window” attributes. when we do this, we are telling jquery mobile to open the link to the element with id=”dlg-invalid-credentials” as a popup dialog, using a “pop” transition, and center the element relative to the document’s window. here’s the html for the popup: login failed did you enter the right credentials? ok notice that the “dlg-invalid-credentials” div is decorated with the data-rel=”popup” attribute, signaling to jquery mobile to apply popup styles to this div. if you click or tap the “submit” button, you will see the “invalid credentials” popup: one last thing on this screen. we have the popup linked directly to the “submit” button for testing purposes. in an upcoming part of this tutorial we will add programming logic that will activate the popup only when the login fails. for the moment we are only concerned with creating the html code for the pages and making sure that the jquery mobile enhancements work on them. creating an account locked screen with jquery mobile many business applications use an account locked feature as a measure to increase the app’s security. we will use this feature in our app, and this means that we need to create an account locked screen. the purpose of the screen is to notify the user that their account is locked. we will define under which conditions an account will be locked through programming logic that we will add in an upcoming chapter of this tutorial. let’s create an empty account-locked.html file, and drop the following code in it: back app name your account is locked please contact the helpdesk to resolve this issue. the file should look like the screenshot below when viewed with a mobile browser or emulator: the html code for this screen is very similar to that in the prior two screens, with the exception of one element that i want you to pay attention to: back app name the header section of the jquery mobile page we just created contains a link to the sign-in.html file. when we decorate it with the ui-btn-left, ui-btn, ui-btn-icon-notext, ui-corner-all and ui-icon-back classes, we are giving the link the appearance of a toolbar button, just like this: the data-rel=”back” attribute causes any taps on the anchor to mimic a “back button”, going back one history entry and ignoring the anchor’s default href. you can read more about navigation and linking on jquery mobile by visiting jquery mobile’s navigation documentation. you should also visit the jquery mobile buttons guide to learn about how to create buttons. creating a sign up page with jquery mobile when users tap on the landing screen’s “sign up” button, they will open the sign up screen. this is where we will capture the user’s personal information so we can create an account for them. remember that our mockup of the sign up screen looks like this: let’s create an empty sign-up.html file and add the following code to it: book it sign up first name last name email address password confirm password submit almost done... confirm your email address we sent you an email with instructions on how to confirm your email address. please check your inbox and follow the instructions in the email. ok the file should look as depicted below when you open it with a mobile browser or emulator: the only difference with the mockup is that we are using a “submit” button at the bottom of the screen, instead of a “done” button in the toolbar. when you examine the html code, you will find that the “submit” button is wired to the “dlg-sign-up-sent” popup: almost done... confirm your email address we sent you an email with instructions on how to confirm your email address. please check your inbox and follow the instructions in the email. ok if you tap on the button, the popup will become visible: we will use this popup to notify the users that we have sent them a message asking them to confirm their email address. the message will contain a link to a webpage where users will need to re-enter the email address used to create their account in the app. with this step we are trying to make sure that it was is a human with a valid email inbox who created the account. back in the popup’s html code, notice how the “ok” button links back to the sign in screen. you should be able to confirm that this link works when you tap the button. creating a password reset screen with jquery mobile the app’s landing screen has a “can’t access your account?” link that helps user initiate the password reset workflow of the app. the first step of this workflow is to present the “begin password reset” screen to the user. we will use this screen to capture the user’s email address. if we find this email address in the user accounts database of the app, we will email the user a provisional password. next, we will activate the “end password reset” screen, where the user will need to enter the provisional password and a new password of their choosing. the picture below illustrates this process. let’s create an empty begin-password-reset.html file in the project’s directory. we will write the following code int the file: book it password reset enter your email address submit password reset check your inbox we sent you an email with instructions on how to reset your password. please check your inbox and follow the instructions in the email. ok this is how the screen should look when viewed on a mobile browser or emulator: there is nothing new in the html code of this screen. we wired the “submit” button so when a user taps it, the embedded “dlg-pwd-reset-sent” popup will become active: we did this for testing purposes. remember that we will add the programming logic that activates these popups in upcoming chapters of this tutorial. when a user taps the popup’s “ok” button, the application will navigate to the “end password reset” screen, which we will create next. the end password reset screen this is the screen where the user will enter the provisional password we sent them via email, along with a new password of their choosing. to create this screen we will add an empty “end-password-reset.html” file to the project. here’s the code that goes in the file: book it reset password provisional password new password confirm new password submit done your password was changed. ok the screen should look like the picture below when viewed with a mobile browser or emulator: we wired the “submit” button so it activates the embedded “dlg-pwd-changed” popup: this popup simply tells the user that their password was changed. tapping the “ok” button will make the app navigate back to the sign in screen, where the user can sign in with the new password. summary and next steps this concludes our first phase of work on the user registration, login and logout screens of the jquery mobile version of the app. i will emphasize again that in this phase we are not adding programming logic to the screens. we are simply creating a jquery mobile page for each screen and making sure that the visual elements within the screens adhere to the mockups that we created in previous chapters of this tutorial, as well as to the ui patterns flowchart that i mentioned at the beginning of this article. while we’ve made significant progress with the app at this point, it’s fair to say that we are just getting started. we are still missing the programming for this article’s screens, as well as the jquery mobile pages and programming for the screens that will allow users to browse and reserve meeting rooms, which is why we created the app in the first place. in the next chapter of this tutorial we will get started with the programming of the user profile screens we just created. don’t forget to sign up for my mailing list so you can be among the first to know when i publish the next update. stay tuned don’t miss any articles. get new articles and updates sent free to your inbox.

October 13, 2014

by Jorge Ramon

· 78,863 Views · 4 Likes

Neo4j: COLLECTing Multiple Values (Too Many Parameters for Function ‘Collect’)

One of my favourite functions in Neo4j’s cypher query language is COLLECT which allows us to group items into an array for later consumption. However, I’ve noticed that people sometimes have trouble working out how to collect multiple items with COLLECT and struggle to find a way to do so. Consider the following data set: create (p:Person {name: "Mark"}) create (e1:Event {name: "Event1", timestamp: 1234}) create (e2:Event {name: "Event2", timestamp: 4567}) create (p)-[:EVENT]->(e1) create (p)-[:EVENT]->(e2) If we wanted to return each person along with a collection of the event names they’d participated in we could write the following: $ MATCH (p:Person)-[:EVENT]->(e) > RETURN p, COLLECT(e.name); +--------------------------------------------+ | p | COLLECT(e.name) | +--------------------------------------------+ | Node[0]{name:"Mark"} | ["Event1","Event2"] | +--------------------------------------------+ 1 row That works nicely, but what about if we want to collect the event name and the timestamp but don’t want to return the entire event node? An approach I’ve seen a few people try during workshops is the following: MATCH (p:Person)-[:EVENT]->(e) RETURN p, COLLECT(e.name, e.timestamp) Unfortunately this doesn’t compile: SyntaxException: Too many parameters for function 'collect' (line 2, column 11) "RETURN p, COLLECT(e.name, e.timestamp)" ^ As the error message suggests, the COLLECT function only takes one argument so we need to find another way to solve our problem. One way is to put the two values into a literal array which will result in an array of arrays as our return result: $ MATCH (p:Person)-[:EVENT]->(e) > RETURN p, COLLECT([e.name, e.timestamp]); +----------------------------------------------------------+ | p | COLLECT([e.name, e.timestamp]) | +----------------------------------------------------------+ | Node[0]{name:"Mark"} | [["Event1",1234],["Event2",4567]] | +----------------------------------------------------------+ 1 row The annoying thing about this approach is that as you add more items you’ll forget in which position you’ve put each bit of data so I think a preferable approach is to collect a map of items instead: $ MATCH (p:Person)-[:EVENT]->(e) > RETURN p, COLLECT({eventName: e.name, eventTimestamp: e.timestamp}); +--------------------------------------------------------------------------------------------------------------------------+ | p | COLLECT({eventName: e.name, eventTimestamp: e.timestamp}) | +--------------------------------------------------------------------------------------------------------------------------+ | Node[0]{name:"Mark"} | [{eventName -> "Event1", eventTimestamp -> 1234},{eventName -> "Event2", eventTimestamp -> 4567}] | +--------------------------------------------------------------------------------------------------------------------------+ 1 row During the Clojure Neo4j Hackathon that we ran earlier this week this proved to be a particularly pleasing approach as we could easily destructure the collection of maps in our Clojure code.

October 13, 2014

by Mark Needham

· 10,408 Views

R: Filtering data frames by column type ('x' must be numeric)

I’ve been working through the exercises from An Introduction to Statistical Learning and one of them required you to create a pair wise correlation matrix of variables in a data frame. The exercise uses the ‘Carseats’ data set which can be imported like so: > install.packages("ISLR") > library(ISLR) > head(Carseats) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US 1 9.50 138 73 11 276 120 Bad 42 17 Yes Yes 2 11.22 111 48 16 260 83 Good 65 10 Yes Yes 3 10.06 113 35 10 269 80 Medium 59 12 Yes Yes 4 7.40 117 100 4 466 97 Medium 55 14 Yes Yes 5 4.15 141 64 3 340 128 Bad 38 13 Yes No 6 10.81 124 113 13 501 72 Bad 78 16 No Yes filter the categorical variables from a data frame and If we try to run the ‘cor‘ function on the data frame we’ll get the following error: > cor(Carseats) Error in cor(Carseats) : 'x' must be numeric As the error message suggests, we can’t pass non numeric variables to this function so we need to remove the categorical variables from our data frame. But first we need to work out which columns those are: > sapply(Carseats, class) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "factor" "numeric" "numeric" Urban US "factor" "factor" We can see a few columns of type ‘factor’ and luckily for us there’s a function which will help us identify those more easily: > sapply(Carseats, is.factor) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE Urban US TRUE TRUE Now we can remove those columns from our data frame and create the correlation matrix: > cor(Carseats[sapply(Carseats, function(x) !is.factor(x))]) Sales CompPrice Income Advertising Population Price Age Education Sales 1.00000000 0.06407873 0.151950979 0.269506781 0.050470984 -0.44495073 -0.231815440 -0.051955242 CompPrice 0.06407873 1.00000000 -0.080653423 -0.024198788 -0.094706516 0.58484777 -0.100238817 0.025197050 Income 0.15195098 -0.08065342 1.000000000 0.058994706 -0.007876994 -0.05669820 -0.004670094 -0.056855422 Advertising 0.26950678 -0.02419879 0.058994706 1.000000000 0.265652145 0.04453687 -0.004557497 -0.033594307 Population 0.05047098 -0.09470652 -0.007876994 0.265652145 1.000000000 -0.01214362 -0.042663355 -0.106378231 Price -0.44495073 0.58484777 -0.056698202 0.044536874 -0.012143620 1.00000000 -0.102176839 0.011746599 Age -0.23181544 -0.10023882 -0.004670094 -0.004557497 -0.042663355 -0.10217684 1.000000000 0.006488032 Education -0.05195524 0.02519705 -0.056855422 -0.033594307 -0.106378231 0.01174660 0.006488032 1.000000000 Be Sociable, Share!

October 8, 2014

by Mark Needham

· 29,126 Views

Using Groovy To Import XML Into MongoDB

This year I’ve been demonstrating how easy it is to create modern web apps using AngularJS, Java and MongoDB. I also use Groovy during this demo to do the sorts of things Groovy is really good at - writing descriptive tests, and creating scripts. Due to the time pressures in the demo, I never really get a chance to go into the details of the script I use, so the aim of this long-overdue blog post is to go over this Groovy script in a bit more detail. Firstly I want to clarify that this is not my original work - I stoleborrowed most of the ideas for the demo from my colleague Ross Lawley. In this blog post he goes into detail of how he built up an application that finds the most popular pub names in the UK. There’s asection in there where he talks about downloading the open street map data and using python to convert the XML into something more MongoDB-friendly - it’s this process that I basically stole, re-worked for coffee shops, and re-wrote for the JVM. I’m assuming if you’ve worked with Java for any period of time, there has come a moment where you needed to use it to parse XML. Since my demo is supposed to be all about how easy it is to work with Java, I didnot want to do this. When I wrote the demo I wasn’t really all that familiar with Groovy, but what I did know was that it has built in support for parsing and manipulating XML, which is exactly what I wanted to do. In addition, creating Maps (the data structures, not the geographical ones) with Groovy is really easy, and this is effectively what we need to insert into MongoDB. Goal Of The Script Parse an XML file containing open street map data of all coffee shops. Extract latitude and longitude XML attributes and transform intoMongoDB GeoJSON. Perform some basic validation on the coffee shop data from the XML. Insert into MongoDB. Make sure MongoDB knows this contains query-able geolocation data. The script is PopulateDatabase.groovy, that link will take you to the version I presented at JavaOne: Firstly, We Need Data I used the same service Ross used in his blog post to obtain the XML file containing “all” coffee shops around the world. Now, the open street map data is somewhat… raw and unstructured (which is why MongoDB is such a great tool for storing it), so I’m not sure I really have all the coffee shops, but I obtained enough data for an interesting demo using http://www.overpass-api.de/api/xapi?*[amenity=cafe][cuisine=coffee_shop] The resulting XML file is in the github project, but if you try this yourself you might (in fact, probably will) get different results. Each XML record looks something like: Each coffee shop has a unique identifier and a latitude and longitude as attributes of a node element. Within this node is a series of tag elements, all with k and v attributes. Each coffee shop has a varying number of these attributes, and they are not consistent from shop to shop (other than amenity and cuisine which we used to select this data). Initialisation Before doing anything else we want to prepare the database. The assumption of this script is that either the collection we want to store the coffee shops in is empty, or full of stale data. So we’re going to use the MongoDB Java Driver to get the collection that we’re interested in, and then drop it. There’s two interesting things to note here: This Groovy script is simply using the basic Java driver. Groovy can talk quite happily to vanilla Java, it doesn’t need to use a Groovy library. There are Groovy-specific libraries for talking to MongoDB (e.g. the MongoDB GORM Plugin), but the Java driver works perfectly well. You don’t need to create databases or collections (collections are a bit like tables, but less structured) explicitly in MongoDB. You simply use the database and collection you’re interested in, and if it doesn’t already exist, the server will create them for you. In this example, we’re just using the default constructor for theMongoClient, the class that represents the connection to the database server(s). This default is localhost:27017, which is where I happen to be running the database. However you can specify your own address and port - for more details on this see Getting Started With MongoDB and Java. Turn The XML Into Something MongoDB-Shaped So next we’re going to use Groovy’s XmlSlurper to read the open street map XML data that we talked about earlier. To iterate over every node we use: xmlSlurper.node.each. For those of you who are new to Groovy or new to Java 8, you might notice this is using a closure to define the behaviour to apply for every “node” element in the XML. Create GeoJSON Since MongoDB documents are effectively just maps of key-value pairs, we’re going to create a Map coffeeShop that contains the document structure that represents the coffee shop that we want to save into the database. Firstly, we initialise this map with the attributes of the node. Remember these attributes are something like: We’re going to save the ID as a value for a new field calledopenStreetMapId. We need to do something a bit more complicated with the latitude and longitude, since we need to store them as GeoJSON, which looks something like: { 'location' : { 'coordinates': [, ], 'type' : 'Point' } } In lines 12-14 you can see that we create a Map that looks like the GeoJSON, pulling the lat and lon attributes into the appropriate places. Insert Remaining Fields Now for every tag element in the XML, we get the k attribute and check if it’s a valid field name for MongoDB (it won’t let us insert fields with a dot in, and we don’t want to override our carefully constructed locationfield). If so we simply add this key as the field and its the matching vattribute as the value into the map. This effectively copies theOpenStreetMap key/value data into key/value pairs in the MongoDB document so we don’t lose any data, but we also don’t do anything particularly interesting to transform it. Save Into MongoDB Finally, once we’ve created a simple coffeeShop Map representing the document we want to save into MongoDB, we insert it into MongoDB if the map has a field called name. We could have checked this when we were reading the XML and putting it into the map, but it’s actually much easier just to use the pretty Groovy syntax to check for a key called namein coffeeShop. When we want to insert the Map we need to turn this into aBasicDBObject, the Java Driver’s document type, but this is easily done by calling the constructor that takes a Map. Alternatively, there’s a Groovy syntax which would effectively do the same thing, which you might prefer: collection.insert(coffeeShop as BasicDBObject) Tell MongoDB That We Want To Perform Geo Queries On This Data Because we’re going to do a nearSphere query on this data, we need to add a “2dsphere” index on our location field. We created the locationfield as GeoJSON, so all we need to do is call createIndex for this field. Conclusion So that’s it! Groovy is a nice tool for this sort of script-y thing - not only is it a scripting language, but its built-in support for XML, really nice Map syntax and support for closures makes it the perfect tool for iterating over XML data and transforming it into something that can be inserted into a MongoDB collection.

October 8, 2014

by Trisha Gee

· 10,320 Views

How to Allow Only HTTPS on an S3 Bucket

It is possible to disable HTTP access on S3 bucket, limiting S3 traffic to only HTTPS requests. The documentation is scattered around the Amazon AWS documentation, but the solution is actually straightforward. All you need to do to block HTTP traffic on an S3 bucket is add a Condition in your bucket's policy. AWS supports a global condition for verifying SSL. So you can add a condition like this: "Condition": { "Bool": { "aws:SecureTransport": "true" } } Here's a complete example: { "Version": "2008-10-17", "Id": "some_policy", "Statement": [ { "Sid": "AddPerm", "Effect": "Allow", "Principal": { "AWS": "*" }, "Action": "s3:GetObject", "Resource": "arn:aws:s3:::my_bucket/*", "Condition": { "Bool": { "aws:SecureTransport": "true" } } } ] } Now accessing the contents of my_bucket over HTTP will produce a 403 error, while using HTTPS will work fine.

October 8, 2014

by Matt Butcher

· 17,783 Views

What is Write Concern in MongoDB?

In MongoDB there are multiple guarantee levels available for reporting the success of a write operation, called Write Concerns. The strength of the write concerns determine the level of guarantee. A weak Write Concern has better performance at the cost of lesser guarantee, while a strong Write Concern has higher guarantee as clients wait to confirm the write operations. MongoDB provides different levels of write concern to better address the specific needs of applications. Clients may adjust write concern to ensure that the most important operations persist successfully to an entire MongoDB deployment. For other less critical operations, clients can adjust the write concern to ensure faster performance rather than ensure persistence to the entire deployment. Write Concern Levels MongoDB has the following levels of conceptual write concern, listed from weakest to strongest: Unacknowledged With an unacknowledged write concern, MongoDB does not acknowledge the receipt of write operations. Unacknowledged is similar to errors ignored; however, drivers will attempt to receive and handle network errors when possible. The driver’s ability to detect network errors depends on the system’s networking configuration. Acknowledged With a receipt acknowledged write concern, the mongod confirms the receipt of the write operation. Acknowledged write concern allows clients to catch network, duplicate key, and other errors. This is default write concern. Journaled With a journaled write concern, the MongoDB acknowledges the write operation only after committing the data to the journal. This write concern ensures that MongoDB can recover the data following a shutdown or power interruption. You must have journaling enabled to use this write concern. Replica Acknowledged Replica sets present additional considerations with regards to write concern. The default write concern only requires acknowledgement from the primary. With replica acknowledged write concern, you can guarantee that the write operation propagates to additional members of the replica set. Write operation to a replica set with write concern level of w:2 or write to the primary and at least one secondary.

October 7, 2014

by Rishav Rohit

· 26,412 Views · 2 Likes

PostgreSQL: ERROR: Column Does Not Exist

I’ve been playing around with PostgreSQL recently and in particular the Northwind dataset typically used as an introductory data set for relational databases. Having imported the data I wanted to take a quick look at the employees table: postgres=# SELECT * FROM employees LIMIT 1; EmployeeID | LastName | FirstName | Title | TitleOfCourtesy | BirthDate | HireDate | Address | City | Region | PostalCode | Country | HomePhone | Extension | Photo | Notes | ReportsTo | PhotoPath ------------+----------+-----------+----------------------+-----------------+------------+------------+-----------------------------+---------+--------+------------+---------+----------------+-----------+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------------------------------------- 1 | Davolio | Nancy | Sales Representative | Ms. | 1948-12-08 | 1992-05-01 | 507 - 20th Ave. E.\nApt. 2A | Seattle | WA | 98122 | USA | (206) 555-9857 | 5467 | \x | Education includes a BA IN psychology FROM Colorado State University IN 1970. She also completed "The Art of the Cold Call." Nancy IS a member OF Toastmasters International. | 2 | http://accweb/emmployees/davolio.bmp (1 ROW) That works fine but what if I only want to return the ‘EmployeeID’ field? postgres=# SELECT EmployeeID FROM employees LIMIT 1; ERROR: COLUMN "employeeid" does NOT exist LINE 1: SELECT EmployeeID FROM employees LIMIT 1; I hadn’t realised (or had forgotten) that field names get lower cased so we need to quote the name if it’s been stored in mixed case: postgres=# SELECT "EmployeeID" FROM employees LIMIT 1; EmployeeID ------------ 1 (1 ROW) From my reading the suggestion seems to be to have your field names lower cased to avoid this problem but since it’s just a dummy data set I guess I’ll just put up with the quoting overhead for now.

October 7, 2014

by Mark Needham

· 17,621 Views

Comparison of SQL Server Compact, SQLite, SQL Server Express and LocalDB

Now that SQL Server 2014 and SQL Server Compact 4 has been released, some developers are curious about the differences between SQL Server Compact 4.0 and SQL Server Express 2014 (including LocalDB) I have updated the comparison table from the excellent discussion of the differences between Compact 3.5 and Express 2005 here to reflect the changes in the newer versions of each product. Information about LocalDB comes from here and SQL Server 2014 Books Online. LocalDB is the full SQL Server Express engine, but invoked directly from the client provider. It is a replacement of the current “User Instance” feature in SQL Server Express. Feature SQL Server Compact 3.5 SP2 SQL Server Compact 4.0 SQLite, incl SQLite ADO.NET Provider SQL Server Express 2012 SQL Server 2012 LocalDB Deployment/ Installation Features Installation size 2.5 MB download size 12 MB expanded on disk 2.5 MB download size 18 MB expanded on disk 10 MB download, 14 MB expanded on disk 120 MB download size > 300 MB expanded on disk 32 MB download size > 160 MB on disk ClickOnce deployment Yes Yes Yes Yes Yes Privately installed, embedded, with the application Yes Yes Yes No No Non-admin installation option Yes Yes Yes No No Runs under ASP.NET No Yes Yes Yes Yes Runs on Windows Mobile / Windows Phone platform Yes No Yes No No Runs on WinRT (Phone/Store Apps) No No Yes No No Runs on non-Microsoft platforms No No Yes No No Installed centrally with an MSI Yes Yes Yes Yes Yes Runs in-process with application Yes Yes Yes No No (as process started by app) 64-bit support Yes Yes Yes Yes Yes Runs as a service No – In process with application No - In process with application No - In process with application Yes No – as launched process Data file features File format Single file Single file Single file Multiple files Multiple files Data file storage on a network share No No No No No Support for different file extensions Yes Yes Yes No No Database size support 4 GB 4 GB 140 TB 10 GB 10 GB XML storage Yes – stored as ntext Yes - stored as ntext Yes, stored as text Yes, native Yes, native Binary (BLOB) storage Yes – stored as image Yes - stored as image Yes Yes Yes FILESTREAM support No No No Yes No Code free, document safe, file format Yes Yes Yes No No Programmability Transact-SQL - Common Query Features Yes Yes No Yes Yes Procedural T-SQL - Select Case, If, features No No Limited Yes Yes Remote Data Access (RDA) Yes No (not supported) No No No ADO.NET Sync Framework Yes No No Yes Yes LINQ to SQL Yes No (not supported) No Yes Yes ADO.NET Entity Framework 4.1 Yes (no Code First) Yes Yes Yes Yes ADO.NET Entity Framework 6 Yes (fully) Yes (fully) Yes (limited) Yes Yes Subscriber for merge replication Yes No No Yes No Simple transactions Yes Yes Yes Yes Yes Distributed transactions No No No Yes Yes Native XML, XQuery/XPath No No No Yes Yes Stored procedures, views, triggers No No Views and triggers Yes Yes Role-based security No No No Yes Yes Number of concurrent connections 256 (100) 256 Unlimited Unlimited Unlimited (but only local) There is also a table here that allows you to determine which Transact-SQL commands, features, and data types are supported by SQL Server Compact 3.5 (which are the same a 4.0 with very few exceptions), compared with SQL Server 2005 and 2008.

October 4, 2014

by Erik Ejlskov Jensen

· 24,948 Views

Product Catalog with MongoDB, Part 2: Product Search

Continue learning about product catalogs in MongoDB as we look at product seraches.

September 26, 2014

by Antoine Girbal

· 19,539 Views · 4 Likes

Product Catalog with MongoDB, Part 1: Schema Design

This post is part of the Product Catalog MongoDB Series, in which we will cover many aspects of building a Product Catalog with MongoDB. This approach has been tested with a varied product catalog of 130 million items running on a single server (EC2 i2.2xlarge). MongoDB seems to be the perfect fit to implement a product catalog since products maps so well to documents. But as we shall see it is not as easy as it seems! The data is fairly complex with many relationships involved. Also almost every other system will want to make use of the catalog instead of making its own copy, so typically a low latency, scalable and geo distributed catalog service is the ideal solution. A product has at least the following information: Item: the overall product info (e.g. Levi’s 501) Variant: a specific variant of an item (e.g. in black size 6) which typically has a specific SKU / UPC Price: price information may vary based on the store, the variant, etc Hierarchy: the item taxonomy Facet: facets to search products by Vendors: a given sku may be available through different vendors if the site is a marketplace A classic pitfall is to try to fit everything into a single document. As a result you end up with something very complex with many nested lists, which makes it difficult to navigate and index. Additionally APIs find themselves sending back massive documents even if only partial info is needed. In certain cases, we’ve seen items with 1000s of variant (e.g. Automotive part) which go beyond 16MB of pure JSON (in which case compression becomes mandatory)! Instead here we are going to model the data in a way that is natural, maps well to the API, at the sweet spot between normalization and denorm. Item Model The item collection has document representing the high level data of a product. Here is a sample item document for a shoe: { "_id": "054VA72303012P", "desc": [ { "lang": "en", "val": "Give your dressy look a lift with these women's Kate high-heel shoes by Metaphor. These playful peep-toe pumps feature satin-wrapped stiletto heels and chiffon pompoms at the toes. Rhinestones on each of the silvertone buckles add just a touch of sparkle to these shoes for a flirty footwear look that's made for your next night out." } ], "name": "Women's Kate Ivory Peep-Toe Stiletto Heel", "lname": "women's kate ivory peep-toe stiletto heel", "category": "/84700/80009/1282094266/1200003270", "brand": { "id": "2483510", "img": { "src": "http://i.sears.com/s/i/bl/image/spin_prod_metadata_168138610" }, "name": "Metaphor" }, "assets": { "imgs": [ { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_967112812", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945877912", "width": "1900" } }, { "img": { "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945878012", "width": "1900" } } ] }, "shipping": { "dimensions": { "height": "13.0", "length": "1.8", "width": "26.8" }, "weight": "1.75" }, "specs": [ { "name": "Heel Height (in.)", "val": "3.75" } ], "attrs": [ { "name": "Heel Height", "value": "High (2-1/2 to 4 in.)" }, { "name": "Upper Material", "value": "Synthetic" }, { "name": "Toe", "value": "Open toe" } { "name": "Brand", "value": "Metaphor" } ], "variants": { "cnt": 9, "attrs": [ { "dispType": "COMBOBOX", "name": "Width", }, { "dispType": "DROPDOWN", "name": "Color", }, { "dispType": "DROPDOWN", "name": "Shoe Size", } ] }, "lastUpdated": 1400877254787 } Fields of interest: _id: the product id lastUpdated: useful timestamp to see recently updated category: the category path made up of hierarchy nodes name: the product name lname: a lower-case version of the name. This can be useful for doing case-insensitive matching with an index brand: the brand desc: list of descriptions (website, retail box, etc) assets: list of assets (images, etc) attrs: list of attributes as name-value pairs. Will be used to implement facetting. Note that the brand is also included as one attribute. variants: some information on variants, but not the variants themselves Common queries (indexed): find by id: { _id: "the product id" } find by category prefix: { product.cat: { $regex: "^category prefix" } } find by case-insensitive name prefix: { product.lname: { $regex: "^name prefix" } } Variant Model The Variant documents represent specific variations of a product. Certain products only exist in a unique variant (e.g. XBox, no options to pick) whereas other products may have thousands. Here is a sample variant document for the same shoe: { "_id": "05458452563", "name": "Width:Medium,Color:Ivory,Shoe Size:6.5", "lname": "width:medium,color:ivory,shoe size:6.5", "itemId": "054VA72303012P", "altIds": { "upc": "632576103580" }, "assets": { "imgs": [ { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348512" }, { "width": "1900", "height": "1900", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_945348612" } ] }, "attrs": [ { "name": "Width", "value": "Medium" }, { "name": "Color", "family": "White", "value": "Ivory" }, { "name": "6.5", "value": "6.5" } ] } Fields of interest: _id: the SKU itemId: the parent item id. attrs: a list of attributes specific to the variant. Note that some of the attributes may have both a specific value (e.g. ivory) and a family value (e.g. white). assets: assets specific to the variant (e.g. image with a specific color). Common query parameters (indexed): find by SKU: { _id: "the sku" } find by item Id: { itemId: "item id" } Hierarchy Model The Hierarchy document represents a node of the hierarchical tree representing product taxonomy. The top level nodes represent departments, while further nodes represent specific categories. { "_id": "1200003270", "name": "Women's Heels & Pumps", "count": 223, "parents": [ "1282094266" ], "facets": [ "Heel Height", "Toe", "Upper Material", "Width", "Shoe Size", "Color" ] } Fields of interest: _id: the category id name: the category name count: the number of items in this category. It can be a useful statistic to display. parents: list of parent nodes. Simpler implementations could make use of a single value. facets: list of facets that exist for this category (e.g. color, size). This info will be used when displaying the facets available in the searching page. Common queries (indexed): find by parent id: { p: "parent id" } find top level departments: { p: null } Facet Model The facet document represent a name/value pair representing a product attribute. { "_id": "accessory type=hosiery", "name": "Accessory Type", "value": "Hosiery", "count": 14 } Fields of interest: _id: the id, which is a concatenation of lower-cased facet name and value. name: the facet name with original casing, e.g. “Accessory Type” value: the facet value with original casing, e.g. “Hosiery”. Important note: here the value should be the family value is possible, e.g. “White” rather than “Ivory”. Those facets will be used for searching items, and the family value is better for that purpose. count: the number of items that have this facet. This count will be important in defining the order of attributes in a query when doing faceted search. Common query parameters (indexed): find a specific facet: { _id: "name_value" } find facets for a name: { _id: { $regex: "^name_" } } Price Model The Price document obviously represents the price of an item, but there is quite a bit more to it. We want to be able to vary the price per variant (e.g. gold color is more expensive) or per store (e.g. online store is cheaper). While we will not touch on the store model in this post, let’s just imagine we have a few thousand stores which are grouped into a dozen store groups (e.g. online, west coast, etc). If we implement this naively, we would end up with 1000 stores x 200m variants = 2 billion price documents! Instead let’s be a bit smarter and make good use of MongoDB’s querying capability, by allowing to price products at different levels as needed, thus keeping the number of documents in the millions. A document looks like: { "_id": "SPM8824542513_1234", "price": "69.99", "sale": { "salePrice": "42.72", "saleEndDate": "2050-12-31 23:59:59" }, "lastUpdated": 1374647707394 } Fields of interest: _id: the id is built in a specific way. It is the concatenation of the item information and store information. The item information is either the item id or the variant id (SKU). The store information is either the store group id or the store id. price: the regular price sale: sales information, optional Common queries (indexed): find all prices by item id: { _id: { "$regex": "^itemId_" } } find all prices by SKU (price could be at item level): { _id: { "$in: [ { "$regex": "^itemId_" }, { "$regex": "^sku_" } ] } find price for a given SKU and store (4 combinations are possible): { _id: { "$in: [ "itemId_storeGroupId", "itemId_storeId", "sku_storeGroupId", "sku_storeId" ] } find items on sale, starting with ones ending soonest: { "sale.saleEndDate": { $ne: null } } with sort by { "sale.saleEndDate": 1 } (sparse index on "sale.saleEndDate") Summary Model The previous documents are now properly modeled, easy to maintain, and can efficiently power an API to serve product details pages. The last, and most difficult issue left to tackle is how to do faceted searching and other kind of browsing. We could leave it off to a full text search system or similar software, but it’s actually doable with MongoDB. For this purpose we face some tough challenges: whatever the search is, need a response within milliseconds returning hundreds of items the search can be a combination of many facets: category, brand, etc facets can be both at the item and variant levels: color, size, etc. If matching a specific variant, we should display that specific image (e.g. red shoes). hundreds of variants of the same item could match, in which case only the parent item should be returned as result efficient sorting on several attributes: price, popularity pagination feature which requires deterministic ordering For this purpose we create a separate collection called Summary in which each document represent the summary information of an item and all its variants. The data is stripped out to the minimum needed to power a browse & search feature. Such a document looks like: { "_id": "3ZZVA46759401P", "name": "Women's Chic - Black Velvet Suede", "lname": "women's chic - black velvet suede", "dep": "84700", "cat": "/84700/80009/1282094266/1200003270", "desc": [ { "lang": "en", "val": "This pointy toe slingback features a high quality upper and a classy, simple silhouette. This heel has a classic shape, an adjustable ankle strap for a vintage feel and a secure fit. The Chic is the perfect combination between dressy and professional." } ], "img": [ { "height": "330", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726201", "title": "spin_prod_591726201", "width": "450" } ], "attrs": [ "heel height=mid (1-3/4 to 2-1/4 in.)", "brand=metaphor" ], "sattrs": [ "upper material=synthetic", "toe=open toe" ], "vars": [ { "id": "05497884001", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6" ] }, { "id": "05497884002", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=6.5" ] }, { "id": "05497884004", "img": [ { "height": "400", "src": "http://c.shld.net/rpx/i/s/i/spin/image/spin_prod_591726301", "title": "spin_prod_591726301", "width": "450" } ], "attrs": [ "width=medium", "color=black", "shoe size=7.5" ] } ] } Fields of interest: _id: the item id name: the item name lname: the item name, lower-case img: list of images, ideally just the thumbnail dep: the department (top level of category). Needs to be separate for proper indexing cat: the category path attrs: the item attributes, to be indexed sattrs: the item secondary attributes, not to be indexed vars: list of variants vars.id: the variant sku vars.attrs: the variant attributes, to be indexed vars.sattrs: the secondary variant attributes, not to be indexed Indices: department + attr + category + _id department + vars.attrs + category + _id department + category + _id department + price + _id department + rating + _id Common queries (indexed): find by department: { dep: "department" } find by category prefix: { dep: "department", cat: { $regex: "^category prefix"} } find by item attribute: { dep: "department", attrs: "name=value" } find by several item attributes: { dep: "department", attrs: { $all: [ "name=value", ... ] } find by variant attribute: { dep: "department", vars.attrs: "name=value" } find by several variant attributes: { dep: "department", vars.attrs: { $all: [ "name=value", ... ] } find by item attributes, variant attributes, category: { dep: "department", attrs: { $all: [ "name=value", ... ], vars.attrs: { $all: [ "name=value", ... ], cat: { $regex: "^category prefix"} } A few interesting notes on indexing / querying: each index starts with the department, which is a convenient way to subdivide our product catalog. It is an acceptable restriction to force the user to pick a department before displaying any kind of search facet (unless we’re displaying a pre-computed list like “most popular”). Hence having the department there will ensure that there is always a large amount of filtering done for cheap by the index :) each index ends with “_id” which is useful for pagination. It will give sorting on _id for free for some common queries. It’s always better to avoid resorting the skip/limit for pagination, which is only fine for a low number of pages. for queries using “$all“ the most restrictive attribute should be specified first (e.g. “color=red”). This information can be inferred from the “facet“ collection described earlier. This piece is critical to make facetted searches efficient and keep them in the few milliseconds. Conclusion In conclusion, we’ve seen here how to model and index a product catalog in MongoDB which will allow high performance, flexibility, and easy maintenance. More details on this topic can be see in the MongoDB World video Product Catalog Stay tuned for more information on our Product Catalog MongoDB solution, including: How to implement full text search within MongoDB or with a connector to an external FTS system More statistics and benchmarking on the faceted search capability Operational considerations: geo distributed for low latency queries, stringent read latency SLA, and spiky catalog updates Also check out another interesting topic: how to log all user activities around the site and run useful analytics on them in the MongoDB World video covering the Insight Component

September 25, 2014

by Antoine Girbal

· 65,848 Views · 8 Likes

The No Fluff Introduction to Big Data

big data traditionally has referred to a collection of data too massive to be handled efficiently by traditional database tools and methods. this original definition has expanded over the years to identify tools (big data tools) that tackle extremely large datasets (nosql databases, mapreduce, hadoop, newsql, etc.), and to describe the industry challenge posed by having data harvesting abilities that far outstrip the ability to process, interpret, and act on that data. technologists knew that those huge batches of user data and other data types were full of insights that could be extracted by analyzing the data in large aggregates. they just didn’t have any cheap, simple technology for organizing and querying these large batches of raw, unstructured data. the term quickly became a buzzword for every sort of data processing product’s marketing team. big data became a catchall term for anything that handled non-trivial sizes of data. sean owen, a data scientist at cloudera, has suggested that big data is a stage where individual data points are irrelevant and only aggregate analysis matters [1]. but this is true for a 400 person survey as well, and most people wouldn’t consider that very big. the key part missing from that definition is the transformation of unstructured data batches into structured datasets. it doesn’t matter if the database is relational or non-relational. big data is not defined by a number of terabytes, it’s rooted in the push to discoverhidden insights in data that companies used to disregard or throw away. due to the obstacles presented by large scale data management, the goal for developers and data scientists is two-fold: first, systems must be created to handle large scale data, and two, business intelligence and insights should be acquired from analysis of the data. acquiring the tools and methods to meet these goals is a major focus in the data science industry, but it’s a landscape where needs and goals are still shifting. what are the characteristics of big data? tech companies are constantly amassing data from a variety of digital sources that is almost without end—everything from email addresses to digital images, mp3s, social media communication, server traffic logs, purchase history, and demographics. and it’s not just the data itself, but data about the data (metadata). it is a barrage of information on every level. what is it that makes this mountain of data big data? one of the most helpful models for understanding the nature of big data is “the three vs:” volume, velocity, and variety. data volume volumeis the sheer size of the data being collected. there was a point in not-so-distant history where managing gigabytes of data was considered a serious task—now we have web giants like google and facebook handling petabytes of information about users’ digital activities. the size of the data is often seen as the first challenge of characterizing big data storage, but even beyond that is the capability of programs to provide architecture that can not only store but query these massive datasets. one of the most popular models for big data architecture comes from google’s mapreduce concept, which was the basis for apache hadoop, a popular data management solution. data velocity velocityis a problem that flows naturally from the volume characteristics of big data. data velocity is the speed at which data is flowing into a business’ infrastructure and the ability of software solutions to receive and ingest that data quickly. certain types of high-velocity data, such as streaming data, needs to be moved into storage and processed on the fly. this is often referred to as complex event processing (cep). the ability to intercept and analyze data that has a lifespan of milliseconds is a widely sought after. this kind of quick-fire data processing has long been the cornerstone of digital financial transactions, but it is also being used to track live consumer behavior or to bring instant updates to social media feeds. data variety variety refers to the source and type of data that is being collected. this data could be anything from raw image data to sensor readings, audio recordings, social media communication, and metadata. the challenge of data variety is being able to take raw, unstructured data and organize it so that an application can use it. this kind of structure can be achieved through architectural models that traditionally favor relational databases—but there is often a need to tidy up this data before it will even be useful to store in a raw form. sometimes a better option is to use a schema-less, non-relational database. how do you manage big data? the three vs is a great model for getting an initial understanding of what makes big data a challenge for businesses. however, big data is not just about the data itself, but the way that it is handled. a popular way of thinking about these challenges is to look at how a business stores, processes, and accesses their data. · store: can you store the vast amounts of data being collected? · process: can you organize, clean, and analyze the data collected? · access: can you search and query this data in an organized manner? the store, process, and access model is useful for two reasons: it reminds businesses that big data is largely about managing data, and it demonstrates the problem of scale within big data management. “big” is relative. the data batches that challenge some companies could be moved through a single google datacenter in under a minute. the only question a company needs to ask itself is how it will store and access increasingly massive amounts of data for its particular use case. there are several high level approaches that companies have turned to in the last few years. the traditional approach the traditional method for handling most data is to use relational databases. data warehouses are then used to integrate and analyze data from many sources. these databases are structured according to the concept of “early structure binding”—essentially, the database has predetermined “questions” that can be asked based on a schema. relational databases are highly functional, and the goal with this type of data processing is for the database to be fully transactional. although relational databases are the most common persistence type by a large margin (see key findings pg. 4-5), a growing number of use cases are not well-suited for relational schema. relational architectures tend to have difficulty when dealing with the velocity and variety of big data, since their structure is very rigid. when you perform functions such as join on many large data sets, the volume can be a problem as well. instead, businesses are looking to non-relational databases, or a mixture of both types, to meet data demand. the newer approach - mapreduce, hadoop, and nosql databases in the early 2000s, web giant google released two helpful web technologies: google file system (gfs) and mapreduce. both were new and unique approaches to the growing problem of big data, but mapreduce was chief among them, especially when it comes to its role as a major influencer of later solution models. mapreduce is a programming paradigm that allows for low cost data analysis and clustered scale-out processing. mapreduce became the primary architectural influence for the next big thing in big data: the creation of the big data management infrastructure known as hadoop. hadoop’s open source ecosystem and ease of use for handling large-scale data processing operations have secured a large part of the big data marketplace. besides hadoop, there was a host of non-relational (nosql) databases that emerged around 2009 to meet a different set of demands for processing big data. whereas hadoop is used for its massive scalability and parallel processing, nosql databases are especially useful for handling data stored within large multi-structured datasets. this kind of discrete data handling is not traditionally seen as a strong point of relational databases, but it’s also not the same kind of data operations that hadoop is running. the solution for many businesses ends up being a combination of these approaches to data management. finding hidden data insights once you get beyond storage and management, you still have the enormous task of creating actionable business intelligence (bi) from the datasets you’ve collected. this problem of processing and analyzing data is maybe one of the trickiest in the data management lifecycle. the best options for data analytics will favor an approach that is predictive and adaptable to changing data streams. the thing is, there’s so many types of analytic models and different ways of providing infrastructure for this process. your analytics solution should scale, but to what degree? scalability can be an enormous pain in your analytical neck, due to the problem of decreasing performance returns when scaling out an algorithm. ultimately, analytics tools rely on a great deal of reasoning and analysis to extract data patterns and data insights, but this capacity means nothing for a business if they can’t then create actionable intelligence. part of this problem is that many businesses have the infrastructure to accommodate big data, but they aren’t asking questions about what problems they’re going to solve with the data. implementing a big data-ready infrastructure before knowing what questions you want to ask is like putting the cart before the horse. but even if we do know the questions we want to ask, data analysis can always reveal many correlations with no clear causes. as organizations get better at processing and analyzing big data, the next major hurdle will be pinpointing the causes behind the trends by asking the right questions and embracing the complexity of our answers. [1] http://www.quora.com/what-is-big-data 2014 guide to big data this guide explores the meaning of big data, how businesses use it, and uncovers new tools and techniques for the future of big data. this guide includes: detailed profiles on 43 big data vendor solutions in-depth articles written by industry experts results from our survey of 850 it professionals "finding the database for your use case" download now

September 25, 2014

by Benjamin Ball

· 10,690 Views · 1 Like

JPA tutorial: Mapping Entities – Part 1

In this article I will discuss about the entity mapping procedure in JPA. As for my examples I will use the same schemathat I used in one of my previous articles. In my two previous articles I explained how to set up JPA in a Java SE environment. I do not intend to write the setup procedure for a web application because most of the tutorials on the web do exactly that. So let’s skip over directly to object relational mapping, or entity mapping. Wikipedia defines Object Relational Mapping as follows - Object-relational mapping (ORM, O/RM, and O/R mapping) in computer science is a programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language. There are both free and commercial packages available that perform object-relational mapping, although some programmers opt to create their own ORM tools. Typically, mapping is the process through which you provide necessary information about your database to your ORM tool. The tool then uses this information to read/write objects into the database. Usually you tell your ORM tool the table name to which an object of a certain type will be saved. You also provide column names to which an object’s properties will be mapped to. Relation between different object types also need to be specified. All of these seem to be a lot of tasks, but fortunately JPA follows what is known as “Convention over Configuration” approach, which means if you adopt to use the default values provided by JPA, you will have to configure very little parts of your applications. In order to properly map a type in JPA, you will at a minimum need to do the following - Mark your class with the @Entity annotation. These classes are called entities. Mark one of the properties/getter methods of the class with the @Id annotation. And that’s it. Your entities are ready to be saved into the database because JPA configures all other aspects of the mapping automatically. This also shows the productivity gain that you can enjoy by using JPA. You do not need to manually populate your objects each time you query the database, saving you from writing lots of boilerplate code. Let’s see an example. Consider the following Address entity which I have mapped according to the above two rules - import javax.persistence.Entity; import javax.persistence.Id; @Entity public class Address { @Id private Integer id; private String street; private String city; private String province; private String country; private String postcode; /** * @return the id */ public Integer getId() { return id; } /** * @param id the id to set */ public Address setId(Integer id) { this.id = id; return this; } /** * @return the street */ public String getStreet() { return street; } /** * @param street the street to set */ public Address setStreet(String street) { this.street = street; return this; } /** * @return the city */ public String getCity() { return city; } /** * @param city the city to set */ public Address setCity(String city) { this.city = city; return this; } /** * @return the province */ public String getProvince() { return province; } /** * @param province the province to set */ public Address setProvince(String province) { this.province = province; return this; } /** * @return the country */ public String getCountry() { return country; } /** * @param country the country to set */ public Address setCountry(String country) { this.country = country; return this; } /** * @return the postcode */ public String getPostcode() { return postcode; } /** * @param postcode the postcode to set */ public Address setPostcode(String postcode) { this.postcode = postcode; return this; } } Now based on your environment, you may or may not add this entity declaration in your persistence.xml file, which I have explained in my previous article. Ok then, let’s save some object! The following code snippet does exactly that - import com.keertimaan.javasamples.jpaexample.entity.Address; import javax.persistence.EntityManager; import com.keertimaan.javasamples.jpaexample.persistenceutil.PersistenceManager; public class Main { public static void main(String[] args) { EntityManager em = PersistenceManager.INSTANCE.getEntityManager(); Address address = new Address().setId(1) .setCity("Dhaka") .setCountry("Bangladesh") .setPostcode("1000") .setStreet("Poribagh"); em.getTransaction() .begin(); em.persist(address); em.getTransaction() .commit(); System.out.println("addess is saved! It has id: " + address.getId()); Address anotherAddress = new Address().setId(2) .setCity("Shinagawa-ku, Tokyo") .setCountry("Japan") .setPostcode("140-0002") .setStreet("Shinagawa Seaside Area"); em.getTransaction() .begin(); em.persist(anotherAddress); em.getTransaction() .commit(); em.close(); System.out.println("anotherAddress is saved! It has id: " + anotherAddress.getId()); PersistenceManager.INSTANCE.close(); } } Let’s take a step back at this point and think what we needed to do if we had used plain JDBC for persistence. We had to manually write the insert queries and map each of the attributes to the corresponding columns for both cases, which would have required a lot of code. An important point to note about the example is the way I am setting the id of the entities. This approach will only work for short examples like this, but for real applications this is not good. You’d typically want to use, say, auto-incremented id columns or database sequences to generate the id values for your entities. For my example, I am using a MySQL database, and all of my id columns are set to auto increment. To reflect this in my entity model, I can use an additional annotation called @GeneratedValue in the id property. This tells JPA that the id value for this entity will be automatically generated by the database during the insert, and it should fetch that id after the insert using a select command. With the above modifications, my entity class becomes something like this - import javax.persistence.Entity; import javax.persistence.Id; import javax.persistence.GeneratedValue; @Entity public class Address { @Id @GeneratedValue private Integer id; // Rest of the class code........ And the insert procedure becomes this - Address anotherAddress = new Address() .setCity("Shinagawa-ku, Tokyo") .setCountry("Japan") .setPostcode("140-0002") .setStreet("Shinagawa Seaside Area"); em.getTransaction() .begin(); em.persist(anotherAddress); em.getTransaction() .commit(); How did JPA figure out which table to use to save Address entities? Turns out, it’s pretty straight-forward - When no explicit table information is provided with the mapping then JPA tries to find a table whose name matches with the entity name. The name of an entity can be explicitly specified by using the “name” attribute of the @Entity annotation. If no name attribute is found, then JPA assumes a default name for an entity. The default name of an entity is the simple name (not fully qualified name) of the entity class, which in our case is Address. So our entity name is then determined to be “Address”. Since our entity name is “Address”, JPA tries to find if there is a table in the database whose name is “Address” (remember, most of the cases database table names are case-insensitive). From our schema, we can see that this is indeed the case. So how did JPA figure our which columns to use to save property values for address entities? At this point I think you will be able to easily guess that. If you cannot, stay tuned for my next post! Until next time. [ Full working code can be found at github.]

September 22, 2014

by MD Sayem Ahmed

· 9,803 Views · 2 Likes

How to Setup Custom Remote Deployment Repositories for JBoss BPM Suite

In this article we wanted to share another configuration property that can provide surprising help when setting up your JBoss BPM Suite. Previously we outlined a basic set of configuration properties to provide you with a few tricks when installing your own JBoss BRMS or JBoss BPM Suite products. As the JBoss BPM Suite is a super set, including full JBoss BRMS functionality, the rest of this article will refer only to JBoss BPM Suite but apply to both products. In this article we will show you how to modify your JBoss EAP container configuration to point the products at a custom deployment repository by adjusting a single configuration property. Maven repository The default setup is that the products will look for your maven setting in the default settings.xml as found set in theM2_HOME variable or in the users home directory at .m2/settings.xml. The following system property can be added to JBoss EAP standalone.xml configuration file to point to any file containing your custom settings. kie.maven.settings.custom Location of the maven configuration file where it can find it's settings. Default: the M2_HOME/conf/settings.xml or users home directory .m2/settings.xml Example usage in JBoss EAP When initially setting up the product for use on JBoss EAP containers, one can adjust configuration with the help of system properties. Below we show how to configure an installation to point to our custom maven deployment repository by using a custom settings file we will call bpmsuite-settings.xml We hope this helps you with configuring your own custom deployment repositories and enables you to tie into existing continuous integration infrastructures that might exist in your organization.

September 19, 2014

by Eric D. Schabell

CORE

· 6,225 Views · 1 Like