Data Engineering Resources

The Latest Data Engineering Topics

MongoDB and its locks

Sometimes, you need your jobs to be persisted to a database. Existing solutions such as Gearman only used relational or file-based persistence, so they were a no-go for us and we went with MongoDB. Fast-forward a few months, and we have some problems with the database load. However, it's not that workers are pestering it too much: the problem was related to locks. MongoDB locking model As of 2.4, MongoDB holds write locks on an entire database for each write operation. Since atomicity is guaranteed only on a single document, this isn't usually a problem because even if you are inserting thousands of documents you are doing so in thousands of different operations that can be interleaved with queries and other inserts with a fair policy. This sometimes results in count() queries being inconsistent as documents are moved and indexes are asynchronously updated. However, write corruption is inexistent as documents are a very cohesive entity. However, atomic operations over a single document still lock the whole database, as in the case of findAndModify(), which looks for a document matching a certain query and updates it with a $set operation before returning it; all in a single shot and with the guarantee no other process will be able to perform the same operation of reading and writing at the same time. You can see this operation is ideal for implementing workers based on a pull model, each asking the database for a new job to do and locking it with '$set: {locked: true}'. However, after the number of workers increases a little bit, locks become a problem. Lock duration We cleaned up the working space collection of our MongoDB database by keeping in it only the unfinished jobs, and moving all the rest (completed or failed) to a different collection for archival. As the load increases due to new contracts, we saw the locking time increase as well: the application and the workers were insisting on the same database. The first of the problems was that after reducing the specs of our primary server, we started seeing timeouts of unrelated code even if the CPU and IO usage were low. The locks taken by workers to pick jobs were starting to take seconds or tens of seconds. Moreover, the MongoDB server started filling the logs with: Fri Dec 6 00:01:07 [conn280998] warning: ClientCursor::yield can't unlock b/c of recursive lock... I'm a user, not MongoDB guru but that seems not very good, especially given hundreds of these messages were written every day (although the queues continued to work correctly.) We did not find any explanation for these messages in the documentation, but I suppose they mean some operations are taking so long that they have to yield to make room for others, but in the case of atomic operations they can't to preserve consistency. An easy solution Since MongoDB does not have collection-wide locks yet, we decided to move the job pool and the completed job collections to a different database. In this way, we had a main database with the usual collections and one containing just these two, named with a '_queue' suffix. Note that we're still writing to the same database server: there is still the same number of connections being created by each process. This solution preallocates more space given two databases are involved, but as you know space is cheap nowadays. Both insertion of jobs and worker reads must take place on the same database. Here is where we discovered cohesion pays: if you have this information in a single place it is very easy to change configuration. If you have a singleton database, because "we should only have one database in this application, it will never change" this feature would cost you a lot. Fortunately, in our case it was about 10 lines of code, including the refactoring on the Factory Methods that created MongoDB database objects. Long term This solution is not for the long term, as we know the numbers of machines and their workers pool will increase in the future; a sufficiently high number of workers will saturate the connections available on the MongoDB server and lock the common collection until a pick of a job takes dozens of seconds. The design towards which we are moving includes one "foreman" to each machine, and many workers under his control; only the foreman polls the database and may lock the common collection. Distributing the job pool is not what we want for ease of retrieval of a job in case something goes bad (ever done a query on multiple databases?). Also, we don't want a push solution as it will involve the registration of workers or foremen to a central point of failure that assignes them their jobs. Since most of our servers are shutdown and rebooted according to the user load, we prefer a dynamic solution where a server can start picking jobs whenever it wants and stop without notifying remote machines.

December 6, 2013

by Giorgio Sironi

· 27,249 Views

Adding Java 8 Lambda Goodness to JDBC

Data access, specifically SQL access from within Java, has never been nice. This is in large part due to the fact that the JDBC api has a lot of ceremony. Java 7 vastly improved things with ARM blocks by taking away a lot of the ceremony around managing database objects such as Statements and ResultSets but fundamentally the code flow is still the same. Java 8 Lambdas gives us a very nice tool for improving the flow of JDBC. Out first attempt at improving things here is very simply to make it easy to work with ajava.sql.ResultSet. Here we simply wrap the ResultSet iteration and then delegate it to Lambda function. This is very similar in concept to Spring's JDBCTemplate. NOTE: I've released All the code snippets you see here under an Apache 2.0 license on Github. First we create a functional interface called ResultSetProcessor as follows: @FunctionalInterface public interface ResultSetProcessor { public void process(ResultSet resultSet, long currentRow) throws SQLException; } Very straightforward. This interface takes the ResultSet and the current row of theResultSet as a parameter. Next we write a simple utility to which executes a query and then calls ourResultSetProcessor each time we iterate over the ResultSet: public static void select(Connection connection, String sql, ResultSetProcessor processor, Object... params) { try (PreparedStatement ps = connection.prepareStatement(sql)) { int cnt = 0; for (Object param : params) { ps.setObject(++cnt, param)); } try (ResultSet rs = ps.executeQuery()) { long rowCnt = 0; while (rs.next()) { processor.process(rs, rowCnt++); } } catch (SQLException e) { throw new DataAccessException(e); } } catch (SQLException e) { throw new DataAccessException(e); } } Note I've wrapped the SQLException in my own unchecked DataAccessException. Now when we write a query it's as simple as calling the select method with a connection and a query: select(connection, "select * from MY_TABLE",(rs, cnt)-> { System.out.println(rs.getInt(1)+" "+cnt) }); So that's great, but I think we can do more... One of the nifty Lambda additions in Java is the new Streams API. This would allow us to add very powerful functionality with which to process a ResultSet. Using the Streams API over a ResultSet however creates a bit more of a challenge than the simple select with Lambda in the previous example. The way I decided to go about this is create my own Tuple type which represents a single row from a ResultSet. My Tuple here is the relational version where a Tuple is a collection of elements where each element is identified by an attribute, basically a collection of key value pairs. In our case the Tuple is ordered in terms of the order of the columns in the ResultSet. The code for the Tuple ended up being quite a bit so if you want to take a look, see the GitHub project in the resources at the end of the post. Currently the Java 8 API provides the java.util.stream.StreamSupport object which provides a set of static methods for creating instances of java.util.stream.Stream. We can use this object to create an instance of a Stream. But in order to create a Stream it needs an instance ofjava.util.stream.Spliterator. This is a specialised type for iterating and partitioning a sequence of elements, the Stream needs for handling operations in parallel. Fortunately the Java 8 api also provides the java.util.stream.Spliterators class which can wrap existing Collection and enumeration types. One of those types being ajava.util.Iterator. Now we wrap a query and ResultSet in an Iterator: public class ResultSetIterator implements Iterator { private ResultSet rs; private PreparedStatement ps; private Connection connection; private String sql; public ResultSetIterator(Connection connection, String sql) { assert connection != null; assert sql != null; this.connection = connection; this.sql = sql; } public void init() { try { ps = connection.prepareStatement(sql); rs = ps.executeQuery(); } catch (SQLException e) { close(); throw new DataAccessException(e); } } @Override public boolean hasNext() { if (ps == null) { init(); } try { boolean hasMore = rs.next(); if (!hasMore) { close(); } return hasMore; } catch (SQLException e) { close(); throw new DataAccessException(e); } } private void close() { try { rs.close(); try { ps.close(); } catch (SQLException e) { //nothing we can do here } } catch (SQLException e) { //nothing we can do here } } @Override public Tuple next() { try { return SQL.rowAsTuple(sql, rs); } catch (DataAccessException e) { close(); throw e; } } } This class basically delegates the iterator methods to the underlying result set and then on the next() call transforms the current row in the ResultSet into my Tuple type. And that's the basics done (This class will need a little bit more work though). All that's left is to wire it all together to make a Stream object. Note that due to the nature of a ResultSet it's not a good idea to try process them in parallel, so our stream cannot process in parallel. public static Stream stream(final Connection connection, final String sql, final Object... parms) { return StreamSupport .stream(Spliterators.spliteratorUnknownSize( new ResultSetIterator(connection, sql), 0), false); } Now it's straightforward to stream a query. In the usage example below I've got a table TEST_TABLE with an integer column TEST_ID which basically filters out all the non even numbers and then runs a count: long result = stream(connection, "select TEST_ID from TEST_TABLE") .filter((t) -> t.asInt("TEST_ID") % 2 == 0) .limit(100) .count(); And that's it! We now have a very powerful way of working with a ResultSet. So all this code is available under an Apache 2.0 license on GitHub here. I've rather lamely dubbed the project "lambda tuples," and the purpose really is to experiment and see where you can take Java 8 and Relational DB access, so please download or feel free to contribute.

December 5, 2013

by Julian Exenberger

· 77,689 Views · 6 Likes

Implementing the “Card” UI Pattern in PhoneGap/HTML5 Applications

The Card UI pattern is a common look used by Pinterest and many other content sites. See how you can make a PhoneGap app with this look.

December 2, 2013

by Andrew Trice

· 115,684 Views · 2 Likes

Deconstructing the Azure Point-to-Site VPN for Command Line usage

when configuring an azure virtual network one of the most common things you'll want to do is setup a point-to-site vpn so that you can actually get to your servers to manage and maintain them. azure point-to-site vpns use client certificates to secure connections which can be quite complicated to configure so microsoft has gone the extra mile to make it easy for you to configure and get setup – sadly at the cost of losing the ability to connect through the command line or through powershell – let's change that. current state of play == no command line vpn connections normally when you want to launch a vpn from the cli or powershell in windows you can simply use the following command: rasdial "my home vpn" the azure pre-packaged vpn doesn't allow this because it's really just not a normal vpn. it's something else , something mysterious - not a normal native windows vpn connection. when you run the azure vpn through the command line you get this (you'll see a hint as to why i'd be using azure point-to-site in this screenshot): azure vpns don't appear to support this. if you want to keep your servers behind a private network in azure and use continuous deployment to get your code into production this makes it hard to deploy without a human being around. not really the best case scenario – especially when you remind yourself that automated builds aim to do away with human error altogether. what the azure point-to-site looks like out of the box when you first go to setup a point-to-site vpn into your azure virtual network microsoft points you at a page that walks you through creating a client certificate on your local machine to use as authentication. they then get you to download a package for setting up the azure vpn ras dialler on your local machine. this is accessed from within the azure "networks" page for your virtual network. you install this package and then whenever connecting you're greeted with a connection screen that you might of seen in a previous life. and by seen i don't mean that windows azure virtual networks have been around for ages. but more that the login screen may look familiar. this is because this login screen is a microsoft " connection manager " login screen and has been around for a while. example from technet (note extremely dated bitmap awesomeness): connection manager is used to pre-package vpn and dial up connections for easy-install distribution in a large organisation. this also means we can reconstruct the underlying vpn connection and use it as a normal vpn – claiming back our cli super powers. digging through the details so what we really want to know is: what is this mystical vpn technology the people at microsoft have bestowed upon us? here's how i started getting more information about the implementation: connecting once successfully then disconnect. open it up again to connect and click on properties then clicking on view log you'll then be greeted by something that looks like this: ****************************************************************** operating system : windows nt 6.2 dialler version : 7.2.9200.16384 connection name : my azure virtual network all users/single user : single user start date/time : 24/11/2013, 7:50:31 ****************************************************************** module name, time, log id, log item name, other info for connection type, 0=dial-up, 1=vpn, 2=vpn over dial-up ****************************************************************** [cmdial32] 7:50:31 03 pre-init event callingprocess = c:\windows\system32\cmmon32.exe [cmdial32] 7:50:39 04 pre-connect event connectiontype = 1 [cmdial32] 7:50:39 06 pre-tunnel event username = myclientsslcertificate domain = dunsetting = [obfuscated azure gateway id] tunnel devicename = tunneladdress = [obfuscated azure gateway id].cloudapp.net [cmdial32] 7:50:44 07 connect event [cmdial32] 7:50:44 08 custom action dll actiontype = connect actions description = to update your routing table actionpath = c:\users\doug\appdata\roaming\microsoft\network\connections\cm\[obfuscated azure gateway id]\cmroute.dll returnvalue = 0x0 [cmmon32] 7:56:21 23 external disconnect [cmdial32] 7:56:21 13 disconnect event callingprocess = c:\windows\explorer.exe more importantly you'll see this path included in the connection: within this folder is all the magic connection manager odds and ends. apologies for the [obfuscated], simply the path contains information to my azure endpoint. within this folder you'll see a bunch of files: most importantly there is a pbk file – a personal phonebook. this is what stores the connect settings for the vpn as is a commonly distributed way of sending out connection settings in the enterprise. if you run this on its own you'll actually be able to connect to the vpn directly (without your network routes being updated). this phonebook is where we can steal our settings from to recreate a command line driven connection. setting it up open up the properties of your azure point-to-site vpn phonebook above, and copy the connection address. it will look like this: azuregateway-[guid].cloudapp.net open network sharing centre , and create a new connection. then select connect to a workplace . select that you'll "use my internet connection". then enter your azure point-to-site vpn address and then give your new connection a name. remember this name for later then click create to save your vpn. now open the connection properties for your newly created vpn. this is where we'll use the settings in your azure diallers config to setup your connection. i'll save you the hassle of showing you me copying the settings from one connection to another and instead i'll just focus on what you need to set them to. flick over to the options tab and then click ppp settings . click the 2 missing options enable software compression and negotiate multi-link for single-link connections . set the type of vpn to secure socket tunnelling protocol (sstp), turn on eap and select microsoft: smart card of other certificate as the authentication type. then click on properties . select "use a certificate on this computer", un-tick "connect to these servers", and then select the certificate that uses your azure endpoint uri as its certificate name and then save out. then flick over to the network tab. open tcp/ipv4 then advanced then untick use default gateway on remote network . this setting stops internet traffic going over the vpn while you're connected so you can still surf reddit while managing your azure environment. close the vpn configuration panel. you now have a working vpn connection to azure. when you connect using windows you'll be asked to select the name of the client certificate you'll be authenticating with. you select the certificate you created and uploaded into azure before you setup your connection. when you connect using the command line you don't need to specify your certificate: rasdial "azure vpn" but there's one catch: your local machine's route table doesn't know when to send any traffic to your azure virtual network. the network link is there, but windows doesn't know what to send over your internet link and what to send over the vpn link. you see microsoft did a few things when they packaged your connection manager, and one of these things was to also copy a file called "cmroute.dll" and call this after connection to route your traffic onto your virtual network. this file altered your routing table to route traffic to your virtual network subnets through the vpn connection . we can do the same thing – so lets go about it. what's this about routing... rooting (for the english speakers in the room) my azure virtual network consists of the following network range: 10.0.0.0/8 i also have the following subnets for different machines groups. 10.0.1.0/24 (web servers) 10.0.2.0/24 (application servers) 10.0.3.0/24 (management services) my pptp connections, or point-to-site connections sit on the range: 172.16.0/24 this means that when i connect to the azure vpn i will get an ip address in this range. example: 172.16.0.17 when this happens we need to tell windows to route all traffic going to my 10.0.x.x range ip addresses through the ip address that has been given to us by azure's vpn rras service. you can see your current routing table by entering route print into a command prompt or powershell console. automating the routing additions luckily the windows task scheduler supports event listeners that allow us to watch for vpn connections and run commands off the back of them. take the below powershell script below and save it for arguments sake in c:\scripts\updateroutetableforazurevpn.ps1 ############################################################# # adds ip routes to azure vpn through the point-to-site vpn ############################################################# # define your azure subnets $ips = @("10.0.1.0", "10.0.2.0","10.0.3.0") # point-to-site ip address range # should be the first 4 octets of the ip address '172.16.0.14' == '172.16.0. $azurepptprange = "172.16.0." # find the current new dhcp assigned ip address from azure $azureipaddress = ipconfig | findstr $azurepptprange # if azure hasn't given us one yet, exit and let u know if (!$azureipaddress){ "you do not currently have an ip address in your azure subnet." exit 1 } $azureipaddress = $azureipaddress.split(": ") $azureipaddress = $azureipaddress[$azureipaddress.length-1] $azureipaddress = $azureipaddress.trim() # delete any previous configured routes for these ip ranges foreach($ip in $ips) { $routeexists = route print | findstr $ip if($routeexists) { "deleting route to azure: " + $ip route delete $ip } } # add our new routes to azure virtual network foreach($subnet in $ips) { "adding route to azure: " + $subnet echo "route add $ip mask 255.255.255.0 $azureipaddress" route add $subnet mask 255.255.255.0 $azureipaddress } now execute the following from an elevated command prompt window. this tells windows to add an event listener based task that looks for events to our "azure vpn" connection and if it sees them, it runs our powershell script. schtasks /create /f /tn "vpn connection update" /tr "powershell.exe -noninteractive -command c:\scripts\updateroutetableforazurevpn.ps1" /sc onevent /ec application /mo "*[system[(level=4 or level=0) and (eventid=20225)]] and *[eventdata[data='azure vpn']] " if i then connect to my vpn the above script should execute. after connecting if i check my routing table by entering route print into a console application we have our routes to azure added correctly. we're done! with that we're now able to fully use an azure point-to-site vpn simply from the command line. this means we can use it as part of a build server deployment, or if you're working on it all the time you can simply set it up to connect every time you login to windows . command line usage rasdial "[connection name]" rasdial "[connection name]" /disconnect for my connection named "azure vpn" this command line usage becomes: rasdial "azure vpn" rasdial "azure vpn" /disconnect

November 29, 2013

by Douglas Rathbone

· 10,029 Views

New in Neo4j: Optional Relationships with OPTIONAL MATCH

One of the breaking changes in Neo4j 2.0.0-RC1 compared to previous versions is that the -[?]-> syntax for matching optional relationships has been retired and replaced with the OPTIONAL MATCH construct. An example where we might want to match an optional relationship could be if we want to find colleagues that we haven’t worked with given the following model: Suppose we have the following data set: CREATE (steve:Person {name: "Steve"}) CREATE (john:Person {name: "John"}) CREATE (david:Person {name: "David"}) CREATE (paul:Person {name: "Paul"}) CREATE (sam:Person {name: "Sam"}) CREATE (londonOffice:Office {name: "London Office"}) CREATE UNIQUE (steve)-[:WORKS_IN]->(londonOffice) CREATE UNIQUE (john)-[:WORKS_IN]->(londonOffice) CREATE UNIQUE (david)-[:WORKS_IN]->(londonOffice) CREATE UNIQUE (paul)-[:WORKS_IN]->(londonOffice) CREATE UNIQUE (sam)-[:WORKS_IN]->(londonOffice) CREATE UNIQUE (steve)-[:COLLEAGUES_WITH]->(john) CREATE UNIQUE (steve)-[:COLLEAGUES_WITH]->(david) We might write the following query to find people from the same office as Steve but that he hasn’t worked with: MATCH (person:Person)-[:WORKS_IN]->(office)<-[:WORKS_IN]-(potentialColleague) WHERE person.name = "Steve" AND office.name = "London Office" WITH person, potentialColleague MATCH (potentialColleague)-[c?:COLLEAGUES_WITH]-(person) WHERE c IS null RETURN potentialColleague ==> +----------------------+ ==> | potentialColleague | ==> +----------------------+ ==> | Node[4]{name:"Paul"} | ==> | Node[5]{name:"Sam"} | ==> +----------------------+ We first find which office Steve works in and find the people who also work in that office. Then we optionally match the ‘COLLEAGUES_WITH’ relationship and only return people who Steve doesn’t have that relationship with. If we run that query in 2.0.0-RC1 we get this exception: ==> SyntaxException: Question mark is no longer used for optional patterns - use OPTIONAL MATCH instead (line 1, column 199) ==> "MATCH (person:Person)-[:WORKS_IN]->(office)<-[:WORKS_IN]-(potentialColleague) WHERE person.name = "Steve" AND office.name = "London Office" WITH person, potentialColleague MATCH (potentialColleague)-[c?:COLLEAGUES_WITH]-(person) WHERE c IS null RETURN potentialColleague" ==> Based on that advice we might translate our query to read like this: MATCH (person:Person)-[:WORKS_IN]->(office)<-[:WORKS_IN]-(potentialColleague) WHERE person.name = "Steve" AND office.name = "London Office" WITH person, potentialColleague OPTIONAL MATCH (potentialColleague)-[c:COLLEAGUES_WITH]-(person) WHERE c IS null RETURN potentialColleague If we run that we get back more people than we’d expect: ==> +------------------------+ ==> | potentialColleague | ==> +------------------------+ ==> | Node[15]{name:"John"} | ==> | Node[14]{name:"David"} | ==> | Node[13]{name:"Paul"} | ==> | Node[12]{name:"Sam"} | ==> +------------------------+ The reason this query doesn’t work as we’d expect is because the WHERE clause immediately following OPTIONAL MATCH is part of the pattern rather than being evaluated afterwards as we’ve become used to. The OPTIONAL MATCH part of the query matches a ‘COLLEAGUES_WITH’ relationship where the relationship is actually null, something of a contradiction! However, since the match is optional a row is still returned. If we include ‘c’ in the RETURN part of the query we can see that this is the case: MATCH (person:Person)-[:WORKS_IN]->(office)<-[:WORKS_IN]-(potentialColleague) WHERE person.name = "Steve" AND office.name = "London Office" WITH person, potentialColleague OPTIONAL MATCH (potentialColleague)-[c:COLLEAGUES_WITH]-(person) WHERE c IS null RETURN potentialColleague, c ==> +---------------------------------+ ==> | potentialColleague | c | ==> +---------------------------------+ ==> | Node[15]{name:"John"} | | ==> | Node[14]{name:"David"} | | ==> | Node[13]{name:"Paul"} | | ==> | Node[12]{name:"Sam"} | | ==> +---------------------------------+ If we take out the WHERE part of the OPTIONAL MATCH the query is a bit closer to what we want: MATCH (person:Person)-[:WORKS_IN]->(office)<-[:WORKS_IN]-(potentialColleague) WHERE person.name = "Steve" AND office.name = "London Office" WITH person, potentialColleague OPTIONAL MATCH (potentialColleague)-[c:COLLEAGUES_WITH]-(person) RETURN potentialColleague, c ==> +-----------------------------------------------+ ==> | potentialColleague | c | ==> +-----------------------------------------------+ ==> | Node[2]{name:"John"} | :COLLEAGUES_WITH[5]{} | ==> | Node[3]{name:"David"} | :COLLEAGUES_WITH[6]{} | ==> | Node[4]{name:"Paul"} | | ==> | Node[5]{name:"Sam"} | | ==> +-----------------------------------------------+ If we introduce a WITH after the OPTIONAL MATCH we can choose to filter out those people that we’ve already worked with: MATCH (person:Person)-[:WORKS_IN]->(office)<-[:WORKS_IN]-(potentialColleague) WHERE person.name = "Steve" AND office.name = "London Office" WITH person, potentialColleague OPTIONAL MATCH (potentialColleague)-[c:COLLEAGUES_WITH]-(person) WITH potentialColleague, c WHERE c IS null RETURN potentialColleague If we evaluate that query it returns the same output as our original query: ==> +----------------------+ ==> | potentialColleague | ==> +----------------------+ ==> | Node[4]{name:"Paul"} | ==> | Node[5]{name:"Sam"} | ==> +----------------------+

November 26, 2013

by Mark Needham

· 21,218 Views · 7 Likes

Groovy Goodness: Remove Part of String With Regular Expression Pattern

Since Groovy 2.2 we can subtract a part of a String value using a regular expression pattern. The first match found is replaced with an empty String. In the following sample code we see how the first match of the pattern is removed from the String: // Define regex pattern to find words starting with gr (case-insensitive). def wordStartsWithGr = ~/(?i)\s+Gr\w+/ assert ('Hello Groovy world!' - wordStartsWithGr) == 'Hello world!' assert ('Hi Grails users' - wordStartsWithGr) == 'Hi users' // Remove first match of a word with 5 characters. assert ('Remove first match of 5 letter word' - ~/\b\w{5}\b/) == 'Remove match of 5 letter word' // Remove first found numbers followed by a whitespace character. assert ('Line contains 20 characters' - ~/\d+\s+/) == 'Line contains characters' Code written with Groovy 2.2.

November 23, 2013

by Hubert Klein Ikkink

· 19,458 Views

How to Create a Range From 1 to 10 in SQL

How do you create a range from 1 to 10 in SQL? Have you ever thought about it? This is such an easy problem to solve in any imperative language, it’s ridiculous. Take Java (or C, whatever) for instance: for (int i = 1; i <= 10; i++) System.out.println(i); This was easy, right? Things even look more lean when using functional programming. Take Scala, for instance: (1 to 10) foreach { t => println(t) } We could fill about 25 pages about various ways to do the above in Scala, agreeing on how awesome Scala is (or what hipsters we are). But how to create a range in SQL? … And we’ll exclude using stored procedures, because that would be no fun. In SQL, the data source we’re operating on are tables. If we want a range from 1 to 10, we’d probably need a table containing exactly those ten values. Here are a couple of good, bad, and ugly options of doing precisely that in SQL. OK, they’re mostly bad and ugly. By creating a table The dumbest way to do this would be to create an actual temporary table just for that purpose: CREATE TABLE "1 to 10" AS SELECT 1 value FROM DUAL UNION ALL SELECT 2 FROM DUAL UNION ALL SELECT 3 FROM DUAL UNION ALL SELECT 4 FROM DUAL UNION ALL SELECT 5 FROM DUAL UNION ALL SELECT 6 FROM DUAL UNION ALL SELECT 7 FROM DUAL UNION ALL SELECT 8 FROM DUAL UNION ALL SELECT 9 FROM DUAL UNION ALL SELECT 10 FROM DUAL See also this SQLFiddle This table can then be used in any type of select. Now that’s pretty dumb but straightforward, right? I mean, how many actual records are you going to put in there? By using a VALUES() table constructor This solution isn’t that much better. You can create a derived table and manually add the values from 1 to 10 to that derived table using the VALUES() table constructor. In SQL Server, you could write: SELECT V FROM ( VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10) ) [1 to 10](V) See also this SQLFiddle By creating enough self-joins of a sufficent number of values Another “dumb”, yet a bit more generic solution would be to create only a certain amount of constant values in a table, view or CTE (e.g. two) and then self join that table enough times to reach the desired range length (e.g. four times). The following example will produce values from 1 to 10, “easily”: WITH T(V) AS ( SELECT 0 FROM DUAL UNION ALL SELECT 1 FROM DUAL ) SELECT V FROM ( SELECT 1 + T1.V + 2 * T2.V + 4 * T3.V + 8 * T4.V V FROM T T1, T T2, T T3, T T4 ) WHERE V <= 10 ORDER BY V See also this SQLFiddle By using grouping sets Another way to generate large tables is by using grouping sets, or more specifically by using the CUBE() function. This works much in a similar way as the previous example when self-joining a table with two records: SELECT ROWNUM FROM ( SELECT 1 FROM DUAL GROUP BY CUBE(1, 2, 3, 4) ) WHERE ROWNUM <= 10 See also this SQLFiddle By just taking random records from a “large enough” table In Oracle, you could probably use ALL_OBJECTs. If you’re only counting to 10, you’ll certainly get enough results from that table: SELECT ROWNUM FROM ALL_OBJECTS WHERE ROWNUM <= 10 See also this SQLFiddle What’s so “awesome” about this solution is that you can cross join that table several times to be sure to get enough values: SELECT ROWNUM FROM ALL_OBJECTS, ALL_OBJECTS, ALL_OBJECTS, ALL_OBJECTS WHERE ROWNUM <= 10 OK. Just kidding. Don’t actually do that. Or if you do, don’t blame me if your productive system runs low on memory. By using the awesome PostgreSQL GENERATE_SERIES() function Incredibly, this isn’t part of the SQL standard. Neither is it available in most databases but PostgreSQL, which has the GENERATE_SERIES() function. This is much like Scala’s range notation: (1 to 10) SELECT * FROM GENERATE_SERIES(1, 10) See also this SQLFiddle By using CONNECT BY If you’re using Oracle, then there’s a really easy way to create such a table using the CONNECT BY clause, which is almost as convenient as PostgreSQL’s GENERATE_SERIES() function: SELECT LEVEL FROM DUAL CONNECT BY LEVEL < 10 See also this SQLFiddle By using a recursive CTE Recursive common table expressions are cool, yet utterly unreadable. the equivalent of the above Oracle CONNECT BY clause when written using a recursive CTE would look like this: WITH "1 to 10"(V) AS ( SELECT 1 FROM DUAL UNION ALL SELECT V + 1 FROM "1 to 10" WHERE V < 10 ) SELECT * FROM "1 to 10" See also this SQLFiddle By using Oracle’s MODEL clause A decent “best of” comparison of how to do things in SQL wouldn’t be complete without at least one example using Oracle’s MODEL clause (see this awesome use-case for Oracle’s spreadsheet feature). Use this clause only to make your co workers really angry when maintaining your SQL code. Bow before this beauty! SELECT V FROM ( SELECT 1 V FROM DUAL ) T MODEL DIMENSION BY (ROWNUM R) MEASURES (V) RULES ITERATE (10) ( V[ITERATION_NUMBER] = CV(R) + 1 ) ORDER BY 1 See also this SQLFiddle Conclusion There aren’t actually many nice solutions to do such a simple thing in SQL. Clearly, PostgreSQL’s GENERATE_SERIES() table function is the most beautiful solution. Oracle’s CONNECT BY clause comes close. For all other databases, some trickery has to be applied in one way or another. Unfortunately.

November 20, 2013

by Lukas Eder

· 29,589 Views

Neo4j: Modeling Hyper Edges in a Property Graph

At the Graph Database meet up in Antwerp, we discussed how you would model a hyper edge in a property graph like Neo4j, and I realized that I’d done this in my football graph without realizing. A hyper edge is defined as follows: A hyperedge is a connection between two or more vertices, or nodes, of a hypergraph. A hypergraph is a graph in which generalized edges (called hyperedges) may connect more than two nodes with discrete properties. In Neo4j, an edge (or relationship) can only be between itself or another node; there’s no way of creating a relationship between more than 2 nodes. I had problems when trying to model the relationship between a player and a football match because I wanted to say that a player participated in a match and represented a specific team in that match. I started out with the following model: Unfortunately, creating a direct relationship from the player to the match means that there’s no way to work out which team they played for. This information is useful because sometimes players transfer teams in the middle of a season and we want to analyze how they performed for each team. In a property graph, we need to introduce an extra node which links the match, player and team together: Although we are forced to adopt this design it actually helps us realize an extra entity in our domain which wasn’t visible before – a player’s performance in a match. If we want to capture information about a players’ performance in a match we can store it on this node. We can also easily aggregate players stats by following the played relationship without needing to worry about the matches they played in. The Neo4j manual has a few more examples of domain models containing hyper edges which are worth having a look at if you want to learn more.

November 19, 2013

by Mark Needham

· 6,613 Views

Camel CXF Service With Multiple Query Parameters

While the awesome Apache Camel team is busy fixing the handling of the multiple parameters in the query, here’s a workaround. Hopefully, this post will become obsolete with the next versions of Camel. (Currently, I use 2.7.5) Problem Query parameters more than 1 is passed as a null value into a Camel-CXF service. Say, if the URL has four query parameters as in name=arun&email=arun@arunma.com&age=10&phone=123456 only the first one gets populated when you do a Multi Query Params @GET @Path("search") @Produces(MediaType.APPLICATION_JSON) public String sourceResultsFromTwoSources(@QueryParam("name") String name, @QueryParam("age") String age, @QueryParam("phone") String phone,@QueryParam("email") String email ); All other parameters are null. Final Output For url http://localhost:8181/cxf/karafcxfcamel/search?name=arun&email=arun@arunma.com&age=31&phone=232323 the result expected is : Workaround Interestingly, we could get the entire query string in the header. QueryStringHeader 1.String queryString = exchange.getIn().getHeader(Exchange.HTTP_QUERY, String.class); We could then do a ExtractingParams MultivaluedMap queryMap = JAXRSUtils.getStructuredParams(queryString, "&", false, false); to get the query parameters as a multi valued Map. The query parameters could then be set as a property to the Exchange and used across the exchange. Code The entire code could be downloaded from github Please note that I am running Camel as part of OSGi inside the Karaf container. While the workaround does not differ because of the environment in which you are using Camel-CXF, please be wary of this fact when you download the code from github. Watch out for the blueprint xmls for Camel configuration. The most important piece here is the Router Router RestToBeanRouter package me.rerun.karafcxfcamel.camel.beans; import org.apache.camel.Exchange; import org.apache.camel.Processor; import org.apache.camel.builder.RouteBuilder; import org.apache.camel.model.dataformat.JsonLibrary; import org.apache.cxf.jaxrs.utils.JAXRSUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import javax.ws.rs.core.MultivaluedMap; import java.util.List; import java.util.Map; public class RestToBeanRouter extends RouteBuilder { private static Logger logger= LoggerFactory.getLogger(RouteBuilder.class); @Override public void configure() throws Exception { from ("cxfrs://bean://rsServer") .process(new ParamProcessor()) .multicast() .parallelProcessing() .aggregationStrategy(new ResultAggregator()) .beanRef("restServiceImpl", "getNameEmailResult") .beanRef("restServiceImpl", "getAgePhoneResult") .end() .marshal().json(JsonLibrary.Jackson) .to("log://camelLogger?level=DEBUG"); } private class ParamProcessor implements Processor { @Override public void process(Exchange exchange) throws Exception { String queryString = exchange.getIn().getHeader(Exchange.HTTP_QUERY, String.class); MultivaluedMap queryMap = JAXRSUtils.getStructuredParams(queryString, "&", false, false); for (Map.Entry> eachQueryParam : queryMap.entrySet()) { exchange.setProperty(eachQueryParam.getKey(), eachQueryParam.getValue()); } } } } Interface RestService package me.rerun.karafcxfcamel.rest; import javax.ws.rs.GET; import javax.ws.rs.Path; import javax.ws.rs.Produces; import javax.ws.rs.core.MediaType; public interface RestService { @GET @Path("search") @Produces(MediaType.APPLICATION_JSON) public String sourceResultsFromTwoSources(); } Implementation RestServiceImpl package me.rerun.karafcxfcamel.rest; import me.rerun.karafcxfcamel.model.AgePhoneResult; import me.rerun.karafcxfcamel.model.NameEmailResult; import me.rerun.karafcxfcamel.service.base.AgePhoneService; import me.rerun.karafcxfcamel.service.base.NameEmailService; import me.rerun.karafcxfcamel.service.impl.AgePhoneServiceImpl; import org.apache.camel.Exchange; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.util.List; public class RestServiceImpl implements RestService { private static Logger logger= LoggerFactory.getLogger(AgePhoneServiceImpl.class); private NameEmailService nameEmailService; private AgePhoneService agePhoneService; public RestServiceImpl(){ } //Do nothing. Camel intercepts and routes the requests public String sourceResultsFromTwoSources() { return null; } public NameEmailResult getNameEmailResult(Exchange exchange){ logger.info("Invoking getNameEmailResult from RestServiceImpl"); String name=getFirstEntrySafelyFromList(exchange.getProperty("name", List.class)); String email=getFirstEntrySafelyFromList(exchange.getProperty("email", List.class)); return nameEmailService.getNameAndEmail(name, email); } public AgePhoneResult getAgePhoneResult(Exchange exchange){ logger.info("Invoking getAgePhoneResult from RestServiceImpl"); String age=getFirstEntrySafelyFromList(exchange.getProperty("age", List.class)); String phone=getFirstEntrySafelyFromList(exchange.getProperty("phone", List.class)); return agePhoneService.getAgePhoneResult(age, phone); } public NameEmailService getNameEmailService() { return nameEmailService; } public AgePhoneService getAgePhoneService() { return agePhoneService; } public void setNameEmailService(NameEmailService nameEmailService) { this.nameEmailService = nameEmailService; } public void setAgePhoneService(AgePhoneService agePhoneService) { this.agePhoneService = agePhoneService; } private String getFirstEntrySafelyFromList(List list){ if (list!=null && !list.isEmpty()){ return list.get(0); } return null; } } Reference Camel Mailing List Question

November 17, 2013

by Arun Manivannan

· 18,153 Views

Revisiting The Purpose of Manager Classes

I came across a debate about whether naming a class [Something]Manager is proper. Mainly, I came across this article, but also found other references here and here. I already had my own thoughts on this topic and decided I would share them with you. I started thinking on this a while ago when working in a Swing-powered desktop app. We decided we would make use of the Listeners model that Swing establishes for its components. To register a listener in a Swing component you say: // Some-Type = {Action, Key, Mouse, Focus, ...} component.add[Some-Type]Listener(listener); That's not a problem until you need to dynamically remove those listeners, or override them (remove and add a new listener that listens to the same event but fires a different action), or do any other 'unnatural' thing with them. This turns out to be a little difficult if you use the interface Swing provides out-of-the-box. For instance, to remove a listener, you need to pass the whole instance you want to remove; you would have to hold the reference to the listener if you intend to remove it some time later. So I decided I would take a step further and try to abstract the mess by creating a Manager class. What you could do with this class was something like this: EventsListenersManager.registerListener( "some unique name", component, listener); EventsListenersManager.overrideListener("some unique name", newListener); EventsListenerManager.disableListener("some unique name"); EventsListenerManager.enableListener("some unique name"); EventsListenerManager.unregisterListener("some unique name"); What we gained here??? We got rid of the explicit specification of [Some-Type] and have a single function for every type of listener; and we defined a unique name for the binding, which will be used as an id to remove, enable/disable, and override listeners; a.k.a. no need to hold references. Obviously, it is now convenient to always use the manager’s interface, since it reduces the need to trick our code and gives us a simple and more readable interface to achieve many things with listeners. But that’s not the only advantage of this manager class. You see, manager classes in general have one big advantage: they create an abstraction of a service and work as a centralized point to add logic of any nature, like security logic or performance logic. In the case of our EventsListenersManager, we put some extra logic to enchance the way we can use listeners by providing an interface that simplifies their use: it is easier to make a switch between two listeners now than it was before. Another thing we could have done was to restrict the number of listeners registered in one component for performance sake (Hint: listeners execution don’t run in parallel). So manager classes seem to be a good thing, but we can’t make a manager for every object we want to control in an application. This would consume time and would make us start thinking in a manager class even when we don’t need it. But we can discover a pattern for the need of defining a manager. I’ve seen them used when the resources they control are expensive, like database connections, media resources, hardware resources, etc; and as I see it, expensiveness can come in many flavors: Listeners are expensive because they may be difficult to play with and can be a bottleneck in performance (we must keep them short and fast). Connections are expensive to construct. Media resources are expensive to load. Hardware resources are expensive because they are scarce. So we create different managers for them: We create a manager for listeners to ease and control their use. We create a connections pool in order to limit and control the number of connections to a database. Games developers create a resources manager to have a single point to access resources and reduce the possibility to load them multiple times. We all know there is a driver for every piece of hardware. That’s their manager. Here we can see multiple variations of manager classes implementations, but they all attempt to solve a single thing: reducing the cost we may incur in when using expensive resources directly. I think people should have this line of thought: make a manager class for every expensive resource you detect in your application. You will be more than thankful when you want to add extra logic and you only have to go to one place. Also, enforce the use of manager objects instead of accessing the resources directly. The next time you are planning to make a manager class, ask yourself: is the 'thing' I want to control expensive in any way???

November 7, 2013

by Martín Proenza

· 10,383 Views

Deep Dive into Connection Pooling

As your application grows in functionality and/or usage, managing resources becomes increasingly important. Failure to properly utilize connection pooling is one major “gotcha” that we’ve seen greatly impact MongoDB performance and trip up developers of all levels. Connection Pools Creating new authenticated connections to the database is expensive. So, instead of creating and destroying connections for each request to the database, you want to re-use existing connections as much as possible. This is where connection pooling comes in. A Connection Pool is a cache of database connections maintained by your driver so that connections can be re-used when new connections to the database are required. When properly used, connection pools allow you to minimize the frequency and number of new connections to your database. Connection Churn Used improperly however, or not at all, your application will likely open and close new database connections too often, resulting in what we call “connection churn”. In a high-throughput application this can result in a constant flood of new connection requests to your database which will adversely affect the performance of your database and your application. Opening Too Many Connections Alternately, although less common, is the problem of creating too many MongoClient objects that are never closed. In this case, instead of churn, you get a steady increase in the number of connections to your database such that you have tens of thousands of connections open when you application could almost certainly due with far fewer. Since each connection takes RAM, you may find yourself wasting a good portion of your memory on connections which will also adversely affect your application’s performance. Although every application is different and the total number of connections to your database will greatly depend on how many client processes or application servers are connected, in our experience, any connection count great than 1000 – 1500 connections should raise an eyebrow, and most of the time your application will require far fewer than that. MongoClient and Connection Pooling Most MongoDB language drivers implement the MongoClient class which, if used properly, will handle connection pooling for you automatically. The syntax differs per language, but often you do something like this to create a new connection-pool-enabled client to your database: mongoClient = new MongoClient(URI, connectionOptions); Here the mongoClient object holds your connection pool, and will give your app connections as needed. You should strive to create this object once as your application initializes and re-use this object throughout your application to talk to your database. The most common connection pooling problem we see results from applications that create a MongoClient object way too often, sometimes on each database request. If you do this you will not be using your connection pool as each MongoClient object maintains a separate pool that is not being reused by your application. Example with Node.js Let’s look at a concrete example using the Node.js driver. Creating new connections to the database using the Node.js driver is done like this: mongodb.MongoClient.connect(URI, function(err, db) { // database operations }); The syntax for using MongoClient is slightly different here than with other drivers given Node’s single-threaded nature, but the concept is the same. You only want to call ‘connect’ once during your apps initialization phase vs. on each database request. Let’s take a closer look at the difference between doing the right thing vs. doing the wrong thing. Note: If you clone the repo from here, the logger will output your logs in your console so you can follow along. Consider the following examples: var express = require('express'); var mongodb = require('mongodb'); var app = express(); var MONGODB_URI = 'mongo-uri'; app.get('/', function(req, res) { // BAD! Creates a new connection pool for every request mongodb.MongoClient.connect(MONGODB_URI, function(err, db) { if(err) throw err; var coll = db.collection('test'); coll.find({}, function(err, docs) { docs.each(function(err, doc) { if(doc) { res.write(JSON.stringify(doc) + "\n"); } else { res.end(); } }); }); }); }); // App may initialize before DB connection is ready app.listen(3000); console.log('Listening on port 3000'); The first (no pooling): calls connect() in every request handler establishes new connections for every request (connection churn) initializes the app (app.listen()) before database connections are made var express = require('express'); var mongodb = require('mongodb'); var app = express(); var MONGODB_URI = 'mongodb-uri'; var db; var coll; // Initialize connection once mongodb.MongoClient.connect(MONGODB_URI, function(err, database) { if(err) throw err; db = database; coll = db.collection('test'); app.listen(3000); console.log('Listening on port 3000'); }); // Reuse database/collection object app.get('/', function(req, res) { coll.find({}, function(err, docs) { docs.each(function(err, doc) { if(doc) { res.write(JSON.stringify(doc) + "\n"); } else { res.end(); } }); }); }); The second (with pooling): calls connect() once reuses the database/collection variable (reuses existing connections) waits to initialize the app until after the database connection is established If you run the first example and refresh your browser enough times, you’ll quickly see that your MongoDB has a hard time handling the flood of connections and will terminate. Further Consideration – Connection Pool Size Most MongoDB drivers support a parameter that sets the max number of connections (pool size) available to your application. The connection pool size can be thought of as the max number of concurrent requests that your driver can service. The default pool size varies from driver to driver, e.g. for Node it is 5, whereas for Python it is 100. If you anticipate your application receiving many concurrent or long-running requests, we recommend increasing your pool size- adjust accordingly!

November 7, 2013

by Chris Chang

· 24,122 Views · 2 Likes

Data Access Module using Groovy with Spock testing

This blog is more of a tutorial where we describe the development of a simple data access module, more for fun and learning than anything else. All code can be found here for those who don’t want to type along: https://github.com/ricston-git/tododb As a heads-up, we will be covering the following: Using Groovy in a Maven project within Eclipse Using Groovy to interact with our database Testing our code using the Spock framework We include Spring in our tests with ContextConfiguration A good place to start is to write a pom file as shown here. The only dependencies we want packaged with this artifact are groovy-all and commons-lang. The others are either going to be provided by Tomcat or are only used during testing (hence the scope tags in the pom). For example, we would put the jar with PostgreSQL driver in Tomcat’s lib, and tomcat-jdbc and tomcat-dbcp are already there. (Note: regarding the postgre jar, we would also have to do some minor configuration in Tomcat to define a DataSource which we can get in our app through JNDI – but that’s beyond the scope of this blog. See here for more info). Testing-wise, I’m depending on spring-test, spock-core, and spock-spring (the latter is to get spock to work with spring-test). Another significant addition in the pom is the maven-compiler-plugin. I have tried to get gmaven to work with Groovy in Eclipse, but I have found the maven-compiler-plugin to be a lot easier to work with. With your pom in an empty directory, go ahead and mkdir -p src/main/groovy src/main/java src/test/groovy src/test/java src/main/resources src/test/resources. This gives us a directory structure according to the Maven convention. Now you can go ahead and import the project as a Maven project in Eclipse (install the m2e plugin if you don’t already have it). It is important that you do not mvn eclipse:eclipse in your project. The .classpath it generates will conflict with your m2e plugin and (at least in my case), when you update your pom.xml the plugin will not update your dependencies inside Eclipse. So just import as a maven project once you have your pom.xml and directory structure set up. Okay, so our tests are going to be integration tests, actually using a PostgreSQL database. Since that’s the case, lets set up our database with some data. First go ahead and create a tododbtest database which will only be used for testing purposes. Next, put the following files in your src/test/resources: Note, fill in your username/password: DROP TABLE IF EXISTS todouser CASCADE; CREATE TABLE todouser ( id SERIAL, email varchar(80) UNIQUE NOT NULL, password varchar(80), registered boolean DEFAULT FALSE, confirmationCode varchar(280), CONSTRAINT todouser_pkey PRIMARY KEY (id) ); insert into todouser (email, password, registered, confirmationCode) values ('abc.j123@gmail.com', 'abc123', FALSE, 'abcdefg') insert into todouser (email, password, registered, confirmationCode) values ('def.123@gmail.com', 'pass1516', FALSE, '123456') insert into todouser (email, password, registered, confirmationCode) values ('anon@gmail.com', 'anon', FALSE, 'codeA') insert into todouser (email, password, registered, confirmationCode) values ('anon2@gmail.com', 'anon2', FALSE, 'codeB') Basically, testContext.xml is what we’ll be configuring our test’s context with. The sub-division into datasource.xml and initdb.xml may be a little too much for this example… but changes are usually easier that way. The gist is that we configure our data source in datasource.xml (this is what we will be injecting in our tests), and the initdb.xml will run the schema.sql and test-data.sql to create our table and populate it with data. So lets create our test, or should I say, our specification. Spock is specification framework that allows us to write more descriptive tests. In general, it makes our tests easier to read and understand, and since we’ll be using Groovy, we might as well make use of the extra readability Spock gives us. package com.ricston.blog.sample.model.spec; import javax.sql.DataSource import org.springframework.beans.factory.annotation.Autowired import org.springframework.test.annotation.DirtiesContext import org.springframework.test.annotation.DirtiesContext.ClassMode import org.springframework.test.context.ContextConfiguration import spock.lang.Specification import com.ricston.blog.sample.model.data.TodoUser import com.ricston.blog.sample.model.dao.postgre.PostgreTodoUserDAO // because it supplies a new application context after each test, the initialize-database in initdb.xml is // executed for each test/specification @DirtiesContext(classMode=ClassMode.AFTER_EACH_TEST_METHOD) @ContextConfiguration('classpath:testContext.xml') class PostgreTodoUserDAOSpec extends Specification { @Autowired DataSource dataSource PostgreTodoUserDAO postgreTodoUserDAO def setup() { postgreTodoUserDAO = new PostgreTodoUserDAO(dataSource) } def "findTodoUserByEmail when user exists in db"() { given: "a db populated with a TodoUser with email anon@gmail.com and the password given below" String email = 'anon@gmail.com' String password = 'anon' when: "searching for a TodoUser with that email" TodoUser user = postgreTodoUserDAO.findTodoUserByEmail email then: "the row is found such that the user returned by findTodoUserByEmail has the correct password" user.password == password } } One specification is enough for now, just to make sure that all the moving parts are working nicely together. The specification itself is easy enough to understand. We’re just exercising the findTodoUserByEmail method of PostgreTodoUserDAO – which we will be writing soon. Using the ContextConfiguration from Spring Test we are able to inject beans defined in our context (the dataSource in our case) through the use of annotations. This keeps our tests short and makes them easier to modify later on. Additionally, note the use of DirtiesContext. Basically, after each specification is executed, we cannot rely on the state of the database remaining intact. I am using DirtiesContext to get a new Spring context for each specification run. That way, the table creation and test data insertions happen all over again for each specification we run. Before we can run our specification, we need to create at least the following two classes used in the spec: TodoUser and PostgreTodoUserDAO package com.sample.data import org.apache.commons.lang.builder.ToStringBuilder class TodoUser { long id; String email; String password; String confirmationCode; boolean registered; @Override public String toString() { ToStringBuilder.reflectionToString(this); } } package com.ricston.blog.sample.model.dao.postgre import groovy.sql.Sql import javax.sql.DataSource import com.ricston.blog.sample.model.dao.TodoUserDAO import com.ricston.blog.sample.model.data.TodoUser class PostgreTodoUserDAO implements TodoUserDAO { private Sql sql public PostgreTodoUserDAO(DataSource dataSource) { sql = new Sql(dataSource) } /** * * @param email * @return the TodoUser with the given email */ public TodoUser findTodoUserByEmail(String email) { sql.firstRow """SELECT * FROM todouser WHERE email = $email""" } } package com.ricston.blog.sample.model.dao; import com.ricston.blog.sample.model.data.TodoUser; public interface TodoUserDAO { /** * * @param email * @return the TodoUser with the given email */ public TodoUser findTodoUserByEmail(String email); } We’re just creating a POGO in TodoUser, implementing its toString using common’s ToStringBuilder. In PostgreTodoUserDAO we’re using Groovy’s SQL to access the database, for now, only implementing the findTodoUserByEmail method. PostgreTodoUserDAO implements TodoUserDAO, an interface which specifies the required methods a TodoUserDAO must have. Okay, so now we have all we need to run our specification. Go ahead and run it as a JUnit test from Eclipse. You should get back the following error message: org.codehaus.groovy.runtime.typehandling.GroovyCastException: Cannot cast object '{id=3, email=anon@gmail.com, password=anon, registered=false, confirmationcode=codeA}' with class 'groovy.sql.GroovyRowResult' to class 'com.ricston.blog.sample.model.data.TodoUser' due to: org.codehaus.groovy.runtime.metaclass.MissingPropertyExceptionNoStack: No such property: confirmationcode for class: com.ricston.blog.sample.model.data.TodoUser Possible solutions: confirmationCode at com.ricston.blog.sample.model.dao.postgre.PostgreTodoUserDAO.findTodoUserByEmail(PostgreTodoUserDAO.groovy:23) at com.ricston.blog.sample.model.spec.PostgreTodoUserDAOSpec.findTodoUserByEmail when user exists in db(PostgreTodoUserDAOSpec.groovy:37) Go ahead and connect to your tododbtest database and select * from todouser; As you can see, our confirmationCode varchar(280), ended up as the column confirmationcode with a lower case ‘c’. In PostgreTodoUserDAO’s findTodoUserByEmail, we are getting back GroovyRowResult from our firstRow invocation. GroovyRowResult implements Map and Groovy is able to create a POGO (in our case TodoUser) from a Map. However, in order for Groovy to be able to automatically coerce the GroovyRowResult into a TodoUser, the keys in the Map (or GroovyRowResult) must match the property names in our POGO. We are using confirmationCode in our TodoUser, and we would like to stick to the camel case convention. What can we do to get around this? Well, first of all, lets change our schema to use confirmation_code. That’s a little more readable. Of course, we still have the same problem as before since confirmation_code will not map to confirmationCode by itself. (Note: remember to change the insert statements in test-data.sql too). One way to get around this is to use Groovy’s propertyMissing methods as show below: def propertyMissing(String name, value) { if(isConfirmationCode(name)) { this.confirmationCode = value } else { unknownProperty(name) } } def propertyMissing(String name) { if(isConfirmationCode(name)) { return confirmationCode } else { unknownProperty(name) } } private boolean isConfirmationCode(String name) { 'confirmation_code'.equals(name) } def unknownProperty(String name) { throw new MissingPropertyException(name, this.class) } By adding this to our TodoUser.groovy we are effectively tapping in on how Groovy resolves property access. When we do something like user.confirmationCode, Groovy automatically calls getConfirmationCode(), a method which we got for free when declared the property confirmationCode in our TodoUser. Now, when user.confirmation_code is invoked, Groovy doesn’t find any getters to invoke since we never declared the property confirmation_code, however, since we have now implemented the propertyMissing methods, before throwing any exceptions it will use those methods as a last resort when resolving properties. In our case we are effectively checking whether a get or set on confirmation_code is being made and mapping the respective operations to our confirmationCode property. It’s as simple as that. Now we can keep the auto coercion in our data access object and the property name we choose to have in our TodoUser. Assuming you’ve made the changes to the schema and test-data.sql to use confirmation_code, go ahead and run the spec file and this time it should pass. That’s it for this tutorial. In conclusion, I would like to discuss some finer points which someone who’s never used Groovy’s SQL before might not know. As you can see in PostgreTodoUserDAO.groovy, our database interaction is pretty much a one-liner. What about resource handling (e.g. properly closing the connection when we’re done), error logging, and prepared statements? Resource handling and error logging are done automatically, you just have to worry about writing your SQL. When you do write your SQL, try to stick to using triple quotes as used in the PostgreTodoUserDAO.groovy example. This produces prepared statements, therefore protecting against SQL injection and avoids us having to put ‘?’ all over the place and properly lining up the arguments to pass in to the SQL statement. Note that transaction management is something which the code using our artifact will have to take care of. Finally, note that a bunch of other operations (apart from findTodoUserByEmail) are implemented in the project on GitHub: https://github.com/ricston-git/tododb. Additionally, there is also a specification test for TodoUser, making sure that the property mapping works correctly. Also, in the pom.xml, there is some maven-surefire-plugin configuration in order to get the surefire-plugin to pick up our Spock specifications as well as any JUnit tests which we might have in our project. This allows us to run our specifications when we, for example, mvn clean package. After implementing all the operations you require in PostgreTodoUserDAO.groovy, you can go ahead and compile the jar or include in a Maven multi-module project to get a data access module you can use in other applications.

November 6, 2013

by Justin Calleja

· 20,829 Views

Modeling Data in Neo4j: Bidirectional Relationships

transitioning from the relational world to the beautiful world of graphs requires a shift in thinking about data. although graphs are often much more intuitive than tables, there are certain mistakes people tend to make when modelling their data as a graph for the first time. in this article, we look at one common source of confusion: bidirectional relationships. directed relationships relationships in neo4j must have a type, giving the relationship a semantic meaning, and a direction. frequently, the direction becomes part of the relationship's meaning. in other words, the relationship would be ambiguous without it. for example, the following graph shows that the czech republic defeated sweden in ice hockey. had the direction of the relationship been reversed, the swedes would be much happier. with no direction at all, the relationship would be ambiguous, since it would not be clear who the winner was. note that the existence of this relationship implies a relationship of a different type going in the opposite direction, as the next graph illustrates. this is often the case. to give another example, the fact that pulp fiction was directed_by quentin tarantino implies that quentin tarantino is_director_of pulp fiction. you could come up with a huge number of such relationship pairs. one common mistake people often make when modelling their domain in neo4j is creating both types of relationships. since one relationship implies the other, this is wasteful, both in terms of space and traversal time. neo4j can traverse relationships in both directions. more importantly, thanks to the way neo4j organizes its data, the speed of traversal does not depend on the direction of the relationships being traversed. bidirectional relationships some relationships, on the other hand, are naturally bidirectional. a classic example is facebook or real-life friendship. this relationship is mutual - when someone is your friend, you are (hopefully) his friend, too. depending on how we look at the model, we could also say such relationship is undirected. graphaware and neo technology are partner companies. since this is a mutual relationship, we could model it as bidirectional or undirected relationship, respectively. but since none of this is directly possible in neo4j, beginners often resort to the following model, which suffers from the exact same problem as the incorrect ice hockey model: an extra unnecessary relationship. neo4j apis allow developers to completely ignore relationship direction when querying the graph, if they so desire. for example, in neo4j's own query language, cypher, the key part of a query finding all partner companies of neo technology would look something like match (neo)-[:partner]-(partner) the result would be the same as executing and merging the results of the following two different queries: match (neo)-[:partner]->(partner) and match (neo)<-[:partner]-(partner) therefore, the correct (or at least most efficient) way of modelling the partner relationships is using a single partner relationship with an arbitrary direction . conclusion relationships in neo4j can be traversed in both directions with the same speed. moreover, direction can be completely ignored. therefore, there is no need to create two different relationships between nodes, if one implies the other.

November 6, 2013

by Michal Bachman

· 27,539 Views · 2 Likes

Automatically Collect and Process Visitors’ IP Addresses

(NOTE: A version of this article posted previously contained incorrect information. The below version corrects those errors. The author apologises for any inconvenience.) I’m not a programmer. I’m an art collection manager with a fierce DIY streak that has helped me to develop a database application, and build and manage a website that incorporates it. Ever since I accidentally lobotomised my first Windows 3.1 computer, I’ve taught myself how to seek out, find and apply the information I need, sometimes through long hours of trial and error, and I owe almost all of it to the Internet. If it weren’t for people’s willingness to share information for free on innumerable forums and websites like this one, I would not have been able even to scratch the surface of completing the sorts of tasks that are now behind me. In that spirit of sharing, I thought I might humbly offer the solution I’ve cobbled together from various sources, and tweaked to automate tasks related to collecting the IP addresses of visitors to my website. I’m sure it’s nothing earth-shattering to an experienced coder, but it works well for me, and I’ve never seen a complete solution like it presented anywhere on the Internet before. It is a Windows-centric solution, since that’s the platform I’ve always used. The first step is to gather and record the IP addresses of website visitors. I’ve chosen to do this using php. Insert this code into the html of each page for which you’d like to capture IP addresses, just before the closing tag: If you want to record the IP addresses for each webpage to a different file, use a different name each time for filename.txt in the above example. Next, create the blank filename.txt file(s) in the same directory of your web server in which these html files reside. Now each time a visitor loads these pages, their IP address will be written to the text file(s) you’ve indicated. Next, you’ll need a way to download the text file(s) from the server to your local machine. If you’re writing to multiple files on the server, I’ve found it’s helpful to download them separately, then combine them into one list. Also, I like to sort the list and remove duplicate entries (you’ll see why a little later). Following is a Visual Basic script to do all of that. Let’s designate it C:\Folder\Subfolder\DloadCmbnDdupe.vbs. In the script below, substitute YourWebsite.com with the domain name of your website, C:\Folder\Subfolder with the actual location on your local computer and filename*.txt with the file name(s) on the web server to which you’ve chosen to write. Option Explicit On Error Resume Next Download "http://www.YourWebsite.com/filename.txt", _ " C:\Folder\Subfolder\filename.txt " Download "http://www.YourWebsite.com/filename2.txt", _ " C:\Folder\Subfolder\filename2.txt " Download "http://www.YourWebsite.com/filename3.txt", _ " C:\Folder\Subfolder\filename3.txt " CmbnDdupe() If Err <> 0 Then Wscript.echo "Error Type = " & Err.Description End If WScript.Quit '----------------------------------------------------------------------------------------- Function Download(strURL, strPath) Dim i, objFile, objFSO, objHTTP, strFile, strMsg Const ForReading = 1, ForWriting = 2, ForAppending = 8 Set objFSO = CreateObject("Scripting.FileSystemObject") If objFSO.FolderExists(strPath) Then strFile = objFSO.BuildPath(strPath, Mid(strURL, InStrRev(strURL, "/") + 1)) ElseIf objFSO.FolderExists(Left(strPath, InStrRev(strPath, "\") - 1)) Then strFile = strPath Else WScript.Echo "ERROR: Target folder not found." Exit Function End If Set objFile = objFSO.OpenTextFile(strFile, ForWriting, True) Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1") objHTTP.Open "GET", strURL, False objHTTP.Send For i = 1 To LenB(objHTTP.ResponseBody) objFile.Write Chr(AscB(MidB(objHTTP.ResponseBody, i, 1))) Next objFile.Close() End Function '----------------------------------------------------------------------------------------- Function CmbnDdupe() Dim shell Set shell=createobject("wscript.shell") shell.run "CmbnDdupe.bat" Set shell=nothing End Function You’ll notice that Function CmbnDdupe calls a batch file, CmbnDdupe.bat, in the same directory (C:\Folder\Subfolder). Here it is, below. Again, substitute filename*.txt with the file name(s) you used at the beginning. @echo off for %%x in (filename.txt) do type %%x>>templist for %%x in (filename2.txt) do type %%x>>templist for %%x in (filename3.txt) do type %%x>>templist ren templist IPlist.txt setlocal disableDelayedExpansion set file=IPlist.txt set "sorted=%file%.sorted" set "deduped=%file%.deduped" ::Define a variable containing a linefeed character set LF=^ ::The 2 blank lines above are critical, do not remove sort "%file%" >"%sorted%" >"%deduped%" ( set "prev=" for /f usebackq^ eol^=^%LF%%LF%^ delims^= %%A in ("%sorted%") do ( set "ln=%%A" setlocal enableDelayedExpansion if /i "!ln!" neq "!prev!" ( endlocal (echo %%A) set "prev=%%A" ) else endlocal ) ) >nul move /y "%deduped%" "%file%" del "%sorted%" exit This routine combines the downloaded files into one (templist), then renames it to IPlist.txt. It then sorts the IP addresses into ascending order, saving to IPlist.txt.sorted, and then removes any duplicates, saving to IPlist.txt.deduped. Finally, it moves (overwrites and deletes) IPlist.txt.deduped to IPlist.txt and deletes IPlist.txt.sorted, leaving behind IPlist.txt (the sorted and de-dupe-ified list). Now, we have a list of one IP address per visitor to the pages from which we’re collecting. At this point I like to ping each IP address to collect whatever information is available about it. This is why I remove the duplicate entries, which are caused by a visitor viewing more than one of the collecting pages, or by a visitor returning to the pages. I don’t need to waste time and bandwidth pinging the same IP address more than once. If I want to see which IPs visited which pages multiple times, I can always just look at filename.txt, filename2.txt and filename3.txt. I’ve named the ping routine PingList.vbs, and put it in C:\Folder\Subfolder. Here it is: Option Explicit On Error Resume Next Dim srcFile srcFile = "IPlist.txt" PingList(srcFile) If Err <> 0 Then Wscript.echo "Error Type = " & Err.Description End If WScript.Quit '----------------------------------------------------------------------------------------- Function PingList(srcFile) Dim objFSO Dim objShell Dim strCommand Dim opnFile Dim strText Dim logFile Set objFSO = CreateObject("Scripting.FileSystemObject") Set objShell = Wscript.CreateObject("Wscript.Shell") logFile = "Log.txt" If objFSO.FileExists(srcFile) Then Set opnFile = objFSO.OpenTextFile(srcFile, 1) Do While Not opnFile.AtEndOfStream strText = opnFile.ReadLine If Trim(strText) <> "" Then strCommand = strText objShell.run "%comspec% /c ping -a -n 1 " & strText & " >> " & logFile, , True End If Loop opnFile.Close Else WScript.Echo "File '" & srcFile & "' was not found." End If End Function This script pings each IP address in IPlist.txt, resolves the hostname if possible and writes the results to Log.txt in the same directory. Go ahead and create a blank Log.txt now in C:\Folder\Subfolder. The next thing you’ll want to do is clear the contents of the text files on the server, so that they will retain only the IP addresses from new visits. Create the following file in C:\Folder\Subfolder. I’ve named it Upload.cmd. It's critical that it has the .cmd file type. @echo off echo user YourUsername> ftpcmd.dat echo YourPassword>> ftpcmd.dat echo bin>> ftpcmd.dat echo cd /YourWebDirectory/>> ftpcmd.dat echo prompt>> ftpcmd.dat echo mput %1 %2 %3>> ftpcmd.dat echo rename filename_.txt filename.txt>> ftpcmd.dat echo rename filename 2_.txt filename2.txt>> ftpcmd.dat echo rename filename 3_.txt filename3.txt>> ftpcmd.dat echo quit>> ftpcmd.dat ftp -n -s:ftpcmd.dat ftp.YourWebsite.com del ftpcmd.dat exit Substitute YourUsername and YourPassword with the username and password with which you access your website files, and YourWebDirectory with the location of your website files on the server. In C:\Folder\Subfolder, create the blank text file(s) that will overwrite the ones on the server. Give them a different name (for instance, add an underscore), as you’ll want to distinguish them from the files you downloaded at the beginning of this exercise. Hence, the blank filename_.txt will be copied to the server as filename.txt, overwriting the existing file. The number of per-cent-sign-plus-integer combinations (variables) needs to correspond with the number of files you upload and overwrite; in this case, three (%1 %2 %3 = filename_.txt, filename2_.txt and filename3_.txt). Substitute YourWebsite.com for the domain name of your website. Before moving to the final step, create a blank text file named ErrorLog.txt in C:\Folder\Subfolder. This is where we’ll record any errors encountered during the execution of the combined routines. Now to put it all together and automate it. Create the following batch file (I’ve name it DLPingUL.bat) and put it in C:\. @echo off title Download Ping Upload - Scheduled task, please wait cd "C:\Folder\Subfolder" start "" /wait CScript DloadCmbnDdupe.vbs 2>> ErrorLog.txt start "" /wait Upload.cmd filename_.txt filename2_.txt filename3_.txt 2>> ErrorLog.txt copy /d /y /a IPlist.txt NewIPlist.txt /a 2>> ErrorLog.txt start "" /wait CScript PingList.vbs 2>> ErrorLog.txt del IPlist.txt exit The reason we’ve put this in C:\ is so that Windows Task Scheduler will have no problem with permissions when running it. This routine moves to the directory in which you’ve stored all of the relevant files (C:\Folder\Subfolder, substitute with the actual location); executes the Visual Basic script that creates the list; executes the upload, passing the file names of the blank replacement files to the echo mput %1 %2 %3 command; copies IPlist.txt to NewIPlist.txt, overwriting the latter if it exists (this is so you have a list to which to refer if you want); and executes the VB Script to ping and record the results. Finally, it deletes IPlist.txt, as it needs to be created programmatically each time. Any errors are recorded in C:\Folder\Subfolder\ErrorLog.txt. The final step is to create a new task in Task Scheduler that runs C:\DLPingUL.bat at a time of your choosing. I run it every Saturday at 3:00 am, so that when I wake up I have C:\Folder\Subfolder\Log.txt waiting for me with all of its pinged IP address information. Having the hostname can be especially helpful; it can show you which bots and spiders crawled your site, or from which corporation the visit originated. The only manual task I do is any further research on those IP addresses, like running them through Whois or whatismyipaddress.com/ip-lookup . These sites help me to determine, for example, if an IP address is static or dynamic, or if it is associated with hackers or spammers, among other useful bits of information. When I’m finished, I clear the contents of Log.txt so that it is ready for next time. If you’ve read all the way down to here then you’ve been very patient with me, and for that I thank you kindly. Addendum: For best results, when copying the above code, click on "View Source" in the upper right corner of the code box, and copy and paste the source. This will ensure consistency. _____________________________________________ Phillip Schubert is the founder of Schubert & Associates www.schubertassociates.com.au

November 5, 2013

by Phillip Schubert

· 18,356 Views

LINQ - AddRange Method in C#

In this post I’m going to explain you about Linq AddRange method. This method is quite useful when you want to add multiple elements to a end of list. Following is a method signature for this. public void AddRange( IEnumerable collection ) To Understand let’s take simple example like following. using System; using System.Collections.Generic; namespace Linq { class Program { static void Main(string[] args) { List names=new List {"Jalpesh"}; string[] newnames=new string[]{"Vishal","Tushar","Vikas","Himanshu"}; foreach (var newname in newnames) { names.Add(newname); } foreach (var n in names) { Console.WriteLine(n); } } } } Here in the above code I am adding content of array to a already created list via foreach loop. You can use AddRange method instead of for loop like following.It will same output as above. using System; using System.Collections.Generic; namespace Linq { class Program { static void Main(string[] args) { List names=new List {"Jalpesh"}; string[] newnames=new string[]{"Vishal","Tushar","Vikas","Himanshu"}; names.AddRange(newnames); foreach (var n in names) { Console.WriteLine(n); } } } } Now when you run that example output is like following. Add Range in more complex scenario: You can also use add range to more complex scenarios also like following.You can use other operator with add range as following. using System; using System.Collections.Generic; using System.Linq; namespace Linq { class Program { static void Main(string[] args) { List names=new List {"Jalpesh"}; string[] newnames=new string[]{"Vishal","Tushar","Vikas","Himanshu"}; names.AddRange(newnames.Where(nn=>nn.StartsWith("Vi"))); foreach (var n in names) { Console.WriteLine(n); } } } } Here in the above code I have created array with string and filter it with where operator while adding it to an existing list. Following is output as expected. That’s it. Hope you like it. Stay tuned for more..

November 3, 2013

by Jalpesh Vadgama

· 61,626 Views

Service Injection in Doctrine DBAL Type

When you think of a Doctrine 2 DBAL Type you think of an atomic thing, but how can you work programmatically on this type without defining an event? A DBAL Type doesn't allow access to the Symfony 2 service container, you must use a hack. But before this let me explain the classic way (using events), why you should use this hack and why you shouldn't. The classic way is defined in the Symfony 2 Cookbook: How to Register Event Listeners and Subscribers Doctrine 2 events unlike Symfony 2 events aren't defined by the developer, the developer can only attach listeners on them. Why? Because Doctrine 2 isn't a framework that you can use for everything, persistence is its only job. When should you use this hack? When your stored object isn't a 1:1 representation of the PHP object and its elaboration can be memoizable or really fast. I use this hack for browscaps: with the BrowscapBundle I can convert from an user agent string to a stdClass object (like the get_browser function). Our object is container = $container; } public function prePersist(LifecycleEventArgs $args) { $this->doObjectToString($args); } public function postPersist(LifecycleEventArgs $args) { $this->doStringToObject($args); } public function preUpdate(LifecycleEventArgs $args) { $this->doObjectToString($args); } public function postUpdate(LifecycleEventArgs $args) { $this->doStringToObject($args); } public function postLoad(LifecycleEventArgs $args) { $this->doStringToObject($args); } private function doStringToObject($args) { $entity = $args->getEntity(); if ($entity instanceof Agent && !is_object($entity->getHeader())) { $browscap = $this->container->get('browscap'); $browser = $browscap->getBrowser($entity->getHeader()); $entity->setHeader($browser); } } private function doObjectToString($args) { $entity = $args->getEntity(); if ($entity instanceof Agent && is_object($entity->getHeader())) { $user_agent = $entity->getHeader()->browser_name; $entity->setHeader($user_agent); } } } With this code, everytime you will persist, update or extract a Agent entity from/to related storage system it'll be converted from string to object. The problem is that these callbacks will be invoked everytime and numerous events aren't recommended for your application. But with this hack I can write: services: acme.demo_bundle.event_listener.container_listener: arguments: - "@service_container" class: "Acme\DemoBundle\EventListener\ContainerListener" tags: - { name: doctrine.event_listener, event: getContainer } Doctrine ignores this event but it exists and results attached! container = $container; } public function getContainer() { return $this->container; } } This listener seems useless, but it's the only way for this hack because Doctrine 2 DBAL Type doesn't allow direct access to the service container but allows access to events listeners. getVarcharTypeDeclarationSQL($fieldDeclaration); } public function convertToPHPValue($value, AbstractPlatform $platform) { if (is_null($value)) { return null; } $listeners = $platform->getEventManager()->getListeners('getContainer'); $listener = array_shift($listeners); $container = $listener->getContainer(); return $container->get('browscap')->getBrowser($value); } public function convertToDatabaseValue($value, AbstractPlatform $platform) { if ($value instanceof Browscap) { return $value->getBrowser()->browser_name; } elseif ($value instanceof stdClass) { return $value->browser_name; } return $value; } public function getName() { return 'browscap'; } public function requiresSQLCommentHint(AbstractPlatform $platform) { return true; } } I use this hack to define only the events related to application flow (less events is better). Now that you know when you can use this, you must read why you shouldn't use it. Let me explain the reason with one simple example: imagine that one day PHP will allow external hooks in native classes constructor, how can you work without knowing what you're doing while initializing a new stdClass? The same reason here: everytime you extract a value from the database you want extract it fast (hopefully you'll extract more than one records), but how can you be sure that extraction is fast if every attribute of a single record depends on external libraries and logics? Quoting Ocramius, member of the Doctrine 2 development team: DBAL types are not designed for Dependency Injection. We explicitly avoided using DI for DBAL types because they have to stay simple. We’ve been asked many many times to change this behaviour, but doctrine believes that complex data manipulation should NOT happen within the very core of the persistence layer itself. That should be handled in your service layer.

November 2, 2013

by Emanuele Minotto

· 7,628 Views

EasyNetQ: Publisher Confirms

Publisher confirms are a RabbitMQ addition to AMQP to guarantee message delivery. You can read all about them here and here. In short they provide a asynchronous confirmation that a publish has successfully reached all the queues that it was routed to. To turn on publisher confirms with EasyNetQ set the publisherConfirms connection string parameter like this: var bus = RabbitHutch.CreateBus("host=localhost;publisherConfirms=true"); When you set this flag, EasyNetQ will wait for the confirmation, or a timeout, before returning from the Publish method: bus.Publish(new MyMessage { Text = "Hello World!" }); // here the publish has been confirmed. Nice and easy. There’s a problem though. If I run the above code in a while loop without publisher confirms, I can publish around 4000 messages per second, but with publisher confirms switched on that drops to around 140 per second. Not so good. With EasyNetQ 0.15 we introduced a new PublishAsync method that returns a Task. The Task completes when the publish is confirmed: bus.PublishAsync(message).ContinueWith(task => { if (task.IsCompleted) { Console.WriteLine("Publish completed fine."); } if (task.IsFaulted) { Console.WriteLine(task.Exception); } }); Using this code in a while loop gets us back to 4000 messages per second with publisher confirms on. Happy confirms!

November 1, 2013

by Mike Hadlow

· 8,475 Views

Securing Docker’s Remote API

One piece to Docker that is interesting AMAZING is the Remote API that can be used to programatically interact with docker. I recently had a situation where I wanted to run many containers on a host with a single container managing the other containers through the API. But the problem I soon discovered is that at the moment when you turn networking on it is an all or nothing type of thing… you can’t turn networking off selectively on a container by container basis. You can disable IPv4 forwarding, but you can still reach the docker remote API on the machine if you can guess the IP address of it. One solution I came up with for this is to use nginx to expose the unix socket for docker over HTTPS and utilize client-side ssl certificates to only allow trusted containers to have access. I liked this setup a lot so I thought I would share how it’s done. Disclaimer: assumes some knowledge of docker! Generate The SSL Certificates We’ll use openssl to generate and self-sign the certs. Since this is for an internal service we’ll just sign it ourselves. We also remove the password from the keys so that we aren’t prompted for it each time we start nginx. # Create the CA Key and Certificate for signing Client Certs openssl genrsa -des3 -out ca.key 4096 openssl rsa -in ca.key -out ca.key # remove password! openssl req -new -x509 -days 365 -key ca.key -out ca.crt # Create the Server Key, CSR, and Certificate openssl genrsa -des3 -out server.key 1024 openssl rsa -in server.key -out server.key # remove password! openssl req -new -key server.key -out server.csr # We're self signing our own server cert here. This is a no-no in production. openssl x509 -req -days 365 -in server.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out server.crt # Create the Client Key and CSR openssl genrsa -des3 -out client.key 1024 openssl rsa -in client.key -out client.key # no password! openssl req -new -key client.key -out client.csr # Sign the client certificate with our CA cert. Unlike signing our own server cert, this is what we want to do. openssl x509 -req -days 365 -in client.csr -CA ca.crt -CAkey ca.key -set_serial 01 -out client.crt Another option may be to leave the passphrase in and provide it as an environment variable when running a docker container or through some other means as an extra layer of security. We’ll move ca.crt, server.key and server.crt to /etc/nginx/certs. Setup Nginx The nginx setup for this is pretty straightforward. We just listen for traffic on localhost on port 4242. We require client-side ssl certificate validation and reference the certificates we generated in the previous step. And most important of all, set up an upstream proxy to the docker unix socket. I simply overwrote what was already in /etc/nginx/sites-enabled/default. upstream docker { server unix:/var/run/docker.sock fail_timeout=0; } server { listen 4242; server localhost; ssl on; ssl_certificate /etc/nginx/certs/server.crt; ssl_certificate_key /etc/nginx/certs/server.key; ssl_client_certificate /etc/nginx/certs/ca.crt; ssl_verify_client on; access_log on; error_log /dev/null; location / { proxy_pass http://docker; proxy_redirect off; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; client_max_body_size 10m; client_body_buffer_size 128k; proxy_connect_timeout 90; proxy_send_timeout 120; proxy_read_timeout 120; proxy_buffer_size 4k; proxy_buffers 4 32k; proxy_busy_buffers_size 64k; proxy_temp_file_write_size 64k; } } One important piece to make this work is you should add the user nginx runs as to the docker group so that it can read from the socket. This could be www-data, nginx, or something else! Hack It Up! With this setup and nginx restarted, let’s first run a curl command to make sure that this setup correctly. First we’ll make a call without the client cert to double check that we get denied access then a proper one. # Is normal http traffic denied? curl -v http://localhost:4242/info # How about https, sans client cert and key? curl -v -s -k https://localhost:4242/info # And the final good request! curl -v -s -k --key client.key --cert client.crt https://localhost:4242/info For the first two we should get some run of the mill 400 http response codes before we get a proper JSON response from the final command! Woot! But wait there’s more… let’s build a container that can call the service to launch other containers! For this example we’ll simply build two containers: one that has the client certificate and key and one that doesn’t. The code for these examples are pretty straightforward and to save space I’ll leave the untrusted container out. You can view the untrusted container on github (although it is nothing exciting). First, the node.js application that will connect and display information: https = require 'https' fs = require 'fs' options = host: 172.42.1.62 port: 4242 method: 'GET' path: '/containers/json' key: fs.readFileSync('ssl/client.key') cert: fs.readFileSync('ssl/client.crt') headers: { 'Accept': 'application/json'} # not required, but being semantic here! req = https.request options, (res) -> console.log res req.end() And the Dockerfile used to build the container. Notice we add the client.crt and client.key as part of building it! FROM shykes/nodejs MAINTAINER James R. Carr ADD ssl/client* /srv/app/ssl ADD package.json /srv/app/package.json ADD app.coffee /srv/app/app.coffee RUN cd /srv/app && npm install . CMD cd /srv/app && npm start That’s about it. Run docker build . and docker run -n >IMAGE ID< and we should see a json dump to the console of the actively running containers. Doing the same in the untrusted directory should present us with some 400 error about not providing a client ssl certificate. I’ve shared a project with all this code plus a vagrant file on github for your own prusual. Enjoy!

October 31, 2013

by James Carr

· 13,972 Views

Adding Appsec to Agile: Security Stories, Evil User Stories and Abuse(r) Stories

Because Agile development teams work from a backlog of stories, one way to inject application security into software development is by writing up application security risks and activities as stories, making them explicit and adding them to the backlog so that application security work can be managed, estimated, prioritized and done like everything else that the team has to do. Security Stories SAFECode has tried to do this by writing a set of common, non-functional Security Stories following the well-known “As a [type of user] I want {something} so that {reason}” template. These stories are not customer- or user-focused: not the kind that a Product Owner would understand or care about. Instead, they are meant for the development team (architects, developers and testers). Example: As a(n) architect/developer, I want to ensure AND as QA, I want to verify that sensitive data is kept restricted to actors authorized to access it. There are stories to prevent/check for the common security vulnerabilities in applications: XSS, path traversal, remote execution, CSRF, OS command injection, SQL injection, password brute forcing. Checks for information exposure through error messages, proper use of encryption, authentication and session management, transport layer security, restricted uploads and URL redirection to un-trusted sites; and basic code quality issues: NULL pointer checking, boundary checking, numeric conversion, initialization, thread/process synchronization, exception handling, use of unsafe/restricted functions. SAFECode also includes a list of secure development practices (operational tasks) for the team that includes making sure that you’re using the latest compiler, patching the run-time and libraries, static analysis, vulnerability scanning, code reviews of high-risk code, tracking and fixing security bugs; and more advanced practices that require help from security experts like fuzzing, threat modeling, pen tests, environmental hardening. Altogether this is a good list of problems that need to be watched out for and things that should be done on most projects. But although SAFECode’s stories look like stories, they can’t be used as stories by the team. These Security Stories are non-functional requirements (NFRs) and technical constraints that (like requirements for scalability and maintainability and supportability) need to be considered in the design of the system, and may need to be included as part of the definition of done and conditions of acceptance for every user story that the team works on. Security Stories can’t be pulled from the backlog and delivered like other stories and removed from the backlog when they are done, because they are never “done”. The team has to keep worrying about them throughout the life of the project and of the system. As Rohit Sethi points out, asking developers to juggle long lists of technical constraints like this is not practical: If you start adding in other NFR constraints, such as accessibility, the list of constraints can quickly grow overwhelming to developers. Once the list grows unwieldy, our experience is that developers tend to ignore the list entirely. They instead rely on their own memories to apply NFR constraints. Since the number of NFRs continues to grow in increasingly specialized domains such as application security, the cognitive burden on developers’ memories is substantial. OWASP Evil User Stories – Hacking the Backlog Someone at OWASP has suggested an alternative, much smaller set of non-functional Evil User Stories that can be "hacked" into the backlog: A way for a security guy to get security on the agenda of the development team is by “hacking the backlog”. The way to do this is by crafting Evil User Stories, a few general negative cases that the team needs to consider when they implement other stories. Example #1. "As a hacker, I can send bad data in URLs, so I can access data and functions for which I'm not authorized." Example #2. "As a hacker, I can send bad data in the content of requests, so I can access data and functions for which I'm not authorized." Example #3. "As a hacker, I can send bad data in HTTP headers, so I can access data and functions for which I'm not authorized." Example #4. "As a hacker, I can read and even modify all data that is input and output by your application." Thinking like a Bad Guy – Abuse Cases and Abuser Stories Another way to beef up security in software development is to get the team to carefully look at the system they are building from the bad guy's perspective. In “Misuse and Abuse Cases: Getting Past the Positive”, Dr. Gary McGraw at Cigital talks about the importance of anticipating things going wrong, and thinking about behaviour that the system needs to prevent. Assume that the customer/user is not going to behave, or is actively out to attack the application. Question all of the assumptions in the design (the can’ts and won’ts), especially trust conditions – what if the bad guy can be anywhere along the path of an action (for example, using an attack proxy between the client and the server)? Abuse Cases are created by security experts working with the team as part of a critical review – either of the design or of an existing application. The goal of a review like this is to understand how the system behaves under attack/failure conditions, and document any weaknesses or gaps that need to be addressed. At Agile 2013 Judy Neher presented a hands-on workshop on how to write Abuser Stories, a lighter-weight, Agile practice which makes “thinking like a bad guy” part of the team’s job of defining and refining user requirements. Take a story, and as part of elaborating the story and listing the scenarios, step back and look at the story through a security lens. Don’t just think of what the user wants to do and can do - think about what they don’t want to do and can’t do. Get the same people who are working on the story to “put their black hats on” and think evil for a little while, brainstorm to come up with negative cases. As {some kind of bad guy} I want to {do some bad thing}… The {bad guy} doesn’t have to be a hacker. They could be an insider with a grudge or a selfish customer who is willing to take advantage of other users, or an admin user who needs to be protected from making expensive mistakes, or an external system that may not always function correctly. Ask questions like: How do I know who the user is and that I can trust them? Who is allowed to do what, and where are the authorization checks applied? Look for holes in multi-step workflows – what happens if somebody bypasses a check or tries to skip a step or do something out of sequence? What happens if an action or a check times-out or blocks or fails – what access should be allowed, what kind of information should be shown, what kind shouldn’t be? Are we interacting with children? Are we dealing with money? With dangerous command-and-control/admin functions? With confidential or pirvate data? Look closer at the data. Where is it coming from? Can I trust it? Is the source authenticated? Where is it validated – do I have to check it myself? Where is it stored (does it have to be stored)? If it has to be stored, should it be encrypted or masked (including in log files)? Who should be able to see it? Who shouldn’t be able to see it? Who can change it, and to the changes need to be audited? Do we need to make sure the data hasn't been tampered with (checksum, HMAC, digital signature)? Use this exercise to come up with refutation criteria (user can do this, but can’t do that; they can see this but they can’t see that), instead of, or as part of the conditions of acceptance for the story. Prioritize these cases based on risk, add the cases that you agree need to be taken care of as scenarios to the current story, or as new stories to the backlog if they are big enough. “Thinking like a bad guy” as you are working on a story seems more useful and practical than other story-based approaches. It doesn’t take a lot of time, and it’s not expensive. You don’t need to write Abuser Stories for every user Story and the more Abuser Stories that you do, the easier it will get – you'll get better at it, and you’ll keep running into the same kinds of problems that can be solved with the same patterns. You end up with something concrete and functional and actionable, work that has to be done and can be tested. Concrete, actionable cases like this are easier for the team to understand and appreciate – including the Product Owner, which is critical in Scrum, because the Product Owner decides what is important and what gets done. And because Abuser Stories are done in phase, by the people who are working on the stories already (rather than a separate activity that needs to be setup and scheduled) they are more likely to get done. Simple, quick, informal threat modeling like this isn’t enough to make a system secure – the team won’t be able to find and plug all of the security holes in the system this way, even if the developers are well-trained in secure software development and take their work seriously. Abuser Stories are good for identifying business logic vulnerabilities, reviewing security features (authentication, access control, auditing, password management, licensing) improving error handling and basic validation, and keeping onside of privacy regulations. Effective software security involves a lot more work than this: choosing a good framework and using it properly, watching out for changes to the system's attack surface, carefully reviewing high-risk code for design and coding errors, writing good defensive code as much as possible, using static analysis to catch common coding mistakes, and regular security testing (pen testing and dynamic analysis). But getting developers and testers to think like a bad guy as they build a system should go a long way to improving the security and robustness of your app.

October 31, 2013

by Jim Bird

· 21,207 Views · 1 Like

How to Use MongoDB as a Pure In-memory DB (Redis Style)

The Idea There has been a growing interest in using MongoDB as an in-memory database, meaning that the data is not stored on disk at all. This can be super useful for applications like: a write-heavy cache in front of a slower RDBMS system embedded systems PCI compliant systems where no data should be persisted unit testing where the database should be light and easily cleaned That would be really neat indeed if it was possible: one could leverage the advanced querying / indexing capabilities of MongoDB without hitting the disk. As you probably know the disk IO (especially random) is the system bottleneck in 99% of cases, and if you are writing data you cannot avoid hitting the disk. One sweet design choice of MongoDB is that it uses memory-mapped files to handle access to data files on disk. This means that MongoDB does not know the difference between RAM and disk, it just accesses bytes at offsets in giant arrays representing files and the OS takes care of the rest! It is this design decision that allows MongoDB to run in RAM with no modification. How it is done This is all achieved by using a special type of filesystem called tmpfs. Linux will make it appear as a regular FS but it is entirely located in RAM (unless it is larger than RAM in which case it can swap, which can be useful!). I have 32GB RAM on this server, let’s create a 16GB tmpfs: # mkdir /ramdata # mount -t tmpfs -o size=16000M tmpfs /ramdata/ # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvde1 5905712 4973924 871792 86% / none 15344936 0 15344936 0% /dev/shm tmpfs 16384000 0 16384000 0% /ramdata Now let’s start MongoDB with the appropriate settings. smallfiles and noprealloc should be used to reduce the amount of RAM wasted, and will not affect performance since it’s all RAM based. nojournal should be used since it does not make sense to have a journal in this context! dbpath=/ramdata nojournal = true smallFiles = true noprealloc = true After starting MongoDB, you will find that it works just fine and the files are as expected in the FS: # mongo MongoDB shell version: 2.3.2 connecting to: test > db.test.insert({a:1}) > db.test.find() { "_id" : ObjectId("51802115eafa5d80b5d2c145"), "a" : 1 } # ls -l /ramdata/ total 65684 -rw-------. 1 root root 16777216 Apr 30 15:52 local.0 -rw-------. 1 root root 16777216 Apr 30 15:52 local.ns -rwxr-xr-x. 1 root root 5 Apr 30 15:52 mongod.lock -rw-------. 1 root root 16777216 Apr 30 15:52 test.0 -rw-------. 1 root root 16777216 Apr 30 15:52 test.ns drwxr-xr-x. 2 root root 40 Apr 30 15:52 _tmp Now let’s add some data and make sure it behaves properly. We will create a 1KB document and add 4 million of them: > str = "" > aaa = "aaaaaaaaaa" aaaaaaaaaa > for (var i = 0; i < 100; ++i) { str += aaa; } > for (var i = 0; i < 4000000; ++i) { db.foo.insert({a: Math.random(), s: str});} > db.foo.stats() { "ns" : "test.foo", "count" : 4000000, "size" : 4544000160, "avgObjSize" : 1136.00004, "storageSize" : 5030768544, "numExtents" : 26, "nindexes" : 1, "lastExtentSize" : 536600560, "paddingFactor" : 1, "systemFlags" : 1, "userFlags" : 0, "totalIndexSize" : 129794000, "indexSizes" : { "_id_" : 129794000 }, "ok" : 1 } The document average size is 1136 bytes and it takes up about 5GB of storage. The index on _id takes about 130MB. Now we need to verify something very important: is the data duplicated in RAM, existing both within MongoDB and the filesystem? Remember that MongoDB does not buffer any data within its own process, instead data is cached in the FS cache. Let’s drop the FS cache and see what is in RAM: # echo 3 > /proc/sys/vm/drop_caches # free total used free shared buffers cached Mem: 30689876 6292780 24397096 0 1044 5817368 -/+ buffers/cache: 474368 30215508 Swap: 0 0 0 As you can see there is 6.3GB of used RAM of which 5.8GB is in FS cache (buffers). Why is there still 5.8GB of FS cache even after all caches were dropped?? The reason is that Linux is smart and it does not duplicate the pages between tmpfs and its cache… Bingo! That means your data exists with a single copy in RAM. Let’s access all documents and verify RAM usage is unchanged: > db.foo.find().itcount() 4000000 # free total used free shared buffers cached Mem: 30689876 6327988 24361888 0 1324 5818012 -/+ buffers/cache: 508652 30181224 Swap: 0 0 0 # ls -l /ramdata/ total 5808780 -rw-------. 1 root root 16777216 Apr 30 15:52 local.0 -rw-------. 1 root root 16777216 Apr 30 15:52 local.ns -rwxr-xr-x. 1 root root 5 Apr 30 15:52 mongod.lock -rw-------. 1 root root 16777216 Apr 30 16:00 test.0 -rw-------. 1 root root 33554432 Apr 30 16:00 test.1 -rw-------. 1 root root 536608768 Apr 30 16:02 test.10 -rw-------. 1 root root 536608768 Apr 30 16:03 test.11 -rw-------. 1 root root 536608768 Apr 30 16:03 test.12 -rw-------. 1 root root 536608768 Apr 30 16:04 test.13 -rw-------. 1 root root 536608768 Apr 30 16:04 test.14 -rw-------. 1 root root 67108864 Apr 30 16:00 test.2 -rw-------. 1 root root 134217728 Apr 30 16:00 test.3 -rw-------. 1 root root 268435456 Apr 30 16:00 test.4 -rw-------. 1 root root 536608768 Apr 30 16:01 test.5 -rw-------. 1 root root 536608768 Apr 30 16:01 test.6 -rw-------. 1 root root 536608768 Apr 30 16:04 test.7 -rw-------. 1 root root 536608768 Apr 30 16:03 test.8 -rw-------. 1 root root 536608768 Apr 30 16:02 test.9 -rw-------. 1 root root 16777216 Apr 30 15:52 test.ns drwxr-xr-x. 2 root root 40 Apr 30 16:04 _tmp # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/xvde1 5905712 4973960 871756 86% / none 15344936 0 15344936 0% /dev/shm tmpfs 16384000 5808780 10575220 36% /ramdata And that verifies it! :) What about replication? You probably want to use replication since a server loses its RAM data upon reboot! Using a standard replica set you will get automatic failover and more read capacity. If a server is rebooted MongoDB will automatically rebuild its data by pulling it from another server in the same replica set (resync). This should be fast enough even in cases with a lot of data and indices since all operations are RAM only :) It is important to remember that write operations get written to a special collection called oplog which resides in the local database and takes 5% of the volume by default. In my case the oplog would take 5% of 16GB which is 800MB. In doubt, it is safer to choose a fixed oplog size using the oplogSize option. If a secondary server is down for a longer time than the oplog contains, it will have to be resynced. To set it to 1GB, use: oplogSize = 1000 What about sharding? Now that you have all the querying capabilities of MongoDB, what if you want to implement a large service with it? Well you can use sharding freely to implement a large scalable in-memory store. Still the config servers (that contain the chunk distribution) should be disk based since their activity is small and rebuilding a cluster from scratch is not fun. What to watch for RAM is a scarce resource, and in this case you definitely want the entire data set to fit in RAM. Even though tmpfs can resort to swapping the performance would drop dramatically. To make best use of the RAM you should consider: usePowerOf2Sizes option to normalize the storage buckets run a compact command or resync the node periodically. use a schema design that is fairly normalized (avoid large document growth) Conclusion Sweet, you can now use MongoDB and all its features as an in-memory RAM-only store! Its performance should be pretty impressive: during the test with a single thread / core I was achieving 20k writes per second, and it should scale linearly over the number of cores.

October 28, 2013

by Antoine Girbal

· 60,641 Views