Data Engineering Resources

The Latest Data Engineering Topics

Mapping Enums Done Right With @Convert in JPA 2.1

If you ever worked with Java enums in JPA you are definitely aware of their limitations and traps. Usingenum as a property of your @Entity is often very good choice, however JPA prior to 2.1 didn’t handle them very well. It gave you 2+1 choices: @Enumerated(EnumType.ORDINAL) (default) will map enum values using Enum.ordinal(). Basically first enumerated value will be mapped to 0 in database column, second to 1, etc. This is very compact and works great to the point when you want to modify your enum. Removing or adding value in the middle or rearranging them will totally break existing records. Ouch! To make matters worse, unit and integration tests often work on clean database, so they won’t catch discrepancy in old data. @Enumerated(EnumType.STRING) is much safer because it stores string representation of enum. You can now safely add new values and move them around. However renaming enum in Java code will still break existing records in DB. Even more important, such representation is very verbose, unnecessarily consuming database resources. You can also use raw representation (e.g. single char or int) and map it manually back and forth in @PostLoad/@PrePersist/@PreUpdate events. Most flexible and safe from database perspective, but quite ugly. Luckily Java Persistence API 2.1 (JSR-388) released few days ago provides standardized mechanism of pluggable data converters. Such API was present for ages in proprietary forms and it’s not really rocket science, but having it as part of JPA is a big improvement. To my knowledge Eclipselink is the only JPA 2.1 implementation available to date, so we will use it to experiment a bit. We will start from sample Spring application developed as part of “Poor man’s CRUD: jqGrid, REST, AJAX, and Spring MVC in one house” article. That version had no persistence, so we will add thin DAO layer on top of Spring Data JPA backed by Eclipselink. Only entity so far is Book: @Entity public class Book { @Id @GeneratedValue(strategy = IDENTITY) private Integer id; //... private Cover cover; //... } Where Cover is an enum: public enum Cover { PAPERBACK, HARDCOVER, DUST_JACKET } Neither ORDINAL nor STRING is a good choice here. The former because rearranging first three values in any way will break loading of existing records. The latter is too verbose. Here is where custom converters in JPA come into play: import javax.persistence.AttributeConverter; import javax.persistence.Converter; @Converter public class CoverConverter implements AttributeConverter { @Override public String convertToDatabaseColumn(Cover attribute) { switch (attribute) { case DUST_JACKET: return "D"; case HARDCOVER: return "H"; case PAPERBACK: return "P"; default: throw new IllegalArgumentException("Unknown" + attribute); } } @Override public Cover convertToEntityAttribute(String dbData) { switch (dbData) { case "D": return DUST_JACKET; case "H": return HARDCOVER; case "P": return PAPERBACK; default: throw new IllegalArgumentException("Unknown" + dbData); } } } OK, I won’t insult you, my dear reader, explaining this. Converting enum to whatever will be stored in relational database and vice-versa. Theoretically JPA provider should apply converters automatically if they are declared with: @Converter(autoApply = true) It didn’t work for me. Moreover declaring them explicitly instead of @Enumerated in@Entity class didn’t work as well: import javax.persistence.Convert; //... @Convert(converter = CoverConverter.class) private Cover cover; Resulting in exception: Exception Description: The converter class [com.blogspot.nurkiewicz.CoverConverter] specified on the mapping attribute [cover] from the class [com.blogspot.nurkiewicz.Book] was not found. Please ensure the converter class name is correct and exists with the persistence unit definition. Bug or feature, I had to mention converter in orm.xml: And it flies! I have a freedom of modifying my Cover enum (adding, rearranging, renaming) without affecting existing records. One tip I would like to share with you is related to maintainability. Every time you have a piece of code mapping from or to enum, make sure it’s tested properly. And I don’t mean testing every possible existing value manually. I am more after a test making sure that newenum values are reflected in mapping code. Hint: code below will fail (by throwingIllegalArgumentException) if you add new enum value but forget to add mapping code from it: for (Cover cover : Cover.values()) { new CoverConverter().convertToDatabaseColumn(cover); } Custom converters in JPA 2.1 are much more useful than what we saw. If you combine JPA with Scala, you can use @Converter to map database columns directly toscala.math.BigDecimal, scala.Option or small case class. In Java there will finally be a portable way of mapping Joda time. Last but not least, if you like (very) strongly typed domain, you may wish to have PhoneNumber class (with isInternational(),getCountryCode() and custom validation logic) instead of String or long. This small addition in JPA 2.1 will surely improve domain objects quality significantly. If you wish to play a bit with this feature, sample Spring web application is available on GitHub.

June 6, 2013

by Tomasz Nurkiewicz

· 69,401 Views · 6 Likes

Serialization and injection

Serialization is a form of persistence: serialized data survives the process and the RAM where it was created and can be reconstituted inside different processes and machines that live in a different time or place. Sometimes serialization is a poor form of persistence in fact, one that confuses the boundary between the different schemas the data can fit in. However, what I found useful in the last years of development is to institute a strict separation: serialize Value Objects, Entities, and everything that represents the state of the application. Meanwhile, use Dependency Injection over services that are part of a larger object graph and never serialize this second kind of objects. In the discussion that follows, I make the assumption that serialization and deserialization occur on the same machine (e.g. like for web-oriented sessions.) The problem with serialization, which work transparently most of the time, is the need to serialize service objects instead of limiting the procedure to data structures. How can you store such objects? Not options Some options to solve this problems are really not options. Serialization by itself will fail because of the staleness of the references contained in these objects. For example, in PHP trying to serialize a database connections composed by a Repository or DAO object will rightly fail with an exception. Whenever an object represents a resource of the current machine, it cannot usually be serialized except in the case when the only resource involved is RAM. If the resource is disk space or other running processes such as a database daemon, the reconstitution of the object in another place and time will fail and it's best to just stop the developer immediately during storage. Quasi-options Some solutions to the problem try to avoid the staleness problem by serializing objects without their resources, and make them regrab a new version of them on deserialization. In PHP for example, this can be done with the __sleep() and __wakeup() magic methods, called automatically during serialization and deserializaton respectively. This deserialization mechanism introduces a dependency from the serialized Entity to external services: such a dependency is already in place when building the object the first time (passing the XService in the constructor) but it is aggravated when deserializing (depending on a XServiceFactory instead of just an XService). An improvement, from the dependencies point of view, is to reattach collaborators to deserialized objects like you would for other persistence-related tasks. For example, EntityRepository can inject the missing pieces of Entity every time its find() method is called. However, there is still another option, which is the most resilient from the modelling point of view and not only that of dependency management: injecting non-serializable collaborators through the stack. Objects can collaborate even without keeping field references to each other, and injecting dependencies as parameters move the dependency starting point from the server to the client object (which may or may not be desirable). What is most important is that Entities are relieved of having to manage external references in any context, not only that of persistence and in particular serialization. The metaphor for the 3rd option Misko Hevery likes to say: have you ever seen a credit card able to charge itself? If a CreditCard is an Entity in your domain, it would be very strange to keeping a wire attached to your wallet wherever you go. With the first option, you have the card spring a wire when it is taken out of the wallet, like in horror movies. This intelligent cable tries as its best to attach to the nearest Point of Sale (a bad case of bluetooth I think). With Repositories in mind, you're not dealing with automated wires anymore, but you're still attaching cables between cards and fixed devices. In reality, cards collaborate with the PoS in a fast process that does not last more than a few seconds. Actually, sometimes they don't touch it at all, as in all Internet-based purchases. Keeping services around to deal with external dependencies does not mean the API of your Domain Model has to be biased towards service objects: pos.charge(creditCard); // can equivalently be: creditCard.chargeOn(pos); This is a form of Double Dispatch since there are two objects collaborating and you can dispatch (send messages) to both, being polimorphic by substituting both objects. The sequence of calls is: client -> creditCard -> pos The client object still looks at CreditCard as a behaviorally complete object, but it is clear which dependency is necessary to run each use case (CreditCard method). You can persist a CreditCard easily and send it over the wire to caches or databases. When it comes the time to charge, it is the client that has to bring forward a service able to connect to a bank.

June 5, 2013

by Giorgio Sironi

· 7,227 Views

Write CSV Data into Hive and Python

Apache Hive is a high level SQL-like interface to Hadoop. It lets you execute mostly unadulterated SQL, like this: CREATE TABLE test_table(key string, stats map); The map column type is the only thing that doesn’t look like vanilla SQL here. Hive can actually use different backends for a given table. Map is used to interface with column oriented backends like HBase. Essentially, because we won’t know ahead of time all the column names that could be in the HBase table, Hive will just return them all as a key/value dictionary. There are then helpers to access individual columns by key, or even pivot the map into one key per logical row. As part of the Hadoop family, Hive is focused on bulk loading and processing. So it’s not a surprise that Hive does not support inserting raw values like the following SQL: INSERT INTO suppliers (supplier_id, supplier_name) VALUES (24553, 'IBM'); However, for unit testing Hive scripts, it would be nice to be able to insert a few records manually. Then you could run your map reduce HQL, and validate the output. Luckily, Hive can load CSV files, so it’s relatively easy to insert a handful or records that way. CREATE TABLE foobar(key string, stats map) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' ; LOAD DATA LOCAL INPATH '/tmp/foobar.csv' INTO TABLE foobar; This will load a CSV file with the following data, where c4ca4-0000001-79879483-000000000124 is the key, and comments and likesare columns in a map. c4ca4-0000001-79879483-000000000124,comments:0|likes:0 c4ca4-0000001-79879483-000000000124,comments:0|likes:0 Because I’ve been doing this quite a bit in my unit tests, I wrote a quick Python helper to dump a list of key/map tuples to a temporary CSV file, and then load it into Hive. This uses hiver to talk to Hive over thrift. import hiver from django.core.files.temp import NamedTemporaryFile def _hql(self, hql): client = hiver.connect(settings.HIVE_HOST, settings.HIVE_PORT) try: client.execute(hql) finally: client.shutdown() def insert(self, table_name, rows): ''' cannot insert single rows via hive, need to save to a temp file and bulk load that ''' csv_file = NamedTemporaryFile(delete=True) for row in rows: map_repr = '|'.join('%s:%s' % (key, value) for key, value in row[1].items()) csv_file.write(row[0] + "," + map_repr + "\n") csv_file.flush() try: _hql('DROP TABLE IF EXISTS %s' % table_name) _hql(""" CREATE TABLE %s ( key string, map ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' MAP KEYS TERMINATED BY ':' """ % (table_name)) _hql(""" LOAD DATA LOCAL INPATH '%s' INTO TABLE %s """ % (csv_file.name, table_name) finally: csv_file.close() You can call it like this: insert('test_table', [ ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879483-000000000124', {'comments': 1, 'likes': 2}), ('c4ca4-0000001-79879496-000000000124', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ('b4aed-0000002-79879783-000000000768', {'comments': 1, 'likes': 2}), ])

June 5, 2013

by Chase Seibert

· 14,729 Views

Hadoop REST API - WebHDFS

Hadoop provides a Java native API to support file system operations..

June 3, 2013

by Istvan Szegedi

· 57,473 Views · 5 Likes

Build an Arduino Motor/Stepper/Servo Shield – Part 1: Servos

this post starts a small (or larger?) series of tutorials using the arduino motor/stepper/servo shield with the frdm-kl25z board. that motor shield is probably one of the most versatile on the market, and features 2 servo and 4 motor connectors for dc or stepper motors. that makes it a great shield for any robotic project arduino motor stepper servo shield with frdm-kl25z the series starts with a tutorial how to drive two servo motors. and if this is not what you are expecting to do with this shield, then you can vote and tell me what you want to see instead on this motor shield . oem or original? the original arduino motor/stepper/servo shield is available from adaftruit industries and costs less than $20. i’m using a oem version, see this link . the functionality is the same, except that the oem version only runs with motors up to 16 vdc, while the original shield is for motors up to 25 vdc. motor stepper servo shield details the board has two stmicroelectronics l293d motor h-bridge ic’s which can drive up to 4 dc motors (or up to 2 stepper motors) with 0.6 a per bridge (1.2 a peak). the 74hct595n (my board has the sn74hc595 from texas instrument) is a shift register used for the h-bridges to reduce the number of pins needed (more about this in a next post). a terminal block with jumper is providing power to the dc/stepper motor. the 5 vdc for the servos is taken from the frdm board. the frdm-kl25z can only give a few hundred ma on the 5v arduino header. that works for small servos, but i recommend to cut the 5v supply to the servos and use a dedicated 5v (or 6v) for the servos. outline in this tutorial, i’m creating a project with codewarrior for mcu10.4 for the frdm-kl25z board, and then add support for two servo motors. processor expert components this tutorial uses added processor expert components which are not part of codewarrior distribution. the following other components are used: wait : allows waiting for a given time servo : high level driver for hobby servp motors make sure you have the latest and greatest components loaded from github . instructions how to download and install the additional components can be found here . creating codewarrior project to create a new project in codewarrior: file > new > bareboard project, give a project name specify the device to be used: mkl25z128 opensda as connection i/o support can be set to ‘no i/o’ processor expert as rapid application development option this creates the starting point for my project: new servo project created servo motor servo motors are used in rc (radio control) or (hobby) robotics. typical servo motor (hitec hs-303) the motor has 3 connectors: gnd (black) power (red), typically 5v, but can be 6v or even higher pwm (white or yellow), signal for position information the pwm signal typically has frequency of 50 hz (20 ms), with a duty (high duration) between 1 ms and 2 ms. the screenshot below shows such a 50 hz signal with 1.5 ms duty cycle (servo middle position): servo signal many servos go below 1 ms and beyond 2 ms. e.g. many hitec servos have a range of 0.9…2.1 ms. check the data sheet of your servos for details. if you do not have a data sheet, then you might just experiment with different values. with a pwm duty of 1 ms to 2 ms within a 20 ms period, this means that only 10% of the whole pwm duty are used. this means if you have a pwm resolution of only 8bits, then only 10% of 256 steps could be used. as such, an 8bit pwm signal does not give me a fine tuned servo positioning. the duration of the duty cycle (1..2 ms) is translated into a motor position. typically the servo has a built-in closed-loop control with a microcontroller and a potentiometer. i have found that it is not important to have an *exact* 50 hz pwm frequency. you need to experiment with your servo if it works as well with a lower or higher frequency, or with non-fixed frequency (e.g. if you do a software pwm). many servos build an average of the duty cycle, so you might need to send several pulses until the servo reacts to a changed value. servo processor expert component i’m using here my own ‘servo’ component which offers following capabilities: pwm configuration (duty and period) min/max and initialization values methods to change the duty cycle optional command line shell support: you can type in commands and control the servo. this is useful for testing or calibration. optional ‘timed’ moving, so you can move the servo faster or slower to the new position in an interrupt driven way of course it is possible to use servos without any special components. from the components view, i add the servo component. to add it to my project, i can double-click on it or use the ‘+’ icon in that view: servo component in components library view in case the processor expert views are not shown, use the menu processor expert > show views this will add a new ‘servo’ component to the project: servo component added but it shows errors as first the pwm and pin settings need to be configured. pwm configuration on the arduino motor/stepper/servo shield the two servo motor headers are connected to pwm1b and pwm1a (see schematic ): servo header on board (source: dk electronics shield schematic) following the signals, this ends up at following pins on the kl25z: servo 1 => pwm1b => arduino header d10 => frdm-kl25z d10 => kl25z pin 73 => ptd0/spi0_pcs0/ tpm0_ch0 servo 2 => pwm1a => arduino header d9 => frdm-kl25z d9 => kl25z pin 78 => adc0_se6b/ptd5/spi1_sck/uart2_tx/ tpm0_ch5 from the pin names on the kinets (tpm0_ch0 and tpm0_ch5) i can see that this would be the same timer (tpm0), but with different channel numbers (ch0 and ch5). for my first servo processor expert has created for me a ‘timerunit_ldd’ which i will be able to share (later more on this). the timerunit_ldd implements the ‘ l ogical d evice d river’ for my pwm: timerunit_ldd so i select the pwm component inside the servo component and configure it for tpm0_c0v and the pin ptd0/spi0_pcs0/tpm0_ch0 with low initial polarity. the period of 20 ms (50 hz) and starting pulse with of 1.5 ms (mid-point) should already be pre-configured: servo1 pwm configuration i recommend to give it a pin signal name (i used ‘servo1′) that i need to set the ‘initial polarity’ to low is a bug of processor expert in my view: the device supports an initial ‘high’ polarity, but somehow this is not implemented? what it means is that the polarity of the pwm signal is now inverted: a ‘high’ duty cycle will mean that the signal is low. we need to ‘revert’ the logic later in the servo component. because of the inverted pwm logic, i need to set the ‘inverted pwm’ attribute in the servo component: inverted pwm the other settings of the servo component we can keep ‘as is’ for now. the ‘min pos pwm’ and ‘max pos pwm’ define the range of the pwm duty cycle which we will use later for the servo position. adding second servo as with the first servo, i add the second servo from the components library view. as i already have a timerunit_ldd present in my system, processor expert asks me if i want to re-use the existing one or to create a new component: shared component dialog as explained above: i can use the same timer (just a different pin/channel), so i have my existing component selected and press ok. as above, i configure the timer channel and pin with initial polarity: servo2 pwm configuration and i should not forget to enable the inverted logic: inverted pwm for servo2 test application time to try things out. for this i create a simple demo application which changes the position of both servos. first i add the wait component to the project from the components library: added wait component as i have all my processor expert components configured, i can generate the code: generating processor expert code next i add a new header application.h file to my project. for this i select the ‘sources’ folder of my project and use the new > header file context menu to add my new header file: new application.h in that header file application.h i add a prototype for my application ‘run’ routine: added app_run prototype from the main() in processorexpert.c , i call that function (not to forget to include the header file): calling app_run from main the same way i add a new source file application.c: new application.c to test my servos, i’m using the setpos() method which accepts a 8bit (0 to 255) value which is the position. to slow things a bit, i’m waiting a few milliseconds between the different positions: #include "application.h" #include "wait1.h" #include "servo1.h" #include "servo2.h" void app_run(void) { uint16_t pos; for(;;) { for(pos=0;pos<=255;pos++) { servo1_setpos(pos); servo2_setpos(pos); wait1_waitms(50); } } } save all files, and we should be ready to try it out on the board. build, download and run that’s it! time to build the project (menu project > build project ) and to download it with the debugger (menu run > debug ) and to start the application. if everything is going right, then the two servos will slowly turn in one direction until the end position, and then return back to the starting position. summary using hobby servo motors with the frdm-kl25z, codewarrior, processor expert and the additional components plus the arduino/stepper/servo shield is very easy in my view. i hope this post is useful to start your own experiments with hobby servo motors to bring any robotic project to the next level. i have here on github a project which features what is explained in this post, but with a lot more components, bells and whistles

June 2, 2013

by Erich Styger

· 17,836 Views · 7 Likes

Avro's Built-In Sorting

avro has a little-known gem of a feature which allows you to control which fields in an avro record are used for partitioning , sorting and grouping in mapreduce. the following figure gives a quick refresher as to what these terms mean. oh, and don’t take the placement of the “sorting” literally - sorting actually occurs on both the map and reduce side - but it’s always performed in the context of a specific partition (i.e. for a specific reducer). by default all the fields in an avro map output key are used for partitioning, sorting and grouping in mapreduce. let’s walk through an example and see how this works. you’ll begin with a simple schema github source : {"type": "record", "name": "com.alexholmes.avro.weathernoignore", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int"}, {"name": "counter", "type": "int", "default": 0} ] } we’re going to see what happens when we run this code against a small sample data set, which we’ll generate using avro code github source : file input = tmpfolder.newfile("input.txt"); avrofiles.createfile(input, weathernoignore.schema$, arrays.aslist( weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(3).build(), weathernoignore.newbuilder().setstation("iad").settime(1).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(2).settemp(1).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(2).build(), weathernoignore.newbuilder().setstation("sfo").settime(1).settemp(1).build() ).toarray()); to understand how avro is partitioning, sorting and grouping the data, we’ll write an identity mapper and reducer, with a small enhancement to the reducer to increment the counter field for each record we see in an individual reducer instance github source : package com.alexholmes.avro.sort.basic; import com.alexholmes.avro.weathernoignore; import org.apache.avro.mapred.avrokey; import org.apache.avro.mapred.avrovalue; import org.apache.avro.mapreduce.avrojob; import org.apache.avro.mapreduce.avrokeyinputformat; import org.apache.avro.mapreduce.avrokeyoutputformat; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.nullwritable; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import java.io.ioexception; public class avrosort { private static class sortmapper extends mapper, nullwritable, avrokey, avrovalue> { @override protected void map(avrokey key, nullwritable value, context context) throws ioexception, interruptedexception { context.write(key, new avrovalue(key.datum())); } } private static class sortreducer extends reducer, avrovalue, avrokey, nullwritable> { @override protected void reduce(avrokey key, iterable> values, context context) throws ioexception, interruptedexception { int counter = 1; for (avrovalue weathernoignore : values) { weathernoignore.datum().setcounter(counter++); context.write(new avrokey(weathernoignore.datum()), nullwritable.get()); } } } public boolean runmapreduce(final job job, path inputpath, path outputpath) throws exception { fileinputformat.setinputpaths(job, inputpath); job.setinputformatclass(avrokeyinputformat.class); avrojob.setinputkeyschema(job, weathernoignore.schema$); job.setmapperclass(sortmapper.class); avrojob.setmapoutputkeyschema(job, weathernoignore.schema$); avrojob.setmapoutputvalueschema(job, weathernoignore.schema$); job.setreducerclass(sortreducer.class); avrojob.setoutputkeyschema(job, weathernoignore.schema$); job.setoutputformatclass(avrokeyoutputformat.class); fileoutputformat.setoutputpath(job, outputpath); return job.waitforcompletion(true); } } if you look at the output of the job below, you’ll see that the output is sorted across all the fields, and that the sorting is in field ordinal order. what this means is that when mapreduce is sorting these records, it compares the station field first, then the time field second, and so on according to the ordering of the fields in the avro schema. this is pretty much what you’d expect if you write your own complex writable type, and your comparator compared all the fields in order. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} oh, and before we move on notice that the value for the counter field is always 1 , meaning that each reducer was only fed a single key/vaue pair, which makes sense since our identity mapper only emitted a single value for each key, the keys are unique, and the mapreduce partitioner, sorter and grouper were using all the fields in the record. excluding fields for sorting avro gives us the ability to indicate that specific fields should be ignored when performing ordering functions. in mapreduce these fields are ignored for sorting/partitioning and grouping in mapreduce, which basically means that we have the ability to perform secondary sorting. let’s examine the following schema github source : {"type": "record", "name": "com.alexholmes.avro.weather", "doc": "a weather reading.", "fields": [ {"name": "station", "type": "string"}, {"name": "time", "type": "long"}, {"name": "temp", "type": "int", "order": "ignore"}, {"name": "counter", "type": "int", "order": "ignore", "default": 0} ] } it’s pretty much identical to the first schema, the only difference being that the last two fields are flagged as being “ignored” for sorting/partitioning/grouping. let’s run the same (other than modified to work with the different schema) mapreduce code github source as above against this new schema and examine the outputs. {"station": "iad", "time": 1, "temp": 1, "counter": 1} {"station": "sfo", "time": 1, "temp": 3, "counter": 1} {"station": "sfo", "time": 1, "temp": 2, "counter": 2} {"station": "sfo", "time": 1, "temp": 1, "counter": 3} {"station": "sfo", "time": 2, "temp": 1, "counter": 1} there are a couple of notable differences between this output, and the output from the previous schema which didn’t have any ignored fields. first, it’s clear that the temp field isn’t being used in the sorting, which makes sense since we specified that it should be ignored in the schema. however, more interestingly, note the value of the counter field. all records that had identical station and time values went to the same reducer invocation, evidenced by the increasing value of counter . this is essentially secondary sort! now, all of this greatness isn’t without some limitations: you can’t support two mapreduce jobs that use the same avro key, but have different sorting/partitioning/grouping requirements. although it’s conceivable that you could create a new instance of the avro schema and set the ignored flags for these fields yourself. the partitioner, sorter and grouping functions in mapreduce all work off of the same fields (i.e. they all ignore fields that set as ignored in the schema). this means that your options for secondary sorting are limited. for example, you wouldn’t be able to partition all stations to the same reducer, and then group by station and time. ordering uses a field’s ordinal position to determine its order within the overall set of fields to be ordered. in other words, in a two-field record, the first field is always compared before the second. there’s no way to change this behavior other than flipping the order of the fields in the record. having said all of that - the “ignoring fields” feature for sorting is pretty awesome, and something that will no doubt come in handy in my future mapreduce work.

May 29, 2013

by Alex Holmes

· 8,115 Views

Accessing An Artifact’s Maven And SCM Versions At Runtime

You can easily tell Maven to include the version of the artifact and its Git/SVN/… revision in the JAR manifest file and then access that information at runtime via getClass().getPackage.getImplementationVersion(). (All credit goes to Markus Krüger and other colleagues.) Include Maven artifact version in the manifest (Note: You will actually not want to use it, if you also want to include a SCM revision; see below.) pom.xml: ... org.apache.maven.plugins maven-jar-plugin ... true true ... ... The resulting MANIFEST.MF of the JAR file will then include the following entries, with values from the indicated properties: Built-By: ${user.name} Build-Jdk: ${java.version} Specification-Title: ${project.name} Specification-Version: ${project.version} Specification-Vendor: ${project.organization.name Implementation-Title: ${project.name} Implementation-Version: ${project.version} Implementation-Vendor-Id: ${project.groupId} Implementation-Vendor: ${project.organization.name} (Specification-Vendor and Implementation-Vendor come from the POM’s organization/name.) Include SCM revision For this you can either use the Build Number Maven plugin that produces the property ${buildNumber}, or retrieve it from environment variables passed by Jenkinsor Hudson (SVN_REVISION for Subversion, GIT_COMMIT for Git). For git alone, you could also use the maven-git-commit-id-plugin that can either replace strings such as ${git.commit.id} in existing resource files (using maven’s resource filtering, which you must enable) with the actual values or output all of them into a git.properties file. Let’s use the buildnumber-maven-plugin and create the manifest entries explicitely, containing the build number (i.e. revision) org.codehaus.mojo buildnumber-maven-plugin 1.2 validate create false false org.apache.maven.plugins maven-jar-plugin 2.4 ${project.name} ${project.version} ${buildNumber} ... Accessing the version & revision As mentioned above, you can access the manifest entries from your code via getClass().getPackage.getImplementationVersion() andgetClass().getPackage.getImplementationTitle(). References SO: How to get Maven Artifact version at runtime? Maven Archiver documentation

May 28, 2013

by Jakub Holý

· 12,760 Views

Amazon S3 Parallel MultiPart File Upload

In this blog post, I will present a simple tutorial on uploading a large file to Amazon S3 as fast as the network supports. Amazon S3 is clustered storage service of Amazon. It is designed to make web-scale computing easier. Amazon S3 provides a simple web services interface that can be used to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, secure, fast, inexpensive infrastructure that Amazon uses to run its own global network of web sites. The service aims to maximize benefits of scale and to pass those benefits on to developers. For using Amazon services, you'll need your AWS access key identifiers, which AWS assigned you when you created your AWS account. The following are the AWS access key identifiers: Access Key ID (a 20-character, alphanumeric sequence) For example: 022QF06E7MXBSH9DHM02 Secret Access Key (a 40-character sequence) For example: kWcrlUX5JEDGM/LtmEENI/aVmYvHNif5zB+d9+ct Caution Your Secret Access Key is a secret, which only you and AWS should know. It is important to keep it confidential to protect your account. Store it securely in a safe place. Never include it in your requests to AWS, and never e-mail it to anyone. Do not share it outside your organization, even if an inquiry appears to come from AWS or Amazon.com. No one who legitimately represents Amazon will ever ask you for your Secret Access Key. The Access Key ID is associated with your AWS account. You include it in AWS service requests to identify yourself as the sender of the request. The Access Key ID is not a secret, and anyone could use your Access Key ID in requests to AWS. To provide proof that you truly are the sender of the request, you also include a digital signature calculated using your Secret Access Key. The sample code handles this for you. Your Access Key ID and Secret Access Key are displayed to you when you create your AWS account. They are not e-mailed to you. If you need to see them again, you can view them at any time from your AWS account. To get your AWS access key identifiers Go to the Amazon Web Services web site at http://aws.amazon.com. Point to Your Account and click Security Credentials. Log in to your AWS account. The Security Credentials page is displayed. Your Access Key ID is displayed in the Access Identifiers section of the page. To display your Secret Access Key, click Show in the Secret Access Key column. You can use your Amazon keys from a properties file in your application. Here is a sample for properties file containing Amazon keys: # Fill in your AWS Access Key ID and Secret Access Key # http://aws.amazon.com/security-credentials accessKey = secretKey = Here is sample AmazonUtil class for getting AWS Credentials from properties file. public class AmazonUtil { private static final Logger logger = LogUtil.getLogger(); private static final String AWS_CREDENTIALS_CONFIG_FILE_PATH = ConfigUtil.CONFIG_DIRECTORY_PATH + File.separator + "aws-credentials.properties"; private static AWSCredentials awsCredentials; static { init(); } private AmazonUtil() { } private static void init() { try { awsCredentials = new PropertiesCredentials(IOUtil.getResourceAsStream(AWS_CREDENTIALS_CONFIG_FILE_PATH)); } catch (IOException e) { logger.error("Unable to initialize AWS Credentials from " + AWS_CREDENTIALS_CONFIG_FILE_PATH); } } public static AWSCredentials getAwsCredentials() { return awsCredentials; } } Amazon S3 has Multipart Upload service which allows faster, more flexible uploads into Amazon S3. Multipart Upload allows you to upload a single object as a set of parts. After all parts of your object are uploaded, Amazon S3 then presents the data as a single object. With this feature you can create parallel uploads, pause and resume an object upload, and begin uploads before you know the total object size. For more information on Multipart Upload, review the Amazon S3 Developer Guide In this tutorial, my sample application uploads each file parts to Amazon S3 with different threads for using network throughput as possible as much. Each file part is associated with a thread and each thread uploads its associated part with Amazon S3 API. Figure 1. Amazon S3 Parallel Multi-Part File Upload Mechanism Amazon S3 API suppots MultiPart File Upload in this way: 1. Send a MultipartUploadRequest to Amazon. 2. Get a response containing a unique id for this upload operation. 3. For i in ${partCount} 3.1. Calculate size and offset of split-i in whole file. 3.2. Build a UploadPartRequest with file offset, size of current split and unique upload id. 3.3. Give this request to a thread and starts upload by running thread. 3.3.1. Send associated UploadPartRequest to Amazon. 3.3.2. Get response after successful upload and save ETag property of response. 4. Wait all threads to terminate 5. Get ETags (ETag is an identifier for successfully completed uploads) of all terminated threads. 6. Send a CompleteMultipartUploadRequest to Amazon with unique upload id and all ETags. So Amazon joins all file parts as target objects. Here is implementation: public class AmazonS3Util { private static final Logger logger = LogUtil.getLogger(); public static final long DEFAULT_FILE_PART_SIZE = 5 * 1024 * 1024; // 5MB public static long FILE_PART_SIZE = DEFAULT_FILE_PART_SIZE; private static AmazonS3 s3Client; private static TransferManager transferManager; static { init(); } private AmazonS3Util() { } private static void init() { // ... s3Client = new AmazonS3Client(AmazonUtil.getAwsCredentials()); transferManager = new TransferManager(AmazonUtil.getAwsCredentials()); } // ... public static void putObjectAsMultiPart(String bucketName, File file) { putObjectAsMultiPart(bucketName, file, FILE_PART_SIZE); } public static void putObjectAsMultiPart(String bucketName, File file, long partSize) { List partETags = new ArrayList(); List uploaders = new ArrayList(); // Step 1: Initialize. InitiateMultipartUploadRequest initRequest = new InitiateMultipartUploadRequest(bucketName, file.getName()); InitiateMultipartUploadResult initResponse = s3Client.initiateMultipartUpload(initRequest); long contentLength = file.length(); try { // Step 2: Upload parts. long filePosition = 0; for (int i = 1; filePosition < contentLength; i++) { // Last part can be less than part size. Adjust part size. partSize = Math.min(partSize, (contentLength - filePosition)); // Create request to upload a part. UploadPartRequest uploadRequest = new UploadPartRequest(). withBucketName(bucketName).withKey(file.getName()). withUploadId(initResponse.getUploadId()).withPartNumber(i). withFileOffset(filePosition). withFile(file). withPartSize(partSize); uploadRequest.setProgressListener(new UploadProgressListener(file, i, partSize)); // Upload part and add response to our list. MultiPartFileUploader uploader = new MultiPartFileUploader(uploadRequest); uploaders.add(uploader); uploader.upload(); filePosition += partSize; } for (MultiPartFileUploader uploader : uploaders) { uploader.join(); partETags.add(uploader.getPartETag()); } // Step 3: complete. CompleteMultipartUploadRequest compRequest = new CompleteMultipartUploadRequest(bucketName, file.getName(), initResponse.getUploadId(), partETags); s3Client.completeMultipartUpload(compRequest); } catch (Throwable t) { logger.error("Unable to put object as multipart to Amazon S3 for file " + file.getName(), t); s3Client.abortMultipartUpload( new AbortMultipartUploadRequest( bucketName, file.getName(), initResponse.getUploadId())); } } // ... private static class UploadProgressListener implements ProgressListener { File file; int partNo; long partLength; UploadProgressListener(File file) { this.file = file; } @SuppressWarnings("unused") UploadProgressListener(File file, int partNo) { this(file, partNo, 0); } UploadProgressListener(File file, int partNo, long partLength) { this.file = file; this.partNo = partNo; this.partLength = partLength; } @Override public void progressChanged(ProgressEvent progressEvent) { switch (progressEvent.getEventCode()) { case ProgressEvent.STARTED_EVENT_CODE: logger.info("Upload started for file " + "\"" + file.getName() + "\""); break; case ProgressEvent.COMPLETED_EVENT_CODE: logger.info("Upload completed for file " + "\"" + file.getName() + "\"" + ", " + file.length() + " bytes data has been transferred"); break; case ProgressEvent.FAILED_EVENT_CODE: logger.info("Upload failed for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; case ProgressEvent.CANCELED_EVENT_CODE: logger.info("Upload cancelled for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; case ProgressEvent.PART_STARTED_EVENT_CODE: logger.info("Upload started at " + partNo + ". part for file " + "\"" + file.getName() + "\""); break; case ProgressEvent.PART_COMPLETED_EVENT_CODE: logger.info("Upload completed at " + partNo + ". part for file " + "\"" + file.getName() + "\"" + ", " + (partLength > 0 ? partLength : progressEvent.getBytesTransfered()) + " bytes data has been transferred"); break; case ProgressEvent.PART_FAILED_EVENT_CODE: logger.info("Upload failed at " + partNo + ". part for file " + "\"" + file.getName() + "\"" + ", " + progressEvent.getBytesTransfered() + " bytes data has been transferred"); break; } } } private static class MultiPartFileUploader extends Thread { private UploadPartRequest uploadRequest; private PartETag partETag; MultiPartFileUploader(UploadPartRequest uploadRequest) { this.s3Client = s3Client; this.uploadRequest = uploadRequest; } @Override public void run() { partETag = s3Client.uploadPart(uploadRequest).getPartETag(); } private PartETag getPartETag() { return partETag; } private void upload() { start(); } } }

May 28, 2013

by Serkan Özal

· 57,393 Views · 3 Likes

Web API in ASP.NET Web Forms Application

With the release of ASP.NET MVC 4 one of the exciting features packed in the release was ASP.NET Web API.

May 24, 2013

by Lohith Nagaraj

· 51,577 Views

Capturing camera/picture data without PhoneGap

As people know, I'm a huge fan of PhoneGap and what it allows me to do with JavaScript, HTML, and CSS. But I think it is crucial to remember that you don't always need PhoneGap. A great example of that is camera access. Did you know that recent mobile browsers support accessing the camera directly from HTML and JavaScript? Let's look at an example. Over a year ago I wrote a blog post where I created an application called "Color Thief." This application made use of PhoneGap's Camera API and a third party JavaScript library called Color Thief. I loved this example because it demonstrated how you could combine the extra power that PhoneGap provides along with existing JavaScript libraries. This morning I watched an excellent Google IO presentation (https://www.youtube.com/watch?v=EPYnGFEcis4&feature=youtube_gdata_player) on Mobile HTML. It was an overview of some of the exciting stuff you can now do with mobile HTML and JavaScript. To be clear, this was all without using wrappers like PhoneGap. In one of the examples the presenters discussed the new "capture" support for the input/file field type. This is rather simple to implement: If supported (recent Android and latest iOS), the user can then use their camera to select a picture. I decided to rebuild my old demo to skip PhoneGap completely and just make use of this feature. Here's the code: For the most part, this is pretty similar to the last version. I no longer wait for the deviceready event, but instead just listen for the document itself to load. Instead of listening for a button click, I've switched to a input field using type=file. I now listen for the change event, and on that, I see if I have access to a file. If I do, I can then use the URL object to create a pointer to the source and then simply add it to my DOM. After that, Color Thief takes over. The only tricky part I ran into was that in iOS the URL object is still prefixed. You can see how I get around that in the startup code. To be fair, this isn't 100% backwards compatible, I could add a few checks in here to ensure that things will work and gracefully let people on older phones know they can't use this feature. But the end result is nearly the exact same functionality in a web page - no PhoneGap, no native code. <br>

May 21, 2013

by Raymond Camden

· 17,587 Views

Azure Blob Storage - "The specified blob or block content is invalid"

If you’re uploading blobs by splitting blobs into blocks and you get the error – The specified blob or block content is invalid, then this post is for you. Short Version If you’re uploading blobs by splitting blobs into blocks and you get the above mentioned error, ensure that your block ids of your blocks are of same length. If the block ids of your blocks are of different length, you’ll get this error. Long Version Now for the longer version of this post . A few days back I was working with storage client library especially around uploading blobs in chunks and with one particular blob I was constantly getting the error – The specified blob or block content is invalid. I tried numerous combinations even resorting to REST API directly but to no avail. It only happened with just one blob. Furthermore if I uploaded the same blob without splitting it into blocks, all was well. I was at my wits’ end. Tried searching the Internet for this error but could not find a conclusive answer to my problem. After much trial and error, I was able to simulate the same problem on other blobs as well. Here’s how you can recreate it: Start uploading the blob by splitting it into blocks. For block id, let’s do a 7 character long string e.g. intValue.ToString(“d7”). This will ensure that my block ids would be “0000001”, “0000002”, …, ”0000010” ….. After one or two blocks are uploaded, cancel the operation. Now re-upload the blob by splitting it into blocks. However this time for block id, let’s do a 6 character long string e.g. intValue.ToString(“d6”). You’ll get the error as soon as you try to upload the 1st block. Possible Solutions Now that we know the root cause of this problem, let’s look at some of the possible solutions to solve this problem. Wait out One possible solution is to wait out. I know its lame but still a possible solution. We know that Windows Azure Blob Storage Service keeps all uncommitted blocks for a duration of 7 days and if within 7 days those uncommitted blocks are not committed, the storage service purges them. I wish storage service provided some mechanism to purge uncommitted blocks programmatically. Commit uncommitted blocks You could possibly commit the blocks which are in uncommitted state so that at least you get a blob (which would not be the blob we wanted to upload in the first place). You can then delete that blob and re-upload the blob by specifying block ids which are of same length. To fetch the list of uncommitted blocks, if you’re using REST API directly you can perform “Get Block List” operation and pass “blocklisttype=uncommitted” as one of the query string parameters. If you’re using storage client library (assuming you’re using the version 2.x of .Net storage client library), you can do something like the code below: private static List GetUncommittedBlockIds(CloudBlockBlob blob) { var sasUri = blob.GetSharedAccessSignature(new SharedAccessBlobPolicy() { SharedAccessExpiryTime = DateTime.UtcNow.AddMinutes(5), Permissions = SharedAccessBlobPermissions.Read, }); var blobUri = new Uri(string.Format("{0}{1}", blob.Uri, sasUri)); List uncommittedBlockIds = new List(); var request = BlobHttpWebRequestFactory.GetBlockList(blobUri, null, null, BlockListingFilter.Uncommitted, null, null); //request.Headers.Add("Authorization", using (var resp = (HttpWebResponse)request.GetResponse()) { using (var stream = resp.GetResponseStream()) { var getBlockListResponse = new GetBlockListResponse(stream); var blocks = getBlockListResponse.Blocks; foreach (var block in blocks.Where(b => !b.Committed)) { uncommittedBlockIds.Add(Encoding.UTF8.GetString(Convert.FromBase64String(block.Name))); } } } return uncommittedBlockIds; } A few things to keep in mind here: Microsoft.WindowsAzure.Storage.Blob namespace does not have the capability to get the list of uncommitted blocks. You would need to make use ofMicrosoft.WindowsAzure.Storage.Blob.Protocol namespace. Because we’re kind of invoking the REST API by executing an HttpWebRequest, I created a shared access signature on the blob so that I don’t have to create “Authorization” header. Fetch uncommitted blocks to see block id length You could fetch the list of uncommitted blocks just to find out the length of the block id used. You could then use that block id length for your new upload session and do the upload. Please see the code snippet above to find this information. Upload another blob with same name without splitting it into blocks You could also upload another blob with the same name without splitting it into blocks. It could very well be a zero byte blob. That way your uncommitted block list will be wiped clean. Then you could delete that dummy blob and re-upload the actual blob. A Few Words About Blocks Since we’re talking about blocks, I thought it might be useful to mention a few points about them: Blocks and block related operations are only applicable for “Block Blobs”. Duh!! You’ll get an error if you’re trying to do these operations on a “Page Blob”. For uploading large blobs, it is recommended that you split your blob into blocks. In fact if your blob size is more than 64 MB, then you have to split it into blocks. Minimum size of a block is 1 Byte and the maximum size of a block is 4 MB. It is recommended that you choose a block size based on your internet connectivity and number of parallel threads you want use to upload these blocks. A blob can be split into a maximum of 50000 blocks. It’s important to remember this limitation because you are reminded of this limit when you’re trying to upload 50001st block. The length of all the block ids must be same. So if you’re using an integer value to denote block id, you make sure that you pad that integer value with “0” so that you get same length. So you could do something likeint.ToString(“d6”). When passing the block id as a parameter, it must be Base64 encoded. While the order in which the blocks are uploaded is not important, the order is important when you commit the block list because that’s when the blob is constructed by the service. For example, let’s say you’re uploading a blob by splitting it into 5 blocks (with ids “000001”, “000002”, “000003”, “000004”, and “000005”). You could upload these blocks in any order – 000004, 000001, 000003, 000005, 000002 however when you commit the block list, ensure that the block ids are passed in proper order i.e. 000001, 000002, 000003, 000004, 000005. Summary That’s it for this post. I hope you’ve found this information useful. I spent considerable amount of time trying to fix this problem so I hope it will help some folks out. As always, if you find any issues with the post please let me know and I’ll fix it ASAP.

May 20, 2013

by Gaurav Mantri

· 10,898 Views

Bucketing, Multiplexing and Combining in Hadoop - Part 1

this is the first blog post in a series which looks at some data organization patterns in mapreduce. we’ll look at how to bucket output across multiple files in a single task, how to multiplex data across multiple files, and also how to coalesce data. these are all common patterns that are useful to have in your mapreduce toolkit. we’ll kick things off with a look at bucketing data outputs in your map or reduce tasks. by default when using a fileoutputformat-derived outputformat (such as textoutputformat), all the outputs for a reduce task (or a map task in a map-only job) are written to a single file in hdfs. imagine a situation where you have user activity logs being streamed into hdfs, and you want to write a mapreduce job to better organize the incoming data. as an example a large organization with multiple products may want to bucket the logs based on the product. to do this you’ll need the ability to write to multiple output files in a single task. let’s take a look at how we can make that happen. multipleoutputformat there are a few ways you can achieve your goal, and the first option we’ll look at is the multipleoutputformat class in hadoop. this is an abstract class that lets you do the following: define the output path for each and every key/value output record being emitted by a task. incorporate the input paths into the output directory for map-only jobs. redefine the key and value that are used to write to the underlying recordwriter . this is useful in situations where you want to remove data from the outputs as it duplicates data in the filename. for each output path, define the recordwriter that should be used to write the outputs. ok enough with the words - let’s look at some data and code. first up is the simple data we’ll use in our example - imagine you work at a fruit market with locations in multiple cities, and you have a purchase transaction stream which contains the store location along with the fruit that was purchased. cupertino apple sunnyvale banana cupertino pear to help bucket your data for future analysis, you want to bin each record into city-specific files. for the simple data set above you don’t want to filter, project or transform your data, just bucket it out, so a simple identity map-only job will do the job. to force more than one mapper, we’ll write the data to two separate files. $ tab="$(printf '\t')" $ hdfs -put - file1.txt << eof cupertino${tab}apple sunnyvale${tab}banana eof $ hdfs -put - file2.txt << eof cupertino${tab}pear eof here’s the code which will let you write city-specific output files. import org.apache.commons.lang.stringutils; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.filesystem; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.text; import org.apache.hadoop.mapred.*; import org.apache.hadoop.mapred.lib.identitymapper; import org.apache.hadoop.mapred.lib.multipletextoutputformat; import org.apache.hadoop.util.progressable; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; import java.io.ioexception; import java.util.arrays; /** * an example of how to use {@link org.apache.hadoop.mapred.lib.multipleoutputformat}. */ public class mofexample extends configured implements tool { /** * create output files based on the output record's key name. */ static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } /** * the main job driver. */ public int run(final string[] args) throws exception { string csvinputs = stringutils.join(arrays.copyofrange(args, 0, args.length - 1), ","); path outputdir = new path(args[args.length - 1]); jobconf jobconf = new jobconf(super.getconf()); jobconf.setjarbyclass(mofexample.class); jobconf.setnumreducetasks(0); jobconf.setmapperclass(identitymapper.class); jobconf.setinputformat(keyvaluetextinputformat.class); jobconf.setoutputformat(keybasedmultipletextoutputformat.class); fileinputformat.setinputpaths(jobconf, csvinputs); fileoutputformat.setoutputpath(jobconf, outputdir); return jobclient.runjob(jobconf).issuccessful() ? 0 : 1; } /** * main entry point for the utility. * * @param args arguments * @throws exception when something goes wrong */ public static void main(final string[] args) throws exception { int res = toolrunner.run(new configuration(), new mofexample(), args); system.exit(res); } } run this code and you’ll see the following files in hdfs, where /output is the job output directory: $ hadoop fs -lsr /output /output/cupertino/part-00000 /output/cupertino/part-00001 /output/sunnyvale/part-00000 if you look at the output files you’ll see that the files contain the correct buckets. $ hadoop fs -lsr /output/cupertino/* cupertino apple cupertino pear $ hadoop fs -lsr /output/sunnyvale/* sunnyvale banana awesome, you have your data bucketed by store. now that we have everything working, let’s look at what we did to get there. we had to do two things to get this working: extend multipletextoutputformat this is where the magic happened - let’s look at that class again. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } } you are working with text, which is why you extended multipletextoutputformat , a class that in turn extends multipleoutputformat . multipletextoutputformat is a simple class which instructs the multipleoutputformat to use textoutputformat as the underlying output format for writing out the records. if you were to use multipleoutputformat as-is it behaves as if you were using the regular textoutputformat , which is to say that it’ll only write to a single output file. to write data to multiple files you had to extend it, as with the example above. the generatefilenameforkeyvalue method allows you to return the output path for an input record. the third argument, name , is the original fileoutputformat -created filename, which is in the form “part-nnnnn”, where “nnnnn” is the task index, to ensure uniqueness. to avoid file collisions, it’s a good idea to make sure your generated output paths are unique, and leveraging the original output file is certainly a good way of doing this. in our example we’re using the key as the directory name, and then writing to the original fileoutputformat filename within that directory. specify the outputformat the next step was easy - specify that this output format should be used for your job: jobconf.setoutputformat(keybasedmultipletextoutputformat.class); earlier we also mentioned that you can use the input path as part of the output path, which we will look at next. using the input filename as part of the output filename in map-only jobs what if we wanted to keep the input filename as part of the output filename? this only works for map-only jobs, and can be accomplished by overriding the getinputfilebasedoutputfilename method. let’s look at the following code to understand how this method fits into the overall sequence of actions that the multipleoutputformat class performs: public void write(k key, v value) throws ioexception { // get the file name based on the key string keybasedpath = generatefilenameforkeyvalue(key, value, myname); // get the file name based on the input file name string finalpath = getinputfilebasedoutputfilename(myjob, keybasedpath); // get the actual key k actualkey = generateactualkey(key, value); v actualvalue = generateactualvalue(key, value); recordwriter rw = this.recordwriters.get(finalpath); if (rw == null) { // if we don't have the record writer yet for the final path, create // one // and add it to the cache rw = getbaserecordwriter(myfs, myjob, finalpath, myprogressable); this.recordwriters.put(finalpath, rw); } rw.write(actualkey, actualvalue); }; the getinputfilebasedoutputfilename method is called with the output of generatefilenameforkeyvalue , which contains our already-customized output file. our new keybasedmultipletextoutputformat can now be updated to override getinputfilebasedoutputfilename and append the original input filename to the output filename: static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(object key, object value, string name) { return key.tostring() + "/" + name; } @override protected string getinputfilebasedoutputfilename(jobconf job, string name) { string infilename = new path(job.get("map.input.file")).getname(); return name + "-" + infilename; } if you run with your modified outputformat class you’ll see the following files in hdfs, confirming that the input filenames are now concatenated to the end of each output file. $ hadoop fs -lsr /output /output/cupertino/part-00000-file1.txt /output/cupertino/part-00001-file2.txt /output/sunnyvale/part-00000-file1.txt the implementation of getinputfilebasedoutputfilename in multipleoutputformat doesn’t do anything interesting by default, but if you set the value of the mapred.outputformat.numoftrailinglegs configurable to an integer greater than 0, then the getinputfilebasedoutputfilename will use part of the input path as the output path. let’s see what happens when we set the value to 1: jobconf.setint("mapred.outputformat.numoftrailinglegs", 1); the output files in hdfs now exactly mirror the input files used for the job: $ hadoop fs -lsr /output /output/file1.txt /output/file2.txt if we set mapred.outputformat.numoftrailinglegs to 2, and our input files exist in the /inputs directory, then our output directory looks like this: $ hadoop fs -lsr /output /output/input/file1.txt /output/input/file2.txt basically as you keep incrementing mapred.outputformat.numoftrailinglegs , then multipleoutputformat will continue to go up the parent directories of the input file and use them in the output path. modifying the output key and value it’s very possible that the actual key and value you want to emit are different from those that were used to determine the output file. in our example, we took the output key and wrote to a directory using the key name. if you do that keeping the key in the output file may be redundant. how would we modify the output record so that the key isn’t written? multipleoutputformat has your back with the generateactualkey method. class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override protected text generateactualkey(text key, text value) { return null; } } the returned value from this method replaces the key that’s supplied to the underlying recordwriter , so if you return null as in the above example, no key will be written to the file. $ hadoop fs -lsr /output/cupertino/* apple pear $ hadoop fs -lsr /output/sunnyvale/* banana you can achieve the same result for the output value by overriding the generateactualvalue method. changing the recordwriter in our final step we’ll look at how you can leverage multiple recordwriter classes for different output files. this is accomplished by overriding the getrecordwriter method. in the example below we’re leveraging the same textoutputformat for all the files, but it gives you a sense of what can be accomplished. static class keybasedmultipletextoutputformat extends multipletextoutputformat { @override protected string generatefilenameforkeyvalue(text key, text value, string name) { return key.tostring() + "/" + name; } @override public recordwriter getrecordwriter(filesystem fs, jobconf job, string name, progressable prog) throws ioexception { if (name.startswith("apple")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } else if (name.startswith("banana")) { return new textoutputformat().getrecordwriter(fs, job, name, prog); } return super.getrecordwriter(fs, job, name, prog); } } conclusion when using multipleoutputformat , give some thought to the number of distinct files that each reducer will create. it would be prudent to plan your bucketing so that you have a relatively small number of files. in this post we extended multipletextoutputformat , which is a simple extension of multipleoutputformat that supports text outputs. multiplesequencefileoutputformat also exists to support sequencefiles in a similar fashion. so what are the shortcomings with the multipleoutputformat class? if you have a job that uses both map and reduce phases, then multipleoutputformat can’t be used in the map-side to write outputs. of course, multipleoutputformat works fine in map-only jobs. all recordwriter classes must support exactly the same output record types. for example, you wouldn’t be able to support a recordwriter that emitted for one output file, and have another recordwriter that emitted . multipleoutputformat exists in the mapred package, so it won’t work with a job that requires use of the mapreduce package. all is not lost if you bump into either one of these issues, as you’ll discover in the next blog post.

May 20, 2013

by Alex Holmes

· 6,301 Views

Postgres Fuzzy Search Using Trigrams (+/- Django)

When building websites, you’ll often want users to be able to search for something by name. On LinerNotes, users can search for bands, albums, genres etc from a search bar that appears on the homepage and in the omnipresent nav bar. And we need a way to match those queries to entities in our Postgres database. At first, this might seem like a simple problem with a simple solution, especially if you’re using the ORM; just jam the user input into an ORM filter and retrieve every matching string. But there’s a problem: if you do Bands.objects.filter(name="beatles") You’ll probably get nothing back, because the name column in your “bands” table probably says “The Beatles” and as far as Postgres is concerned if it’s not exactly the same string, it’s not a match. Users are naturally terrible at spelling, and even if they weren’t they’d be bad at guessing exactly how the name is formatted in your database. Of course you can use the LIKE keyword in SQL (or the equivalent ‘__contains’ suffix in the ORM) to give yourself a little flexibility and make sure that “Beatles” returns “The Beatles”. But 1) the LIKE keyword requires you to evaluate a regex against every row in your table, or hope that you’ve configured your indices to support LIKE (a quick Google doesn’t tell me whether Django does that by default in the ORM) and 2) what if the user types “Beetles”? Well, then you’ve got a bit of a problem. No matter how obvious it is to human you that “beatles” is close to “beetles”[1], to the computer they’re just two non-identical byte sequences. If you want the computer to understand them as similar you’re going to have to give it a metric for similarity and a method to make the comparison. There are a few ways to do that. You can do what I did initially and whip out the power tools, i.e. a dedicated search system like Solr or ElasticSearch. These guys have notions of fuzziness built right in (Solr more automatically than ES). But they’re designed for full-text indexing of documents (e.g. full web pages) and they’re rather complex to set up and administer. ES has been enough of a hassle to keep running smoothly that I took the time to see if I could push the search workload to Postgres, and hence this article. Unless you need to do something real fancy, it’s probably overkill to use them for just matching names. Instead, we’re going to follow Starr Horne’s advice and use a Postgres EXTENSION that lets us build fuzziness into our query in a fast and fairly simple way. Specifically, we’re going to use an extension called pg_trgm (i.e. “Postgres Trigram”) which gives Postgres a “similarity” function that can evaluate how many three-character subsequences (i.e. “trigrams”) two strings share. This is actually a pretty good metric for fuzzy matching short strings like names. To use pg_trgm, you’ll need to install the “Postgres Contrib” package. On ubuntu: sudo apt-get install postgres-contrib **WARNING: THIS WILL TRY TO RESTART YOUR DATABASE** then pop open psql and install pg_trgm (NB: this only works on Postgres 9.1+; Google for the instructions if you’re on a lower version.) psql CREATE EXTENSION pg_trgm; \dx # to check it's installed Now you can do SELECT * FROM types_and_labels_view WHERE label %'Mountain Goats' ORDER BY similarity(label,'Mountain Goats') DESC LIMIT 100; And out will pop the 100 most similar names. This will still take a long time if your table is large, but we can improve that with a special type of index provided by pg_trgm: CREATE INDEX labels_trigram_index ON types_and_labels_table USING gist (label gist_trgm_ops); or CREATE INDEX labels_trigram_index ON types_and_labels_table USING gin (label gin_trgm_ops); (GIN is slower than GIST to build, but answers queries faster. That’ll take a while to build (possibly quite a while), but once it does you should be able to fuzzy search with ease and speed. If you’re using Django, you will have to drop into writing SQL to use this (until someone, maybe you, writes a Django extension to do this in the ORM.) And as a frustrating finishing note, my attempt to implement this on LinerNotes was not ultimately succesful. It seems that that index query performance is at least O(n) and with 50 million entities in my database queries take at least 10 seconds. I’ve read that performance is great up to about 100k records then drops off sharply from there. There are some apparently additional options for improving query performance, but I’ll be sticking with ElasticSearch for now. [1] Sorry, Googlebot! Not sorry, Bingbot.

May 19, 2013

by George London

· 9,604 Views

Lazy sequences implementation for Java 8

I just published the LazySeq library on GitHub - the result of my Java 8 experiments recently. I hope you will enjoy it. Even if you don't find it very useful, it's still a great lesson of functional programming in Java 8 (and in general). Also it's probably the first community library targeting Java 8! Introduction A Lazy sequence is a data structure that is computed only when its elements are actually needed. All operations on lazy sequences, like map() and filter() are lazy as well, postponing invocation up to the moment when it is really necessary. Lazy sequences are always traversed from the beginning using very cheap first/rest decomposition (head() and tail()). An important property of lazy sequences is that they can represent infinite streams of data, e.g. all natural numbers or temperature measurements over time. Lazy sequence remembers already computed values so if you access the Nth element, all elements from 1 to N-1 are computed as well and cached. Despite that LazySeq (being at the core of many functional languages and algorithms) is immutable and thread-safe. Rationale This library is heavily inspired by scala.collection.immutable.Stream and aims to provide immutable, thread-safe and easy to use lazy sequence implementation, possibly infinite. See Lazy sequences in Scala and Clojure for some use cases. Stream class name is already used in Java 8, therefore LazySeq was chosen, similar to lazy-seq in Clojure. Speaking of Stream, at first it looks like a lazy sequence implementation available out-of-the-box. However, quoting Javadoc: Streams are not data structures and: Once an operation has been performed on a stream, it is considered consumed and no longer usable for other operations. In other words java.util.stream.Stream is just a thin wrapper around existing collection, suitable for one time use. More akin to Iterator than to Stream in Scala. This library attempts to fill this niche. Of course implementing lazy sequence data structure was possible prior to Java 8, but lack of lambdas makes working with such data structure tedious and too verbose. Getting started Building and working with lazy sequences in 10 minutes. Infinite sequence of all natural numbers In order to create a lazy sequence you use LazySeq.cons() factory method that accepts first element (head) and a function that might be later used to compute rest (tail). For example in order to produce lazy sequence of natural numbers with given start element you simply say: private LazySeq naturals(int from) { return LazySeq.cons(from, () -> naturals(from + 1)); } There is really no recursion here. If there was, calling naturals() would quickly result in StackOverflowError as it calls itself without stop condition. However () -> naturals(from + 1) expression defines a function returning LazySeq (Supplier to be precise) that this data structure will invoke, but only if needed. Look at the code below, how many times do you think naturals() function was called (except the first line)? final LazySeq ints = naturals(2); final LazySeq strings = ints. map(n -> n + 10). filter(n -> n % 2 == 0). take(10). flatMap(n -> Arrays.asList(0x10000 + n, n)). distinct(). map(Integer::toHexString); First invocation of naturals(2) returns lazy sequence starting from 2 but rest (3, 4, 5, ...) is not computed yet. Later we map() over this sequence, filter() it, take() first 10 elements, remove duplicates, etc. All these operations do not evaluate the sequence and are as lazy as possible. For example take(10) doesn't evaluate first 10 elements eagerly to return them. Instead new lazy sequence is returned which remembers that it should truncate original sequence at 10th element. Same applies to distinct(). It doesn't evaluate the whole sequence to extract all unique values (otherwise code above would explode quickly, traversing infinite amount of natural numbers). Instead it returns a new sequence with only the first element. If you ever ask for the second unique element, it will lazily evaluate tail, but only as much as possible. Check out toString() output: System.out.println(strings); //[1000c, ?] Question mark (?) says: "there might be something more in that collection, but I don't know it yet". Do you understand where did 1000c came from? Look carefully: Start from an infinite stream of natural numbers starting from 2 Add 10 to each element (so the first element becomes 12 or C in hex) filter() out odd numbers (12 is even so it stays) take() first 10 elements from sequence so far Each element is replaced by two elements: that element plus 0x1000 and the element itself (flatMap()). This does not yield a sequence of pairs, but a sequence of integers that is twice as long We ensure only distinct() elements will be returned In the end we turn integers to hex strings. As you can see none of these operations really require evaluating the whole stream. Only head is being transformed and this is what we see in the end. So when this data structure is actually evaluated? When it absolutely must, e.g. during side-effect traversal: strings.force(); //or strings.forEach(System.out::println); //or final List list = strings.toList(); //or for (String s : strings) { System.out.println(s); } All the statements above alone will force evaluation of whole lazy sequence. Not very smart if our sequence was infinite, but strings was limited to first 10 elements so it will not run infinitely. If you want to force only part of the sequence, simply call strings.take(5).force(). BTW have you noticed that we can iterate over LazySeq strings using standard Java 5 for-each syntax? That's because LazySeq implements List interface, thus plays nicely with Java Collections Framework ecosystem: import java.util.AbstractList; public abstract class LazySeq extends AbstractList Please keep in mind that once lazy sequence is evaluated (computed) it will cache (memoize) them for later use. This makes lazy sequences great for representing infinite or very long streams of data that are expensive to compute. iterate() Building an infinite lazy sequence very often boils down to providing an initial element and a function that produces next item based on the previous one. In other words second element is a function of the first one, third element is a function of the second one, and so on. Convenience LazySeq.iterate() function is provided for such circumstances. ints definition can now look like this: final LazySeq ints = LazySeq.iterate(2, n -> n + 1); We start from 2 and each subsequent element is represented as previous element + 1. More examples: Fibonacci sequence and Collatz conjecture No article about lazy data structure can be left without Fibonacci numbers example: private static LazySeq lastTwoFib(int first, int second) { return LazySeq.cons( first, () -> lastTwoFib(second, first + second) ); } Fibonacci sequence is infinite as well but we are free to transform it in multiple ways: System.out.println( fib. drop(5). take(10). toList() ); //[5, 8, 13, 21, 34, 55, 89, 144, 233, 377] final int firstAbove1000 = fib. filter(n -> (n > 1000)). head(); fib.get(45); See how easy and natural it is to work with infinite stream of numbers? drop(5).take(10) skips first 5 elements and displays next 10. At this point first 15 numbers are already computed and will never by computed again. Finding first Fibonacci number above 1000 (happens to be 1597) is very straightforward. head() is always precomputed by filter() , so no further evaluation is needed. Last but not least we can simply just ask for 45th Fibonacci number (0-based) and get 1134903170. If you ever try to access any Fibonacci number up to this one, they are precomputed and fast to retrieve. Finite sequences (Collatz conjecture) Collatz conjecture is also quite interesting problem. For each positive integer n we compute next integer using following algorithm: n/2 if n is even 3n + 1 if n is odd For example starting from 10 series looks as follows: 10, 5, 16, 8, 4, 2, 1. The series ends when it reaches 1. Mathematicians believe that starting from any integer we will eventually reach 1 but it's not yet proven. Let us create a lazy sequence that generates Collatz series for given n, but only as many as needed. As stated above, this time our sequence will be finite: private LazySeq collatz(long from) { if (from > 1) { final long next = from % 2 == 0 ? from / 2 : from * 3 + 1; return LazySeq.cons(from, () -> collatz(next)); } else { return LazySeq.of(1L); } } This implementation is driven directly by the definition. For each number greater than 1 return that number + lazily evaluated (() -> collatz(next)) rest of the stream. As you can see if 1 is given, we return single element lazy sequence using special of() factory method. Let's test it with aforementioned 10: final LazySeq collatz = collatz(10); collatz.filter(n -> (n > 10)).head(); collatz.size(); filter() allows us to find first number in the sequence that is greater than 10. Remember that lazy sequence will have to traverse the contents (evaluate itself), but only to the point where it finds first matching element. Then it stops, ensuring it computes as little as possible. However size(), in order to calculate total number of elements, must traverse the whole sequence. Of course this can only work with finite lazy sequences, calling size() on an infinite sequence will end up poorly. If you play a bit with this sequence you will quickly realize that sequences for different numbers share the same suffix (always end with the same sequence of numbers). This begs for some caching/structural sharing. See CollatzConjectureTest for details. But can it be used to something, you know... useful? Real life? Infinite sequences of numbers are great, but not very practical in real life. Maybe some more down to earth examples? Imagine you have a collection and you need to pick few items from that collection randomly. Instead of collection I will use a function returning random latin characters: private char randomChar() { return (char) ('A' + (int) (Math.random() * ('Z' - 'A' + 1))); } But there is a twist. You need N (N < 26, number of latin characters) unique values. Simply calling randomChar() few times doesn't guarantee uniqueness. There are few approaches to this problem, with LazySeq it's pretty straightforward: LazySeq charStream = LazySeq.continually(this::randomChar); LazySeq uniqueCharStream = charStream.distinct(); continually() simply invokes given function for each element when needed. Thus charStream will be an infinite stream of random characters. Of course they can't be unique. However uniqueCharStream guarantees that its output is unique. It does so by examining next element of underlying charStream and rejecting items that already appeared. We can now say uniqueCharStream.take(4) and be sure that no duplicates will appear. Once again notice that continually(this::randomChar).distinct().take(4) really calls randomChar() only once! As long as you don't consume this sequence, it remains lazy and postpones evaluation as long as possible. Another example involves loading batches (pages) of data from database. Using ResultSet or Iterator is cumbersome but loading whole data set into memory often not feasible. An alternative involves loading first batch of data eagerly and then providing a function to load next batches. Data is loaded only when it's really needed and we don't suffer performance or scalability issues. First let's define abstract API for loading batches of data from database: public List loadPage(int offset, int max) { //load records from offset to offset + max } I abstract from the technology entirely, but you get the point. Imagine that we now define LazySeq that starts from row 0 and loads next pages only when needed: public static final int PAGE_SIZE = 5; private LazySeq records(int from) { return LazySeq.concat( loadPage(from, PAGE_SIZE), () -> records(from + PAGE_SIZE) ); } When creating new LazySeq instance by calling records(0) first page of 5 elements is loaded. This means that first 5 sequence elements are already computed. If you ever try to access 6th or above, sequence will automatically load all missing record and cache them. In other words you never compute the same element twice. More useful tools when working with sequences are grouped() and sliding() methods. First partitions input sequence into groups of equal size. Take this as an example, also proving that these methods are as always lazy: final LazySeq chars = LazySeq.of('A', 'B', 'C', 'D', 'E', 'F', 'G'); chars.grouped(3); //[[A, B, C], ?] chars.grouped(3).force(); //force evaluation //[[A, B, C], [D, E, F], [G]] and similarly for sliding(): chars.sliding(3); //[[A, B, C], ?] chars.sliding(3).force(); //force evaluation //[[A, B, C], [B, C, D], [C, D, E], [D, E, F], [E, F, G]] These two methods are extremely useful. You can look at your data through sliding window (e.g. to compute moving average) or partition it to equal-length buckets. Last interesting utility method you may find useful is scan() that iterates (lazily, of course) the input stream and constructs every element of output by applying a function on previous and current element of input. Code snippet is worth a thousand words: LazySeq list = LazySeq. numbers(1). scan(0, (a, x) -> a + x); list.take(10).force(); //[0, 1, 3, 6, 10, 15, 21, 28, 36, 45] LazySeq.numbers(1) is a sequence of natural numbers (1, 2, 3...). scan() creates a new sequence that starts from 0 and for each element of input (natural numbers) adds it to last element of itself. So we get: [0, 0+1, 0+1+2, 0+1+2+3, 0+1+2+3+4, 0+1+2+3+4+5...]. If you want a sequence of growing strings, just replace few types: LazySeq.continually("*"). scan("", (s, c) -> s + c). map(s -> "|" + s + "\\"). take(10). forEach(System.out::println); And enjoy this beautiful triangle: |\ |*\ |**\ |***\ |****\ |*****\ |******\ |*******\ |********\ |*********\ Alternatively (same output): lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); Java collections framework interoperability LazySeq implements java.util.List interface, thus can be used in variety of places. Moreover it also implements Java 8 enhancements to collections, namely streams and collectors: lazySeq. stream(). map(n -> n + 1). flatMap(n -> asList(0, n - 1).stream()). filter(n -> n != 0). substream(4, 18). limit(10). sorted(). distinct(). collect(Collectors.toList()); However streams in Java 8 were created to work around feature that is a foundation of LazySeq - lazy evaluation. Example above postpones all intermediate steps until collect() is called. With LazySeq you can safely skip .stream() and work directly on sequence: lazySeq. map(n -> n + 1). flatMap(n -> asList(0, n - 1)). filter(n -> n != 0). slice(4, 18). limit(10). sorted(). distinct(); Moreover LazySeq provides special purpose collector (see: LazySeq.toLazySeq()) that avoids evaluation even when used with collect() - which normally forces full collection computation. Implementation details Each lazy sequence is built around the idea of eagerly computed head and lazily evaluated tail represented as function. This is very similar to classic single-linked list recursive definition: class List { private final T head; private final List tail; //... } However in case of lazy sequence tail is given as a function, not a value. Invocation of that function is postponed as long as possible: class Cons extends LazySeq { private final E head; private LazySeq tailOrNull; private final Supplier> tailFun; @Override public LazySeq tail() { if (tailOrNull == null) { tailOrNull = tailFun.get(); } return tailOrNull; } For full implementation see Cons.java and FixedCons.java used when tail is known at creation time (for example LazySeq.of(1, 2) as opposed to LazySeq.cons(1, () -> someTailFun()). Pitfalls and common dangers Below common issues and misunderstandings are described. Evaluating too much One of the biggest dangers of working with infinite sequences is trying to evaluate them completely, which obviously leads to infinite computation. The idea behind infinite sequence is not to evaluate it in its entirety but to take as much as we need without introducing artificial limits and accidental complexity (see database loading example). However evaluating whole sequence is way too simple to miss. For example calling LazySeq.size()must evaluate whole sequence and will run infinitely, eventually filling up stack or heap (implementation detail). There are other methods that require full traversal in order to function properly. E.g. allMatch() making sure all elements match given predicate. Some methods are even more dangerous, because whether they will finish or not depends on data in the sequence. For example anyMatch() may return immediately if head matches predicate - or never. Sometimes we can easily avoid costly operations by using more deterministic methods. For example: seq.size() <= 10 //BAD may not work or be extremely slow if seq is infinite. However we can achieve the same with (more) predictable: seq.drop(10).isEmpty() Remember that lazy sequences are immutable (so we don't really mutate seq), drop(n) is typically O(n) while isEmpty() is O(1). When in doubt, consult source code or JavaDoc to make sure your operation won't too eagerly evaluate your sequence. Also be very cautious when using LazySeq where java.util.Collection or java.util.List is expected. Holding unnecessary reference to head Lazy sequences be definition remember already computed elements. You have to be aware of that, otherwise your sequence (especially infinite) will quickly fill up available memory. However, because LazySeq is just a fancy linked list, if you no longer keep a reference to head (but only to some element in the middle), it becomes eligible for garbage collection. For example: //LazySeq first = seq.take(10); seq = seq.drop(10); First ten elements are dropped and we assume nothing holds a reference to what previously was hept in seq. This makes first ten elements eligible for garbage collection. However if we uncomment first line and keep reference to old head in first, JVM will not release any memory. Let's put that into perspective. The following piece of code will eventually throw OutOfMemoryError because infinite reference keeps holding the beginning of the sequence, therefore all the elements created so far: LazySeq infinite = LazySeq.continually(Big::new); for (Big arr : infinite) { // } However by inlining call to continually() or extracting it to a method this code works flawlessly (well, still runs forever, but uses almost no memory): private LazySeq getContinually() { return LazySeq.continually(Big::new); } for (Big arr : getContinually()) { // } What's the difference? For-each loop uses iterators underneath. LazySeqIterator underneath doesn't hold a reference to old head() when it advances, so if nothing else references that head, it will be eligible for garbage collection, see true javac output when for-each is used: for (Iterator cur = getContinually().iterator(); cur.hasNext(); ) { final Big arr = cur.next(); //... } TL;DR Your sequence grows while being traversed. If you keep holding one end while the other grows, it will eventually blow up. Just like your first level cache in Hibernate if you load too much in one transaction. Use only as much as needed. Converting to plain Java collections Converting is simple, but dangerous. This is a consequence of points above. You can convert lazy sequence to java.util.List by calling toList(): LazySeq even = LazySeq.numbers(0, 2); even.take(5).toList(); //[0, 2, 4, 6, 8] or using Collector from Java 8 having richer API: even. stream(). limit(5). collect(Collectors.toSet()) //[4, 6, 0, 2, 8] But remember that Java collections are finite from definition so avoid converting lazy sequences to collections explicitly. Note that LazySeq is already List, thus Iterable and Collection. It also has efficient LazySeq.iterator(). If you can, simply pass LazySeq instance directly and may just work. Performance, time and space complexity head() of every sequence (except empty) is always computed eagerly, thus accessing it is fast O(1). Computing tail() may take everything from O(1) (if it was already computed) to infinite time. As an example take this valid stream: import static com.blogspot.nurkiewicz.lazyseq.LazySeq.cons; import static com.blogspot.nurkiewicz.lazyseq.LazySeq.continually; LazySeq oneAndZeros = cons( 1, () -> continually(0) ). filter(x -> (x > 0)); It represents 1 followed by infinite number of 0s. By filtering all positive numbers (x > 0) we get a sequence with same head, but filtering of tail is delayed (lazy). However if we now carelessly call oneAndZeros.tail(), LazySeq will keep computing more and more of this infinite sequence, but since there is no positive element after initial 1, this operation will run forever, eventually throwing StackOverflowError or OutOfMemoryError (this is an implementation detail). However if you ever reach this state, it's probably a programming bug or misusing of the library. Typically tail() will be close to O(1). On the other hand if you have plenty of operations already "stacked", calling tail() will trigger them rapidly one after another, so tail() run time is heavily dependant on your data structure. Most operations on LazySeq are O(1) since they are lazy. Some operations, like get(n) or drop(n) are O(n) (n represents parameter, not sequence length). In general run time will be similar to normal linked list. Because LazySeq remembers all already computed values in a single linked list, memory consumption is always O(n), where nn is the number of already computed elements. Troubleshooting Error invalid target release: 1.8 during maven build If you see this error message during maven build: [INFO] BUILD FAILURE ... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project lazyseq: Fatal error compiling: invalid target release: 1.8 -> [Help 1] it means you are not compiling using Java 8. Download JDK 8 with lambda support and let maven use it: $ export JAVA_HOME=/path/to/jdk8 I get StackOverflowError or program hangs infinitely When working with LazySeq you sometimes get StackOverflowError or OutOfMemoryError: java.lang.StackOverflowError at sun.misc.Unsafe.allocateInstance(Native Method) at java.lang.invoke.DirectMethodHandle.allocateInstance(DirectMethodHandle.java:426) at com.blogspot.nurkiewicz.lazyseq.LazySeq.iterate(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq.lambda$0(LazySeq.java:118) at com.blogspot.nurkiewicz.lazyseq.LazySeq$$Lambda$2.get(Unknown Source) at com.blogspot.nurkiewicz.lazyseq.Cons.tail(Cons.java:32) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) at com.blogspot.nurkiewicz.lazyseq.LazySeq.size(LazySeq.java:325) When working with possibly infinite data structures, care must be taken. Avoid calling operations that must (size(), allMatch(), minBy(), forEach(), reduce(), ...) or can (filter(), distinct(), ...) traverse the whole sequence in order to give correct results. See Pitfalls for more examples and ways to avoid. Maturity Quality This project was started as an exercise and is not battle-proven. But a healthy 300+ unit-test suite (3:1 test code/production code ratio) guards quality and functional correctness. I also make sure LazySeq is as lazy as possible by mocking tail functions and verifying they are called as rarely as one can get. Contributions and bug reports In the event of finding a bug or missing feature, don't hesitate to open a new ticket or start pull request. I would also love to see more interesting usages of LazySeq in wild. Possible improvements Just like FixedCons is used when tail is known up-front, consider IterableCons that wraps existing Iterable in one node rather than building FixedCons hierarchy. This can be used for all concat methods. Parallel processing support (implementing spliterator?) License This project is released under version 2.0 of the Apache License.

May 15, 2013

by Tomasz Nurkiewicz

· 28,975 Views · 1 Like

Deploy a File Server in the Cloud (WebDav on Windows Azure)

this month, my fellow it pro technical evangelists and i are authoring a new series of articles on 20 key scenarios with windows azure infrastructure services . check out the list of articles here: http://mythoughtsonit.com/2013/05/20-key-scenarios-with-windows-azure-infrastructure-services/ . web-based distributed authoring and versioning, or webdav, is a set of protocols based on http that allows end-users to map a network drive over http and edit content and files stored on the web server. when webdav was first offered on microsoft server i had evaluated it and decided it did not perform well enough for me. the webdav extension to iis was completely rewritten back in the server 2008 timeframe and is worth taking a look at again. in this article i will guide you step by step through the process of setting up webdav on server 2012 in a windows azure iaas environment. this will give you a solid performing file share on the internet over port 80 and the http protocol. first you need an azure account. you can setup a free trail of azure. details can be found here: http://mythoughtsonit.com/2013/04/step-by-step-guide-to-setting-up-a-windows-azure-free-trial/ second provision a server 2012 machine. watch a video of what to do here: third open port 80 to this new server: in the azure portal select your 2012 server and choose the “endpoints” tab on the top. click “add endpoint” at the bottom of the screen enter the endpoint information for port 80 to port 80 done. next we need to install the iis webserver and webdav. installing webdav on iis 8.0 start server manager and go to “add roles and features” under server roles – add the web server (iis) role click through the wizard until you come to the role services section. then find and select “webdav publishing” and “windows authentication” click next and then install when the install is finished you are ready to move on to the next section. configuring iis 8 for webdav after the installation finishes you need to configure the box for access. start the iis manager tool. choose the “default web site” on the left side. then click on “authentication” open the windows authentication option and enable it. open the “webdav authoring rules” create a webdav rule. i choose to allow all users access to all content. a better security practice is to limit what users can use the service. it’s your data so you decide. make sure webdav is enabled and that your access rule is set: that is it… now your ready to access your webdav file share! test and insure you can hit the web server by using your browser: because you opened port 80 and installed iis 8 you should see the default web page when you browse to your servers internet dns name. example: http://yourdomainname.cloudapp.net/ how to map a drive to your webdav server: there are two ways i use to connect to the webdav server how to map a drive to your webdav server from the win 8 gui: from windows explorer, right click on “computer” and select “map a network drive” map your network drive by entering the address to your server example: http://yourdomainname.cloudapp.net/ i selected “connect using different credentials” because my workstation was not joined to the server in anyway and i needed to use an account in the servers local sam database. hit “finish” and enter your credentials. now you will have a connected drive that you can access from windows explorer or any tool via the drive mapping. how to map a drive to your webdav server from a cmd box: 1. hit windows start and type: cmd 2. enter the command: net use [drive letter] [url] example: net use e: http://yourdomainname.cloudapp.net/

May 15, 2013

by Brian Lewis

· 15,941 Views

JPA - Querydsl Projections

In my last post: JPA - Basic Projections - I've mentioned about two basic possibilities of building JPA Projections. This post brings you more examples, this time based on Querydsl framework. Note, that I'm referring Querydslversion 3.1.1 here. Reinvented constructor expressions Take a look at the following code:... import static com.blogspot.vardlokkur.domain.QEmployee.employee; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import org.springframework.beans.factory.annotation.Autowired; import com.blogspot.vardlokkur.domain.EmployeeNameProjection; import com.mysema.query.jpa.JPQLTemplates; import com.mysema.query.jpa.impl.JPAQuery; import com.mysema.query.types.ConstructorExpression; ... public class ConstructorExpressionExample { ... @PersistenceContext private EntityManager entityManager; @Autowired private JPQLTemplates jpqlTemplates; public void someMethod() { ... final List projections = new JPAQuery(entityManager, jpqlTemplates) .from(employee) .orderBy(employee.name.asc()) .list(ConstructorExpression.create(EmployeeNameProjection.class, employee.employeeId, employee.name)); ... } ... } The above Querydsl construction means: create new JPQL query [1][2], using employee as the data source, order the data using employee name [3], and return the list of EmployeeNameProjection, built using the 2-arg constructor called with employee ID and name [4]. This is very similar to the constructor expressions example from my previous post (JPA - Basic Projections), and leads to the following SQL query: select EMPLOYEE_ID, EMPLOYEE_NAME from EMPLOYEE order by EMPLOYEE_NAME asc As you see above, the main advantage comparing to the JPA constructor expressions is using Java class, instead of its name hard-coded in JPQL query. Even more reinvented constructor expressions Querydsl documentation [4] describes another way of using constructor expressions, requiring @QueryProjectionannotation and Query Type [1] usage for projection, see example below. Let's start with the projection class modification - note that I added @QueryProjection annotation on the class constructor. package com.blogspot.vardlokkur.domain; import java.io.Serializable; import javax.annotation.concurrent.Immutable; import com.mysema.query.annotations.QueryProjection; @Immutable public class EmployeeNameProjection implements Serializable { private final Long employeeId; private final String name; @QueryProjection public EmployeeNameProjection(Long employeeId, String name) { super(); this.employeeId = employeeId; this.name = name; } public Long getEmployeeId() { return employeeId; } public String getName() { return name; } } Now we may use modified projection class (and corresponding Query Type [1] ) in following way: ... import static com.blogspot.vardlokkur.domain.QEmployee.employee; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import org.springframework.beans.factory.annotation.Autowired; import com.blogspot.vardlokkur.domain.EmployeeNameProjection; import com.blogspot.vardlokkur.domain.QEmployeeNameProjection; import com.mysema.query.jpa.JPQLTemplates; import com.mysema.query.jpa.impl.JPAQuery; ... public class ConstructorExpressionExample { ... @PersistenceContext private EntityManager entityManager; @Autowired private JPQLTemplates jpqlTemplates; public void someMethod() { ... final List projections = new JPAQuery(entityManager, jpqlTemplates) .from(employee) .orderBy(employee.name.asc()) .list(new QEmployeeNameProjection(employee.employeeId, employee.name)); ... } ... } Which leads to SQL query: select EMPLOYEE_ID, EMPLOYEE_NAME from EMPLOYEE order by EMPLOYEE_NAME asc In fact, when you take a closer look at the Query Type [1] generated for EmployeeNameProjection(QEmployeeNameProjection), you will see it is some kind of "shortcut" for creating constructor expression the way described in first section of this post. Mapping projection Querydsl provides another way of building projections, using factories based on MappingProjection. package com.blogspot.vardlokkur.domain; import static com.blogspot.vardlokkur.domain.QEmployee.employee; import com.mysema.query.Tuple; import com.mysema.query.types.MappingProjection; public class EmployeeNameProjectionFactory extends MappingProjection { public EmployeeNameProjectionFactory() { super(EmployeeNameProjection.class, employee.employeeId, employee.name); } @Override protected EmployeeNameProjection map(Tuple row) { return new EmployeeNameProjection(row.get(employee.employeeId), row.get(employee.name)); } } The above class is a simple factory creating EmployeeNameProjection instances using employee ID and name. Note that the factory constructor defines which employee properties will be used for building the projection, and mapmethod defines how the instances will be created. Below you may find an example of using the factory: ... import static com.blogspot.vardlokkur.domain.QEmployee.employee; import javax.persistence.EntityManager; import javax.persistence.PersistenceContext; import org.springframework.beans.factory.annotation.Autowired; import com.blogspot.vardlokkur.domain.EmployeeNameProjection; import com.blogspot.vardlokkur.domain.EmployeeNameProjectionFactory import com.mysema.query.jpa.JPQLTemplates; import com.mysema.query.jpa.impl.JPAQuery; ... public class MappingProjectionExample { ... @PersistenceContext private EntityManager entityManager; @Autowired private JPQLTemplates jpqlTemplates; public void someMethod() { ... final List projections = new JPAQuery(entityManager, jpqlTemplates) .from(employee) .orderBy(employee.name.asc()) .list(new EmployeeNameProjectionFactory()); .... } ... } As you see, the one and only difference here, comparing to constructor expression examples, is the list method call. Above example leads again to the very simple SQL query: select EMPLOYEE_ID, EMPLOYEE_NAME from EMPLOYEE order by EMPLOYEE_NAME asc Building projections this way is much more powerful, and doesn't require existence of n-arg projection constructor. QBean based projection (JavaBeans strike again) There is at least one more possibility of creating projection with Querydsl - QBean based - in this case we build the result list using: ... .list(Projections.bean(EmployeeNameProjection.class, employee.employeeId, employee.name)) This way requires EmployeeNameProjection class to follow JavaBean conventions, which is not always desired in application. Use it if you want, but you have been warned ;) Few links for the dessert Using Query Types Querying Ordering Constructor projections

May 15, 2013

by Michal Jastak

· 39,449 Views · 3 Likes

Getting Started with Active Directory Lightweight Directory Services

introduction in preparation for some upcoming posts related to linq (what else?), windows powershell and rx, i had to set up a local ldap-capable directory service. (hint: it will pay off to read till the very end of the post if you’re wondering what i’m up to...) in this post i’ll walk the reader through the installation, configuration and use of active directory lightweight directory services (lds) , formerly known as active directory application mode (adam). having used the technology several years ago, in relation to the linq to active directory project (which as an extension to this blog series will receive an update), it was a warm and welcome reencounter. what’s lightweight directory services anyway? use of hierarchical storage and auxiliary services provided by technologies like active directory often has advantages over alternative designs, e.g. using a relational database. for example, user accounts may be stored in a directory service for an application to make use of. while active directory seems the natural habitat to store (and replicate, secure, etc.) additional user information, it admins will likely point you – the poor developer – at the door when asking to extend the schema. that’s one of the places where lds comes in, offering the ability to take advantage of the programming model of directory services while keeping your hands off “the one and only ad schema”. the lds website quotes other use cases, which i’ll just copy here verbatim: active directory lightweight directory service (ad lds), formerly known as active directory application mode, can be used to provide directory services for directory-enabled applications. instead of using your organization’s ad ds database to store the directory-enabled application data, ad lds can be used to store the data. ad lds can be used in conjunction with ad ds so that you can have a central location for security accounts (ad ds) and another location to support the application configuration and directory data (ad lds). using ad lds, you can reduce the overhead associated with active directory replication, you do not have to extend the active directory schema to support the application, and you can partition the directory structure so that the ad lds service is only deployed to the servers that need to support the directory-enabled application. install from media generation. the ability to create installation media for ad lds by using ntdsutil.exe or dsdbutil.exe. auditing. auditing of changed values within the directory service. database mounting tool. gives you the ability to view data within snapshots of the database files. active directory sites and services support. gives you the ability to use active directory sites and services to manage the replication of the ad lds data changes. dynamic list of ldif files. with this feature, you can associate custom ldif files with the existing default ldif files used for setup of ad lds on a server. recursive linked-attribute queries. ldap queries can follow nested attribute links to determine additional attribute properties, such as group memberships. obviously that last bullet point grabs my attention through i will retain myself from digressing here. getting started if you’re running windows 7, the following explanation is the right one for you. for older versions of the operating system, things are pretty similar though different downloads will have to be used. for windows server 2008, a server role exists for lds. so, assuming you’re on windows 7, start by downloading the installation media over here . after installing this, you should find an entry “active directory lightweight directory services setup wizard” under the “administrative tools” section in “control panel”: lds allows you to install multiple instances of directory services on the same machine, just like sql server allows multiple server instances to co-exist. each instance has a name and listens on certain ports using the ldp protocol. starting this wizard – which lives under %systemroot%\adam\adaminstall.exe, revealing the former product name – brings us here: after clicking next, we need to decide whether we create a new unique instance that hasn’t any ties with existing instances, or whether we want to create a replicate of an existing instance. for our purposes, the first option is what we need: next, we’re asked for an instance name. the instance name will be used for the creation of a windows service, as well as to store some settings. each instance will get its own windows service. in our sample, we’ll create a directory for the northwind employees tables, which we’ll use to create accounts further on. we’re almost there with the baseline configuration. the next question is to specify a port number, both for plain tcp and for ssl-encrypted traffic. the default ports, 389 and 636, are fine for us. later we’ll be able to connect to the instance by connecting to ldp over port 389, e.g. using the system.directoryservices namespace functionality in .net. notice every instance of lds should have its own port number, so only one can be using the default port numbers. now that we have completed the “physical administration”, the wizard moves on to a bit of “logical administration”. more specifically, we’re given the option to create a directory partition for the application. here we choose to create such a partition, though in many concrete deployment scenarios you’ll want the application’s setup to create this at runtime. our partition’s distinguished name will mimic a “northwind.local” domain containing a partition called “employees”: after this bit of logical administration, some more physical configuration has to be carried out, specifying the data files location and the account to run the services under. for both, the default settings are fine. also the administrative account assigned to manage the lds instance can be kept as the currently logged in user, unless you feel the need to change this in your scenario: finally, we’ve arrived at an interesting step where we’re given the option to import ldif files. and ldif file, with extension .ldf, contains the definition of a class that can be added to a directory service’s schema. basically those contain things like attributes and their types. under the %systemroot%\adam folder, a set of out-of-the-box .ldf files can be found: instead of having to run the ldifde.exe tool, the wizard gives us the option to import ldif files directly. those classes are documented in various places, such as rfc2798 for inetorgperson . on technet, information is presented in a more structured manner, e.g revealing that inetorgperson is a subclass of user . custom classes can be defined and imported after setup has completed. in this post, we won’t extend the schema ourselves but we will simply be using the built-in user class so let’s tick that one: after clicking next, we get a last chance to revisit our settings or can confirm the installation. at this point, the wizard will create the instance – setting up the service – and import the ldif files. congratulations! your first lds instance has materialized. if everything went alright, the northwindemployees service should show up: inspecting the directory to inspect the newly created directory instance, a bunch of tools exist. one is adsi edit which you could already see in the administrative tools. to set it up, open the mmc-based tool and go to action, connect to… in the dialog that appears, specify the server name and choose schema as the naming context. for example, if you want to inspect the user class, simply navigate to the schema node in the tree and show the properties of the user entry. to visualize the objects in the application partition, connect using the distinguished name specified during the installation: now it’s possible to create a new object in the directory using the context menu in the content pane: after specifying the class, we get to specify the “cn” name (for common name) of the object. in this case, i’ll use my full name: we can also set additional attributes, as shown below (using the “physicaldeliveryofficename” to specify the office number of the user): after clicking set, closing the attributes dialog and clicking finish to create the object, we see it pop up in the items view of the adsi editor snap-in: programmatic population of the directory obviously we’re much more interested in a programmatic way to program directory services. .net supports the use of directory services and related protocols (ldap in particular) through the system.directoryservices namespace. in a plain new console application, add a reference to the assembly with the same name (don’t both about other assemblies that deal with account management and protocol stuff): for this sample, i’ll also assume the reader got a northwind sql database sitting somewhere and knows how to get data out of its employees table as rich objects. below is how things look when using the linq to sql designer: we’ll just import a few details about the users; it’s left to the reader to map other properties onto attributes using the documentation about the user directory services class . just a few lines of code suffice to accomplish the task (assuming the system.directoryservices namespace is imported): static void main() { var path = "ldap://bartde-hp07/cn=employees,dc=northwind,dc=local"; var root = new directoryentry(path); var ctx = new northwinddatacontext(); foreach (var e in ctx.employees) { var cn = "cn=" + e.firstname + e.lastname; var u = root.children.add(cn, "user"); u.properties["employeeid"].value = e.employeeid; u.properties["sn"].value = e.lastname; u.properties["givenname"].value = e.firstname; u.properties["comment"].value = e.notes; u.properties["homephone"].value = e.homephone; u.properties["photo"].value = e.photo.toarray(); u.commitchanges(); } } after running this code – obviously changing the ldap path to reflect your setup – you should see the following in adsi edit (after hitting refresh): now it’s just plain easy to write an application that visualizes the employees with their data. we’ll leave that to the ui-savvy reader (just to tease that segment of my audience, i’ve also imported the employee’s photo as a byte-array). a small preview of what’s coming up to whet the reader’s appetite about next episodes on this blog, below is a single screenshot illustrating something – imho – rather cool (use of linq to active directory is just an implementation detail below): note: what’s shown here is the result of a very early experiment done as part of my current job on “linq to anything” here in the “cloud data programmability team”. please don’t fantasize about it as being a vnext feature of any product involved whatsoever. the core intent of those experiments is to emphasize the omnipresence of linq (and more widely, monads) in today’s (and tomorrow’s) world. while we’re not ready to reveal the “linq to anything” mission in all its glory (rather think of it as “linq to the unimaginable”), we can drop some hints. stay tuned for more!

May 11, 2013

by Bart De Smet

· 26,324 Views · 1 Like

Hibernate 3 with Spring

1. Overview This article will focus on setting up Hibernate 3 with Spring – we’ll look at how to configure Spring 3 with Hibernate 3 using both Java and XML Configuration. 2. Maven To add the Spring Persistence dependencies to the pom, please see the Spring with Maven article. Continuing with Hibernate 3, the Maven dependencies are simple: org.hibernate hibernate-core 3.6.10.Final Then, to enable Hibernate to use its proxy model, we need javassist as well: org.javassist javassist 3.17.1-GA And since we’re going to use MySQL for this tutorial, we’ll also need: mysql mysql-connector-java 5.1.25 runtime 3. Java Spring Configuration for Hibernate 3 Setting up Hibernate 3 with Spring and Java configuration is straightforward: import java.util.Properties; import javax.sql.DataSource; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.ComponentScan; import org.springframework.context.annotation.Configuration; import org.springframework.context.annotation.PropertySource; import org.springframework.core.env.Environment; import org.springframework.dao.annotation.PersistenceExceptionTranslationPostProcessor; import org.springframework.jdbc.datasource.DriverManagerDataSource; import org.springframework.orm.hibernate3.HibernateTransactionManager; import org.springframework.orm.hibernate3.annotation.AnnotationSessionFactoryBean; import org.springframework.transaction.annotation.EnableTransactionManagement; import com.google.common.base.Preconditions; @Configuration @EnableTransactionManagement @PropertySource({ "classpath:persistence-mysql.properties" }) @ComponentScan({ "org.baeldung.spring.persistence" }) public class PersistenceConfig { @Autowired private Environment env; @Bean public AnnotationSessionFactoryBean sessionFactory() { AnnotationSessionFactoryBean sessionFactory = new AnnotationSessionFactoryBean(); sessionFactory.setDataSource(restDataSource()); sessionFactory.setPackagesToScan(new String[] { "org.baeldung.spring.persistence.model" }); sessionFactory.setHibernateProperties(hibernateProperties()); return sessionFactory; } @Bean public DataSource restDataSource() { DriverManagerDataSource dataSource = new DriverManagerDataSource(); dataSource.setDriverClassName(env.getProperty("jdbc.driverClassName")); dataSource.setUrl(env.getProperty("jdbc.url")); dataSource.setUsername(env.getProperty("jdbc.user")); dataSource.setPassword(env.getProperty("jdbc.pass")); return dataSource; } @Bean public HibernateTransactionManager transactionManager() { HibernateTransactionManager txManager = new HibernateTransactionManager(); txManager.setSessionFactory(sessionFactory().getObject()); return txManager; } @Bean public PersistenceExceptionTranslationPostProcessor exceptionTranslation() { return new PersistenceExceptionTranslationPostProcessor(); } Properties hibernateProperties() { return new Properties() { { setProperty("hibernate.hbm2ddl.auto", env.getProperty("hibernate.hbm2ddl.auto")); setProperty("hibernate.dialect", env.getProperty("hibernate.dialect")); } }; } } Compared to the XML Configuration – described next – there is a small difference in the way one bean in the configuration access another. In XML there is no difference between pointing to a bean or pointing to a bean factory capable of creating that bean. Since the Java configuration is type-safe – pointing directly to the bean factory is no longer an option – we need to retrieve the bean from the bean factory manually: txManager.setSessionFactory(sessionFactory().getObject()); 4. XML Spring Configuration for Hibernate 3 Simillary, Hibernate 3 can be configured using XML Configuration as well: ${hibernate.hbm2ddl.auto} ${hibernate.dialect} Then, this XML file is boostrapped into the Spring context: @Configuration @EnableTransactionManagement @ImportResource({ "classpath:persistenceConfig.xml" }) public class PersistenceXmlConfig { // } For both types of configuration, the JDBC and Hibernate specific properties are stored in a properties file: # jdbc.X jdbc.driverClassName=com.mysql.jdbc.Driver jdbc.url=jdbc:mysql://localhost:3306/spring_hibernate_dev?createDatabaseIfNotExist=true jdbc.user=tutorialuser jdbc.pass=tutorialmy5ql # hibernate.X hibernate.dialect=org.hibernate.dialect.MySQL5Dialect hibernate.show_sql=false hibernate.hbm2ddl.auto=create-drop 5. Spring, Hibernate and MySQL The example above uses MySQL 5 as the underlying database configured with Hibernate – however, Hibernate supports several underlying SQL Databases. 5.1. The Driver The Driver class name is configured via the jdbc.driverClassName property provided to the DataSource. In the example above, it is set to com.mysql.jdbc.Driver from the mysql-connector-java dependency we defined in the pom, at the start of the article. 5.2. The Dialect The Dialect is configured via the hibernate.dialect property provided to the Hibernate SessionFactory. In the example above, this is set to org.hibernate.dialect.MySQL5Dialect as we are using MySQL 5 as the underlying Database. There are several other dialects supporting MySQL: org.hibernate.dialect.MySQL5InnoDBDialect – for MySQL 5.x with the InnoDB storage engine org.hibernate.dialect.MySQLDialect – for MySQL prior to 5.x org.hibernate.dialect.MySQLInnoDBDialect – for MySQL prior to 5.x with the InnoDB storage engine org.hibernate.dialect.MySQLMyISAMDialect – for all MySQL versions with the ISAM storage engine Hibernate supports SQL Dialects for every supported Database. 6. Usage At this point, Hibernate 3 is fully configured with Spring and we can inject the raw HibernateSessionFactory directly whenever we need to: public abstract class FooHibernateDAO{ @Autowired SessionFactory sessionFactory; ... protected Session getCurrentSession(){ return sessionFactory.getCurrentSession(); } } 7. Conclusion In this example, we configured Hiberate 3 with Spring – both with Java and XML configuration. The implementation of this simple project can be found in the github project – this is an Eclipse based project, so it should be easy to import and run as it is.

May 8, 2013

by Eugen Paraschiv

· 12,988 Views · 1 Like

Hebrew Search with ElasticSearch

Hebrew search is not an easy task, and HebMorph is a project I started several years ago to address that problem. After a certain period of inactivity I'm back actively working on it. I'm also happy to say there are already several live systems using it to enable Hebrew searches in their applications. This post is a short step-by-step guide on how to use HebMorph in an ElasticSearch installation. There are quite a few configuration options and things to consider when enabling Hebrew search, most are in the realm of performance vs relevance trade-offs, but I'll talk about those in a separate post. 0. What exactly is HebMorph HebMorph is a project a bit wider than just providing a Hebrew search plugin for ElasticSearch, but for the purpose of this post let us treat it in that narrow aspect. HebMorph has 3 main parts - the hspell dictionary files, the hebmorph-core package which is a wrapper around the dictionary files with important bits that allow for locating words even if they weren't written exactly as they appear in the dictionary, and the hebmorph-lucene package which contains various tools for processing streams of text into Lucene tokens - the searchable parts. To enable Hebrew search from ElasticSearch we are going to need to use the Hebrew analyzer class HebMorph provides to analyze incoming Hebrew texts. That is done by providing ElasticSearch with the HebMorph packages and then telling it to use the Hebrew analyzer on text fields as needed. 1. Get HebMorph and hspell At the moment you will have to compile HebMorph from sources yourself using Maven. In the future we might upload it to a centralized repository, but since we still actively working on a lot of stuff there it is still a bit too early for that. Probably the easiest way to get HebMorph is to do git clone from the main repository. The repository is located at https://github.com/synhershko/HebMorph and includes the latest hspell files already under /hspell-data-files. If you are new to git GitHub offers great tutorials for getting started with it, and they also enable you to download the entire source tree as a zip or a tarball. Once you have the sources, run mvn package or mvn install to create 2 jars - hebmorph-core and hebmorph-lucene. Those 2 packages are required before moving on to the next step. 2. Create an ElasticSearch plugin In this step we will create a new plugin which we will use in the next step to create the Hebrew analyzers in. If you already have a plugin you wish to use, skip to the next step. ElasticSearch plugins are compiled Java packages you simply drop to the plugins folder of your ElasticSearch installation and it gets detected automatically by the ElasticSearch instance once it is initialized. If you are new to this, you might want to read up a bit on that in the official ElasticSearch documentation. Here is a great guide to start with: http://jfarrell.github.io/ The gist of this is having a Java project with a es-plugin.properties file embedded as a resource and pointing to class that tells ElasticSearch what classes to load as plugins, and their plugin type. In the next section we will use this to add our own Analyzer implementation which makes use of HebMorph's capabilities. 3. Creating an Hebrew Analyzer HebMorph already comes with MorphAnalyzer - an Analyzer implementation which takes care of Hebrew-aware tokenization, lemmatization and whatnot. Because it is highly configurable, personally I prefer re-implementing it in the ElasticSearch plugin so it is easier to change the configurations in code. In case you wondered, I'm not planning in supporting external configurations for this as it is too subtle and you should really know what you are doing there. Don't forget to add dependencies to hebmorph-core and hebmorph-lucene to your project. My common Analyzer setup for Hebrew search looks like this: public abstract class HebrewAnalyzer extends ReusableAnalyzerBase { protected enum AnalyzerType { INDEXING, QUERY, EXACT } private static final DictRadix prefixesTree = LingInfo.buildPrefixTree(false); private static DictRadix dictRadix; private final StreamLemmatizer lemmatizer; private final LemmaFilterBase lemmaFilter; protected final Version matchVersion; protected final AnalyzerType analyzerType; protected final char originalTermSuffix = '$'; static { try { dictRadix = Loader.loadDictionaryFromHSpellData(new File(resourcesPath + "hspell-data-files"), true); } catch (IOException e) { // TODO log } } protected HebrewAnalyzer(final AnalyzerType analyzerType) throws IOException { this.matchVersion = matchVersion; this.analyzerType = analyzerType; lemmatizer = new StreamLemmatizer(null, dictRadix, prefixesTree, null); lemmaFilter = new BasicLemmaFilter(); } @Override protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) { // on query - if marked as keyword don't keep origin, else only lemmatized (don't suffix) // if word termintates with $ will output word$, else will output all lemmas or word$ if OOV if (analyzerType == AnalyzerType.QUERY) { final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter); src.setAlwaysSaveMarkedOriginal(true); src.setSuffixForExactMatch(originalTermSuffix); TokenStream tok = new SuffixKeywordFilter(src, '$'); return new TokenStreamComponents(src, tok); } if (analyzerType == AnalyzerType.EXACT) { // on exact - we don't care about suffixes at all, we always output original word with suffix only final HebrewTokenizer src = new HebrewTokenizer(reader, prefixesTree, null); TokenStream tok = new NiqqudFilter(src); tok = new LowerCaseFilter(matchVersion, tok); tok = new AlwaysAddSuffixFilter(tok, '$', false); return new TokenStreamComponents(src, tok); } // on indexing we should always keep both the stem and marked original word // will ignore $ && will always output all lemmas + origin word$ // basically, if analyzerType == AnalyzerType.INDEXING) final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter); src.setAlwaysSaveMarkedOriginal(true); TokenStream tok = new SuffixKeywordFilter(src, '$'); return new TokenStreamComponents(src, tok); } public static class HebrewIndexingAnalyzer extends HebrewAnalyzer { public HebrewIndexingAnalyzer() throws IOException { super(AnalyzerType.INDEXING); } } public static class HebrewQueryAnalyzer extends HebrewAnalyzer { public HebrewQueryAnalyzer() throws IOException { super(AnalyzerType.QUERY); } } public static class HebrewExactAnalyzer extends HebrewAnalyzer { public HebrewExactAnalyzer() throws IOException { super(AnalyzerType.EXACT); } } } You may notice how I created 3 separate analyzers - one for indexing, one for querying and the last for exact querying. I'll be talking more about this in future posts, but the idea is to be able to provide flexibility on querying while still allow for correct indexing. Configuring the analyzers to be picked up from ElasticSearch is rather easy now. First, you need to wrap each analyzer in a "provider", like so: public class HebrewQueryAnalyzerProvider extends AbstractIndexAnalyzerProvider { private final HebrewAnalyzer.HebrewQueryAnalyzer hebrewAnalyzer; @Inject public HebrewQueryAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) throws IOException { super(index, indexSettings, name, settings); hebrewAnalyzer = new HebrewAnalyzer.HebrewQueryAnalyzer(); } @Override public HebrewAnalyzer.HebrewQueryAnalyzer get() { return hebrewAnalyzer; } } After you've created such providers for all types of analyzers, create an AnalysisBinderProcessor like this (or update your existing one with definitions for the Hebrew analyzers): public class MyAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor { private final static HashMap> languageAnalyzers = new HashMap<>(); static { languageAnalyzers.put("hebrew", HebrewIndexingAnalyzerProvider.class); languageAnalyzers.put("hebrew_query", HebrewQueryAnalyzerProvider.class); languageAnalyzers.put("hebrew_exact", HebrewExactAnalyzerProvider.class); } public static boolean analyzerExists(final String analyzerName) { return languageAnalyzers.containsKey(analyzerName); } @Override public void processAnalyzers(final AnalyzersBindings analyzersBindings) { for (Map.Entry> entry : languageAnalyzers.entrySet()) { analyzersBindings.processAnalyzer(entry.getKey(), entry.getValue()); } } } Don't forget to update your Plugin class to catch the AnalysisBinderProcessor - it should look something like this (plus any other stuff you want to add there): public class MyPlugin extends AbstractPlugin { @Override public String name() { return "my-plugin"; } @Override public String description() { return "Implements custom actions required by me"; } @Override public void processModule(Module module) { if (module instanceof AnalysisModule) { ((AnalysisModule)module).addProcessor(new MyAnalysisBinderProcessor()); } } } 4. Using the Hebrew analyzers Compile the ElasticSearch plugin and drop it along with its dependencies in a folder under the /plugins folder of ElasticSearch. You now have 3 new types of analyzers at your disposal: "hebrew", "hebrew_query" and "hebrew_exact". For indexing, you want to use the "hebrew" analyzer. In your mapping, you can define a certain field or an entire set of fields to use that specific analyzer by setting the analyzer for that field. You can also leave the analyzer configuration blank, and specify the analyzer to use for those fields with unspecified analyzer using the _analyzer field in the index request. See more about both here and here. The "hebrew" analyzer will expand each term to all recognized lemmas; in case the word wasn't recognized it will try to tolerate spelling errors or missing Yud/Vav - most of the time it will be successful (with some rate of false positives, which the lemma-filters should remove to some degree). Some words will still remain unrecognized and thus will be indexed as-is. When querying using a QueryString query you can specify what analyzer to use - use the "hebrew_query" or "hebrew_exact" analyzer. The former will perform lemma expansion similar to the indexing analyzer, and the latter will avoid that and allow you to perform exact matches (useful when searching for names or exact phrases). I pretty much ignored a lot of the complexity involved in fine tuning searches for Hebrew, and many very cool things HebMorph allows you to do with Hebrew search for the sake of focus. I will revisit them in a later blog post. 5. Administration The hspell dictionary files are looked up by a physical location on disk - you will need to provide a path they are saved at. Since dictionaries update, it is sometimes easier to update them that way in a distributed environment like the one I'm working with. It may be desirable to have them compiled within the same jar file as the code itself - I'll be happy to accept a pull request to do that. The code above is working with ElasticSearch 0.90 GA and Lucene 4.2.1. I also had it running on earlier versions of both technologies, but may had to make a few minor changes. I assume the samples would break on future versions and I'll probably don't have much time going back and keeping it up to date, but bear in mind most of the time the changes are minor and easy to understand and make by yourself. Both HebMorph and the hspell dictionary are released under the AGPL3. For any questions on licensing, feel free to contact me.

May 6, 2013

by Itamar Syn-hershko

· 7,055 Views

Let's Talk ASM - String Concatenation

not a lot of developers today know assembly, which - regardless of your professional line of work - is a good skill to have. assembly teaches you think on a much lower level, going beyond the abstracted out layer provided by many of the high-level languages. today we're going to look at a way to implement a string concatenation function. specifically, i want to follow the following procedure for building the final result: ask the user for input append a crlf (carriage return + line feed) to the entered string append the entered string to the existing composite string follow back from step 1 until the user enters a terminator character display the composite string let's assume that you have zero knowledge of assembly. if that is the case, i would recommend starting here . in this example, i am using visual studio 2012 to test the code, but you might as well use an older version of the ide if you want. for convenience purposes, i would recommend downloading the basic framework code that comes for free from the writer of the introduction to 80x86 assembly language and computer architecture book: visual studio 2012 visual studio 2010 visual studio 2008 first, you have the standard declarations: .586 .model flat include io.h ; header file for input/output cr equ 0dh ; carriage return character lf equ 0ah ; line feed .stack 4096 .data prompt byte cr, lf, "original string? ",0 restitle byte "final result",0 stringin byte 1024 dup (?) stringout byte 1024 dup (?) linefeed byte cr, lf notice the reference to io.h - at this point you want a way to receive user input and display output data through standard winapi channels, and io.h does just that. some asm experts might argue that it is not a good idea to use winapi hooks in the context of a "pure" assembly program, for educational purposes, but in this situation the focus is on the inner workings of a different function. note: the program is adapted to the scenario where the execution of the string concatenation function is the sole purpose. as you will get a hang of the execution flow, you can easily adapt it to a scenario where some of the registers can be re-used. let's start by clearing the ecx and edx registers: .code _mainproc proc ; clear the ecx and edx registers because these will ; be used for length counters and sequential increments. xor ecx, ecx xor edx, edx once the strings will be entered by the user, i will need to find out the length of the string to append, in order to have a correct sequential memory address. now i need to get user input: input_data: ; prompt the user to enter the string he ultimately ; wants appended to the main string buffer. input prompt, stringin, 40 ; read ascii characters ; make sure that the string doesn't start with the $ character ; which would automatically mean that we need to terminate the ; reading process cmp stringin, '$' je done lea eax, [stringout + edx] ; destination address push eax ; push the destination on the stack lea eax, [stringin] ; source address push eax ; push the source on the stack call strcopy ; call the string copy procedure once the string is entered, i can check whether the terminator character - "$", was used. one of the great things about the cmp instruction is the fact that it checks the starting address of the entered string, therefore i can simply compare the entered data with a single character. in case the character is encountered, the program flow terminates at done, where the output is displayed: done: ; output the new data. output restitle, stringout mov eax, 0 ret strcopy is an internal procedure that will simply copy a string from one memory address to another: strcopy proc near32 push ebp mov ebp, esp push edi push esi pushf mov esi, [ebp+8] mov edi, [ebp+12] cld whilenonull: cmp byte ptr [esi], 0 je endwhilenonull movsb jmp whilenonull endwhilenonull: mov byte ptr [edi], 0 popf pop esi pop edi pop ebp ret 8 strcopy endp to make sure that the next string is properly appended, i need to find out the length of the previous one, for a correct memory address offset: ; let's get the length of the current string - move it ; to the proper register so that we can perform the measurement mov edi, eax ; find the length of the string that was just entered sub ecx, ecx sub al, al not ecx cld repne scasb not ecx dec ecx add edx, ecx repne scasb is used for an in-string iterative null terminator search (you can read more about it here ). it will decrement ecx for each character. ; we need to append the linefeed (crlf) to the string so we apply ; the same string concatenation procedure for that sequence. lea eax, [stringout + edx] ; destination address push eax ; first parameter lea eax, [linefeed] ; source push eax ; second parameter call strcopy ; call string copy procedure mov edi, eax ; we know that the crlf characters are 2 entities, therefore ; increment the overall counter by 2. add edx, 2 ; ask for more input because no terminator character was used. jmp input_data once the basic input data is processed, i can append the crlf sequence and increment edx for the proper offset, after which the program flow is being reset from the point where the user has to enter the next character sequence.

May 3, 2013

by Denzel D.

· 13,117 Views