Databases Resources

The Latest Databases Topics

Sometimes it is useful to “backcast” a time series — that is, forecast in reverse time. Although there are no in-built R functions to do this, it is very easy to implement. Suppose x is our time series and we want to backcast for periods. Here is some code that should work for most univariate time series. The example is non-seasonal, but the code will also work with seasonal data. library(forecast) x <- WWWusage h <- 20 f <- frequency(x) # Reverse time revx <- ts(rev(x), frequency=f) # Forecast fc <- forecast(auto.arima(revx), h) plot(fc) # Reverse time again fc$mean <- ts(rev(fc$mean),end=tsp(x)[1] - 1/f, frequency=f) fc$upper <- fc$upper[h:1,] fc$lower <- fc$lower[h:1,] fc$x <- x # Plot result plot(fc, xlim=c(tsp(x)[1]-h/f, tsp(x)[2]))

February 28, 2014

by Rob J Hyndman

· 5,759 Views

Hibernate Query by Example (QBE)

What is It Query by example is an alternative querying technique supported by the main JPA vendors but not by the JPA specification itself. QBE returns a result set depending on the properties that were set on an instance of the queried class. So if I create an Address entity and fill in the city field then the query will select all the Address entities having the same city field as the given Address entity. The typical use case of QBE is evaluating a search form where the user can fill in any search fields and gets the results based on the given search fields. In this case QBE can reduce code size significantly. When to Use · Using many fields of an entity in a query · User selects which fields of an Entity to use in a query · We are refactoring the entities frequently and don’t want to worry about breaking the queries that rely on them Limitations · QBE is not available in JPA 1.0 or 2.0 · Version properties, identifiers and associations are ignored · The query object should be annotated with @Entity Test Data I used the following entities to test the QBE feature of Hibernate: · Address (long id, String city, String street, String countryISO2Code, AddressType addressType) · AddressType (Integer type, String description) Imports The examples will refer to the following classes: import org.hibernate.Criteria; import org.hibernate.Session; import org.hibernate.criterion.Example; import org.hibernate.criterion.Restrictions; import org.junit.Test; import java.util.List; Utility Methods I also made two utility methods to present a list of the two entity types: private void listAddresses(List addresses) { for (Address address : addresses) { System.out.println(address.getId() + ", " + address.getCountryISO2Code() + ", " + address.getCity() + ", " + address.getStreet() + ", " + address.getAddressType().getType() + ", " + address.getAddressType().getDescription()); } } private void listAddressTypes(List addressTypes) { for (AddressType addressType : addressTypes) { System.out.println(addressType.getType() + ", " + addressType.getDescription()); } } Example 1: Equals This example code returns the Address entities matching the given CountryISO2Code and City. Method: @Test public void testEquals() throws Exception { Session session = (Session) entityManager.getDelegate(); Address address = new Address(); address.setCountryISO2Code("US"); address.setCity("CHICAGO"); Example addressExample = Example.create(address); Criteria criteria = session.createCriteria(Address.class).add(addressExample); listAddresses(criteria.list()); } Result: 75, US, CHICAGO, Los Angeles Way2, 6, Customer 170, US, CHICAGO, Jackson Blvd 33a, 4, Delivery 63, US, CHICAGO, Main Avenue 1, 5, Bill to 37, US, CHICAGO, Jackson Blvd 33a, 4, Delivery 36, US, CHICAGO, Jackson Blvd 33a, 4, Delivery Example 2: Id Limitation This example presents that id fields in the query object are ignored. Method: @Test public void testIdLimitation() throws Exception { Session session = (Session) entityManager.getDelegate(); Address address = new Address(); address.setCountryISO2Code("US"); address.setCity("CHICAGO"); address.setId(100); // setting id is ignored Example addressExample = Example.create(address); Criteria criteria = session.createCriteria(Address.class).add(addressExample); listAddresses(criteria.list()); } Result: 75, US, CHICAGO, Los Angeles Way2, 6, Customer 170, US, CHICAGO, Jackson Blvd 33a, 4, Delivery 63, US, CHICAGO, Main Avenue 1, 5, Bill to 37, US, CHICAGO, Jackson Blvd 33a, 4, Delivery 36, US, CHICAGO, Jackson Blvd 33a, 4, Delivery Example 3: Association Limitation Associations of the query object are ignored, too. Method: @Test public void testAssociationLimitation() throws Exception { Session session = (Session) entityManager.getDelegate(); Address address = new Address(); address.setCountryISO2Code("US"); address.setCity("CHICAGO"); AddressType addressType = new AddressType(); addressType.setType(5); address.setAddressType(addressType); // setting an association is ignored Example addressExample = Example.create(address); Criteria criteria = session.createCriteria(Address.class).add(addressExample); listAddresses(criteria.list()); } Result: 75, US, CHICAGO, Los Angeles Way2, 6, Customer 170, US, CHICAGO, Jackson Blvd 33a, 4, Delivery 63, US, CHICAGO, Main Avenue 1, 5, Bill to 37, US, CHICAGO, Jackson Blvd 33a, 4, Delivery 36, US, CHICAGO, Jackson Blvd 33a, 4, Delivery Example 4: Like QBE supports like in the query object if we enable it with Example.enableLike(). Method: @Test public void testLike() throws Exception { Session session = (Session) entityManager.getDelegate(); Address address = new Address(); address.setCountryISO2Code("US"); address.setCity("AT%"); Example addressExample = Example.create(address).enableLike(); Criteria criteria = session.createCriteria(Address.class).add(addressExample); listAddresses(criteria.list()); } Result: 83, US, ATLANTA, null, 6, Customer 184, US, ATLANTA, null, 1, Shipper 25, US, ATLANTA, null, 1, Shipper Example 5: ExcludeProperty We can exclude a property with Example.excludeProperty(String propertyName). Method: @Test public void testExcludeProperty() throws Exception { Session session = (Session) entityManager.getDelegate(); Address address = new Address(); address.setCountryISO2Code("US"); address.setCity("AT%"); Example addressExample = Example.create(address).enableLike() .excludeProperty("countryISO2Code"); // countryISO2Code is a property of Address Criteria criteria = session.createCriteria(Address.class).add(addressExample); listAddresses(criteria.list()); } Result: 154, GR, ATHENS, BETA ALPHA Street 5, 2, Consignee 83, US, ATLANTA, null, 6, Customer 25, US, ATLANTA, null, 1, Shipper 184, US, ATLANTA, null, 1, Shipper Example 6: IgnoreCase Case-insensitive search is supported by Example.ignoreCase(). Method: @Test public void testIgnoreCase() throws Exception { Session session = (Session) entityManager.getDelegate(); AddressType addressType = new AddressType(); addressType.setDescription("customer"); Example addressTypeExample = Example.create(addressType).ignoreCase(); Criteria criteria = session.createCriteria(AddressType.class) .add(addressTypeExample); listAddressTypes(criteria.list()); } Result: 6, Customer Example 7: ExcludeZeroes We can ignore 0 values of the query object by Example.excludeZeroes(). Method: @Test public void testExcludeZeroes() throws Exception { Session session = (Session) entityManager.getDelegate(); AddressType addressType = new AddressType(); addressType.setType(0); addressType.setDescription("Customer"); Example addressTypeExample = Example.create(addressType) .excludeZeroes(); Criteria criteria = session.createCriteria(AddressType.class) .add(addressTypeExample); listAddressTypes(criteria.list()); } Result: 6, Customer Example 8: Combining with Criteria QBE can be combined with criteria query. In this example we add further restriction to the query object using criteria query. Method: @Test public void testCombiningWithCriteria() throws Exception { Session session = (Session) entityManager.getDelegate(); AddressType addressType = new AddressType(); addressType.setDescription("Customer"); Example addressTypeExample = Example.create(addressType); Criteria criteria = session .createCriteria(AddressType.class).add(addressTypeExample) .add(Restrictions.eq("type", 6)); listAddressTypes(criteria.list()); } Result: 6, Customer Example 9: Association With criteria query we can filter both sides of an association, using two query objects. Method: @Test public void testAssociation() throws Exception { Session session = (Session) entityManager.getDelegate(); Address address = new Address(); address.setCountryISO2Code("US"); AddressType addressType = new AddressType(); addressType.setType(6); Example addressExample = Example.create(address); Example addressTypeExample = Example.create(addressType); Criteria criteria = session.createCriteria(Address.class).add(addressExample) .createCriteria("addressType").add(addressTypeExample); // addressType is a property of Address listAddresses(criteria.list()); } Result: 84, US, BOSTON, null, 6, Customer 83, US, ATLANTA, null, 6, Customer 82, US, SAN FRANCISCO, null, 6, Customer 75, US, CHICAGO, Los Angeles Way2, 6, Customer EclipseLink EclipseLink QBE uses QueryByExamplePolicy, ReadObjectQuery and JpaHelper: QueryByExamplePolicy qbePolicy =newQueryByExamplePolicy(); qbePolicy.excludeDefaultPrimitiveValues(); Address address =newAddress(); address.setCity("CHICAGO"); ReadObjectQuery roq =newReadObjectQuery(address, qbePolicy); Query query =JpaHelper.createQuery(roq, entityManager); OpenJPA OpenJPA uses OpenJPAQueryBuilder: CriteriaQuery cq = openJPAQueryBuilder.createQuery(Address.class); Address address =newAddress(); address.setCity("CHICAGO"); cq.where(openJPAQueryBuilder.qbe(cq.from(Address.class), address); References Hibernate: · Srinivas Guruzu and Gary Mak: Hibernate Recipes: A Problem-Solution Approach (Apress) · http://docs.jboss.org/hibernate/core/3.3/reference/en/html/querycriteria.html#querycriteria-examples · http://www.java2s.com/Code/Java/Hibernate/CriteriaQBEQueryByExampleCriteria.htm · http://www.dzone.com/snippets/hibernate-query-example · http://gal-levinsky.blogspot.de/2012/01/qbe-pattern.html Hibernate associations: · http://stackoverflow.com/questions/9309884/query-by-example-on-associations · http://stackoverflow.com/questions/8236596/hibernate-query-by-example-equivalent-of-association-criteria-query JPA: · http://stackoverflow.com/questions/2880209/jpa-findbyexample EclipseLink: · http://www.coderanch.com/t/486528/ORM/databases/findByExample-JPA-book OpenJPA: · http://www.ibm.com/developerworks/java/library/j-typesafejpa/#N10C18

February 27, 2014

by Donat Szilagyi

· 62,572 Views · 3 Likes

A Deeper Look into the Java 8 Date and Time API

Within this post we will have a deeper look into the new Date/Time API we get with Java 8 (JSR 310). Please note that this post is mainly driven by code examples that show the new API functionality. I think the examples are self-explanatory so I did not spent much time writing text around them :-) Let's get started! Working with Date and Time Objects All classes of the Java 8 Date/Time API are located within the java.time package. The first class we want to look at is java.time.LocalDate. A LocalDate represents a year-month-day date without time. We start with creating new LocalDate instances: // the current date LocalDate currentDate = LocalDate.now(); // 2014-02-10 LocalDate tenthFeb2014 = LocalDate.of(2014, Month.FEBRUARY, 10); // months values start at 1 (2014-08-01) LocalDate firstAug2014 = LocalDate.of(2014, 8, 1); // the 65th day of 2010 (2010-03-06) LocalDate sixtyFifthDayOf2010 = LocalDate.ofYearDay(2010, 65); LocalTime and LocalDateTime are the next classes we look at. Both work similar to LocalDate. ALocalTime works with time (without dates) while LocalDateTime combines date and time in one class: LocalTime currentTime = LocalTime.now(); // current time LocalTime midday = LocalTime.of(12, 0); // 12:00 LocalTime afterMidday = LocalTime.of(13, 30, 15); // 13:30:15 // 12345th second of day (03:25:45) LocalTime fromSecondsOfDay = LocalTime.ofSecondOfDay(12345); // dates with times, e.g. 2014-02-18 19:08:37.950 LocalDateTime currentDateTime = LocalDateTime.now(); // 2014-10-02 12:30 LocalDateTime secondAug2014 = LocalDateTime.of(2014, 10, 2, 12, 30); // 2014-12-24 12:00 LocalDateTime christmas2014 = LocalDateTime.of(2014, Month.DECEMBER, 24, 12, 0); By default LocalDate/Time classes will use the system clock in the default time zone. We can change this by providing a time zone or an alternative Clock implementation: // current (local) time in Los Angeles LocalTime currentTimeInLosAngeles = LocalTime.now(ZoneId.of("America/Los_Angeles")); // current time in UTC time zone LocalTime nowInUtc = LocalTime.now(Clock.systemUTC()); From LocalDate/Time objects we can get all sorts of useful information we might need. Some examples: LocalDate date = LocalDate.of(2014, 2, 15); // 2014-06-15 boolean isBefore = LocalDate.now().isBefore(date); // false // information about the month Month february = date.getMonth(); // FEBRUARY int februaryIntValue = february.getValue(); // 2 int minLength = february.minLength(); // 28 int maxLength = february.maxLength(); // 29 Month firstMonthOfQuarter = february.firstMonthOfQuarter(); // JANUARY // information about the year int year = date.getYear(); // 2014 int dayOfYear = date.getDayOfYear(); // 46 int lengthOfYear = date.lengthOfYear(); // 365 boolean isLeapYear = date.isLeapYear(); // false DayOfWeek dayOfWeek = date.getDayOfWeek(); int dayOfWeekIntValue = dayOfWeek.getValue(); // 6 String dayOfWeekName = dayOfWeek.name(); // SATURDAY int dayOfMonth = date.getDayOfMonth(); // 15 LocalDateTime startOfDay = date.atStartOfDay(); // 2014-02-15 00:00 // time information LocalTime time = LocalTime.of(15, 30); // 15:30:00 int hour = time.getHour(); // 15 int second = time.getSecond(); // 0 int minute = time.getMinute(); // 30 int secondOfDay = time.toSecondOfDay(); // 55800 Some information can be obtained without providing a specific date. For example, we can use the Year class if we need information about a specific year: Year currentYear = Year.now(); Year twoThousand = Year.of(2000); boolean isLeap = currentYear.isLeap(); // false int length = currentYear.length(); // 365 // sixtyFourth day of 2014 (2014-03-05) LocalDate date = Year.of(2014).atDay(64); We can use the plus and minus methods to add or subtract specific amounts of time. Note that these methods always return a new instance (Java 8 date/time classes are immutable). LocalDate tomorrow = LocalDate.now().plusDays(1); // before 5 houres and 30 minutes LocalDateTime dateTime = LocalDateTime.now().minusHours(5).minusMinutes(30); TemporalAdjusters are another nice way for date manipulation. TemporalAdjuster is a single method interface that is used to separate the process of adjustment from actual date/time objects. A set of common TemporalAdjusters can be accessed using static methods of the TemporalAdjusters class. LocalDate date = LocalDate.of(2014, Month.FEBRUARY, 25); // 2014-02-25 // first day of february 2014 (2014-02-01) LocalDate firstDayOfMonth = date.with(TemporalAdjusters.firstDayOfMonth()); // last day of february 2014 (2014-02-28) LocalDate lastDayOfMonth = date.with(TemporalAdjusters.lastDayOfMonth()); Static imports make this more fluent to read: import static java.time.temporal.TemporalAdjusters.*; ... // last day of 2014 (2014-12-31) LocalDate lastDayOfYear = date.with(lastDayOfYear()); // first day of next month (2014-03-01) LocalDate firstDayOfNextMonth = date.with(firstDayOfNextMonth()); // next sunday (2014-03-02) LocalDate nextSunday = date.with(next(DayOfWeek.SUNDAY)); Time Zones Working with time zones is another big topic that is simplified by the new API. The LocalDate/Time classes we have seen so far do not contain information about a time zone. If we want to work with a date/time in a certain time zone we can use ZonedDateTime or OffsetDateTime: ZoneId losAngeles = ZoneId.of("America/Los_Angeles"); ZoneId berlin = ZoneId.of("Europe/Berlin"); // 2014-02-20 12:00 LocalDateTime dateTime = LocalDateTime.of(2014, 02, 20, 12, 0); // 2014-02-20 12:00, Europe/Berlin (+01:00) ZonedDateTime berlinDateTime = ZonedDateTime.of(dateTime, berlin); // 2014-02-20 03:00, America/Los_Angeles (-08:00) ZonedDateTime losAngelesDateTime = berlinDateTime.withZoneSameInstant(losAngeles); int offsetInSeconds = losAngelesDateTime.getOffset().getTotalSeconds(); // -28800 // a collection of all available zones Set allZoneIds = ZoneId.getAvailableZoneIds(); // using offsets LocalDateTime date = LocalDateTime.of(2013, Month.JULY, 20, 3, 30); ZoneOffset offset = ZoneOffset.of("+05:00"); // 2013-07-20 03:30 +05:00 OffsetDateTime plusFive = OffsetDateTime.of(date, offset); // 2013-07-19 20:30 -02:00 OffsetDateTime minusTwo = plusFive.withOffsetSameInstant(ZoneOffset.ofHours(-2)); Timestamps Classes like LocalDate and ZonedDateTime provide a human view on time. However, often we need to work with time viewed from a machine perspective. For this we can use the Instant class which represents timestamps. An Instant counts the time beginning from the first second of January 1, 1970 (1970-01-01 00:00:00) also called the EPOCH. Instant values can be negative if they occured before the epoch. They followISO 8601 the standard for representing date and time. // current time Instant now = Instant.now(); // from unix timestamp, 2010-01-01 12:00:00 Instant fromUnixTimestamp = Instant.ofEpochSecond(1262347200); // same time in millis Instant fromEpochMilli = Instant.ofEpochMilli(1262347200000l); // parsing from ISO 8601 Instant fromIso8601 = Instant.parse("2010-01-01T12:00:00Z"); // toString() returns ISO 8601 format, e.g. 2014-02-15T01:02:03Z String toIso8601 = now.toString(); // as unix timestamp long toUnixTimestamp = now.getEpochSecond(); // in millis long toEpochMillis = now.toEpochMilli(); // plus/minus methods are available too Instant nowPlusTenSeconds = now.plusSeconds(10); Periods and Durations Period and Duration are two other important classes. Like the names suggest they represent a quantity or amount of time. A Period uses date based values (years, months, days) while a Duration uses seconds or nanoseconds to define an amount of time. Duration is most suitable when working with Instants and machine time. Periods and Durations can contain negative values if the end point occurs before the starting point. // periods LocalDate firstDate = LocalDate.of(2010, 5, 17); // 2010-05-17 LocalDate secondDate = LocalDate.of(2015, 3, 7); // 2015-03-07 Period period = Period.between(firstDate, secondDate); int days = period.getDays(); // 18 int months = period.getMonths(); // 9 int years = period.getYears(); // 4 boolean isNegative = period.isNegative(); // false Period twoMonthsAndFiveDays = Period.ofMonths(2).plusDays(5); LocalDate sixthOfJanuary = LocalDate.of(2014, 1, 6); // add two months and five days to 2014-01-06, result is 2014-03-11 LocalDate eleventhOfMarch = sixthOfJanuary.plus(twoMonthsAndFiveDays); // durations Instant firstInstant= Instant.ofEpochSecond( 1294881180 ); // 2011-01-13 01:13 Instant secondInstant = Instant.ofEpochSecond(1294708260); // 2011-01-11 01:11 Duration between = Duration.between(firstInstant, secondInstant); // negative because firstInstant is after secondInstant (-172920) long seconds = between.getSeconds(); // get absolute result in minutes (2882) long absoluteResult = between.abs().toMinutes(); // two hours in seconds (7200) long twoHoursInSeconds = Duration.ofHours(2).getSeconds(); Formatting and Parsing Formatting and parsing is another big topic when working with dates and times. In Java 8 this can be accomplished by using the format() and parse() methods: // 2014-04-01 10:45 LocalDateTime dateTime = LocalDateTime.of(2014, Month.APRIL, 1, 10, 45); // format as basic ISO date format (20140220) String asBasicIsoDate = dateTime.format(DateTimeFormatter.BASIC_ISO_DATE); // format as ISO week date (2014-W08-4) String asIsoWeekDate = dateTime.format(DateTimeFormatter.ISO_WEEK_DATE); // format ISO date time (2014-02-20T20:04:05.867) String asIsoDateTime = dateTime.format(DateTimeFormatter.ISO_DATE_TIME); // using a custom pattern (01/04/2014) String asCustomPattern = dateTime.format(DateTimeFormatter.ofPattern("dd/MM/yyyy")); // french date formatting (1. avril 2014) String frenchDate = dateTime.format(DateTimeFormatter.ofPattern("d. MMMM yyyy", new Locale("fr"))); // using short german date/time formatting (01.04.14 10:45) DateTimeFormatter formatter = DateTimeFormatter.ofLocalizedDateTime(FormatStyle.SHORT) .withLocale(new Locale("de")); String germanDateTime = dateTime.format(formatter); // parsing date strings LocalDate fromIsoDate = LocalDate.parse("2014-01-20"); LocalDate fromIsoWeekDate = LocalDate.parse("2014-W14-2", DateTimeFormatter.ISO_WEEK_DATE); LocalDate fromCustomPattern = LocalDate.parse("20.01.2014", DateTimeFormatter.ofPattern("dd.MM.yyyy")); Conversion Of course we do not always have objects of the type we need. Therefore, we need an option to convert different date/time related objects between each other. The following examples show some of the possible conversion options: // LocalDate/LocalTime <-> LocalDateTime LocalDate date = LocalDate.now(); LocalTime time = LocalTime.now(); LocalDateTime dateTimeFromDateAndTime = LocalDateTime.of(date, time); LocalDate dateFromDateTime = LocalDateTime.now().toLocalDate(); LocalTime timeFromDateTime = LocalDateTime.now().toLocalTime(); // Instant <-> LocalDateTime Instant instant = Instant.now(); LocalDateTime dateTimeFromInstant = LocalDateTime.ofInstant(instant, ZoneId.of("America/Los_Angeles")); Instant instantFromDateTime = LocalDateTime.now().toInstant(ZoneOffset.ofHours(-2)); // convert old date/calendar/timezone classes Instant instantFromDate = new Date().toInstant(); Instant instantFromCalendar = Calendar.getInstance().toInstant(); ZoneId zoneId = TimeZone.getDefault().toZoneId(); ZonedDateTime zonedDateTimeFromGregorianCalendar = new GregorianCalendar().toZonedDateTime(); // convert to old classes Date dateFromInstant = Date.from(Instant.now()); TimeZone timeZone = TimeZone.getTimeZone(ZoneId.of("America/Los_Angeles")); GregorianCalendar gregorianCalendar = GregorianCalendar.from(ZonedDateTime.now()); Conclusion With Java 8 we get a very rich API for working with date and time located in the java.time package. The API can completely replace old classes like java.util.Date or java.util.Calendar with newer, more flexible classes. Due to mostly immutable classes the new API helps in building thread safe systems. The source of the examples can be found on GitHub.

February 27, 2014

by Michael Scharhag

· 209,600 Views · 18 Likes

Choosing Columns for Agile Team Boards

"And let Reform her columns roll. With thunder peal, and lightning flash..." - Ignis, "The Genius of Liberty" Vol III No. 2 Introduction In the past couple of articles we've seen how a Kanban board is able to help in the attainment of transparency and the stabilization of an agile team. Today we'll see if we can resolve one of the most common queries that result from this usage: how does a team decide which columns should appear on the board for tracking the progress of work items? The simplest case...and why it may not be enough When we set up a Kanban board in the last article, there were only three columns - or to use the correct term, "stations". These were a "Backlog" station (essentially a "To Do" list of work that has not yet been started), a station for showing which work is "In Progress", and a finally a station for representing the work that has been completed. You can't get much simpler than that, and it begs the question as to why you would want to make it more complicated. In practice however, there are at least two situations in which this minimalist approach will be found wanting: A team isn't cross-trained, and its members effectively work in skill silos. Consequently we can expect dependencies between team members, some of whom may become blocked while waiting for others to complete their part of the work. The incurral of this wasted time will not be apparent if it is all considered to be "work in progress". For example, the team may be split into developers and testers, and bottlenecks may arise as work passes between them. We may need to break Work In Progress down into further stations in order to expose this waste more fully. Bottlenecks arise due to constraints in the workflow, which is a different problem. In this situation a team might be fully cross-trained, and none of its members become blocked waiting for another. Rather, waste arises because the work itself is inefficiently staged. This often happens with activities like development, review, and test. For example if two people are required for a review, but only one is needed for development and test, then a bottleneck may well occur. Work will build-up awaiting review due to contention for these resources and the value of the investment in effort will start to depreciate. Again the incurral of this waste will not be apparent if it is all considered to be "work in progress". More stations are needed to expose it. Adding further stations These then are the two key things to consider when choosing additional stations. We're out to expose waste caused by work silos, or by the inappropriate staging of activities. Either of these can introduce constraints and become the source of bottlenecks in a value stream. Sometimes blockages can occur due to a dependency on something that must be done outside of the team. When this happens, it implies that the team are not fully in control of their own process, and consequently are unable to meet their own Definition of Done. They don't have all of the skills or resources needed. This is a problem and a contra-indication to agile practice. If it happens it's essential to make the dependency clear so that it can be challenged and removed. We may therefore choose to have an "externally blocked" column on the board to expose problems of this nature. It isn't really a station, because it doesn't represent a state in which value is added. Rather, it shows that an item has stalled within the value stream and that the team are not in a position to provide remedy. Another option is to place a red, day-glo sticky note on the ticket highlighting the seriousness of the problem. This is a clear signal that an impediment has occurred...that is to say, in Lean-Kanban terminology, it is an andon flag. In this case the flag shows that a major blockage has arisen and needs resolving. Challenging the boundaries Now we need to turn our attention to the boundaries of the board. There are two principal areas we should look at. Firstly, on the leftmost side of the board, we can see the work that a team inducts into its "Backlog" prior to actioning it. Secondly, on the rightmost side, we can see the work that the team considers to be "Done". These two boundaries are very often a source of waste. To understand why, just consider how backlogs are often allowed to grow without effective limit, and at how completed work may be permitted to accumulate in a "Done" column. These stations may not represent work in progress as far as the team is concerned, but it would be foolish to deny that they are batches too. After all, they are still part of the value stream. They represent inventory that is depreciating in value, or relevance, until something useful is done with it all. It behooves us to query the waste that is incurred, and to ask how the size of these batches may be constrained. Specifically: How can work be inducted into a backlog with minimal accumulation and delay? How can value be delivered to consumers as soon as work is completed? In short, what can be done to "lean" these process boundaries, so that inventory in the team's part of the value stream enters and exits in a "just-in-time" fashion? We can answer these questions by improving transparency still further. This can mean the refinement of the "Backlog" and "Done" columns into other, more finely-grained stations. For example, work might be building up in a Product Backlog because it is not being triaged appropriately, or perhaps because acceptance criteria are insufficiently well defined. We might be able to expose these problems by replacing a backlog with "Triaged", "Accepted", and "Ready" stations. At the other side of the board, completed work may be building up in the "Done" column because a release cannot yet be made. Additional stations such as "System Integrated", "In User Acceptance", and "Awaiting Release" could add clarity here. Removing stations The simple, 3 column board we started has now exploded into a behemoth of perhaps ten columns or more. This may seem like an excessively complex structure for a workflow and a casual observer may criticize it for being fundamentally unagile. After all, inventory should either be work in progress by an agile team, or it will be awaiting their attention or have already been completed. The criticism is a valid one but we need to bear one thing in mind: these stations are there to expose problems. Only once transparency has been attained can we hope to provide remedy. The bottlenecks, along with the diagnostic stations we added to reveal them, can then be removed. Conclusion Knowing how many "columns" to include on an agile board, and what they should be, is something of a black art to many agile teams. In this article we've looked at the issues involved in making this decision. The board of a fully cross-trained team should be elegant in its simplicity, but when problems arise we must be prepared to do some digging in order to root out their causes.

February 25, 2014

by $$anonymous$$

· 12,560 Views · 2 Likes

Voron & Time Series Data: Getting Real Data Outputs

So far, we have just put the data in and out. And we have had a pretty good track record doing so. However, what do we do with the data now that we have it? As you can expect, we need to read it out. Usually by specific date ranges. The interesting thing is that we usually are not interested in just a single channel, we care about multiple channels. And for fun, those channel might be synchronized or not. An example of the first might be the current speed and the current engine temperature in a car. They are generally share the exact same timestamps. An example of out of sync is when you have a sensor on a rooftop measuring rainfall, and another sensor in the sewer measuring water flow rates. (Again, thanks to Dan for helping me with the domain). This is interesting, because it present quite a few interesting problems: We need to merge different streams into a unified view. We need to handle both matching and non matching sequences. We need to handle erroneous data, what happens when we have two reading for the same time for the same sensor? Yes, that shouldn’t happen, but it does. I solved this with the following API: public class RangeEntry { public DateTime Timestamp; public double?[] Values; } IEnumerable results = dts.ScanRanges(DateTime.MinValue, DateTime.MaxValue, new[] { "6febe146-e893-4f64-89f8-527f2dbaae9b", "707dcb42-c551-4f1a-9203-e4b0852516cf", "74d5bee8-9a7b-4d4e-bd85-5f92dfc22edb", "7ae29feb-6178-4930-bc38-a90adf99cfd3", }); This API gives me the results in the time order, with the same positions as the ids requested for the values. With nulls if there isn’t a value matching the value from that time in that particular sensor channel. The actual implementation relies on this method: IEnumerable ScanRange(DateTime start, DateTime end, string id) All this does it provide the entries all the entries in a particular date range, for a particular channel. Let us see how we implement multi channel scanning on top of this: private class PendingEnumerator { public IEnumerator Enumerator; public int Index; } private class PendingEnumerators { private readonly SortedDictionary> _values = new SortedDictionary>(); public void Enqueue(PendingEnumerator entry) { List list; var dateTime = entry.Enumerator.Current.Timestamp; if (_values.TryGetValue(dateTime, out list) == false) { _values.Add(dateTime, list = new List()); } list.Add(entry); } public bool IsEmpty { get { return _values.Count == 0; } } public List Dequeue() { if (_values.Count == 0) return new List(); var kvp = _values.First(); _values.Remove(kvp.Key); return kvp.Value; } } public IEnumerable ScanRanges(DateTime start, DateTime end, string[] ids) { if (ids == null || ids.Length == 0) yield break; var pending = new PendingEnumerators(); for (int i = 0; i < ids.Length; i++) { var enumerator = ScanRange(start, end, ids[i]).GetEnumerator(); if(enumerator.MoveNext() == false) continue; pending.Enqueue(new PendingEnumerator { Enumerator = enumerator, Index = i }); } var result = new RangeEntry { Values = new double?[ids.Length] }; while (pending.IsEmpty == false) { Array.Clear(result.Values,0,result.Values.Length); var entries = pending.Dequeue(); if (entries.Count == 0) break; foreach (var entry in entries) { var current = entry.Enumerator.Current; result.Timestamp = current.Timestamp; result.Values[entry.Index] = current.Value; if(entry.Enumerator.MoveNext()) pending.Enqueue(entry); } yield return result; } } We are getting a single entry from each channel into the pending enumerators. Then, we collate all the entries that share the same time into a single entry. We use the Index property to track the actual expected index of the entry in the output. And we handle duplicate times in the same channel by outputting multiple entries. Testing this on my 1.1 million records data set, we can get 185 thousands records back in 0.15 seconds.

February 25, 2014

by Oren Eini

· 5,409 Views

Brief comparison of BDD frameworks

JDave, Concordion, Easyb, JBehave, Cucumber are all compared here briefly for your convenience.

February 24, 2014

by Sebastian Laskawiec

· 129,922 Views · 16 Likes

Voron and the FreeDB Dataset

i got tired of doing arbitrary performance testing, so i decided to take the freedb dataset and start working with that. freedb is a data set used to look up cd information based on the a nearly unique disk id. this is a good dataset, because it contains a lot of data (over three million albums, and over 40 million songs), and it is production data. that means that it is dirty . this makes it perfect to run all sort of interesting scenarios. the purpose of this post (and maybe the new few) is to show off a few things. first, we want to see how voron behaves with realistic data set. second, we want to show off the way voron works, its api, etc. to start with, i run my freedb parser, pointing it at /dev/null. the idea is to measure what is the cost of just going through the data is. we are using freedb-complete-20130901.tar.bz2 from sep 2013. after 1 minute, we went through 342,224 albums, and after 6 minutes we were at 2,066,871 albums. reading the whole 3,328,488 albums took about a bit over ten minutes. so just the cost of parsing and reading the freedb dataset is pretty expensive. the end result is a list of objects that looks like this: now, let us see how we want to actually use this. we want to be able to: lookup an album by the disk ids lookup all the albums by an artist*. lookup albums by album title*. this gets interesting, because we need to deal with questions such as: “given pearl jam, if i search for pearl, do i get them? do i get it for jam?” for now, we are going to go with case insensitive, but we won’t be doing full text search, we will allow, however, prefix searches. we are using the following abstraction for the destination: public abstract class destination { public abstract void accept(disk d); public abstract void done(); } basically, we read data as fast as we can, and we shove it to the destination, until we are done. here is the voron implementation: public class vorondestination : destination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private readonly jsonserializer _serializer = new jsonserializer(); private int counter = 1; public vorondestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override void accept(disk d) { var ms = new memorystream(); _serializer.serialize(new jsontextwriter(new streamwriter(ms)), d); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); if(d.artist != null) _currentbatch.multiadd(d.artist.tolower(), key, "ix_artists"); if (d.title != null) _currentbatch.multiadd(d.title.tolower(), key, "ix_titles"); if (counter%1000 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } } public override void done() { _storageenvironment.writer.write(_currentbatch); } } let us go over this in detail, shall we? in line 10 we create a new storage environment. in this case, we want to just import the data, so we can create the storage inline. on lines 13 – 15, we create the relevant trees. you can think about voron trees in a very similar manner to the way you think about tables. they are a way to separate data into different parts of the storage. note that this still all reside in a single file, so there isn’t a physical separation. note that we created an albums tree, which will contain the actual data. and ix_artists, ix_titles trees. those are indexes into the albums tree. you can see them being used just a little lower. in the accept method, you can see that we use a writebatch, a native voron notion that allows us to batch multiple operations into a single transaction. in this case, for every album, we are making 3 writes. first, we write all of the data, as a json string, into a stream and put it in the albums tree. then we create a simple incrementing integer to be the actual album key. finally, we add the artist and title entries (lower case, so we don’t have to worry about case sensitivity in searches) into the relevant indexes. at 60 seconds, we written 267,998 values to voron. in fact, i explicitly designed it so we can see the relevant metrics. at 495 seconds we have reads 1,995,385 entries from the freedb file, we parsed 1,995,346 of them and written to voron 1,610,998. as you can imagined, each step is running in a dedicated thread, so we can see how they behave on an individual basis. the good thing about this is that i can physically see the various costs, it is actually pretty cool here is the voron directory at 60 seconds: you can see that we have two journal files active (haven’t been applied to the data file yet) and the db.voron file is at 512 mb. the compression buffer is at 32 mb (this is usually twice as big as the biggest transaction, uncompressed). the scratch buffer is used to hold in flight transaction information (until we send it to the data file), and you can see it is sitting on 256mb in size. at 15 minutes, we have the following numbers: 3,035,452 entries read from the file, 3,035,426 parsed and 2,331,998 written to voron. note that we are reading the file & writing to voron on the same disk, so that might impact the read performance. at that time, we can see the following on the disk: note that we increase the size of most of our files by factor of 2, so some of the space in the db.voron file is probably not used. note that we needed more scratch space to handle the in flight information. the entire process took 22 minutes, start to finish. although i have to note that this hasn’t been optimized at all, and i know we are doing a lot of stupid stuff through it. you might have noticed something else, we actually “crashed” closed the voron db, this was done to see what would happen when we open a relatively large db after an unordered shutdown. we’ll actually get to play with the data in my next post. so far this has been pretty much just to see how things are behaving. and… i just realized something, i forgot to actually add an index on disk id . which means that i have to import the data again. but before that, i also wrote the following: public class jsonfiledestination : destination { private readonly gzipstream _stream; private readonly streamwriter _writer; private readonly jsonserializer _serializer = new jsonserializer(); public jsonfiledestination() { _stream = new gzipstream(new filestream("freedb.json.gzip", filemode.createnew, fileaccess.readwrite), compressionlevel.optimal); _writer = new streamwriter(_stream); } public override void accept(disk d) { _serializer.serialize(new jsontextwriter(_writer), d); _writer.writeline(); } public override void done() { _writer.flush(); _stream.dispose(); } } this completed in ten minutes, for 3,328,488 entries. or a rate of about 5,538 per / second. the result is a 845mb gzip file. i had twofold reasons to want to do this. first, this gave me something to compare ourselves to, and more to the point, i can re-use this gzip file for my next tests, without having to go through the expensive parsing of the freedb file. i did just that and ended up with the following: public class voronentriesdestination : entrydestination { private readonly storageenvironment _storageenvironment; private writebatch _currentbatch; private int counter = 1; public voronentriesdestination() { _storageenvironment = new storageenvironment(storageenvironmentoptions.forpath("freedb")); using (var tx = _storageenvironment.newtransaction(transactionflags.readwrite)) { _storageenvironment.createtree(tx, "albums"); _storageenvironment.createtree(tx, "ix_diskids"); _storageenvironment.createtree(tx, "ix_artists"); _storageenvironment.createtree(tx, "ix_titles"); tx.commit(); } _currentbatch = new writebatch(); } public override int accept(string d) { var disk = jobject.parse(d); var ms = new memorystream(); var writer = new streamwriter(ms); writer.write(d); writer.flush(); ms.position = 0; var key = new slice(endianbitconverter.big.getbytes(counter++)); _currentbatch.add(key, ms, "albums"); int count = 1; foreach (var diskid in disk.value("diskids")) { count++; _currentbatch.multiadd(diskid.value(), key, "ix_diskids"); } var artist = disk.value("artist"); if (artist != null) { count++; _currentbatch.multiadd(artist.tolower(), key, "ix_artists"); } var title = disk.value("title"); if (title != null) { count++; _currentbatch.multiadd(title.tolower(), key, "ix_titles"); } if (counter % 100 == 0) { _storageenvironment.writer.write(_currentbatch); _currentbatch = new writebatch(); } return count; } public override void done() { _storageenvironment.writer.write(_currentbatch); _storageenvironment.dispose(); } } now we are actually properly disposing of things, and i also decreased the size of the batch, to see how it would respond. note that it is now being fed directly from the gzip file, at a greatly reduced cost. i also added tracking note only for how many albums we write, but also how many entries . by entries i mean, how many voron entries (which include the values we add to the index). i did find a bug where we would just double the file size without due consideration to its size, so now we are doing smaller file size increases. word of warning : i didn’t realized until after i was done with all the benchmarks, but i actually run all of those in debug configuration, which basically means that it is utterly useless as a performance metric. that is especially true because we have a lot of verifier code that runs in debug mode. so please don’t take those numbers as actual performance metrics, they aren’t valid. time # of albums # of entries 4 minutes 773,398 3,091,146 6 minutes 1,126,998 4,504,550 8 minutes 1,532,858 6,126,413 18 minutes 2,781,698 11,122,799 24 minutes 3,328,488 13,301,496 the status of the file system midway during the run. you can see that now we increase the file is smaller increments. and that we are using more scratch space, probably because we are under very heavy write load. after the run: scratch & compression are only used when the database is running, and deleted on close. the database is 7gb in side, which is quite respectable. now, to working with it, but i’ll save that for my next post, this one is long enough already.

February 20, 2014

by Oren Eini

· 3,805 Views

The Risks Of Big-Bang Deployments And Techniques For Step-wise Deployment

If you ever need to persuade management why it might be better to deploy a larger change in multiple stages and push it to customers gradually, read on. A deployment of many changes is risky. We want therefore to deploy them in a way which minimizes the risk of harm to our customers and our companies. The deployment can be done either in an all-at-once (also known as big-bang) way or a gradual way. We will argue here for the more gradual (“stepwise”) approach. Big-bang or stepwise deployment? A big-bang deployment seems to be the natural thing to do: the full solution is developed and tested and then replaces the current system at once. However, it has two crucial flaws. First, it assumes that most defects can be discovered by testing. However, due to differences in test/prod environments, unknown dependencies, and the sheer scale of a typical larger system there always will be problems that are not discovered until production deployment or even until the application runs for a while in production (whichapplies even to airplanes). The more parts have been changed, the more of these production defects will happen at the same time. A gradual deployment makes it possible to discover and handle them one by one. Second, the more complex the deployment, the higher chance of human error(s), i.e. the deployment itself is a likely source of serious defects. Some of the drawbacks of a big-bang deployment in more detail: Complexity: A big-bang deployment requires coordination of many people and “moving parts” that depend on each other, providing a huge opportunity for human mistake (i.e. there will be mistakes). Lot of time: Such a deployment requires lot of time (typically also more than planed/expected) and thus lot of downtime when users cannot use the system. Hard troubleshooting: With a network of inter-dependent parts that changed all at the same time, while perhaps also changing the infrastructure (i.e. connections between them), it is extremely hard to pinpoint the source of defects, thus considerably increasing the time to detect and correct defects while also increasing the risk of people stepping on the toes of each other and “panic fixes” that either cause more problems than they remove or are not good enough (as the rollback that sped upKnight’s downfall). Rollback is likely either impossible or equally time-consuming and risky as the deployment itself, thus increasing the impact of defects and inviting even more human errors. Impact: Deploying everything to all users at the same time means that everybody will be impacted by a potential defect/error/mistake. Long freeze: All needs to be tested together after all development is finished, which requires a lot of time while the code is frozen and no more fixes and changes can get into production for weeks. Risk mitigation The goal of a good deployment plan is to mitigate the risk of the deployment and get it to an acceptable level. There are two aspects to risk: the probability of a defect and the impact of the defect. The following table shows how the possible measures affect them: Defect probability reduction Defect impact reduction testing stepwise deployment gradual migration of users to the new version (f.ex. 1 in 1000 or particular subsets) rollback mechanism => these also lead to much lower time to detect and fix defects Practices for stepwise deployment Enable stepwise deployment: Use parallel change and other Continuous Delivery techniques to make it possible to deploy updated components independently from each other and to switch on/off new features and to switch what versions of the components they depend on are currently used. (Parallel change – keeping the old and new code and being able to use one or the other – is crucial here. Also notice that parallel change applies also to data – you will need to evolve your data schema gradually and keep both old and new one at the same time in a period of time.) Enable rollback. The previous measure – stepwise deployment – makes it also easy(ier) to roll-back the changes by switching to a previous version of a dependency or by switching back to the old code. Migrate users gradually to the new version, i.e. expose the new version only to a small subset of the users initially and increase that subset until everybody uses it. This can be done f.ex. by deploying to only a subset of servers and sending a random/particular subset of users to the new servers but there are also ways if you have only a single machine. (See f.ex. my post Webapp Blue-Green Deployment Without Breaking Sessions/With Fallback With HAProxy.) Monitoring – make sure you are able to monitor flow of users through the system and detect any anomalies and errors early, long before angry calls from the business. Tools such as Logstash, Google Analytics (with custom events from JavaScript), client-side error logging via one of existing services or a custom solution are invaluable. About these ads

February 20, 2014

by Jakub Holý

· 22,192 Views

Customize the Appearance of Pivot Table Reports inside Android Apps

This technical tip shows how developers can customize the Appearance of Pivot Table Reports inside their Android applications using Aspose.Cells for Android. Previously we have shown how to create a simple pivot table. This article further goes and discusses how to customize the appearance of a pivot table by setting its properties like Setting the AutoFormat and PivotTableStyle Types, Setting Format Options, Setting Row Column and Page Fields Format, Modify a Pivot Table Quick Style and Clearing PivotFields etc. //Setting the AutoFormat and PivotTableStyle Type //Setting the PivotTable report is automatically formatted for Excel 2003 formats pivotTable.setAutoFormat(true); //Setting the PivotTable atuoformat type. pivotTable.setAutoFormatType(PivotTableAutoFormatType.CLASSIC); //Setting the PivotTable's Styles for Excel 2007/2010 formats e.g XLSX. pivotTable.setPivotTableStyleType(PivotTableStyleType.PIVOT_TABLE_STYLE_LIGHT_1); //Setting Format Options //The code sample that follows illustrates how to set a number of pivot table formatting options, including adding grand totals for rows and columns. //Dragging the third field to the data area. pivotTable.addFieldToArea(PivotFieldType.DATA,2); //Show grand totals for rows. pivotTable.setRowGrand(true); //Show grand totals for columns. pivotTable.setColumnGrand(true); //Display a custom string in cells that contain null values. pivotTable.setDisplayNullString(true); pivotTable.setNullString("null"); //Setting the layout pivotTable.setPageFieldOrder(PrintOrderType.DOWN_THEN_OVER); //Setting Row, Column, and Page Fields Format //The code example that follows shows how to access row fields, access a particular row, set subtotals, apply automatic sorting, and using the autoShow option. //Accessing the row fields. PivotFieldCollection pivotFields = pivotTable.getRowFields(); //Accessing the first row field in the row fields. PivotField pivotField = pivotFields.get(0); //Setting Subtotals. pivotField.setSubtotals(PivotFieldSubtotalType.SUM,true); pivotField.setSubtotals(PivotFieldSubtotalType.COUNT,true); //Setting autosort options. //Setting the field auto sort. pivotField.setAutoSort(true); //Setting the field auto sort ascend. pivotField.setAscendSort(true); //Setting the field auto sort using the field itself. pivotField.setAutoSortField(-1); //Setting autoShow options. //Setting the field auto show. pivotField.setAutoShow(true); //Setting the field auto show ascend. pivotField.setAscendShow(false); //Setting the auto show using field(data field). pivotField.setAutoShowField(0); //The following lines of code illustrate how to format data fields. //Accessing the data fields. PivotFieldCollection pivotFields = pivotTable.getDataFields(); //Accessing the first data field in the data fields. PivotField pivotField = pivotFields.get(0); //Setting data display format pivotField.setDataDisplayFormat(PivotFieldDataDisplayFormat.PERCENTAGE_OF); //Setting the base field. pivotField.setBaseField(1); //Setting the base item. pivotField.setBaseItem(PivotItemPosition.NEXT); //Setting number format pivotField.setNumber(10); //Modify a Pivot Table Quick Style //The code examples that follow show how to modify the quick style applied to a pivot table. File sdDir = Environment.getExternalStorageDirectory(); String sdPath = sdDir.getCanonicalPath(); //Open the template file containing the pivot table. Workbook wb = new Workbook(sdPath + "/Template.xlsx"); //Add pivot table style Style style1 = wb.createStyle(); com.aspose.cells.Font font1 = style1.getFont(); font1.setColor(Color.getRed()); Style style2 = wb.createStyle(); com.aspose.cells.Font font2 = style2.getFont(); font2.setColor( Color.getBlue()); int i = wb.getWorksheets().getTableStyles().addPivotTableStyle("tt"); //Get and Set the table style for different categories TableStyle ts = wb.getWorksheets().getTableStyles().get(i); int index = ts.getTableStyleElements().add(TableStyleElementType.FIRST_COLUMN); TableStyleElement e = ts.getTableStyleElements().get(index); e.setElementStyle(style1); index = ts.getTableStyleElements().add(TableStyleElementType.GRAND_TOTAL_ROW); e = ts.getTableStyleElements().get(index); e.setElementStyle(style2); //Set Pivot Table style name PivotTable pt = wb.getWorksheets().get(0).getPivotTables().get(0); pt.setPivotTableStyleName ("tt"); //Save the file. wb.save(sdPath + "/OutputFile.xlsx"); //Clearing PivotFields //PivotFieldCollection has a method named clear() for the task. When you want to clear all the PivotFields in the areas e.g., page, column, row or data, you can use it. The code sample below shows how to clear all the PivotFields in data area. File sdDir = Environment.getExternalStorageDirectory(); String sdPath = sdDir.getCanonicalPath(); //Open the template file containing the pivot table. Workbook workbook = new Workbook(sdPath + "/PivotTable.xlsx"); //Get the first worksheet Worksheet sheet = workbook.getWorksheets().get(0); //Get the pivot tables in the sheet PivotTableCollection pivotTables = sheet.getPivotTables(); //Get the first PivotTable PivotTable pivotTable = pivotTables.get(0); //Clear all the data fields pivotTable.getDataFields().clear(); //Add new data field pivotTable.addFieldToArea(PivotFieldType.DATA, "Betrag Netto FW"); //Set the refresh data flag on pivotTable.setRefreshDataFlag(false); //Refresh and calculate the pivot table data pivotTable.refreshData(); pivotTable.calculateData(); //Save the Excel file workbook.save(sdPath + "/out1.xlsx");

February 19, 2014

by David Zondray

· 2,730 Views

How to Build an iOS and Android App in 24 hours with HTML5 and Cordova

what can one create during the new year and christmas holidays? as it turned down – quite enough. even if you have two kids and a bunch of family members whom you want to visit. the only thing you cannot accomplish in time is to finish an article for dzone. it takes a lot of time, nearly the entire january. by the 5th of january i had a laptop and a couple of days to spend on some development. having estimated what i can do here, i decided to create a mobile app that would work faster than the original. for this, i needed to find communicative creators of a popular app. hence, i found a “ spender ” app in the app store. it is a simple app for tracking your budget. with it, you can estimate how effectively you spend your money in the end of each month. by the 5th of january, this app was in top-10 in the russian app store. i also found their dev-story on iphones.ru. in their dev-story, the developers wrote that after completing their previous project, they had three-four free days. so, they decided to create a new app during this free time. their product manager and programmers helped them with positioning the app and its key features. this encouraged me and i began to think how to create nearly the same app in 2 days . note: the original app was updated in the middle of january, and now it looks a little different from my app. anyway, you can find its screenshots in the dev-story. i already had the experience of mobile app development using c# and cocoa. since this was my personal free time, i wanted to use it with maximum effectiveness. even if i didn’t succeed, i was eager to learn a new framework or programming language. i was working for devexpress from 2006 till2011 and have been reading their announces since i left the company. so, i knew that they created a mobile js-framework based on cordova/phonegap. they made it after i left the company, so i was curious to try it. the gartner research company reports that by august, 2013 most of the enterprise mobile software was created using phonegap or phonegap-based products (like kony ). from my consumer experience, it's far from true. maybe i was wrong? i'm not so good at html and javascript. i can create mark-up with stackoverflow.com and i can write simple selectors with jquery. i can also find the required information in their documentation. in other words, html+js was a gap in my knowledge and i was ready to fill it or gain some experience. thus, i planned to create a cross-platform application that could become an advantage over the original ios-only spender app. moreover, i wanted to spend my time in the most effective way. on the one hand, i had a potentially effective js framework, on the other – a lack of js experience. i hoped that the js framework advantages could balance my poor experience. since i like to use a vcs during development, i'll try to recover my progress. you can download complete apps here: ios , android i'm not sure i can provide public access to my repo, because it contains images i bought from fotolia and third-party libraries, each with a difference license. i'm not a lawyer, so i’d prefer not to take the risk. the most curious of you can take a look into the app bundle itself. js wasn't minified. place: tula, russia, date: january, 5, 2014 +20 minutes spent on installing node.js and cordova cli +10 minutes downloaded a template app from cordova. added a template from phonejs. created a git-repo, registered it in webstorm. added a new record to the httpd.conf in order to have an ability to debug my future app in the browser. +38 minutes changed the app namespace to "io.nikitin.thriftbox". added navigation. phonejs is an mvc-framework. each app screen is represented as a collection of html markup (views) and fabric function (viewmodel). here is how it looks at its simplest // view content and thriftbox.home = function (params) { // request parameters taken from uri return {}; // viewmodel instance }; then view and view model are bound via knockout-bindings . to be in time, i create only two screens: expense input and monthly expense report. +4 hours 20 minutes here i got stuck for the first time. i couldn't create a markup of digit buttons. the original app had a huge keyboard that looked like a calculator or dialer. i found out that it was not that easy to create such a keyboard, even using a table tag. in the iphone retina screen, 1px borders between buttons changed their colors after clicking on the buttons. on my iphone, the difference in colors was very noticeable. i had to invent how to tackle this. i tried to implement buttons using div s. but i couldn't achieve a border width of 1 px and make all buttons look equal in different screens. three hours later i gave up the idea of using divs and moved forward. +28 minutes removing a clicked button indicator on ios. ios displays a gray indicator around tapped links and objects with the onclick event handler. since i had my own indicator of a tapped object (the tapped button became darker), i didn't need the default indicator. i solved this problem using the dxaction event: was: 1 became: 1 this event is an extended variation of a "click" event: its handler supports uri navigation between views and correctly works in the scrollable area. +14 minutes the buttonpress event handler shown in the previous example now validates numbers from user input. var number = ko.observable(null); var isvalidnumber = ko.computed(function() { return number() && parsefloat(number()) > 0; }); ...... function buttonpress(button) { if (button) { if (number()) number(number() + button); else number(button); } else if (number()) number(number().substr(0, number().length - 1)); } var viewmodel = { number: number, isvalidnumber: isvalidnumber, viewshowing: viewshowing, buttonpress: buttonpress }; ..... +8 minutes added a fastclick.js , which removes a delay between tapping the screen and raising the 'click' event on phones. the mobile browser delays the raising of the click event by default to be sure the end-user will not perform a double tap. for the end-user, this looks as if the app is sluggish. you click buttons much faster than an app responds. fastclick.js handles the touchstart event and then creates all the click event process logic. btw, adding this library was a mistake; later i'll tell why. +4 minutes added a limitation to the length of user input numbers. corrected the font size for a better look-and-feel. +58 minutes added a choice of an expense category. added a scrollable pane with available categories below the input field. video . it took less time than it could be. in the phonejs component collection, i found dxtileview . it provides a kinetic scrolling with the required appearance out-of-the-box. it's not easy to implement kinetic scrolling by yourself and thus it’s great that this scrolling is enabled for ios only - android doesn't have it. it was 7:40 pm, so, i decided to continue the next day. place: tula, russia, date: january, 5, 2014 +3 hours 9 minutes storing data on a local storage. phonejs contains classes for working with data: selection, filtering, sorting, and grouping. there are several approaches to store data: odata and localstorage. i didn't want to implement a server side for a free app, and decided to use localstorage. later i found out that this was not an ideal decision. for example, when updating to ios 5.1 user data is erased , other people complained that localstorage is cleared regularly or even when shutting the device down. i didn't want to risk, so i used file api of phonegap. documentation says that this api is based on w3c file api. in fact, this means that this api differs in safari for mac os, chrome for mac os, cordova for ios and cordova for android. file api implementation is different for ios and android . e.g. android implementation doesn't contain the 'blob' class and 'window.permanent' constant. ii however implements the 'localfilesystem' and 'localfilesystem.persistent' classes. the laptop browser provides additional api for requesting an additional storage space, which mobile browsers don't provide. the available documentation for this api adds more problems. i found several articles searching by "html5 file api". and, i couldn't find an article that would cover all my questions. finally i created a new class for working with fileapi. this class supports cordova 3.3 on ios, android, and chrome 32 for mac os and windows 8. you can find it here: https://github.com/chebum/filestorage-for-phone.js/blob/master/filestorage.js you can use it as follows: // in this example i create data/records file in the documents folder of the app fs.initfileapi(1000000, true) .then(function () { var records = new fs.filearraystore({ key: "id", filename: "records" }); return records.insert({ customer: "peter" }) }) .then(function () { alert("record saved!"); }); // or use low-level api: fs.initfileapi(100000, true) .then(function() { return fs.writefile("file1", "file content") }) .then(function() { alert("file saved!"); }); +33 minutes saving the added records to the storage. category list is stored in arraystore , to simplify the selection operations. +26 minutes creating layout for the app's views. phonejs provides several layouts that are the placeholders for the views. my app's start page didn't fit into any of the available layout, so i have chosen the emptylayout. but, it doesn't provide animation effects when navigating through views. i copied the emptylayout code and added an attribute that had animation effects. +1 h. 51 min. template's about screen was redesigned to a report screen, empty by that moment. created a viewmodel that selects data for a current month. added localization date formatting for the screen caption. +59 minutes added the display of expenses grouped by categories for a current month. +28 minutes added the selection of months for which the report should be generated. end-users can tap the screen header to select the required month. +1 h. 20 min. added cordova-plugin statusbar that didn't work outof-the-box. i found that the reference to cordova.js was commented in the phonejs app template: as a result, the native part of my app didn't work. +39 minutes in the report screen, the upper part was changed to dxtoolbar . +22 minutes i discoveredwhy the dxbutton click event handler didn't work. removing the fastclick.js solved my problem, but caused a delay between tapping and event raising. i've changed the dxaction event subscription to 'touchstart'. +25 minutes formatting output strings when generating a report. at night i dreamed of crappy buttons in the application’s main screen. places: tula, vnukovo airport, date: january, 7-8, 2014 i had an early flight to budapest from vnukovo, and because i had no time in the afternoon, i gradually completed at the airport at night. as you know, it’s not very comfortable to sleep or sit in a café chair for a long time, but it turned out that programming was ok. +2 h. 5 min. in the morning, i decided to split the buttons in order to remove borders between them. i took the ios dialer keyboard as a sample. i created three keyboards. the button size changes depending on screen resolution: for 3.5'', 4'' and 5'' phones. each table cell contained a div with configured alignment. because of the lack of an incomplete vertical text alignment in html, the final css style for buttons ended to be quite complex: .home-view .buttons td div { color: #4a5360; border: 1px solid #4a5360; text-align: center; position: absolute; left: 50%; /* small buttons - default */ font-size: 26px; padding: 13px 0 13px 0; width: 52px; line-height: 26px; border-radius: 26px; margin-left: -27px; margin-top: -27px; } +1 h. 50 minutes i bought several vector icon sets on fotolia. i cut the required icons and converted them to png. it took me quite a long time, maybe, because it was 1.30 am :) +1 hour 10 minutes added a splash-screen for the app. +36 minutes created three sizes for the app icon. localized the app name for ios. +20 minutes hiding the splash screen after the app is completely loaded. +2 hours fixing multiple bugs. +2 hours creating screenshots for play store +30 minutes creating screenshots for app store +30 minutes writing an app description for two app stores. +1 h. 30 minutes submitting my app to the app store. here i faced with an issue with the app certification. my accountancy let's summarize the time i spent and divide it into categories. development: 21 hours 37 minutes graphics and texts: 8 hours 26 minutes totally: 30 hours 3 minutes as a result, i got a minimum-feature working app, though it is not as cool as the latest version of "spender". i couldn't create splitting expenses by days and income input. my app's ui could be more elegant as well. after analyzing the original 'spender' developer work, i got the following. they say that they involved four developers for three-four days. it is about 96-128 man-hours. i spent only 30 man-hours and got an app for three mobile platforms. ios and android versions are already in stores. the version for windows phone 8 requires a ui redesign. i can be proud of myself :). you can download complete apps here: ios , android

February 12, 2014

by Ivan Nikitin

· 210,786 Views

Build Your Own Custom Lucene Query and Scorer

Every now and then we’ll come across a search problem that can’t simply be solved with plain Solr relevancy. This usually means a customer knows exactly how documents should be scored. They may have little tolerance for close approximations of this scoring through Solr boosts, function queries, etc. They want a Lucene-based technology for text analysis and performant data structures, but they need to be extremely specific in how documents should be scored relative to each other. Well for those extremely specialized cases we can prescribe a little out-patient surgery to your Solr install – building your own Lucene Query. This is the Nuclear Option Before we dive in, a word of caution. Unless you just want the educational experience, building a custom Lucene Query should be the “nuclear option” for search relevancy. It’s very fiddly and there are many ins-and-outs. If you’re actually considering this to solve a real problem, you’ve already gone down the following paths: You’ve utilized Solr’s extensive set of query parsers & features including function queries, joins, etc. None of this solved your problem You’ve exhausted the ecosystem of plugins that extend on the capabilities in (1). That didn’t work. You’ve implemented your own query parser plugin that takes user input and generates existing Lucene queries to do this work. This still didn’t solve your problem. You’ve thought carefully about your analyzers – massaging your data so that at index time and query time, text lines up exactly as it should to optimize the behavior of existing search scoring. This still didn’t get what you wanted. You’ve implemented your own custom Similarity that modifies how Lucene calculates the traditional relevancy statistics – query norms, term frequency, etc. You’ve tried to use Lucene’s CustomScoreQuery to wrap an existing Query and alter each documents score via a callback. This still wasn’t low-level enough for you, you needed even more control. If you’re still reading you either think this is going to be fun/educational (good for you!) or you’re one of the minority that must control exactly what happens with search. If you don’t know, you can of course contact us for professional services. Ok back to the action… Refresher – Lucene Searching 101 Recall that to search in Lucene, we need to get a hold of an IndexSearcher. This IndexSearcher performs search over an IndexReader. Assuming we’ve created an index, with these classes we can perform searches like in this code: Directory dir = new RAMDirectory(); IndexReader idxReader = new IndexReader(dir); idxSearcher idxSearcher = new IndexSearcher(idxReader) Query q = new TermQuery(new Term(“field”, “value”)); idxSearcher.search(q); Let’s summarize the objects we’ve created: Directory – Lucene’s interface to a file system. This is pretty straight-forward. We won’t be diving in here. IndexReader – Access to data structures in Lucene’s inverted index. If we want to look up a term, and visit every document it exists in, this is where we’d start. If we wanted to play with term vectors, offsets, or anything else stored in the index, we’d look here for that stuff as well. IndexSearcher — wraps an IndexReader for the purpose of taking search queries and executing them. Query – How we expect the searcher to perform the search, encompassing both scoring and which documents are returned. In this case, we’re searching for “value” in field “field”. This is the bit we want to toy with In addition to these classes, we’ll mention a support class exists behind the scenes: Similarity – Defines rules/formulas for calculating norms at index time and query normalization. Now with this outline, let’s think about a custom Lucene Query we can implement to help us learn. How about a query that searches for terms backwards. If the document matches a term backwards (like ananab for banana), we’ll return a score of 5.0. If the document matches the forwards version, let’s still return the document, with a score of 1.0 instead. We’ll call this Query “BackwardsTermQuery”. This example is hosted here on github. A tale of 3 classes – A Query, A Weight, and a Scorer Before we sling code, let’s talk about general architecture. A Lucene Query follows this general structure: A custom Query class, inheriting from Query A custom Weight class, inheriting from Weight A custom Scorer class inheriting from Scorer These three objects wrap each other. A Query creates a Weight, and a Weight in turn creates a Scorer. A Query is itself a very straight-forward class. One of its main responsibilities when passed to the IndexSearcher is to create a Weight instance. Other than that, there are additional responsibilities to Lucene and users of your Query to consider, that we’ll discuss in the “Query” section below. A Query creates a Weight. Why? Lucene needs a way to track IndexSearcher level statistics specific to each query while retaining the ability to reuse the query across multiple IndexSearchers. This is the role of the Weight class. When performing a search, IndexSearcher asks the Query to create a Weight instance. This instance becomes the container for holding high-level statistics for the Query scoped to this IndexSearcher (we’ll go over these steps more in the “Weight” section below). The IndexSearcher safely owns the Weight, and can abuse and dispose of it as needed. If later the Query gets reused by another IndexSearcher, a new Weight simply gets created. Once an IndexSearcher has a Weight, and has calculated any IndexSearcher level statistics, the IndexSearcher’s next task is to find matching documents and score them. To do this, the Weight in turn creates a Scorer. Just as the Weight is tied closely to an IndexSearcher, a Scorer is tied to an individual IndexReader. Now this may seem a little odd – in our code above the IndexSearcher always has exactly one IndexReader right? Not quite. See, a little hidden implementation detail is that IndexReaders may actually wrap other smaller IndexReaders – each tied to a different segment of the index. Therefore, an IndexSearcher needs to have the ability score documents across multiple, independent IndexReaders. How your scorer should iterate over matches and score documents is outlined in the “Scorer” section below. So to summarize, we can expand the last line from our example above… idxSearcher.search(q); … into this psuedocode: Weight w = q.createWeight(idxSearcher); // IndexSearcher level calculations for weight Foreach IndexReader idxReader: Scorer s = w.scorer(idxReader); // collect matches and score them Now that we have the basic flow down, let’s pick apart the three classes in a little more detail for our custom implementation. Our Custom Query What should our custom Query implementation look like? Query implementations always have two audiences: (1) Lucene and (2) users of your Query implementation. For your users, expose whatever methods you require to modify how a searcher matches and scores with your query. Want to only return as a match 1/3 of the documents that match the query? Want to punish the score because the document length is longer than the query length? Add the appropriate modifier on the query that impacts the scorer’s behavior. For our BackwardsTermQuery, we don’t expose accessors to modify the behavior of the search. The user simply uses the constructor to specify the term and field to search. In our constructor, we will simply be reusing Lucene’s existing TermQuery for searching individual terms in a document. private TermQuery backwardsQuery; private TermQuery forwardsQuery; public BackwardsTermQuery(String field, String term) { // A wrapped TermQuery for the reverse string Term backwardsTerm = new Term(field, new StringBuilder(term).reverse().toString()); backwardsQuery = new TermQuery(backwardsTerm); // A wrapped TermQuery for the Forward Term forwardsTerm = new Term(field, term); forwardsQuery = new TermQuery(forwardsTerm); } Just as importantly, be sure your Query meets the expectation of Lucene. Most importantly, you MUST override the following. createWeight() hashCode() equals() The method createWeight() we’ve discussed. This is where you’ll create a weight instance for an IndexSearcher. Pass any parameters that will influence the scoring algorithm, as the Weight will in turn be creating a searcher. Even though they are not abstract methods, overriding the hashCode()/equals() methods is very important. These methods are used by Lucene/Solr to cache queries/results. If two queries are equal, there’s no reason to rerun the query. Running another instance of your query could result in seeing the results of your first query multiple times. You’ll see your search for “peas” work great, then you’ll search for “bananas” and see “peas” search results. Override equals() and hashCode() so that “peas” != bananas. Our BackwardsTermQuery implements createWeight() by creating a custom BackwardsWeight that we’ll cover below: @Override public Weight createWeight(IndexSearcher searcher) throws IOException { return new BackwardsWeight(searcher); } BackwardsTermQuery has a fairly boilerplate equals() and hashCode() that passes through to the wrapped TermQuerys. Be sure equals() includes all the boilerplate stuff such as the check for self-comparison, the use of the super equals operator, the class comparison, etc etc. By using Lucene’s unit test suite, we can get a lot of good checks that our implementation of these is correct. @Override public boolean equals(Object other) { if (this == other) { return true; } if (!super.equals(other)) { return false; } if (getClass() != other .getClass()) { return false; } BackwardsTermQuery otherQ = (BackwardsTermQuery)(other); if (otherQ.getBoost() != getBoost()) { return false; } return otherQ.backwardsQuery.equals(backwardsQuery) && otherQ.forwardsQuery.equals(forwardsQuery); } @Override public int hashCode() { return super.hashCode() + backwardsQuery.hashCode() + forwardsQuery.hashCode(); } Our Custom Weight You may choose to use Weight simply as a mechanism to create Scorers (where the real meat of search scoring lives). However, your Custom Weight class must at least provide boilerplate implementations of the query normalization methods even if you largely ignore what is passed in: getValueForNormalization normalize These methods participate in a little ritual that IndexSearcher puts your Weight through with the Similarity for query normalization. To summarize the query normalization code in the IndexSearcher: float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f); Great, what does this code do? Well a value is extracted from Weight. This value is then passed to a Similarity instance that “normalizes” that value. Weight then receives this normalized value back. In short, this is allowing IndexSearcher to give weight some information about how its “value for normalization” compares to the rest of the stuff being searched by this searcher. This is extremely high level, “value for normalization” could mean anything, but here it generally means “what I think is my weight” and what Weight receives back is what the searcher says “no really here is your weight”. The details of what that means depend on the Similarity and Weight implementation. It’s expected that the Weight’s generated Scorer will use this normalized weight in scoring. You can chose to do whatever you want in your own Scorer including completely ignoring what’s passed to normalize(). While our Weight isn’t factoring into the scoring calculation, for consistency sake, we’ll participate in the little ritual by overriding these methods: @Override public float getValueForNormalization() throws IOException { return backwardsWeight.getValueForNormalization() + forwardsWeight.getValueForNormalization(); } @Override public void normalize(float norm, float topLevelBoost) { backwardsWeight.normalize(norm, topLevelBoost); forwardsWeight.normalize(norm, topLevelBoost); } Outside of these query normalization details, and implementing “scorer”, little else happens in the Weight. However, you may perform whatever else that requires an IndexSearcher in the Weight constructor. In our implementation, we don’t perform any additional steps with IndexSearcher. The final and most important requirement of Weight is to create a Scorer. For BackwardsWeight we construct our custom BackwardsScorer, passing scorers created from each of the wrapped queries to work with. @Override public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws IOException { Scorer backwardsScorer = backwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); Scorer forwardsScorer = forwardsWeight.scorer(context, scoreDocsInOrder, topScorer, acceptDocs); return new BackwardsScorer(this, context, backwardsScorer, forwardsScorer); } Our Custom Scorer The Scorer is the real meat of the search work. Responsible for identifying matches and providing scores for those matches, this is where the lion share of our customization will occur. It’s important to note that a Scorer is also a Lucene DocIdSetIterator. A DocIdSetIterator is a cursor into a set of documents in the index. It provides three important methods: docID() – what is the id of the current document? (this is an internal Lucene ID, not the Solr “id” field you might have in your index) nextDoc() – advance to the next document advance(target) – advance (seek) to the target One uses a DocIdSetIterator by first calling nextDoc() or advance() and then reading the docID to get the iterator’s current location. The value of the docIDs only increase as they are iterated over. By implementing this interface a Scorer acts as an iterator over matches in the index. A Scorer for the query “field1:cat” can be iterated over in this manner to return all the documents that match the cat query. In fact, if you recall from my article, this is exactly how the terms are stored in the search index. You can chose to either figure out how to correctly iterate through the documents in a search index, or you can use the other Lucene queries as building blocks. The latter is often the simplest. For example, if you wish to iterate over the set of documents containing two terms, simply use the scorer corresponding to a BooleanQuery for iteration purposes. The first method of our scorer to look at is docID(). It works by reporting the lowest docID() of our underlying scorers. This scorer can be thought of as being “before” the other in the index, and as we want to report numerically increasing docIDs, we always want to chose this value: @Override public int docID() { int backwordsDocId = backwardsScorer.docID(); int forwardsDocId = forwardsScorer.docID(); if (backwordsDocId <= forwardsDocId && backwordsDocId != NO_MORE_DOCS) { currScore = BACKWARDS_SCORE; return backwordsDocId; } else if (forwardsDocId != NO_MORE_DOCS) { currScore = FORWARDS_SCORE; return forwardsDocId; } return NO_MORE_DOCS; } Similarly, we always want to advance the scorer with the lowest docID, moving it ahead. Then, we report our current position by returning docID() which as we’ve just seen will report the docID of the scorer that advanced the least in the nextDoc() operation. @Override public int nextDoc() throws IOException { int currDocId = docID(); // increment one or both if (currDocId == backwardsScorer.docID()) { backwardsScorer.nextDoc(); } if (currDocId == forwardsScorer.docID()) { forwardsScorer.nextDoc(); } return docID(); } In our advance() implementation, we allow each Scorer to advance. An advance() implementation promises to either land docID() exactly on or past target. Our call to docID() after we call advance will return either that one or both are on target, or it will return the lowest docID past target. @Override public int advance(int target) throws IOException { backwardsScorer.advance(target); forwardsScorer.advance(target); return docID(); } What a Scorer adds on top of DocIdSetIterator is the “score” method. When score() is called, a score for the current document (the doc at docID) is expected to be returned. Using the full capabilities of the IndexReader, any number of information stored in the index can be consulted to arrive at a score either in score() or while iterating documents in nextDoc()/advance(). Given the docId, you’ll be able to access the term vector for that document (if available) to perform more sophisticated calculations. In our query, we’ll simply keep track as to whether the current docID is from the wrapped backwards term scorer, indicating a match on the backwards term, or the forwards scorer, indicating a match on the normal, unreversed term. Recall docID() is always called on advance/nextDoc. You’ll notice we update currScore in docID, updating it every time the document advances. @Override public float score() throws IOException { return currScore; } A Note on Unit Testing Now that we have an implementation of a search query, we’ll want to test it! I highly recommend using Lucene’s test framework. Lucene will randomly inject different implementations of various support classes, index implementations, to throw your code off balance. Additionally, Lucene creates test implementations of classes such as IndexReader that work to check whether your Query correctly fulfills its contract. In my work, I’ve had numerous cases where tests would fail intermittently, pointing to places where my use of Lucene’s data structures subtly violated the expected contract. An example unit test is included in the github project associated with this blog post. Wrapping Up That’s a lot of stuff! And I didn’t even cover everything there is to know! As an exercise to the reader, you can explore the Scorer methods cost() and freq(), as well as the rewrite() method of Query used optionally for optimization. Additionally, I haven’t explored how most of the traditional search queries end up using a framework of Scorers/Weights that don’t actually inherit from Scorer or Weight known as “SimScorer” and “SimWeight”. These support classes consult a Similarity instance to customize calculation certain search statistics such as tf, convert a payload to a boost, etc. In short there’s a lot here! So tread carefully, there’s plenty of fiddly bits out there! But have fun! Creating a custom Lucene query is a great way to really understand how search works, and the last resort short in solving relevancy problems short of creating your own search engine. And if you have relevancy issues, contact us! If you don’t know whether you do, our search relevancy product, Quepid – might be able to tell you!

February 10, 2014

by Doug Turnbull

· 14,489 Views

Voron & Time Series: Working with Real Data

dan liebster has been kind enough to send me a real world time series database. the data has been sanitized to remove identifying issues, but this is actually real world data, so we can learn a lot more about this. this is what this looks like: the first thing that i did was take the code in this post , and try it out for size. i wrote the following: int i = 0; using (var parser = new textfieldparser(@"c:\users\ayende\downloads\timeseries.csv")) { parser.hasfieldsenclosedinquotes = true; parser.delimiters = new[] {","}; parser.readline();//ignore headers var startnew = stopwatch.startnew(); while (parser.endofdata == false) { var fields = parser.readfields(); debug.assert(fields != null); dts.add(fields[1], datetime.parseexact(fields[2], "o", cultureinfo.invariantculture), double.parse(fields[3])); i++; if (i == 25*1000) { break; } if (i%1000 == 0) console.write("\r{0,15:#,#} ", i); } console.writeline(); console.writeline(startnew.elapsed); } note that we are using a separate transaction per line , which means that we are really doing a lot of extra work. but this simulate very well incoming events coming one at a time. we were able to process 25,000 events in 8.3 seconds. at a rate of just over 3 events per millisecond . now, note that we have in here the notion of “channels”. from my investigation, it seems clear that some form of separation is actually very common in time series data. we are usually talking about sensors or some such, and we want to track data across different sensors over time. and there is little if any call for working over multiple sensors / channels at the same time. because of that, i made a relatively minor change in voron, that allows it to have an infinite number of separate trees. that means that i can use as many trees as you want, and we can model a channel as a tree in voron. i also changed things so we instead of doing a single transaction per line, we will do a transaction per 1000 lines. that dropped the time to insert 25,000 lines to 0.8 seconds. or a full order of magnitude faster. that done, i inserted the full data set, which is just over 1,096,384 records. that took 36 seconds. in the data set i have, there are 35 channels. i just tried, and reading all the entries in a channel with 35,411 events takes 0.01 seconds. that allows doing things like doing averages over time, comparing data, etc. you can see the code implementing this in the following link .

February 7, 2014

by Oren Eini

· 4,053 Views

Managing Disk Space in MongoDB

In our previous post on MongoDB storage structure and dbStats metrics, we covered how MongoDB stores data and the differences between the dataSize, storageSize and fileSize metrics. We can now apply this knowledge to evaluate strategies for re-using MongoDB disk space. When documents or collections are deleted, empty record blocks within data files arise. MongoDB attempts to reuse this space when possible, but it will never return this space to the file system. This behavior explains why fileSize never decreases despite deletes on a database. If your app frequently deletes or if your fileSize is significantly larger than the size of your data plus indexes, you can use one of the methods below reclaim free space. Getting your free space back Compacting individual collections You can compact individual collections using the compact command. This command rewrites and defragments all data in a collection, as well as all of the indexes on that collection. Important notes on compacting: This operation blocks all other database activity when running and should be used only when downtime for your database is acceptable. If you are running a replica set, you can perform compaction on secondaries in order to avoid blocking the primary and use failover to make the primary a secondary before compacting it. Compacting individual collections will not reduce your storage footprint on disk (i.e., your fileSize) but it will defragment the collections you compact. Compacting one or more databases For a single-node MongoDB deployment, you can use the db.repairDatabase() command to compact all the collections in the database. This operation rewrites all the data and indexes for each collection in the database from scratch and thereby compacts and defragments the entire database. To compact all the databases on your server process, you can stop your mongod process and run it with the “–repair” option. Important notes on running a repair: This operation blocks all other database activity when running and should be used only when downtime for your database is acceptable. Running a repair requires free disk space equal to the size of your current data set plus 2 GB. You can use space in a different volume than the one that your mongod is running in by specifying the “–repairpath” option. Compacting all databases on a server by re-syncing replica set nodes For a multi-node MongoDB deployment, you can resync a secondary from scratch to reclaim space. By resyncing each node in your replica set you effectively rewrite the data files from scratch and thereby defragment your database. Please note that if your cluster is comprised of only two electable nodes, you will sacrifice high availability during the resync because the secondary is completely wiped before syncing. If your app is sensitive to downtime, we recommend a process similar to the one we use here at MongoLab which we call a “rolling node replacement.” This process replaces each node in your cluster in turn by bringing a new node into the cluster, replicating the data to that new node and removing the old node. In this way, your cluster can maintain the same level of redundancy during the compaction as during normal operations. A tip about efficiently using space usePowerOf2Sizes Setting the usePowerof2Sizes option is a proactive approach to reusing space in collections that experience frequent document moves or deletions. This option supersedes the default padding factor mechanism and reduces the impact of fragmentation within the collection by allocating additional space for each document in intervals that follow the powers of 2. Setting this option for a specific collection makes it less likely that documents in that collection need to be moved when they grow in size, less likely that a document will need to be moved more than once in its lifetime, and more likely that space left by moving documents can be reused by new or other moved documents. Thanks for reading! We hope the above strategies help guide you in evaluating options for reusing empty space in your MongoDB.

February 6, 2014

by Chris Chang

· 23,656 Views

Java: Handling a RuntimeException in a Runnable

At the end of last year I was playing around with running scheduled tasks to monitor a Neo4j cluster and one of the problems I ran into was that the monitoring would sometimes exit. I eventually realised that this was because a RuntimeException was being thrown inside the Runnable method and I wasn’t handling it. The following code demonstrates the problem: import java.util.ArrayList; import java.util.List; import java.util.concurrent.*; public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } If we run that code we’ll see the RuntimeException but the executor won’t exit because the thread died without informing it: Exception in thread "main" pool-1-thread-1 -> 1391212558074 java.util.concurrent.ExecutionException: java.lang.RuntimeException: game over at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:252) at java.util.concurrent.FutureTask.get(FutureTask.java:111) at RunnableBlog.main(RunnableBlog.java:11) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Caused by: java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) At the time I ended up adding a try catch block and printing the exception like so: public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { try { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } catch (RuntimeException e) { e.printStackTrace(); } } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } This allows the exception to be recognised and as far as I can tell means that the thread executing the Runnable doesn’t die. java.lang.RuntimeException: game over pool-1-thread-1 -> 1391212651955 at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) pool-1-thread-1 -> 1391212652956 java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) pool-1-thread-1 -> 1391212653955 java.lang.RuntimeException: game over at RunnableBlog$1.run(RunnableBlog.java:16) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) This worked well and allowed me to keep monitoring the cluster. However, I recently started reading ‘Java Concurrency in Practice‘ (only 6 years after I bought it!) and realised that this might not be the proper way of handling the RuntimeException. public class RunnableBlog { public static void main(String[] args) throws ExecutionException, InterruptedException { ScheduledExecutorService executor = Executors.newSingleThreadScheduledExecutor(); executor.scheduleAtFixedRate(new Runnable() { @Override public void run() { try { System.out.println(Thread.currentThread().getName() + " -> " + System.currentTimeMillis()); throw new RuntimeException("game over"); } catch (RuntimeException e) { Thread t = Thread.currentThread(); t.getUncaughtExceptionHandler().uncaughtException(t, e); } } }, 0, 1000, TimeUnit.MILLISECONDS).get(); System.out.println("exit"); executor.shutdown(); } } I don’t see much difference between the two approaches so it’d be great if someone could explain to me why this approach is better than my previous one of catching the exception and printing the stack trace.

February 6, 2014

by Mark Needham

· 19,655 Views

Using Database Views in Grails

This post is a quick explanation on how to use database views in Grails. For an introduction I tried to summarize what database views are. However, I noticed I cannot describe it better than it is already done on Wikipedia. Therefore I will just quote the Wikipedia summary of View (SQL)here: In database theory, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary. Unlike ordinary base tables in a relational database, a view does not form part of the physical schema: as a result set, it is a virtual table computed or collated from data in the database, dynamically when access to that view is requested. Changes applied to the data in a relevant underlying table are reflected in the data shown in subsequent invocations of the view. (Wikipedia) Example Let's assume we have a Grails application with the following domain classes: class User { String name Address address ... } class Address { String country ... } For whatever reason we want a domain class that contains direct references to the name and the country of an user. However, we do not want to duplicate these two values in another database table. A view can help us here. Creating the view At this point I assume you are already using the Grails database-migration plugin. If you don't you should clearly check it out. The plugin is automatically included with newer Grails versions and provides a convenient way to manage databases using change sets. To create a view we just have to create a new change set: changeSet(author: '..', id: '..') { createView(""" SELECT u.id, u.name, a.country FROM user u JOIN address a on u.address_id = a.id """, viewName: 'user_with_country') } Here we create a view named user_with_country which contains three values: user id, user name andcountry. Creating the domain class Like normal tables views can be mapped to domain classes. The domain class for our view looks very simple: class UserWithCountry { String name String country static mapping = { table 'user_with_country' version false } } Note that we disable versioning by setting version to false (we don't have a version column in our view). At this point we just have to be sure that our database change set is executed before hibernate tries to create/update tables on application start. This is typically be done by disabling the table creation of hibernate in DataSource.groovy and enabling the automatic migration on application start by settinggrails.plugin.databasemigration.updateOnStart to true. Alternatively this can be achieved by manually executing all new changesets by running the dbm-update command. Usage Now we can use our UserWithCountry class to access the view: Address johnsAddress = new Address(country: 'england') User john = new User(name: 'john', address: johnsAddress) john.save(failOnError: true) assert UserWithCountry.count() == 1 UserWithCountry johnFromEngland = UserWithCountry.get(john.id) assert johnFromEngland.name == 'john' assert johnFromEngland.country == 'england' Advantages of views I know the example I am using here is not the best. The relationship between User and Address is already very simple and a view isn't required here. However, if you have more sophisticated data structures views can be a nice way to hide complex relationships that would require joining a lot of tables. Views can also be used as security measure if you don't want to expose all columns of your tables to the application.

January 25, 2014

by Michael Scharhag

· 16,697 Views

Big Data Search, Part 4: The Index Format is Horrible

I have completed my own exercise, and while I wanted to try it with “few allocations” rule, it is interesting to see just how far out there the code is. This isn’t something that you can really use for anything except as a basis to see how badly you are doing. Let us start with the index format. It is just a CSV file with the value and the position in the original file. That means that any search we want to do on the file is actually a binary search, as discussed in the previous post. But doing a binary search like that is an absolute killer for performance. Let us consider our 15TB data set. In my tests, a 1GB file with 4.2 million rows produced roughly 80MB index. Assuming the same is true for the larger file, that gives us a 1.2 TB file. In my small index, we have to do 24 seeks to get to the right position in the file. And as you should know, disk seeks are expensive. They are in the order of 10ms or so. So the cost of actually searching the index is close to quarter of a second. Now, to be fair, there is going to be a lot of caching opportunities here, but probably not that many if we have a lot of queries to deal with ere. Of course, the fun thing about this is that even with a 1.2 TB file, we are still talking about less than 40 seeks (the beauty of O(logN) in action), but that is still pretty expensive. Even worse, this is what happens when we are running on a single query at a time. What do you think will happen if we are actually running this with multiple threads generating queries. Now we will have a lot of seeks (effective random) that would generate a big performance sink. This is especially true if we consider that any storage solution big enough to store the data is going to be composed of an aggregate of HDD disks. Sure, we get multiple spindles, so we get better performance overall, but still… Obviously, there are multiple solutions for this issue. B+Trees solve the problem by packing multiple keys into a single page, so instead of doing a O(log2N), you are usually doing O(log36N) or O(log100N). Consider those fan outs, we will have 6 – 8 seeks to do to get to our data. Much better than the 40 seeks required using plain binary search. It would actually be better than that in the common case, since the first few levels of the trees are likely to reside in memory (and probably in L1, if we are speaking about that). However, given that we are storing sorted strings here, one must give some attention to Sorted Strings Tables. The way those work, you have the sorted strings in the file, and the footer contains two important bits of information. The first is the bloom filter, which allows you to quickly rule out missing values, but the more important factor is that it also contains the positions of (by default) every 16th entry to the file. This means that in our 15 TB data file (with 64.5 billion entries), we will use about 15GB just to store pointers to the different locations in the index file (which will be about 1.2 TB). Note that the numbers actually are probably worse. Because SST (note that when talking about SST I am talking specifically about the leveldb implementation) utilize many forms of compression, it is actually that the file size will be smaller (although, since the “value” we use is just a byte position in the data file, we won’t benefit from compression there). Key compression is probably a lot more important here. However, note that this is a pretty poor way of doing things. Sure, the actual data format is better, in the sense that we don’t store as much, but in terms of the number of operations required? Not so much. We still need to do a binary search over the entire file. In particular, the leveldb implementation utilizes memory mapped files. What this ends up doing is rely on the OS to keep the midway points in the file in RAM, so we don’t have to do so much seeking. Without that, the cost of actually seeking every time would make SSTs impractical. In fact, you would pretty much have to introduce another layer on top of this, but at that point, you are basically doing trees, and a binary tree is a better friend here. This leads to an interesting question. SST is probably so popular inside Google because they deal with a lot of data, and the file format is very friendly to compression of various kinds. It is also a pretty simple format. That make it much nicer to work with. On the other hand, a B+Tree implementation is a lot more complex, and it would probably several orders of magnitude more complex if it had to try to do the same compression tricks that SSTs do. Another factor that is probably as important is that as I understand it, a lot of the time, SSTs are usually used for actual sequential access (map/reduce stuff) and not necessarily for the random reads that are done in leveldb. It is interesting to think about this in this fashion, at least, even if I don’t know what I’ll be doing with it.

January 24, 2014

by Oren Eini

· 12,111 Views

How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2, Part 1

Learn how to set up a four node Hadoop cluster using AWS EC2, PuTTy(gen), and WinSCP.

January 23, 2014

by Hardik Pandya

· 136,010 Views · 3 Likes

Big Data Search, Part 2: Setting Up

the interesting thing about this problem is that i was very careful in how i phrased things. i said what i wanted to happen, but didn’t specify what needs to be done. that was quite intentional. for that matter, the fact that i am posting about what is going to be our acceptance criteria is also intentional. the idea is to have a non trivial task, but something that should be very well understood and easy to research. it also means that the candidate needs to be able to write some non trivial code. and i can tell a lot about a dev from such a project. at the same time, this is a very self contained scenario. the idea is that this is something that you can do in a short amount of time. the reason that this is an interesting exercise is that this is actually at least two totally different but related problems. first, in a 15tb file, we obviously cannot rely on just scanning the entire file. that means that we have to have an index. and that means we have to build it. interestingly enough, an index being a sorted structure, that means that we have to solve the problem of sorting more data than can fit in main memory. the second problem is probably easier, since it is just an implementation of external sort, and there are plenty of algorithms around to handle that. note that i am not really interested in actual efficiencies for this particular scenario. i care about being able to see the code. see that it works, etc. my solution, for example, is a single threaded system that make no attempt at parallelism or i/o optimizations. it clocks at over 1 gb / minute and the memory consumption is at under 150mb. queries for a unique value return the result in 0.0004 seconds. queries that returned 153k results completed in about 2 seconds. when increasing the used memory to about 650mb, there isn’t really any difference in performance, which surprised me a bit. then again, the entire code is probably highly inefficient. but that is good enough for now. the process is kicked off with indexing: 1: var options = new directoryexternalstorageoptions("/path/to/index/files"); 2: var input = file.openread(@"/path/to/data/crimes_-_2001_to_present.csv"); 3: var sorter = new externalsorter(input, options, new int[] 4: { 5: 1,// case number 6: 4, // ichr 7: 8: }); 9: 10: sorter.sort(); i am actually using the chicago crime data for this. this is a 1gb file that i downloaded from the chicago city portal in csv format. this is what the data looks like: the externalsorter will read and parse the file, and start reading it into a buffer. when it gets to a certain size (about 64mb of source data, usually), it will sort the values in memory and output them into temporary files. those file looks like this: initially, i tried to do that with binary data, but it turns out that that was too complex to be easy, and writing this in a human readable format made it much easier to work with. the format is pretty simple, you have the value of the left, and on the right you have start position of the row for this value. we generate about 17 such temporary files for the 1gb file. one temporary file per each 64 mb of the original file. this lets us keep our actual memory consumption very low, but for larger data sets, we’ll probably want to actually do the sort every 1 gb or maybe more. our test machine has 16 gb of ram, so doing a sort and outputting a temporary file every 8 gb can be a good way to handle things. but that is beside the point. the end result is that we have multiple sorted files, but they aren’t sequential. in other words, in file #1 we have values 1,4,6,8 and in file #2 we have 1,2,6,7. we need to merge all of them together. luckily, this is easy enough to do. we basically have a heap that we feed entries from the files into. and that pretty much takes care of this. see merge sort if you want more details about this. the end result of merging all of those files is… another file, just like them, that contains all of the data sorted. then it is time to actually handle the other issue, actually searching the data. we can do that using simple binary search, with the caveat that because this is a text file, and there is no fixed size records or pages, it is actually a big hard to figure out where to start reading. in effect, what i am doing is to select an arbitrary byte position, then walk backward until i find a ‘\n’. once i found the new line character, i can read the full line, check the value, and decide where i need to look next. assuming that i actually found my value, i can now go to the byte position of the value in the original file and read the original line, giving it to the user. assuming an indexing rate of 1 gb / minute a 15 tb file would take about 10 days to index. but there are ways around that as well, but i’ll touch on them in my next post. what all of this did was bring home just how much we usually don’t have to worry about such things. but i consider this research well spent, we’ll be using this in the future.

January 21, 2014

by Oren Eini

· 3,454 Views

Node.js and N1QL

This post was originally written by Brett Lawson. So, recently I added support to our Node.js client for executing N1QL queries against your cluster, providing you are running an instance of the N1QL engine (to get a hold of the updated version of the Node.js client with this support, point npm to our github master branch at https://github.com/couchbase/couchnode). When I implemented it, I didn’t have very much to test against at the time, so I figured it would be a interesting endeavor to see how nice the Node.js’s beer-sample example would look if we used entirely N1QL queries rather than using any views. I first started by converting over the basic queries which simply selected all beers or breweries from the sample data, and then moved on to converting the live-search querying to use N1QL as well. I figured I would write a little blog post on the conversions and make some remarks about what I noticed along the way. Here is our first query: var q = { limit : ENTRIES_PER_PAGE, stale : false }; db.view( "beer", "by_name", q).query(function(err, values) { var keys = _.pluck(values, 'id'); db.getMulti( keys, null, function(err, results) { var beers = _.map(results, function(v, k) { v.value.id = k; return v.value; }); res.render('beer/index', {'beers':beers}); }) }); and the converted version: db.query( "SELECT META().id AS id, * FROM beer-sample WHERE type='beer' LIMIT " + ENTRIES_PER_PAGE, function(err, beers) { res.render('beer/index', {'beers':beers}); }); As you can see, we no longer need to do two separate operations to retrieve the list. We can execute our N1QL query which will returns all the information that we need, and formats it appropriately; rather than needing to reformat the data and add our id values, we can simply select it as part of the result set. I find the N1QL version here is much more concise and appreciate how simple it was to construct the query. I then converted the brewery listing function following a similar path, and here is what I ended up with, as you can see, it is similarly beautiful and concise: db.query( "SELECT META().id AS id, name FROM beer-sample WHERE type='brewery' LIMIT " + ENTRIES_PER_PAGE, function(err, breweries) { res.render('brewery/index', {'breweries':breweries}); }); Next I converted the searching methods. These were a bit more of a challenge as looking at the original code directly, without thinking about what it was trying to achieve, the semantics were not immediately obvious, here is a look at what it looked like: var q = { startkey : value, endkey : value + JSON.parse('"\u0FFF"'), stale : false, limit : ENTRIES_PER_PAGE } db.view( "beer", "by_name", q).query(function(err, values) { var keys = _.pluck(values, 'id'); db.getMulti( keys, null, function(err, results) { var beers = []; for(var k in results) { beers.push({ 'id': k, 'name': results[k].value.name, 'brewery_id': results[k].value.brewery_id }); } res.send(beers); }); }); Again, we have quite a bit of code to achieve something which you should expect to be quite simple. In case you can’t tell, the map/reduce query above retrieves a listing of beers whose names begin with the value entered by the user. We are going to convert this to a N1QL LIKE clause, and as an added bonus, we will allow the search term to appear anywhere in the string, instead of requiring it at the beginning: db.query( "SELECT META().id, name, brewery_id FROM beer-sample WHERE type='beer' AND LOWER(name) LIKE '%" + term + "%' LIMIT " + ENTRIES_PER_PAGE, function(err, beers) { res.send(beers); }); We have again collapsed a large amount of vaguely understandable code down to a simple and concise query. I believe this begins to show the power of N1QL and why I am personally so excited to see N1QL. There is however one caveat I noticed while doing this, and this is that similar to SQL, you need to be careful about what kind of user-data you are passing into your queries. I wrote a simple cleaning function to try and prevent any malicious intent (though N1QL is currently read-only anyways), but my cleaning code is by no means extensive. Another issue I noticed is that our second query with the LIKE clause executed significantly slower as a N1QL query then it did when using map/reduce. I believe this is simply a result of N1QL still being developer preview, and there is lots of optimizations left to be done by the N1QL team. If you want to see the fully converted source code, take a look at the n1ql branch of the beersample-node repository available here, https://github.com/couchbaselabs/beersample-node/tree/n1ql. Thanks! Brett

January 17, 2014

by Don Pinto

· 7,921 Views · 1 Like

A Beginner's Guide to ACID and Database Transactions

Read the original article here.

January 7, 2014

by Vlad Mihalcea

· 20,459 Views