The Latest and Popular SDLC Topics

The Latest Popular Topics

OpenStack + Private Cloud = Ideal Habitat for Devops

The use of OpenStack in the private cloud is invaluable for DevOps. It provides engineers the ability to innovate quickly and deal with uncertainty. It also maximizes existing infrastructure and provides a programmable, software-defined IaC. Openstack in the private cloud = agile development OpenStack has emerged as the de facto standard for IaaS in the private cloud. It gives engineers a vital self-service capability to provision (and de-provision) environments, allowing them to act autonomously, in the moment. This helps to eliminate the downstream bottleneck caused by waiting for operations staff to find time to do the provisioning. As OpenStack is open source it is vendor agnostic, allowing you to take advantage of competitive pricing rather than suffering from vendor lock-in. A private cloud means lower cost for the same capacity in a public cloud, which is especially useful for enterprises with high data needs. For security reasons, OpenStack is still mainly used in the private cloud by developers and QA, i.e. in a non-production context. However, OpenStack gives an ability to optimize application performance and/or security by having more control compared to public cloud. The software is increasingly backed by the critical mass of leading IT infrastructure vendors such as IBM, CICSO and HP. Gartner assumes that “by 2019, OpenStack enterprise deployments will grow tenfold, up from just hundreds of production deployments today, due to increased maturity and growing ecosystem support.”1 Challenges to consider OpenStack implementation skills are still rare in the market, so experimentation and self-learning is necessary. Although this takes time, it is offset by the fact the software is free and represents a good opportunity to gain internal expertise. This is particularly valid if you class infrastructure as a core competence. The maturity and functionality of OpenStack projects vary widely - while it covers storage, network and compute, the main adoption currently happens around compute (Nova) and block storage (Cinder), with object storage and network (Neutron) lacking significantly behind. However, without leveraging virtualized network services as part of a private cloud, full-stack environment provisioning is not possible, so don’t forget to add necessary network services to your private cloud. Where to begin Integrating OpenStack clouds with existing infrastructure can be a challenge. It is hardly plug and play. At first, it is best to focus on relatively isolated DevOps environments, such as Gartner’s “mode two”2 applications rather than introducing open stack across the board straight away, (Bimodal IT “refers to having two modes of IT, each designed to develop and deliver information – and technology – intensive services in its own way. Mode 1 is traditional, emphasizing scalability, efficiency, safety and accuracy. Mode 2 is nonsequential, emphasizing agility and speed.”3) As with any open source software, new functions and upgrades are frequently released. This means keeping up with changes in functionality and filling gaps with customizations or third-party products. Upgrades are complex and typically require planned downtime. For these reasons, we recommend choosing a hardened distribution and sticking with it. Openstack is the most complete vendor agnostic solution for storage, network and compute services. The ability for developers to instantly spin up environments at any time is invaluable for a fully agile DevOps environment, and is well worth the effort it takes to acclimatize to Openstack. 1 http://www.prnewswire.com/news-releases/suse-openstack-cloud-5-to-simplify-private-cloud-management-300048721.html 2 http://www.gartner.com/it-glossary/bimodal 3 http://www.gartner.com/it-glossary/bimodal

June 26, 2015

by Ron Gidron

· 3,971 Views · 2 Likes

How to Debug Your Maven Build with Eclipse

When running a Maven build with many plugins (e.g. the jOOQ or Flyway plugins), you may want to have a closer look under the hood to see what’s going on internally in those plugins, or in your extensions of those plugins. This may not appear obvious when you’re running Maven from the command line, e.g. via: C:\Users\jOOQ\workspace>mvn clean install Luckily, it is rather easy to debug Maven. In order to do so, just create the following batch file on Windows: @ECHO OFF IF "%1" == "off" ( SET MAVEN_OPTS= ) ELSE ( SET MAVEN_OPTS=-Xdebug -Xnoagent -Djava.compile=NONE -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005 ) Of course, you can do the same also on a MacOS X or Linux box, by usingexport intead of SET. Now, run the above batch file and proceed again with building: C:\Users\jOOQ\workspace>mvn_debug C:\Users\jOOQ\workspace>mvn clean install Listening for transport dt_socket at address: 5005 Your Maven build will now wait for a debugger client to connect to your JVM on port 5005 (change to any other suitable port). We’ll do that now with Eclipse. Just add a new Remote Java Application that connects on a socket, and hit “Debug”: That’s it. We can now set breakpoints and debug through our Maven process like through any other similar kind of server process. Of course, things work exactly the same way with IntelliJ or NetBeans. Once you’re done debugging your Maven process, simply call the batch again with parameter off: C:\Users\jOOQ\workspace>mvn_debug off C:\Users\jOOQ\workspace>mvn clean install And your Maven builds will no longer be debugged. Happy debugging!

June 25, 2015

by Lukas Eder

· 25,037 Views

Spring Integration Kafka 1.2 is Available, With 0.8.2 Support and Performance Enhancements

Spring Integration Kafka 1.2 is out with a major performance overhaul.

June 25, 2015

by Pieter Humphrey

· 2,960 Views

What's Coming With JSF 2.3?

There seems to be a good deal of excitement in the Java EE community around the new MVC specification. This is certainly great and most understandable. Some (perhaps more established) parts of the Java EE community has in the meanwhile been more quietly contributing to the continuing evolution of JSF 2.3. So what is in JSF 2.3? The real answer is that it depends on what the JSF community needs. There is a small raft of work that's on the table now, but I think the JSF community should be very proactive in helping determining what needs to be done to keep the JSF community strong for years to come. Just as he did for JSF 2.2, Java EE community advocate Arjan Tijms has started maintaining a regularly updated blog entry listing the things the JSF 2.3 expert group is working on. So far he has detailed CDI injection improvements, the newly added post render view event, improved collections support and a few others. You should definitely check it out as a JSF developer and provide your input. Arjan also has an excellent collection of Java EE 8 blog entries generally onzeef.com. On a related note, JSF specification lead Ed Burns wrote up a very interesting recent blog entryoutlining the continuing momentum behind the strong JSF ecosystem. He highlighted a couple of brand new JSF plugins that we will explore in depth in future entries.

June 25, 2015

by Reza Rahman

· 4,551 Views · 2 Likes

Writing a Download Server Part I: Always Stream, Never Keep Fully in Memory

Downloading various files (either text or binary) is a bread and butter of every enterprise application. PDF documents, attachments, media, executables, CSV, very large files, etc. Almost every application, sooner or later, will have to provide some form of download. Downloading is implemented in terms of HTTP, so it's important to fully embrace this protocol and take full advantage of it. Especially in Internet facing applications features like caching or user experience are worth considering. This series of articles provides a list of aspects that you might want to consider when implementing all sorts of download servers. Note that I avoid "best practices" term, these are just guidelines that I find useful but are not necessarily always applicable. One of the biggest scalability issues is loading whole file into memory before streaming it. Loading full file into byte[] to later return it e.g. from Spring MVC controller is unpredictable and doesn't scale. The amount of memory your server will consume depends linearly on number of concurrent connections times average file size - factors you don't really want to depend on so much. It's extremely easy to stream contents of a file directly from your server to the client byte-by-byte (with buffering), there are actually many techniques to achieve that. The easiest one is to copy bytes manually: @RequestMapping(method = GET) public void download(OutputStream output) throws IOException { try(final InputStream myFile = openFile()) { IOUtils.copy(myFile, output); } } Your InputStream doesn't even have to be buffered, IOUtils.copy() will take care of that. However this implementation is rather low-level and hard to unit test. Instead I suggest returning Resource: @RestController @RequestMapping("/download") public class DownloadController { private final FileStorage storage; @Autowired public DownloadController(FileStorage storage) { this.storage = storage; } @RequestMapping(method = GET, value = "/{uuid}") public Resource download(@PathVariable UUID uuid) { return storage .findFile(uuid) .map(this::prepareResponse) .orElseGet(this::notFound); } private Resource prepareResponse(FilePointer filePointer) { final InputStream inputStream = filePointer.open(); return new InputStreamResource(inputStream); } private Resource notFound() { throw new NotFoundException(); } } @ResponseStatus(value= HttpStatus.NOT_FOUND) public class NotFoundException extends RuntimeException { } Two abstractions were created to decouple Spring controller from file storage mechanism.FilePointer is a file descriptor, irrespective to where that file was taken. Currently we use one method from it: public interface FilePointer { InputStream open(); //more to come } open() allows reading the actual file, no matter where it comes from (file system, database BLOB, Amazon S3, etc.) We will gradually extend FilePointer to support more advanced features, like file size and MIME type. The process of finding and creatingFilePointers is governed by FileStorage abstraction: public interface FileStorage { Optional findFile(UUID uuid); } Streaming allows us to handle hundreds of concurrent requests without significant impact on memory and GC (only a small buffer is allocated in IOUtils). BTW I am using UUID to identify files rather than names or other form of sequence number. This makes it harder to guess individual resource names, thus more secure (obscure). More on that in next articles. Having this basic setup we can reliably serve lots of concurrent connections with minimal impact on memory. Remember that many components in Spring framework and other libraries (e.g. servlet filters) may buffer full response before returning it. Therefore it's really important to have an integration test trying to download huge file (in tens of GiB) and making sure the application doesn't crash. Writing a download server Part I: Always stream, never keep fully in memory Part II: headers: Last-Modified, ETag and If-None-Match Part III: headers: Content-length and Range Part IV: Implement HEAD operation (efficiently) Part V: Throttle download speed Part VI: Describe what you send (Content-type, et.al.) The sample application developed throughout these articles is available on GitHub.

June 24, 2015

by Tomasz Nurkiewicz

· 17,129 Views

Overcoming Barriers to Performance and Scalability Test Automation

[This article was written by Ophir Prusak] Guest author Ophir Prusak is chief evangelist atBlazeMeter. To learn more about load and performance testing automation, he invites readers toattend a meetupthis Wednesday, June 24, at New Relic’s San Francisco offices. Performance and load testing are kind of like flossing your teeth. You know you need to do it, but you might not be doing it as much as you should. When your site goes down because it couldn’t handle the load, you look back and realize you might have easily prevented it with a little more testing in advance. That’s why companies are automating their application testing in an effort to lower costs, increase efficiency, and reduce the time needed to release new features. The importance of automated testing in a continuous delivery era Continuous Delivery (CD) is rapidly emerging as the “new normal” in software development, as Perforce discovered in an independent survey, with an estimated 80% of SaaS companies and 51% of non-SaaS companies adopting this practice. Companies that provide Software-as-a-Service know they need to be continuously creating new features, updating their websites, and optimizing their backend. But while software development has adapted nicely in terms of automation, the testing side has moved more slowly. For a fully Continuous Delivery and Integration process to be realized, performance testing must be automated. As the need for testing increases, doing it manually can dramatically increase your time to release. Automating testing throughout the CD process can help detect errors instantly and deliver software faster. Making it work JMeter is the de facto standard in open source load testing. It’s the most widely used open source tool for performance testing for a good reason. There’s virtually nothing it can’t test (websites, native mobile applications, APIs, and Web applications) and it’s extremely powerful and fully featured. Yet there are challenges. JMeter poses a steep learning curve in terms of integration and ease of use. Additionally, it doesn’t integrate easily with APM and Continuous Integration (CI) tools. Many developers have been looking for a way to conduct performance testing with less time and effort—and fewer hiccups along the way. Taurus: An effort to simplify test automation A new open source project called Taurus (Test AUtomation Running Smoothly) is designed to provide exactly that—a way to remove most of the pain of using JMeter on its own. Taurus can give you the ability to Create and define a load test even without using JMeter. Override existing JMeter files or tests configurations. Create human-readable configuration files and testing scripts that are easily added to source control systems like GitHub. Integrate into CI tools like Jenkins. Run multiple tests in parallel. Provide pass/fail criteria back into the CI tool for easier automation of test-results analysis. Make analysis of test results easier and more intuitive. Taurus still uses JMeter under the hood, but is designed to have a much easier learning curve, especially for simple tests. Taurus also offers a built-in result analysis engine that provides both console-based reporting features and result analysis. Performance testing and optimizing your applications is not simple, yet there are solutions available that make the process easier and more successful. I’m looking forward to seeing how the technology evolves even further in the near future. If you want to learn more about Taurus, check out the project on GitHub. Better yet, you are invited to come to a meetup this Wednesday, June 24, at New Relic’s San Francisco offices. You can learn a lot more about Taurus and how you can use it to help scale load and performance testing automation.

June 24, 2015

by Fredric Paul

· 1,763 Views

Build a search engine with strus

What is strus ? The project strus is a collection of libraries and tools written in C++ to build a competitive search engine. Currently it is a single person project that started in September 2014 and therefore the competitiveness in terms of features of the software is more a promise than a fact. It definitely needs more brain to be put into it to catch up with the big players for open source search engines Lucene and Xapian. But strus is not only a me-too-project for search. Strus introduces expression matching and information extraction on a different level than other known open source engines (read more…). Strus simplifies the architecture of a search engine by “outsourcing” of components like the key/value store database storing the data blocks. This componentization (see components of strus) reduces the amount of code drastically and it raises opportunities for experts on a specific topic to contribute (read more…). Strus is not the first attempt to try that, but it is the first attempt as open source project, that has a performance within reach of the big open source search engines. And it does that without a 10 years history of optimization in the back. Strus might not be there at eye level, but let’s see what happens, if more different reasoning and competition is put into it. For who is strus ? People I would primarily like to address with this blog are developers or hackers as potential contributors or for feedback. On the other hand the project could already be interesting for experimental projects that can afford to go along with the development of strus. As stakeholder you can influence the project too. As the demo project, the search on the complete Wikipedia collection (English) shows, it is already possible to build projects, but you have to be aware, that dead lines should not exist, because you might hit a point where a feature you need is not instantaneously available. Project planning gets difficult at the current stage. Furthermore the state of documentation is still quite poor. Programming paradigms All interfaces of strus are pure. No inheritance is used in the main header files. Strus is more a lego thing than a provider of solution classes. If you want for example to build a sequence of terms as feature for your search, you have to build its expression tree with help of a stack, rather than picking a class that implements a sequence query. In PHP this looks as follows: $terms = [ “hello”, “world” ]; $query->pushTerm( “word”/*feature type*/, $term[0] ); $query->pushTerm( “word”/*feature type*/, $term[1] ); $query->pushExpression( “sequence”, 2/*nof terms*/, 2/*position range*/); $query->defineFeature( “docfeat” /*name addressing this feature set*/); The number of interface classes is small (see for example the interface classes of the core), but you have to understand them. If you want to contribute, you should also have a closer look at the programing guidelines. Try it There exist a guide how to fetch, build and install strus. Unfortunately a tutorial is still missing. There will be one soon ! Support I will reply to questions. Please mail me to contact at project dash strus dot net. Thanks I want to thank the authors of LevelDB here. I was looking for some time for a key/value store database that had an upper bound seek function in the interface. The upper bound seek is crucial because it allows you to minimize block accesses on disk when joining sets. A key/value store without upper bound seek would have forced me to create virtual blocks that point to other blocks. This would mean more disk accesses to fetch the data blocks needed. LevelDB has it. Any other alternative candidate to implement the database interface has to have it too. Social Media Github: patrickfrey Twitter: @ProjectStrus

June 23, 2015

by Patrick Frey

· 931 Views

PostgreSQL Powers All New Apps for 77% of the Database's Users

Survey of open source PostgreSQL users found adoption continues to rise with 55% of users deploying it for mission-critical applications Bedford, MA – June 23, 2015 – EnterpriseDB (EDB), the leading provider of enterprise-class Postgres products and database compatibility solutions, today announced the results of its “PostgreSQL Adoption Survey 2015,” a biennial survey of open source PostgreSQL users. Conducted by EnterpriseDB, the survey found PostgreSQL adoption continuing to rise, with 55% of users – up from 40% two years ago – deploying it for mission-critical applications and 77% of users are dedicating all new application deployments to PostgreSQL. These findings give voice to end users and confirm such industry indicators as increasing job listings and monthly rankings on DB-Engines that have pointed to rising interest in and demand for PostgreSQL, also called Postgres. The growing popularity of Postgres also comes as traditional software vendors suffer setbacks in the marketplace. The enterprise-class performance, security and stability of Postgres, on par with traditional database vendors for most corporate workloads, meanwhile have helped position Postgres among the solutions from the world’s largest vendors. The opportunity to transform their data center economics has helped fuel downloads of Postgres as well. End users reported cutting costs with Postgres, with 41% reporting they had first-year cost savings of 50% or more. They’re using Postgres to build web 2.0 applications using unstructured data as evidenced by the 64% of respondents who said they were working with JSON/JSONB and the 47% who said they were using Postgres for collaboration applications. “Postgres is empowering organizations to transform the economics of IT. IT can invest in the customer engagement applications that differentiate their operations from their competition instead of continuing to pay the steep and rising licensing and support fees charged by traditional database vendors,” said Marc Linster, senior vice president of products and services of EnterpriseDB. “With the expanding adoption, EnterpriseDB has experienced dramatic growth year over year, providing the software, services and support that organizations need to be successful with Postgres.” Database Migrations, Replacements The findings also support statements in a recent Gartner report that reflect the widespread acceptance of open source databases. “By 2018, more than 70% of new in-house applications will be developed on an OSDBMS, and 50% of existing commercial RDBMS instances will have been converted or will be in process,” according to the April 2015 Gartner report, The State of Open-Source RDBMs, 2015.* Among Postgres users, the survey findings show migrations are already under way with 37% reporting they had migrated applications from Oracle or Microsoft SQL Server to Postgres. Many users were still planning further migrations, with 37% of PostgreSQL users saying they will gradually replace their legacy systems with Postgres, compared to 29% who said that in the 2013 survey. Further, end users predict their deployments of Postgres will expand significantly, with 32% saying they anticipate production deployments of Postgres to increase by at least 50% over the next year. The survey, conducted by EnterpriseDB using an online tool in May 2015, queried registered users of PostgreSQL and drew 274 respondents worldwide from government organizations and companies ranging in size and industry. *The State of Open-Source RDBMs, 2015, by Donald Feinberg and Merv Adrian, published on April 21, 2015. Connect with EnterpriseDB Read the blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb Become a fan on Facebook: http://www.facebook.com/EnterpriseDB?ref=ts Join us on Google+: https://plus.google.com/108046988421677398468 Connect on LinkedIn: http://www.linkedin.com/company/enterprisedb

June 23, 2015

by Fran Cator

· 961 Views

This Week In Modern Software: Inside Obama’s Geek Squad

[This article was written by Kevin Casey] Welcome to This Week in Modern Software, orTWiMS, New Relic’s weekly roundup of the need-to-know news, stories, and events of interest surrounding software analytics, cloud computing, application monitoring, development methodologies, programming languages, and the myriad of other issues that influence modern software. This week, our top story goes inside President Obama’s secret team of tech geeks, 140 of them and counting: TWiMS Top Story: Inside Obama’s Stealth Startup—Fast Company What it’s about:If the President of the United States walked into the room and personally recruited you to rebuild the country’s technology infrastructure, could you turn him down? He’s serious, and that room is theRoosevelt Room in the West Wing of the White House, by the way. AsLisa Gelobtersays: “What are you going to say that?” Gelobter’s answer was “Yes”—she’s now chief digital officer for the US Department of Education, part of a 140-person-and-counting tech team that’s functioning something like an elite startup embedded inside the federal government. Its business? Only modernizing the technical infrastructure, applications, and processes of just about every federal agency. Why you should care:What was once something of a tech desert—the federal government—is beginning to draw top private-sector talent inside the Beltway. The team, led by Mikey Dickerson (who helped lead the team that rescuedHealthcare.gov) andformer US CTO Todd Park, also includes the likes of former Googler Matthew Weaver, and it hopes to hit 500 people by the end 2016, shortly before President Obama will leave office. Its challenges are immense, from tackling government bureaucracy (to test just how entrenched the suits were, Weaver requested the official title “Rogue Leader”—and he got it) to the fact that its recruiting pitch includes the phrase: “You’ll have to take a pay cut.” But its mission is both noble and necessary, and the appeal of working on major problems with enormous public impacts appears to be working. Recommended reading. Further reading: Mikey Dickerson’s 10 Tips for Dealing with Bureaucracy—New Relic Blog [Video] Airbnb Open Sources Software to Lure Talent Amid ‘Insane’ Competition—CIO Journal What it’s about:Airbnb added three new apps to its open source portfolio earlier this month, but the motivation wasn’t just trying to give employees the best business tools or contribute to the software community at large. Sure, that might have been part of the equation, but the rental booking site hopes open-sourcing some of its toolkit will help recruit the best software talent in the face of what director of engineeringMike Curtiscalls “insane” competition in the Silicon Valley labor market. Why you should care:In the software arms race, any little edge counts. Curtis tellsCIO Journalthat Airbnb will keep the proprietary stuff closely guarded, of course. But it will open source “generic” tools with wider industry use cases, such as its recently releasedAerosolvemachine-learning package and itsAirpalcloud-based data querying tool. The latter, which works with Facebook’s open sourcePrestoDB, aims to simplify SQL queries to the point where you don’t need to be a big data wonk or business intelligence guru to run it. Indeed, one in three Airbnb employees have run a query on it in the year since it launched. Airbnb has contributed a dozen open source tools on its aptly namedNerds site(gotta love that!) to date, something the company hopes both contributes to greater good but also advertises its software innovation to potential hires. Google Is Wielding Its Own Secret Weapon in the Cloud—The New York Times What it’s about:In thecutthroat competitionfor public cloud business, Google may be its own best customer testimonial. In advance of this week’sOpen Network Summit, theTimes’Bits bloglooked at Google’s plan to not only unveil cloud customers such as HTC but reveal much more than ever before about its own infrastructure. Google did just that on Wednesday, offering a look inside itsdata center networking, including its massive-capacity, lightning-fast Jupiter network. Why you should care:As major cloud players continue to zap prices with their shrink-rays, it’s increasingly clear that features and underlying platforms will distinguish one from the other when enterprise users make their pick. Google is taking a big step toward writing its own story in this regard, and the synopsis might read something like: “We’re pretty good at this stuff.” Its Jupiter fabrics deliver 1 petabit per second of bisection bandwidth, according to Google, or “enough for 100,000 servers to exchange information at 10Gb/s each, enough to read the entire scanned contents of the Library of Congress in less than 1/10th of a second.” If it sounds like a bit of bragging, well, yeah—it is. But it’s bragging with a purpose: Attracting devs who want access to the same technology without having to build it themselves.Google’s Amin Vahdat connected the dots in a blog post: “The same networks that power all of Google’s internal infrastructure and services also power Google Cloud Platform.” Move Over, Meeker: Byron Deeter’s State of the Cloud Report—Bessemer Venture Partners What it’s about:With a nod to Mary Meeker’s classicState of the Internet report,Bessemer Venture Partners’Byron Deeterchecks in with his 2015 State of the Cloud Report. Given cloud computing’s relative youth and rampant ascension, it’s no surprise the stats are staggering. Here’s one to start: Cloud revenues have increased tenfold in the last six years, from a scant $5.6 billion in 2008 to more than $56 billion in 2014. And it’s going to double again in the next four years, according to BVP’s projections, to $127.5 billion in 2018. Why you should care:Deeter’s full presentation is worth a weekend watch or read, but it’s the forward-looking slides that may be most compelling for software pros. Deeter notes both the immense risks and opportunities in cloud security, unveiling a 10-point security plan for cloud startups on slide 37. To underscore the security landscape, Deeter quotes an unnamed cloud CEO who says aDDoSattack that took down the firm’s API caused more customer churn in one day than in the rest of its history. Wow. He also addresses the exploding market for cloud services built specifically for developers including, yes, New Relic. And for mobile developers, slide 44 underscores something we’ve talked about before in this space:the real money’s in enterprise apps, and it’s still a largely untapped market. Click through thefull slide deck hereorwatch video of Deeter’s presentation here. Bandwidth: The Next Frontier of Cloud Computing—ZDnet What it’s about:Is networking the next big thing in the everything-as-a-service age? It just might be, as firms likePacnetvie to deliver networking capacity on a pay-for-what-you-use model that some industry folks say better suits cloud environments facing significant but uneven networking needs. Why you should care:As author Drew Turney notes, there’s a common blind spot when it comes to cloud computing’s many shapes and sizes: Moving all that data from points A to Z, and everywhere in between, which can cause both performance problems and undue financial pressures. The promise of Networking-as-a-Service (NaaS), industry execs tell Turney, is that it can provide more efficient, scalable networking for short-term usage bursts such as customer traffic spikes or large cloud backup-and-storage jobs, enabling companies to later dial down their capacity as needed. Combined withSoftware-Defined Networking (SDN),NaaS makes it possible to build intelligent applications that manage their own networking needs, which might be the most significant enterprise potential of NaaS, saysNuage NetworksarchitectMarten Hauville. Page Bloat: Average Web Page Now More Than 2MB—The Performance Beacon (SOASTA) What it’s about:Do you need to put your website on a diet? Apparently so: The average Web page topped 2 MB as of May 2015, according to ongoing tracking atThe Performance Beacon. That’s double the average page weight from just three years ago. The site projects average page weight will exceed 3 MB in late 2017. Why you should care:Performance, performance, performance:Slow speedsare a killerin the modern software era. While author andSOASTAUX evangelistTammy Evertsrightly notes that page weight is not the only factor in Web optimization, we’re simply not paying it enough attention when designing and building Web pages. Images are the big culprit in the Web’s expanding waistline: they comprise nearly two-thirds of the average page’s weight, and video is a growing part of our Web diet, too. But other factors such as custom fonts play a role, adding weight even as the Web sheds previous performance hogs like Flash. The ideal weight? 1 MB, she says, which will save crucial seconds in load times. Sounds like it’s time to hit the virtual treadmill.

June 23, 2015

by Fredric Paul

· 1,061 Views

Lucene SIMD Codec Benchmark and Future Steps

We are happy to share results of our Lucene SIMD research announced earlier. Ivan integrated https://github.com/lemire/simdcomp as Lucene Codec and we could observe 18% gain on standard Lucene benchmark. Here are the fork, deck, recording from BerlinBuzzwords. Tech notes The prototype is limited to postings (IndexOptions.DOCS), so far it doesn’t support freqs, positions, payloads. Thus, full idf-tf scoring is not possible so far. The heap problem Currently, the bottleneck of the search performance is the scoring heap. Heap is hard for vectorization, and even hard to compute with regular instructions. Thus, benchmark retrieves only top 10 docs to limit efforts for managing heap. Here is a profiler snapshot for the default Lucene code, decoding takes more than collecting. This is hotspots with the SIMD codec, note that collecting is prevailing now and ForUtil takes relatively smaller time for decoding. Edge cases There are few special code paths which bypass generic FOR decoding which make it harder to observe vectorization gain. Very dense stopwords postings are encoded as a sequence of increasing numbers with by just specifying length of the sequence (see ForUtil.ALL_VALUES_EQUAL). Thus, we excluded stopwords from the benchmark to better observe the gain in FOR decoding. Another edge case is shortening postings on high segmentation. FOR compression is applied on blocks, and remaining tail is encoded by vInt. Thus, to observe the gain in FOR decoding, we merge segments to the single one. Due to the same reason, rare terms with short postings list is not a good use case to show a gain. Further Plans Here are some directions which we consider: provide codec and benchmark as a separate modules; apply SIMD codec for DocValues and Norms - it should improve generic sorting, scoring and faceting. Because ordinals in DocValues are not increasing like postings, https://github.com/lemire/FastPFor should be incorporated; complete codec for supporting frequencies, offsets and positions to make it fully functional; presumably, SIMD facet component might get some gain from vectorization, however decoding ordinals might not be the biggest problem in faceting, like it’s described here; execute binary operations like intersections on compressed data with SIMD instructions https://github.com/lemire/SIMDCompressionAndIntersection; native code might access mmapped index files without boundary checks or copying to heap arrays; implementing roaring bitmaps might help with dense postings; Which of of those directions are relevant your challenges? Leave a comment below! Here are still questions to clarify: will critical natives work for Java 9 and further? couldn’t it happen that vectorization heuristic by JIT makes explicit SIMD codec redundant? We’d like to thank all people who contributed their researches and let us to conduct ours.

June 23, 2015

by Mikhail Khludnev

· 1,935 Views

Spring Data Couchbase: Handle Unknown Class

Spring Data Couchbase provides transparent way to save and load Java classes to and from Couchbase. However, if a loaded class contains a property of unknown class, you will receive org.springframework.data.mapping.model.MappingException: No mapping metadata found for java.lang.Object This may happen if, for example, different versions of your code save and load information. In order to handle situation when we want to load an object, which contains another object on unknown class (in a map or list property) we should override the default SPMappingCouchbaseConverter. Let's see how we do this with Spring XML configuration: I replace my old XML: to the following XML: And create the following class: public class MyMappingCouchbaseConverter extends MappingCouchbaseConverter { public MyMappingCouchbaseConverter(final MappingContext, CouchbasePersistentProperty> mappingContext) { super(mappingContext); } @Override protected R read(final TypeInformation type, final CouchbaseDocument source, final Object parent) { if (Object.class == typeMapper.readType(source, type).type) { return null; } return super.read(type, source, parent); } } Now, if loaded object will contain a property of unknown class or an object of unknown class in a list or map, this property or object will be replaced by null. view source print?

June 22, 2015

by Pavel Bernshtam

· 4,066 Views

Optimized Text-Stamp Operations, Enhanced PDF to HTML & DOC Conversion in Java Apps

What's New in this Release? Aspose team is pleased to announce the release of Aspose.Pdf for Java 10.3.0. It provides better license initialization capabilities. As shared in earlier blogs, we introduced a method clear() in com.aspose.pdf.MemoryCleaner class, which provides Memory Cleanup features so that memory is set free from unused objects. This method optimizes API performance as system resources are released, leaving API with sample resources to perform various PDF creation and manipulation operations. In this new release, we have also optimized TextStamp operation. Other than these improvements, a better support for UTF8 and UTF16 characters is provided, when converting TEXT files to PDF format. Cross file format conversions are one of the salient features offered by our API. Therefore, the PDF to HTML, the PDF to DOC, transformation of PDF pages to Image format as well as the Image to PDF conversion features are specifically improved. Among these features, the text manipulation is also improved while searching and replacing TextFragments inside the PDF file. Starting this new release, we are providing a single code base (.jar) file targeting JDK 1.6 and its compatible with JDK 1.6, 1.7 and later versions. Some important improved features included in this release are given below Increase TextStamp creation performance com.aspose.pdf.MemoryCleaner.clear() method nulls the license object as well Aspose.Pdf 9.5.2 to HTML conversion issue on particular file UTF-8 characters not appearing properly License implementation difference in 9.3.0 and 10.2.0 with Java web application java.awt.HeadlessException in Headless Mode PDF to Image - Conversion process stucks in infinite loop Text to PDF: Incorrect rendering of UTF8 text in output PDF Text to PDF: Incorrect rendering of UTF16 text in output PDF gets wrong coordinates of seached Text Image to PDF: API throws IllegalArgumentException PDF to PNG - Process hangs during conversion PDF to HTML: text is distorted in output HTML PDF to DOC: Text renders incorrectly Image to PDF throws IllegalArgumentException exception PDF to HTML - StringIndexOutOfBoundsException being generated PDF to Image - conversion method stuck and never returns Hyperlink text/contents are not visible in PDF file Overview: Aspose.Pdf for Java Aspose.Pdf is a Java PDF component to create PDF documents without using Adobe Acrobat. It supports Floating box, PDF form field, PDF attachments, security, Foot note & end note, Multiple columns document, Table of Contents, List of Tables, Nested tables, Rich text format, images, hyperlinks, JavaScript, annotation, bookmarks, headers, footers and many more. Now you can create PDF by API, XML and XSL-FO files. It also enables you to converting HTML, XSL-FO and Excel files into PDF. Homepage of Aspose.Pdf for Java Download Aspose.Pdf for Java

June 22, 2015

by David Zondray

· 1,037 Views

Think about the last time something really aggravated you, whether it was a slow Internet connection, long store lines, or a rude cashier. Did you vow to go home and call an 800 number, punch through a bunch of option keys, and wait to talk to a customer service rep? Or did you take out your smart phone and hammer out your frustrations in 140 characters or less? Odds are you’re like the millions of consumers who express their grievances with friends, family, and colleagues on Twitter, Facebook, YouTube, or a host of other social sites. With more than 230 million people on Twitter and a billion or more on Facebook, companies now understand the importance of providing customer service over social media. According to a 2014 Forrester report, 62 percent of businesses believe they will lose ground if they don’t adopt social customer service technologies. Companies slow to embrace social media for customer service, also known as social care, are missing an opportunity to build their brands and customer loyalty. Ignoring customer problems on social media can spark a raging ﬁre of discontent. But by connecting with customers on social media, you can quickly respond and resolve issues in front of thousands of other prospective clients. A study by the International Customer Management Institute shows 61 percent of consumers who received social care were more satisﬁed with their support. And 58 percent said social care increased their customer loyalty. Are you ready to advance your customer support program to the social community or are you willing to sit on hold while your customers begin looking elsewhere? We’ve created an eBook, “Social Customer Care: How to Use Social Media to Improve Customer Support,” to explore the reasons for adding social care to your customer service programs. It also provides tips from industry experts on how to get there. Like this post? Click here to subscribe to our blog and receive the latest content on social learning, customer support, sales enablement, or all three.

June 22, 2015

by Bloomfire Marketing

· 977 Views · 1 Like

Query Autofiltering Revisited -- Let's Be More precise!

In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find. A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this. A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns. The traditional way to do this is to use Natural Language Processing or NLP to parse the user query. This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”. In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed. The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search (Funny story: after posting this, I ran across a Mexican Restaurant in Toms River, New Jersey that does serve sushi – but still, most of them don’t!). The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way. The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting. It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”. It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it. This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context. The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata. Searching vs Navigating A little historical background here. With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links. In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata. The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference. Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will). A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching: In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use : syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user. However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words. The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful. So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender. But the information is still somewhere in the search index, we just need a way to get it back at at “query time”. Then, we can both have our cake AND eat it! Noun Phrases and the Hierarchy of meta information When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing. In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do. That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth. Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types. Rebuilding Context: Identifying field name patterns to find relevant phrase patterns So now its time to put Humpty Dumpty back together again. With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate metadata mapping on behalf of the user. We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store. Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example. The Query AutoFiltering Search Component To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things. The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map. This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field. If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query. The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer. Assuming that we have indexed the following records: This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records. However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query. Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values. So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”. Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too. If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”. It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type. From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query: q=blue red lion socks to a filter query: fq=color:blue&fq=brand:”red lion”&fq=product_type:socks This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%! Adding case sensitivity and synonym support: One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality. To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed. Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future. Source Code For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it). The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above. Conclusions The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post. Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can. This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost). The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications. The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.

June 22, 2015

by Lisa Warner

· 2,141 Views

Long-Term Log Analysis with AWS Redshift

You will aggregate a lot of logs over the lifetime of your product and codebase, so it’s important to be able to search through them. In the rare case of a security issue, not having that capability is incredibly painful. You might be able to use services that allow you to search through the logs of the last two weeks quickly. But what if you want to search through the last six months, a year, or even further? That availability can be rather expensive or not even an option at all with existing services. Many hosted log services provide S3 archival support which we can use to build a long-term log analysis infrastructure with AWS Redshift. Recently I’ve set up scripts to be able to create that infrastructure whenever we need it at Codeship. AWS Redshift AWS Redshift is a data warehousing solution by AWS. It has an easy clustering and ingestion mechanism ideal for loading large log files and then searching through them with SQL. As it automatically balances your log files across several machines, you can easily scale up if you need more speed. As I said earlier, looking through large amounts of log files is a relatively rare occasion; you don’t need this infrastructure to be around all the time, which makes it a perfect use case for AWS. Setting Up Your Log Analysis Let’s walk through the scripts that drive our long-term log analysis infrastructure. You can check them out in the flomotlik/redshift-logging GitHub repository. I’ll take you step by step through configuring the whole setup of the environment variables needed, as well as starting the creation of the cluster and searching the logs. But first, let’s get a high-level overview of what the setup script is doing before going into all the different options that you can set: Creates an AWS Redshift cluster. You can configure the number of servers and which server type should be used. Waits for the cluster to become ready. Creates a SQL table inside the Redshift cluster to load the log files into. Ingests all log files into the Redshift cluster from AWS S3. Cleans up the database and prints the psql access command to connect into the cluster. Be sure to check out the script on GitHub before we go into all the different options that you can set through the .env file. Options to set The following is a list of all the options available to you. You can simply copy the .env.template file to .env and then fill in all the options to get picked up. AWS_ACCESS_KEY_ID AWS key of the account that should run the Redshift cluster. AWS_SECRET_ACCESS_KEY AWS secret key of the account that should run the Redshift cluster. AWS_REGION=us-east-1 AWS region the cluster should run in, default us-east-1. Make sure to use the same region that is used for archiving your logs to S3 to have them close. REDSHIFT_USERNAME Username to connect with psql into the cluster. REDSHIFT_PASSWORD Password to connect with psql into the cluster. S3_AWS_ACCESS_KEY_ID AWS key that has access to the S3 bucket you want to pull your logs from. We run the log analysis cluster in our AWS Sandbox account but pull the logs from our production AWS account so the Redshift cluster doesn’t impact production in any way. S3_AWS_SECRET_ACCESS_KEY AWS secret key that has access to the S3 bucket you want to pull your logs from. PORT=5439 Port to connect to with psql. CLUSTER_TYPE=single-node The cluster type can be single-node or multi-node. Multi-node clusters get auto-balanced which gives you more speed at a higher cost. NODE_TYPE Instance type that’s used for the nodes of the cluster. Check out the Redshift Documentation for details on the instance types and their differences. NUMBER_OF_NODES=10 Number of nodes when running in multi-mode. CLUSTER_IDENTIFIER=log-analysis DB_NAME=log-analysis S3_PATH=s3://your_s3_bucket/papertrail/logs/862693/dt=2015 Database format and failed loads When ingesting log statements into the cluster, make sure to check the amount of failed loads that are happening. You might have to edit the database format to fit to your specific log output style. You can debug this easily by creating a single-node cluster first that only loads a small subset of your logs and is very fast as a result. Make sure to have none or nearly no failed loads before you extend to the whole cluster. In case there are issues, check out the documentation of the copy command which loads your logs into the database and the parameters in the setup script for that. Example and benchmarks It’s a quick thing to set up the whole cluster and run example queries against it. For example, I’ll load all of our logs of the last nine months into a Redshift cluster and run several queries against it. I haven’t spent any time on optimizing the table, but you could definitely gain some more speed out of the whole system if necessary. It’s just fast enough already for us out of the box. As you can see here, loading all logs of May — more than 600 million log lines — took only 12 minutes on a cluster of 10 machines. We could easily load more than one month into that 10-machine cluster since there’s more than enough storage available, but for this post, one month is enough. After that, we’re able to search through the history of all of our applications and past servers through SQL. We connect with our psql client and send of SQL queries against the “events’ database. For example, what if we want to know how many build servers reported logs in May: loganalysis=# select count(distinct(source_name)) from events where source_name LIKE 'i-%'; count ------- 801 (1 row) So in May, we had 801 EC2 build servers running for our customers. That query took ~3 seconds to finish. Or let’s say we want to know how many people accessed the configuration page of our main repository (the project ID is hidden with XXXX): loganalysis=# select count(*) from events where source_name = 'mothership' and program LIKE 'app/web%' and message LIKE 'method=GET path=/projects/XXXX/configure_tests%'; count ------- 15 (1 row) So now we know that there were 15 accesses on that configuration page throughout May. We can also get all the details, including who accessed it when through our logs. This could help in case of any security issues we’d need to look into. The query took about 40 seconds to go though all of our logs, but it could be optimized on Redshift even more. Those are just some of the queries you could use to look through your logs, gaining more insight into your customers’ use of your system. And you et all of that with a setup that costs $2.50 an hour, can be shut down immediately, and recreated any time you need access to that data again. Conclusions Being able to search through and learn from your history is incredibly important for building a large infrastructure. You need to be able to look into your history easily, especially when it comes to security issues. With AWS Redshift, you have a great tool in hand that allows you to start an ad hoc analytics infrastructure that’s fast and cheap for short-term reviews. Of course, Redshift can do a lot more as well. Let us know what your processes and tools around logging, storage, and search are in the comments.

June 21, 2015

by Florian Motlik

· 1,446 Views

Java RegEx: How to Replace All With Pre-processing on a Captured Group

Need to replace all occurances of a pattern text and replace it with a captured group? Something like this in Java works nicely: String html = "myurl\n" + "myurl2\n" + "myurl3"; html = html.replaceAll("id=(\\w+)'?", "productId=$1'"); Here I swapped the query name from "id" to "productId" on all the links that matched my criteria. But what happen if I needed to pre-process the captured ID value before replacing it? Let's say now I want to do a lookup and transform the ID value to something else? This extra requirement would lead us to dig deeper into Java RegEx package. Here is what I come up with: import java.util.regex.*; ... public String replaceAndLookupIds(String html) { StringBuffer newHtml = new StringBuffer(); Pattern p = Pattern.compile("id=(\\w+)'?"); Matcher m = p.matcher(html); while (m.find()) { String id= m.group(1); String newId = lookup(id); String rep = "productId=" + newId + "'"; m.appendReplacement(newHtml, rep); } m.appendTail(newHtml); return newHtml.toString(); }

June 17, 2015

by Zemian Deng

· 14,016 Views · 1 Like

Why 12 Factor Application Patterns, Microservices and CloudFoundry Matter (Part 2)

Learn why 12 Factor Application Patterns, Microservices and CloudFoundry matter when trying to change the way your product is produced.

June 12, 2015

by Tim Spann

CORE

· 15,639 Views · 4 Likes

Spring Integration Tests with MongoDB Rulez

Spring integration tests allow you to test functionality against a running application. This article shows proper database set- and clean-up with MongoDB.

June 10, 2015

by Ralf Stuckert

· 21,459 Views · 2 Likes

Regular Expressions Denial of the Service (ReDOS) Attacks: From the Exploitation to the Prevention

autors :michael hidalgo, dinis cruz introduction when it comes to web application security, one of the recommendations to write software that is resilient to attacks is to perform a correct input data validation. however, as mobile applications and apis (application programming interface) proliferates, the number of untrusted sources where data comes from goes up, and a potential attacker can take advantage of the lack of validations to compromise our applications. regular expressions provides a versatile mechanism to perform input data validation. developers use them to validate email addresses, zip codes, phone numbers and many other task that are easily implemented thought them. unfortunately most of the time software engineers don't fully understand how regular expressions works in the background and by choosing a wrong regular expression pattern they can introduce a risk in the application. in this article we are going to discuss about the so called regular expression denial of the service (redos) vulnerability and how we can identify this problems early in the software development life cycle (sdlc) stages by enforcing a culture focused on unit testing. hardware features for this article in order to provide information about execution time, performance, cpu utilisation and other facts, we are relying on virtual machine that uses windows 7 32-bit operating system, 5.22 gb ram. intel(r) core (tm) it-3820qm cpu @2.7 ghz. we are also using 4 cores. understanding the problem. the owasp foundation (2012) defines a regular regular expression denial of service attack as follows: "the regular expression denial of service (redos) is a denial of service attack, that exploits the fact that most regular expression implementations may reach extreme situations that cause them to work very slowly (exponentially related to input size). an attacker can then cause a program using a regular expression to enter these extreme situations and then hang for a very long time." although a broad explanation about regular expression engines is out of the scope of this article,it is important to understand that, according to stubblebine,t (regular expressions pocket reference), a pattern matching consist of finding a section of text that is described (matched) by a regular expression. two main rules are used to match results: the earliest (leftmost) wins : the regular expression is applied to the input starting at the first character and moving toward the last. as soon as the regular expression engine finds a match,it returns. standard quantifiers are greedy : according to stubblebine, "quantifiers specify how many times something can be repeated. the standard quantifiers attempt to match as many times as possible. the process of giving up characters and trying less-greedy matches is called backtracking." for this article we are focused a regular expression engine called nondeterministic finite automaton (nfa).this engines usually compare each element of the regex to the input string, keeping track of positions where it chose between two options in the regex. if an option fails, the engine backtracks to the most recently saved position.(stubblebine,t 2007). it is important to note that this engine is also implemented in .net, java, python, php and ruby on rails. this article is focused on c# and therefore we are relying on the microsoft .net framework system.text.regularexpression classes which at the heart uses nfa engines. according to bryan sullivan "one important side effect of backtracking is that while the regex engine can fairly quickly confirm a positive match (that is, an input string does match a given regex), confirming a negative match (the input string does not match the regex) can take quite a bit longer. in fact, the engine must confirm that none of the possible “paths” through the input string match the regex, which means that all paths have to be tested. with a simple non-grouping regular expression, the time spent to confirm negative matches is not a huge problem." in order to illustrate the problem, let's use this regular expression (\w+\d+)+c which basically performs the following checks: between one and unlimited times, as many times as possible, giving back as needed. \w+ match any word character a-za-z0-9_ . \d+ match a digit 0-9 matches the character c literally (case sensitive) so matching values are 12c,1232323232c and !!!!cd4c and non matching values are for instance !!!!!c,aaaaaac and abababababc . the following unit test was created to verify both cases. const string regexpattern = @"(\w+\d+)+c"; public void testregularexpression() { var validinput = "1234567c"; var invalidinput = "aaaaaaac"; regex.ismatch(validinput, regexpattern).assert_is_true(); regex.ismatch(invalidinput, regexpattern).assert_is_false(); } execution time : 6 milliseconds now that we've verified that our regular expression works well, let's write a new unit test to understand the backtracking problem and the performance effects. note that the longer the string, the longer the time the regular expression engine will take to resolve it. we will generate 10 random strings, starting at the length of 15 characters, incrementing the length until get to 25 characters,and then we will see the execution times. const string regexpattern = @"(\w+\d+)+c"; [testmethod] public void isvalidinput() { var sw = new stopwatch(); int16 maxiterations = 25; for (var index = 15; index < maxiterations; index++) { sw.start(); //generating x random numbers using fluentsharp api var input = index.randomnumbers() + "!"; regex.ismatch(input, regexpattern).assert_false(); sw.stop(); sw.reset(); } } now let's take a look at the test results: random string character length elapsed time (ms) 360817709111694! 16 16ms 2639383945572745! 17 23ms 57994905459869261! 18 50ms 327218096525942566! 19 106ms 4700367489525396856! 20 207ms 24889747040739379138! 21 394ms 156014309536784168029! 22 795ms 8797112169446577775348! 23 1595ms 41494510101927739218368! 24 3200ms 112649159593822679584363! 25 6323ms by looking at this results we can understand that the execution time (total time to resolve the input text against the regular expression) goes up exponentially to the size of the input. we can also see that when we append a new character, the execution time almost duplicates. this is an important finding because shows how expensive this process is, if we do not have a correct input data validation we can introduce performance issues in our application. a real-life use-case and an appeal for a unit testing approach now that we have seen the problems we can face by selecting a wrong (evil) regular expression, let's discuss about a realistic scenario where we need to validate input data thought regular expressions. we strongly believe that unit testing techniques can not only help to write quality code but also we can use them to find vulnerabilities in the code we are writing. by writing unit test that performs security checks (like input data validation) a common task in web applications consist on request an email address to the user signing in our application. from a ux (user experience perspective) complaining browsers support friendly error messages when an input, that was supposed to be an email address, does not match with the requirements in terms of format. here is a ui validation when a input textbox (with the email type is set) and the value is not a valid email address. however relying on a ui validation is not longer enough. an eavesdropper can easily perform an http request without using a browser (namely by using a proxy to capture data in transit) and then send a payload that can compromise our application. in the following use case, we are using a backend validation for the email address by using a regular expression. we will show you the real power of regular expressions here, we are not only testing that the regular expression validates the input but also how it behaves when it receives any arbitrary input. we are using this evil regular expression to validate the email: ^( 0-9a-za-z @([0-9a-za-z][-\w][0-9a-za-z].)+[a-za-z]{2,9})$ . with the following test we are verifying that a valid email and invalid emails formats are correctly processed by the regular expression, which is the functional aspect from a development point of view. const string emailregex = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var validemailaddress = "[email protected]"; var invalidemailaddress = new string[] { "a", "abc.com", "1212", "aa.bb.cc", "aabcr@s" }; regex.ismatch(validemailaddress, emailregex).assert_is_true(); //looping throught invalid email address foreach (var email in invalidemailaddress) { regex.ismatch(email, emailregex).assert_is_false(); } } elapsed time: 6ms. so both cases are validate correctly. one could state that both scenarios supported by the unit test are enough to select this regular expression for our input data validations. however we can do a more extensive testing as you'll see. the exploit so far the previous regular expression selected to valid an email address seems to work well, we have added some unit test that verifies valid an invalid inputs. but how does it behaves when we send an arbitrary input?, from a variable length, do we face a denial of the service attack?. this kind of questions can be solved wit unit testing technique like this one: const string emailregex = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var validemailaddress = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"; var watch = new stopwatch(); watch.start(); validemailaddress.regex(emailregex).assert_is_false(); watch.stop(); console.writeline("elapsed time {0}ms", watch.elapsedmilliseconds); watch.reset(); } **elapsed time : ~23 minutes (1423127 milliseconds).** results are disturbing. we can clearly see the performance problem introduced by evaluating the given input.it takes roughly 23 minutes to validate the input given the hardware characteristics described before. in the following images you will see the cpu behaviour when running this unit test. here is another cpu utlization: and this is another image from the cpu utilization while the test is running. fuzzing and unit testing: a perfect combination of techniques in the previous unit test we found that a given input string can lead to have denial of the service issue in our application. note that we didn't need an extreme large payload, in our scenario 34 characters can illustrate this problem or even less. when using any regular expression it is recomendable to always test it against unit testing to cover most of the possible ways a user (which can be a potential attacker) can send. here is where we can use fuzzing. tobias klein in his book a bug hunter's diary a guide tour throught the wilds of sofware security defines fuzzing as "a complete different approach to bug hunting is known as fuzzing. fuzzing is a dynamic-analysis technique that consist of testing an application by providing it with malformed or unexpected input. then klein continues adding that: "it isn't easy to identify the entry points of such complex applications, but complex software often tends to crash while processing malformed input data. page 05" mano paul in his book official (isc)2 guide to the csslp talking about fuzzing states that: "also known as fuzz testing or fault injection testing, fuzzing is a brute-force type of testing in which faults (random and pseudo-random input data) are injected into the software and it's behaviour is observed. it is a test whose results are indicative of the extended and effectiveness of the input validation.page 336". taking previous definitions into consideration, we are going to implement a new unit test that can allow us to generate random input data and test our regular expression. in this case, we are using this email regular expression "^[\w-.]{1,}\@([\w]{1,}.){1,}[a-z]{2,4}$"; and by doing an exhaustive testing we will see if we are not introducing a denial of the service problem. we want to make sure that the elapsed time to resolve if the random string matches the regular expression is evaluated in less than 3 seconds: const string emailregex = @"^[\w-\.]{1,}\@([\w]{1,}\.){1,}[a-z]{2,4}$"; //number of random strings to generte. const int maxiterations = 10000; [testmethod] public void fuzz_emailaddress() { //valid email should return true "[email protected]".regex(emailregex).assert_is_true(); //invalid email should return false "abce" .regex(emailregex).assert_is_false(); //testing maxiterations times for (int index = 0; index < maxiterations; index++) { //generating a random string var fuzzinput = (index * 5).randomstring(); var sw = new stopwatch(); sw.start(); fuzzinput.regex(emailregex).assert_is_false(); //elapsed time should be less than 3 seconds per input. sw.elapsed.seconds().assert_size_is_smaller_than(3); } } under the hardware features described before, this test passes. considering that we are using this computation (index * 5), the largest string generate is of 49995 character (which is 9999 *5). having said that we were able to test a large string against the regular expression and we confirmed that even thought it is quite large input value, the time involved to verify if it was or not a valid email, it was less than 3 seconds. now assuming that a check for the length of the email in the first place, it will guarantee that a malicious user can't inject a large payload in our application. countermeasures provided in microsoft .net 4.5 and upper if you are developing applications in microsoft .net 4.5 then you can take advantage of a new implementation on top of the ismatch method from the regex class . starting from .net 4.5 the ismatch method provides an overload that allows you to enter a timeout. note that this overload is not available in .net 4.0 . this new parameter is called matchtimeout and according to microsoft : "the matchtimeout parameter specifies how long a pattern matching method should try to find a match before it times out. setting a time-out interval prevents regular expressions that rely on excessive backtracking from appearing to stop responding when they process input that contains near matches. for more information, see best practices for regular expressions in the .net framework and backtracking in regular expressions . if no match is found in that time interval, the method throws a regexmatchtimeoutexception exception. matchtimeout overrides any default time-out value defined for the application domain in which the method executes." taken from here . we've written a new unit test where we're using a regular expression that we know can lead to denial of the service. in this case we'll test an email address that previously generated a significant side effect in the performance of the application. we'll see then how we can reduce the impact of this process by setting up a timeout. const string emailregexpattern = @"^([0-9a-za-z]([-.\w]*[0-9a-za-z])*@([0-9a-za-z][-\w]*[0-9a-za-z]\.)+[a-za-z]{2,9})$"; [testmethod] public void validateemailaddress() { var emailaddress = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!"; var watch = new stopwatch(); watch.start(); //timeout of 5 seconds try { regex.ismatch(emailaddress, emailregexpattern, regexoptions.ignorecase, timespan.fromseconds(5)); } catch (exception ex) { ex.message.assert_not_null(); ex.gettype().assert_is(typeof(regexmatchtimeoutexception)); } finally { watch.stop(); watch.elapsed.seconds().assert_size_is_smaller_than(5); watch.reset(); } } running this test in visual studio we can confirm it passes, which means that the backtracking mechanism is taking longer than 5 seconds to resolve. it will throw a regexmatchtimeoutexception exception indicating that it might take longer than 5 seconds to evaluate the input. ideally one would expect this process to take less than a second, however several conditions or requirements might lead to allow a timeout in seconds. note how this model provides a very needed defensive programming style where the software engineers make informed decisions on the code they write, in this case we can establish the next steps when our method times and that way we can decrease any denial of the service attack. final thoughts no one size fits all is so cliché that has to be true. we are not sure if the regular expressions you are currently using in your applications are vulnerable to this attack. what we can do for sure is to show you how you can take advantage of unit testing to write secure code. when we write code we want to make sure that each single line of code is covered by a unit testing, which at the end of the day will guarantee early detections of error. however if we can combine this exercise with the adoption and implementation of test that can also try to attack/compromise the application (and we are not talking about anything fancy) like sending random strings, using fuzzing techniques, using combination of characters, exceeding the expected length, we will be helping to write software that is resilient to attacks. as a recommendation always test your regular expressions agains uni test, make sure that they are resilient to the attack we have covered in this article and if you are able to identify those problematic patterns out there, do a contribution and report them so we are not introduce them in the software we write. references 1.cruz,dinis(2013) the email regex that (could had) dosed a site. 2.hollos,s. hollos,r (2013) finite automata and regular expressions problems and solutions. 3.kirrage,j. rathnayake , thielecke, h.: static analysis for regular expression denial-of-service attacks. university of birmingham, uk 4.klein, t. a bug hunter's diary a guided tour through the wilds of software security (2011). 5.the owasp foundation (2012) regular expression denial of service - redos. 6.stubblebine, t(2007) regular expression pocket reference, second edition. 7.sullivan, b (2010) regular expression denial of service attacks and defenses

June 7, 2015

by Michael Hidalgo

· 34,124 Views · 5 Likes

Purpose of ThreadLocal in Java and When to Use ThreadLocal

ThreadLocal is a simple way to have per-thread data that cannot be accessed concurrently by other threads, without requiring great effort or design compromises.

June 7, 2015

by Santosh Singh

· 21,454 Views · 3 Likes