DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest DevOps and CI/CD Topics

article thumbnail
10 Goals Related to DevOps
Looking for a "DevOps Manifesto" to help guide your own organizational transiton? Well, there's no 'manifesto' out there like we have for Agile and SOA, but I think if you look around, you'll find that this commuity does have some well-defined, agreed-upon goals. In Kris Buytaert's presentation: "Devops, the future is here, it's just not evenly distributed yet" he gives a great 10 point summary of organizational goals from Edward Deming that coincide with what the DevOps movment is looking to achieve in IT organizations: Adopt the new philosophy. We are in a new economic age. Western management must awaken to the challenge, and must learn their responsibilities, and take on leadership for a change. Cease dependence on inspection to achieve quality. Eliminate the need for massive inspection by building quality into the product in the first place. Constantly improve your system of production and service, to improve quality and productivity, and thus constantly decrease costs. Institute training on the job. Institute leadership. The aim of supervision should be to help people and machines and gadgets to a better job. Drive out fear, so that everyone may work effectively for the company. Break down barriers between departments. People in research, design, sales, and production must work as a team in order to forsee problems of production and usage that may be encountered with the product or service. Eliminate slogans, exhortation, and targets for the work force; asking for zero defects and new levels of productivity. Such exhortations only create adversarial relationships, since the bulk of the causes of low quality and low productivity belong to the system and thus lie beyond the power of the work force. *Eliminate management by objective. Eliminate management by numbers and numerical goals. Instead substitute with leadership. *Remove barriers that rob the hourly worker of his/her right to pride of workmanship. The responsibility of supervisors must be changed from sheer numbers to quality. *Remove barriers that rob people in management and in engineering of their right to pride of workmanship. Institute a vigorous program of education and self-improvement. Put everybody in the company to work to accomplish the transformation. The transformation is everybody's job. On a more technical level, Kris says you need the following things to get to a point where your team can deploy at a moment's notice, quickly and safely: Version Control Continuous Integration Built Pipelines Bug Tracking Integrated Testing Automated Deployment Here are all of the slides from his presentation: Devops, the future is here, it's just not evenly distributed yet. View more presentations from Kris Buytaert
December 9, 2011
by Mitch Pronschinske
· 31,233 Views · 1 Like
article thumbnail
Rolling Forward and Other Deployment Myths
There is more and more writing on DevOps lately, which is good and bad. There still remains a small core of thoughtful people that are worth listening to and learning from. There’s more and more marketing from vendors and consultants jumping on the Devops bandwagon. There’s some naïve silliness (“Hire wicked smart people and give them all access to root.”) which can probably be safely ignored. And then there’s stuff that is half-right and half-wrong, too dangerous to be ignored. Like this recent post on Rollbacks and Other Deployment Myths, in which John Vincent lists 5 “myths” about system deployment, which I want to take some time to respond to here. Change Is Change? The author tries to make the point that “Change is neither good or bad. It’s just change.” Therefore we do not need to be afraid of making changes. I don’t agree. This attitude to change ignores the fact that once a system is up and running and customers are relying on it to conduct their business, whatever change you are making to the system is almost never as important as making sure that the system keeps running properly. Unfortunately, changes often lead to problems. We know from Visible Ops that based on studies of hundreds of companies, 80% of operational failures are caused by mistakes made during changes. This is where heavyweight process control frameworks like ITIL and COBIT, and detective change control tools like Tripwire came from. To help companies get control over IT change, because people had to find some way to stop shit from breaking. Yes, I get the point that in IT we tend to over-compensate, and I agree that calling in the Release Police and trying to put up a wall around all changes isn’t sustainable. People don’t have to work this way. But trivializing change, pretending that changes don't lead to problems, is dangerous. Deploys Are Not Risky? You can be smart and careful and break changes down into small steps and try to automate code pushes and configuration changes, and plan ahead and stage and review and test all your changes, and after all of this you can still mess up the deploy. Even if you make frequent small changes and simplify the work and practice it a lot. For systems like Facebook and online games and a small number of other cases, maybe deployment really is a non-issue. I don’t care if Facebook deploys in the middle of the day – I can usually tell when they are doing a “zero downtime” deploy (or maybe they are “transparently” recovering from a failure) because data disappears temporarily or shows up in the wrong order, functions aren’t accessible for a while, forms don’t resolve properly, and other weird shit happens, and then things come back later or they don’t. As a customer, do I care? No. It’s an inconvenience, and it’s occasionally unsettling ("WTF just happened?"), but I get used to it and so do millions of others. That’s because most of us don’t use Facebook or systems like this for anything important. For business-critical systems handling thousands of transactions a second that are tied into hundreds of other company’s systems (the world that I work in) this doesn’t cut it. Maybe I spend too much time at this extreme, where even small problems with compatibility that only affect a small number of customers, or slight and temporary performance slow downs, are a big deal. But most people I work with and talk to in software development and maintenance and system operations agree that deployment is a big deal and needs to be done with care and attention, no matter how simple and small the changes are and no matter how clean and simple and automated the deployment process is. Rollbacks Are a Myth? Vincent wants us to “understand that it’s typically more risky to rollback than rolling forward. Always be rolling forward.” Not even the Continuous Deployment advocates (who are often some of the most radical – and I think some of the most irresponsible – voices in the Devops community) agree with this – they still roll back if they find problems with changes. “Rollbacks are a myth” is an echo of the “real men fail forward” crap I heard at Velocity last year and it is where I draw the line. It's one thing to state an extreme position for argument's sake or put up a straw man – but this is just plain wrong. If you're going to deploy, you have to anticipate roll back and think about it when you make changes and you have to test rolling back to make sure that it works. All of this is hard. But without a working roll back you have no choice other than to fail forward (whatever this means, because nobody who talks about it actually explains how to do it), and this is putting your customers and their business at unnecessary risk. It’s not another valid way of thinking. It’s irresponsible. James Hamilton wrote an excellent paper on Designing and Delivering Internet-Scale Services when he was at Microsoft (now he’s a an executive and Distinguished Engineer at Amazon). Hamilton’s paper remains one of the smartest things that anyone has written about how to deal with deployment and operational problems at scale. Everyone who designs, builds, maintains or operates an online system should read it. His position on roll back is simple and obvious and right: Reverting to the previous version is a rip cord that should always be available on any deployment. Everything Fails. Embrace Failure. I agree that everything can and will fail some day, that we can’t pretend that we can prevent failures in any system. But I don’t agree with embracing failure, at least in business-critical enterprise systems, where recovering from a failure means lost business and requires unraveling chains of transactions between different upstream and downstream systems and different companies, messing up other companies' businesses as well as your own and dealing the follow-on compliance problems. Failures in these kinds of systems, and a lot of other systems, are ugly and serious, and they should be treated seriously. We do whatever we can to make sure that failures are controlled and isolated, and we make sure that we can recover quickly if something goes wrong (which includes being able to roll back!). But we also do everything that we can to prevent failures. Embracing failure is fine for online consumer web site startups – let’s leave it to them. SLAs I wanted to respond to the points about SLAs, but it’s not clear to me what the author was trying to say. SLAs are not about servers. Umm, yes that’s right of course... SLAs are important to set business expectations with your customers (the people who are using the system) and with your partners and suppliers. So that your partners and suppliers know what you need from them and what you are paying them for, and so that you know if you can depend on them when you have to. So that your customers know what they are paying you for. And SLAs (not just high-level uptime SLAs, but SLAs for Recovery Time and Recovery Point goals and incident response and escalation) are important so that your team understands the constraints that they need to work under, what trade-offs to make in design and implementation and in operations. Under-Compensating Is Worse Than Over-Compensating I spent more time than I thought I would responding to this post, because some of what the author says is right – especially in the second part of his post, Deploy All the Things where he provides some good practical advice on how to reduce risk in deployment. He’s right that Operations main purpose isn’t to stop change – it can’t be. We have to be able to keep changing, and developers and operations have to work together to do this in safe and efficient ways. But trivializing the problems and risks of change and over-simplifying how to deal with these risks and how to deal with failures, isn’t the way to do this. There has to be a middle way between the ITIL and COBIT world of controls and paper and process, and cool Web startups failing forward, a way that can really work for the rest of us. Source: http://swreflections.blogspot.com/2011/10/rolling-forward-and-other-deployment.html
October 25, 2011
by Jim Bird
· 17,305 Views
article thumbnail
JDepend design metrics in CI
this article is intended to give the reader enough information to understand what jdepend is, what it does, and how to use it in a maven build. it’s a kind of cheat sheet, if you like. what is it? jdepend is more of a design metric than a code metric, it gives you information about your classes with regards to how they’re related to each other. using this information you should be able to identify any unwanted or dubious dependencies. how does it do that? it traverses java class files and generates design quality metrics, such as: number of classes and interfaces afferent couplings (ca) – what is this?? someone probably feels very proud of themselves for coming up with this phrase. afferent coupling means the number of other packages which depend on the package being measured, in a nutshell. jdepend define this as a measure of a package’s “responsibility” efferent couplings (ce) – sort of the opposite of ca. it’s a measure of the number of other packages that your package depends on abstractness (a) – the ratio of abstract classes to total classes. instability (i) – the ratio of efferent coupling (ce) to total coupling (ce + ca) distance from the main sequence (d) – this sounds fairly wishy-washy and i’ve never paid any attention to it. it’s defined as: “an indicator of the package’s balance between abstractness and stability”. meh. to use jdepend with maven you’ll need maven 2.0 or higher and jdk 1.4 or higher. you don’t need to install anything, as maven will sort this out for you by downloading it at build time. here’s a snippet from one of my project poms, it comes from in the section: org.codehaus.mojo jdepend-maven-plugin 1.6 build/maven/${pom.artifactid}/target/jdepend-reports what you’ll get is a jdepend entry under the project reports section of your maven site, like this: maven project reports page and this is what the actual report looks like (well, some of it): jdepend report summary: jdepend isn’t something i personally use very heavily, but i can understand how it could be used to good effect as a general measure of how closely related your classes are, which, in certain circumstances could prompt you to redesign or refactor your code. i don’t think this sort of information is required on a per commit basis, so i’d be tempted to only include it in my nightly reports. however, i also use sonar, and that has a built-in measure of afferent coupling, so if you’re only interested in that measurement and you’re already running sonar, then jdepend is probably a bit of an unnecessary overhead. also, sonar itself has some good plugins which can provide architectural and design governance features, at least one of which i know implemented jdepend. source: http://jamesbetteley.wordpress.com/2011/04/06/jdepend-design-metrics-in-ci/
October 21, 2011
by James Betteley
· 13,661 Views · 1 Like
article thumbnail
You can’t be Agile in Maintenance?
I’ve been going over a couple of posts by Steve Kilner that question whether Agile methods can be used effectively in software maintenance. It’s a surprising question really. There are a lot of maintenance teams who have had success following Agile methods like Scrum and Extreme Programming (XP) for some time now. We’ve been doing it for almost 5 years, enhancing and maintaining and supporting enterprise systems, and I know that it works. Agile development naturally leads into maintenance – the goal of incremental Agile development is to get working software out to customers as soon as possible, and get customers using it. At some point, when customers are relying on the software to get real business done and need support and help to keep the system running, teams cross from development over to maintenance. But there’s no reason for Agile development teams to fundamentally change the way that they work when this happens. It is harder to introduce Agile practices into a legacy maintenance team – there are a lot of technical requirements and some cultural changes that need to be made. But most maintenance teams have little to lose and lots to gain from borrowing from what Agile development teams are doing. Agile methods are designed to help small teams deal with a lot of change and uncertainty, and to deliver software quickly – all things that are at least as important in maintenance as they are in development. Technical practices in Extreme Programming especially help ensure that the code is always working – which is even more important in maintenance than it is in development, because the code has to work the first time in production. Agile methods have to be adapted to maintenance, but most teams have found it necessary to adapt these methods to fit their situations anyways. Let’s look at what works and what has to be changed to make Agile methods like Scrum and XP work in maintenance. What works well and what doesn’t Planning Game Managing maintenance isn’t the same as managing a development project – even an Agile development project. Although Agile development teams expect to deal with ambiguity and constant change, maintenance teams need to be even more flexible and responsive, to manage conflicts and unpredictable resourcing problems. Work has to be continuously reviewed and prioritized as it comes in – the customer can’t wait for 2 weeks for you to look at a production bug. The team needs a fast path for urgent changes and especially for hot fixes. You have to be prepared for support demands and interruptions. Structure the team so that some people can take care of second-level support, firefighting and emergency bug fixing and the rest of the team can keep moving forward and get something done. Build slack into schedules to allow for last-minute changes and support escalation. You will also have to be more careful in planning out maintenance work, to take into account technical and operational dependencies and constraints and risks. You’re working in the real world now, not the virtual reality of a project. Standups Standups play an important role in Agile projects to help teams come up to speed and bond. But most maintenance teams work fine without standups – since a lot of maintenance work can be done by one person working on their own, team members don’t need to listen to each other each morning talking about what they did yesterday and what they’re going to do – unless the team is working together on major changes. If someone has a question or runs into a problem, they can ask for help without waiting until the next day. Small releases Most changes and fixes that maintenance teams need to make are small, and there is almost always pressure from the business to get the code out as soon as it is ready, so an Agile approach with small and frequent releases makes a lot of sense. If the time boxes are short enough, the customer is less likely to interrupt and re-prioritize work in progress – most businesses can wait a few days or a couple of weeks to get something changed. Time boxing gives teams a way to control and structure their work, an opportunity to batch up related work to reduce development and testing costs, and natural opportunities to add in security controls and reviews and other gates. It also makes maintenance work more like a project, giving the team a chance to set goals and to see something get done. But time boxing comes with overhead – the planning and setup at the start, then deployment and reviews at the end – all of which adds up over time. Maintenance teams need to be ruthless with ceremonies and meetings, pare them down, keep only what’s necessary and what works. It’s even more important in maintenance than in development to remember that the goal is to deliver working code at the end of each time box. If some code is not working, or you’re not sure if it is working, then extend the deadline, back some of the changes out, or pull the plug on this release and start over. Don’t risk a production failure in order to hit an arbitrary deadline. If the team is having problems fitting work into time boxes, then stop and figure out what you’re doing wrong – the team is trying to do too much too fast, or the code is too unstable, or people don’t understand the code enough – and fix it and move on. Reviews and Retrospectives Retrospectives are important in maintenance to keep the team moving forward, to find better ways of working, and to solve problems. But like many practices, regular reviews reach a point of diminishing returns over time – people end up going through the motions. Once the team is setup, reviews don’t need to be done in each iteration unless the team runs into problems. Schedule reviews when you or the team need them. Collect data on how the team is working, on cycle time and bug report/fix ratios, correlate problems in production with changes, and get the team together to review if the numbers move off track. If the team runs into a serious problem like a major production failure, then get to the bottom of it through Root Cause Analysis. Sustainable pace / 40-hour week It’s not always possible to work a 40-hour week in maintenance. There are times when the team will be pushed to make urgent changes, spend late nights firefighting, releasing after hours and testing on weekends. But if this happens too often or goes on too long the team will burn out. It’s critical to establish a sustainable pace over the long term, to treat people fairly and give them a chance to do a good job. Pairing Pairing is hard to do in small teams where people are working on many different things. Pairing does make sense in some cases – people naturally pair-up when trying to debug a nasty problem or walking through a complicated change – but it’s not necessary to force it on people, and there are good reasons not to. Some teams (like mine) rely more on code reviews instead of pairing, or try to get developers to pair when first looking at a problem or change, and at the end again to review the code and tests. The important thing is to ensure that changes get looked at by at least one other person if possible, however this gets done. Collective Code Ownership Because maintenance teams are usually small and have to deal with a lot of different kinds of work, sooner or later different people will end up working on different parts of the code. It’s necessary, and it’s a good thing because people get a chance to learn more about the system and work with different technologies and on different problems. But there’s still a place for specialists in maintenance. You want the people who know the code the best to make emergency fixes or high-risk changes – or at least have them review the changes – because it has to work the first time. And sometimes you have no choice – sometimes there is only one person who understands a framework or language or technical problem well enough to get something done. Coding Guidelines – follow the rules Getting the team to follow coding guidelines is important in maintenance to help ensure the consistency and integrity of the code base over time – and to help ensure software security. Of course teams may have to compromise on coding standards and style conventions, depending on what they have inherited in the code base; and teams that maintain multiple systems will have to follow different guidelines for each system. Metaphor In XP, teams are supposed to share a Metaphor: a simple high-level expression of the system architecture (the system is a production line, or a bill of materials) and common names and patterns that can be used to describe the system. It’s a fuzzy concept at best, a weak substitute for more detailed architecture or design, and it’s not of much practical value in maintenance. Maintenance teams have to work with the architecture and patterns that are already in place in the system. What is important is making sure that the team has a common understanding of these patterns and the basic architecture so that the integrity isn’t lost – if it hasn’t been lost already. Getting the team together and reviewing the architecture, or reverse-engineering it, making sure that they all agree on it and documenting it in a simple way is important especially when taking over maintenance of a new system and when you are planning major changes. Simple Design Agile development teams start with simple designs and try to keep them simple. Maintenance teams have to work with whatever design and architecture that they inherit, which can be overwhelmingly complex, especially in bigger and older systems. But the driving principle should still be to design changes and new features as simple as the existing system lets you – and to simplify the system’s design further whenever you can. Especially when making small changes, simple, just-enough design is good – it means less documentation and less time and less cost. But maintenance teams need to be more risk adverse than development teams – even small mistakes can break compatibility or cause a run-time failure or open a security hole. This means that maintainers can’t be as iterative and free to take chances, and they need to spend more time upfront doing analysis, understanding the existing design and working through dependencies, as well as reviewing and testing their changes for regressions afterwards. Refactoring Refactoring takes on a lot of importance in maintenance. Every time a developer makes a change or fix they should consider how much refactoring work they should do and can do to make the code and design clearer and simpler, and to pay off technical debt. What and how much to refactor depends on what kind of work they are doing (making a well-thought-out isolated change, or doing shotgun surgery, or pushing out an emergency hot fix) and the time and risks involved, how well they understand the code, how good their tools are (development IDEs for Java and .NET at least have good built-in tools that make many refactorings simple and safe) and what kind of safety net they have in place to catch mistakes – automated tests, code reviews, static analysis. Some maintenance teams don’t refactor because they are too afraid of making mistakes. It’s a vicious circle – over time the code will get harder and harder to understand and change, and they will have more reasons to be more afraid. Others claim that a maintenance team is not working correctly if they don’t spend at least 50% of their time refactoring. The real answer is somewhere in between – enough refactoring to make changes and fixes safe. There are cases where extensive refactoring, restructuring or rewriting code is the right thing to do. Some code is too dangerous to change or too full of bugs to leave the way it is – studies show that in most systems, especially big systems, 80% of the bugs can cluster in 20% of the code. Restructuring or rewriting this code can pay off quickly, reducing problems in production, and significantly reducing the time needed to make changes and test them as you go forward. Continuous Testing Testing is even more important and necessary in maintenance than it is in development. And it’s a major part of maintenance costs. Most maintenance teams rely on developers to test their own changes and fixes by hand to make sure that the change worked and that they didn’t break anything as a side effect. Of course this makes testing expensive and inefficient and it limits how much work the team can do. In order to move fast, to make incremental changes and refactoring safe, the team needs a better safety net, by automating unit and functional tests and acceptance tests. It can take a long time to put in test scaffolding and tools and write a good set of automated tests. But even a simple test framework and a small set of core fat tests can pay back quickly in maintenance, because a lot changes (and bugs) tend to be concentrated in the same parts of the code – the same features, framework code and APIs get changed over and over again, and will need to be tested over and over again. You can start small, get these tests running quickly and reliably and get the team to rely on them, fill in the gaps with manual tests and reviews, and then fill out the tests over time. Once you have a basic test framework in place, developers can take advantage of TFD/TDD especially for bug fixes – the fix has to be tested anyways, so why not write the test first and make sure that you fixed what you were supposed to? Continuous Integration To get Continuous Testing to work, you need a Continuous Integration environment. Understanding, automating and streamlining the build and getting the CI server up and running and wiring in tests and static analysis checks and reporting can take a lot of work in an enterprise system, especially if you have to deal with multiple languages and platforms and dependencies between systems. But doing this work is also the foundation for simplifying release and deployment – frequent short releases means that release and deployment has to be made as simple as possible. Onsite Customer / Product Owner Working closely with the customer to make sure that the team is delivering what the customer needs when the customer needs it is as important in maintenance as it is in developing a new system. Getting a talented and committed Customer engaged is hard enough on a high-profile development project – but it’s even harder in maintenance. You may end up with too many customers with conflicting agendas competing for the team’s attention, or nobody who has the time or ability to answer questions and make decisions. Maintenance teams often have to make compromises and help fill in this role on their own. But it doesn’t all fit…. Kilner’s main point of concern isn’t really with Agile methods in maintenance. It’s with incremental design and development in general – that some work doesn’t fit nicely into short time boxes. Short iterations might work ok for bug fixes and small enhancements (they do), but sometimes you need to make bigger changes that have lots of dependencies. He argues that while Agile teams building new systems can stub out incomplete work and keep going in steps, maintenance teams have to get everything working all at once – it’s all or nothing. It’s not easy to see how big changes can be broken down into small steps that can be fit into short time boxes. I agree that this is harder in maintenance because you have to be more careful in understanding and untangling dependencies before you make changes, and you have to be more careful not to break things. The code and design will sometimes fight the kinds of changes that you need to make, because you need to do something that was never anticipated in the original design, or whatever design there was has been lost over time and any kind of change is hard to make. It’s not easy – but teams solve these problems all the time. You can use tools to figure out how much of a dependency mess you have in the code and what kind of changes you need to make to get out of this mess. If you are going to spend “weeks, months, or even years” to make changes to a system, then it makes sense to take time upfront to understand and break down build dependencies and isolate run-time dependencies, and put in test scaffolding and tests to protect the team from making mistakes as they go along. All of this can be done in time boxed steps. Just because you are following time boxes and simple, incremental design doesn’t mean that you start making changes without thinking them through. Read Working With Legacy Code – Michael Feathers walks through how to deal with these problems in detail, in both object oriented and procedural languages. What to do if it takes forever to make a change. How to break dependencies. How to find interception points and pinch points. How to find structure in the design and the code. What tests to write and how to get automated tests to work. Changing data in a production system, especially data shared with other systems, isn’t easy either. You need to plan out API changes and data structure changes as carefully as possible, but you can still make data and database changes in small, structured steps. To make code changes in steps you can use Branching by Abstraction where it makes sense (like making back-end changes) and you can protect customers from changes through Feature Flags and Dark Launching like Facebook and Twitter and Flickr do to continuously roll out changes – although you need to be careful, because if taken too far these practices can make code more fragile and harder to work with. Agile development teams follow incremental design and development to help them discover an optimal solution through trial-and-error. Maintenance teams work this way for a different reason – to manage technical risks by breaking big changes down and making small bets instead of big ones. Working this way means that you have to put in scaffolding (and remember to take it out afterwards) and plan out intermediate steps and review and test everything as you make each change. Sometimes it might feel like you are running in place, that it is taking longer and costing more. But getting there in small steps is much safer, and gives you a lot more control. Teams working on large legacy code bases and old technology platforms will have a harder time taking on these ideas and succeeding with them. But that doesn’t mean that they won’t work. Yes, you can be Agile in maintenance.
October 14, 2011
by Mitch Pronschinske
· 17,525 Views
article thumbnail
Git: Getting the history of a deleted file
We recently wanted to get the Git history of a file which we knew existed but had now been deleted so we could find out what had happened to it. Using a simple git log didn’t work: git log deletedFile.txt fatal: ambiguous argument 'deletedFile.txt': unknown revision or path not in the working tree. We eventually came across Francois Marier’s blog post which points out that you need to use the following command instead: git log -- deletedFile.txt I’ve tried reading through the man page but I’m still not entirely sure what the distinction between using – and not using it is supposed to be. If someone could explain it that’d be cool… From http://www.markhneedham.com/blog/2011/10/04/git-getting-the-history-of-a-deleted-file
October 10, 2011
by Mark Needham
· 48,898 Views · 1 Like
article thumbnail
The Goal of software development
The Goal by Eli Goldratt is a business book in the form of a novel, where the protagonist must save his factory from closing due to very low productivity. The Goal is not limited to the management of a large organization (not even to for-profit companies): you simply have to define different units of measurement, like goal units instead of making money, the default goal. In fact, from the applications of the Theory of Constraints in our field I think it applies to software development too. What follows is my translation of the themes of The Goal to our field. The Goal is... Mking money, of course. For the ones of you with knowledge in accounting, the original goal is: raising throghput, the amounts of items sold (not produced) in the unit of time. Lowering investment/inventory, all the money tied up in the system in the form of assets that could be sold or products that stay in a warehouse. Lowering operational expense, all the money that we have spent as support and that cannot be recovered. How does these measurements apply to software development? A team does not always have an impact on contract negotiation, so often talking about money is far from everyday reality (kudos to you if you can apply that point of the Agile Manifesto.) The goal for software development can be translated, in my opinion, to: raising the throughput, the amount of features delivered (deployed, not implemented or tested) in the unit of time. You can measure this amount in story points, since feature vary in size. lowering investment/inventory, all the time tied up in the system in the form of undeployed or untested features that clutter the code base. In a minor part, also investment in the form of hardware, but that's by far less important than the team's time. lower operational expense, the time spent by developers every day in order to support the development. Automation is a kind of time investment that will bring more time (and quality) in the future, lowering operational expense. Like for material products, WIP has storage and opportunity costs which goes into the operational expense. Kanban is a tool that tries to reduce WIP in order to foster the two latter points. Throughput accounting This kind of throughput accounting is emphasized in the novel, over the use of cost accounting, where each developer (ehm, factory worker) has to be occupied all the time, even if the work he is doing isn't moving towards the goal: neverending refactoring. Specification of a feature which cannot be implemented until two months are gone, and will have to be rewritten. Implementation of features which won't be merged with the main branch any time soon. With Test-Driven Development, we are getting good at moving a feature from implemented to tested directly in the same commit. Yet the missing step is getting the feature to the users: maybe that's also what Continuous Deployment is all about... Dependent events Dependent events and statistical fluctuations are production systems topics that make a balanced plant close to bankruptcy: however, we're not at the point in which we can model our team as precisely as a factory. The basic point is that a plant in which everyone is working all the time is inefficient: when an early stage (like defining a specification or implementing a feature) gets delayed, downstream step such as deployment are dalayed too. Converely, when an upstream step finish earlier, the downstream stage is already at maximum efficiency and cannot process the intermediate result faster. I wonder if this applies to software development too. In a factory, workers are specialized and can do just a few jobs across the plant. Since workers and machines have different production rates, there will be just one bottleneck: the slowest one. If products have to pass from the bottleneck, anyone producing faster than the bottleneck will just accumulate WIP in front of him. Continuing with our example, if the analyst or domain expert is churning out specifications for new features every day, most of them are just WIP in front of the development team. Once there is an established buffer, any additional specification won't raise throughput any faster; instead, it will raise the inventory (partial features) and the time spent in managing it. I think this is not always true in the most technical phases of development instead. For example, in a small team a developer may be moved to testing or refactoring, or setting up Continuous Integration or evaluation of a new library. Unless you have a DBA which can just manage databases, your developer is not fixed into a stage of the system. The bottleneck The previous example featured a bottleneck, the most famous concept of the Theory of Constraints introduced in the book. This translation to the software development case is mine and could be incomplete. A feature (or a user story) has to pass in a series of stations where different people will work on it to make it real: specification from a domain expert, implementation from a technical team, extended testing with optional customer validation, deployment which should be fast but at the same time must not kill the current version of the application. Each station has an average velocity. By definition, there is a station which is the slowest and can process fewest story points in the unit of time. This is the bottleneck. (This is not always true, as velocity may vary greatly in time with the addition of new people or hidden lines discovered in a feature. Becoming good at estimation and stabilizing a team are two objectives that help reach the assumption.) You can identify a bottleneck by looking at where is the WIP: it will accumulate in front of it. If you already have a kanban board, this phase is simpler... Once identified, the throughput of the system can only be improved by raising the bottleneck's capacity enough so that it is no more a bottleneck. You can move people to it (keeping an eye on communication costs); ensure it is used at maximum efficiency by freeing the specialized developers from other mundane tasks. Now you can restart and find a new bottleneck... Conclusions This is just a little introduction to the themes of the Theory of Constraints and Goldratt's teachings; I don't pretend to explain the whole book in an article. It is also against the Socratic method: you should reach the answers yourself, and these are just examples from my experience. There is more to Goldratt and The Goal than bottlenecks and throughput, such as continuous improvement. If you're working or managing in a software development team, I suggest you to read this book if you have the opportunity. Even when freelancing, it is an eye-opener in moving towards a Goal instead of busyworking; and it's written with a never boring teaching method.
October 4, 2011
by Giorgio Sironi
· 21,610 Views
article thumbnail
EC2 Interview – AWS Interview – Cloud Interview – 8 Questions
If you're looking for a cloud expert, specifically someone who knows Amazon Web Services and EC2, you'll want to have a battery of questions to assess their knowledge.
September 15, 2011
by Sean Hull
· 111,875 Views · 1 Like
article thumbnail
Cloud Integration with Apache Camel and Amazon Web Services (AWS): S3, SQS and SNS
The integration framework Apache Camel already supports several important cloud services (see my overview article at http://www.kai-waehner.de/blog/2011/07/09/cloud-computing-heterogeneity-will-require-cloud-integration-apache-camel-is-already-prepared for more details). This article describes the combination of Apache Camel and the Amazon Web Services (AWS) interfaces of Simple Storage Service (S3), Simple Queue Service (SQS) and Simple Notification Service (SNS). Thus, The concept of Infrastructure as a Service (IaaS) is used to access messaging systems and data storage without any need for configuration. Registration to AWS and Setup of Camel First, you have to register to the Amazon Web Services (for free). Most AWS services include a free monthly quota, which is absolutely sufficient to play around and develop some simple applications. As its name states, AWS uses technology-independent web services. Besides, APIs for several different programming languages are available to ease development. By the way, Camel uses the AWS SDK for Java (http://aws.amazon.com/sdkforjava), of course. The documentation is detailed and easy to understand, including tutorials, screenshots and code examples . Hint 1: You should read the introductions to S3, SQS and SNS (go to http://aws.amazon.com and click on „products“) and play around with the AWS Management Console (http://aws.amazon.com/console) before you continue. This step is very easy and takes less than one hour. Then, you will have a much better understanding about AWS and where Camel can help you! Hint 2: It really helps to look at the source code of the camel-aws component, It helps you to understand how Camel uses the AWS Java API internally. If you want to write tests, you can do it the same way. In the past, I was afraid of looking at „complex“ source code of open source frameworks. But there is no need to be scared! The camel-aws component (and most other camel components) contain only of a few classes. Everything is easy to understand. It helps you to understand Camel internals, the AWS API, and to spot and solve errors due to exceptions in your code. In the meanwhile, the current Camel version 2.8 supports three AWS services: S3, SQS and SNS. All of them use similar concepts. Therefore, they are included in one single camel component: „camel-aws“. You have to add the libraries to your existing Camel project. As always, the simplest way is to use Maven and add the following dependency to the pom.xml: org.apache.camel camel-aws ${camel-version} Configuration of the Camel Endpoint The implementation and configuration of all three services is very similar. The URI looks like this (the code shows the SQS service): aws-sqs://queue-name[?options] There are two alternatives to configure your endpoint. Using Parameters The easy way is to use two paramters in the URI of your endpoint: „accessKey“ and „secretKey“ (you receive both after your AWS registration). “aws-sqs://unique-queue-name?accessKey=“INSERT_ME“&secretKey=INSERT_ME” Be aware of the following problem, which can result in a strange, non-speaking exception (thanks to Brendan Long): You’ll need to URL encode any +’s in your secret key (otherwise, they’ll be treated as spaces). + = %2B, so if your secretkey was “my+secret\key”, your Camel URL should have “secretKey=my%2Bsecret\key”. “Within the query string, the plus sign is reserved as shorthand notation for a space. Therefore, real plus signs must be encoded. This method was used to make query URIs easier to pass in systems which did not allow spaces.” Source: WC3 URI Recommendations Adding a configured AmazonClient to the Registry If you need to do more configuration (e.g. because your system is behind a firewall), you have to add an AmazonClient object to your registry. The following code shows an example using SQS, but SNS and S3 use exactly the same concept. @Override protected JndiRegistry createRegistry() throws Exception { JndiRegistry registry = super.createRegistry(); AWSCredentials awsCredentials = new BasicAWSCredentials(“INSERT_ME”, “INSERT_ME”); ClientConfiguration clientConfiguration = new ClientConfiguration(); clientConfiguration.setProxyHost(“http://myProxyHost”); clientConfiguration.setProxyPort(8080); AmazonSQSClient client = new AmazonSQSClient(awsCredentials, clientConfiguration); registry.bind(“amazonSQSClient”, client); return registry; } This example overwrites the createRegistry() method of a JUnit test (extending CamelTestSupport). You can also add this information to your runtime Camel application, of course. Apache Camel and the Simple Storage Service (S3) Simple Storage Service (S3) is a key-value-store. You can store small to very large data. The usage is very easy. You create buckets and put key-value data into these buckets. You can also create folders within buckets to organize your data. That’s it. You can monitor your buckets using the AWS Management Console – an intuitive GUI supporting most AWS services. The following example shows both alternatives for accessing the Amazon services (as described above): Paramenters and the AmazonClient. // Transfer data from your file inbox to the AWS S3 service from(“file:files/inbox”) // This is the key of your key-value data .setHeader(S3Constants.KEY, simple(“This is a static key”)) // Using parameters for accessing the AWS service .to(“aws-s3://camel-integration-bucket-mwea-kw?accessKey=INSERT_ME&secretKey=INSERT_ME&region=eu-west-1″); // Transfer data from the AWS S3 service to your file outbox from(“aws-s3://camel-integration-bucket-mwea-kw?amazonS3Client=#amazonS3Client&region=eu-wes”) .to(“file:files/outbox”); There are some additional parameters, for instance you can submit the desired AWS region or delete data after receiving it (see http://camel.apache.org/aws-s3.html and the corresponding SQS and SNS sites for more details about parameters and message headers). As you see in the code, you can use the AWS-S3 endpoint for producing and for consuming messages. Each bucket must be unique, thus you have to add some specific information such as your company to its name. Hint: If a bucket does not exist, Camel is creating it automatically (as the AWS API does). This concept is also used for SQS queues and SNS topics. Apache Camel and the Simple Queue Service (SQS) The Simple Queue Service (SQS) is similar to a JMS provider such as WebSphere MQ or ActiveMQ (but with some differences). You create queues and send messages to them. Consumers receive the messages. Contrary to most other AWS services, you cannot monitor queues by using the AWS management console directly. You have to use the service „Cloudwatch“ (http://aws.amazon.com/cloudwatch) and start an EC2 instance to monitor queues and its content. As you can see in the following code example, the syntax and concepts are almost the same as for the S3 service: from(“file:inbox”) .to(“aws-sqs://camel-integration-queue-mwea-kw?accessKey=INSERT_ME&secretKey=INSERT_ME”); from(“aws-sqs://camel-integration-queue-mwea-kw?amazonSQSClient=#amazonSQSClient”) .to(“file:outbox?fileName=sqs-${date:now:yyyy.MM.dd-hh:mm:ss:SS}”); Again, you can use the AWS-SQS endpoint for producing and for consuming messages. Each queue name must be unique. There exist two important differences to JMS (copy & paste from the AWS documentation): Q: How many times will I receive each message? Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies. Q: Why are there separate ReceiveMessage and DeleteMessage operations? When Amazon SQS returns a message to you, that message stays in the queue, whether or not you actually received the message. You are responsible for deleting the message; the delete request acknowledges that you’re done processing the message. If you don’t delete the message, Amazon SQS will deliver it again on another receive request. Apache Camel and the Simple Notification Service (SNS) The Simple Notification Service (SNS) acts like JMS topics. You create a topic, consumers subscribe to the topic and then receive notifications. Several transport protocols are supported: HTTP(S), Email and SQS. Further interfaces will be added in the future, e.g. the Short Message Service (SMS) for mobile phones. Contrary to S3 and SQS, Camel only offers a producer endpoint for this AWS service. You can only create topics and send messages via Camel. The reason is simple: Camel already offers endpoints for consuming these messages: HTTP, Email and SQS are already available. There is one tradeoff: A consumer cannot subscribe to topics using Camel – at the moment. The AWS Management Console has to be used. A very interesting discussion can be read on the Camel JIRA issue regarding the following questions: Should Camel be able to subscribe to topics? Should the producer contain this feature or should there be a consumer? In my opinion, there should be a consumer which is able to subscribe to topics, otherwise Camel is missing a key part of the AWS SNS service! Please read the discussion and contribute your opinion: https://issues.apache.org/jira/browse/CAMEL-3476. Apache Camel is already ready for the Cloud Computing Era AWS offers many more services for the cloud. Probably, it does not make sense to integrate everyone into Camel, but more AWS services will be supported in the future. For instance, SimpleDB and the Relational Database Service (RDS) are already planned and make sende, too: http://camel.apache.org/aws.html. The conclusion is easy: Apache Camel is already ready for the cloud computing era. Several important cloud services are already supported. Cloud integration will become very important in the future. Thus, Camel is on a very good way. Hopefully, we will see more cloud components, soon. I will continue to write articles about other Camel cloud components (and new AWS addons, ouf course). For instance, a component for the Platform as a Service (PaaS) product Google App Engine (GAE) is already available. If you have any additional important information, questions or other feedback, please write a comment. Thank you in advance… Best regards, Kai Wähner (Twitter: @KaiWaehner) [Content from my Blog: Cloud Integration with Apache Camel and Amazon Web Services (AWS): S3, SQS and SNS]
August 30, 2011
by Kai Wähner DZone Core CORE
· 26,153 Views
article thumbnail
Lucene's near-real-time search is fast!
Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement. Fortunately, it's trivial to use. Just open your initial NRT reader, like this: // w is your IndexWriter IndexReader r = IndexReader.open(w, true); (That's the 3.1+ API; prior to that use w.getReader() instead). The returned reader behaves just like one opened with IndexReader.open: it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away! Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one. What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible. Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader. Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when. With NRT, your application has controlled consistency: you decide exactly when changes must become visible. Recently there have been some good improvements related to NRT: New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions. NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092). When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900). Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010). How fast is NRT search? I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents. Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard. The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time. Finally, the NRT reader is reopened once per second. I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory. The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time: The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph. Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents. Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat. Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical. Reopen delay during merging The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching. I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0. Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness. Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache. Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges. This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene! Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough. While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT. NIOFSDirectory vs MMapDirectory Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory: There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another. The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory! Optimizing your NRT turnaround time My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate. Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements. There are also some simple steps you can take to reduce the turnaround time: Store the index on a fast IO system, ideally a modern SSD. Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader. Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk). If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster. Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092). Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader. While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit. Happy near-real-time searching!
July 11, 2011
by Michael Mccandless
· 19,020 Views
article thumbnail
Git Tutorial: Comparing Files With diff
The most common scenario to use diff is to see what changes you made after your last commit. Let’s see how to do it.
June 19, 2011
by Veera Sundar
· 271,836 Views · 2 Likes
article thumbnail
Git Tip : Restore a deleted tag
A little tip that can be very useful, how to restore a deleted Git tag. If you juste deleted a tag by error, you can easily restore it following these steps. First, use git fsck --unreachable | grep tag then, you will see the unreachable tag. If you have several tags on the list, use git show KEY to found the good tag and finally, when you know which tag to restore, use git update-ref refs/tags/NAME KEY and the previously deleted tag with restore with NAME. Thanks to Shawn Pearce for the tip. From http://www.baptiste-wicht.com/2011/06/git-tip-restore-a-deleted-tag/
June 16, 2011
by Baptiste Wicht
· 27,710 Views
article thumbnail
Merge Policy Internals in Solr
last week, a colleague asked me a really simple question about segments merging in solr. after discussing the answer for some minutes while playing around with solr, i realized that there are a lot of subtleties in this subject. so i started to look at the code and discovered some interesting things, which i’m going to summarize in this post. what is a merge policy? the process of choosing which segments are going to be merged, and in which way, is carried out by an abstract class named mergepolicy. this class builds a mergespecification, which is an object composed of a list of onemerge objects. each one of them represents a single merge operation, defined by the segments that will be merged into a new one. after a change in the index, the indexwriter invokes the mergepolicy to obtain a mergespecification. next, it invokes the mergescheduler, who is in charge of determining when the merges will be executed. there are mainly two implementations of mergescheduler: the concurrentmergescheduler, which runs each merge in a separate thread, and the serialmergescheduler, which runs all the merges sequentially in the current thread. finally, when the time to do a merge comes, the indexwriter does part of the job, and delegates the other part to a segmentmerger. so, if we want to know when a set of segments is going to be merged, why a segment is being merged and another one isn’t, and other things like these, we should take a look at the mergepolicy. there are many implementations of mergepolicy, but i’m going to focus in one of them (logbytesizemergepolicy), because it’s solr’s default, and i believe that it’s the one used by most people. mergepolicy defines three abstract methods to construct a mergespecification: findmerges is invoked whenever there is a change in the index. this is the method that i’ll study in this post. findmergesforoptimize is invoked whenever an optimize operation is called. findmergestoexpungedeletes is invoked whenever an expunge deletes operation is called. step by step i’ll start by giving a brief and conceptual description of how the merge policy works, that you can follow by looking at the figure below: sort the segments by name. group the existing segments into levels. each level is a contiguous set of segments. for each level, determine the interval of segments that are going to be merged. parameters lets define a couple of parameters involved in the process of merging the segments that compose an index: mergefactor: this parameter determines many things, like how many segments are going to be merged into a new one, the maximum number of segments that can be in a level and the span of each level. can be set in solrconfig.xml. minmergesize: all segments whose size is less than this parameter’s value will belong to the same level. this value is fixed. maxmergesize: all segments whose size is greater than this parameter’s value won’t be ever merged. this value is fixed. maxmergedocs: all segments containing more documents than this parameter’s value won’t be merged. can be set in solrconfig.xml. constructing the levels let’s see how levels are constructed. to define the first level, the algorithm searches the maximum segment’ size. let’s call this value levelmaxsize. if this value is less than minmergesize, then all the segments are considered to be at the same level. otherwise, let’s define levelminsize as: that’s quite an ugly arithmetic formula, but it means something like this: levelminsize will be almost mergefactor times less than levelmaxsize (it would have been mergefactor times if 1 had been used as exponent instead of 0.75), but if it goes below minmergesize, make it equal to minmergesize. after this calculation, the algorithm will choose which segments belong to the current level. to do this, it will find the newest segment whose size is greater or equal than levelminsize, and will consider this segment, and all the segments older than it, to be part of this level. the next levels will be constructed using the same algorithm, but constraining the set of available segments to the ones newer than the newest of the previous level. in the next figure you can see a simple example with mergefactor=10 and minmergesize=1.6mib. the intention behind using a mergefactor of 10 is to obtain levels with span of decreasing order of magnitude. but in some cases, this algorithm can result in a structure of levels that you won’t expect if you don’t know the internals. take, for example, the segments of the following table, assuming mergefactor=10 and minmergesize=1.6mib: segment size a 200 mib l 88 mib m 8.9 mib n 6.5 mib o 1.4 mib p 842 kib q 842 kib r 842 kib s 842 kib t 842 kib u 842 kib v 842 kib w 842 kib x 160 mib how many levels are in this case? let’s see: the maximum segment size is 200 mib, thus, levelminsize is 35 mib. the newest segment with size greater than levelminsize is x , so the first level will include x and all the segments older than it. this means that, in this case, there’s only one level! choosing which segments to merge after defining the levels, the mergepolicy will choose which segments to merge. to do this, it’ll analyze each level separately. if a level has less than mergefactor segments, then it’ll be skipped. otherwise, each group of mergefactor contiguous segments will be merged into a new one. if at least one of the segments in a group is bigger than maxmergefactor, or has more than maxmergedocs documents, then the group is skipped. resuming the second example of the previous section, where only one level is present, the result of the merge will be: segment size u 842 kib v 842 kib w 842 kib x 160 mib y 311 mib minmergesize and maxmergesize through the post, i’ve talked about these two parameters, which are involved in the process of merging. it’s worth noting that the value of them is hard coded in the source code of lucene, and their values are the following: minmergesize has a fixed value of 1.6 mib. this means that any segment whose size is less than 1.6 mib will be included in the last level. maxmergesize has a fixed value of 2 gib. this means that any segment whose size is greater than 2 gib will never be merged. conclusion while the algorithm itself isn’t extremely complex, you need to understand the internals if you want to predict what will happen to your index when more documents are added. also, knowing how the merge policy works will help you in determining the value of some parameters like mergefactor and maxmergedocs.
May 23, 2011
by Juan Grande
· 16,409 Views
article thumbnail
Git backups, and no, it's not just about pushing
Git is a backup system itself: for example, you can version your .txt folders containing TODO lists. Since Git version your files just like it does for code, after accidental deletion or modifications it will be able to bring you back. Yet, if you do not regularly push your commits, a problem with the drive containing the repository may cause the loss of all your work. You can put the repository in Dropbox or on a similar service, but I don't trust it. Dropbox syncs files in .git independently from the rest and from one another, and it may break temporarily or for good the repository. By the way, I only want to snapshot a backup at specific points in time, not always occupying my connection by instant mirroring. A note before beginning: with binary data Git is not proficient as a backup tool: text works a lot better (it's like code). This article is dedicated to the backup of code and textual content. Push is not a backup For example, because it may lack branches. In general, pushing to origin is not even an option as you may not want to push your changes yet, but still perform a backup. It's only in the open source world that backup corresponds to publishing online. However, thanks to decentralization there are some simple solutions, involving the creation of repositorite different from origin: git clone /path/to/working/copy #creates the backup git pull #origin master of course, updates the backup # you can specify better branches via the local configuration of the backup copy (git config) The inverse solution, involging pulling, is also possible: git init . #in the folder of your backup, or you can use a remote repository git remote add backup_repo /path/to/backup/repo #or a git:// repo git push backup #master usually, but also multiple branches git push --all backup #an alternative that pushes all branches All the commands, also the one that will follow, are just bash commands: it's easy to create a script and automate its execution with cron, anacron or whatever you want. The Force"del" Unix is powerful in you. git bundle git bundle is another command that may be used for backup purposes. It will create a single file containing all the refs you need to export from your local repository. It's often used for publishing commits via USB keys or other media in absence of a connection. For one branch, it's simple. This command will create a myrepo.bundle file. git bundle create myrepo.bundle master For more branches or tags, it's still simple: git bundle create myrepo.bundle master other_branch Restoring the content of the bundle is a single comment. Inside an empty repo, type: git bundle unbundle myrepo.bundle Instead if you do not have a repo, and just want to recreate the old one: git clone myrepo.bundle -b master myrepo_folder In emergency situations, bundle comes handy. But my issue with that command is that I always forget something when I use it: for example in my tutorial repository I had a lot of tags, but bundle did not include them by default (you have to specify the whole references list like for master other_branch.) Tarballs An alternative is just to archive the repository in a tar.gz or tar.bz file. tar -cf repository.tar repository/ gzip repository.tar # or bzip2 repository.tar After that, you can use scp or even rsync (but I don't think it will speed up much) to put repository.tar.gz on another medium. The weight is higher in this case, since the repository contains also the checked out working copy. But you don't have to learn new commands: apart from the weight and the lack of incremental updates, this solution works fine. Bare repositories You can use git clone --bare repository/ backup_folder/ to create a bare copy of the repository, as a backup. The bare repository does not maintain a checked out working tree, and as so saves space and time for its transferral. This method can be used in conjunction with the pull/push or the tarball method. For restoring the backup: git clone backup_folder/ new_repository/ will recreate the original situation in new_repository. In any of the cases the new folders are created automatically. I won't advise to just copy the folder as often on other backup filesystems (like an USB key's vfat) permissions, owner and other metadata are lost. Conclusion So now you have some alternatives for backing up your repositories or transporting them without setting up a server like Gitosis or passing from the publicly available Github. In fact, I researched this techniques for transporting my tutorial code to phpDay 2011 and the Dutch PHP Conference, and they have worked pretty well.
May 18, 2011
by Giorgio Sironi
· 72,937 Views · 1 Like
article thumbnail
Bootstrapping CDI in several environments
i feel like writing some posts about cdi (contexts and dependency injection). so this is the first one of a series of x posts ( 0 javax.enterprise cdi-api 1.0 provided an empty beans.xml will do to enable cdi you must have a beans.xml file in your project (under the meta-inf or web-inf). that’s because cdi needs to identify the beans in your classpath (this is called bean discovery) and build its internal metamodel. with the beans.xml file cdi knows it has beans to discover. so, for all the following examples i’ll make it simple and will leave this file completely empty. java ee 6 containers let’s start with the easiest possible environment : java ee 6 containers . why is it the simplest ? well, because you don’t have to do anything : cdi is part of java ee 6 as well as the web profile 1.0 so you don’t need to manually bootstrap it. let’s see how to inject a cdi bean within an ejb 3.1 and a servlet 3.0 . ejb 3.1 since ejb 3.1 you can use the ejbcontainer api to get an in-memory embedded ejb container and you can easily unit test your ejbs. so let’s write an ejb and a test class. first let’s have a look at the code of the ejb. as you can see, with version 3.1 an ejb is just a pojo : no inheritance, no interface, just one @stateless annotation. it gets a reference of the hello bean buy using the @inject annotation and uses it in the saysomething() method. @stateless public class mainejb31 { @inject hello hello; public string saysomething() { return hello.sayhelloworld(); } } you can now package the mainejb31, hello and world classes with the empty beans.xml file into a jar, deploy it to glassfish 3.x , and it will work. but if you don’t want to bother deploying it to glassfish and just unit test it, this is what you need to do : public class mainejbtest { private static ejbcontainer ec; private static context ctx; @beforeclass public static void initcontainer() throws exception { map properties = new hashmap(); properties.put(ejbcontainer.modules, new file("target/classes")); ec = ejbcontainer.createejbcontainer(properties); ctx = ec.getcontext(); } @afterclass public static void closecontainer() throws exception { if (ec != null) ec.close(); } @test public void shoulddisplayhelloworld() throws exception { // looks up the ejb mainejb31 mainejb = (mainejb31) ctx.lookup("java:global/classes/mainejb!org.antoniogoncalves.cdi.helloworld.mainejb"); assertequals("should say hello world !!!", "hello world !!!", mainejb.saysomething()); } } in the code above the method initcontainer() initializes the ejbcontainer. the shoulddisplayhelloworld() looks up the ejb (using the new portable jndi name ), invokes it and makes sure the saysomething() method returns hello world !!!. green test. that was pretty easy too. servlet 3.0 servlet 3.0 is part of java ee 6, so again, there is no needed configuration to bootstrap cdi. let’s use the new @webservlet annotation and write a very simple one that injects a reference of hello and displays an html page with hello world !!!. this is what the servlet looks like : @webservlet(urlpatterns = "/mainservlet") public class mainservlet30 extends httpservlet { @inject hello hello; @override protected void service(httpservletrequest req, httpservletresponse resp) throws servletexception, ioexception { resp.setcontenttype("text/html"); printwriter out = resp.getwriter(); out.println(""); out.println(""); out.println(""); out.println(saysomething()); out.println(""); out.println(""); out.close(); } public string saysomething() { return hello.sayhelloworld(); } } thanks to the @webservlet i don’t need any web.xml (it’s optional in servlet 3.0) to map the mainservlet30 to the /mainservlet url. you can now package the mainservlet30, hello and world classes with the empty beans.xml and no web.xml into a war, deploy it to glassfish 3.x , go to http://localhost:8080/bootstrapping-servlet30-1.0/mainservlet and it will work. unfortunately servlet 3.0 doesn’t have an api for the container (such as ejbcontainer). there is no servletcontainer api that would let you use an embedded servlet container in a standard way and, why not, easily unit test it. application client container not many people know it, but java ee (or even older j2ee versions) comes with an application client container (acc). it’s like an ejb or servlet container but for plain pojos. for example you can develop a swing application (yes, i’m sure that some of you still use swing), run it into the acc and get some extra services given by the container (security, naming, certain annotations…). glassfish v3 has an acc that you can launch in a command line : appclient -jar . so i thought, great, i can use cdi with acc the same way i use it within ejb or servlet container, no need to bootstrap anything, it’s all out of the box. i was wrong . as per the cdi specification (section 12.1), cdi is not required to support application client bean archives. so the glassfish application client container doesn’t support it. i haven’t tried the jboss acc , maybe it works. other containers the beauty of cdi is that it doesn’t require java ee 6 . you can use cdi with simple pojos in a java se environment, as well as some servlet 2.5 containers. of course it’s not as easy to bootstrap because you need a bit of configuration. but it then works fine (not always but). java se 6 ok, so until now there was nothing to do to bootstrap cdi. it is already bundled with the ejb 3.1 and servlet 3.0 containers of java ee 6 (and web profile). so the idea here is to use cdi in a simple java se environment. coming back to our hello and world classes, we need a pojo with an entry point that will bootstrap cdi so we can use injection to get those classes. in standard java se when we say entry point , we think of a public static void main(string[] args) method. well, we need something similar… but different. weld is the reference implementation of cdi. that means it implements the specification, the standard apis (mostly found in javax.inject and javax.enterprise.context packages) but also some proprietary code (in org.jboss.weld package). bootstrapping cdi in java se is not specified so you will need to use specific weld features. you can do that in two different flavors: by observing the containerinitialized event or using the programatic bootstrap api consisting of the weld and weldcontainer classes. the following code uses the containerinitialized event. as you can see, it uses the @observes annotation that i’ll explain in a future post. but the idea is that this class is listening to the event and processes the code once the event is triggered. import org.jboss.weld.environment.se.events.containerinitialized; import javax.enterprise.event.observes; import javax.inject.inject; public class mainjavase6 { @inject hello hello; public void saysomething(@observes containerinitialized event) { system.out.println(hello.sayhelloworld()); } } but who trigers the containerinitialized event ? well, it’s the org.jboss.weld.environment.se.startmain class. i’m using maven so a nice trick is to use the exec-maven-plugin to run the startmain class. download the code , have a look at the pom.xml and give it a try. the other possibility is to programmatically bootstrap the weld container. this can be handy in unit testing. the code below initializes the weld container (with new weld().initialize()) and then looks for the hello class (using weld.instance().select(hello.class).get()). import org.jboss.weld.environment.se.weld; import org.jboss.weld.environment.se.weldcontainer; import org.junit.beforeclass; import org.junit.test; import static junit.framework.assert.assertequals; public class hellotest { @test public void shoulddisplayhelloworld() { weldcontainer weld = new weld().initialize(); hello hello = weld.instance().select(hello.class).get(); assertequals("should say hello world !!!", "hello world !!!", hello.sayhelloworld()); } } execute the test with mvn test and it should be green. as you can see, there is a bit more work using cdi in a java se environment, but it’s not that complicated. tomcat 6.x ok, and what about your legacy servlet 2.5 containers ? the first one that comes in mind is tomcat 6.x ( note that tomcat 7.x will implement servlet 3.0 but is still in beta version at the time of writing this post ). weld provides support for tomcat but you need to configure it a bit to make cdi work. first of all, this is a servlet 2.5, not a 3.0. so the code of the servlet is slightly different from the one seen before (no annotation allowed) and of course, you need your good old web.xml file : public class mainservlet25 extends httpservlet { @inject hello hello; @override protected void service(httpservletrequest req, httpservletresponse resp) throws servletexception, ioexception { resp.setcontenttype("text/html"); printwriter out = resp.getwriter(); out.println(""); out.println(""); out.println(""); out.println(saysomething()); out.println(""); out.println(""); out.close(); } public string saysomething() { return hello.sayhelloworld(); } } because we don’t have a @webservlet annotation in servlet 2.5, we need to declare and map it in the web.xml (using the servlet and servlet-mapping tags). then, you need to explicitly specify the servlet listener to boot weld and control its interaction with requests (org.jboss.weld.environment.servlet.listener). tomcat has a read-only jndi, so weld can’t automatically bind the beanmanager extension spi. to bind the beanmanager into jndi, you should populate meta-inf/context.xml and make the beanmanager available to your deployment by adding it to your web.xml: mainservlet25 org.antoniogoncalves.cdi.bootstrapping.servlet.mainservlet25 mainservlet25 /mainservlet org.jboss.weld.environment.servlet.listener beanmanager javax.enterprise.inject.spi.beanmanager the meta-inf/context.xml file is an optional file which contains a context for a single tomcat web application. this can be used to define certain behaviours for your application, jndi resources and other settings. package all the files (mainservlet25, hello, world, meta-inf/context.xml, beans.xml and web.xml) into a war and deploy it into tomcat 6.x. go to http://localhost:8080/bootstrapping-servlet25-tomcat-1.0/mainservlet and you will see your hello world page. jetty 6.x another famous servlet 2.5 containers is jetty 6.x (at codehaus) and jetty 7.x ( note that jetty 8.x will implement servlet 3.0 but it’s still in experimental stage at the time of writing this post ). if you look at the weld documentation, there is actually support for jetty 6.x and 7.x . the code is the same one as tomcat (because it’s a servlet 2.5 container), but the configuration changes. with jetty you need to add two files under web-inf : jetty-env.xml and jetty-web.xml : beanmanager javax.enterprise.inject.spi.beanmanager org.jboss.weld.resources.managerobjectfactory true package all the files (mainservlet25, hello, world, web-inf/jetty-env.xml, web-inf/jetty-web.xml, beans.xml and web.xml) into a war and deploy it into jetty 6.x. go to http://localhost:8080/bootstrapping-servlet25-jetty6/mainservlet and you will see your hello world page. there was a mistake in the weld documentation so i couldn’t make it work. i started a thread on the weld forum and thanks to dan allen , pete muir and all the weld team, this was fixed and i managed to make it work. simple as posting an email to the forum . thanks for your help guys. spring 3.x here is the tricky part. spring 3.x implements the jsr 330 : dependency injection for java , which means that @inject works out of the box. but i didn’t find a way to integrate cdi with spring 3.x . the weld documentation mentions that because of its extension points, “ integration with third-party frameworks such as spring (…) was envisaged by the designers of cdi “. i did find this blog that simulates cdi features by enabling spring ones. what i didn’t find is a clear statement or roadmap on springsource about supporting cdi or not in future releases. the last trace of this topic is a comment on a long tss flaming thread . at that time (16 december 2009), juergen huller said “ with respect to implementing cdi on top of spring (…) trying to hammer it into the semantic frame of another framework such as cdi would be an exercise that is certainly achievable (…) but ultimately pointless “. but if you have any fresh news about it, let me know. conclusion as i said, this post is not about explaining cdi, i’ll do that in future posts. i just wanted to focus on how to bootstrap it in several environments so you can try by yourself. as you saw, it’s much simpler to use cdi within an ejb 3.1 or servlet 3.0 container in java ee 6. i’ve used glassfish 3.x but it should also work with other java ee 6 or web profile containers such as jboss 6 or resin . when you don’t use java ee 6, there is a bit more work to do. depending on your environment or servlet container you need some configuration to bootstrap weld. by the way, i’ve used weld because it’s the reference implementation, the one bunddled with glassfish and jboss. but you could also use openwebbeans , another cdi implementation. download the code , give it a try, and give me some feedback. from http://agoncal.wordpress.com/2011/01/12/bootstrapping-cdi-in-several-environments/
April 28, 2011
by Antonio Goncalves
· 31,394 Views
article thumbnail
A story about User Stories; Where do you start and what about the planning?
In this multi-part post, I’m going to share my personal experiences while working with user stories for gathering, tracking and planning requirements. It currently consists out of three parts: What are they and why do you need them? Who writes them and how do you control scope? Where do you start and what about the planning? You can also download all parts as one comprehensive PDF for easy printing or e-reading. Where do you start? Suppose that after intensive discussions and tough scoping sessions you ended up with a list of user stories and are about to start building the system. The first story not only needs to realize some particular feature, but also involves building a skeleton implementation of the system’s architecture. How do you avoid spending way too much time on plumbing and other general purpose stuff you need for the rest of the stories? The article Managing the Bootstrap Story by Jennitta Andrea addressed this challenge in more detail and offers some alternative solutions. One of these solutions is to find and define a user story with the product owner that offers minimal functionality yet still has project value. Such a story is often referred to as the backbone story because you realize the backbone of your system in it. It’s quite common to use the backbone story to realize a proof-of-concept (PoC) that verifies the chosen architecture. Since a working PoC can give the product owner confidence that the team is able to build such a product, that fact alone may be enough project value for the product owner. More storyotypes? You might have suspected it already, but that backbone story is just an example of another storyotype. In fact, after I started looking for an approach to capture the non-functional requirements of a project or system, I ran into a slide deck that mentioned a whole set of additional storyotypes. Dan Rawsthorne, the author, tried to define a storyotype for virtually every possible thing you might need to do in a project. Personally I think he went a bit too far, but a small set of additional storyotypes proved to be very useful anyway. Storyotype Description Compound Epic A composite user story that groups a number of stories in a logical sense. Complex Epic A user story whose content and impact must be determined later in the project, but for which it is clear that it involves a significant amount of work. Setup A story that is used to setup the project environment, including a source control environment, a project website, a build server. Technical A story that involves making a technical improvement or adjustment. Examples include introducing a coding standard, refactoring a poor design, executing a performance test. Documentation A story for writing a user manual, installation manual, etc. Training A story for developing and/or hosting a training, or having a workshop with end users. Quality Improvements A story which objective is to fix a collection of related bugs, or spent a fixed amount of time to improve the quality of the code base. Spike A story that aims to do a technical investigation to determine the usability of a specific technology, or for trying an alternative technical solution. When is the story complete? So how do you know that a user story has been successfully realized? Well, if all is good, all stories will conform with INVEST and are associated with a number of acceptance criteria (typically written down as the how-to demo) specified by the product owner. That should be enough to determine if it is functionally sound. But what you still miss is a way of explaining the stakeholders, including the product owner, when the team treats the story as finished. That may differ by team, but usually includes some or more of the following criteria. The code compiles and there are no warnings or errors. The code meets the coding standards setup by the project or the organization. The code is reviewed by a peer developer. All automated unit and integration tests have completed successfully. Visual Studio’s static code analysis tool does not report any violations. ReSharper reports no potential errors (a.k.a. everything is 'green'). The daily integration build has completed successfully. The functionality was tested by another member of the team (anybody but the developer). The feature or functionality has been signed off using the project checklist. The system functionality is tested by a tester. The visual look and feel is has been approved by an employee of the communications department. Together with the story’s how-to demo these criteria are commonly referred to as the definition-of-done. Usually, a team or project will have a default definition-of-done that applies to all stories and only mentions the particulars of that story if necessary. Then what about the planning? User stories are an excellent unit for tracking progress within your project. However, purists within the Agile community will tell you that an Agile project will have no long term plan. Instead, the functionality is realized iteratively according to the priority defined by the product owner. I agree with the latter and believe that its iterative nature is essential for dealing with the changing requirements that are common in all projects. It allows deferring decisions to the last responsible moment, and that’s always a good thing. But in reality you often can’t escape from providing at least a rough schedule to your management. How should you deal with that? What I often do to get all stakeholders to join me in a number of workshops. Using use case diagrams to illustrate the context of the discussions, I try to get enough stories on paper to represent the entire scope of the project. You need to beware though that you don’t write down too much details or have too much in-depth discussions. That would give the stakeholders a false sense of precision, and consequently, will cause them to see the stories as a formal functional design. Also, if you run into some high-level chunk of functionality for which nobody really knows what it will look like, add an epic story for it and include a spike to elaborate on the epic later on in the project. Then organize a number of shorter meetings with the team or, if the team hasn’t been formed yet, with a few experienced developers. Let them discuss every story one by one and then try to estimate the size of each story in so called story points. Some people from the Agile community say you should estimate using relative sizes only. In other words, a story that seems to require twice as much work as another story should also have twice as many story points. The story point as a unit does not have value. It’s the relative differences that are important. What works for me is that every story point corresponds to the ideal day of an experienced senior software developer. In other words, one story point means that an experienced developer familiar with the chosen architecture, technology and project methodology needs to work for 8 hours without being disturbed by telephone, email, coffee breaks, or any other distractions. Mike Cohn, author of User Stories Applied, has dedicated many chapters to this estimation technique. Ideally, each story is between 1 and 8 story points, but at the beginning of the project you still may have some epics to break up. After finishing those meetings you should have an estimate of the total size of the project. Now, in order to get from those story points to a total number of hours you need to estimate the expected productivity of the team. Mike Cohn does this by creating a table with the expected roles, their availability (to deal with part time employees), and the expected productivity compared to the ideal senior developer (as a percentage). By calculating the average productivity and multiplying it with the number of story points you’ll end up with the total number of estimated man-hours. It’s only an estimate and both the productivity can be disappointing as well as the estimate in story points may appear to be wrong. But it still gives you an initial estimate that can be used for global planning and budget discussions. Obviously it is important to ensure that you keep on continuously measuring the actual productivity. Wow, now what? By now, it should be clear that a user story is not an independent concept but something that closely resonates with many of the aspects of our work in the software industry. In this multi-part post I have tried to explain a number of those aspects and to clarify the relationship between them. But even though I’ve not touched everything as detailed as possible, I still hope I've managed to convince you about the power and potential of user stories. Last but not least, if you have any questions or comments, please do not hesitate to email me at [email protected] or tweet me at my Twitter ID ddoomen.
March 17, 2011
by Dennis Doomen
· 7,407 Views
article thumbnail
Clustering Tomcat Servers with High Availability and Disaster Fallback
There has been a lot of buzz lately on high-availability and clustering. Most developers don't care and why should they? These features should be transparent to the application architecture and not something of concern to the developers of that application. But knowledge never hurts, so I emerged myself into the world of load balancing, heartbeats and virtual IP addresses. And you know what? Next time we need a infrastructure like this, I can at least sit down with the guys from the infrastructure department and at least know what the hell they are talking about. So what exactly is a high-availability clustered infrastructure (HACI, as I'll call it from now on) ? In essence, it should be a zero-downtime infrastructure (or at least perceived as one by the end user, which means never ever returning a default browser 404 page), capable of horizontal scaling when the need for it arises and without a single point of failure. It's the SLA writer's dream. A basic HACI setup looks like this: The users enters through a virtual IP address, assigned to one of the two load balancers. Only one of the load-balancers is active (the active master, LB1), the other one is there in the event LB1 fails ((LB2, a passive slave). The two load balancers are redundant, ie. having the exact same configuration. The load balancers redirect all traffic to the real servers. This can be done through round-robin assignment or through other means like sticky sessions, where the same user is redirected to the same server each and every time within a session. Servers can be added at any moment and configured on the load balancers. Ideally, the load balancer configuration is aware of the hardware specification and balances the load accordingly, but that's beyond the scope of this article (it involves adding weights). If all servers balanced by the load balancer fail, a backup server should be used to redirect all traffic coming from the load balancer. This can be a very lightweight server, which purpose is only to provide a sensible error page to the user (something like 'Sorry, we are performing maintenance'). Again, perception and immediate feedback to the user is key. You don't want to show the user a plain 404 page. Off course, if the backup server goes down too, you're in trouble (off course, by that time, warning bells should have gone off on every level in the hierarchy). So how to achieve this with as little effort as possible? If you want to try this out, I suggest you start by installing a virtual machine like VirtualBox or VMWare. This way you can try out the configuration yourself. In this example, I'll be load-balancing 3 Tomcat servers using sticky sessions using 2 load balancers in active-passive mode. I'm assuming all 3 Tomcat servers share the same hardware configuration, so they are all able to handle the same amount of traffic each. I'm also throwing in a backup server, in case all 3 Tomcat servers go down (serving a custom 503 page kindly informing the user of a catastrophic failure, instead of dropping the standard 404 bomb). You want to start off by assigning IP addresses to the servers. This will make your life a bit easier. We'll need 7 addresses: 3 for the tomcat server, 1 for the backup server, 2 for the loadbalancers and 1 virtual IP address to be shared between the load balancers (and which will be the entry point for your users). So our assignment will be: Virtual IP 10.0.5.99 www.haci.local LB1 10.0.5.100 lb1.haci.local #MASTER LB2 10.0.5.101 lb2.haci.local #SLAVE WEB1 10.0.5.102 web1.haci.local WEB2 10.0.5.103 web2.haci.local WEB3 10.0.5.104 web3.haci.local BACKUP 10.0.5.105 backup.haci.local Setting up the web servers is easy. You just install Tomcat on each server and create a simple JSP file to be served to users (make a small change, like the background color, on each server to distinguish the servers). I won't be covering session replication between the Tomcat servers, as it'll take me too far. If you want, you can configure the appropriate session replication and storage (using multicast or JDBC for example). The backup server I'm using is a basic LAMP server that returns a simple 503 page on every request it gets. The 503 error code is important, because it reflects the current state of the system: currently unavailable. For the loadbalancers I'll be using 2 applications: HAProxy and keepalived. HAProxy is going to handle load balancing, while keepalived will handle the failover between the two load balancers. First, we're going to configure HAProxy for both LB1 and LB2. Installing HAProxy is quite easy on an ubuntu system. Just do a sudo apt-get install haproxy and you're off. After the install, backup the current HAProxy config and start editing away. cp /etc/haproxy.cfg /etc/haproxy.cfg_orig cat /dev/null > /etc/haproxy.cfg vi /etc/haproxy.cfg The content of the config to reflect our setup should become something like this (same config on LB1 and LB2): global log 127.0.0.1 local0 log 127.0.0.1 local1 notice #log loghost local0 info maxconn 4096 #debug #quiet user haproxy group haproxy defaults log global mode http option httplog option dontlognull retries 3 redispatch maxconn 2000 contimeout 5000 clitimeout 50000 srvtimeout 50000 frontend http-in bind 10.0.5.99:80 default_backend servers backend servers mode http stats enable stats auth someuser:somepassword balance roundrobin cookie JSESSIONID prefix option httpclose option forwardfor option httpchk HEAD /check.txt HTTP/1.0 server web1 10.0.5.102:80 cookie haci_web1 check server web2 10.0.5.103:80 cookie haci_web2 check server web3 10.0.5.104:80 cookie haci_web3 check server webbackup 10.0.5.105:80 backup After this, enable HAProxy on both LB1 and LB2 by editing /etc/defaults/haproxy # Set ENABLED to 1 if you want the init script to start haproxy. ENABLED=1 # Add extra flags here. #EXTRAOPTS="-de -m 16" So far for the HAProxy configuration. We can't start it up yet, as LB1 and LB2 aren't listening yet on the virtual IP address. Next we'll configure the failover of the loadbalancers using keepalived. Installing it on Ubuntu is as easy as it was for HAProxy: sudo apt-get install keepalived. But its configuration is slightly different on both load balancers. First, we need to configure the both servers to be able to listen to the shared IP address. Add the following line to /etc/sysctl.conf: net.ipv4.ip_nonlocal_bind=1 And run sysctl -p Now, we configure keepalived so that LB1 is configured as the main load balancer and binds to the shared IP address, while LB2 is on standby, ready to take over whenever LB1 goes down. The configuration for LB1 looks like this (edit /etc/keepalived/keepalived.conf): vrrp_script chk_haproxy { # Requires keepalived-1.1.13 script "killall -0 haproxy" # cheaper than pidof interval 2 # check every 2 seconds weight 2 # add 2 points of prio if OK } vrrp_instance VI_1 { interface eth0 state MASTER virtual_router_id 51 priority 101 # 101 on master, 100 on backup virtual_ipaddress { 10.0.5.99 } track_script { chk_haproxy } } Start up keepalived and check whether it is listening to the virtual IP address. /etc/init.d/keepalived start ip addr sh eth0 It should return something like this, indicating it is listening to the virtual IP address 2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:0c:29:a5:5b:93 brd ff:ff:ff:ff:ff:ff inet 10.0.5.100/24 brd 10.0.5.255 scope global eth0 inet 10.0.5.99/32 scope global eth0 inet6 fe80::20c:29ff:fea5:5b93/64 scope link valid_lft forever preferred_lft forever Next, we configure LB2. The configuration is almost the same, exception for the priority. vrrp_script chk_haproxy { # Requires keepalived-1.1.13 script "killall -0 haproxy" # cheaper than pidof interval 2 # check every 2 seconds weight 2 # add 2 points of prio if OK } vrrp_instance VI_1 { interface eth0 state MASTER virtual_router_id 51 priority 100 # 101 on master, 100 on backup virtual_ipaddress { 10.0.5.99 } track_script { chk_haproxy } } Start up keepalived and check the network interface. /etc/init.d/keepalived start ip addr sh eth0 It should return something like this, indicating it is not listening to the virtual IP address 2: eth0: mtu 1500 qdisc pfifo_fast qlen 1000 link/ether 00:0c:29:a5:5b:93 brd ff:ff:ff:ff:ff:ff inet 10.0.5.101/24 brd 10.0.5.255 scope global eth0 inet6 fe80::20c:29ff:fea5:5b93/64 scope link valid_lft forever preferred_lft forever Now, start up HAProxy on both LB1 and LB2. /etc/init.d/haproxy start Now you can issue requests to 10.0.5.99 (or www.haci.local), which will go to LB1, which in turn will load-balance the request to either WEB1, WEB2 and WEB3. You can test the load balancing by turning off WEB1 (or the server you're currently on). You can also the backup server by turning all main webservers (WEB1, WEB2 and WEB3). And you can test the loadbalancer failover by turning off LB1. At that point LB2 will kick in and act as the master, loadbalancing all requests. When you turn LB1 back on, it'll take over the master role once again. HAProxy allows you to add extra servers very easily, reloading the configuration without breaking existing sessions. See the HAProxy documentation for more info or on ServerFault. (http://serverfault.com/questions/165883/is-there-a-way-to-add-more-backend-server-to-haproxy-without-restarting-haproxy). Cheap and effective. While most enterprise shops have hardware load balancers, which also have these possibilities and more, if you're on a tight budget or need to simulate a HACI environment for development purposes (a lesson here: always simulate your production environment when you're testing during development), this might be the sane option. To finish, I'll quickly explain how to set up the backup server (a simple LAMP server). Create a vhost configuration on the apache for www.haci.local or any other domain pointing to the virtual IP address and set up mod_rewrite for it: RewriteEngine On RewriteCond %{REQUEST_URI} !\.(css|gif|ico|jpg|js|png|swf|txt)$ [NC] RewriteConf %{REQUEST_URI} !/503.php RewriteRule .* /503.php [L] Then create the 503.php file and add this to the top of it: Sorry, our servers are currently undergoing maintenance. Please check back with us in a while. Thank you for your patience. You can decorate the 503.php file any way you like. You can even use CSS, JavaScript and image files in the php file. Now, back to my IDE. I'm getting withdrawal symptoms.
March 11, 2011
by Lieven Doclo
· 57,985 Views
article thumbnail
How to Create a Statically Linked Version of git Binaries
Creating a statically linked version of a unix software saves you in all the circumstances where you can’t install the software as root, when you don’t have development tools on the target machine, when you cannot find a prepackaged version for the specific server, when the libraries available on a server are conflicting with the ones required by your software, etc. Examples: The unix server that hosts this blog, provides me a restricted shell to manage mysql and public web files with minimal access privileges. On this server I periodically update WordPress and other things, and I like to keep the changes under control and backed up with git. The problem is that on this server I haven’t git binaries, and I cannot install software and libraries required. Having the files of my WordPress installation under git version control saved me a lot of time in the past when somebody hacked into the server and changed the php files to do nasty stuff. I was able to easily check what was exactly changed since last legitimate update with “git status” command, and restore things as before. I own a little NAS (network attached storage) which runs a mini distribution of Linux which doesn’t have a prepackaged version of git to install, nor I cannot use any package management tool like apt rpm or yum etc. So I thought that I can compile by myself the git binaries and have them deployed on the target machine. It is possible, but there is an important note: when you build a software on a unix server, the resulting binaries are referencing the libraries installed there, so you cannot easily port the binaries between servers since they have dependencies. What you need is a “statically linked” version of the software, which fortunately it’s not so difficult to achieve. Binaries which are statically linked are usually bigger in size and will possibly require more memory to execute, but they won’t require the specific libraries to be present on the executing computer, since the libraries code is contained in the binaries themselves. Here is how I built a static version of git on an ubuntu virtual machine, that can be ported to other unix servers: # let's make sure we have all we need to proceed $ sudo apt-get install libexpat1-dev asciidoc libz-dev gettext curl # let's create the directory to host the built artifacts $ sudo mkdir /opt/git-1.7.4.1-static # we are ready to download and unpack latest version of git sources $ curl http://kernel.org/pub/software/scm/git/git-1.7.4.1.tar.bz2 | tar xvj $ cd git-1.7.4.1/ # then compile and install the files in the target directory we created $ ./configure --prefix=/opt/git-1.7.4.1-static CFLAGS="${CFLAGS} -static" NO_OPENSSL=1 NO_CURL=1 $ sudo make install $ sudo make install-doc On the above commands, the thing to notice is the CFLAGS=”${CFLAGS} -static” which is used to specify that the libraries must be statically linked with the binaries. The last thing to do is create a tarball of /opt/git-1.7.4.1-static folder and copy that on the target machine; adding the /opt/git-1.7.4.1-static/bin directory to the PATH variable and the /opt/git-1.7.4.1-static/share/man to the MANPATH variable. If possible, it’s a good idea to keep the same installation path on target machine (/opt/git-1.7.4.1-static) since this path gets hardcoded in some git scripts during the build process. But it shouldn’t give too many problems, anyway. From http://en.newinstance.it/2011/02/27/how-to-create-a-statically-linked-version-of-git-binaries/
February 28, 2011
by Luigi Viggiano
· 12,101 Views
article thumbnail
HOWTO: Partially Clone an SVN Repo to Git, and Work With Branches
I've blogged a few times now about Git (which I pronounce with a hard 'g' a la "get", as it's supposed to be named for Linus Torvalds, a self-described git, but which I've also heard called pronounced with a soft 'g' like "jet"). Either way, I'm finding it way more efficient and less painful than either CVS or SVN combined. So, to continue this series ([1], [2], [3]), here is how (and why) to pull an SVN repo down as a Git repo, but with the omission of old (irrelevant) revisions and branches. Using SVN for SVN repos In days of yore when working with the JBoss Tools and JBoss Developer Studio SVN repos, I would keep a copy of everything in trunk on disk, plus the current active branch (most recent milestone or stable branch maintenance). With all the SVN metadata, this would eat up substantial amounts of disk space but still require network access to pull any old history of files. The two repos were about 2G of space on disk, for each branch. Sure, there's tooling to be able to diff and merge between branches w/o having both branches physically checked out, but nothing beats the ability to place two folders side by side OFFLINE for deep comparisons. So, at times, I would burn as much as 6-8G of disk simply to have a few branches of source for comparison and merging. With my painfullly slow IDE drive, this would grind my machine to a halt, especially when doing any SVN operation or counting files / disk usage. Using Git for SVN repos naively Recently, I started using git-svn to pull the whole JBDS repo into a local Git repo, but it was slow to create and still unwieldy. And the JBoss Tools repo was too large to even create as a Git repo - the operation would run out of memory while processing old revisions of code to play forward. At this point, I was stuck having individual Git repos for each JBoss Tools component (major source folder) in SVN: archives, as, birt, bpel, build, etc. It worked, but replicating it when I needed to create a matching repo-collection for a branch was painful and time-consuming. As well, all the old revision information was eating even more disk than before: jbosstools' trunk as multiple git-svn clones: 6.1G devstudio's trunk as single git-svn clone: 1.3G So, now, instead of a couple Gb per branch, I was at nearly 4x as much disk usage. But at least I could work offline and not deal w/ network-intense activity just to check history or commit a change. Still, far from ideal. Cloning SVN with standard layout & partial history This past week, I discovered two ways to make the git-svn experience at least an order of magnitude better: Standard layout (-s) - this allows your generated Git repo to contain the usual trunk, branches/* and tags/* layout that's present in the source SVN repo. This is a win because it means your repo will contain the branch information so you can easily switch between branches within the same repo on disk. No more remote network access needed! Revision filter (-r) - this allows your generated Git repo to start from a known revision number instead of starting at its birth. Now instead of taking hours to generate, you can get a repo in minutes by excluding irrelevant (ancient) revisions. So, why is this cool? Because now, instead of having 2G of source+metadata to copy when I want to do a local comparison between branches, the size on disk is merely: jbosstools' trunk as single git-svn clone w/ trunk and single branch: 1.3G devstudio's trunk as single git-svn clone w/ trunk and single branch: 0.13G So, not only is the footprint smaller, but the performance is better and I need never do a full clone (or svn checkout) again - instead, I can just copy the existing Git repo, and rebase it to a different branch. Instead of hours, this operation takes seconds (or minutes) and happens without the need for a network connection. Okay, enough blather. Show me the code! Check out the repo, including only the trunk & most recent branch # Figure out the revision number based on when a branch was created, then # from r28571, returns -r28571:HEAD rev=$(svn log --stop-on-copy \ http://svn.jboss.org/repos/jbosstools/branches/jbosstools-3.2.x \ | egrep "r[0-9]+" | tail -1 | sed -e "s#\(r[0-9]\+\).\+#-\1:HEAD#") # now, fetch repo starting from the branch's initial commit git svn clone -s $rev http://svn.jboss.org/repos/jbosstools jbosstools_GIT Now you have a repo which contains trunk & a single branch git branch -a # list local (Git) and remote (SVN) branches * master remotes/jbosstools-3.2.x remotes/trunk Switch to the branch git checkout -b local/jbosstools-3.2.x jbosstools-3.2.x # connect a new local branch to remote one Checking out files: 100% (609/609), done. Switched to a new branch 'local/jbosstools-3.2.x' git svn info # verify now working in branch URL: http://svn.jboss.org/repos/jbosstools/branches/jbosstools-3.2.x Repository Root: http://svn.jboss.org/repos/jbosstools Switch back to trunk git checkout -b local/trunk trunk # connect a new local branch to remote trunk Switched to a new branch 'local/trunk' git svn info # verify now working in branch URL: http://svn.jboss.org/repos/jbosstools/trunk Repository Root: http://svn.jboss.org/repos/jbosstools Rewind your changes, pull updates from SVN repo, apply your changes; won't work if you have local uncommitted changes git svn rebase Fetch updates from SVN repo (ignoring local changes?) git svn fetch Create a new branch (remotely with SVN) svn copy \ http://svn.jboss.org/repos/jbosstools/branches/jbosstools-3.2.x \ http://svn.jboss.org/repos/jbosstools/branches/some-new-branch From http://divby0.blogspot.com/2011/01/howto-partially-clone-svn-repo-to-git.html
January 28, 2011
by Nick Boldt
· 35,544 Views
article thumbnail
Apache Solr: Get Started, Get Excited!
we've all seen them on various websites. crappy search utilities. they are a constant reminder that search is not something you should take lightly when building a website or application. search is not just google's game anymore. when a java library called lucene was introduced into the apache ecosystem, and then solr was built on top of that, open source developers began to wield some serious power when it came to customizing search features. in this article you'll be introduced to apache solr and a wealth of applications that have been built with it. the content is divided as follows: introduction setup solr applications summary 1. introduction apache solr is an open source search server. it is based on the full text search engine called apache lucene . so basically solr is an http wrapper around an inverted index provided by lucene. an inverted index could be seen as a list of words where each word-entry links to the documents it is contained in. that way getting all documents for the search query "dzone" is a simple 'get' operation. one advantage of solr in enterprise projects is that you don't need any java code, although java itself has to be installed. if you are unsure when to use solr and when lucene, these answers could help. if you need to build your solr index from websites, you should take a look into the open source crawler called apache nutch before creating your own solution. to be convinced that solr is actually used in a lot of enterprise projects, take a look at this amazing list of public projects powered by solr . if you encounter problems then the mailing list or stackoverflow will help you. to make the introduction complete i would like to mention my personal link list and the resources page which lists books, articles and more interesting material. 2. setup solr 2.1. installation as the very first step, you should follow the official tutorial which covers the basic aspects of any search use case: indexing - get the data of any form into solr. examples: json, xml, csv and sql-database. this step creates the inverted index - i.e. it links every term to its documents. querying - ask solr to return the most relevant documents for the users' query to follow the official tutorial you'll have to download java and the latest version of solr here . more information about installation is available at the official description . next you'll have to decide which web server you choose for solr. in the official tutorial, jetty is used, but you can also use tomcat. when you choose tomcat be sure you are setting the utf-8 encoding in the server.xml . i would also research the different versions of solr, which can be quite confusing for beginners: the current stable version is 1.4.1. use this if you need a stable search and don't need one of the latest features. the next stable version of solr will be 3.x the versions 1.5 and 2.x will be skipped in order to reach the same versioning as lucene. version 4.x is the latest development branch. solr 4.x handles advanced features like language detection via tika, spatial search , results grouping (group by field / collapsing), a new "user-facing" query parser ( edismax handler ), near real time indexing, huge fuzzy search performance improvements, sql join-a like feature and more. 2.2. indexing if you've followed the official tutorial you have pushed some xml files into the solr index. this process is called indexing or feeding. there are a lot more possibilities to get data into solr: using the data import handler (dih) is a really powerful language neutral option. it allows you to read from a sql database, from csv, xml files, rss feeds, emails, etc. without any java knowledge. dih handles full-imports and delta-imports. this is necessary when only a small amount of documents were added, updated or deleted. the http interface is used from the post tool, which you have already used in the official tutorial to index xml files. client libraries in different languages also exist. (e.g. for java (solrj) or python ). before indexing you'll have to decide which data fields should be searchable and how the fields should get indexed. for example, when you have a field with html in it, then you can strip irrelevant characters , tokenize the text into 'searchable terms', lower case the terms and finally stem the terms . in contrast, if you would have a field with text in it that should not be interpreted (e.g. urls) you shouldn't tokenize it and use the default field type string. please refer to the official documentation about field and field type definitions in the schema.xml file. when designing an index keep in mind the advice from mauricio : "the document is what you will search for. " for example, if you have tweets and you want to search for similar users, you'll need to setup a user index - created from the tweets. then every document is a user. if you want to search for tweets, then setup a tweet index; then every document is a tweet. of course, you can setup both indices with the multi index options of solr. please also note that there is a project called solr cell which lets you extract the relevant information out of several different document types with the help of tika. 2.3. querying for debugging it is very convenient to use the http interface with a browser to query solr and get back xml. use firefox and the xml will be displayed nicely: you can also use the velocity contribution , a cross-browser tool, which will be covered in more detail in the section about 'search application prototyping' . to query the index you can use the dismax handler or standard query handler . you can filter and sort the results: q=superman&fq=type:book&sort=price asc you can also do a lot more ; one other concept is boosting. in solr you can boost while indexing and while querying. to prefer the terms in the title write: q=title:superman^2 subject:superman when using the dismax request handler write: q=superman&qf=title^2 subject check out all the various query options like fuzzy search , spellcheck query input , facets , collapsing and suffix query support . 3. applications now i will list some interesting use cases for solr - in no particular order. to see how powerful and flexible this open source search server is. 3.1. drupal integration the drupal integration can be seen as generic use case to integrate solr into php projects. for the php integration you have the choice to either use the http interface for querying and retrieving xml or json. or to use the php solr client library . here is a screenshot of a typical faceted search in drupal : for more information about faceted search look into the wiki of solr . more php projects which integrates solr: open source typo3- solr module magento enterprise - solr module . the open source integration is out dated. oxid - solr module . no open source integration available. 3.2. hathi trust the hathi trust project is a nice example that proves solr's ability to search big digital libraries. to quote directly from the article : "... the index for our one million book index is over 200 gigabytes ... so we expect to end up with a two terabyte index for 10 million books" other examples for libraries: vufind - aims to replace opac internet archive national library of australia 3.3. auto suggestions mainly, there are two approaches to implement auto-suggestions (also called auto-completion) with solr: via facets or via ngramfilterfactory . to push it to the extreme you can use a lucene index entirely in ram. this approach is used in a large music shop in germany. live examples for auto suggestions: kaufda.de 3.4. spatial search applications when mentioning spatial search, people have geographical based applications in mind. with solr, this ordinary use case is attainable . some examples for this are : city search - city guides yellow pages kaufda.de spatial search can be useful in many different ways : for bioinformatics, fingerprints search, facial search, etc. (getting the fingerprint of a document is important for duplicate detection). the simplest approach is implemented in jetwick to reduce duplicate tweets, but this yields a performance of o(n) where n is the number of queried terms. this is okay for 10 or less terms, but it can get even better at o(1)! the idea is to use a special hash set to get all similar documents. this technique is called local sensitive hashing . read this nice paper about 'near similarity search and plagiarism analysis' for more information. 3.5. duckduckgo duckduckgo is made with open source and its "zero click" information is done with the help of solr using the dismax query handler: the index for that feature contains 18m documents and has a size of ~12gb. for this case had to tune solr: " i have two requirements that differ a bit from most sites with respect to solr: i generally only show one result, with sometimes a couple below if you click on them. therefore, it was really important that the first result is what people expected. false positives are really bad in 0-click, so i needed a way to not show anything if a match wasn't too relevant. i got around these by a) tweaking dismax and schema and b) adding my own relevancy filter on top that would re-order and not show anything in various situations. " all the rest is done with tuned open source products. to quote gabriel again: "the main results are a hybrid of a lot of things, including external apis, e.g. bing, wolframalpha, yahoo, my own indexes and negative indexes (spam removal), etc. there are a bunch of different types of data i'm working with. " check out the other cool features such as privacy or bang searches . 3.6. clustering support with carrot2 carrot2 is one of the "contributed plugins" of solr. with carrot2 you can support clustering : " clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. " see some research papers regarding clustering here . here is one visual example when applying clustering on the search "pannous" - our company : 3.7. near real time search solr isn't real time yet, but you can tune solr to the point where it becomes near real time, which means that the time ('real time latency') that a document takes to be searchable after it gets indexed is less than 60 seconds even if you need to update frequently. to make this work, you can setup two indices. one write-only index "w" for the indexer and one read-only index "r" for your application. index r refers to the same data directory of w, which has to be defined in the solrconfig.xml of r via: /pathto/indexw/data/ to make sure your users and the r index see the indexed documents of w, you have to trigger an empty commit every 60 seconds: wget -q http://localhost:port/solr/update?stream.body=%3ccommit/%3e -o /dev/null everytime such a commit is triggered a new searcher without any cache entries is created. this can harm performance for visitors hitting the empty cache directly after this commit, but you can fill the cache with static searches with the help of the newsearcher entry in your solrconfig.xml. additionally, the autowarmcount property needs to be tuned, which fills the cache with a newsearcher from old entries. also, take a look at the article 'scaling lucene and solr' , where experts explain in detail what to do with large indices (=> 'sharding') and what to do for high query volume (=> 'replicating'). 3.8. loggly = full text search in logs feeding log files into solr and searching them at near real-time shows that solr can handle massive amounts of data and queries the data quickly. i've setup a simple project where i'm doing similar things , but loggly has done a lot more to make the same task real-time and distributed. you'll need to keep the write index as small as possible otherwise commit time will increase too great. loggly creates a new solr index every 5 minutes and includes this when searching using the distributed capabilities of solr ! they are merging the cores to keep the number of indices small, but this is not as simple as it sounds. watch this video to get some details about their work. 3.9. solandra = solr + cassandra solandra combines solr and the distributed database cassandra , which was created by facebook for its inbox search and then open sourced. at the moment solandra is not intended for production use. there are still some bugs and the distributed limitations of solr apply to solandra too. tthe developers are working very hard to make solandra better. jetwick can now run via solandra just by changing the solrconfig.xml. solandra also has the advantages of being real-time (no optimize, no commit!) and distributed without any major setup involved. the same is true for solr cloud. 3.10. category browsing via facets solr provides facets , which make it easy to show the user some useful filter options like those shown in the "drupal integration" example. like i described earlier , it is even possible to browse through a deep category tree. the main advantage here is that the categories depend on the query. this way the user can further filter the search results with this category tree provided by you. here is an example where this feature is implemented for one of the biggest second hand stores in germany. a click on 'schauspieler' shows its sub-items: other shops: game-change 3.11. jetwick - open twitter search you may have noticed that twitter is using lucene under the hood . twitter has a very extreme use case: over 1,000 tweets per second, over 12,000 queries per second, but the real-time latency is under 10 seconds! however, the relevancy at that volume is often not that good in my opinion. twitter search often contains a lot of duplicates and noise. reducing this was one reason i created jetwick in my spare time. i'm mentioning jetwick here because it makes extreme use of facets which provides all the filters to the user. facets are used for the rss-alike feature (saved searches), the various filters like language and retweet-count on the left, and to get trending terms and links on the right: to make jetwick more scalable i'll need to decide which of the following distribution options to choose: use solr cloud with zookeeper use solandra move from solr to elasticsearch which is also based on apache lucene other examples with a lot of facets: cnet reviews - product reviews. electronics reviews, computer reviews & more. shopper.com - compare prices and shop for computers, cell phones, digital cameras & more. zappos - shoes and clothing. manta.com - find companies. connect with customers. 3.12. plaxo - online address management plaxo.com , which is now owned by comcast, hosts web addresses for more than 40 million people and offers smart search through the addresses - with the help of solr. plaxo is trying to get the latest 'social' information of your contacts through blog posts, tweets, etc. plaxo also tries to reduce duplicates . 3.13. replace fast or google search several users report that they have migrated from a commercial search solution like fast or google search appliance (gsa) to solr (or lucene). the reasons for that migration are different: fast drops linux support and google can make integration problems. the main reason for me is that solr isn't a black box —you can tweak the source code, maintain old versions and fix your bugs more quickly! 3.14. search application prototyping with the help of the already integrated velocity plugin and the data import handler it is possible to create an application prototype for your search within a few hours. the next version of solr makes the use of velocity easier. the gui is available via http://localhost:port/solr/browse if you are a ruby on rails user, you can take a look into flare. to learn more about search application prototyping, check out this video introduction and take a look at these slides. 3.15. solr as a whitelist imagine you are the new google and you have a lot of different types of data to display e.g. 'news', 'video', 'music', 'maps', 'shopping' and much more. some of those types can only be retrieved from some legacy systems and you only want to show the most appropriated types based on your business logic . e.g. a query which contains 'new york' should result in the selection of results from 'maps', but 'new yorker' should prefer results from the 'shopping' type. with solr you can set up such a whitelist-index that will help to decide which type is more important for the search query. for example if you get more or more relevant results for the 'shopping' type then you should prefer results from this type. without the whitelist-index - i.e. having all data in separate indices or systems, would make it nearly impossible to compare the relevancy. the whitelist-index can be used as illustrated in the next steps. 1. query the whitelist-index, 2. decide which data types to display, 3. query the sub-systems and 4. display results from the selected types only. 3.16. future solr is also useful for scientific applications, such as a dna search systems. i believe solr can also be used for completely different alphabets so that you can query nucleotide sequences - instead of words - to get the matching genes and determine which organism the sequence occurs in, something similar to blast . another idea you could harness would be to build a very personalized search. every user can drag and drop their websites of choice and query them afterwards. for example, often i only need stackoverflow, some wikis and some mailing lists with the expected results, but normal web search engines (google, bing, etc.) give me results that are too cluttered. my final idea for a future solr-based app could be a lucene/solr implementation of desktop search. solr's facets would be especially handy to quickly filter different sources (files, folders, bookmarks, man pages, ...). it would be a great way to wade through those extra messy desktops. 4. summary the next time you think about a problem, think about solr! even if you don't know java and even if you know nothing about search: solr should be in your toolbox. solr doesn't only offer professional full text search, it could also add valuable features to your application. some of them i covered in this article, but i'm sure there are still some exciting possibilities waiting for you!
January 25, 2011
by Peter Karussell
· 147,278 Views
article thumbnail
Maven Profile Best Practices
Maven profiles, like chainsaws, are a valuable tool, with whose power you can easily get carried away, wielding them upon problems to which they are unsuited. Whilst you're unlikely to sever a leg misusing Maven profiles, I thought it worthwhile to share some suggestions about when and when not to use them. These three best practices are all born from real-world mishaps: The build must pass when no profile has been activated Never use Use profiles to manage build-time variables, not run-time variables and not (with rare exceptions) alternative versions of your artifact I'll expand upon these recommendations in a moment. First, though, let's have a brief round-up of what Maven profiles are and do. Maven Profiles 101 A Maven profile is a sub-set of POM declarations that you can activate or disactivate according to some condition. When activated, they override the definitions in the corresponding standard tags of the POM. One way to activate a profile is to simply launch Maven with a -P flag followed by the desired profile name(s), but they can also be activated automatically according to a range of contextual conditions: JDK version, OS name and version, presence or absence of a specific file or property. The standard example is when you want certain declarations to take effect automatically under Windows and others under Linux. Almost all the tags that can be placed directly in a POM can also be enclosed within a tag. The easiest place to read up further about the basics is the Build Profiles chapter of Sonatype's Maven book. It's freely available, readable, and explains the motivation behind profiles: making the build portable across different environments. The build must pass when no profile has been activated (Thanks to for this observation.) Why? Good practice is to minimise the effort required to make a successful build. This isn't hard to achieve with Maven, and there's no excuse for a simple mvn clean package not to work. A maintainer coming to the project will not immediately know that profile wibblewibble has to be activated for the build to succeed. Don't make her waste time finding it out. How to achieve it It can be achieved simply by providing sensible defaults in the main POM sections, which will be overridden if a profile is activated. Never use Why not? This flag activates the profile if no other profile is activated. Consequently, it will fail to activate the profile if any other profile is activated. This seems like a simple rule which would be hard to misunderstand, but in fact it's surprisingly easy to be fooled by its behaviour. When you run a multimodule build, the activeByDefault flag will fail to operate when any profile is activated, even if the profile is not defined in the module where the activeByDefault flag occurs. (So if you've got a default profile in your persistence module, and a skinny war profile in your web module... when you build the whole project, activating the skinny war profile because you don't want JARs duplicated between WAR and EAR, you'll find your persistence layer is missing something.) activeByDefault automates profile activation, which is a good thing; activates implicitly, which is less good; and has unexpected behaviour, which is thoroughly bad. By all means activate your profiles automatically, but do it explicitly and automatically, with a clearly defined rule. How to avoid it There's another, less documented way to achieve what aims to achieve. You can activate a profile in the absence of some property: !foo.bar This will activate the profile "nofoobar" whenever the property foo.bar is not defined. Define that same property in some other profile: nofoobar will automatically become active whenever the other is not. This is admittedly more verbose than , but it's more powerful and, most importantly, surprise-free. Use profiles to adapt to build-time context, not run-time context, and not (with rare exceptions) to produce alternative versions of your artifact Profiles, in a nutshell, allow you to have multiple builds with a single POM. You can use this ability in two ways: Adapt the build to variable circumstances (developer's machine or CI server; with or without integration tests) whilst still producing the same final artifact, or Produce variant artifacts. We can further divide the second option into: structural variants, where the executable code in the variants is different, and variants which vary only in the value taken by some variable (such as a database connection parameter). If you need to vary the value of some variable at run-time, profiles are typically not the best way to achieve this. Producing structural variants is a rarer requirement -- it can happen if you need to target multiple platforms, such as JDK 1.4 and JDK 1.5 -- but it, too, is not recommended by the Maven people, and profiles are not the best way of achieving it. The most common case where profiles seem like a good solution is when you need different database connection parameters for development, test and production environments. It is tempting to meet this requirement by combining profiles with Maven's resource filtering capability to set variables in the deliverable artifact's configuration files (e.g. Spring context). This is a bad idea. Why? It's indirect: the point at which a variable's value is determined is far upstream from the point at which it takes effect. It makes work for the software's maintainers, who will need to retrace the chain of events in reverse It's error prone: when there are multiple variants of the same artifact floating around, it's easy to generate or use the wrong one by accident. You can only generate one of the variants per build, since the profiles are mutually exclusive. Therefore you will not be able to use the Maven release plugin if you need release versions of each variant (which you typically will). It's against Maven convention, which is to produce a single artifact per project (plus secondary artifacts such as documentation). It slows down feedback: changing the variable's value requires a rebuild. If you configured at run-time you would only need to restart the application (and perhaps not even that). One should always aim for rapid feedback. Profiles are there to help you ensure your project will build in a variety of environments: a Windows developer's machine and a CI server, for instance. They weren't intended to help you build variant artifacts from the same project, nor to inject run-time configuration into your project. How to achieve it If you need to get variable runtime configuration into your project, there are alternatives: Use JNDI for your database connections. Your project only contains the resource name of the datasource, which never changes. You configure the appropriate database parameters in the JNDI resource on the server. Use system properties: Spring, for example, will pick these up when attempting to resolve variables in its configuration. Define a standard mechanism for reading values from a configuration file that resides outside the project. For example, you could specify the path to a properties file in a system property. Structural variants are harder to achieve, and I confess I have no first-hand experience with them. I recommend you read this explanation of how to do them and why they're a bad idea, and if you still want to do them, take the option of multiple JAR plugin or assembly plugin executions, rather than profiles. At least that way, you'll be able to use the release plugin to generate all your artifacts in one build, rather than a single one at a time. Further reading Profiles chapter from the Sonatype Maven book. Deploying to multiple environments (prod, test, dev): Stackoverflow.com discussion; see the first and top-rated answer. Short of creating a specific project for the run-time configuration, you could simply use run-time parameters such as system properties. Creating multiple artifacts from one project: How to Create Two JARs from One Project (…and why you shouldn’t) by Tim O'Brien of Sonatype (the Maven people) Blog post explaining the same technique Maven best practices (not specifically about profiles): http://mindthegab.com/2010/10/21/boost-your-maven-build-with-best-practices/ http://blog.tallan.com/2010/09/16/maven-best-practices/ This article is a completely reworked version of a post from my blog.
November 27, 2010
by Andrew Spencer
· 141,001 Views · 4 Likes
  • Previous
  • ...
  • 328
  • 329
  • 330
  • 331
  • 332
  • 333
  • 334
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×