DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

The Latest Databases Topics

article thumbnail
Introducing Logentries NEW Query Language: LEQL
[This article was written by Matt Kiernan] We are excited to announce that Logentries’ new SQL-like query language, LEQL, is now available for more advanced analytics and easy extraction of valuable insights from your log data. A SQL-Like Query Language If you’ve ever used SQL, LEQL should feel familiar. In fact, Logentries already supports a number of SQL-like search functions, including: SUM: Sums a set of values COUNT: Counts the number of times a value occurs GROUPBY: Groups values by a unique key UNIQUE: Enables the count of only unique values With the rollout of LEQL, we’ll be introducing four new query functions: MIN: Calculate the minimum value of a specified key MAX: Calculate the maximum value of a specified key SORT: Display results sorted either ascending or descending TIMESLICE: Specifies how to group by time (e.g. by specific number of minutes, hours or days) A Consistent Yet Expressive Syntax We believe a reliable query language depends on a consistently enforced syntax. For this reason, we’ll be enforcing how queries are structured. Here’s an example of how an old query would change with LEQL: Old pages>0 | GroupBY(dbName) | SUM(pages) New where(pages>0) groupby (dbName) calculate(SUM:pages) *In this example, pages & dbName are Key names in log events Notice how the search logic gets wrapped in a where() clause, used for refining your search to return only results that match your search criteria (i.e. where events include the text or Key “pages”.) groupBy() is an optional clause that enables you to organize your search results into groups by specifying a Key from a Key-Value Pair (i.e. key: value). Calculations made within your query get utilized in the calculate()clause. When building your query, you no longer need to separate sections with pipes “|”. Though we believe in the value of a consistent query syntax, we also believe in the importance of giving users an expressive language that is easy to use and delivers expected results. We’re taking the following steps to make LEQL easy to use: Outdated saved queries will automatically be converted into LEQL – no effort required where clauses will automatically be added to any new query you write LEQL terms will not be case sensitive An updated search bar will provide a query builder and validator An updated search bar & query builder An Updated Search Bar & Query Builder As we rollout LEQL, we’ll be introducing a new search bar, allowing users to switch between a simple & advanced modes based on their preference. Simple mode “Simple mode” provides an easy way to build queries by providing a list of the available functions. Type-Assist will show a list of keys to associate with each functions, or new keys can be typed manually. Advanced mode “Advanced mode” will allow users to type their queries manually. Type-Assist will autocomplete key names while the new search bar will automatically validate query syntax. July 1st Rollout The LEQL rollout will take place in phases, beginning July 1st and will continue over the next few weeks to update all plans. If you’d like early beta access to LEQL, or have any questions, feel free to reach us at [email protected].
June 26, 2015
by Trevor Parsons
· 3,467 Views
article thumbnail
MaxScale: A New Tool to Solve Your MySQL Scalability Problems
Written by Yves Trudeau Ever since MySQL replication has existed, people have dreamed of a good solution to automatically split read from write operations, sending the writes to the MySQL master and load balancing the reads over a set of MySQL slaves. While if at first it seems easy to solve, the reality is far more complex. First, the tool needs to make sure it parses and analyses correctly all the forms of SQL MySQL supports in order to sort writes from reads, something that is not as easy as it seems. Second, it needs to take into account if a session is in a transaction or not. While in a transaction, the default transaction isolation level in InnoDB, Repeatable-read, and the MVCC framework insure that you’ll get a consistent view for the duration of the transaction. That means all statements executed inside a transaction must run on the master but, when the transaction commits or rollbacks, the following select statements on the session can be again load balanced to the slaves, if the session is in autocommit mode of course. Then, what do you do with sessions that set variables? Do you restrict those sessions to the master or you replay them to the slave? If you replay the set variable commands, you need to associate the client connection to a set of MySQL backend connections, made of at least a master and a slave. What about temporary objects like with “create temporary table…”? How do you deal when a slave lags behind or what if worse, replication is broken? Those are just a few of the challenges you face when you want to build a tool to perform read/write splitting. Over the last few years, a few products have tried to tackle the read/write split challenge. The MySQL_proxy was the first attempt I am aware of at solving this problem but it ended up with many limitations. ScaleARC does a much better job and is very usable but it stills has some limitations. The latest contender is MaxScale from MariaDB and this post is a road story of my first implementation of MaxScale for a customer. Let me first introduce what is MaxScale exactly. MaxScale is an open source project, developed by MariaDB, that aims to be a modular proxy for MySQL. Most of the functionality in MaxScale is implemented as modules, which includes for example, modules for the MySQL protocol, client side and server side. Other families of available modules are routers, monitors and filters. Routers are used to determine where to send a query, Read/Write splitting is accomplished by the readwritesplit router. The readwritesplit router uses an embedded MySQL server to parse the queries… quite clever and hard to beat in term of query parsing. There are other routers available, the readconnrouter is basically a round-robin load balancer with optional weights, the schemarouter is a way to shard your data by schema and the binlog router is useful to manage a large number of slaves (have a look at Booking.com’s Jean-François Gagné’s talk at PLMCE15 to see how it can be used). Monitors are modules that maintain information about the backend MySQL servers. There are monitors for a replicating setup, for Galera and for NDB cluster. Finally, the filters are modules that can be inserted in the software stack to manipulate the queries and the resultsets. All those modules have well defined APIs and thus, writing a custom module is rather easy, even for a non-developer like me, basic C skills are needed though. All event handling in MaxScale uses epoll and it supports multiple threads. Over the last few months I worked with a customer having a challenging problem. On a PXC cluster, they have more than 30k queries/s and because of their write pattern and to avoid certification issues, they want to have the possibility to write to a single node and to load balance the reads. The application is not able to do the Read/Write splitting so, without a tool to do the splitting, only one node can be used for all the traffic. Of course, to make things easy, they use a lot of Java code that set tons of sessions variables. Furthermore, for ISO 27001 compliance, they want to be able to log all the queries for security analysis (and also for performance analysis, why not?). So, high query rate, Read/Write splitting and full query logging, like I said a challenging problem. We experimented with a few solutions. One was a hardware load balancer that failed miserably – the implementation was just too simple, using only regular expressions. Another solution we tried was ScaleArc but it needed many rules to whitelist the set session variables and to repeat them to multiple servers. ScaleArc could have done the job but all the rules increases the CPU load and the cost is per CPU. The queries could have been sent to rsyslog and aggregated for analysis. Finally, the HA implementation is rather minimalist and we had some issues with it. Then, we tried MaxScale. At the time, it was not GA and was (is still) young. Nevertheless, I wrote a query logging filter module to send all the queries to a Kafka cluster and we gave it a try. Kafka is extremely well suited to record a large flow of queries like that. In fact, at 30k qps, the 3 Kafka nodes are barely moving with cpu under 5% of one core. Although we encountered some issues, remember MaxScale is very young, it appeared to be the solution with the best potential and so we moved forward. The folks at MariaDB behind MaxScale have been very responsive to the problems we encountered and we finally got to a very usable point and the test in the pilot environment was successful. The solution is now been deployed in the staging environment and if all goes well, it will be in production soon. The following figure is simplified view of the internals of MaxScale as configured for the customer: The blocks in the figure are nearly all defined in the configuration file. We define a TCP listener using the MySQL protocol (client side) which is linked with a router, either the readwritesplit router or the readconn router. The first step when routing a query is to assign the backends. This is where the read/write splitting decision is made. Also, as part of the steps required to route a query, 2 filters are called, regexp (optional) and Genlog. The regexp filter may be used to hot patch a query and the Genlog filter is the logging filter I wrote for them. The Genlog filter will send a json string containing about what can be found in the MySQL general query log plus the execution time. Authentication attempts are also logged but the process is not illustrated in the figure. A key point to note, the authentication information is cached by MaxScale and is refreshed upon authentication failure, the refresh process is throttled to avoid overloading the backend servers. The servers are continuously monitored, the interval is adjustable, and the server status are used when the decision to assign a backend for a query is done. In term of HA, I wrote a simple Pacemaker resource agent for MaxScale that does a few fancy things like load balancing with IPTables (I’ll talk about that in future post). With Pacemaker, we have a full fledge HA solution with quorum and fencing on which we can rely. Performance wise, it is very good – a single core in a virtual environment was able to read/write split and log to Kafka about 10k queries per second. Although MaxScale supports multiple threads, we are still using a single thread per process, simply because it yields a slightly higher throughput and the custom Pacemaker agent deals with the use of a clone set of MaxScale instances. Remember we started early using MaxScale and the beta versions were not dealing gracefully with threads so we built around multiple single threaded instances. So, since a conclusion is needed, MaxScale has proven to be a very useful and flexible tool that allows to elaborate solutions to problems that were very hard to tackle before. In particular, if you need to perform read/write splitting, then, try MaxScale, it is best solution for that purpose I have found so far. Keep in touch, I’ll surely write other posts about MaxScale in the near future.
June 26, 2015
by Peter Zaitsev
· 1,337 Views
article thumbnail
OpenStack + Private Cloud = Ideal Habitat for Devops
The use of OpenStack in the private cloud is invaluable for DevOps. It provides engineers the ability to innovate quickly and deal with uncertainty. It also maximizes existing infrastructure and provides a programmable, software-defined IaC. Openstack in the private cloud = agile development OpenStack has emerged as the de facto standard for IaaS in the private cloud. It gives engineers a vital self-service capability to provision (and de-provision) environments, allowing them to act autonomously, in the moment. This helps to eliminate the downstream bottleneck caused by waiting for operations staff to find time to do the provisioning. As OpenStack is open source it is vendor agnostic, allowing you to take advantage of competitive pricing rather than suffering from vendor lock-in. A private cloud means lower cost for the same capacity in a public cloud, which is especially useful for enterprises with high data needs. For security reasons, OpenStack is still mainly used in the private cloud by developers and QA, i.e. in a non-production context. However, OpenStack gives an ability to optimize application performance and/or security by having more control compared to public cloud. The software is increasingly backed by the critical mass of leading IT infrastructure vendors such as IBM, CICSO and HP. Gartner assumes that “by 2019, OpenStack enterprise deployments will grow tenfold, up from just hundreds of production deployments today, due to increased maturity and growing ecosystem support.”1 Challenges to consider OpenStack implementation skills are still rare in the market, so experimentation and self-learning is necessary. Although this takes time, it is offset by the fact the software is free and represents a good opportunity to gain internal expertise. This is particularly valid if you class infrastructure as a core competence. The maturity and functionality of OpenStack projects vary widely - while it covers storage, network and compute, the main adoption currently happens around compute (Nova) and block storage (Cinder), with object storage and network (Neutron) lacking significantly behind. However, without leveraging virtualized network services as part of a private cloud, full-stack environment provisioning is not possible, so don’t forget to add necessary network services to your private cloud. Where to begin Integrating OpenStack clouds with existing infrastructure can be a challenge. It is hardly plug and play. At first, it is best to focus on relatively isolated DevOps environments, such as Gartner’s “mode two”2 applications rather than introducing open stack across the board straight away, (Bimodal IT “refers to having two modes of IT, each designed to develop and deliver information – and technology – intensive services in its own way. Mode 1 is traditional, emphasizing scalability, efficiency, safety and accuracy. Mode 2 is nonsequential, emphasizing agility and speed.”3) As with any open source software, new functions and upgrades are frequently released. This means keeping up with changes in functionality and filling gaps with customizations or third-party products. Upgrades are complex and typically require planned downtime. For these reasons, we recommend choosing a hardened distribution and sticking with it. Openstack is the most complete vendor agnostic solution for storage, network and compute services. The ability for developers to instantly spin up environments at any time is invaluable for a fully agile DevOps environment, and is well worth the effort it takes to acclimatize to Openstack. 1 http://www.prnewswire.com/news-releases/suse-openstack-cloud-5-to-simplify-private-cloud-management-300048721.html 2 http://www.gartner.com/it-glossary/bimodal 3 http://www.gartner.com/it-glossary/bimodal
June 26, 2015
by Ron Gidron
· 3,965 Views · 2 Likes
article thumbnail
Generating CSV-files on .NET
I have project where I need to output some reports as CSV-files. I found a good library called CsvHelper from NuGet and it works perfect for me. After some playing with it I was able to generate CSV-files that were shown correctly in Excel. Here is some sample code and also extensions that make it easier to work with DataTables. Simple report Here’s the simple fragment of code that illustrates how to use CsvHelper. using (var writer = new StreamWriter(Response.OutputStream)) using (var csvWriter = new CsvWriter(writer)) { csvWriter.Configuration.Delimiter = ";"; csvWriter.WriteField("Task No"); csvWriter.WriteField("Customer"); csvWriter.WriteField("Title"); csvWriter.WriteField("Manager"); csvWriter.NextRecord(); foreach (var project in data) { csvWriter.WriteField(project.Code); csvWriter.WriteField(project.CustomerName); csvWriter.WriteField(project.Name); csvWriter.WriteField(project.ProjectManagerName); csvWriter.NextRecord(); } } Of course, you can use other methods to output whole object or object list with one shot. I just needed here custom headers that doesn’t match property names 1:1. Generic helper for DataTable Some of my projects come from service layer as DataTable. I don’t want to add new models or Data Transfer Objects (DTO) with no good reason and DataTable is actually flexible enough if you need to add new fields to report and you want to do it fast. As DataTables are not supported by default (yet?), I wrote simple extension methods that work on DataTable views. When called on DataTable it selects default view automatically. The idea is – you can set filter on default data view and leave out the rows you don’t need. If you just want to show DataTable to screen as table then check out my posting Simple view to display contents of DataTable. public static class CsvHelperExtensions { public static void WriteDataTable(this CsvWriter csvWriter, DataTable table) { WriteDataView(csvWriter, table.DefaultView); } public static void WriteDataView(this CsvWriter csvWriter, DataView view) { foreach (DataColumn col in view.Table.Columns) { csvWriter.WriteField(col.ColumnName); } csvWriter.NextRecord(); foreach (DataRowView row in view) { foreach (DataColumn col in view.Table.Columns) { csvWriter.WriteField(row[col.ColumnName]); } csvWriter.NextRecord(); } } } And here is simple MVC controller action that gets data as DataTable and returns it as CSV-file. The result is CSV-file that opens correctly in Excel. [HttpPost] public void ExportIncomesReport() { var data = // Get DataTable here Response.ContentType = "text/csv"; Response.AddHeader("Content-disposition", "attachment;filename=IncomesReport.csv"); var preamble = Encoding.UTF8.GetPreamble(); Response.OutputStream.Write(preamble, 0, preamble.Length); using (var writer = new StreamWriter(Response.OutputStream)) using (var csvWriter = new CsvWriter(writer)) { csvWriter.Configuration.Delimiter = ";"; csvWriter.WriteDataTable(data); } } One thing to notice – with CsvHelper we have full control over a stream where we write data and this way we can write more performant code. Related Posts .Net Framework 4.0: string.IsNullOrWhiteSpace() method Exporting GridView Data to Excel Code Contracts: Hiding ContractException How to dump object properties My object to object mapper source released The post Generating CSV-files on .NET appeared first on Gunnar Peipman - Programming Blog.
June 26, 2015
by Gunnar Peipman
· 4,705 Views · 1 Like
article thumbnail
[On-Demand Webinar] JSON+SQL: Query Without Compromise
watch the on-demand webinar » this webinar introduces n1ql, couchbase’s query language for json. n1ql is the first query language to leverage the complete flexibility of json and the full power of sql. while json benefits from sql because it enables developers to model and query data with relationships, sql benefits from json because it removes the “impedance mismatch” between the data model and the application model. join gerald sangudi, couchbase’s chief architect of query, for an introduction to the n1ql language, architecture, and ecosystem. watch and learn how you can: create a data model that is not based on query limitations query the same data in different ways without duplicating it build applications with ad-hoc, intelligent, and precise access to data leverage the entire sql ecosystem for enterprise integration watch the on-demand webinar »
June 26, 2015
by Chris Smith
· 1,268 Views · 4 Likes
article thumbnail
George Kadifa Joins Perfecto Mobile’s Board of Directors
Former Executive Vice President of HP Software and Operating Partner at Silver Lake Partners Brings Deep Experience to Accelerate Continuous Quality and Digital Engagement Boston, MA – June 25, 2015: Perfecto Mobile, the world’s leader in mobile app quality and experience, today announced the appointment of George Kadifa to its Board of Directors. As a Board member, Kadifa will expand Perfecto Mobile’s vision towards enterprise digital engagement and accelerate the momentum with Agile and DevOps teams. Kadifa has extensive expertise in growing and managing technology businesses, having held leadership positions at HP, IBM, Silver Lake Partners, Corio, Oracle, and Booz-Allen & Hamilton. As Operating Partner at Silver Lake Partners, Kadifa was responsible for driving the growth of a 24-company enterprise portfolio from the firm’s large-cap investment fund. Most recently, Kadifa served as Executive Vice President of HP Software and Strategic Relationships, where he led HP’s multi-billion dollar software portfolio under the direction of HP’s CEO. “We are delighted to welcome George Kadifa to Perfecto Mobile’s Board of Directors,” said David Reichman, Chairman of the Board at Perfecto Mobile. “His extensive leadership experience at the top global technology companies, paired with his deep operational knowledge, will add a valuable dimension to the Board as he supports Perfecto Mobile’s vision into the next phase of digital engagement.” Kadifa is currently the Managing Director at Sumeru Equity Partners, Director at Velocity Technology Solutions and serves as a trustee for the University of Chicago Booth School of Business. "As someone with first-hand experience leading both a new breed of companies as well as some of the largest technology organisations in the world, I have come across many companies who set out to change an industry,” said Mr. Kadifa. “It is quite rare to find a company such as Perfecto Mobile, with superior technology, a vast market to penetrate, and a visionary executive team. In addition, it offers a highly disruptive business that is transforming legacy tools and waterfall methodologies to an open and continuous approach, matching the way DevOps, Agile and Mobile teams work. I am excited to work with CEO Eran Yaniv, the Perfecto Mobile executive team and the Board to support Perfecto Mobile’s explosive growth becoming the standard in the mobile and digital quality market.”
June 25, 2015
by Fran Cator
· 1,020 Views
article thumbnail
Interoute’s cloud platform chosen by European technology company, BQ, to deploy its Unified Communications
BQ deploys its call centre and telephony solution on Interoute Virtual Data Centre to improve international customer and employee communications Madrid, June 25th, 2015 - Interoute, owner operator of Europe's largest cloud services platform has announced that BQ, a leading European technology company, has chosen Interoute Virtual Data Centre (VDC), to host its new customer and employee unified communications solution. BQ has deployed a new telephony and call centre solution on Interoute VDC, leveraging the throughput, flexibility and scalability provided by this cloud platform. The BQ solution supports its 1,000 employees across different international offices, using Interoute VDC to provide the global reach they need. The solution is complemented with telephony services and worldwide DDIs from Interoute with great cost savings thanks to the economies of scale provided by Interoute's global infrastructure. Since it was founded in Spain, BQ has grown its business inside and outside the country thanks to its latest generation technology devices catalogue and highly competitive prices, as well as its full commitment to its users through a comprehensive support service. Mario Fernández, IT Manager at BQ, has said: "One of the main BQ objectives is to give the best user support. So, we chose Interoute to provide and guarantee the performance of our telephony service. The VoIP solution provided by Interoute meets all our needs: hosted private cloud, high availability and the ability to quickly scale and expand when needed." Interoute Virtual Data Centre is Interoute's scalable, fully automated Infrastructure as a Service (IaaS) solution. Interoute VDC provides on-demand computing, storage and applications integrated into the heart of its customers' IT infrastructure. This networked cloud replaces the need to buy, manage and maintain physical IT infrastructure and is built into Interoute's fibre connected physical Data Centres world-wide. It's simple to provision, scalable, compliant and cost effective. Diego Matas, General Manager at Interoute Iberia, has added: "We are proud that a Spanish company such as BQ, committed to education and pioneering innovation in exciting fields like robotics and 3D printing, has chosen our cloud platform for its networked communications. Interoute's networked cloud will enable BQ to continue to build upon its excellent reputation for high quality service. We look forward to working with this innovative company to support its future ICT needs."
June 25, 2015
by Fran Cator
· 784 Views
article thumbnail
InfinityQS launches ProFicient Now! program to help manufacturers leverage cloud technology
- Now available with limited-time pricing, package brings together InfinityQS’ cloud-based enterprise quality hub, ProFicient on Demand, with training and ongoing services to ensure a successful deployment - InfinityQS International, Inc., the global authority on real-time quality and Manufacturing Intelligence, announces the launch of ProFicient Now!, a program that blends InfinityQS’ cloud-based enterprise quality hub, ProFicient on Demand, with training and ongoing services to ensure a successful deployment. Available with limited-time pricing, ProFicient Now! aims to give manufacturers the knowledge, tools and continued guidance needed to realise the benefits of a cloud-based quality management program and quickly gain a competitive advantage. “In today’s fast-paced market, manufacturers are looking for ways to better align their quality systems with overall manufacturing excellence goals,” said Doug Fair, Chief Operating Officer, InfinityQS. “By combining the power of the cloud with ongoing expert guidance from our engineering team, ProFicient Now! helps both new and existing clients track towards their goals and achieve a competitive edge through their quality initiatives.” With ProFicient Now!, manufacturers receive expert engineering guidance that leads them through their deployment. Included in the ProFicient Now! package is: Training: Administrators obtain comprehensive skills for building and maintaining the system. Solution Design: The client works closely with InfinityQS to examine the current environment and establish goals for the deployment. Onsite Services: An InfinityQS engineer creates the initial system configurations. Quarterly Consultations: InfinityQS experts guide clients in data analysis and help uncover opportunities for improvement and cost reduction. Executive Review: The InfinityQS engineer leads a review with the client senior management team to review successes, quality enhancements and opportunities for improvement that were uncovered during the use of ProFicient Now! InfinityQS ProFicient is a proven enterprise quality hub powered by a robust, centralised Statistical Process Control (SPC) software engine. ProFicient enables global manufacturers to proactively monitor, analyse and report on Manufacturing Intelligence to improve quality, decrease costs and make smarter business decisions. With a cloud-based deployment option, ProFicient streamlines global data collection and analysis with a unified data archive. For more information about ProFicient Now, including its limited-time pricing, visit here: http://www.infinityqs.com/ProFicient-Now-EMEA
June 25, 2015
by Fran Cator
· 974 Views
article thumbnail
Simplified API Monitoring for DevOps Teams
[This article was written by Laura Strassman] AlertSite is now integrated with Ready! API. This means that developers, testers and operations teams can collaborate together on API quality using the same tests and metrics, simplifying configuration of monitoring assets and ultimately turning around performance problems in real time. There are several advantages to this approach: You should be monitoring your APIs in production. When your API moves into production from test, the environment changes – there is no way to know if theAPI performanceis compromised unless you look. Furthermore, you can find problems that may be a result of the location or variance in traffic. There is no easier method. You can simply click a button from right in the Ready! API interface and see the status of your APIs in production it can’t get any easier. You take your already created test cases and turn them into monitors whenever you have a new API or test you want to keep an eye on. Troubleshooting is like shooting fish in a barrel. You wrote the functional test, you know it works, and when something comes back as not working you can isolate it quickly. All of this makes it easy to be ahead of problems, solve them quickly when they happen and keep customers happy. You can be monitoring your APIs in less than 3 minutes: <br>
June 25, 2015
by Denis Goodwin
· 1,518 Views
article thumbnail
7 Things I Didn’t Expect to Hear at Gartner’s IT Ops Summit
Last week’s Gartner IT Operations Strategies & Solutions Summit in Orlando, Fla., was exactly what you’d expect—a place to talk about the IT operations issues impacting some of the largest companies in the world. Even so, there were a few interesting surprises. Among them: 1. Bi-modal is big. Not everyone will succeed. Gartner continued to tell its customers to employ two modes of IT—a traditional, slower moving capability for older, typically internal systems of record; and a high-speed, experimental one for new, typically customer-facing Web and mobile apps. “This is a time of experimentation and innovation,” said Gartner VP and distinguished analyst Chris Howard in his opening keynote. Organizations can’t ignore that there are multiple speeds and they should participate in all. Gartner managing VPRonni Colville added that by 2017, 75% of IT orgs will have this “bi-modal” IT capability. See also: Bi-Modal IT: Gartner Endorses Both Disruptive and Conservative Approaches to Technology However, “50% will make a mess of it,” Colville said. Why? Not necessarily because of technology failings, but more often because of a lack of people skills. 2. IT success is all about people. Donna Scott, also a Gartner VP and distinguished analyst, told her keynote audience that “you will be judged on agility, speed, and innovation.” However, the biggest problems Gartner sees for infrastructure and operations team engagement and innovation are lack of time, company culture that’s not conducive to these approaches, and a lack of business skills in IT. More than half of the people responding to an in-room poll said “people” are the part of IT ops that must change first. Not technology. Gartner research director George Spafford underscored similar issues in large organizations trying to use DevOps at scale: people and “human factors” are the biggest concerns from his in-room poll. All these probably contributed to hiring best-selling author Daniel Pink as a keynote speaker on the opening day of the conference. His focus? Not IT or architecture. Instead, he pounded home the importance of influencing people and selling internally. 3. Big orgs are trying DevOps. But the issues are different at scale. In numerous sessions I saw many hands go up when analysts asked, “Who here is trying DevOps?” Clearly, the approach is getting traction in large companies. But there’s lots of learning still to do. In fact, that was Spafford’s biggest bit of advice. “Always be learning,” he said, “trying to see what works and what breaks, especially at scale.” And, even once you’ve had some initial success, keep learning. “If you’ve done ‪DevOps, stay humble,” he advised. 4. Looking to innovative organizations for ideas … analytics on the rise. Many sessions addressed how large organizations are taking on ideas fostered by smaller, more risk-tolerant companies, and offered advice for doing so successfully. In addition to multiple discussions of DevOps, an entire session was devoted to establishing your own “Genius Bar®—a “walk-up IT support center” as explained in this CIO article. As at previous conferences, Gartner research VP Cameron Haight ran several sessions on lessons learned from firms running massive, Web-scale IT systems. “You need lots of data … and access to it inexpensively,” he said. Some commercial monitoring companies (New Relic included!) got a shout out for taking the lessons of Web scale IT to heart in their offerings. In addition, Haight said, “Analytics are increasingly important for application performance monitoring given the huge amount of data now available.” 5. Cloud: Enterprises want it, but aren’t very good at it yet. Gartner research director Dennis Smith talked through the enterprise’s interest in cloud computing. A huge majority of his in-room poll wanted some mix of both public and private cloud, while only 9% wanted to use only a private cloud environment and a measly 4% were looking to move entirely to the public cloud. The most popular choice (41%) was an 80/20 split between private and public cloud infrastructure. “Enterprises don’t make the dean’s list,” for cloud usage, Smith said, earning no more than a C average in his opinion. Large organizations are doing well at visibility, governance, and delivering standardized stacks, he said, but are less skilled at optimizing for these new environments. Still, Smith said the trends point toward enterprises improving on all fronts. 6. Cloud security can be better than yours. Importantly, Gartner VP and distinguished analyst Neil MacDonald gave the cloud a vote of confidence: noting that, for a variety of reasons, “Well-managed public cloud can be more secure than your own data center.” For example, on-premise software can pose serious security risks, he said, because of “deployment lag” where customers are stuck using software releases with unpatched security vulnerabilities. With a cloud-based Software-as-a-Service (SaaS), security updates can be more quickly rolled out to all customers. But cloud security can be different, requiring a shift to information-level security from OS-level security. Best practices include doing away with a huge pool of all-powerful sysadmins in favor of JEA, or “just enough administration,” where sysadmins have just enough privileges to do their job, and no more. An analogous security practice for compute resources is “least privilege,” where apps and microservices can’t talk to each other unless they specifically need to do so. Audience polling supported MacDonald’s optimistic view of cloud security, which suggests that large enterprises may struggle less with their cloud policies moving forward. 7. Containers: Try ’em! Ahead of this week’s DockerCon in San Francisco, Gartner devoted significant airtime to educating the audience on containers and microservices. My summary of ‪Gartner VP and distinguished analyst Tom Bittman’s advice on containers was simple: Try ’em. Now. Complement them with VMs. ‪And Docker (the company) is important, but not the be-all and end-all in this space. Bittman (copping to some deja vu from Gartner presentations he made on server virtualization 13 years ago) noted that while virtualization has been focused on admin and ops functions, containers are focused on value for developers. But because containers are well suited for driving up VM utilization for workloads that share the same OS, we can expect to see more combinations of containers and server virtualization. Finally, Bittman underscored that Gartner doesn’t see containers having much impact on premise, but making a huge difference in the cloud. That doesn’t necessarily fit with what’s been shown in other research, such as this 2015 State of Containers Survey sponsored by VMblog.com and StackEngine, so we’ll want to watch how this plays out. This is all a lot to digest. The Gartner IT Operations Strategies & Solutions Summitacknowledges the importance of dealing with existing IT systems and practices as well as promising new technologies and thinking, and tries to point a way forward. In fact, Haight had a very good quote about microservices that I thought also served to wrap up the entire event: “If you want to run with the big dogs, you need to rethink application architecture,” he said. That can be very difficult for an enterprise to fully implement … but also very appealing. Note: Al Sargent contributed to this post. All product and company names herein may be trademarks of their registered owners. Server, tortoise and hare, business team, and cloud security images courtesy ofShutterstock.com.
June 24, 2015
by Fredric Paul
· 1,820 Views
article thumbnail
Don't Leave Your Database Behind
Microsoft’s recent announcement that Windows 10 will be the last big-bang release for the operating system was a surprise to many. But what on Earth is going on? Why have they suddenly turned round and decided to upgrade their software with a series of frequent updates instead? The answer is a term not many chief executives will be familiar with, but an important one nevertheless: continuous delivery. Continuous delivery for software applications means that new features can be released faster and companies can be more competitive. Although it requires a change in processes and an investment in the tools that make it possible, the return on investment is high – and proven. The 2014 State of DevOps Report from Puppet Labs found that high-performing IT organisations that use practices such as continuous delivery are twice as likely to exceed their profitability, market share and productivity goals. WHAT ABOUT THE DATABASE? A stumbling block many organisations find in their pursuit of continuous delivery is the database. Databases hold critical information and their stability is vital to the bottom line. Often they’re tied into those same applications that can gain from continuous delivery, but they’re seen as too risky to include in the process. In fact, the DZone Guide to Continuous Delivery 2015 showed that while 61 per cent of companies have already implemented continuous delivery for their applications, it falls to half that number when it comes to the database.The result? Application development is faster, smoother and more cost effective, while database development lags behind. “Clients such as Yorkshire Water are seeing a return on investment of 700 per cent after investing in continuous delivery for their databases” At Redgate, we’ve been working on a way to resolve the issue with a Database Lifecycle Management (DLM) solution: one that brings all the advantages of continuous delivery to the database, while protecting the data at the same time. We have some pedigree here. Our tools are already used in more than 90 per cent of Fortune 100 companies, and are trusted in areas such as finance, healthcare and manufacturing, where performance and reliability are not optional. It’s working too. Clients such as Yorkshire Water are seeing a return on investment of 700 per cent after investing in continuous delivery for their databases. Similarly, StateServ, a medical equipment supplier with customers across the United States, has adopted Redgate’s SQL Source Control and SQL Compare to reduce the time it takes to release new versions of its website by 50 per cent. Property services provider Move With Us has also adopted these tools to reduce the cost of releasing database updates by 97 per cent. As Anthony Hall, IT operations manager, says: “We spend less time deploying and more time developing better software, which keeps us ahead of our competitors. It results in a product that more closely matches what the customer wants, at a cheaper price. That equals happy customers, a happier development team and bigger profits for the company.” I love comments like that. It demonstrates that all the effort we’ve put into developing our tools and the DLM process to go with it is paying off. Not for us – for our clients. By aligning the continuous delivery of the application with the continuous delivery of the database, a company will inevitably see increased productivity, reduced risk and higher quality end-products. This suddenly means that your technology becomes a strategic advantage, as opposed to an unpredictable risk, and can be used to deliver real financial benefits to the business. CASE STUDY KEEPING DATA FLOWING FOR YORKSHIRE WATER Yorkshire Water manage the collection, treatment and distribution of water in Yorkshire, supplying around 1.24 billion litres of drinking water daily. At the same time they collect, treat and dispose of about one billion litres of waste water safely back into the environment. As might be imagined, their applications and databases are diverse and technically challenging, and deploying changes to the databases used to be manual and time consuming. The company turned to Redgate’s database development tools to automate the changes, saving time as well as avoiding errors. Using SQL Source Control and SQL Compare, they achieved every aim. In 25 days, they moved their first project from a manual deployment process to full automation and it now takes half a day to start auto-deploying a new project. As software development team leader Carl Davison says: “It’s now a minor overhead to deploy. We predict the return on investment to be in the order of 700 per cent over the next five years. We can deploy database changes as soon as the business needs them, without delays or problems.”
June 24, 2015
by Moe Long
· 1,504 Views
article thumbnail
PostgreSQL Powers All New Apps for 77% of the Database's Users
Survey of open source PostgreSQL users found adoption continues to rise with 55% of users deploying it for mission-critical applications Bedford, MA – June 23, 2015 – EnterpriseDB (EDB), the leading provider of enterprise-class Postgres products and database compatibility solutions, today announced the results of its “PostgreSQL Adoption Survey 2015,” a biennial survey of open source PostgreSQL users. Conducted by EnterpriseDB, the survey found PostgreSQL adoption continuing to rise, with 55% of users – up from 40% two years ago – deploying it for mission-critical applications and 77% of users are dedicating all new application deployments to PostgreSQL. These findings give voice to end users and confirm such industry indicators as increasing job listings and monthly rankings on DB-Engines that have pointed to rising interest in and demand for PostgreSQL, also called Postgres. The growing popularity of Postgres also comes as traditional software vendors suffer setbacks in the marketplace. The enterprise-class performance, security and stability of Postgres, on par with traditional database vendors for most corporate workloads, meanwhile have helped position Postgres among the solutions from the world’s largest vendors. The opportunity to transform their data center economics has helped fuel downloads of Postgres as well. End users reported cutting costs with Postgres, with 41% reporting they had first-year cost savings of 50% or more. They’re using Postgres to build web 2.0 applications using unstructured data as evidenced by the 64% of respondents who said they were working with JSON/JSONB and the 47% who said they were using Postgres for collaboration applications. “Postgres is empowering organizations to transform the economics of IT. IT can invest in the customer engagement applications that differentiate their operations from their competition instead of continuing to pay the steep and rising licensing and support fees charged by traditional database vendors,” said Marc Linster, senior vice president of products and services of EnterpriseDB. “With the expanding adoption, EnterpriseDB has experienced dramatic growth year over year, providing the software, services and support that organizations need to be successful with Postgres.” Database Migrations, Replacements The findings also support statements in a recent Gartner report that reflect the widespread acceptance of open source databases. “By 2018, more than 70% of new in-house applications will be developed on an OSDBMS, and 50% of existing commercial RDBMS instances will have been converted or will be in process,” according to the April 2015 Gartner report, The State of Open-Source RDBMs, 2015.* Among Postgres users, the survey findings show migrations are already under way with 37% reporting they had migrated applications from Oracle or Microsoft SQL Server to Postgres. Many users were still planning further migrations, with 37% of PostgreSQL users saying they will gradually replace their legacy systems with Postgres, compared to 29% who said that in the 2013 survey. Further, end users predict their deployments of Postgres will expand significantly, with 32% saying they anticipate production deployments of Postgres to increase by at least 50% over the next year. The survey, conducted by EnterpriseDB using an online tool in May 2015, queried registered users of PostgreSQL and drew 274 respondents worldwide from government organizations and companies ranging in size and industry. *The State of Open-Source RDBMs, 2015, by Donald Feinberg and Merv Adrian, published on April 21, 2015. Connect with EnterpriseDB Read the blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb Become a fan on Facebook: http://www.facebook.com/EnterpriseDB?ref=ts Join us on Google+: https://plus.google.com/108046988421677398468 Connect on LinkedIn: http://www.linkedin.com/company/enterprisedb
June 23, 2015
by Fran Cator
· 960 Views
article thumbnail
Neo4j: The Foul Revenge Graph
Last week I was showing the foul graph to my colleague Alistair who came up with the idea of running a ‘foul revenge’ query to find out which players gained revenge for a foul with one of their own later in them match. Queries like this are very path centric and therefore work well in a graph. To recap, this is what the foul graph looks like: The first thing that we need to do is connect the fouls in a linked list based on time so that we can query their order more easily. We can do this with the following query: MATCH (foul:Foul)-[:COMMITTED_IN_MATCH]->(match) WITH foul,match ORDER BY match.id, foul.sortableTime WITH match, COLLECT(foul) AS fouls FOREACH(i in range(0, length(fouls) -2) | FOREACH(foul1 in [fouls[i]] | FOREACH (foul2 in [fouls[i+1]] | MERGE (foul1)-[:NEXT]->(foul2) ))); This query collects fouls grouped by match and then adds a ‘NEXT’ relationship between adjacent fouls. The graph now looks like this: Now let’s find the revenge foulers in the Bayern Munich vs Barcelona match. We’re looking for the following pattern: This translates to the following cypher query: match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1), (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2), (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2) WHERE (foul1)-[:NEXT*]->(foul2) RETURN player2.name AS firstFouler, player1.name AS revengeFouler, foul1.time, foul1.location, foul2.time, foul2.location I’ve added in a few extra parts to the pattern to pull out the players involved and to find the revenge foulers in a specific match – the Bayern Munich vs Barcelona Semi Final 2nd leg. We end up with the following revenge fouls: We can see here that Dani Alves actually gains revenge on Bastian Schweinsteiger twice for a foul he made in the 10th minute. If we tweak the query to the following we can get a visual representation of the revenge fouls as well: match (foul1:Foul)-[:COMMITTED_AGAINST]->(app1)-[:COMMITTED_FOUL]->(foul2)-[:COMMITTED_AGAINST]->(app2)-[:COMMITTED_FOUL]->(foul1), (player1)-[:MADE_APPEARANCE]->(app1), (player2)-[:MADE_APPEARANCE]->(app2), (foul1)-[:COMMITTED_IN_MATCH]->(match:Match {id: "32683310"})<-[:COMMITTED_IN_MATCH]-(foul2), (foul1)-[:NEXT*]->(foul2) RETURN * At the moment I’ve restricted the revenge concept to single matches but I wonder whether it’d be more interesting to create a linked list of fouls which crosses matches between teams in the same season. The code for all of this is on github – the README is a bit sketchy at the moment but I’ll be fixing that up soon.
June 23, 2015
by Mark Needham
· 1,010 Views
article thumbnail
FusionExperience announces successful partnership with Cloud Consulting
London, UK – FusionExperience, the business and data solutions provider, today announces the success of its first salesforce.com partnership with Cloud Consulting Ltd. (CCL). CCL was working with an international airline client to migrate a legacy charter and group booking application from one Salesforce.com instance to a new one. Very early on in the project CCL discovered that there were considerable elements of unsupported custom code and that these had to be redesigned and redeveloped. The airline took the opportunity at this stage to request changes and improve the application in line with their new business processes. CCL worked with FusionExperience to migrate the application to the latest salesforce.com environment and re-architected the booking engine functionality and complex pricing algorithms using Apex and VisualForce. For business reasons the airline had a strict project deadline and despite all the unknowns involved the project timescales were maintained and FusionExperience delivered on time and to budget. The airline went live with the application on schedule without any post-production problems or warranty fixes required. They now have an up to date system that has achieved a game changing transformation in the way it does business. Robin James, Platform Evangelist for FusionExperience said; “The ability to seamlessly work with our partners on salesforce.com projects enables rapid scaling of resources and capabilities. This ensures that the client is delighted by the results, yet unaware of the complex extended ecosystem that has been involved. This is facilitated by that fact that we all speak the same salesforce.com language. Cloud Consulting is an ideal partner to work with in this way, as our delivery and technical strengths are well matched with their intimate client facing approach.” Tim Pullen, Managing Director of CCL added: “We already had a close relationship with FusionExperience and it was natural for us to turn to them for help with this suddenly extremely challenging project. The combination of cleaning, segmenting and splitting the data in Salesforce.com, extracting the system configuration and custom code and then creating a new system was tough enough to start but then having to redevelop the application from scratch took it to a new level. Right from the start Robin James and his team took everything in their stride and provided a level of comfort, reassurance, skill and professionalism that we’d never experienced before from other partners. Bear in mind that the old system had no user or technical documentation plus undocumented code and you begin to understand just how good the end result has been for the airline. Thank you Fusion!”
June 22, 2015
by Fran Cator
· 828 Views
article thumbnail
Spring Data Couchbase: Handle Unknown Class
Spring Data Couchbase provides transparent way to save and load Java classes to and from Couchbase. However, if a loaded class contains a property of unknown class, you will receive org.springframework.data.mapping.model.MappingException: No mapping metadata found for java.lang.Object This may happen if, for example, different versions of your code save and load information. In order to handle situation when we want to load an object, which contains another object on unknown class (in a map or list property) we should override the default SPMappingCouchbaseConverter. Let's see how we do this with Spring XML configuration: I replace my old XML: to the following XML: And create the following class: public class MyMappingCouchbaseConverter extends MappingCouchbaseConverter { public MyMappingCouchbaseConverter(final MappingContext, CouchbasePersistentProperty> mappingContext) { super(mappingContext); } @Override protected R read(final TypeInformation type, final CouchbaseDocument source, final Object parent) { if (Object.class == typeMapper.readType(source, type).type) { return null; } return super.read(type, source, parent); } } Now, if loaded object will contain a property of unknown class or an object of unknown class in a list or map, this property or object will be replaced by null. view source print?
June 22, 2015
by Pavel Bernshtam
· 4,065 Views
article thumbnail
Thread Pools in NGINX Boost Performance 9x!
Introduction [This article was written by Valentin Bartenev] It’s well known that NGINX uses an asynchronous, event-driven approach to handling connections. This means that instead of creating another dedicated process or thread for each request (like servers with a traditional architecture), it handles multiple connections and requests in one worker process. To achieve this, NGINX works with sockets in a non-blocking mode and uses efficient methods such as epoll and kqueue. Because the number of full-weight processes is small (usually only one per CPU core) and constant, much less memory is consumed and CPU cycles aren’t wasted on task switching. The advantages of such an approach are well-known through the example of NGINX itself. It successfully handles millions of simultaneous requests and scales very well. Each process consumes additional memory, and each switch between them consumes CPU cycles and trashes L-caches But the asynchronous, event-driven approach still has a problem. Or, as I like to think of it, an “enemy”. And the name of the enemy is: blocking. Unfortunately, many third-party modules use blocking calls, and users (and sometimes even the developers of the modules) aren’t aware of the drawbacks. Blocking operations can ruin NGINX performance and must be avoided at all costs. Even in the current official NGINX code it’s not possible to avoid blocking operations in every case, and to solve this problem the new “thread pools” mechanism was implemented in NGINX version 1.7.11. What it is and how it supposed to be used, we will cover later. Now let’s meet face to face with our enemy. The Problem First, for better understanding of the problem a few words about how NGINX works. In general, NGINX is an event handler, a controller that receives information from the kernel about all events occurring on connections and then gives commands to the operating system about what to do. In fact, NGINX does all the hard work by orchestrating the operating system, while the operating system does the routine work of reading and sending bytes. So it’s very important for NGINX to respond fast and in a timely manner. The events can be timeouts, notifications about sockets ready to read or to write, or notifications about an error that occurred. NGINX receives a bunch of events and then processes them one by one, doing the necessary actions. Thus all the processing is done in a simple loop over a queue in one thread. NGINX dequeues an event from the queue and then reacts to it by, for example, writing or reading a socket. In most cases, this is extremely quick (perhaps just requiring a few CPU cycles to copy some data into memory) and NGINX proceeds through all of the events in the queue in an instant. All processing is done in a simple loop by one thread But what will happen if some long and heavy operation has occurred? The whole cycle of event processing will get stuck waiting for this operation to finish. So, by saying “a blocking operation” we mean any operation that stops the cycle of handling events for a significant amount of time. Operations can be blocking for various reasons. For example, NGINX might be busy with lengthy, CPU-intensive processing, or it might have to wait to access a resource (such as a hard drive, or a mutex or library function call that gets responses from a database in a synchronous manner, etc.). The key point is that while processing such operations, the worker process cannot do anything else and cannot handle other events, even if there are more system resources available and some events in the queue could utilize those resources. Imagine a salesperson in a store with a long queue in front of him. The first guy in the queue asks for something that is not in the store but is in the warehouse. The salesperson goes to the warehouse to deliver the goods. Now the entire queue must wait a couple of hours for this delivery and everyone in the queue is unhappy. Can you imagine the reaction of the people? The waiting time of every person in the queue is increased by these hours, but the items they intend to buy might be right there in the shop. Nearly the same situation happens with NGINX when it asks to read a file that isn’t cached in memory, but needs to be read from disk. Hard drives are slow (especially the spinning ones), and while the other requests waiting in the queue might not need access to the drive, they are forced to wait anyway. As a result, latencies increase and system resources are not fully utilized. Some operating systems provide an asynchronous interface for reading and sending files and NGINX can use this interface (see the aio directive). A good example here is FreeBSD. Unfortunately, we can’t say the same about Linux. Although Linux provides a kind of asynchronous interface for reading files, it has a couple of significant drawbacks. One of them is alignment requirements for file access and buffers, but NGINX handles that well. But the second problem is worse. The asynchronous interface requires the O_DIRECT flag to be set on the file descriptor, which means that any access to the file will bypass the cache in memory and increase load on the hard disks. That definitely doesn’t make it optimal for many cases. To solve this problem in particular, thread pools were introduced in NGINX 1.7.11. They are not included by default in NGINX Plus yet, but contact sales if you’d like to try a build of NGINX Plus R6 that has thread pools enabled. Now let’s dive into what thread pools are about and how they work. Thread Pools Let’s return to our poor sales assistant who delivers goods from a faraway warehouse. But he has become smarter (or maybe he became smarter after being beaten by the crowd of angry clients?) and hired a delivery service. Now when somebody asks for something from the faraway warehouse, instead of going to the warehouse himself, he just drops an order to a delivery service and they will handle the order while our sales assistant will continue serving other customers. Thus only those clients whose goods aren’t in the store are waiting for delivery, while others can be served immediately. In terms of NGINX, the thread pool is performing the functions of the delivery service. It consists of a task queue and a number of threads that handle the queue. When a worker process needs to do a potentially long operation, instead of processing the operation by itself it puts a task in the pool’s queue, from which it can be taken and processed by any free thread. It seems then we have another queue. Right. But in this case the queue is limited by a specific resource. We can’t read from a drive faster than the drive is capable of producing data. Now at least the drive doesn’t delay processing of other events and only the requests that need to access files are waiting. The “reading from disk” operation is often used as the most common example of a blocking operation, but actually the thread pools implementation in NGINX can be used for any tasks that aren’t appropriate to process in the main working cycle. At the moment, offloading to thread pools is implemented only for two essential operations: the read() syscall on most operating systems and sendfile() on Linux. We will continue to test and benchmark the implementation, and we may offload other operations to the thread pools in future releases if there’s a clear benefit. Benchmarking It’s time to move from theory to practice. To demonstrate the effect of using thread pools we are going to perform a synthetic benchmark that simulates the worst mix of blocking and non-blocking operations. It requires a data set that is guaranteed not to fit in memory. On a machine with 48 GB of RAM, we have generated 256 GB of random data in 4-MB files, and then have configured NGINX 1.9.0 to serve it. The configuration is pretty simple: worker_processes 16; events { accept_mutex off; } http { include mime.types; default_type application/octet-stream; access_log off; sendfile on; sendfile_max_chunk 512k; server { listen 8000; location / { root /storage; } } } As you can see, to achieve better performance some tuning was done: logging and accept_mutex were disabled,sendfile was enabled, and sendfile_max_chunk was set. The last directive can reduce the maximum time spent in blocking sendfile() calls, since NGINX won’t try to send the whole file at once, but will do it in 512-KB chunks. The machine has two Intel Xeon E5645 (12 cores, 24 HT-threads in total) processors and a 10-Gbps network interface. The disk subsystem is represented by four Western Digital WD1003FBYX hard drives arranged in a RAID10 array. All of this hardware is powered by Ubuntu Server 14.04.1 LTS. The clients are represented by two machines with the same specifications. On one of these machines, wrk creates load using a Lua script. The script requests files from our server in a random order using 200 parallel connections, and each request is likely to result in a cache miss and a blocking read from disk. Let’s call this load the random load. On the second client machine we will run another copy of wrk that will request the same file multiple times using 50 parallel connections. Since this file will be frequently accessed, it will remain in memory all the time. In normal circumstances, NGINX would serve these requests very quickly, but performance will fall if the worker processes are blocked by other requests. Let’s call this load the constant load. The performance will be measured by monitoring throughput of the server machine using ifstat and by obtaining wrkresults from the second client. Now, the first run without thread pools does not give us very exciting results: % ifstat -bi eth2 eth2 Kbps in Kbps out 5531.24 1.03e+06 4855.23 812922.7 5994.66 1.07e+06 5476.27 981529.3 6353.62 1.12e+06 5166.17 892770.3 5522.81 978540.8 6208.10 985466.7 6370.79 1.12e+06 6123.33 1.07e+06 As you can see, with this configuration the server is able to produce about 1 Gbps of traffic in total. In the output from top, we can see that all of worker processes spend most of the time in blocking I/O (they are in a D state): top - 10:40:47 up 11 days, 1:32, 1 user, load average: 49.61, 45.77 62.89 Tasks: 375 total, 2 running, 373 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 0.3 sy, 0.0 ni, 67.7 id, 31.9 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 49453440 total, 49149308 used, 304132 free, 98780 buffers KiB Swap: 10474236 total, 20124 used, 10454112 free, 46903412 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4639 vbart 20 0 47180 28152 496 D 0.7 0.1 0:00.17 nginx 4632 vbart 20 0 47180 28196 536 D 0.3 0.1 0:00.11 nginx 4633 vbart 20 0 47180 28324 540 D 0.3 0.1 0:00.11 nginx 4635 vbart 20 0 47180 28136 480 D 0.3 0.1 0:00.12 nginx 4636 vbart 20 0 47180 28208 536 D 0.3 0.1 0:00.14 nginx 4637 vbart 20 0 47180 28208 536 D 0.3 0.1 0:00.10 nginx 4638 vbart 20 0 47180 28204 536 D 0.3 0.1 0:00.12 nginx 4640 vbart 20 0 47180 28324 540 D 0.3 0.1 0:00.13 nginx 4641 vbart 20 0 47180 28324 540 D 0.3 0.1 0:00.13 nginx 4642 vbart 20 0 47180 28208 536 D 0.3 0.1 0:00.11 nginx 4643 vbart 20 0 47180 28276 536 D 0.3 0.1 0:00.29 nginx 4644 vbart 20 0 47180 28204 536 D 0.3 0.1 0:00.11 nginx 4645 vbart 20 0 47180 28204 536 D 0.3 0.1 0:00.17 nginx 4646 vbart 20 0 47180 28204 536 D 0.3 0.1 0:00.12 nginx 4647 vbart 20 0 47180 28208 532 D 0.3 0.1 0:00.17 nginx 4631 vbart 20 0 47180 756 252 S 0.0 0.1 0:00.00 nginx 4634 vbart 20 0 47180 28208 536 D 0.0 0.1 0:00.11 nginx 4648 vbart 20 0 25232 1956 1160 R 0.0 0.0 0:00.08 top 25921 vbart 20 0 121956 2232 1056 S 0.0 0.0 0:01.97 sshd 25923 vbart 20 0 40304 4160 2208 S 0.0 0.0 0:00.53 zsh In this case the throughput is limited by the disk subsystem, while the CPU is idle most of the time. The results from wrkare also very low: Running 1m test @ http://192.0.2.1:8000/1/1/1 12 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 7.42s 5.31s 24.41s 74.73% Req/Sec 0.15 0.36 1.00 84.62% 488 requests in 1.01m, 2.01GB read Requests/sec: 8.08 Transfer/sec: 34.07MB And remember, this is for the file that should be served from memory! The excessively large latencies are because all the worker processes are busy with reading files from the drives to serve the random load created by 200 connections from the first client, and cannot handle our requests in good time. It’s time to put our thread pools in play. For this we just add the aio threads directive to the location block: location / { root /storage; aio threads; } and ask NGINX to reload its configuration. After that we repeat the test: % ifstat -bi eth2 eth2 Kbps in Kbps out 60915.19 9.51e+06 59978.89 9.51e+06 60122.38 9.51e+06 61179.06 9.51e+06 61798.40 9.51e+06 57072.97 9.50e+06 56072.61 9.51e+06 61279.63 9.51e+06 61243.54 9.51e+06 59632.50 9.50e+06 Now our server produces 9.5 Gbps, compared to ~1 Gbps without thread pools! It probably could produce even more, but it has already reached the practical maximum network capacity, so in this test NGINX is limited by the network interface. The worker processes spend most of the time just sleeping and waiting for new events (they are in S state in top): top - 10:43:17 up 11 days, 1:35, 1 user, load average: 172.71, 93.84, 77.90 Tasks: 376 total, 1 running, 375 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.2 us, 1.2 sy, 0.0 ni, 34.8 id, 61.5 wa, 0.0 hi, 2.3 si, 0.0 st KiB Mem: 49453440 total, 49096836 used, 356604 free, 97236 buffers KiB Swap: 10474236 total, 22860 used, 10451376 free, 46836580 cached Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4654 vbart 20 0 309708 28844 596 S 9.0 0.1 0:08.65 nginx 4660 vbart 20 0 309748 28920 596 S 6.6 0.1 0:14.82 nginx 4658 vbart 20 0 309452 28424 520 S 4.3 0.1 0:01.40 nginx 4663 vbart 20 0 309452 28476 572 S 4.3 0.1 0:01.32 nginx 4667 vbart 20 0 309584 28712 588 S 3.7 0.1 0:05.19 nginx 4656 vbart 20 0 309452 28476 572 S 3.3 0.1 0:01.84 nginx 4664 vbart 20 0 309452 28428 524 S 3.3 0.1 0:01.29 nginx 4652 vbart 20 0 309452 28476 572 S 3.0 0.1 0:01.46 nginx 4662 vbart 20 0 309552 28700 596 S 2.7 0.1 0:05.92 nginx 4661 vbart 20 0 309464 28636 596 S 2.3 0.1 0:01.59 nginx 4653 vbart 20 0 309452 28476 572 S 1.7 0.1 0:01.70 nginx 4666 vbart 20 0 309452 28428 524 S 1.3 0.1 0:01.63 nginx 4657 vbart 20 0 309584 28696 592 S 1.0 0.1 0:00.64 nginx 4655 vbart 20 0 30958 28476 572 S 0.7 0.1 0:02.81 nginx 4659 vbart 20 0 309452 28468 564 S 0.3 0.1 0:01.20 nginx 4665 vbart 20 0 309452 28476 572 S 0.3 0.1 0:00.71 nginx 5180 vbart 20 0 25232 1952 1156 R 0.0 0.0 0:00.45 top 4651 vbart 20 0 20032 752 252 S 0.0 0.0 0:00.00 nginx 25921 vbart 20 0 121956 2176 1000 S 0.0 0.0 0:01.98 sshd 25923 vbart 20 0 40304 3840 2208 S 0.0 0.0 0:00.54 zsh There are still plenty of CPU resources. The results of wrk: Running 1m test @ http://192.0.2.1:8000/1/1/1 12 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 226.32ms 392.76ms 1.72s 93.48% Req/Sec 20.02 10.84 59.00 65.91% 15045 requests in 1.00m, 58.86GB read Requests/sec: 250.57 Transfer/sec: 0.98GB The average time to serve a 4-MB file has been reduced from 7.42 seconds to 226.32 milliseconds (33 times less), and the number of requests per second has increased by 31 times (250 vs 8)! The explanation is that our requests no longer wait in the events queue for processing while worker processes are blocked on reading, but are handled by free threads. As long as the disk subsystem is doing its job as best it can serving our random load from the first client machine, NGINX uses the rest of the CPU resources and network capacity to serve requests of the second client from memory. Still Not a Silver Bullet After all our fears about blocking operations and some exciting results, probably most of you already are going to configure thread pools on your servers. Don’t hurry. The truth is that fortunately most read and send file operations do not deal with slow hard drives. If you have enough RAM to store the data set, then an operating system will be clever enough to cache frequently used files in a so-called “page cache”. The “page cache” works pretty well and allows NGINX to demonstrate great performance in almost all common use cases. Reading from the page cache is quite quick and no one can call such operations “blocking.” On the other hand, offloading to a thread pool has some overhead. So if you have a reasonable amount of RAM and your working data set isn’t very big, then NGINX already works in the most optimal way without using thread pools. Offloading read operations to the thread pool is a technique applicable to very specific tasks. It is most useful where the volume of frequently requested content doesn’t fit into the operating system’s VM cache. This might be the case with, for instance, a heavily loaded NGINX-based streaming media server. This is the situation we’ve simulated in our benchmark. It would be great if we could improve the offloading of read operations into thread pools. All we need is an efficient way to know if the needed file data is in memory or not, and only in the latter case should the reading operation be offloaded to a separate thread. Turning back to our sales analogy, currently the salesman cannot know if the requested item is in the store and must either always pass all orders to the delivery service or always handle them himself. The culprit is that operating systems are missing this feature. The first attempts to add it to Linux as the fincore() syscall were in 2010 but that didn’t happen. Later there were a number of attempts to implement it as a new preadv2() syscall with the RWF_NONBLOCK flag (see Non-blocking buffered file read operations and Asynchronous buffered read operations at LWN.net for details). The fate of all these patches is still unclear. The sad point here is that it seems the main reason why these patches haven’t been accepted yet to the kernel is continuous bikeshedding. On the other hand, users of FreeBSD don’t need to worry at all. FreeBSD already has a sufficiently good asynchronous interface for reading files, which you should use instead of thread pools. Configuring Thread Pools So if you are sure that you can get some benefit out of using thread pools in your use case, then it’s time to dive deep into configuration. The configuration is quite easy and flexible. The first thing you should have is NGINX version 1.7.11 or later, compiled with the --with-threads configuration parameter. In the simplest case, the configuration looks very plain. All you need is to include the aio threads directive in the http, server, or location context: aio threads; This is the minimal possible configuration of thread pools. In fact, it’s a short version of the following configuration: thread_pool default threads=32 max_queue=65536; aio threads=default; It defines a thread pool called default with 32 working threads and a maximum length for the task queue of 65536 requests. If the task queue is overloaded, NGINX logs this error and rejects the request: thread pool "NAME" queue overflow: N tasks waiting The error means it’s possible that the threads aren’t able to handle the work as quickly as it is added to the queue. You can try increasing the maximum queue size, but if that doesn’t help, then it indicates that your system is not capable of serving so many requests. As you already noticed, with the thread_pool directive you can configure the number of threads, the maximum length of the queue, and the name of a specific thread pool. The last implies that you can configure several independent thread pools and use them in different places of your configuration file to serve different purposes: http { thread_pool one threads=128 max_queue=0; thread_pool two threads=32; server { location /one { aio threads=one; } location /two { aio threads=two; } } … } If the max_queue parameter isn’t specified, the value 65536 is used by default. As shown, it’s possible to set max_queueto zero. In this case the thread pool will only be able to handle as many tasks as there are threads configured; no tasks will wait in the queue. Now let’s imagine you have a server with three hard drives and you want this server to work as a «caching proxy» that caches all responses from your back ends. The expected amount of cached data far exceeds the available RAM. It’s actually a caching node for your personal CDN. Of course in this case the most important thing is to achieve maximum performance from the drives. One of your options is to configure a RAID array. This approach has its pros and cons. Now with NGINX you can take another one: # We assume that each of the hard drives is mounted on one of the directories: # /mnt/disk1, /mnt/disk2, or /mnt/disk3 accordingly proxy_cache_path /mnt/disk1 levels=1:2 keys_zone=cache_1:256m max_size=1024G use_temp_path=off; proxy_cache_path /mnt/disk2 levels=1:2 keys_zone=cache_2:256m max_size=1024G use_temp_path=off; proxy_cache_path /mnt/disk3 levels=1:2 keys_zone=cache_3:256m max_size=1024G use_temp_path=off; thread_pool pool_1 threads=16; thread_pool pool_2 threads=16; thread_pool pool_3 threads=16; split_clients $request_uri $disk { 33.3% 1; 33.3% 2; * 3; } location / { proxy_pass http://backend; proxy_cache_key $request_uri; proxy_cache cache_$disk; aio threads=pool_$disk; sendfile on; } In this configuration three independent caches are used, dedicated to each of the disks, and three independent thread pools are dedicated to the disks as well. The split_clients module is used for load balancing between the caches (and as a result between the disks), which perfectly fits this task. The use_temp_path=off parameter to the proxy_cache_path directive instructs NGINX to save temporary files into the same directories where the corresponding cache data is located. It is needed to avoid copying response data between the hard drives when updating our caches. All this together allows us to get maximum performance out of the current disk subsystem, because NGINX through separate thread pools interacts with the drives in parallel and independently. Each of the drives is served by 16 independent threads with a dedicated task queue for reading and sending files. I bet your clients like this custom-tailored approach. Be sure that your hard drives like it too. This example is a good demonstration of how flexibly NGINX can be tuned specifically for your hardware. It’s like you are giving instructions to NGINX about the best way to interact with the machine and your data set. And by fine-tuning NGINX in user space, you can ensure that your software, operating system, and hardware work together in the most optimal mode to utilize all the system resources as effectively as possible. Conclusion Summing up, thread pools is a great feature that pushes NGINX to new levels of performance by eliminating one of its well-known and long-time enemies – blocking – especially when we are speaking about really large volumes of content. And there is even more to come. As previously mentioned, this brand-new interface potentially allows offloading of any long and blocking operation without any loss of performance. NGINX opens up new horizons in terms of having a mass of new modules and functionality. Lots of popular libraries still do not provide an asynchronous non-blocking interface, which previously made them incompatible with NGINX. We may spend a lot of time and resources on developing our own non-blocking prototype of some library, but will it always be worth the effort? Now, with thread pools on board, it is possible to use such libraries relatively easily, making such modules without an impact on performance. Stay tuned.
June 22, 2015
by Patrick Nommensen
· 12,550 Views · 1 Like
article thumbnail
Query Autofiltering Revisited -- Let's Be More precise!
In a previous blog post, I introduced the concept of “query autofiltering”, which is the process of using the meta information (information about information) that has been indexed by a search engine to infer what the user is attempting to find. A lot of the information used to do faceted search can also be used in this way, but by employing this knowledge up front or at “query time”, we can answer questions right away and much more precisely than we could without techniques like this. A word about “precision” here – precision means having fewer “false positives” – unintended responses that creep in to a result set because they share some words with the best answers. Search applications with well tuned relevancy will bring the best results to the top of the result list, but it is common for other responses, which we call “noise hits”, to come back as well. In the previous post, I explained why the search engine will often “do the wrong thing” when multiple terms are used and why this is frustrating to users – they add more information to their query to make it less ambiguous and the responses often do not reward that extra effort – in many cases, the response has more noise hits simply because the query has more words. The solution that I discussed involves adding some semantic awareness to the search process, because how words are used together in phrases is meaningful and we need ways to detect user intent from these patterns. The traditional way to do this is to use Natural Language Processing or NLP to parse the user query. This can work well if the queries are spoken or written as if the person were asking another person, as in “Where can I find restaurants in Cleveland that serve Sushi?” Of course, this scenario –which goes back to the early AI days – has become much more important now that we can talk to our cell phones. For search applications like Google with a “box and a button” paradigm, user queries are usually one word or short phrases like “Sushi Restaurants in Cleveland”. These are often what linguists call “noun phrases” consisting of a word that means a person, place or thing (what of who they want to find or where) – e.g. “restaurant” and “Cleveland” and some words that add precision to their query by constraining the properties of the thing they want to find – in this case “sushi”. In other words, it is clear from this query that the user is not interested in just any restaurant – they want to find those that serve raw fish on a ball of rice or vegetable and seafood thingies wrapped in seaweed. The search engine often does the wrong thing because it doesn’t know how to combine these terms – and typically will use the wrong logical or boolean operator – OR when the users intent should be interpreted as AND. It turns out that in many cases now, our search indexes know the difference between Mexican Restaurants (which typically don’t serve Sushi) and Japanese Restaurants (which usually do) because of the metadata that we put into them to do faceted search (Funny story: after posting this, I ran across a Mexican Restaurant in Toms River, New Jersey that does serve sushi – but still, most of them don’t!). The goal of query autofiltering is to use that built in knowledge to answer the question right away and not wait for the user to “drill in” using the facets. If users don’t give us a precise query (like simply “restaurants”), we can still use faceting, but if they do, it would be cool if we could cut to the chase. As you’ll see, it turns out that we can do this. The previous post contained a solution which I called a “Simple” Category Extraction component. It works by seeing if single tokens in the query matched field values in the search index (using a cool Lucene feature that enable us to mine the “uninverted” index for all of the values that were indexed in a field). For example, if it sees the token “red” and discovers that “red” is one of the values of a “color” field, it would infer that the user was looking for things that are “red” in “color” and will constrain the query this way. The solution works well in a limited set of cases, but there are a number of problems with it that make it less useful in a production setting. It does a nice job in cases where the term “red” is used to qualify or more precisely specify a thing – such as “red sofa”. It does not do so well in cases where the term “red” is not used as a qualifier – such as when it is part of a brand or product name such as “Red Baron Pizza” or “Johnny Walker Red Label” (great Scotch, but “Black Label” is even better, maybe I’ll be rich enough to afford “Blue Label” some day – but I digress …). It is interesting to note that the simple extractor’s main shortcomings are due to the fact that it looks at single tokens at a time in isolation from the tokens around it. This turns out to be the same problem that the core search engine algorithms have – i.e., it’s a “bag of words” approach that doesn’t consider – wait for it – semantic context. The solution is to look for patterns of words that match patterns of content attributes. This does a much better job of disambiguation. We can use the same coding trick as before (upgraded for API changes introduced in Solr 5.0), but we need to account for context and usage – as much as we can without having to introduce full-blown NLP which needs lots of text to crunch. In contrast, this approach can work when we just have structured metadata. Searching vs Navigating A little historical background here. With modern search applications, there are basically two types of user activities that are intermingled: searching and navigating. The former involves typing into a box and the latter, clicking on facet links. In the old days, there was a third user interface called an “advanced” search form where users could pick from a set of metadata fields, put in a value and select their logical combination operators– an interface that would be ideally suited for precise searching given rich metadata. The problem is that nobody wants to use it. Not that people ever liked this interface anyway (except those with Master of Library Science degrees), but Google has also done much to demote this interface to a historical reference. Google still has the same problem of noise hits but they have built a reputation for getting the best results to the top (and usually, they do) – and they also eschew facets (they kinda have them at the bottom of the search page now as related searches). Users can also “markup” their query with quotation marks or boolean expressions or ‘+/-’ signs but trust me – they won’t do that either (typically that is). What this means is that the little search box – love it or hate it – is our main entry point – i.e. we have to deal with it, because that is what users want – to just type stuff and then get the “right” answer back. (If poor ease-of-use or the simple joy of Google didn’t kill the advanced search form completely, the migration to mobile devices absolutely will). A Little Solr/Lucene Technology – String fields, Text fields and “free-text” searching: In Solr, when talking about textual data, these two user activities are normally handled by two different types of index field: string and text. String fields are not analyzed (tokenized) and searching them requires an exact match on a value indexed within a field. This value can be a word or a phrase. In other words, you need to use : syntax in the query (and quoted “value here” syntax if the query is multi-term) – something that power users will be OK with but not something that we can expect of the average user. However, string fields are very good for faceted navigation. Text fields on the other hand are analyzed (tokenized and filtered) and can be searched with “freetext” queries – our little box in other words. The problem here is that tokenization turns a stream of text into a stream of tokens (words) and while we do preserve positional information so we can search on phrases, we don’t know a priori where those phrases are. Text fields can also be faceted (in fact, any field can be a facet field in Solr), but in this case, the facets are based on individual tokens which don’t tend to be too useful. So we have two basic field types for text data, one good for searching and one for navigating. In the harder-to-search type, we know exactly where the phrases are but we typically don’t in the easier-to-search type. A classic trade-off scenario. Since string fields are harder to search (at least within the Google paradigm that users love), we make them searchable by copying their data (using the Solr “copyField” directive) into a catchall text field called “text” by default. This works, but in the process we throw away information about which values are meant to be phrases and which are not. Not only that, we’ve lost the context of what these values represent (the string fields that they came from). So although we’ve made these string fields more searchable, we’ve had to do that by putting them into a “bag of words” blender. But the information is still somewhere in the search index, we just need a way to get it back at at “query time”. Then, we can both have our cake AND eat it! Noun Phrases and the Hierarchy of meta information When we talk about things, there are certain attributes that describe what the thing is (type attributes) and others that describe the properties or characteristics of the thing. In a structured database or search index, both of these kinds of attributes are stored the same way – as field/value pairs. There are however, natural or semantic relationships between these fields that the database or search engine can’t understand, but we do. That is, noun phrases that describe more specific sets of things are buried in the relationships between our metadata fields. All we have to do is dig them out. For example, if I have a database of local businesses, I can have a “what” field like business type that has values like “restaurant”, “hardware store”, “drug store”, “filling station” and so forth. Within some of these business types like restaurant, there may be refining information like restaurant type (“Mexican”, “Chinese”, “Italian”, etc) or brand/franchise (“Exxon”, “Sunoco”, “Hess”, “Rite-Aid”, “CVS”, “Walgreens”, etc.) for gas stations and drug stores. These fields form a natural hierarchy of metadata in which some attributes refine or narrow the set of things that are labeled by broader field types. Rebuilding Context: Identifying field name patterns to find relevant phrase patterns So now its time to put Humpty Dumpty back together again. With Solr/Lucene – it is likely that the information that we need to give precise answers to precise questions is available in the search index. If we can identify sub-phrases within a query that refer or map to a metadata field in the index, we can then add the appropriate metadata mapping on behalf of the user. We are then able to answer questions like “Where is the nearest Tru Value hardware store?” because we can identify the phrase “Tru Value” as a business name and “hardware store” as a specific type of store. Assuming that this information is in the index in the form of metadata fields, parsing the query is a matter of detecting these metadata values and associating them with their source fields. Some additional NLP magic can be used to infer other aspects of the question such as “where is the nearest”, which should trigger the addition of a spatial proximity query filter for example. The Query AutoFiltering Search Component To implement the idea set out above, I developed a Solr Search Component called QueryAutoFilteringComponent. Search components are executed as part of the search request handling process. Besides executing a search, they can also do other things like spell checking or query suggestion, return the set of terms that are indexed in a field or the term vectors (term frequency statistics) among other things. The SearchComponent interface defines a number of methods one of which – prepare( ) – is executed by all of the components in a search handler chain before the request is processed. By specifying that a non-standard component is in the “first-components” list – it will be executed before the query is sent to the index by the downstream QueryComponent. This gives these early components a chance to modify the query before it is executed by the Lucene engine (or distributed to other shards in SolrCloud). The QueryAutoFilteringComponent works by creating a mapping of term values to the index fields that contain them. It uses the Lucene UnivertedIndex and the Solr TermsComponent (in SolrCloud mode) to build this map. This “inverse” map of term value -> index field is then used to discover if any sub-phrase within a query maps to a particular index field. If so, a filter query (fq) or boost query (bq) – depending on the configuration – is created from that field:value pair and if the result is to be a filter query, the value is removed from the original query. The result is a series of query expressions for the phrases that were identified in the original query. An example will help to make this clearer. Assuming that we have indexed the following records: This example is admittedly a bit contrived in that the term “red” is deliberately ambiguous – it can occur as a color value or as part of a brand or product_type phrase. So, with the OOTB Solr /select handler, a search for “red lion socks” brings back all 16 records. However, with the QueryAutoFilterComponent, only 2 results are returned (4 and 5) for this query. Furthermore, searching for “red wine” will only bring back one record (11) whereas searching for “red wine vinegar” brings back just record 12. What the filter does is to match terms with fields, trying to find the longest contiguous phrases that match mapped field values. So for the query “red lion socks” – it will first discover that “red” is a color, but then it will discover that “red lion” is a brand and this will supercede the shorter match that starts with “red”. Likewise, with “red wine vinegar”, it will first find “red” == color, then “red wine” == product_type then “red wine vinegar” == product_type and the final match will win because it is the longest contiguous match. It will work across fields too. If the query is “blue red lion socks” – it will discover that “blue” is a color, then that “blue red” is nothing so it will move on to the next unmatched token – “red”. It will then, as before, discover that “red lion” is a brand, reject “red lion socks” which doesn’t map to anything and finally find that “socks” is a product_type. From these three field matches it will construct a filter (or boost) query with the appropriate mapping of field name to field value. The result of all of this is a translation of the Solr query: q=blue red lion socks to a filter query: fq=color:blue&fq=brand:”red lion”&fq=product_type:socks This final query brings back just 1 result as opposed to 16 for the unfiltered case. In other words, we have increased precision from 6.25% to 100%! Adding case sensitivity and synonym support: One of the problems with using string fields as the source of metadata for noun phrases is that they are not analyzed (as discussed above). This limits the set of user inputs that can match – without any changes, the user must type in exactly what is indexed, including case and plurality. To address this problem, support for basic text analysis such as case insensitivity and stemming (singular/plural) as well as support for synonyms was added to the QueryAutoFilteringComponent. This adds to the code complexity somewhat but it makes it possible for the filter to detect synonymous phrases in the query like “couch” or “lounge chair” when “Sofa” or “Chaise Lounge” were indexed. Another thing that can help at an application level is to develop a suggester for typeahead or autocomplete interfaces that uses the Solr terms component and facet maps to build a multi-field suggester that will guide users towards precise and actionable queries. I hope to have a post on this in the near future. Source Code For those that are interested in how the autofiltering component works or would like to use it in your search application, source code and design documentation are available on github. The component has also been submitted to Solr (SOLR-7539 if you want to track it). The source code on github is in two versions, one that compiles and runs with Solr 4.x and the other that uses the new UninvertingReader API that must be used in Solr 5.0 and above. Conclusions The QueryAutoFilteringComponent does a lot more than the simple implementation introduced in the previous post. Like the previous example, it turns a free form queries into a set of Solr filter queries (fq) – if it can. This will eliminate results that do not match the metadata field values (or their synonyms) and is a way to achieve high precision. Another way to go is to use the “boost query” or bq rather than fq to push the precise hits to the top but allow other hits to persist in the result set. Once contextual phrases are identified, we can boost documents that contain these phrases in the identified fields (one of the chicken-and-egg problems with query-time boosting is knowing what field/value pairs to boost). The boosting approach may make more sense for traditional search applications viewed on laptop or workstation computers whereas the filter query approach probably makes more sense for mobile applications. The component contains a configurable parameter “boostFactor” which when set, will cause it to operate in boost mode so that records with exact matches in identified fields will be boosted over records with random or partial token hits.
June 22, 2015
by Lisa Warner
· 2,138 Views
article thumbnail
Heroku PostgreSQL vs. Amazon RDS for PostgreSQL
Written by Barry Jones. PostgreSQL is becoming the relational database of choice for web development for a whole host of good reasons. That means that development teams have to make a decision on whether to host their own or use a database as a service provider. The two biggest players in the world of PostgreSQL are Heroku PostgreSQL and Amazon RDS for PostgreSQL. Today I’m going to compare both platforms. Heroku was the first big provider to make a push for PostgreSQL instead of MySQL for application development. They launched their Heroku PostgreSQL platform back in 2007. Amazon Web Services first announced their RDS for PostgreSQL service in November 2013 during the AWS re:Invent conference to an overwhelming ovation by the programmers in attendance. Pricing Comparison Before I get too far into the features, let’s cover the pricing differences up front. Of course, both services have areas with different value propositions for productivity and maintenance that go beyond these direct costs. However, it’s worth it to understand the basic costs so you can weigh those values against your needs later. Heroku PostgreSQL has the simplest pricing. The rates and what you get for them are very clearly set at a simple per-month rate that includes the database, storage, data transfer, I/O, backups, SLA, and any other features built into the pricing tier. With RDS for PostgreSQL, pricing is broken down into smaller units of individual resource usage. That means there are more factors involved in estimating the price, so it’s a little tougher to draw an exact comparison to Heroku PostgreSQL. You have the price per hour for the instance type, higher if it’s a multiple availability zone instance, cheaper if you pay an upfront cost to reserve the instance for one to three years; storage cost and storage class (both single and multi AZ); provisioned IOPs rate; backup storage, and data transfer… then there are a whole lot of special cases to consider. Also, keep in mind that you get one year free of the cheapest plan when you sign up. Here is a comparison of an RDS plan to the Heroku Premium 4 plan: Heroku Premium 4 $1,200 / Month 15 GB RAM 512 GB storage 500 connections High Availability Max 15 minutes downtime/month 1 week rollback Point in time recovery Encryption at rest Continuous protection (offsite Write-Ahead-Log) RDS for PostgreSQL $1,156/month on demand or $756/month 1 year reserved db.m3.xlarge Multi-AZ at $0.780/hr ($580) 4 vCPU, 15GB RAM Encryption at rest 512 GB provisioned (SSD) at $0.250/GB ($128) 2000 provisioned IOPS at $0.20/IOPS ($400) Estimated backup storage in excess of free for 1 week rollback, 512 GB at $0.095/GB ($48) Data transfer estimated at $0 for most use cases 22 minutes downtime/month (based on AWS RDS SLA 99.95% uptime) Now, here are the caveats with such a comparison: Heroku isn’t disclosing the number of CPUs associated with their plans. Heroku’s High Availability is equivalent to AWS RDS Multi-AZ. In both setups, a read replica is maintained in a different geographic region specifically for the purpose of automatic failover in the event of an outage. With Heroku, your storage is fully allocated, and you do not pay for IOPS. As such, we don’t know what the limits are for IOPS, but they are very high performance databases. I allocated the minimum IOPS that AWS would allow for 512 GB, which was 2,000. We could go as high as 5,000 IOPS which would increase the price by $600/month. The AWS RDS backups may cost nothing depending on how much of the provisioned storage is actually being used. Backup storage is free up to the level of provisioned storage, and backups are generally smaller, incremental, and do not include the significant space used by indexes. This estimate was based on the seven days of storage needed to allow for one week rollback. AWS RDS storage can be scaled up on the fly, so your specific needs for RAM versus storage could create a wildly different pricing pattern. This comparison is aiming to draw an equivalent. AWS only charges for data transfers out of your availability zone (not including multi-AZ transfers), so transfer rates will not apply in most cases. Clear as mud. Setup Complexity Heroku PostgreSQL setup is dead simple. Whenever you create a PostgreSQL project, a free dev plan is already created with it with a connection waiting. Upgrading the database simply gives you a new connection string with a set username, password, hostname, and database identifier that are all randomly generated by their system. The database connection must be secure but is accessible anywhere on the internet, including directly from your home computer. You can also choose whether to deploy it in the US East region or in the European region. RDS for PostgreSQL setup is slightly more involved; you must select the various options outlined in the pricing section, including the instance type, whether or not it should be Multi-AZ, whether to enable encryption at rest, type of storage, how much to provision, IOPs to provision (if any), backup retention period, whether or not to enable automatic minor version upgrades, selection of backup and maintenance windows, database identifier, name, port, master user and password, which availability zone you want it to be created in and the selection of your VPC group and subnet group, and your database configuration. Obviously, RDS gives you significantly more control over the details. Depending on your point of view, that could be good or bad. The database configuration, for example, has a set of defaults for each database version for each instance type. You can take these defaults and make modifications to them with your own custom settings and then save those as your own parameter group to assign to this and any future databases that you may choose to create. The initial setup time can be slightly more involved because of the various factors like VPC, subnet groups, and public accessibility. However, once these have been defined the first time for your account, everything gets much closer to a point-and-click experience. Host Locations, Regional Restrictions Heroku operates with the AWS US East Region (us-east-1) and Europe (eu-west-1). This also means that your database will be restricted to these regions. Availability Zones are managed internally. If you choose to use Heroku PostgreSQL with something hosted in a different AWS region than those two, you should expect more latency between database requests and transfer rates may apply. Likewise, if you wish to use AWS RDS for PostgreSQL with a Heroku application, just ensure that it is set up in the appropriate region. Security and Access Considerations Within Heroku PostgreSQL, you’re given a randomized username with a randomized password and a randomized database name that must be connected to over SSL. Their network (as well as Amazon’s) have built-in protections against scanners that could potentially brute-force access such a database. That is fairly secure. The downside is that anybody who needs access to the database and has the connection information can do so from anywhere in the world. This is more of a Human Resources-level risk from departed programmers on a project than anything else, but it is something to be aware of nonetheless. Swapping out the database credentials after having a programmer leave the team will generally alleviate this concern. On the other hand, AWS RDS for PostgreSQL has a much more comprehensive security policy. The ability to set and define a VPC and private subnet groups will allow you to restrict database access to only the servers and people who need it. You have the ability to create as many database users with various permission levels as you like in order to more easily manage multiple users or applications accessing the database with different permission levels, while providing a log trail. Thanks to VPC, even if somebody did have the connection information, they still couldn’t access the database without being able to get inside the VPC. For stricter (although more complex) security, RDS wins hands down. Depending on complexity, team, and the development state of your application, this level of security paranoia may not yet make sense and could be more of a headache than you want to manage. You can also configure it with the same access rules used by Heroku PostgreSQL. Backup/Restore/Upgrade Both platforms offer very similar options for backup and restore. Both have scheduled backups, point-in-time recovery, restoration to a new copy, and the ability to create snapshots. Upgrades are more involved. On both platforms, major version upgrades will involve some downtime, which can’t be avoided. Heroku provided three options that all involve some manual steps to complete: copying data, promoting an upgraded follower, or using the pg:upgrade command for an in-place upgrade of larger databases. The pg:upgrade most closely resembles the upgrade process on RDS. With RDS, you select the Modify option for your instance and change the version. It will create pre- and post-snapshots around the in-place upgrade while maintaining the exact same connection string. RDS will allow you to schedule the database upgrade automatically within your set maintenance window. Heroku PostgreSQL will automatically apply minor upgrades and security patches, while RDS allows you to choose whether or not you want them to do that automatically within your maintenance window. Both are fairly straightforward processes, although the RDS process is a little more hands-off in this case. Feature/Extension Availability As of this writing, AWS RDS for PostgreSQL has version 9.3.1–9.3.6 and 9.4.1, while Heroku PostgreSQL has 9.1, 9.2, 9.3, and 9.4. Minor version upgrades are automatic with Heroku, so the point releases are unnecessary. Heroku PostgreSQL has been around longer and because of that has more legacy versions available for their existing users. RDS launched with 9.3 and does not appear to have any intention to support older versions. In addition to all of the functionality built into PostgreSQL, there’s a constantly growing set of extensions. Both platforms have these extensions in common: hstore citext ltree isn cube dict_int unaccent PostGIS dblink earthdistance fuzzystrmatch intarray pg_stat_statements pgcrypto pg_trgm tablefunc uuid-ossp pgrowlocks btree_gist PL/pgSQL PL/Tcl PL/Perl PL/V8 Available on Heroku PostgreSQL: pgstattuple Available on AWS RDS for PostgreSQL: postgres_fdw chkpass intagg tsearch2 sslinfo Here are the full lists for both Heroku PostgreSQL and AWS RDS for PostgreSQL. Scaling Options “Scaling” is a tricky word with databases because it means different things depending on the needs of your application. Scaling for writes vs. reads is based on low intensity and high volume (web traffic) compared to low volume and high intensity (analytics). The most common scaling case on the web is scaling for read traffic. Both Heroku and RDS address this need with the ability to create read replicas. RDS calls them read replicas and Heroku calls them followers, but they’re essentially the same thing: a copy of the database, receiving live updates via the write-ahead-log over the wire to allow you to spread read traffic over multiple servers. This is commonly referred to as horizontal scaling. To create read replicas on either platform is a point-and-click operation. Vertical scaling refers to increasing or decreasing the power of the hardware of your database in place. AWS and Heroku each handle this scenario differently. Heroku instructs users to create a follower of the newly desired database class and then promote it to the primary database once it’s caught up, destroying the original afterwards. Your application will need to update its database connection information to use the new database. If your RDS database is a multi-AZ database, then the failover database will be upgraded first. Once ready, the connection will automatically failover to that instance while the primary is then upgraded, switching back to the primary afterwards. Without a Multi-AZ, you can do the upgrade in place, but downtime will vary depending on the size of the database. Your other option is to create a read replica with the newly desired stats and then promote it to primary when it is ready, just as Heroku recommends. To scale beyond the standard vertical and horizontal options for something that can handle distributed write scaling, neither option is a particularly good fit. It will probably be necessary to either manage your own Postgres-XC installation or restructure your application to isolate the write-heavy traffic into a more use-case specific data source. Monitoring AWS RDS for PostgreSQL comes with all of the standard AWS monitoring options via Cloudwatch. Cloudwatch provides extensive metrics that you can track history with a granular ability to set up alerts via email or SNS notifications (basically webhooks). These are great for integrating with tools like PagerDuty. Heroku PostgreSQL monitoring relies more on logs and command line tools. Their pgextras command line tool will show current information about what’s going on in the database, including bloat, blocking queries, cache and index hit ratios, identification of unused indexes, and the ability to kill specific queries. These tools, while not involving the stat tracking over time that you get from Cloudwatch, provide extremely valuable insights into what’s going on with your database that you don’t come close to getting from RDS. You can see more examples of pg-extras on GitHub. These type of insights are invaluable in tuning your application and database to avoid the problems you’d need a monitor to catch in the first place. Other historical data is available in the logs, although Heroku recommends trying out Librato (which can work with any PostgreSQL database but has a Heroku plugin available for automatic configuration). Additionally, free New Relic plans will provide a wealth of insight into what’s going on with your application and database. While Cloudwatch provides more detailed insight as to what’s going on within the machine, Heroku uses the metrics seen within pg-extras to monitor and notify you of the various problems they see that require correction on your end. If data corruption happens, Heroku identifies and fixes it. Security problems, they’ll handle it. A DBA or a DevOps position will care significantly more about the Cloudwatch metrics. Heroku PostgreSQL tries to focus on making sure you don’t have to worry about it. Dataclips One bonus feature that you get from Heroku PostgreSQL is Dataclips. Dataclips are basically a method for storing and sharing read-only queries among your team for the sake of reporting without having to grant access to every person who may need to see them. Just type in a query and view the results right there on the page. The queries are version controlled; if your team is passing them around and tweaking them, you’ll be able to see the changes over time. In my personal experience, I’ve found dataclips to be a lifesaver, specifically for working with non-programmer teams. When business or support staff need information on sales, fraud, user behavior, account activity, or anything else we happen to have in there, I’ve always had the ability to write up a query to get at the information. Before dataclips, this meant that I needed to write up the query, save it somewhere, usually export the result set to a CSV or spreadsheet, and then email it to whomever was requesting it. Eventually, this becomes a routine activity that you’re having to handle at every request. Enter dataclips. Now I can take that query and just send the random hashed link over to whoever requested the information. If they want more up-to-date information the next day, week, or month, they need only refresh the page. I write the query, then never hear that request again. That is a developer time-saver right there. You can save them and name them, as well as manage more strict access if need be. Summary and Recommendation Overall, AWS RDS for PostgreSQL will usually be cheaper and more tightly tailorable to exactly what your application’s needs are. You’ll have much more granular control over access, security, monitoring, alerts, geographic location, and maintenance plans. With Heroku PostgreSQL, you’ll pay a little bit more on a simplified pricing structure, although all of your development databases will be free. You won’t be able to control a lot of the details that RDS gives you access to, but that’s partially by design so that you don’t have to deal with managing those details. With Heroku, you’ll get insights directly into how your database is performing and using the internal resources to help you catch, tune, and improve your setup before it becomes a problem. If I had to choose, I’d probably go with Heroku and Heroku PostgreSQL as a startup while I focused on actually getting my application developed and getting customers in the door. The value proposition of saving time to focus on business goals so we can build a revenue stream would be of the greatest importance. Then when things grew to a point that the database was no longer changing as much, it might make sense to start migrating things over to RDS as we focus on locking things down to focus on stability, long-term maintenance, and security. In the end, it really boils down to what costs you more: time or infrastructure. If time costs you more, go with Heroku PostgreSQL. If infrastructure costs you more, go with RDS. Having both platforms living within the AWS datacenters makes switching between the two a lot easier as your needs change.
June 22, 2015
by Moritz Plassnig
· 3,907 Views
article thumbnail
R: dplyr - Removing Empty Rows
I’m still working my way through the exercises in Think Bayes and in Chapter 6 needed to do some cleaning of the data in a CSV file containing information about the Price is Right. I downloaded the file using wget: wget http://www.greenteapress.com/thinkbayes/showcases.2011.csv And then loaded it into R and explored the first few rows using dplyr library(dplyr) df2011 = read.csv("~/projects/rLearning/showcases.2011.csv") > df2011 %>% head(10) X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3 1 5631K 5632K 5633K 5634K 5635K 5641K 5642K 5643K 5644K 5645K 5681K 2 3 Showcase 1 50969 21901 32815 44432 24273 30554 20963 28941 25851 28800 37703 4 Showcase 2 45429 34061 53186 31428 22320 24337 41373 45437 41125 36319 38752 5 ... As you can see, we have some empty rows which we want to get rid of to ease future processing. I couldn’t find an easy way to filter those out but what we can do instead is have empty columns converted to ‘NA’ and then filter those. First we need to tell read.csv to treat empty columns as NA: df2011 = read.csv("~/projects/rLearning/showcases.2011.csv", na.strings = c("", "NA")) And now we can filter them out using na.omit: df2011 = df2011 %>% na.omit() > df2011 %>% head(5) X Sep..19 Sep..20 Sep..21 Sep..22 Sep..23 Sep..26 Sep..27 Sep..28 Sep..29 Sep..30 Oct..3 3 Showcase 1 50969 21901 32815 44432 24273 30554 20963 28941 25851 28800 37703 4 Showcase 2 45429 34061 53186 31428 22320 24337 41373 45437 41125 36319 38752 6 Bid 1 42000 14000 32000 27000 18750 27222 25000 35000 22500 21300 21567 7 Bid 2 34000 59900 45000 38000 23000 18525 32000 45000 32000 27500 23800 9 Difference 1 8969 7901 815 17432 5523 3332 -4037 -6059 3351 7500 16136 ... Much better!
June 21, 2015
by Mark Needham
· 58,620 Views · 1 Like
article thumbnail
Long-Term Log Analysis with AWS Redshift
You will aggregate a lot of logs over the lifetime of your product and codebase, so it’s important to be able to search through them. In the rare case of a security issue, not having that capability is incredibly painful. You might be able to use services that allow you to search through the logs of the last two weeks quickly. But what if you want to search through the last six months, a year, or even further? That availability can be rather expensive or not even an option at all with existing services. Many hosted log services provide S3 archival support which we can use to build a long-term log analysis infrastructure with AWS Redshift. Recently I’ve set up scripts to be able to create that infrastructure whenever we need it at Codeship. AWS Redshift AWS Redshift is a data warehousing solution by AWS. It has an easy clustering and ingestion mechanism ideal for loading large log files and then searching through them with SQL. As it automatically balances your log files across several machines, you can easily scale up if you need more speed. As I said earlier, looking through large amounts of log files is a relatively rare occasion; you don’t need this infrastructure to be around all the time, which makes it a perfect use case for AWS. Setting Up Your Log Analysis Let’s walk through the scripts that drive our long-term log analysis infrastructure. You can check them out in the flomotlik/redshift-logging GitHub repository. I’ll take you step by step through configuring the whole setup of the environment variables needed, as well as starting the creation of the cluster and searching the logs. But first, let’s get a high-level overview of what the setup script is doing before going into all the different options that you can set: Creates an AWS Redshift cluster. You can configure the number of servers and which server type should be used. Waits for the cluster to become ready. Creates a SQL table inside the Redshift cluster to load the log files into. Ingests all log files into the Redshift cluster from AWS S3. Cleans up the database and prints the psql access command to connect into the cluster. Be sure to check out the script on GitHub before we go into all the different options that you can set through the .env file. Options to set The following is a list of all the options available to you. You can simply copy the .env.template file to .env and then fill in all the options to get picked up. AWS_ACCESS_KEY_ID AWS key of the account that should run the Redshift cluster. AWS_SECRET_ACCESS_KEY AWS secret key of the account that should run the Redshift cluster. AWS_REGION=us-east-1 AWS region the cluster should run in, default us-east-1. Make sure to use the same region that is used for archiving your logs to S3 to have them close. REDSHIFT_USERNAME Username to connect with psql into the cluster. REDSHIFT_PASSWORD Password to connect with psql into the cluster. S3_AWS_ACCESS_KEY_ID AWS key that has access to the S3 bucket you want to pull your logs from. We run the log analysis cluster in our AWS Sandbox account but pull the logs from our production AWS account so the Redshift cluster doesn’t impact production in any way. S3_AWS_SECRET_ACCESS_KEY AWS secret key that has access to the S3 bucket you want to pull your logs from. PORT=5439 Port to connect to with psql. CLUSTER_TYPE=single-node The cluster type can be single-node or multi-node. Multi-node clusters get auto-balanced which gives you more speed at a higher cost. NODE_TYPE Instance type that’s used for the nodes of the cluster. Check out the Redshift Documentation for details on the instance types and their differences. NUMBER_OF_NODES=10 Number of nodes when running in multi-mode. CLUSTER_IDENTIFIER=log-analysis DB_NAME=log-analysis S3_PATH=s3://your_s3_bucket/papertrail/logs/862693/dt=2015 Database format and failed loads When ingesting log statements into the cluster, make sure to check the amount of failed loads that are happening. You might have to edit the database format to fit to your specific log output style. You can debug this easily by creating a single-node cluster first that only loads a small subset of your logs and is very fast as a result. Make sure to have none or nearly no failed loads before you extend to the whole cluster. In case there are issues, check out the documentation of the copy command which loads your logs into the database and the parameters in the setup script for that. Example and benchmarks It’s a quick thing to set up the whole cluster and run example queries against it. For example, I’ll load all of our logs of the last nine months into a Redshift cluster and run several queries against it. I haven’t spent any time on optimizing the table, but you could definitely gain some more speed out of the whole system if necessary. It’s just fast enough already for us out of the box. As you can see here, loading all logs of May — more than 600 million log lines — took only 12 minutes on a cluster of 10 machines. We could easily load more than one month into that 10-machine cluster since there’s more than enough storage available, but for this post, one month is enough. After that, we’re able to search through the history of all of our applications and past servers through SQL. We connect with our psql client and send of SQL queries against the “events’ database. For example, what if we want to know how many build servers reported logs in May: loganalysis=# select count(distinct(source_name)) from events where source_name LIKE 'i-%'; count ------- 801 (1 row) So in May, we had 801 EC2 build servers running for our customers. That query took ~3 seconds to finish. Or let’s say we want to know how many people accessed the configuration page of our main repository (the project ID is hidden with XXXX): loganalysis=# select count(*) from events where source_name = 'mothership' and program LIKE 'app/web%' and message LIKE 'method=GET path=/projects/XXXX/configure_tests%'; count ------- 15 (1 row) So now we know that there were 15 accesses on that configuration page throughout May. We can also get all the details, including who accessed it when through our logs. This could help in case of any security issues we’d need to look into. The query took about 40 seconds to go though all of our logs, but it could be optimized on Redshift even more. Those are just some of the queries you could use to look through your logs, gaining more insight into your customers’ use of your system. And you et all of that with a setup that costs $2.50 an hour, can be shut down immediately, and recreated any time you need access to that data again. Conclusions Being able to search through and learn from your history is incredibly important for building a large infrastructure. You need to be able to look into your history easily, especially when it comes to security issues. With AWS Redshift, you have a great tool in hand that allows you to start an ad hoc analytics infrastructure that’s fast and cheap for short-term reviews. Of course, Redshift can do a lot more as well. Let us know what your processes and tools around logging, storage, and search are in the comments.
June 21, 2015
by Florian Motlik
· 1,444 Views
  • Previous
  • ...
  • 487
  • 488
  • 489
  • 490
  • 491
  • 492
  • 493
  • 494
  • 495
  • 496
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×