Koby Holzer has 17 years of experience working with large infrastructure environments, with the last 4.5 of these at LivePerson as the director of cloud engineering, specifically focusing on OpenStack. His past experience includes working with prominent Israeli telecom companies in the area of technological infrastructure. I have personally known Koby for the past few years, through discussing, lecturing, and enjoying the great cloud and DevOps community in Israel.
Though unfortunately I didn’t have the pleasure to be at the last OpenStack summit in April, I was thrilled to see Holzer pictured taking part in the keynotes session at one of the most important global cloud events. The following is the result of another great interview with a true cloud leader.
Ofir: Let’s start with a simple question. How was the OpenStack Summit?
Koby: The summit was amazing, and it’s getting better every year. It was my fourth one. This was the most exciting one for me as I had the opportunity to talk about LivePerson, OpenStack, containers … and all the new and exciting stuff we are doing with the LivePerson cloud.
It was the biggest convention yet with over 7,500 people in attendance, and it just continues to get bigger every year in every aspect—logistics, keynote sessions, educational tracks.
On the educational side, I particularly like the practical, real-life cases. It’s very interesting to see how other companies are tackling OpenStack. My focus was on two specific tracks—the case studies, including the AT&T, Comcast and the containers track, which was very popular at the conference. The one session that I remember as particularly meaningful was ... the panel (that) included experts from the most popular container technologies in the market (i.e.Kubernetes, Docker, Mesos).
Koby: Early 2012, we started learning and evaluating OpenStack, which was the Diablo version at that time. We started to play with it, building our cloud labs and making the decision to go to production with a small portion of LivePerson (LP) services during the middle of that year. And when we reached production, we were already using Essex. In 2012, we had Essex in production and towards the end of that year, we decided to rewrite LivePerson’s service from one big monolithic service to many microservices.
The next step was adopting Puppet, which accelerated the consumption of our private cloud. R&D moved from consuming physical servers to virtual OpenStack instances. By the end of 2013, we had already created a large cluster with more than 2,000 instances on four data centers, and from then on it just continued to evolve.
In 2013-2014, we were dealing with the OpenStack upgrade challenge and managed to move to Folsom, then Havana, and Icehouse. We try to upgrade as often as possible, however, the bigger our cloud gets the more difficult that is.
In 2015, we reached a point where we had finished rewriting the service, and our new “LiveEngage” service was ready for production and real users. Today, we have something like 8,000 instances on seven data centers, running on more than 20,000 physical cores. 2016 is the year for us to migrate to containers and Kubernetes, something which we expect to span well into 2017.
Ofir: If you were to look over these last 4.5 years, what would you say have been, or are still, your main challenges?
Koby: I’m managing the cloud engineering, and there is a rather large team here of software developers. We were very lucky that the R&D organization decided to move to microservices at the same time that we introduced OpenStack and Puppet. Looking back, I am not sure if it was planned, but the timing was just perfect. While development built a modern microservices-based service, ops adopted and implemented cloud and automation.
In terms of management challenges, I will just narrow it down and talk about the challenges that I see for 2016. Migrating 150 services to containers is something that my team cannot accomplish alone. We are in a continuous effort to maintain the partnership with R&D and create a joint effort when it comes to educating ourselves and being able to optimally use the new technologies. That includes moving from continuous integration to continuous delivery and building a strong delivery pipeline.
The operations goal is to build an environment that enables R&D to own the service end-to-end, not only to develop it but also to be able to support a quality and robust production environment.
Ofir: Can you point to any specific challenge that you faced and overcame throughout your cloud journey?
Koby: One big challenge was the deployment and adoption across the organization of Puppet. If only the cloud production and operations team was using it, it wouldn’t have been enough. We needed our software developers to adopt Puppet as well and use it as a standard delivery method. And making 200 developers use a new technology doesn’t happen overnight, as you can imagine.
I learned that it’s not something that you can easily convince that number of smart people to do just by saying "guys this is great technology and it’s the only way we can deal with delivery." We learned our lesson from that and now we work much closer with R&D. Taking decisions together from the start.
Remember that this was almost four years ago. It took a management decision from the very top of LivePerson General Manager in order for everyone to understand that this was the way forward. Our entire R&D was instructed that all new updates (would) go to production using Puppet. A small team of DevOps experts was brought in order to support and train the R&D teams and made sure Puppet was being used on a daily basis. This team carried out workshops and were the people to go to if any questions were raised. It took around a year to bring everyone up to speed and today, Puppet is the main delivery tool.
Another challenge, which is a common for OpenStack, is the upgrades, at least with older versions. After four years of practicing, the process of upgrading takes one engineer up to three months. This was the story for every upgrade until now. The most recent upgrade has been the biggest so far, mainly due to the fact that our cloud has grown significantly and that we also needed to upgrade the hosts in tandem.
Upgrading thousands of physical servers while maintaining the service uptime is no simple task. In order to do this, we need to take a group of servers, run a live migration of workloads to the other servers, then upgrade and ensure nothing was harmed before bringing the group back into the pool. There are lots of considerations and activities behind this, including understanding and segmenting the sensitive workloads.
The LivePerson team at OpenStack Day Israel. Photo: Lauren Sell.
Ofir: How do you manage to keep transparency throughout the upgrade process?
Koby: We built a smart cloud discovery solution which updates automatically. Transparency is key, and we have complete control over each individual VM and service. The system records all activities and can be accessed using an API and UI.
Ofir: What 3 takeaway tips can you give from your experience?
As the operations manager, you should be able to build an efficient and professional team, which obviously depends on the size of your OpenStack cloud. Considering that a cloud consists of thousands of hosts, you need at least two network professionals, three talented operations/engineering guys that are responsible for automating everything, and one storage guy. This team does not include the teams that operate the daily tasks and use the OpenStack resources for the LivePerson service.
In addition, you need to think of every management aspect. Security is not part of our team, although ideally it should be. We are supported by our R&D team's security experts. When dealing with building, your private cloud team remembers that your R&D (couldn't( care less whether it's an OpenStack, physical servers, VMWare, whatever. They just need the resources and the flexibility that the cloud and DevOps promise.
Learning from the past with the Puppet challenge, it was like us telling R&D "we demand that you deliver with Puppet," but as an IT leader, you need to understand and market the values of the new technology. And it never ends, but once you have done it, the next time will be easier, as I see with our current move to Docker and Kubernetes. Eventually, we want to work together as equals, with everyone adopting the technology together, learning together and coping with all challenges together.
In order to accomplish that, you need to create a "feature team" that includes representation from parties involved including the architects, leading developers, operations, network, and security.
Although that might be challenging, I strongly suggest (educating) the other parties, not only on the touch points between dev and ops but also to get them to know behind the scenes of your cloud and get them to have the knowledge they need to use the OpenStack/Kubernetes APIs in particular. This is something that we are still working on with our R&D team. And together with containers, our developers will be able to enjoy real (independence) with provisioning and consumption of resources. Connecting between the software and the infrastructure and letting the developers decide what they need and when is the flexibility IT operations are responsible for.
Everyone should adopt the DevOps approach. R&D and Production are both developers, each with their own location in the delivery system. Although I am proud that LP is a cloud pioneer, we still have a way to go on that matter, and that’s exciting. Becoming Netflix or Google doesn’t happen overnight. The good news is that this road never ends, and there is always something new to learn, adopt and do better.
Ofir: What are your future thoughts about the private cloud/cloud market landscape for the next five years?
Koby: I’m not sure about the next five, so let’s start with the next two and move on. I think that in two years we will see hybrid clouds big time —this is also what we are aiming for. By using Kubernetes, we will be able to use all the public clouds, including our own private one the same way, with the same teams and tools. What I want to see in LP is a very dynamic multi-cloud environment. For example, let’s say that Amazon just changed their prices and I know in real-time that I can get a better price with Google, I will want all workloads and traffic to seamlessly move to GCP, and if it changes again the day after, I will want it to automatically move back to a third public cloud. The workload migration will be based on a price/performance equation while taking into consideration the SLA of each workload.
In regards to OpenStack, there is no doubt that today the compute, network, and storage are much better than four years ago, even ENT ready. I think that those core components will be much better, support larger scales and so it will be easier and easier to upgrade seamlessly. The second priority is to have OpenStack integrate better with public clouds, burst workloads, DR, and backup projects supporting us everywhere: on our Openstack private cloud, in EC2, Google, and Azure. For example, Trove working the same in private cloud, EC2 instances, Google cloud, etc. Since the future is hybrid, it just makes sense to have those extra cool projects work for me everywhere I choose. I think it will make OpenStack much stronger.