Cloud Foundry Advisory Board Meeting - 2015 August
Cloud Foundry Advisory Board Meeting - 2015 August
Some updates from the Cloud Foundry community, including changes to the development mailing list, and what core teams have been doing.
Join the DZone community and get the full member experience.Join For Free
Chip Childers updated us on the Foundation, starting with conference events. They are currently looking for proposals for a CF Summit in Berlin, which will occur 2nd-3rd of November. The theme of the conference will be organization stories of transformation and change. There will be a focus on technical discussions around how developers can take advantage of Cloud Foundry, as well as a focus on the platform engineering side for cluster operators within organizations. Chip highlighted the fact that they have not requested CFPs for people to give talks on the internals of the system. This is intentional as they wish to have the focus strongly on the user experience and how people are actually using it.
A second conference, for which they are still locking down specific dates, will be in Shanghai China. It will also be a two day event. They are looking at a December timeframe.
Scott Monson, VP of Business Development for the Foundation, has been working to get a number of special interest groups stood up for specific industry verticals. The intention is to enable companies to be able to get together to talk about their experiences using Cloud Foundry (operating it, deploying it and using different distributions). Chip said the hope is that these companies will also share stories on how they are changing their organization to take advantage of Cloud Foundry. He said it's just as important to ask them to collectively articulate any gaps they see for future development of the platform and the services ecosystem.
The first instance of a "SIG" happened on New York City for the financial services industry. It involved 12 companies from different parts of this industry. Chip said it was interesting to see the open and thorough discussions about some of the things they are looking for from Cloud Foundry. Financial services are notoriously heavily regulated and there were discussion on "how are you approaching the development-production pipeline?" Some of them are approaching it by having separate Cloud Foundry clusters for development and test and then another in production. Others are providing a common system and then expect the application team to effectively organize themselves around the compliance rules. Chip is hoping they can clean up the feedback and make it public once they are sure that the notes they took reflect that specific industry as a whole. They will also be looking for commonality across different industry groups, especially between ones that are heavily regulated.
There has been some churn in the development mailing lists. The Linux IT team upgraded the system to MailMan v3 and added the Hypermail web UI. This enables people to interact with the lists as forums online if they prefer that. There was also an outage over the weekend and through Monday. The IT team has resolved that problem and added some proactive end-to-end mail flow monitoring, so they can catch it earlier next time.
Cornelia Davis from Pivotal asked if it was possible to have a link at the bottom of the email to the web UI for that thread. Chip agreed this is very useful for sending somebody a link to say "hey, take a look at this conversation" and he has made the request for this feature, so it will be done.
Eric Malm from Pivotal gave an update on Diego's progress since the previous call. Last time he mentioned that they have been introducing a database service in front of their etcd database to give them more flexibility with data serialization and internal API versioning. They are almost done introducing this. One of the big tests of this will be switching their serialization into the database from JSON into much more performant Protocol Buffers.
Larger scale performance testing for Diego has been put on hold for the above mentioned database service work. They are almost ready to re-run their 100 Cell end-to-end 10,000 application test. A series of stress tests have been run on a single Diego Cell that is loaded up with a lot of resource usage from applications. This includes applications that are doing a lot of logging or using a lot of cpu, bandwidth and disk IO. This causes a slow-down on a particular Cell, but they didn't see any cascading failures throughout the system.
To improve security and robustness they have been working on Consul. They have improved their ability to operate the Consul cluster with BOSH, similar to how they operate etcd, and are currently working on securing access to the Consul cluster with mutual SSL authentication. They also have a strategy for encrypting the data that etcd keeps in its database and a strategy for rotating those keys.
Improvements to Diego's support for Docker images have been added. They now honour the USER directive which is included in the Docker image metadata. The caching registry has also been improved, which SAP engineers have been working on. It can now run as a multi-node deployment with storage backed by an S3-compatible blob store.
Dmitry said BOSH is on a journey to simplify a lot of its manifest UI. This involves less configuration and to enable better networking capabilities so it's simpler to configure them. Also there will be availability configurations, which is still ongoing under the "global networking" track of work. An "AZ" (Availability Zone) track has also been started and they are making good progress with this. They will be putting these changes into BOSH-lite first, since BOSH-lite is not used in any critical environments. This is get some early feedback.
Stemcell hardening is also being worked on. Features include checking permissions or boot-loading settings. A feature called "Trusted certificates" is being worked on, which allows for configuring all VMs with a custom CA certificate. Epics in the backlog include improving the CLI functionality around retrieving the instance metadata. This is useful as occasionally Dmitry's team have wanted to know which persistence disks have been attached to specific instances when doing debugging or recovery.
A new worker script called "pre-start" has been added, which is called when the job is started, but before the Monit processes are started. This allows for running long-running non-time-bounded migrations of data. This is in the early stages of development.
CPI improvements have been made based on pull requests and some bugs that were found while transitioning into using the Concourse CI system. This work is nearing completion. They have pipelines for all the officially supported CPIs and these have mostly stabilized. They are getting close to saying that people should be using the external CPIs instead of the internal CPIs.
Dr. Nic from Stark & Wayne asked if BOSH-init has a cache for compiled packages yet. Dmitry said it does not yet, but they are getting close to finishing compiled releases and that means they will soon be introducing the compiled release for BOSH-init to use. Dr. Nic said this was the main reason not to use BOSH-init six months ago and asked if it was a blocker for anyone else. Dmitry said that from the typical usage they have seen, even though it does add 10-15 minutes bootstrapping time, people are not doing bootstrapping that often. He thinks this is why this has not come up in more conversations. Dmitry said they are also using BOSH-init for many other things and they will address the package cache at some point.
Amit Gupta from Pivotal gave an update on what the MEGA team has been up to. They working on extracting etcd from cf-release and creating an independent release that is usable by Diego, CF and the general public. As part of this effort they have been testing it thoroughly. They now have a pipeline that exercises an etcd cluster through a robust cycle of scale up, scale down and rolling deploys. Dmitry's turbulence-release (a tool for injecting failure scenarios into a BOSH deployed system) is also being used to exercise what happens when you kill a few VMs in the cluster. It then asserts that etcd is still available and does a bosh cck to bring the cluster back up. Amit would like to scrub some credentials in the CI output before making this a fully open pipeline, including its configuration.
Work has gone into bootstrapping the pipelines to generate templates and manifests for Concourse CI and BOSH. This is also intended to be open and documented. How everything is tested will be transparent and reproducible. In the longer term, Amit said their goal is to have anyone pick up the Cloud Formation template and run the same scripts to create their own environment.
The next features are aimed at composing multiple releases into a single deployment. They want to extract things like etcd, Consul, UAA and Loggregator from cf-release, but still compose them as a single release with a single manifest. Development, testing and integration (with the rest of CF) processes for releases need to be considered, as well as what will the downstream integration environment look like. As the unofficial "etcd team" they will be guinea pigs in proving out how this will work. After that they will address the next level of etcd testing that they want to do.
In the chat, Josh McKenty of Pivotal asked, "Can we think about running the full Jepsen test suite?". Amit said that Jepsen tests etcd itself. It tests that it can handle partition tolerance, that the raft protocol is implemented correctly and it does linearizability. The work that the MEGA team has done is to make it operable using BOSH. For instance, can you add a new node to the cluster without having to delete your cluster and bring it back? In summary, the MEGA team is focused on operability, not things like consistency and availability that Jepsen tests, said Amit.
Dr. Nic said it would be good if more people knew about this etcd BOSH release and how easy it makes deploying etcd. Amit said it would be good if these BOSH releases were all independently valuable and tested in a standalone way. He said as soon as the credentials are dropped from the pipeline he would like to blog about what has been done and possibly create a homepage for it.
Erik Jasiak from Pivotal gave an update on Loggregator from the past month. Major fixes have been done for syslog drain buffering, delays and Doppler buffering in general. He said, now you should see an increased throughput in the number of log messages from stacktraces in Java, or any other application. There was a fix for a one millisecond delay and making Doppler buffering configurable. This should be available in cf-release v215. Erik said he would send out information about how to change a BOSH configuration to increase the buffering size.
The next big push is to enable the firehose to have a nozzle for the varz endpoint. This is to continue to support the Collector for the short-term, even with Diego which does not expose a varz endpoint.
A Lattice nozzle prototype has been built, which is focused on Datadog.
MEGA integration and build pipeline work is being wrapped up. Erik apologized for falling behind with pull requests and other community work during this MEGA work, but they have had a big push to catch up and would like to know if there is anything that has been missed.
The final update from Erik was that the Dojo with ActiveState is still continuing.
Dieu Cao from Pivotal gave the update from her CAPI team. The team incepted in July with the mandate: "By providing a simple and well documented API, the CAPI team is responsible for extracting platform features and policy complexity from developers, so that they can easily use the platform from a variety of clients".
Version 2.5 and 2.6 of the Service Broker API was announced since the last CAB call. This includes support for arbitrary service parameters, instance tags, service keys and including the service ID in the update request. Dieu said she saw a request from Mike Youngstrom of The LDS Church asking if they could also include that service ID in other requests. Stories have been created for this.
Dieu requested more feedback on the asynchronous operations for the Services API. She specifically wanted to hear from people who have implemented it and have had success with it or thoughts on whether the document could be improved. Once enough people have had success with it, she would like to bring it out of the experimental phase.
Work on application process types continues for v3 of Cloud Controller API. They are also documenting the patterns they have used in a style guide. They are coming to consensus within the team about how to do relationships and how to include things in a request so that you do not have to make multiple API calls for related lookups.
Private brokers and service instance dashboard are also being worked on.
Dr. Max from IBM gave the CLI update for Greg Oehman from Pivotal, who was unavailable. The CLI team has been down to a single developer. A second developer is being ramped up so that they have one full pair for pair programming. They have been working on the plugin API and keeping up with pull requests and issues. Upcoming, they have some support for Diego planned, refactoring of the CLI help output and better manifest support.
Marco Nicosia from Pivotal left an update on Core Services in the agenda document as follows...
- v22 is GA, beginning to assemble a backlog for v23. Spending a lot of time thinking about making service plans more flexible.
- Team still working on various bugs, etc. including possibly working with a partner to enable ERT migration from singleton Postgres to HA mysql.
- A tremendous amount of effort going into switching to Concourse.
- Coming releases will start to focus on the proxy, to increase capacity and reduce possibility of deadlocks.
Marco also left an update for Lattice...
- Lattice v0.3.0 is GA, now includes Buildpacks support!
- Now beginning work to incorporate TCP router, including a lot of wrangling over CLI syntax, etc.
- The inclusion of TCP router will enable a form of service discovery
- Next items: Clean-up release, ssh and private docker registry
Shannon Coen from Pivotal gave an update on Routing. Shannon said that, as mentioned in Marco's notes, they have submitted their first pull request for TCP routing support for inclusive with an upcoming release of Lattice. They are looking to integrate TCP routing support in Cloud Foundry by collaborating with the CAPI team.
Route services work is continuing to allow a service from the marketplace to be injected into the route path for transformation rate limiting or authorization. They are at a point where they are ready to collaborate with the CAPI team to add this functionality to Cloud Foundry. Until now, they have been doing a lot of work in GoRouter to add the ability to do the forwarding and also for the Routing API, which will eventually replace NATS.
Since there was nobody present to give a Garden update, Dieu said she had found some email updates which she would share instead.
The Garden team are working on security for Docker image support, as well as support for v2 of the Docker Registry. They are starting to put thought into performance, stability and what a runC-based future might look like. The disk quota accounting issue is being fixed.
Dieu also gave an update for the Greenhouse team, who have been exploring security in isolation and looking to implement CATS tests in a Windows context. They are also ensuring feature parity with Cloud Foundry and Diego.
Dieu said that the UAA team have incepted on SAML groups and user attribute handling in UAA. They have started work on role based access for Pivotal's SSO (single sign-on) service. Work on a standalone BOSH release for UAA continues.
The recent release of UAA was v2.5.1 which includes role-based access to identity zones and bug fixes around notifications integration for identity zones.
SSL support for UAA BOSH release was completed and integration tests are being run as part of the BOSH release pipeline. Some low severity CVEs are being addressed and will be released in the upcoming v2.5.2 release.
Dr. Max from IBM gave an update on the recently incubated Abacus project. He said they had an inception and thanked Dieu for helping with that.
The key epics they came up with are:
- Usage accumulation - automated, region and plan levels and different time dimensions
- Usage reporting - customize usage reports, audit trails, default reports
- Usage aggregation - region and plan level aggregations, collection of orgs (accounts)
- Rating - realtime notifications for threshold, pricing configuration for region, country, currency, org
- Onboarding - make Abacus self-onboarding for service brokers
- Security - secure APIs like all other CF APIS
- APIs - refine and make more consistent
- Pipeline - concourse, bosh-release
- Apps usage - add App usage component via CF public APIs
- Docs - APIs docs, better docs
- Misc - technical debt, minimal dashboard, be more defensive on data consumption
Dr. Max said that there is a lot of work to be done and they are hoping to get additional resources from interested parties such as SAP or Pivotal. The IPM occurred on 13th August and he posted the details on the mailing list so anyone could join the call. IPMs are generally only for the team doing the work, but since they are moving fast they thought they would open it up.
"Super Important Question"
Dieu and Dr. Max opened up the call for questions. Dr. Nic said he had a "super important question". He asked, with ActiveState selling Stackato to HP, where will the minutes of this call be found? As usual the link will be posted on the mailing list and if you are reading this, you found them!
Cloud Foundry v3?
Dr. Nic also asked if there were any plans to cut a v3 of Cloud Foundry or if this had been discussed. Dieu said no, but there is a v3 of the Cloud Controller API in development and versions of other things like Diego are separate. She said they have not talked at all about making a product "v3". Dr. Nic said people like new external versions and it would help with the news cycle side of things. Dieu said she is not on the marketing side, so she personally does not have any plans in that direction.
Erik in Boulder asked about Slack usage. He said teams are using Slack more and more and that they are still running on a free account. He believes that Slack is becoming a critical part of everybody's tooling and Slack history, due to the free plan, is limited to 10,000 messages. There are also limited integrations. He wanted to know if there are any plans to address this or whether other options should be considered. Chip said that it is actively being resolved, but that he would not get into details as of the current state of negotiations. He could not give an estimate on the timeline, but he was due to receive an update by the end of the week from the people involved. Dr. Max said they found gitter which has nice integrations with GitHub and feels very much like Slack.
I asked how to get to the Slack channels. Dieu said the Slack account, because it is free, has a limited membership and so is being restricted to inter-team communication for now. She said Chip was looking at other ways for the people to contact the teams.
Published at DZone with permission of Phil Whelan , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.