Development of System Configuration Management: Summary and Reflections
Custom-built SCM enhances infrastructure reproducibility and automation, but persistent challenges require continuous architectural improvements and smarter decisions.
Join the DZone community and get the full member experience.
Join For FreeSeries Overview
This article is Part 4 of a multi-part series: "Development of system configuration management."
The complete series:
- Introduction
- Migration end evolution
- Working with secrets, IaC, and deserializing data in Go
- Building the CLI and API
- Handling exclusive configurations and associated templates
- Performance consideration
- Summary and reflections
Unsatisfied Expectations
When we started working on this, we imagined something like Kubernetes, but for servers. Everything was supposed to happen automatically — creating or modifying just one file should lead to the creation of new servers, the application, and a refresh of configurations, cluster assembly, alignment with the desired state, and integration with other tools, while remaining flexible for managing deployable hosts and software. In most cases, this works, but there are exceptions. Despite all the advancements in automation, some aspects remain unsatisfactory.
Disk Partition Manager
Currently, we are unable to use the disk partition manager effectively. We have made three attempts to write this module, but have not succeeded. The difficulties in managing partitions arise from the numerous methods available for interacting with disks, which can involve many nested levels. Each new nested level has a unique name that the code must track. For example, the main disk is named sda, its partition is sda1, the MDRAID is md1, and there can be LVM on top of that. To complicate matters, we can assemble LVM over GPT partitions, and we may also have encrypted partitions using LUKS, each with its own unique name.
While this structure is understandable, implementing it in code leads to numerous bugs and complicates debugging during development. Testing requires physical devices, and many of these actions are not repeatable. Changes to disk partitions need to be restored to a clear state for each test run, along with accommodating small fixes. Furthermore, Linux lacks a convenient and versatile API for working with LVM, LUKS, MDRAID, and GPT partitions.
Questions also arise, such as: What if the size does not match what is stored in the configuration? What happens if you need to format the filesystem, or if a disk format is not required? These and many other issues complicate the process.
As a solution, we have decided to use kickstart scripts on CentOS to manage partitioning instead of handling it at the SCM level. This approach may lead to issues, especially because disks can fail at any time, while the kickstart runs only once during the OS installation. Simple cases can be configured automatically by the SCM, but complex situations require manual intervention.
Redis/Sentinel Managers
The Redis/Sentinel managers are also not production-ready at this time. Software that rewrites its own configuration is problematic. Initially, when we developed this module, we faced a situation akin to a "machine war." The pull-mode SCM, which checks the configuration at least twice a minute, consistently identifies changes and restarts Redis and Sentinel. This constant restart redefines the configuration, leading to an endless cycle of conflicts.
Now, we have removed any automatic changes that trigger restarts of Redis and Sentinel if the Redis configuration file has been changed, instead handling these changes manually. It could be beneficial to eliminate the use of the configuration file for Redis, checking the actual parameters through Redis protocol commands such as CONFIG GET parameter and CONFIG SET parameter value. This approach could entirely remove the need to modify the configuration file, and I have proposed it for future changes.
Multi-Instance API Locking
The multi-instance locking with multiple SCM APIs generally works well until something goes wrong. For example, if an API goes down during the configuration build and cannot release the lock, we must wait for it to free up. Throughout the development of the new SCM, the lock lifetime has changed multiple times. We base the new TTL on the speed of generating new configurations, as some features slow down the API while others speed it up. Often, this results in halting configuration rebuilds for ten minutes or more.
To address this, we have implemented an alert system that detects when configuration rebuilds have not occurred for the last N minutes and notifies the SRE team. SR engineers can then manually delete the lock, which has been sufficient so far.
Outdated Hosts
Another complex problem is the deletion of outdated and removed hosts. The SCM authorizes hosts by IP address and the X509 certificate signature. It deletes old keys through pipelines and integration with our inventory system, but this occurs only once an hour. The worst-case scenario is when one host is deleted and a new one is created with the same IP address, breaking the authentication logic. Consequently, SRE engineers must manually delete the X509 certificate signature from Consul.
Password Changes
Currently, the SCM can manage user passwords in /etc/shadow, and it also has an htpasswd manager. The main issue is that the SCM does not match user credentials effectively. To stay up-to-date, passwords are reset to new ones every time. While this implementation is easier because it does not require logging in to check passwords before changing them, it is somewhat alarming to the security department that our /etc/shadow file changes twice a minute. However, logging in twice a minute for any user might also be concerning for them.
TLS Certificate Management and Renewal
Due to the storage of TLS certificates in Vault, we currently lack an implementation to scan all certificates stored there. In the future, we should implement this functionality to facilitate certificate renewal. Renewing the CA, however, is more complex, and we need to research how to perform this without causing downtime in the infrastructure, as updating all certificates synchronously is currently not feasible.
service_restart Hook
This hook has two problems:
- If a certain number of configuration changes should result in a service restart, the service will restart multiple times. For example, we have a hook that restarts a service due to package reinstallation and configuration changes, resulting in the service being restarted twice. Modern SCMs can deduplicate such events, and we plan to address this in the future, as it currently does not lead to significant problems.
- If a change attempt fails, the restart hook is triggered anyway. This hook is activated every time a change attempt occurs. This problem is related to our implementation of package installation. Yum and DNF on CentOS do not have an API; thus, the SCM simply runs the command
yum install {set of packages}. Sometimes, packages cannot be installed due to reasons such as:- No such version (error in configuration)
- The mirror is down
- Corrupted metadata
- Corrupted RPM database
The SCM should parse the stdout/stderr for each package and make decisions on running hooks based on that output. This is currently part of our backlog.
Dependency on Other Infrastructure
Dependency on the functioning of infrastructure components such as Consul, Vault, and the inventory system is critical for the new SCM. Software that relies on multiple storage systems should have up-to-date snapshots of data from those systems to be independent of their speed and downtimes. This is the price of close integration with our company services.
Some downtime issues can be mitigated by introducing caching. For example, if the inventory system is down, the impact on the new SCM is minimal if it can still function for an hour. However, if it remains down beyond that, we risk using outdated inventory information, as the caches have only a 1-hour TTL.
Vault and Consul are critical dependencies, serving as master databases for secrets, dynamic configuration, caches, and authentication data.
Limitations of Horizontal Scaling
The SCM API can now run on multiple servers, with a locking system managing their operations. These servers deploy configurations simultaneously and respond to requests, allowing for horizontal scaling. However, one aspect that lacks scaling functionality is the configuration renewal process. As mentioned earlier, our SCM uses cached configurations, and updating this API requires regenerating the cache.
Currently, the cache generator cannot operate simultaneously on multiple servers. The locking system is designed to prevent the cache from being built simultaneously, as redundant rebuilding can lead to excessive loading of the system.
Currently, this is not a significant issue since the time required to rebuild configurations for a few thousand servers is manageable. However, if the infrastructure were to grow at least tenfold, we would need to address this limitation. I've identified two potential approaches to solve this problem:
- Shard-based separation: We could separate all hosts by a sharding key, allowing each API to generate configurations for only a subset of hosts. This would improve cache renewal performance and speed up the deployment of new changes. However, this approach could compromise high availability advantages. To mitigate this, all sharded APIs must be launched in pairs, each with its own locking system.
- Situational separation for configuration regeneration: This approach involves storing the date of the last built configuration in Consul or using a task manager like any Message Queue system. When configuration generation starts, the selector identifies outdated configurations in real-time and only regenerates those. If another configuration expires shortly thereafter, a different API scheduler will pick it up and regenerate it. In the case of an MQ system, a lightweight producer can store rebuild tasks in the queue, with multiple generators consuming tasks to rebuild simultaneously.
Incidents
Overload of Package Storage
Over the last three years of implementing our new SCM, we have experienced incidents where the package store became overloaded due to excessive requests from the SCM related to package update tasks. Typically, we have hundreds of servers receiving updates from the API at least twice a minute. However, there was one incident where, during the update of a package that we accidentally forgot to upload to our artifact store, we unintentionally created a large influx of update requests.
Due to errors during the package installation, the agent kept trying to install the package repeatedly. Since we use a pull model, these requests were not synchronized, leading to a domino effect: over time, the influx of requests grew so large that the artifact store failed. The only solution was to disable all SCM agents in the infrastructure.
The SCM agents, which were supposed to operate independently, were unaware of the traffic they were generating and did not know how many requests were being sent to the package store at any given time.
Such incidents are rare, and the artifact storage support team was able to implement caching mechanisms to prevent this from happening in the future. However, when using independent agents for mass updates, this should always be kept in mind.
Issues With Bad Responses From Data Sources
We encountered a problem in the code when CMDB administrators decided to perform a full redeploy of the CMDB in Kubernetes. For a short time, it functioned without ingress until the redeployment process was completed. Our new SCM mistakenly accepted 404 responses as valid, leading it to conclude that there were no resources in the hostgroup, which resulted in the redeployment of many configurations and the disassembly of clusters.
We also experienced similar issues with Vault responses. Sometimes it returned incorrect responses (such as 4xx errors), which our new SCM interpreted as a valid response with empty keys. Since the SCM manages password and certificate generation, these modules attempted to regenerate all certificates and store them in Vault. As a result, our infrastructure immediately began regenerating passwords and certificates.
Fortunately, many of these problems did not lead to major incidents because our SCM agents checked the software for health before restarting the next backend. Now, our new SCM checks response codes and validates answers before deployment on the servers.
Summarize
The SCM is used in many infrastructures, and it should be used. It opens opportunities to have reproducible infrastructure, capabilities of tracking changes, deploying a wide range of servers, and so on. Even small infrastructure, like 1 near servers, should also be deployed by SCM, although on small infrastructure, it can decrease the productivity of the SRE team. However, the other opportunities cover the problems with it. Infrastructure reproducibility can enhance hostgroup scaling performance., make easier migrations between IaaS, raise up the servers if old break down.
The bigger the infrastructure, the more productive the SRE team when it uses the SCM.
Every detail of the SCM architecture contributes to the overall productivity of the SRE team when the infrastructure is large.
It leads many teams to develop classical SCM in their own ways or to create something new. These approaches have different pros and cons. Choosing between them can lead to problems if the decisions and architecture are not well thought out. In this article, we considered the challenges of developing your own tool to deploy infrastructure. In our case, it gave us the ability to implement anything we could write in Golang. However, if poor decisions had been made, they could have limited our capabilities and made the tool so inconvenient that we might have had to abandon it over time.
At the start of the project, we didn’t pay much attention to documentation. Instead, we focused on providing examples of configurations for the new SCM. This included various YAML files that describe the installation of services and clusters, as well as the configuration and setup of hosts.
On the other hand, we decided to use a module called "dummy." This module outlines the code components of the project that can serve as placeholders for new modules. In other words, we have a template that allows us to copy, paste, and modify certain parts to create new modules for deploying software.
We concentrated our efforts on reducing the time spent maintaining documentation, and in most cases, this approach has worked. Currently, it is easier to configure hosts using YAML hostgroup files, but it has become more challenging to write code.
Now we see a potential solution for the future: moving all modules to structures instead of interfaces and using automation to generate documentation. Tools like Swagger can automatically create documentation from code, saving us time and improving both the code quality and the descriptive aspects of our SCM.
When we started working on our own SCM, we spent a lot of time maintaining it. One factor that simplified this was our decision to keep the old SCM operational alongside the new one. We introduced new functionality in the new SCM, which took about three years to migrate all infrastructure to it. In my opinion, this process took longer than if we had migrated to any other SCM. I attribute this to several factors:
- We needed to develop playbooks in a Golang implementation, which took approximately the same amount of time. However, once we developed all the common managers, the process became easier. It works like an Ansible playbook, but instead of using YAML code, we use functions with parameters in Go.
- Building the SCM required significant time spent on architectural planning, development, testing, and maintenance. In contrast, a classic open-source SCM has already been developed.
On the positive side, many managers are written and tested once and do not require long-term development. This is similar to playbooks or formulas in Ansible — once you create a convenient configuration, you can reuse it multiple times with group variables and pillars. We have currently developed 36 custom managers (for example, for Nginx, Envoy, Kafka, etc.) and 24 common managers (file, service, etc.). Now, we can create the simplest managers in less than an hour. The most challenging managers were the partition, PKI, and Consul managers. The foundational communication between the agent and API, goroutines, and caches also took considerable time, but that’s a task that only needs to be done once. As this was developed in parallel with other tasks, we did not allocate specific time for developing the new SCM. Instead, we integrated it with other tasks and implemented new functionality as needed.
For example, if someone requested the introduction of basic authentication for the project, we spent slightly more time on that task by creating a manager that could populate the htpasswd file with new users. Occasionally, if a task had a tight deadline, we implemented it in the old SCM. Overall, we could have developed and migrated to the new SCM faster (within a year), but the three-year transition was smoother and more evolutionary.
Generally speaking, it's difficult to estimate the actual labor costs associated with work done before and after introducing the new SCM. Sometimes, situations arise where the playbooks of the current SCM are suboptimal. Prolonged maintenance of some infrastructure, new engineers, significant changes to services, and a lack of time for code refactoring can result in playbooks that are not well optimized, even if they use universal templates that adapt to various situations. Some common issues include:
- When a developer of a playbook opts for non-universal defaults, resulting in most hosts needing explicit configuration. This significantly expands the variable files for each cluster.
- Hardcoding configuration parameters in a template filled with numerous conditional statements relevant only for one cluster can become unmaintainable.
- Employing poor architecture, such as including configurations that should be explicitly described and transferred to servers instead of using a single file with a template that accommodates all situations, increases the amount of configuration required.
- Versioning templates for software that can be applicable to different clusters leads to playbooks like 'redis_new' or 'redis_final' that contain relevant configurations and coexist with outdated modules that are unsupported and use old approaches.
These are just a few examples of issues that accumulate over time. Every team faces these challenges, leading to the need to rewrite such modules from scratch. This is true for any SCM. Any configuration in YAML must eventually be refactored and rewritten due to changes. Moving to a new SCM is similar; we alter everything in alignment with the new standards. Clearly, modules developed from scratch typically have far fewer mistakes and are better suited to fit your infrastructure. The problems are apparent in the old SCM configuration files, and new modules are developed based on the lessons learned from past mistakes. As a result, the new SCM has more tailored modules than the old one.
It is also psychologically challenging to refactor old modules compared to transitioning to a new repository with new configurations. With a new repository, we aim to keep it clean and only migrate refactored configurations. It does not matter whether you use open-source SCM or develop your own custom SCM project.
Measuring the Success of the New SCM Implementation
One significant question following the development and complete migration to the new SCM is whether it has improved our processes. First and foremost, common tasks can now be addressed much faster. We no longer spend as much time on development. For example, we now have only 3–4 small pushes to the repository per month in our Go code.
On the plus side, we enjoy many conveniences. For instance, there is no need to wait for a new VM to deploy configurations. SR engineers can describe resources and services in a single file, push it to the repository, and be confident that in ten minutes, we’ll have a sufficient number of hosts with deployed configurations, assembled clusters, and generated passwords and certificates. This significantly reduces the time spent on routine tasks.
We attribute this efficiency to the high level of connectivity between services relevant to our company, including:
- Ordering a new VM.
- Monitoring settings that can be pushed to the API. If you specify the service, you need to link the host and service to the monitoring system. The new SCM can facilitate this without human intervention.
- Integration with Vault for secret storage.
- Integration with Consul for dynamic configuration.
- Various other APIs.
As we mentioned earlier, we wrote all modules for generating configurations for services while transitioning to the new SCM. All teams encounter the necessity to refactor old configurations, and regardless of the SCM used, playbooks need to be rewritten. Yes, we think it was the right decision, and now we are ready to repeat this way again. Many companies with large infrastructure don’t use clean SCM. They rewrite this, make and maintain new modules that are actually for them.
Developing your own instrument from scratch is a more difficult way than customizing classic SCM. But it opens all opportunities to do anything with infrastructure.
What Would We Like to Change Now if We Knew in Advance?
For a long time, we wanted to eliminate the usage of agent-level variables. However, the migration to a new OS version led to challenges. When we began migrating to a new OS version, we had to set different parameters based on host variables, not just group variables. We resolved this issue, but the realization of this need came only two years after we started development. When we identified the need for host-level variables, we also implemented the ability to select each host for deploying distinct configurations to improve functionality.
Additionally, we now believe that a better approach to migrating hosts would involve using infrastructure tests like Goss. This could help identify any neglected configurations on new hosts and highlight differences between the settings in the old SCM and the new one. We can ensure that we describe all necessary configurations, or we may need to enhance our tests.
The generated test file for Goss (or any other toolset) can be created manually or automatically by the SCM. For example, checking existing files and their restrictions that are known in advance could be saved in the SCM configuration. Then, the SCM can generate a test manifest for all files that it manages.
We think these tests are not very useful with classic SCM because it has already been tested and will work every time. However, when migrating from one SCM to another, the Goss tests generated on the old, validated SCM can be a good tool for testing the new one.
The second goal that can be achieved with the infrastructure testing tool is ensuring that your own developed SCM works as described in the manifests. This will help address the problem of forgotten configurations.
The next thing we would change in the possible next generation of SCM is the use of a task manager like RabbitMQ. This would open up opportunities to improve the speed of new changes. Agents could check only the queue in the MQ system and immediately react to changes. Additionally, using queues could also help solve the problem of scaling the scheduler by leveraging a lightweight producer and multiple schedulers as consumers.
We also think that using pure interfaces in Go instead of structs was a mistake. The SCM interface is necessary for templating purposes, but for many managers, using structs is a more suitable choice.
What About the Developers and Users?
When I started developing this system, I was both a developer and a user for the first six months, using it in my own projects. After we began to expand the system, our entire team started using it. Some members of the SRE team chose to use it, while others focused on writing new features and also utilized the system. Our team has been changing over the last five years, but we have consistently found individuals who contribute new features to our SCM and implement them in their projects.
One of the challenges was finding people in the market who could work with our tool and improve it. However, at least two individuals were consistently involved in developing the new SCM, which did not pose any problems. I decided to ask the team for their thoughts on our SCM. Their responses were generally similar: there was initial rejection due to its complexity, but once they started using it, they began to appreciate it. Additionally, more motivated individuals expressed that our SCM generated increased interest in using and developing the tool.
Opinions expressed by DZone contributors are their own.
Comments