As our (or our clients) infrastraucture grows and runs for longer durations, I have noticed that there are certaion parts of our infrastructure that are known only by certain people to a certain extent. Due to the nature of IT operations, most of the engineers stay in firefighting mode, and fix they some of the problem with a manual hotfix (be it stability related issues, security related issues or performance related issues).
Over time these pieces of infrastructure (or infrastructure services) accumulate some feature or functionality that is not automated or documented, and slowly it attains a state where if you kill that server it will be difficult to recreate it, not only because you don't know what exact steps need to be taken to bring it back to the original state, but also there are dependenies with other integration points you need to worry about. In the community we call them 'Works of Art'.
There are many ways to fix them, but this post is about how to catch them.
An ounce of prevention is worth a pound of cure.
I prefer to kill the whole environment (staging, pre-production, UAT) every weekend or have non-functional relases where I just recreate the production infrastructure at regular intervals. This does not eliminate the accumulation of manual fixes, but this does indicate if any manual fixes are present that are crtical for the services to run. By doing this more frequently I reduce the risk of large, accumulated manual fixes. To me this is a litmus test or Gold Standard for Automated Infrastructure.