DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Self-Inflicted Wounds: The SSL Failure on the Linux Build Server

Self-Inflicted Wounds: The SSL Failure on the Linux Build Server

A nasty cousin of It Works On My Machine is the It Fails On That Machine.

Oren Eini user avatar by
Oren Eini
·
Feb. 08, 19 · Analysis
Like (2)
Save
Tweet
Share
5.71K Views

Join the DZone community and get the full member experience.

Join For Free

A nasty cousin of It Works On My Machine is the It Fails On That Machine. It is nasty because you know that there is something wrong, but you can’t reproduce it.

The machine in question was our Linux build agent, and the failure in question was a set of failing tests that failed to perform a certain operation when TLS was enabled. The problem? They were failing with I/O errors, but only with TLS, and the connection was using localhost. Further investigation showed that the most likely reason for the failure was a timeout. But how could that be? For fun, sometimes, the test passed. So it wasn’t an issue of a firewall of some kind. Testing using openssl s_server and connecting to it manually didn’t show any issues.

The failures reproduced only on that particular build machine. Trying to reproduce the failure on other Linux machines failed. The problem was that these other machines had different kernel version (shouldn’t matter, probably) or different openssl versions (which likely mattered). We started to investigate what the issue was and tried to set up a secured server on the box and connect to it.

Which worked, so that sucked. Running the tests again showed them failing again… which was confusing.

Eventually, we figured out that we couldn’t make it fail when running a server outside the tests, but we could observe that it was slow. Slow as in, took ~2.5 seconds to reply to the client’s hello message.

We tried it in the debugger, and whenever we paused it, it was always some variant of the following:

image

That led us to believe that there is something with openssl that could cause this slowdown. We tried to use strace to understand what it was doing, and we got stuff like this:

image

To be rather exact, we got over 300,000 calls to openat with some file in that directory.

That… didn’t seem right to us. So we looked a bit deeper. As you can see, during the processing of an SSL connection, we can to X509Chain.Build, and that end up calling to X509Store.Certificates, and that end up reading all the certificates that you have on the “~/.dotnet/corefx/cryptography/x509stores/ca” directory.

That directory had…. 2,103 certificates files in it. And every single time that we had to do anything with a certificate, we would go and scan through all of them, including paying all the cryptographic costs of verifying all those certs. If we had multiple threads doing that at the same time, we would run into starvation issues. Just a single pass through that was 2.5 seconds for a single core. The test scenario had 20 connections at a minimum going at once, and that was the reason for the timeouts. We basically run out of CPU because we were spending all of those cycles on the certs.

And why did we have so many certs? The tests create a certificate every time that they run, and register that in the X509Store to make sure that certain features work. But we never delete it from the store…. And over time, we had more and more certs going on there, until the load was big enough to start breaking things.

Oops.

Build (game engine) Linux (operating system)

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • The Quest for REST
  • A Complete Guide to AngularJS Testing
  • How to Develop a Portrait Retouching Function

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: