Who Gets Access to Production?

By Sam Bisbee, CTO

This is the third installment in our new series of weekly blog posts that dives into the role of SecDevOps. This series looks into why we need it in our lives, how we may go about implementing this methodology, and real life stories of how SecDevOps can save the Cloud.

Remote access to production machines is a long contested battlefield that has only gotten uglier since the rise of Software as a Service, which has obliterated the line between building the system and running the system. This caused new methodologies to be enacted, the most popularly touted being DevOps, which is really just an awful way of communicating that everyone is responsible for running the system now. One critical implementation detail that smaller SaaS companies have always understood due to hiring constraints is that the entire technical staff is required to be on call. Yes, even the engineers, developers, or whatever else you call them.

The New Policy

“Lock out the developers” is not an acceptable policy anymore. Developers inherently build better systems when they experience running them. Who would allow a bug to linger if it continuously woke them up throughout the night? This pain was not felt widely enough in the previous “throw it over the wall to operations” world. I can sense desperation rising from the PMs over their kanban story velocity, “If an engineer is on call, then they won’t be able to write code!” While this statement is factually accurate, the sentiment is not.

First, operations has an equally important and lengthy work queue. Second, those paging alerts are likely the most important bugs regardless of whether they’re an uncaught exception (engineering issue) or RAID alarm (operational issue). This typically confounds those new to the SaaS world because they have not fully grasped the ramifications of the Service with a capital “S”. The Service is always on and is the product through which you deliver value. This is one of the best examples of how SaaS companies are so much different culturally and operationally than companies that “ship” product. You are not running an IT department.

Don’t Over Correct

This remote access policy may seem like an over correction, which is why proper controls are critical. One of the most cited fears for granting more people access is the lack of change control. When you apply this fear to developers, what people really mean is that they are afraid of hot patches. This is completely and utterly reasonable.

Hot patches decrease visibility into the system, slowing down or outright preventing the ability to debug. The worst-case scenario is a hot patch actually damaging the system or corrupting user data, which is exponentially more likely due to the lack of testing. The technical community should fully understand by now that “it worked on my laptop” or “it shouldn’t do that” are not reasonable statements when releasing. The only true prevention for hot patching, especially when implementing a populist remote access policy, is to create a frictionless release mechanism. Make it trivial for your teams to build, test, and initiate a staggered release into any of your environments. Ideally your build server is testing every push to your master git branch and anyone can promote a successful build from that server.

Trust but Verify

If frictionless releases are our trust, then accordingly we must verify. Enter monitoring. Techniques such as the Pink Sombrero are good (digital sombreros are better), but you must introduce continuous security monitoring into your environment. For ages there have been tools and techniques that do this, but most teams do not employ them because of their complexity, outdated implementation (taking hashes of your entire multi-TB filesystem in an IO bound cloud or virtual environment is asinine), and volume of false positives. It does not have to be so complicated though. For example, alerting when a user other than chef changes files in your production server’s application directory is an easy first step that a team of any size can easily grasp.

For those who are concerned about access to customer data, whether it be PII or something less toxic, this remote access policy does not apply to that data, as it should live in a segregated environment. They are also likely concerned with passing audits, and the prospect of listing their entire technical team as having production access is not intriguing. In such scenarios, non-operators should be locked out of production unless they are on rotation. Adding and revoking their SSH public key from the gateway on-demand can make controlled access easier.

You Get What You Need

All of this is to say that collectively we are still trying to figure out the security balance in the technical community. Too often people want security, but see it as prohibiting productivity so they punt. This is unfortunate for the obvious reasons, but also because properly operationalized security begins to enhance the developer’s and operator’s experience. Tools are leveraged that make the system easier to run and control. Different monitoring solutions are installed that make the system easier to debug and verify. And, everyone gets access to production.

Stay tuned next Wednesday for our fourth installment in this series as we continue to dive deeper. Until then, be sure to check out our first and second posts in the series.