Day2Ops vs DevOps as a Service
I’ve increasingly seen two trends in describing technical professional service products. I’m particularly fond of (and biased towards) the use of “Day2Ops” as a way to focus on the fact you will incur 80% of development costs AFTER your product is launched. The other trend is selling “DevOps as a Service” - which probably bothers me more than it reasonably should, but I think it’s worth explaining why.
I will cut to the chase. DevOps cannot be provided as a service. Many aspects of DevOps (tooling, training, coaching) can be provided as a service, but DevOps itself is a practice that evolved within the Industry in response to very real inefficiencies in the last couple of decades. Inefficiencies that caused very real pain to employees and, I’m sure, the very real demise of entire organizations. What were those inefficiencies and how did DevOps help? I’m glad you asked…
Ten or so years ago, if you walked into a typical software company, you would see a clear division of tasks between three independent organizations: Product Development, Quality Assurance, and Operations. The following, mostly fictional, release is probably disturbingly familiar to you:
- Product Manager defines the scope of a new product release.
- Software developers build the features for that release. Testing is done on their local machines, and integration tests are occasionally run in a shared “development” environment that was built for a previous release by a sysadmin that no longer works at the company. Nobody understands this environment. They sysadmin created it as a favor, using non-standard tooling and practices, and the developers have further hacked at it to work around issues as they crept up. At least once per release, developers are unable to deploy code to the environment because the magical “deploy_code.sh” script stops working. They put in a request to the Ops team to have a look, but that usually takes 2-3 days before they can get the time to determine that the dev box was re-IPed, or whatever.
- Once features have been deemed “complete”, QA tests them in a “test” environment that is mostly like production, except for the fact that production has multiple load-balanced servers that are much too expensive for QA. So QA instead runs multiple processes on a single machine and uses a script to ensure each process is listening on a different port. And the requests are load-balanced using an Apache instance with mod_lb because real load balancers are also too expensive.
- QA identifies bugs in the completed features, most of these fall into one of three categories: a. Bugs that developers missed due to a lack of end-to-end integration testing in their shared environment. b. Bugs that don’t appear in the development environment, because they are based on assumptions that are only true in the development environment. c. Bugs that only manifest in the QA environment, because half of it has been hacked together and is actually an indication of bugs in the environment’s setup scripts.
- QA and Software engineers go back and forth until all the bugs are ironed out.
- QA creates a tarball of the latest version of the code they tested.
- QA sends a tarball and release notes to the Operations team to have them deploy it to a staging environment. The staging environment is almost identical to production. In fact, it uses the same load balancer as production - since that load balancer is crazy expensive and needs to be shared.
- Operations team notices that the latest release won’t start. So they huddle with QA to figure it out. If they can’t figure it out, QA will figure out how to reproduce the issue in their environment, and then work through it with a software engineer.
- Once satisfied, an operations engineer will schedule and deploy the latest release to production.
- 10 hours later, at 2 am, the Operations Engineer is paged by degraded performance in the production system. The engineer logs on to see the system that resource utilization has quadrupled since the previous release and is now causing processes to be OOMKilled every hour or so.
- The Operations engineers spend an hour trying to figure out what would be causing the increased utilization. They then notice new log messages that regularly output stats about new feature usage. The engineer is pretty sure there is a memory leak in that “bookkeeping” code. They look for a feature flag to disable it, but nothing is stated in the release notes, nor in the --help output for the process. The engineer considers rolling back the release but remembers that the database schema had some update scripts run against it that probably aren’t compatible with the old process.
- The Operations engineer brings in the on-call software engineer who vaguely remembers the addition of an “analytics” feature that matches the newly recorded log messages. The software engineer spends another thirty minutes finding the code, and confirms, yes, there is a memory leak. High-cardinality statistics are being cached, bucketed by time, and never evicted. The fix is relatively simple but will need to be tested and packaged by QA before it can be deployed.
- QA is brought into the call so they can test and “release” the patched code.
- Ops deploys the patched code, and all three hang around for another hour to ensure the system is stable, before heading back to bed at 5 am.