Reader, forgive me - I have sinned. A few days ago, our Platform experienced a number of stability issues of our Kubernetes cluster. Early yesterday morning, a member of our Operations team requested my assistance to troubleshoot a Production error. As I enjoy debugging/tracing issues, I agreed and went through the "usual" tracing process. I used my Development experience to troll through the Pod logs and found the issue. I had found the smoking gun and it pointed to further instability of the Platform.
Throughout the day, as alerts came through and heady from my Operations troubleshooting, I investigated some more - and guess what - I found more smoking guns. The evidence was incontrovertible - the culprit in this whodunnit mystery was the Platform. When a colleague suggested that it could be Application related, I smugly sent through a screenshot of the smoking gun. As errors were still coming in and Operations escalated, our Management team requested an urgent "focus call".
As a key witness for the prosecution, I eagerly requested an invitation for the trial/judgement and execution...
Upon joining the session, I actively participated, explaining our microservices architecture and attempted to replicate the issue - so we could get to the resolution faster. However, while reviewing the log of the latest error, I noticed something - just as our Business Owner was approving/requesting a rebuild of the Cluster. Without getting into deep technical detail, the issue was actually Application related - there was a flaw in our (my?) connection management code.
Not one to see an innocent person convicted for a crime they did not commit, I immediately highlighted that the issue was on our (my?) side. To be honest, I think our Management team was more relieved that the issue had been identified. As they were also not keen to punish one of the people who could rectify it, they graciously thanked everyone and dropped off.
I barely had time to fight off the first surge of adrenalin of finding the issue when I was awash by the second wave by saving our effort of fixing it - with a well placed npm rebuild command in our Docker build file. I had to pinch myself - my troubleshooting skills were possibly only matched by my fixing skills. Needless to say, I went to bed in the early hours of this morning, feeling quite accomplished.
As I got out of bed this morning and reflected on the night before - this time without adrenalin clouding my judgement - I felt awful...
The code that caused the connection management issue - whether it was written by my team or myself - was done on my watch. It was my fault. If I had just dug a bit deeper while troubleshooting with Ops, I would have found the issue sooner. Again, my fault. I felt like the sheriff who had just arrested the most wanted serial killer in town. However, the "hero sheriff" had also let that killer off the hook a few days ago, and in so doing, allowed the killer to knock off a few more victims.
Rest assured, reader - this lesson will not be forgotten with a "Hail Mary". This has been a humbling experience - and this post is my outlet to demonstrate my remorse, my fallibility - but more importantly, my learning. Going forward, as the Tech Lead I am trusted to be, I will try my best not to choose the easiest explanation but the right one.