Debugging Nightmare: A Go Service That Worked Fine For Months Till It Didn’t

So, one Thursday afternoon just before we were leaving the office, my colleague told me that he was getting errors from a service I was responsible for. I told him that the last time any changes were made to the service was two months ago and it was deployed as is without further modification. After checking that the service was working fine in the staging environment, we just blamed it on the instability of the dev-test environment and thought everything would be fine the next day when we came back to the office after the weekends.

On Sunday, I went to the office a bit late and was welcomed by people trying to debug why the service wasn’t working. I was surprised since no changes were made to the service that might cause it to stop working. The worst part was, it was working fine in the Staging and Production clusters.

The service was running inside a Kubernetes cluster and was dockerized. The same docker image was used on all three environments, dev-test, staging, and production. Since the image ran fine on staging, prod, and local, we all were puzzled. It was clear, something was wrong in the Dev environment but we had no clue what it could be. I looked at the service’s log and was greeted with the following panic message:

‌SIGILL: illegal instruction
PC=0xda1128 m=5 sigcode=2
signal arrived during cgo execution
instruction bytes: 0xc4 0xc2 0xe9 0xf7 0xd2 0x48 0x1 0xd8 0x48 0x8b 0x78 0x30 0x48 0x85 0xfa 0x75 

Which made 0 sense, since SIGILL: illegal instruction typically means the CPU doesn’t support the instructions the binary is trying to execute. But this service ran fine for months on the Dev cluster without any issues. So, the CPU not supporting the instruction can’t be the case, right?

At this point, let’s get some background on the service we’re discussing. It’s an image processing service, and I recently rewrote it in Golang from Python for performance gains mostly. This meant, the Go binary had to rely on some C libraries and used Go’s cgo heavily. cgo sometimes can be fragile but since the same image ran fine on all other environments, I ruled that out as a possibility. But I wanted to know what exact function caused this program to panic beyond recovery. So, added delve debugger to the docker image, and asked the DevOps man to run the service with the delve debugger and deploy it to the dev cluster. Connecting to the delve session wasn’t much help. Since, after a few calls, the binary went talking to the C lib, and delve couldn’t trace the call anymore. A whole day was spent doing this, trying to figure out which function caused this.

The next day, one of my colleagues suggested that, instead of trying to find what function caused this, let’s try to find what instruction caused the SIGILL. Which was a brilliant idea! I immediately looked at the panic message and the instruction bytes were right before my eyes! The only problem was that they were in hex, so I googled and found a site that could disassemble x64 bytes into assembly! Putting the instructions bytes c4 c2 e9 f7 d2 48 01 d8 48 8b 78 30 48 85 fa 75 found in the panic message into the sites disassembler resulted in the following:

0:  c4 c2 e9 f7 d2          shlx   edx,edx,edx
5:  48                      dec    eax
6:  01 d8                   add    eax,ebx
8:  48                      dec    eax
9:  8b 78 30                mov    edi,DWORD PTR [eax+0x30]
c:  48                      dec    eax
d:  85 fa                   test   edx,edi
f:  75                      .byte 0x75

The instruction SHLX immediately jumped to us as something special. Doing a Google search we found that SHLX needed a special CPU feature called BMI2. But according to Wikipedia and Intel’s ARK database a CPU has to be 10 years old to not have this! “This can’t be the case, right?”, I asked myself. So to be 100% sure, I execed into the container running the service in the dev cluster and ran lscpu | grep BMI2, and to my surprise, BMI2 was missing! I then did the same for the staging service and BMI2 was present there. I wanted to do the same in prod but didn’t have access, lol. I thought, maybe the BMI2 CPUID Feature flag was disabled on the VM level or the Kubernetes level, so I notified my manager and the DevOps lead about my findings regarding the service failure and went home.

The next day, the DevOps lead, let us know that, only 1 of the 3 servers running the dev cluster supported BMI2. The other 2 were just too old to have this feature. This means we were just lucky that the service when first deployed was provisioned on the server with BMI2 support. Since a storage upgrade, the servers were restarted and the service was provisioned on a server that didn’t have support for BMI2. After hearing this, I laughed and said, “Dockerize once, run everywhere. Such a lie.”, which I already knew having gone through the migration from an Intel Mac to a M1 Mac. Also knew that for some special cases, the Linux Kernel version of all host machines had to be the same for the containers to operate properly, since docker containers share kernel with the host.

So, in the end, the fix was to make sure that the service was only provisioned on the server with BMI2 support. And later decommission the really old servers. Which we did.

That’s all for today! Thanks for reading and making it to the end. Not sure how much useful information was in the blog post. Anyways, you can reach out to me via [email protected].

5 thoughts on “Debugging Nightmare: A Go Service That Worked Fine For Months Till It Didn’t

  1. “The harder the problem is to find, the more interesting the story it becomes”, I’ve found this as a perfect example. Anyways, your effort to uncover the root cause was truly admirable, it wasn’t actually this simple like the story it is now.

Leave a Reply

Your email address will not be published. Required fields are marked *