OpenAI Fixed an 18-Year-Old Bug in a Library Everyone Uses. Here’s How They Found It.

OpenAI engineers debugged a years worth of data infrastructure crashes and found two bugs - one in bad hardware and another unnoticed in open source code for 18 years.

OpenAI Fixed an 18-Year-Old Bug in a Library Everyone Uses. Here’s How They Found It.

OpenAI published a detailed engineering post today with an epic OpenAI debugging story that spanned months, uncovered two unrelated crash causes in its ChatGPT data infrastructure, and ended with a fix for an 18-year-old bug in GNU libunwind, a C++ library used by practically every Linux application.

The story is a masterclass in systems debugging. It also contains a lesson that applies way beyond infrastructure engineering: when you treat a population of failures as a single problem, you can chase red herrings forever. Split them by the data, and the answers usually fall out.

What was crashing?

The crashes were happening inside Rockset, a cloud-native data system OpenAI acquired in 2024 that handles search and real-time analytics for ChatGPT. Rockset powers the data plugins and conversation search features in ChatGPT. It’s written in C++ for performance, which means memory bugs can cause segfaults.

Engineers noticed a pattern where normal C++ functions appeared to finish executing and then return to a bogus address. The instruction pointer pointed nowhere. Sometimes the return address was NULL. Sometimes the stack pointer register itself was off by 8 bytes, as if it had been decremented in the middle of normal execution. Neither of these failure modes makes sense for ordinary application code.

For months, the team treated these as one bug and struggled to find a root cause. They examined individual core dumps, formed hypotheses, ruled them out. Rinse, repeat.

The turning point: epidemiologist mode

The key shift came when the team decided to stop acting like doctors diagnosing individual patients and start acting like epidemiologists analyzing a whole population.

They built a pipeline using ChatGPT to automatically analyze every Rockset core dump from the previous year by extracting registers, filtering known false positives, and labeling each crash by type. Once they had clean population data, the pattern was obvious: what looked like one bug was actually two separate crash populations.

The return-to-null cores were spread across many clusters and geographic regions. The misaligned-stack crashes all came from one Azure region, had a clear start date, and never appeared on long-running nodes.

Bug #1: A bad CPU

The misaligned-stack crashes traced back to a single physical Azure host where the CPU just didn’t do math correctly. The hardware was silently corrupting register state. Once that host was taken out of service, those crashes stopped entirely.

OpenAI improved their tooling to detect similar issues from logs alone (no core dump needed) and changed their control plane to prefer VM reuse over recycling, making bad-node detection much easier.

Bug #2: An 18-year-old race condition in GNU libunwind

With the hardware crashes separated out, the remaining crashes became solvable. They all happened during C++ exception unwinding, the process where the runtime walks the stack to find catch blocks and cleanup handlers.

The root cause was a single-instruction race condition in _Ux86_64_setcontext, part of GNU libunwind. The function builds a ucontext_t struct on the stack, fills in the target register state, and then loads those registers. The problem: it changes %rsp (the stack pointer) on the very first instruction of the critical section, but keeps reading from the struct at the old stack location for the next several instructions. If a signal arrives between those two instructions (a window of roughly 100 picoseconds), the kernel’s signal handler overwrites the struct memory, and the restored instruction pointer goes to NULL.

OpenAI was uniquely positioned to hit this bug because Rockset throws exceptions at a high rate for overload control, uses frequent CPU-time signals, and had recently increased the signal handler’s stack usage. The product of those three factors had crossed the threshold where the race became operationally visible.

The OpenAI debugging fix

OpenAI switched from GNU libunwind to libgcc’s unwinder, which doesn’t have the same issue. They also upstreamed a fix to GNU libunwind and a self-contained reproducer.

The immediate takeaway for anyone running C++ services: if you use GNU libunwind and deliver signals at high frequency during exception-heavy code paths, you could be hitting the same race. Switching to libgcc’s unwinder may be the right move. (This is a bit like the Codex SSD logging bug I covered recently, another case where production infrastructure debugging uncovered something unexpected.)

The team said it best: “Reliability is not just about fixing bugs after they happen. It’s about building the data, workflows, and skills that turn impossible problems into diagnosable and solvable ones.”

Tony Simons

Reviewed & Written By

Tony Simons

Independent tech reviewer and creator of Tony Reviews Things. 14 years of hands-on testing, software auditing, and workflow automation. I test the gear so you don't waste your money on junk.

Submit a Take

Your email address will not be published. Required fields are marked *