The Discovery
A few weeks before tapeout, I started running a systematic pre-tapeout audit. Not because anything was wrong. Everything looked clean. Every timing check passed. The gate-level simulations passed. The matched delay verification passed. The DRC and LVS checks were clean. The flow reported zero violations across the board.
But I know from experience: never trust the tools, verify yourself. And secondly, don't ignore the warnings.
I started pulling apart the SDF files (Standard Delay Format, the file that captures actual cell and interconnect delays from the post-place-and-route extraction). I expected to see a distribution of wire delays, with a mix of short and long interconnects depending on the geometry of each net.
Instead, I found that 98.5% of the interconnect delays were zero.
Out of roughly 25,800 interconnects in the design, only 380 had real wire delay numbers. Every other wire, including every latch enable, every datapath net, every memory write line, was being analyzed as if it had no parasitic capacitance and no wire resistance. The timing engine had been telling me the chip was fine because, from its perspective, there was almost no wire delay anywhere in the design to worry about.
The Root Cause
The problem turned out to be a subtle interaction between several tools in the flow.
The synthesis tool was preserving module hierarchy (which we discussed in the synthesis post). This is necessary for our async design because we need the hierarchy intact for analyzing individual function blocks and for SPICE simulation. The parasitic extraction tool was producing a SPEF file with hierarchical net names using the standard / separator. But the timing analysis step was reading a flattened netlist where those same nets had been collapsed into escaped flat names like \exec/decode_latch/latch_en.
When OpenSTA tried to annotate the parasitics from the SPEF onto the netlist, it could not match the hierarchical SPEF names to the flat escaped names. So it silently dropped the parasitic data for almost every net.
The flow has a step that filters out unannotated nets and reports how many were filtered. That step reported zero filtered nets. Why? Because it had decided that the unannotated nets were "acceptable" and masked the entire problem. The post-processing made the broken output look clean.
Every downstream check inherited the broken parasitic annotation. The SDF files used for gate-level simulation had no wire delay. The matched delay checks (the primary timing closure mechanism for an async bundled-data design) ran with no wire delay. SPICE simulations of the extracted netlist had no parasitics either, because the LVS extraction had wire effects explicitly disabled.
Every safety net had a hole in the same place. Because, of course it does.
What That Bug Was Hiding
Once I fixed the annotation problem and re-ran the timing analysis with real parasitics, a new problem appeared. One that the broken flow had been hiding.
In a synchronous design, clock tree synthesis (CTS) handles buffer insertion on high-fanout clock nets. It looks at how many flip-flops are sitting on each clock branch, balances the loading, inserts buffers as needed, and makes sure every flip-flop sees a clean clock edge with appropriate drive strength.
Our design has no clock. CTS is disabled. There is no clock port to start from. So the buffering of high-fanout nets falls to the resizer, which is a different tool that runs during placement optimization. But the resizer explicitly skips clock nets, deferring to CTS for those.
Our latch enable signals are declared as clocks, even though they are self-timed handshake outputs. Because of this, they got skipped by the resizer. Nothing buffered them..
One of these enable signals was driving 70 latch gate inputs through a minimum-strength inverter, with a total net capacitance pushing 0.2 pF. The rise delay on that net was hundreds of picoseconds. The transition time was nearly a nanosecond. For a self-timed enable pulse that is supposed to be high just long enough for the handshake to complete, that pulse would never reach a valid logic high before it needed to go low. And now you're dead in the water.
This was not a timing margin issue. This was a functional failure waiting to happen on silicon. And it was completely invisible because the parasitic annotation bug meant the timing tools thought every wire was a perfect zero-delay connection.
The Fixes
Three things needed to change.
First, the timing analysis step needed to run inside OpenROAD that holds the hierarchical design database instead of OpenSTA, so that the SPEF annotation could correctly match net names. After the fix, real parasitic annotation worked across all PVT corners. The number of nets with real wire delays jumped from 1.5% to 37%, and each corner now showed correctly differentiated cell and interconnect delays. Though 37% doesn't seem like much, I found the vast majority of the missing nets were actually for antenna diodes, which don't matter anyway. All of the critical paths in the control circuitry were annotated properly, which is what I cared about.
Second, the high-fanout enable signals needed buffer insertion. Since CTS was not run, I built a custom flow step that runs after floorplanning and upsizes the driving inverters based on fanout. The 70-load enable signal went from a minimum-size inverter to one with 16 times the drive strength. The rise transition time dropped from nearly 800 ps to under 200 ps. Plenty of margin for the self-timed pulse. And just for safety sake, I bumped up all of the other latch enables as well, based on their load. Extra margin = good thing.
Third, parasitic-annotated SPICE extraction was added as a separate step in the flow, producing a transistor-level netlist with all the wire capacitances included. This is what you actually need for accurate analog-level verification, distinct from the structural-only netlist that LVS uses.
How You Actually Do This in LibreLane
The mechanism for all of this is a single block in the design's config.yaml:
meta:
flow: Classic
substituting_steps:
OpenROAD.STAPostPNR: Kyttar.STAPostPNR
"+OpenROAD.Floorplan": Kyttar.UpsizeClockDrivers
"+Magic.SpiceExtraction": Kyttar.ParasiticSpiceExtraction
Three entries, three different behaviors. The first entry has no prefix. That is a replacement. LibreLane runs Kyttar.STAPostPNR in place of the stock OpenROAD.STAPostPNR step. The second and third entries have a + prefix. That is an insertion. LibreLane keeps the original step and adds the new one immediately after it in the flow.
The reason STAPostPNR is a replacement rather than an insertion: the standalone OpenSTA tool that the stock step uses could not match the hierarchical SPEF net names to the netlist, because it does not have the design database that knows about the hierarchy. The replacement runs STA inside OpenROAD, which does hold the full chip database and can cross-reference hierarchical names back to the synthesized netlist. That is the entire fix for the annotation bug. You are not fixing OpenSTA, you are moving the work to the tool that already has the information it needs.
The UpsizeClockDrivers step is an insertion rather than a replacement because the stock floorplan step is doing useful work. You just need to add a pass behind it that walks the high-fanout clock-like nets and upsizes their drivers based on load, since CTS is not going to. Same pattern for ParasiticSpiceExtraction: the Magic spice extraction step still runs normally for LVS, and the new step runs after it to produce the parasitic-annotated netlist for analog simulation.
Custom steps are just Python classes that subclass a LibreLane step base class. They get the current flow state, run whatever tool or script they need, and return a new state. The substituting_steps block is how you wire them into an otherwise stock Classic flow without forking anything. Pretty slick in my opinion.
What I Took Away From This
A few things stand out.
Do not trust "0 violations." Verify the methodology. Every check in the flow said clean. The bug was in the design, but the checking infrastructure itself masked the problem. Reports of zero violations from a broken checker look identical to reports of zero violations from a working checker. You have to verify that the check is actually checking what you think it is checking. Again, never trust the tools!
Async designs expose corner cases that synchronous designs hide. The same parasitic annotation bug would affect a synchronous design's timing too. But CTS would have already buffered the high-fanout nets, so the functional failure would not have appeared. The wire delay error would just show up as reduced timing margin. For us, the combination of no CTS plus no parasitics was potentially fatal. When you do something the tool was not designed for, the tool's defensive layers stop protecting you.
Build verification into the flow, not around it. The fixes are now custom flow steps that produce correct results by construction. The original approach was to run the standard flow and check the outputs. That fails when the outputs themselves are wrong in a way the checker cannot detect. Now the steps that produce the data are part of my own flow, so I know they are doing what I expect.
Pre-tapeout audits save chips. This bug was found during a systematic review of every output file with a deliberately suspicious eye. Every individual check passed. It took looking at the data behind the checks, asking "is this distribution reasonable?", to find the problem. Without that audit, this chip would have been dead on arrival, and I would not have known why for months.
Open-source EDA tools have known issues with non-standard configurations. The hierarchical-name-vs-flat-name problem in static timing analysis is documented as a known limitation, going back years. It has not been fixed upstream because most users flatten their designs and never encounter it. Most designs are synchronous and follow standard patterns, so the gaps in tool coverage are invisible until someone steps off the path. If you are doing something unusual, you have to know enough about every tool in the chain to verify that your unusual choice is being handled correctly at every step.
The Cost of Catching It Late
The fixes took several days of intense work, but that's why the chip needed to be 'done' 1 month before tape out. These things happen on every chip. Every. Single. One. Budget appropriately or you will get burned. Luckily it was caught. The alternative was sending a broken chip to the fab, getting silicon back in October, and discovering that the latch enables were not driving cleanly. At that point the only options would be expensive: another tapeout (another six months and another fifteen thousand dollars), or shipping a chip that mostly worked but failed on certain operations under certain conditions.
This is the part of chip design that nobody tells you about until you live through it. The bug that almost kills your chip is never the one you were worried about. It is the one your safety nets do not catch because the safety nets share the same blind spot. The cost of being paranoid before tapeout is days. The cost of not being paranoid is months and a chip that does not work.
Tapeout is in less than a month. The chip is now in good shape. This audit found one. Next week I'll talk about a couple more issues that were uncovered by IDDQ spice simulations and timing analysis, though more subtle and marginal. Not chip killers, but issues that will make a product engineer squirm.