Convention Versus Coverage
Static timing analysis at a mature node like 130nm can be bounded by a small number of process-voltage-temperature corners. Typical at nominal, fast at high voltage and cold (-40°C here), slow at low voltage and hot (100°C in this case). 3 points is the textbook simplification, and for a synchronous design at this node it is typically good enough. The flow tools assume it and the PDK characterization is organized around it. The convention works because synchronous timing has a single number to beat: the data path delay must be less than the clock period. Across PVT, the worst case for that comparison falls at the slow corner.
At advanced nodes, production signoff is a different animal. 16nm and below, you are looking at multi-mode multi-corner closure with aging corners, parametric on-chip variation, and setup-and-hold matrices that easily run to 30 corners or more. The diagonal bracketing assumption holds, but you check many more points along it because variation has gotten harder to model with a few representative cases. The convention scales with the node.
For an asynchronous bundled-data design, neither version of the convention is sufficient, and the reason is not about how many corners. It is about which corners. There is no clock. Instead, every data path has a matched delay chain alongside it: a small chain of inverters and gates whose delay is engineered to be longer than the data path's worst case. When the request signal propagates through the delay chain and arrives at the next stage, the data is guaranteed to have settled. No clock period to beat. The delay chain itself is the timing reference.
This is elegant in a few ways: There is no clock tree to balance and no global skew. Each block runs as fast as its data path will allow. But the timing problem is not gone. It has just changed shape.
The Async Failure Mode
For matched-delay async to work, the delay chain has to track the data path across every PVT corner. The data path is whatever logic the synthesizer placed: adders, multiplexers, all flavors of logic gates. The delay chain is a sequence of weak inverters and AND gates. They have different transistor sizings, different fanouts, different switching behaviors. What's more, the delay chains are a series of elements with single inputs and outputs, so the placement and routing algorithms put them right next to each other, with minimal routing, more on this later.
If the delay chain ever becomes shorter than the data path at any operating condition, the latch captures data before the data is valid. There is no error message or timing exception, the chip just produces wrong answers.
This is the part that took me a while to internalize. In synchronous design, a timing failure is loud: setup violations show up in reports, propagate through gate level simulation, you will see it as long as your constraints are in a reasonable shape. In matched-delay async, a timing failure is silent. The handshake completes, the pipeline advances, the data is corrupted. Hopefully you exercised all of the paths in SDF annotated GLS, otherwise you only find out when the chip does the wrong thing in the lab.
So signoff has to ask a different question. Not "does the data path beat the clock period," but "does the delay chain beat the data path everywhere, across the whole PVT envelope, with margin." And the answer to that question is not bracketed by the diagonal.
The Off-Diagonal Problem
Both versions of the synchronous convention, 3 corners at mature nodes or 30 (or more) at advanced ones, share an assumption: process pairs with voltage and temperature in physically sensible ways. Fast process gets high voltage and cold temperature, because that is where the chip runs fastest. Slow process gets low voltage and hot temperature, because that is where it runs slowest. The worst case for clock-versus-data lives somewhere along that diagonal, so even when you check many points, you check points on or near it.
For matched-delay async, the worst case is wherever the delay chain and the data path have the largest tracking mismatch. That is not necessarily on the diagonal. The data path and the delay chain are different cell mixes. They have different sensitivities to voltage and temperature. The corner where they disagree the most could be anywhere in the PVT space, including corners that no synchronous flow would visit at any node.
Worse, parasitic extraction adds another dimension. Wire delays at fast process with maximum parasitics behave differently than wire delays at slow process with minimum parasitics. The standard flow auto-pairs SPEF with silicon corner because for synchronous designs the worst case is consistent. However, here the ratio that routing impacts the propagation delay in the data path is different than the ratio in the delay path since the delay path has essentially minimal routing, meaning longer more complex data paths have a greater mismatch to the delay path when parasitics are taken into account.
Taking all of this into account, the honest answer was that I did not know whether the matched-delay margins were safe across the full envelope. The chip is a test chip at 130nm, where 3 or 6 corners would have been entirely defensible if it were synchronous. The async nature is what forced the full grid. Comprehensive was strictly better than minimal.
48
The PDK has 16 characterized PVT points. Not a full grid: the foundry only characterizes physically meaningful combinations, like fast process at high voltage and slow process at low voltage. Sparse but realistic.
Each PVT point gets paired with 3 SPEF variants: minimum, nominal, and maximum parasitics. That gives 48 total static timing corners.
I did not parallelize. The flow worked at 3 corners, and 3 weeks before tapeout was the wrong moment to start changing infrastructure. Each 48 corner sweep ran serially, 8 to 12 hours per pass. The joy of overnight runs.
The first run came back with 240 failures. 79 of them were in the category I was actually worried about: raw matched-delay slack worse than -2 ns. These were real failures, not just margin warnings.
And almost all of them were at one specific corner: cold slow.
What Cold Slow Reveals
Cold slow process at -40°C is not a corner most synchronous flows worry about. At advanced nodes it might be visited as one of many, but it is never the corner where setup or hold violations live for a clocked design. For matched-delay async it turned out to be where everything came unglued.
Here is the mechanism. The delay chain cells are weak inverters in series. They are characterized at the same PVT points as everything else, but they have a different temperature dependence than the gates that make up the data path. At cold temperatures, the gain difference between the delay-chain inverters and the data-path logic grows. The delay chain slows down by some factor, the data path slows down by a different, larger factor, and the routing parasitics mismatch adds up. In effect, the matched-delay margin collapses.
This was not a synthesis problem or a placement problem. It was a fundamental issue with the cells themselves not tracking each other across the temperature range. The fix was not subtle: more delay cells. The chain had to be long enough that even when its tracking degraded relative to the data path, it still won the race.
3 Rounds Of Whack-A-Mole
The first round of fixes added delay cells to the worst-affected blocks. 5 extra pairs to one block, 2 extra to another, 1 shift on a tap point. Re-ran the layout. Re-ran the 48 corner sweep. Most of the original violations were closed.
But a different block was now failing. Placement had drifted between runs. The same RTL, the same constraints, produced a slightly different layout, and one of the blocks that had been comfortable before was now showing cold slow problems. The data path delay had grown by almost 1 ns at typical, which scaled to over 2 ns at cold slow. The delay chain that had been fine for the previous layout was no longer enough.
Round 2 added delay cells to that block. Another layout. Another 48 corner sweep. A different block now had a problem.
At this point, I began to start questioning my life choices. Each round of fixes seemed to push the problem somewhere else. Matched-delay async timing closure is at the mercy of placement when margins are tight. The architecture is correct, the flow is fine, the cells track each other reasonably well, but until you get a layout where every block has comfortable margin at every corner simultaneously, you keep playing whack-a-mole. It also doesn't help that since the C-elements are composed of standard cells themselves, which required a lot of false paths blocking combinational feedback loops so that STA would complete AT ALL, there isn't an easy way to constrain data paths. The whole design is effectively unconstrained. With this approach, you're at the mercy of whatever the placement engine decides to do this round.
The Catch
Round 3 was supposed to be 3 more delay cell pairs on the long arm of one block. I had the script ready, the RTL change drafted, the layout queued.
The right question got asked before I ran any of it: did the short arm also fail?
I had not checked. The block had 2 paths through it, one for a long operation that needed the full delay, one for a short operation that needed a shorter delay. The way the design worked, both paths shared the same delay chain but tapped it at different points. My fix only extended the long arm. The short arm, the tap was earlier in the chain, and it was also failing at cold slow. I had been about to ship a fix for one path while leaving the other broken.
The catch saved a layout iteration. Without it, the next 48 corner run would have shown all the short-arm modes still failing, and I would have spent another day figuring out why. Instead, I moved the tap and added the cells in one round.
This is the part of design that is hard to systematize. A second pair of eyes asking the right questions catches things that the original designer is too close to the work to notice.
Voltage Scaling Is Friendlier Than I Expected
One thing fell out of the 48 corner data that I did not expect. As voltage dropped from nominal toward near-threshold, the matched-delay margin did not collapse the way it would in a synchronous design. The data path slowed down and the delay chain slowed down by approximately the same factor. The ratio held. The margins stayed roughly constant down to surprisingly low supply voltages.
This is a real architectural property of matched-delay async. Both sides of the timing race are made from the same library, so process and voltage scaling apply to both sides equally. The temperature mismatch I had spent a week chasing was specifically about cell-type tracking. Voltage scaling does not introduce that mismatch the same way.
Below near-threshold the picture changes again, because non-linear effects make different cell types behave differently. But across the standard operating range, you can scale voltage down for power savings without watching your timing margins collapse. So that's cool.
The Final Run
After the 3rd round of fixes, the timing closed cleanly. 240 failures became 40. None of the remaining 40 were real violations. They were all cosmetic: paths where the delay was greater than the data, but the engineering margin requirement was not fully met. The chip would work at every operating-range corner. The cosmetic flags were artifacts of how strictly I had defined margin. Add in the time borrowing property of the latches, and everything was golden.
Spot checks at the worst-case near-threshold corners showed huge headroom. One ALU path that had been at -16 ns of slack in an early run was now at +3 ns. A handful of tweaks to the delay chains and now everything looks good across all corners.
Lessons
The synchronous convention does not transfer to async. 3 corners at mature nodes or 30 at advanced ones are organized around the assumption that the worst case lives along the process-voltage-temperature diagonal. For matched-delay async the worst case can live anywhere, so the coverage strategy has to be different. 48 corners is not a heroic effort, it is just an overnight run. Do it early.
The corners that hurt are the ones nobody else checks. Cold slow is not on the standard signoff path. It turned out to be exactly where my design's tracking mismatch lived. If you are doing something unusual, the failure modes will be in unusual places, and the standard tooling will not point you there.
Distinguish raw violations from margin shortfalls. A path where the delay is genuinely shorter than the data is broken silicon. A path where the delay is longer than the data but the engineering margin requirement is not quite met is probably a cosmetic issue. The first one needs a fix. The second one requires a deeper understanding of what your actual margin budget is. Know your design and run lots of tests. SDF annotated GLS is your friend.
Voltage scaling in matched-delay async is a bonus. This is the most useful architectural finding from the whole exercise. Both sides of the matched-delay race scale together with voltage, so margins are preserved across the operating range. For products that need to scale voltage for power, this is a real advantage that synchronous designs do not have. Will this hold true at other process nodes? Probably not as well as it did here, but I'm always happy to take a win.
The chip tapes out next week. The signoff is comprehensive in a way it would not have been at 3 corners. Whatever comes back from the fab, I will at least know that the timing was characterized against the actual envelope, not against a convention written for a different kind of design.