The Audit That Cleared the Chip

Week 9: Kyttar taped out on May 20 to the ChipFoundry CI2605 shuttle. Silicon comes back the first week of November. Before submission, we ran a multi-week systematic audit because after the bug that almost killed the chip, every checker was green and we did not trust them anymore. Here is the audit infrastructure, what it found, and why the waivers turned out to be the most important artifact.

Why We Stopped Trusting Green Checkmarks

A few weeks ago I wrote about the parasitics-killer bug: 98.5% of our wire delays were silently zero because the timing analyzer was discarding macro-pin interconnect annotations. Every check in the flow said clean, but the chip would have been DOA.

That single near-miss reframed the entire pre-tape-out conversation. Up to that point, the discipline had been: run the flow, read the warnings, fix what looks broken. After that bug, the question became: what ELSE looks fine but is not? I needed something better before submitting silicon I would only get back in November.

What I built was not a tool so much as it was a discipline: a multi-week, structured, repeatable audit that interrogated every category of pre-tape-out check the flow already had, plus categories the flow did not have, and accepted nothing on appearance alone. This post is about that audit and the part nobody talks about: how you write down what you accept and why, so that six months later when you've forgotten everything, there are enough bread crumbs left to understand your rationale.

The Audit Infrastructure

The audit lived in a tapeout_audit/ directory at the repo top level. It was made up of 3 pieces:

A checklist file pinned to specific run IDs. Every line was one of [ ] open, [x] passed, [!] finding, [W] waived, or [W/~] waived with a manual gate still pending. Pinning to specific run IDs forced re-evaluation on every respin instead of letting status drift. If a respin happened, the checklist had to be archived and re-cut against the new run. I wanted to eliminate the classic "we checked this last time" bugs that easily slip through.

A waivers file with a strict template. Every accepted finding had to specify What, Why, Scope, Owner, Date, and Expires. Scopes were precise: regex-matched check names, specific file paths, severity bounds, ratio ceilings. There are no catch-all waivers. Manual-check waivers had to spell out what the engineer should look for during inspection, and were marked [W/~] until signed off.

A findings directory with one file per discipline: digital_verification.md, synthesis.md, apr.md, sta.md, spice_parasitics.md, physical_verification.md, power_integrity.md, as well as a summary consolidator. Each audit pass overwrote the per-discipline file; previous passes were archived to runs/<date>/.

The whole thing was driven by Claude sub-agents, one per discipline, each with a tight prompt: read the waivers file first and skip findings that matched scope, take the previous pass's findings as context but verify each "fixed" claim against current state, write a structured report with severity-tagged findings and a status-vs-previous table. For things deterministic enough to run as scripts (parasitic-annotation completeness, recovery-margin checks), I wrote shell scripts that emit findings-format markdown and exit 0 or non-zero depending on result. The sub-agents quoted those results rather than re-running them.

I ran 4 full audit passes across roughly 3 weeks. Each pass was about 15 minutes of wall-clock time with the sub-agents running in parallel. Without parallelization, sequential review would have been roughly 2 hours per pass.

Categories the Audit Found That No Checker Would

Overall, there were 25 distinct findings across the audit's lifetime. None of them would have shown as a red flag in the standard LibreLane or ChipFoundry precheck output. They generally fell into one of four categories:

Correctly green but only because of a regression check we had added. After the parasitics-killer post, I wrote a check that opens the SPEF parasitic files for every signoff corner and counts the wire-delay entries that actually terminate on the macro pins and primary input ports we care about. If those counts ever drop or vary across corners, the timing analyzer is silently throwing parasitics away again. The counts came back identical on every audit pass (2287 / 814 in the wrapper, 6994 / 148 in the cell), which is the green checkmark I actually wanted. Every other timing check in the audit depends on this one being true; without it, everything downstream is meaningless. It is the kind of check that should have existed from day one and only existed because the previous bug forced it.

Borderline-green warnings nobody had time to read carefully. The standout was CVC, the gate-level simulator we use for post-PnR verification. Each GLS run emitted 316,431 ERROR lines reading [1391] no module sky130_fd_sc_hd__dfrtp_4 timing check matches SDF. The log file was effectively unreadable. Root cause: OpenROAD's SDF generator uses split REMOVAL and RECOVERY timing checks, while the sky130 PDK's Verilog flop model declares them as the combined $recrem. Other simulators auto-equate the two forms, but CVC requires literal match. The 316K errors were noise, not real failures because there are no async reset related timing conditions. Fix: a regex post-filter in the dodo runner that strips exactly that noise pattern, preserves the raw log separately, and writes a footer [dodo.py] filtered N CVC REMOVAL/RECOVERY SDF-match ERROR lines. This resulted in an 89% log size reduction. CVC's own summary line (1 error(s), 58287 warning(s)) was preserved untouched. A real failure would still pop up, but at least now it was easier to see.

Gaps where no check existed at all. The most consequential one was the latch-capture-window architectural margin, which the matched-delay STA check has no model of. Matched-delay compares combinational data-path arrival against delay-chain arrival, but in this design the downstream latch's latch_en pulse width is what actually determines whether data is captured cleanly. At the brownout corners (cold slow, sub-1.6V Vdd) the latch_en pulse width grows to 5 to 10 ns. The worst raw matched-delay gap I saw at those corners was -2.26 ns of slack: a negative number that looks like failure. But the latch GATE was open many ns longer than the data path took to settle, so the data was captured cleanly anyway. The matched-delay check was flagging a failure, but this built in margin was there to save the day. The detailed waiver description describes how it was waived and, more importantly, thoughts on how this can be improved in a future chip.

Compounding artifacts that look like emergencies in isolation. The cell standalone IR drop report came back showing 48-51% worst-case Vdd drop across runs. At face value: catastrophic. After investigation: artifact. Two reasons. First, the standalone cell is intentionally analyzed with a sparse local PDN because on the full chip, each cell's straps tie into the wrapper's met5 mesh and the openframe core ring. That ends up being roughly 3,840 inter-tier power ties chip-wide that the cell's standalone analysis cannot see. Second, the current model the PDN solver uses comes from report_power with a 50 ns virtual clock and default toggle activity. For an async design with no clock, this is meaningless. The architectural peak (all 120 cells in exec simultaneously) happens to roughly match the STA aggregate, so the number is not as wrong as it would be for a synchronous design, but it is still wrong. Combined analytical estimate of real silicon worst-case IR drop: 60 to 90 mV or 3 to 5% of Vdd.

Things that needed an outside nudge to find. A reader emailed after the parasitics-killer post and pointed me to OpenROAD's report_parasitic_annotation -report_unannotated command, asking whether I had it in the flow. I did, and it was running on every PnR spin, but I was not systematically auditing its output as a first-class signal. There is a difference between "the tool is running" and "we know what it says, every run, and we react to changes." I added a standalone checker that reads every corner's filter_unannotated_metrics.json, reports raw vs filtered residual counts, and fails if any corner shows more than N filtered residuals or the raw count drifts more than M across corners. On the cell: stable at raw 936 / residual 0. On the openframe: raw 9802 / residual 3 (the 3 known input-only gpio_in bits, covered by a waiver). The reader's suggestion was not a fix. It was a category of check I had been running but not auditing. The lesson is "post in public, get free audits." Thanks, Matt.

The audit also caught a race in my own custom STA plugin. The plugin builds a summary.rpt by reading per-corner metrics from OpenROAD via an os.environ-injected metrics path. Two of 9 corners showed ? in the summary with metrics borrowed from another corner's name. It was a classic threading bug: shared environment variable, parallel corner runs, last-writer-wins on the metrics path. I replaced with thread-local command construction in run_corner and the issue was resolved in the next respin. A second related race showed up only on the small cell design where 4+ OpenROAD processes finished within ~1 ms of each other: 3 of 9 corners had no or_metrics_out.json despite the sta.log containing all the right Writing metric lines. Suspected OpenROAD-internal exit-time flush race, outside our control. The workaround fix I had was to just scrape the missing metrics from sta.log if the JSON is not there. I tested against the failing corners' logs and recovered all 21 metrics cleanly. Neither of these were in OpenROAD itself, but in my run scripts. The audit caught my own infrastructure bugs, always a good sign.

The Waivers Are the Artifact

Some of the waiver writeups in this audit are 80 lines long. That looks heavy until you imagine the alternative: a one-line note saying "antenna 4.99x, accepted." Six months later when you respin, the next person to look at the project (or more likely than not, myself) sees the ratio jump to 4.99 again and has to re-derive the entire reasoning chain. Does this fall under any waiver? Is this a regression? Why was 5x accepted last time?

The strict template forced the audit to write down, explicitly, what would invalidate each waiver. Here are 2 examples:

The antenna waiver covered a single ratio that oscillated between 4.99 to 3.44 across consecutive runs on the reset net. The waiver scope was: side-area only, P/R ratio at most 5x, diodes placed, MPW prototype only. If there was any gate-area violation, any ratio above 5x, any pin without a diode, or if this was a production tape-out, this would fall outside the waiver. If two consecutive audit passes had shown the ratio drift to 5.5x, the waiver would have auto-invalidated and the finding would have come back as [!] in the checklist. In the end it stabilized to 4.99x, indicating layout-induced steady state rather than router intermittency. The foundry checks were good with the violation. The SKY130 antenna margins are generous, and the violating pin goes to async reset pins on latches, not a high-speed data input where a Vth shift would actually hurt. There are antenna diodes on this net already as well. All of that reasoning is in the waiver.

The IR-drop waiver had three compensating-control preconditions hardcoded into it: average IR less than 5% Vdd, 0 power-grid violations, and PSM-0040 connectivity report showing 0 critical disconnected pins. If any of those three fail in a future run, the waiver auto-invalidates. The original waiver also promised empirical confirmation by enabling chip-level IR drop on the next respin. That promise turned out to be structurally impossible at the wrapper hierarchy because vccd1 and vssd1 are floating ports until the I/O ring connects them at the level above. So the waiver was updated to drop the impossible promise and document the structural reason. That kind of update, in writing, in version control, is the audit's institutional memory.

These read as paranoid, and they kind of are, but mostly they are forensic. Every one of them stops a real category of "well, last time we said it was fine, I guess it is fine" drift.

The Volatility Was the Most Surprising Part

Across audit passes, margins moved 1 to 3 ns with no RTL change. The cell worst-case IR drop bounced 47.99 to 44.49 to 51.19% across consecutive runs. The antenna ratio went 4.99 to 3.44 to 4.99 again. Matched-delay failure counts went 41 to 51 to 79 across runs as different fixes interacted: the SS-cold max-slew upsize fix tightened slew, which also tightened matched-delay tracking, which re-opened brownout corners we had previously closed by adding a delay stage.

For this MPW prototype this is fine, but the lesson is that async self-timed timing closure is at the mercy of placement when margins are tight. As described in a previous blog post, there isn't a way to constrain the data paths with the approach I took implementing the Muller C-elements and mutex cells, so the design is basically unconstrained. Until you get a layout where every block has comfortable margin at every corner simultaneously, you are at the mercy of router decisions. The architectural answer is to draw these cells by hand and characterize them, which is my plan for the next chip.

The Audit Caught the Auditor

Two corrections to my own previous-pass audit summaries came out during this audit.

I had been quoting "29,524 decap cells in openframe" since the first audit pass. The real number is 36,328. The previous audit was counting only one of the two decap cell types in the design. Caught by the power-integrity sub-agent when it sanity-checked against the LEF cell count.

I had written that the PDN strap width (FP_PDN_VWIDTH) had changed 5 to 3.5 to 5 um across runs. Wrong. It had been steady at 5 um the whole time. Bad data that had found its way into the analysis and then into the waiver writeup. Caught on a careful re-check.

Neither of these changed any verdict on tape-out readiness, but both would have been permanent-record errors if not corrected, because they sat in waivers that survive across respins. The audit caught me. The lesson is that the audit needs the audit needs the audit. Even structured forensic discipline benefits from an explicit re-check pass for the auditor's own work.

What Worked and What I Would Change

Larger companies accumulate a list of 'gotchas' from previous mistakes or bugs that almost slipped through. Failure is the best teacher, but at this stage, I can't afford any tutoring sessions. While not as good as a seasoned engineering team, this agentic flow did uncover issues that I would have missed and did a very deep dive through the results that I would not have had time to verify by hand. So from a cost-benefit perspective, overall this approach worked very well for what I needed.

What worked well: Seven sub-agents running in parallel, one per discipline, structured prompts, status-vs-previous tables, severity-tagged findings. 15 minutes wall clock per pass instead of 2 hours sequential. The structure made each pass cheap enough to run more of them.

Pinning the checklist to specific run IDs. The single most important discipline detail. Without it, status quietly drifts and "we checked this" becomes "we checked something that no longer exists."

Strict waiver template with explicit auto-invalidating scopes. The antenna ratio oscillated inside the waiver scope every time. A hypothetical drift to 5.5x would have failed the auto-invalidation and pulled the finding back into [!] status.

Standalone deterministic checks (parasitic annotation, recovery margin, CVC noise filter) alongside the sub-agents. Some checks are simple enough to script. The sub-agents quote those results rather than re-running them. Combining deterministic and judgment-based review let the judgment work focus on what actually required judgment.

Manual-check gates marked [W/~] until signed off. The PDN ring continuity check could not be automated; it needed a human in Magic looking at the layout. Until that human signed off, the manual gate was [W/~] and counted against tape-out readiness.

What I would change: Run the audit earlier. The first pass should be at 50% of the flow timeline, not at 80%. Several findings would have been much cheaper to fix earlier. The memory.v latch-enable architectural fix took 2 flow iterations because the script's upsize-glob bug fix landed mid-respin.

Build the standalone checks first, sub-agents second. The parasitic-annotation checker, the recovery-margin checker, the CVC-noise filter: these are the deterministic gates everything else quotes anyway. The sub-agents add value on the open-ended "is this report telling me something new" reading. Lead with the deterministic.

What Comes Next

The chip is at fab. First silicon comes back the first week of November. Between now and then, my work shifts from engineering to customer development, foundry conversations for the next chip, a Navy STTR grant proposal that submits on June 3, and the rest of the operational machinery of running a one-person semiconductor company.

That means the weekly cadence of these posts is going to slow down. The next two posts will be about the non-engineering side of getting here: what it took to submit a Navy STTR Phase I grant proposal as a solo-founded bootstrapped chip company, and what it actually looks like to chase non-dilutive funding when you are starting from zero. After those, posts will slow down until silicon comes back. When it does, you will hear about it.