The 180 nA That Wasn't There

Week 7: Our first IDDQ simulation said the slow corner leaked more than the fast corner. That is physically impossible. The numerical artifact hiding the truth was a single line in the SPICE convergence options.

The Anomaly

I was running a quiescent supply current characterization sweep across process, voltage, and temperature. Standard pre-tapeout work. The point of the exercise: figure out how much current the chip burns when it is doing nothing, so the datasheet has a real idle power number.

The first three measurements came back like this:

Corner          Temp     V       I_active
FF              -40C     1.95V   326 nA
TT              25C      1.80V   287 nA
SS              100C     1.60V   261 nA

That ordering is wrong. It is not slightly wrong, it is physics-violating wrong. Sub-threshold leakage is exponential in temperature. The slow process at 100 degrees should leak the most. What I had was the opposite.

Expanding The Sweep

The first instinct when something looks wrong is to take more measurements. Three corners is not enough data to see the shape of what is happening. I expanded to a full grid of process by voltage by temperature, twenty seven simulations, with each axis varied independently.

This is the kind of sweep that gets put off as too expensive in time. Each run is a few hours. But measuring the shape of the leakage surface across the whole envelope is the only way to know whether you are looking at signal or noise.

The expanded grid showed something striking. The fast corner at hot temperatures behaved exactly as physics predicts: leakage rising sharply with temperature, doubling roughly every ten degrees, hitting nine microamps at the worst case.

But the slow corner did not move. Going from -40C to 100C at the slow process, leakage went from 236 nA to 242 nA. A 2.8 percent rise across 140 degrees of temperature. Real sub-threshold leakage in that temperature range should rise by a factor of fifty to one hundred. What is going on?

The First Diagnostic Question

Before chasing the physics, I needed to rule out something stupid: was the current actually DC, or was the simulation showing the average of an oscillating waveform? An async design without a clock could in principle have some weird metastable state that looks like a DC offset to a measurement window. In particular on this design, because this exact scenario had already played out once before.

A handful of probes ruled this out. Peak-to-peak current in the measurement window was 0.2 percent of the mean at the high-leakage corners and roughly 20 percent at the low-leakage corners. RMS matched the average to five significant figures. No characteristic frequency content. The current values were genuinely what the solver thought the DC leakage was.

So the problem had to be with the measurement itself, not the circuit.

Looking For The Floor

I plotted the slow corner data as current divided by voltage:

SS at -40C      I_active     I_active / V
1.60V           236 nA       147 nA/V
1.80V           268 nA       149 nA/V
1.95V           296 nA       152 nA/V

The current was nearly linearly proportional to voltage. That is the defining characteristic of a resistor. Sub-threshold leakage scales exponentially with the gate-to-source voltage minus the threshold voltage, not linearly with supply voltage. Real leakage and a resistor look completely different on a current versus voltage plot.

240 nA at 1.8V works out to 7.5 megohms of equivalent resistance from supply to ground. There was no resistive path in the design. The chip is logic. Where was this current coming from?

The Answer Was In The Solver

Most SPICE simulators add a tiny conductance across every PN junction and every transistor's drain-to-source path. The parameter is called "gmin" and the default value in ngspice is 1e-12 siemens. The reason it exists is purely numerical: when transistors are off, their conductance is supposed to be zero, and the matrix the solver works with becomes singular. Adding a small fake conductance everywhere keeps the matrix invertible.

For most simulations this is invisible. The gmin contribution is a few orders of magnitude below whatever real currents are flowing through your circuit. You never notice.

For an IDDQ measurement at the slow corner, you are measuring the smallest leakage current the chip is physically capable of producing. The physics produces something on the order of a few nanoamps in a small design like this. The solver helpfully adds 100,000 transistors times 1 picosiemens times 1.8 volts equals approximately 180 nanoamps of artifact current. Linear in voltage. Independent of process. Independent of temperature.

That floor was perfectly invisible to the fast process at one hundred degrees, where the real leakage was 9uA. The floor was equally invisible at moderate corners, where it was maybe a third of the reading. But at the slow process at -40C, the real leakage was probably 1nA and the floor was 180nA. Ninety nine percent of what I was measuring was not leakage. It was the solver.

The flat slow-corner temperature curve was not the chip refusing to obey physics. It was the floor I was plotting, with real physics buried in the noise.

Dropping The Floor

The fix sounds simple: lower gmin. Two orders of magnitude tighter, from 1e-12 to 1e-14, would put the floor at 1.8 nA instead of 180 nA. Real leakage would dominate again.

One config line change and a day of simulations later, now everything makes sense. Leakage at the fast process rose exponentially with temperature. Leakage at the slow process rose exponentially with temperature, just smaller in absolute magnitude. The temperature scaling matched the typical behavior.

And... the simulation runtime exploded.

The Convergence Problem

Reducing gmin makes the solver matrix closer to singular. It still works, mathematically, but the numerical conditioning gets worse. Some sims that had taken ~2 hours now took 12 hours. Some failed outright with the kind of error message every SPICE user has stared at:

Warning: singular matrix
Note: Starting dynamic gmin stepping
Warning: Dynamic gmin stepping failed
Note: Starting true gmin stepping
Warning: True gmin stepping failed
Note: Starting source stepping
Warning: source stepping failed
Error: Transient op failed, timestep too small

The pattern was specific. Failures concentrated at the quiet middle: room temperature, moderate voltage, moderate process. The fast hot corners ran fine because real currents were large enough that the solver always knew what was happening. The slow cold corners ran fine because the parasitic settling was just slow but stable. The middle was where the solver had no clear signal to anchor to and the matrix conditioning collapsed.

Sanity Check Against Gate Level Simulation

Before settling on a final approach, I needed to rule out one more possibility: was the simulator chasing genuine post-reset settling? Maybe a latch or a C-element was metastable coming out of reset, and the solver was correctly tracking a circuit that was still moving. If so, the long runtimes and convergence failures were not solver problems, they were design problems.

The way to answer this is to ask the same question in a domain where the solver cannot lie to you. Gate level simulation knows nothing about analog parasitics. It tracks logical transitions and nothing else. If the design is genuinely settling after reset, gate level simulation will see the same activity that SPICE is seeing.

I wrote a small cocotb test that mirrored the SPICE stimulus: power up, assert reset, deassert reset, idle for 500 nanoseconds. Ran it with SDF annotation against the post-layout netlist. Dumped every signal inside one cell. Then a Python script analyzed the VCD: find the reset release edge, skip the propagation window, count logical transitions on every signal afterward.

The result was unambiguous. Zero logical activity after reset release. The gate level simulator saw a perfectly quiescent design. Whatever SPICE was struggling with, it was not the circuit doing anything.

It was pure analog parasitic charge redistribution. 229,000 parasitic capacitors slowly bleeding off through sub-threshold current paths. The lower the real leakage, the slower the bleed-off, which is why the low-leakage corners had longer settling tails than the high-leakage ones. The simulator and circuit weren't wrong, just the physics fighting again.

The Final Approach

The compromise was per-corner gmin tuning. Different corners need different convergence aids depending on how much real current is flowing.

Where real leakage was large enough to dominate, like the fast process at one hundred degrees, gmin at 1e-12 was fine because the floor was invisible against a 9 microamp signal. Where real leakage was small but the solver could still converge, gmin at 1e-14 was right because the lower floor was needed to see the physics. For the awkward middle corners, gmin at 1e-13 was a working compromise: 18 nA floor, manageable convergence time, real currents still distinguishable.

The result was a regression that closed in a couple days instead of running for weeks, with measurements that matched physics across the entire envelope. The worst case was the fast process at high voltage and 100C: 9uA per cell, scaling to about 1mA at the array level. I don't really like that result, but it is what it is.

Lessons

A few things stand out from this debugging experience:

Convergence parameters are noise parameters. Anything that helps the solver also adds something to the result. For most simulations the contribution is invisible. For low-current measurements like IDDQ, the convergence aid IS the signal floor. The number you read off the simulation is the sum of physics and the solver's helpfulness, and at small enough signal levels, the solver wins.

Use physics as the sanity check. Sub-threshold leakage doubles every 8 to 10C. If your simulation says leakage is flat across temperature, you are not seeing leakage. The first cut on any low-current measurement should be: does this scale the way physics says it should? If not, the measurement could be the problem, not the chip.

Cross-check across different abstraction layers. SPICE and gate level simulation measure different things. When SPICE is doing something confusing, gate level simulations can clue you into whether the circuit is even moving. Designs are usually wrong in only one place at a time (unless you're very unlucky), and a multi-layer view narrows that place fast.

Per-corner tuning is legitimate. You do not have to use the same simulator settings for every corner of a sweep. A measurement that needs a tight gmin in one regime can use a loose gmin in another. The cost is a more complex flow. The benefit is correct numbers across the full envelope and a regression that actually finishes.

The chip is on its way to the fab in a week and a half. The IDDQ numbers are about where I would expect. The next post is about the digital signoff side of pre-tapeout characterization, and how going from 3 PVT corners to 48 revealed problems that the standard convention quietly hides.