You think you know how EEPROMs fail? Think again.

The Death and Deathlessness* project was my exploration of life, death, and immortality. EEPROMs, the permanent memory in some chips, have finite life for writing. I had seen examples of folks killing EEPROMs, basically repeatedly reading and writing until they encountered an error, a failure for the address in the EEPROM to correctly take a write. I therefore thought running down an EERPOM to the last failure would be the equivalent of death. And recording it for playback would be like immortality – or would that still be life?

As I was building the project, I realized there were a few questions I could not find answers for. For sure, folks had shown that ATmega or ATtiny EERPOMs actually can be faithfully read and written way more than the 100k cycles the manufacturer guarantees. But everyone stopped their experiments at first failure.

I wanted to know if when an address failed, was it the whole address or was it the whole EEPROM? It felt like it was the whole address, as I’ve heard that folks would stop using that failed address. Though, if an address failed, did it stay failed? And if it stayed failed, would other addresses also start failing until the whole EEPROM had failed? And because I was writing then immediately reading, would the fail be harder to find? That is, my failure test required failure right after a write. What if failure after 100k cycles really was related to long-term stability of the address, which I’d never see, or which is why folks could hit 1M or 2M quick read-write cycles before seeing a failure?

Well, since I was going to run this EEPROM down, I realized that I would be able to answer some of these questions.

When the whole thing started

Some parameters
I was using an ATtiny45, which has 256 bytes of EEPROM. I was only using the first 128 addresses (because my readout was an LED matrix with 128 pixels). For each byte I alternated writing 240 (binary 11110000) and 15 (binary 00001111), meaning all the bits alternated each cycle. I read the value after every write to see if it retained the expected written byte.

My first try blew past 2M cycles with no errors showing, so, in disbelief, I stopped to check my code. It was good. So I got a fresh chip and started another experiment, and with extreme self-control, let the damn thing run for what felt like forever, without watching it.

Errors! Addresses 25, 33 and 115.
Image from day 28.

And then I saw an error. And some interesting behavior. After a few days, and some more interesting behavior, I stopped the experiment.

There was a challenge handling a CSV file with 4.9M rows, but I was able to see that the experiment did about 3.7M cycles, throwing about 1.2M errors.

And now for something completely different
The addresses that failed would only show a failure every other cycle. That’s the first insight: failure is at the BIT level, not the BYTE level, such that when it wrote and read a number, it would fail, and when it wrote and read the next one (all different bits), it would read fine. For example, address 33 always failed on the odd cycles, while address 41 only failed on the even cycles.

Bit-level failure
(morning of day 29)

Crazier still, while a failed address would fail consistently for a long stretch, it would then NOT fail for a long stretch, too. Indeed, I saw other failed addresses, also showing the alternating of fail and succeed, pop up and then not show up for a long time. For example, 33 is first to fail at cycle 666,517; and it fails 263,318 times in the subsequent 3,004,848 cycles.

I did a histogram and saw that only 29 addresses had ever failed. Address 33 was the address with the most failures. But address 34 fails 3 times (adjacent odd cycles 3,440,911; -913; -915) and then never again (that’s for the remaining 231k cycles before I turned it off).

Long stretches no errors
(evening of day 29)

This failing for stretches and then not failing suggests to me that this is a probabilistic fail rather than a catastrophic terminal fail. Think of it as the bit has an issue holding a charge (probabilistic) rather than no longer holds a charge (catastrophic fail). So it can go long stretches when its probability of holding a charge improves, until the probability gets bad again.

Of course, for the manufacturer and anyone looking to depend on the EEPROM, stepping into that probabilistic failure might as well be a terminal failure as it makes that whole address useless. As for me, who thought that perhaps the whole address would fail, or that at least when a bit failed, the address would stay failed, this probabilistic nature basically killed the nature of my project. 

Yup.

I could do a bit more statistical digging into the data, but I don’t really have that skill, or the right toolset. Nonetheless, what I can figure out is that the condition where EVERY address fails is likely impossible (or way longer than I am willing to wait – tho I am tempted).

Below is a link to the CSV is you’re interested in checking this out. The columns are Cycle, Address, Result (0 for failure). For my own logging and for the LED matrix output, I wrote out a row for every cycle. But, to save space and speed things up, I only wrote out Address and Result when there was a failure.

Please note, I share the original so I can avoid inadvertently introducing something in editing. There was some writing artefact that added a null row after every error (about 1.2M – sorry). And there’s some gibberish before the first cycle you need to delete. I suppose I could revisit the logging code. 

Summary
I had wondered a few things about how EERPOMs fail. I wondered if when one bit failed, the whole EEPROM failed. That’s not the case. Also, I couldn’t find anything that suggested if errors were at bit or byte level. I now know that they are at the bit level. I also wasn’t sure if a failure was terminal. Turns out failures are probabilistic, so an address can function properly after having failed numerous times. Also, that means that I likely would never be able to kill off a whole EEPROM, where ALL the addresses fail on EVERY read and write cycle.

In the end, what I thought would be an achievable and straightforward running down of an EEPROM, turned into an interesting revelation of resilience for the wee chip.

You’d think everything there is to know of ATtiny or ATmega EEPROM failure is known. Well, perhaps not (or maybe I just couldn’t find the answers to my questions). In any case, my Death and Deathlessness project did teach me many interesting things about EEPROM failure.

Do these finding jive with the EEPROM behavior you’ve seen?

 

*Deathlessness is a key concept in the Lord of the Rings. I even wrote a short story on it in college, as extra credit, that the prof read out in class, to my embarrassment. In any case, when the movies came out, I had a chance to teach my kids how deathlessness was a key thread throughout JRRT’s stories of Middle Earth.