When was the last time you found yourself under attack from outer space? How about your computer? No, we’re not talking about a video game you played. What if we told you that you’re getting blasted from outer space at this very moment?
While it might sound like a joke, cosmic rays from outer space pose a real threat to virtually every computer of every size. Your laptop? Threatened. Your smartphone? Under threat. Your smart toaster? Believe it or not, it’s under threat.
Now, the good news is that there’s no evidence to suggest that cosmic rays do any real damage to us humans. Our devices, however, aren’t so lucky. Nevertheless, since we increasingly rely on our devices to do pretty much anything (and so do our machines and even cities), we’re not entirely outside harm’s way.
Fortunately, we do have the technology to protect our devices and applications from cosmic rays’ worst effects.
Hold on a minute. What are cosmic rays?
When these cosmic rays penetrate the earth’s atmosphere, they can cause something called air showers; a shower of particles such as electrons, neutrons, muons, pions, alpha particles, and so on, that reach and penetrate objects on earth. For example, they may blast right through you and your electronic devices.
OK, we’re all getting blasted with particles from space. What does that mean for me and my devices?
Particles that originate from cosmic rays can cause electromagnetic interference in data transmissions. For example, what was meant to be interpreted as a 0 gets “flipped” to a 1, and vice versa. This phenomenon is something that’s called a single-event upset or SEU for short, and it’s a big problem in computing that often goes unnoticed.
In most consumer applications, however, the effects caused by such single-event upsets are usually limited in scope and severity—and almost always go undiagnosed. For example, an SEU may cause a blue screen error on your computer or cause an app to “act up.” While undoubtedly frustrating, you might curse the operating system, your computer, or the program you were using, and continue as usual after a device reset. What caused the error in the first place tends not to be something you investigate unless it’s a recurring error, in which case it probably wasn’t caused by an SEU in the first place.
However, on an industrial scale and in applications where errors could have severe real-world impacts, these “small” errors can quickly turn quite… large.
While you may only face a single error in years of computer usage as a result of cosmic rays, imagine that instead of using a single computer (or smartphone, or tablet…), you are a company with thousands of servers. The risk of experiencing an error that causes real-world problems rises exponentially, and your thousands or even millions of customers may also find themselves affected by what was caused by a single stray particle. Of course, the only thing end-customers would know is that their service provider, you, provided disappointing service quality.
In mission-critical applications and applications that could have severe real-world implications, the potential consequences of a single error can be enormous. Back in 2003, the town of Schaerbeek in Belgium saw a single flipped bit in a voting machine result in 4,096 extra votes for a candidate. The only reason the city discovered the error was that voters had seemingly cast more votes than there were eligible voters. Now imagine how many errors like this may have occurred around the world but that have gone undetected.
In a high-risk environment at scale, errors caused by cosmic rays can turn out even more impactful. For example, in 2008, a Qantas aircraft dove 690 feet in 23 seconds after the autopilot was disengaged by a bit flip, forcing the plane to land at a nearby airstrip after about a third of the passengers got injured in the incident.
In fact, airline computers have proven particularly affected by strange glitches, causing severe economic damage due to flight cancelations, delays, costly maintenance work, and so on. Experts believe that bit flips caused many of these glitches, not least because cosmic radiation is a lot higher at 35,000 feet—substantially increasing the risk of such interference.
For aerospace companies and organizations like NASA that operate computers in high orbit and even in deep space, such interference is an even more significant challenge.
So, what can I do to protect my devices and systems?
Usually, when looking at dealing with different types of interference, you tend to search for effective ways to isolate the sensitive parts from the source of the interference—for example by shielding the copper wires in a cable with plastic and metal foil.
With memory and cosmic rays, that type of solution isn’t viable. To effectively protect a DRAM module from cosmic rays, you’d have to cover it with around a dozen feet of concrete on each side.
But instead of designing industrial systems to account for memory covered in huge blocks of concrete, there are other, much more efficient solutions that can be used to mitigate the impact of cosmic radiation on our devices and computer systems.
Protecting you from space with Innodisk’s ECC DRAM
Innodisk offers a broad range of DRAM modules in almost all form factors imaginable that are equipped with something called error-correcting code (ECC), which can essentially reduce the impact of cosmic rays and similar interference on memory modules to zero.
Instead of relying on massive amounts of concrete or other substances to keep electromagnetic interference away from the DRAM circuitry, ECC memory corrects for the rare SEUs when they happen using either Hamming code or triple modular redundancy.
Both Hamming code and triple modular redundancy ensure that if a bit has been flipped, the system notices this and only lets the correct value through.
This feature is made possible by an extra integrated circuit (IC) on the DRAM module.
For example, on the Innodisk DDR4 DRAM module above (Picture 1), there are nine ICs (the black chips embedded on the green board), with one of them being responsible for ECC. Since the number of ICs on DRAM modules follow the 2-4-8-16 (and so on) sequence, when you see odd numbers of ICs like five or nine, you can almost be sure that what you’re seeing is an ECC DRAM module.
While ECC doesn’t protect your memory from interference, it does protect you from the adverse effects that interference can cause to memory modules. SEU being a rare occurrence in the first place, the likelihood of it affected two circuits simultaneously and with the same charge is practically impossible. The flipped bit is completely harmless as long as it goes ignored for the correct value, which it does with Innodisk’s ECC DRAM modules.
Memory problems beyond outer space
In this post, we’ve put extra emphasis on the completely invisible challenges from space faced by DRAM modules and that ECC DRAM can effectively address. However, factors far less interesting-sounding than cosmic rays can also cause bit flips. For instance, memory cells in DRAM modules can have ever-so-slight weaknesses that risk causing a single bit flip one single time. Electromagnetic interference that risks causing bit flips can also have sources far more terrestrial than those of cosmic rays.
Whatever the reason for the bit flip, ECC DRAM helps ensure that it doesn’t result in any practical negative effects on the memory or device’s operation. Most importantly, ECC DRAM helps mitigate the risk of any critical issues arising from a bit flip, like ruining an election result, causing a plane to fall out of the air, or causing service disruption for one’s company or customers.
When working at scale or with critical applications, ECC DRAM is an easy choice.
The harder choice is trying to explain to your customers or bosses that your service outage (or worse!) was caused by cosmic rays from galaxies far, far away.