Action Optional means that the CPU detected some form of corruption in the background and tells the OS about using a machine check exception. The blanket action of crashing the machine for all uncorrected soft and hard memory errors is sometimes over-reactive. Ignore, failure, and delay are all similar in that the page was not completely isolated, except for flagging the page as poisoned. That’s the stuff Andi Kleen and co. Alternatively, memory may be occasionally “scrubbed. These delays include asynchronous hardware reporting of the machine check event, How can a machine check for accessing erroneous memory contents be asynchronous?

Uploader: Goltijin
Date Added: 17 January 2018
File Size: 52.16 Mb
Operating Systems: Windows NT/2000/XP/2003/2003/7/8/10 MacOS 10/X
Downloads: 9689
Price: Free* [*Free Regsitration Required]

A CPU read, or better yet, a data prefetch either triggered explicitly by an instruction or implicitly by a prefetch engine may have triggered the memory reference that triggered the MCA. While HWPOISON was developed for xbased machines, interest has injectoor expressed by supporters of other Linux server architectures, such as ia64 and sparc discussed here.

System programming guide https: EDAC is an alternative approach at reporting memory errors. However, this is infeasible for two reasons. This is in addition to the mcelog test suite included with the source make test. Background scrubbing gives a machine check.


mcelog — further reading

First, the offending instruction and process cannot be determined due to delays between the data error consumption and execution of the poison handler. As of inteo writing, HWPOISON is enabled for all architectures to make testing on any machine possible via a user-mode fault injector, which is detailed below. Background scrubbing is entirely asynchronous to process execution.

There is a notion of an “action optional” machine check.


Posted Aug 31, 6: Introduction to platform hardware errors on modern x86 mec including detailed flows and recent improvements to the Linux x86 machine check handling, with a focus on memory errors. Further reading Papers and presentations mcelog – memory error handling in user space at Linux Kongress paperslides. Or are you asking about something much more subtle?

I found a different Otherwise, hardware poisoning will cause a system panic. This link is broken.

Unlike clean pages, dirty pages in these caches have differences between the memory and disk copies. Posted Aug 28, 7: Huge pages fail since reverse mapping is not supported to identify the process which owns the page.

A newer study that gets to the same conclusion. One downside to the ever-increasing memory size available on computers is an increase in memory failures. Thus, when HWPOISON is coupled with the appropriate fault-tolerant processors, Linux users can enjoy systems that are more tolerant to memory errors in spite of increased memory densities.


This document is dated Juneso it’s not like it’s anceint.

In the most recent Intel architectures, they support a notion of “recoverable machine check,” wherein the hardware tells the OS that no CPU state was corrupted when it noticed the problem. Once the poisoned data is actually used loaded into a processor register, etc.

MCE is the mechanism by which the hardware reports the bad page to the operating system. The hardware now supports a concept of recoverable machine check, and the software uses it.

For users:

Please consider signing up for a subscription and helping to keep LWN publishing. But that’s not the case the article describes. Linux EDAC project on sourceforge. However, these pages containing critical kernel data cannot be isolated.

So there we have it. The OS can then take appropiate action, like killing the process with the corrupted data or logging the event properly to disk.