ESoftCheck: Removal of Non-vital Checks for Fault Tolerance
Jing Yu,
Maria Jesus Garzaran,
Marc Snir
Abstract:
As semiconductor technology scales into the deep
submicron regime the occurrence of transient or soft errors will
increase. This will require new approaches to error detection.
Software checking approaches are attractive because they require
little hardware modification and can be easily adjusted to fit different reliability and performance requirements. Unfortunately,
software checking adds a significant performance overhead.
In this paper we present ESoftCheck, a set of compiler
optimization techniques to determine which are the vital checks,
that is, the minimum number of checks that are necessary to
detect an error and roll back to a correct program state. ESoftCheck identifies the vital checks on platforms where registers
are hardware-protected with parity or ECC, when there are
redundant checks and when checks appear in loops. ESoftCheck
also provides knobs to trade reliability for performance based
on the support for recovery and the degree of trustiness of the
operations. Our experimental results on a Pentium 4 show that
ESoftCheck can obtain 27.1% performance improvement without
losing fault coverage.
To Appear:
"ESoftCheck: Removal of Non-vital Checks for Fault Tolerance"
Jing Yu, Maria Jesus Garzaran, and Marc Snir
Proceedings of the Seventh International Symposium on
Code Generation and Optimization (CGO '09),
Seattle WA, March, 2009.
Download:
Paper: