Reliability, Availability, and Serviceability (RAS)

Concept of RAS

Reliability, Availability and Serviceability (RAS) is a measure that defines the robustness of the system. A RAS enabled platform ensures that the system produces correct outputs, is always operational and is easily maintainable. RAS reduces the systems downtime by detecting the hardware errors and correcting them when possible. The level of RAS to be achieved is implementation dependent. There are various techniques that help achieve RAS targets e.g Fault prevention and fault removal, error handling and recovery and fault handling. A well designed RAS system ensures that the software and hardware collectively work to minimize the impact of hardware faults on entire system operation and hence boost performance.

Overview

RAS spec divides the entire RAS architectural extension support into 2 into 2 categories

ARMv8-A RAS Extension
RAS system architecture

RAS architectural spec defines the hardware ras extensions the cpu and the system could implement to achieve desired level of RAS support. This document outlines concepts of RAS architecture important to understand the ras software architecture.

ARMv8-A RAS Extension define the RAS extensions that are mandatory for CPU implementation that are based on ARMv8.2 and above. To enable RAS extension architectural support in software the RAS_EXTENSION flag must be set to 1.

RAS system architecture define the architectural support required to enable system level ras support on a platform. It defines a reusable component architecture that can detect, record errors and also signal them to Processing Element (PE). PE is implementation defined, it can be anything that is capable for handling the given error e.g AP, SCP or MCP. This architectural definitions makes designing the software easier. Few component definitions that the RAS System architecture defines

Node

A node is one such component architecture defined by RAS. A system can have single or multiple error nodes. Architecturally a node:

Implements one or more standard error record.
Records detected and consumed errors.
Might include control to disable the error reporting and recording while the software initializes.
Reports recorded errors with asynchronous error reporting mechanism like interrupts e.g Fault Handling Interrupt (FHI).
Implements a counter for counting corrected errors.
Logs timestamps in each error record.
Report uncorrected error by in-band error reporting signaling (external abort)
Report critical error condition via Critical Error Interrupt (CRI).

Error Record

RAS system architecture defines standard error record. A node captures entire error information as part of these error records. Spec defines a mechanism to access error records as system register or memory mapped registers. A standard error record comprises of:

ERR<n>STATUS: characterizes the error and marks valid status fields.
ERR<n>ADDR: error address register.
ERR<n>MISC<m>: miscellaneous error register. To be used for:
- Identifying the Field Replaceable Unit (FRU).
- Locating the error within the FRU.
- Implementing corrected error counter to count the corrected errors.
- Storing the timestamp value for recorded errors.

An Error record records following component error states:

Corrected Error (CE).
Deferred Error (DE).
Uncorrected Error (UE): UE has following sub-types:
- Uncontainable error (UC).
- Unrecoverable error (UEU).
- Recoverable error or Signaled error (UER).
- Restartable error or Latent error (UEO).

Software Error Handling

There are couple of approaches to achieve error handling in software. They are

Firmware First Error Handling.
Kernel First Error Handling.

Firmware First Error Handling

Firmware First error handling requires the error events that occur are handled in EL3 and then relayed to OSPM for logging. On error firmware consumes the error information generates a standard Common Platform Error Record (CPER) information buffer which is defined by UEFI spec to store error information. CPER is placed in firmware reserved memory that is later shared with the OSPM when it is notified about the error.

On Arm Neoverse Reference design platforms the Firmware First error handling is achieved using Hardware Error Source Table (HEST) and Software Delegated Exception Interface (SDEI) tables. The Secure Partition (Standalone MM driver) is used to generate CPER info for the error. At boot the HEST table is published and OSPM is made aware about the hardware error source(s) the platform supports.

During the runtime when hardware fault is detected the corresponding error or fault handling interrupt is generated. This interrupt is taken to EL3 runtime firmware which calls into Secure Partition that generates CPER record and places it in firmware reserved memory. EL3 runtime firmware using SDEI notifies the OSPM about the error.

Here are example platform implementations for Firmware First Error Handling.

Kernel First Error Handling

Kernel First errors are handled directly by the OSPM without firmware intervention. The fault and error events that are generated by the platform are taken directly to OSPM.

Arm Neoverse Reference design platforms use Arm Error Source Table (AEST) to achieve kernel first error handling. AEST table is defined in ACPI spec for RAS spec. AEST table defines the hardware error sources that are present on the platform. AEST table comprises of one or more error nodes. A AEST node entry has information of component the node belongs to e.g Processor, Memory, SMMU, GIC etc. It defines interface type for accessing the node e.g memory mapped or system register. A node also defines the list of interrupts the node supports.

OSPM implements a AEST driver module to traverse through the AEST table. The module registers Irq handlers for all supported node interrupts. The fault event occurring on that node or error source is directly forwarded to OSPM for handling.

Here is an example platform implementation for Kernel First Error Handling. N2 CPU RAS support

Error Injection Software

Error injeciton feature is a micro-architecture feature defined by RAS to inject errors in the RAS supported system components. Software can use these registers to inject the error and test the error handling software implemented by the platform.

Arm Neoverse Reference design platform use the Error Injection (EINJ) ACPI table defined in the ACPI spec to implement error injection feature. EINJ is action and instruction based table that defines set of actions and their corresponding instructions. Each action is also assigned a firmware reserved memory space to store action specific data. An instruction is essentially a read or a write operation that is performed on that reserved memory.

On Arm Neoverse Reference platforms the platform firmare at EL3 implements the functionality to program the error injeciton registers. OSPM initiates the injection and generates an SPI interrupt to call in to platform firmware. EINJ defines a action to program the GICD register that triggers a SPI interrupt that is handled in EL3.

Firmware-first and Kernel-first software use the EINJ ACPI table to validate the software functionality. The steps to exercise EINJ feature can be found in Base RAM ECC RAS support and N2 CPU RAS support.