Reliability, Availability, and Serviceability (RAS)

Overview

Reliability, Availability and Serviceability (RAS) is a measure that defines the robustness of the system. A RAS enabled platform ensures that the system produces correct outputs, is always operational and is easily maintainable. RAS reduces the systems downtime by detecting the hardware errors and correcting them when possible. The level of RAS to be achieved is implementation dependent. There are various techniques that help achieve RAS targets e.g Fault prevention and fault removal, error handling and recovery and fault handling. A well designed RAS system ensures that the software and hardware collectively work to minimize the impact of hardware faults on entire system operation and hence boost performance.

RAS specification divides the entire RAS architectural extension support into two categories:

ARMv8-A RAS Extension
RAS System Architecture

RAS architectural specification defines the hardware RAS extensions that the cpu and the system could implement to achieve the desired level of RAS support.

ARMv8-A RAS Extension defines the RAS extensions that are mandatory for CPU implementation that are based on ARMv8.2 and above. To enable RAS extension architectural support in software the RAS_EXTENSION flag must be set to 1.

RAS system architecture defines the architectural support required to enable system level RAS support on a platform. It defines a reusable component architecture that can detect, record errors and also signal them to Processing Element (PE). PE is implementation defined, it can be anything that is capable of handling the given error e.g AP, SCP or MCP. This architectural definitions makes designing the software easier.

Component Definitions by RAS System Architecture

Below are some component definitions that the RAS System architecture defines:

Node

A node is one such component architecture defined by RAS. A system can have single or multiple error nodes. Architecturally a node:

Implements one or more standard error record.
Records detected and consumed errors.
Might include control to disable the error reporting and recording while the software initializes.
Reports recorded errors with asynchronous error reporting mechanism like interrupts e.g Fault Handling Interrupt (FHI).
Implements a counter for counting corrected errors.
Logs timestamps in each error record.
Report uncorrected error by in-band error reporting signaling (external abort)
Report critical error condition via Critical Error Interrupt (CRI).

Error Record

RAS system architecture defines standard error record. A node captures entire error information as part of these error records. Spec defines a mechanism to access error records as system register or memory mapped registers. A standard error record comprises of:

ERR<n>STATUS: characterizes the error and marks valid status fields.
ERR<n>ADDR: error address register.
ERR<n>MISC<m>: miscellaneous error register. To be used for:
- Identifying the Field Replaceable Unit (FRU).
- Locating the error within the FRU.
- Implementing corrected error counter to count the corrected errors.
- Storing the timestamp value for recorded errors.

An Error record records following component error states:

Corrected Error (CE).
Deferred Error (DE).
Uncorrected Error (UE): UE has following sub-types:
- Uncontainable error (UC).
- Unrecoverable error (UEU).
- Recoverable error or Signaled error (UER).
- Restartable error or Latent error (UEO).

Error Handling

There are two approaches to achieve error handling in software:

Firmware First Error Handling.
Kernel First Error Handling.

Firmware First Error Handling

Firmware First error handling requires the error events that occur are handled in EL3 and then relayed to OSPM for logging. On error firmware consumes the error information generates a standard Common Platform Error Record (CPER) information buffer which is defined by UEFI specification to store error information. CPER is placed in firmware reserved memory that is later shared with the OSPM when it is notified about the error.

On Arm Neoverse Reference design platforms the Firmware First error handling is achieved using Hardware Error Source Table (HEST) and Software Delegated Exception Interface (SDEI) tables. The Secure Partition (Standalone MM driver) is used to generate CPER info for the error. At boot the HEST table is published and OSPM is made aware about the hardware error source(s) the platform supports.

During the runtime when hardware fault is detected the corresponding error or fault handling interrupt is generated. This interrupt is taken to EL3 runtime firmware which calls into Secure Partition that generates CPER record and places it in firmware reserved memory. EL3 runtime firmware using SDEI notifies the OSPM about the error.

Kernel First Error Handling

Kernel First errors are handled directly by the OSPM without firmware intervention. The fault and error events that are generated by the platform are taken directly to OSPM.

Arm Neoverse Reference design platforms use Arm Error Source Table (AEST) to achieve kernel first error handling. AEST table is defined in ACPI specification for RAS specification. AEST table defines the hardware error sources that are present on the platform. AEST table comprises of one or more error nodes. A AEST node entry has information of component the node belongs to e.g Processor, Memory, SMMU, GIC etc. It defines interface type for accessing the node e.g memory mapped or system register. A node also defines the list of interrupts the node supports.

OSPM implements a AEST driver module to traverse through the AEST table. The module registers Irq handlers for all supported node interrupts. The fault event occurring on that node or error source is directly forwarded to OSPM for handling.

Error Injection

Error injection feature is a micro-architecture feature defined by RAS to inject errors in the RAS supported system components. Software can use these registers to inject the error and test the error handling software implemented by the platform.

Arm Neoverse Reference Designs use the Error Injection (EINJ) ACPI table defined in the ACPI specification to implement error injection feature. EINJ is action and instruction based table that defines set of actions and their corresponding instructions. Each action is also assigned a firmware reserved memory space to store action specific data. An instruction is essentially a read or a write operation that is performed on that reserved memory.

On Arm Neoverse Reference Platforms the firmware at EL3 implements the functionality to program the error injection registers. OSPM initiates the injection and generates an SPI interrupt to call in to firmware. EINJ defines a action to program the GICD register that triggers a SPI interrupt that is handled in EL3.

Firmware-first and Kernel-first software use the EINJ ACPI table to validate the software functionality.

Note

Error injection, whether firmware-first or kernel-first, are both initiated from the kernel.

Error Injection via Kernel

CPU Error Injection

The Neoverse RD-N2 platforms has support for 2 error nodes, and the presence of these nodes enable the RAS extension.

Node 0: Includes the L3 memory system in the DSU.
Node 1: Includes the private L1 and L2 memory systems in the cpu.

RD-V3 only supports one error node.

Node 0: Includes the private L1 and L2 memory systems in the cpu.

CPU support SED parity (Single Error Detect) and SECDED ECC (Single Error Correct Double Error Detect) capabilities.

Rd-V3-Cfg1 and RdN2 platforms also supports injecting error’s to verify error handling software.

Note

The Neoverse RD-V3 reference design platforms are based on direct connect configuration and has no DSU. Hence they only support one error node i.e Node0.

Error Injection Software Sequence

CPU implements Pseudo Fault Generation registers. With the help of these registers, software can inject either CE, DE or UE into the cache RAMs.

Detailed error injection software sequence:

Select error record for L1 and L2 memory systems i.e. Node0
- write_errselr_el1 (0)
Program the Error Control Register to enable Error Detection, FHI for CE, DE and UE.
- write_erxctlr_el1 (0x109) (Note: To enable ERI on UE write 0x10D)
Program the PFG Control Register to 0.
- write_cpu_pfg_ctrl_register (0)
Clear the Error Status Register to 0.
- write_erxstatus_el1 (0xFFC00000)
Set PFG countdown register to 1.
- write_cpu_pfg_cdn_register (1)
For Deferred Error injection write
- write_cpu_pfg_ctrl_register (0x80000020) [Generates FHI interrupt]

Procedure to Perform Error Injection

Note

This section assumes the user has completed the Getting Started chapter and has a functional working environment.

Error Handling Mode Selection

CPU supports both Firmware First and Kernel First error handling modes, and the default mode is set to Firmware First.

Important

Only one error handling mode can be enabled at a time.

The error handling modes are a build time option, in order to select either the user needs to navigate to the <workspace> and edit the configuration file of the platform of interest and look for TF_A_RAS_FW_FIRST flag.

As an example for RD-V3 Cfg1 platform:

vim <workspace>/build-scripts/configs/rdv3cfg1/rdv3cfg1

Firmware First Selection:

TF_A_RAS_FW_FIRST = 1

Kernel First Selection

TF_A_RAS_FW_FIRST = 0

Note

Clean and build once you switch error handling mode.

Build and Boot Operating System(s)

Refer to any of the bellow list of supported operating systems, to build the reference design platform software stack and boot into the OS.

Inject Error

After the boot is complete, based on the error handling scheme selected use EINJ table debugfs entries to inject the error.

Firmware First Error Injection.
Kernel First Error Injection.

The field sel-firmware-first in oem-einj is used to toggle firmware first error injection, with the default being kernel first error injection. Field sel-error-type is used to choose the type of error injection, where the current implementation only support’s deferred errors.

Firmware First Error Injection

mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject

On successful error injection the firmware reception log’s this error information on the console.

Check the secure uart terminal (window with the name FVP terminal_sec_uart) for a log similar to below.

SP 8001: ErrAddr = 0x8F840
SP 8001: MmEntryPoint Done
INFO:    EINJ event received 83
INFO:    cpu_id 2
INFO:    Injecting DE...
INFO:    ErrStatus = 0x0
INFO:    [CPU RAS] CPU intr received = 17 on cpu_id = 2
INFO:    [CPU RAS] ERXMISC0_EL1 = 0x0
INFO:    [CPU RAS] ERXSTATUS_EL1 = 0x40800000
INFO:    [CPU RAS] ERXADDR_EL1 = 0x0 buff_base = 0xf4600000

Check the non-secure uart terminal (window with the name FVP terminal_nsec_ uart) for a log similar to below.

{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
{2}[Hardware Error]: event severity: recoverable
{2}[Hardware Error]:  Error 0, type: recoverable
{2}[Hardware Error]:   section_type: ARM processor error
{2}[Hardware Error]:   MIDR: 0x00000000410fd840
{2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081020000
{2}[Hardware Error]:   running state: 0x1
{2}[Hardware Error]:   Power State Coordination Interface state: 0
{2}[Hardware Error]:   Error info structure 0:
{2}[Hardware Error]:   num errors: 1
{2}[Hardware Error]:    first error captured
{2}[Hardware Error]:    error_type: 0, cache error
{2}[Hardware Error]:    error_info: 0x000000000002001f
{2}[Hardware Error]:     transaction type: Generic
{2}[Hardware Error]:     operation type: Generic error (type cannot be determined)
{2}[Hardware Error]:     cache level: 0
{2}[Hardware Error]:     processor context not corrupted
{2}[Hardware Error]:     the error has not been corrected
{2}[Hardware Error]:    physical fault address: 0x0000000000000000
{2}[Hardware Error]:   Context info structure 0:
{2}[Hardware Error]:    register context type: AArch64 general purpose registers
{2}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000030: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000040: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000050: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000060: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000070: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000080: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000090: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000a0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000b0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000c0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000d0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000e0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000f0: 00000000 00000000 00000000 00000000

Kernel First Error Injection

mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 0 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject

On successful error injection the kernel receives a error event which is received in the irq handler. The handler traverses through the error record info and logs the error.

Check the non-secure uart terminal (window with the name FVP terminal_nsec_ uart) for a log similar to below.

[ 2365.760926] Injecting DE-
[ 2365.760928] ARM RAS: error from CPU7
[ 2365.760930] ERR0STATUS: 0x40800000

EDAC ( Error Detection and Correction)

The EDAC(Eror Detection and Correction) Linux interface provides a framework, for reporting memory and CPU errors encountered on a system. It allow the Kernel to detect and manage errors, providing valuable information for diagnostics and troubleshooting hardware issue.

We currently only enabled EDAC support for CPU for both RD-N2 and RD-V3 Platforms. Error count is exposed through sysfs inteface this interface allows user to access information about Corrected (CE) and Uncorrected (UE) errors that have occurred in the system aiding in monitoring and diagnosing hardware issues.

Note

This feature is only supported on RD-V3-Cfg1 and RD-N2-Cfg1 Platforms if Kernel first error Handling is enabled.

cat /sys/devices/system/edac/cpu/cpu*/ue_count

Shared RAM Error Injection

RD-V3 and RD-N2 platform have support for Shared RAM that is shared between AP, MCP, SCP and RSS. The shared RAM is protected with SECDED (Single Error Correct Double Error Detect). RD-V3 platform defines ECC RAS registers to log any ECC errors that occur during Shared RAM access from each master AP, SCP, MCP or RSS. For RD-V3 4 sets of ECC RAS registers defined for each master to log errors based on master’s PAS and 2 sets of ECC Ras registers for RD-N2 platform.

RD-V3: The list for Shared RAM ECC RAS registers is defined below:

AP Secure RAM ECC RAS registers

AP Non-Secure RAM ECC RAS registers

AP Realm RAM ECC RAS registers

AP Root RAM ECC RAS registers

SCP Secure RAM ECC RAS registers

SCP Non-Secure RAM ECC RAS registers

SCP Realm RAM ECC RAS registers

SCP Root RAM ECC RAS registers

MCP Secure RAM ECC RAS registers

MCP Non-Secure RAM ECC RAS registers

MCP Realm RAM ECC RAS registers

MCP Root RAM ECC RAS registers

RD-N2: The list for Shared RAM ECC RAS registers is defined below:

AP Secure RAM ECC RAS registers

AP Non-Secure RAM ECC RAS registers

SCP Secure RAM ECC RAS registers

SCP Non-Secure RAM ECC RAS registers

MCP Secure RAM ECC RAS registers

MCP Non-Secure RAM ECC RAS registers

Note

This test is only supported on RD-V3-Cfg1 and RD-N2-Cfg1 Platforms. Firmware First Error Handling

Error Injection on Shared RAM

Each ECC RAS register set implements SRAMECC_ERRMISC1 register which provides a way to inject Corrected Error (CE) or Uncorrected Error (UE) in the Shared RAM. The error injection only takes effect if the register programming is followed by a read access to shared RAM. If the injection is successful the error records pertaining to the master and respective access are populated with error information and an error interrupt is delivered to the master.

RD-V3 Shared SRAM

Detailed Error injection software sequence is illustrated to inject 1-bit CE into Root Shared RAM from AP executing in RD-V3.

Add memory map for the Shared RAM ECC RAS registers memory space.
Add memory map for the Shared memory space.
Program the SRAMECC_ERRCTRL register to enable ED(Error detection), FI(Fault Interrupt) and CFI(Corrected Fault Interrupt)
Program the SRAMECC_ERRMISC1 register to enable INJECT_CE.
Read from memory mapped shared memory space to inject the error.

RD-N2 Shared SRAM

Detailed Error injection software sequence is illustrated to inject 1-bit CE into Non-Secure Shared RAM from AP executing in RD-N2.

Add memory map for the Shared RAM ECC RAS registers memory space.
Add memory map for the Shared memory space.
Program the SRAMECC_ERRCTRL register to enable RAM_ECC_EN and set INJECT_ERROR to [01] for Correctable error.
Read from memory mapped shared memory space to inject the error.

Procedure to Perform Error Injection on Shared RAM

Note

This section assumes the user has completed the Getting Started chapter and has a functional working environment.

Error Handling Mode Selection

Both platform only supports Firmware First SRAM error handling mode, and the default mode is set to Firmware First.

Important

Only Firmware first mode is supported for SRAM-Errors.

The error handling modes are a build time option, in order to select either the user needs to navigate to the <workspace> and edit the configuration file of the platform of interest and look for TF_A_RAS_FW_FIRST flag.

As an example for RD-V3 Cfg1 platform:

vim <workspace>/build-scripts/configs/rdv3cfg1/rdv3cfg1

Firmware First Selection:

TF_A_RAS_FW_FIRST = 1

Note

Clean and build once you switch error handling mode.

Build and Boot Operating System(s)

Refer to any of the bellow list of supported operating systems, to build the reference design platform software stack and boot into the OS.

Inject Error on Shared RAM

Run below command to inject 1-bit CE to the Shared RAM. This test uses EINJ ACPI table to perform error injection. Shared RAM is not a standard defined error_type in EINJ ACPI table so use the vendor defined error type. Bit 31 of error_type field represents vendor error type. Use error_type value 0x8002_0000 to represent Shared RAM errors.

mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject

Shared RAM error handling happens in Firmware first mode. The EL3 firmware receives the fault handling interrupt (FHI) for the corrected error detected and logs the error on the secure console.

EDAC MC0: 1 CE unknown error on unknown memory
( page:0x8f offset:0x840 grain:-281474976710655 syndrome:0x0 - APEI location: )
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 20
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:   section_type: memory error
{1}[Hardware Error]:   physical_address: 0x000000000008f840
{1}[Hardware Error]:   physical_address_mask: 0x0000ffffffffffff

Error Injection via SCP Utility

The error injection utility is referred to as einj-util in this document. Einj-util is a command-line utility designed for SCP. This utility integrates with the SCP CLI Debugger, enabling users to insert commands at runtime. Einj-util facilitates error injection into various RAS-supported components when a user provides error injection command input in the CLI. This utility helps in validating the RAS capable hardware components’ behavior when error is detected and reported.

The term “Component” defines the RAS-supported components for which error injection is supported. “Sub-component” signifies the next level of error categorization for each component, and it varies for different components. For instance, in the context of SRAM, sub-components represent error injection in different worlds: Root, Secure, Realm, and Non-Secure. “Type” defines the various types of errors supported by each component. Error types supported are Correctable Error(CE), Deferred Error(DE), Uncorrectable Error(UE).

Procedure to Perform Error Injection into Various Components

Note

This section assumes the user has completed the Getting Started chapter and has a functional working environment.

Build Software Stack

This procedure doesn’t require a full host OS to be present, but the Busybox Boot is still recommended as it is the simplest method to build the required components.

Boot up to SCP CLI Debugger Shell

Once the build step is completed, boot the Busybox stack on FVP as normal but identify the window with the name FVP terminal_uart_scp once it shows up, as this window is the one to interact with. The steps are as follows:

Launch the FVP and access the SCP UART.
Once in the SCP UART terminal, use Ctrl + e to enter the CLI.
To access the help menu for the einj-util utility, run the command

einj-util -h

The “help” command displays the CLI usage.

> einj-util -h
    Inject error into various components.

    Usage: einj-util -comp <n> -subcomp <n> -type <n>

    -comp: sram (0), tcm (1), cpu (2), rsm (3)

    -subcomp:

            sram: root (0), secure (1), non-secure (2), realm (3)

            tcm: itcm (0), dtcm (1)

            cpu: always 0 for now

            rsm: secure (0), non-secure (1)

    -type:

            sram/tcm/rsm: correctable (0), uncorrectable (1)

            cpu: correctable (0), uncorrectable (1), deferred (2)

    example:

            1) ce into shared sram from secure world:
                    einj-util -comp 0 -subcomp 1 -type 0
            2) ce into scp itcm:
                    einj-util -comp 1 -subcomp 0 -type 0
            3) cpu ue:
                    einj-util -comp 2 -subcomp 0 -type 1

To exit the CLI Debugger, press Ctrl + d.

Various Error Injection Scenarios

Component	Subcomponent	Type of Error	Error Status
Shared SRAM	Secure World	CE	0x86000000
Shared SRAM	Root World	UE	0xa4000000
RSM SRAM	Secure World	CE	0x86000000
RSM SRAM	Non-Secure World	UE	0xa4000000
TCM	ITCM	CE	0x5
TCM	DTCM	UE	0x7
CPU	Core	CE	0xC6000000
		UE	0x60000000
		DE	0x40800000

Shared SRAM Error Injection

Run the following command to inject a correctable error into shared SRAM from the secure world.

> einj-util -comp 0 -subcomp 1 -type 0

After triggering the error, the interrupt handler is invoked, logging error records.

[SRAM_INT] ErrStatus = 0x86000000
[SRAM_INT] fwk_int number = 24
[SRAM_INT] ErrAddr   = 0x10

SRAM ECC Error Status Register Bit Descriptions

AV[31:31]  :  Address Valid

MV[26:26]  :  Miscellaneous Registers Valid

CE[25:24]  :  Correctable error has occurred

DE[23:23]  :  Deferred Error

UET[21:20] :  Uncorrected Error Type

SERR[7:0]  :  Primary Error code

CPU Error Injection

Run the following command to inject a CPU correctable error.

> einj-util -comp 2 -subcomp 0 -type 0

The ErrorStatus register captures information about the triggered CPU error.

Injecting CPU CE
ErrStatus  0xC6000000
ErrAddress 0x0

Core Error Injection ERXSTATUS_EL1 Register Description

AV[31:31]  :  Address Valid

V[30:30]   :  Status Register Valid

MV[26:26]  :  Miscellaneous Registers Valid

CE[25:24]  :  Corrected Error

DE[24:24]  :  Deferred Error

UET[21:20] :  Uncorrected Error Type

SERR[4:0]  :  Primary Error code

SCP ITCM/DTCM Error Injection

Invoke the following command to inject a correctable error into SCP ITCM.

> einj-util -comp 1 -subcomp 0 -type 0

The error record information will be logged in the following manner.

ITCM
Injecting CE
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode   = 0x9
[TCM_INT] ErrStatus = 0x5
[TCM_INT] ErrAddr   = 0x34d8

TCMECC_ERRSTATUS Bit Descriptions

OF[2:2] : Multiple errors occurred before SW cleared the current error

UE[1:1] : Uncorrectable and uncontainable error have occurred

CE[0:0] : Correctable error has occurred

RSM SRAM Error Injection

Invoke the following command to trigger a correctable error in RSM SRAM from the secure world.

> einj-util -comp 3 -subcomp 0 -type 0

The error record information is logged as follows:

Injecting CE into RSM SRAM
[RSM_INT] ErrStatus = 0x86000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr   = 0x10

Note

Refer to the SRAM ECC Error Status register bit descriptions to decode the error status for RSM SRAM errors.

Expected Output for the Various Scenarios

Description	Command	Expected Output
Shared SRAM Secure World CE	einj-util -comp 0 -subcomp 1 -type 0	Injecting CE into Shared SRAM [SRAM_INT] ErrStatus = 0x86000000 [SRAM_INT] fwk_int number = 24 [SRAM_INT] ErrAddr = 0x10
Shared SRAM Secure World UE	einj-util -comp 0 -subcomp 1 -type 1	Injecting UE into Shared SRAM [SRAM_INT] ErrStatus = 0xa4000000 [SRAM_INT] fwk_int number = 24 [SRAM_INT] ErrAddr = 0x1
Shared SRAM Root CE	einj-util -comp 0 -subcomp 0 -type 0	Injecting CE into Shared SRAM [SRAM_INT] ErrStatus = 0x86000000 [SRAM_INT] fwk_int number = 26 [SRAM_INT] ErrAddr = 0x10
Shared SRAM Root UE	einj-util -comp 0 -subcomp 0 -type 1	Injecting UE into Shared SRAM [SRAM_INT] ErrStatus = 0xa4000000 [SRAM_INT] fwk_int number = 26 [SRAM_INT] ErrAddr = 0x10
RSM SRAM Secure World CE	einj-util -comp 3 -subcomp 0 -type 0	Injecting CE into RSM SRAM [RSM_INT] ErrStatus = 0x86000000 [RSM_INT] fwk_int number = 29 [RSM_INT] ErrAddr = 0x10
RSM SRAM Secure World UE	einj-util -comp 3 -subcomp 0 -type 1	Injecting UE into RSM SRAM [RSM_INT] ErrStatus = 0xa4000000 [RSM_INT] fwk_int number = 29 [RSM_INT] ErrAddr = 0x10
RSM SRAM Non-secure World CE	einj-util -comp 3 -subcomp 1 -type 0	Injecting CE into RSM SRAM [RSM_INT] ErrStatus = 0x86000000 [RSM_INT] fwk_int number = 29 [RSM_INT] ErrAddr = 0x10
RSM SRAM Non-secure World UE	einj-util -comp 3 -subcomp 1 -type 1	Injecting UE into RSM SRAM [RSM_INT] ErrStatus = 0xa4000000 [RSM_INT] fwk_int number = 29 [RSM_INT] ErrAddr = 0x10
TCM ITCM CE	einj-util -comp 1 -subcomp 0 -type 0	ITCM Injecting CE [TCM_INT] ErrStatus = 0x5 [TCM_INT] fwk_int number = 21 [TCM_INT] ErrCode = 0x9 [TCM_INT] ErrAddr = 0x6b38
TCM ITCM UE	einj-util -comp 1 -subcomp 0 -type 1	ITCM Injecting UE [TCM_INT] ErrStatus = 0x7 [TCM_INT] fwk_int number = 21 [TCM_INT] ErrCode = 0x9 [TCM_INT] ErrAddr = 0x6a46
TCM DTCM CE	einj-util -comp 1 -subcomp 1 -type 0	DTCM Injecting CE [TCM_INT] ErrStatus = 0x7 [TCM_INT] fwk_int number = 21 [TCM_INT] ErrCode = 0xb [TCM_INT] ErrAddr = 0x6b3c
TCM DTCM UE	einj-util -comp 1 -subcomp 1 -type 1	DTCM Injecting UE [TCM_INT] ErrStatus = 0x7 [TCM_INT] fwk_int number = 21 [TCM_INT] ErrCode = 0xb [TCM_INT] ErrAddr = 0x6a46
CPU Core CE	einj-util -comp 2 -subcomp 0 -type 0	Injecting CPU CE ErrStatus 0xC6000000 ErrAddress 0x0
CPU Core UE	einj-util -comp 2 -subcomp 0 -type 1	Injecting CPU UE ErrStatus 0x60000000 ErrAddress 0x0
CPU Core DE	einj-util -comp 2 -subcomp 0 -type 2	Injecting CPU DE ErrStatus 0x40800000 ErrAddress 0x0

Rasdaemon

Overview

Rasdaemon is error logging tool that is used to log RAS (Reliability, Availability and Serviceability) events. The daemon uses the kernel trace sub-system to capture the error events reported by the kernel modules. The trace events that are captured in /sys/kernel/debug/tracing are reported by the rasdaemon.

Enabling rasdaemon creates a “instances/rasdaemon” directory inside “/sys/kernel/debug/tracing” debugfs directory. All the tracing events that are enabled by the rasdaemon are captured in this directory.

Note

This test is only supported on RD-V3-Cfg1 and RD-N2-Cfg1 Platforms. Firmware First Error Handling

Enabling Rasdaemon

Note

This section assumes the user has completed the chapter Getting Started and has a functional working environment.

RD-N2-Cfg1 and RD-V3-Cfg1 platform have rasdaemon package enabled by default on the buildroot file system. Buildroot repository has support added to enable rasdaemon, any platform performing a buildroot boot can enable rasdaemon package.

To enable rasdaemon on other platform variants add following code to the buildroot defconfig file.

BR2_PACKAGE_RASDAEMON=y
BR2_GLOBAL_PATCH_DIR="board/aarch64-efi/rdinfra/patches/"

To add rasdaemon support on RD-V3 platform add above two lines to file configs/rdv3/buildroot/aarch64_rdinfra_defconfig

Build the software stack for buildroot. Refer Build the platform software.

Perform buildroot filesystem boot. Refer Booting with Buildroot as the filesystem.

On the buildroot shell type following command to enable rasdaemon

mount -t debugfs none /sys/kernel/debug
rasdaemon -e

This command starts rasdaemon and enables trace events for memory controller, aer, non_standard error records, arm event and arm ras external events.

rasdaemon: ras:mc_event event enabled
rasdaemon: ras:aer_event event enabled
rasdaemon: ras:non_standard_event event enabled
rasdaemon: ras:arm_ras_ext_event event enabled
rasdaemon: ras:arm_event event enabled

Test to validate rasdaemon

To validate the logging of RAS events by rasdaemon requires a platform with RAS support enabled. Here we look at the 1-bit DE reported by the CPU on RD-V3-Cfg1 platform that has RAS support enabled. Perform the test for firmware first error handling for 1-bit DE on CPU. The kernel logs this event and also reports an arm_event for this error to the tracing subsystem. Rasdaemon captures this arm_event trace log and prints it.

Refer CPU Error Injecton to perform CPU firmware first error handling test on RD-V3-Cfg1 platform. On the error injection the kernel logs the error and also the arm_event. The trace event is also recorded as part of rasdaemon buffer. To log the trace from rasdaemon run following command.

cat /sys/kernel/debug/tracing/instances/rasdaemon/trace_pipe

The above command outputs following log from rasdaemon.

<idle>-0       [004] d.h1.   555.977157: arm_event: affinity level: 255;
MPIDR: 0000000081040000; MIDR: 00000000410fd840; running state: 1; PSCI state: 0

Other components supporting RAS

CMN Cyprus Kernel First Handling (KFH)

Important

This feature might not be applicable to all Platforms. Please check section Supported Features of individual platform pages to confirm if this feature is listed as supported. Also this feature can be validated only on a pre-silicon validation platform. Current support is limited to RASv1.

CMN Cyprus RAS support

CMN Cyprus implements RAS as a distributed architecture with set of logging, reporting registers and a central interrupt handling unit. The logging and reporting registers are implemented in the XP, HN-I, HN-F/S, SBSX and CCG device nodes.

Logging registers implemented in the device node are:

Error Feature register (ErrFr)
Error Control register (ErrCtlr)
Error Status register (ErrStatus)
Error Address register (ErrAddr)
Error Misc register 0 (ErrMisc0)
Error Misc register 1 (ErrMisc1)

Two sets of these registers are implemented by each device node, one to log error that occur when in root address space and other to log the error when executing in non-secure address space. Each device node also implements ErrGsr (Error group status register) that is set when that node is has non-zero ErrStatus register. CMN Cyprus supports following error types:

Corrected Error (CE)
Deferred Error (DE)
Uncorrected Error Unrecoverable (UEU)

Example: In RD-V3-Cfg1 platform implements a CMN mesh of size 3*3. That has 9 XP’s, 8 HNS, 1 SBSX, 4 HN-I and 5 CCG device nodes. Each of these nodes implement a set of error records to log the detected RAS errors.

Each device node also implements the Pseudo Fault Generation (PFG) registers that allows to inject the pseudo errors within the device node and validate the software error handling flow. The PFG registers defined for each node are:

Error Pseudo Fault Generation Feature register (ErrPfgf)
Error Pseudo Fault Generation Control register (ErrPfgctl)
Error Pseudo Fault Generation Count Down register (ErrPfgcdn)

There are 2 sets of PFG registers implemented per device node. One for root world error injection and other for NS world error injection.

Error/Fault injection in CMN Cyprus

Sequence to be followed to perform SW induced error injection:

Program the Error Control register to enable error detection and enable the FHI interrupt

mmio_write_errctlr ((CMN_BASE + NODE_OFF + ErrCtlr), (BIT3 | BIT0))

Program the PFG count down register to 1, to inject error on first clock tick.

mmio_write_pfgcdn ((CMN_BASE + NODE_OFF + ErrPfgCdn), 1)

Program the PFG control register with following fields:
- Type of error, if CE set BIT6, if DE set BIT5, if UEU set BIT2
- Set BIT11 to update ErrStatus.AV field on fault injection
- Set BIT12 to update ErrStatus.MV field on fault injection
- Set BIT31 to enable the injection by reading the PFG count down register

mmio_write_pfgCtlr ((CMN_BASE + NODE_OFF + PfgCtlr), (BIT<Error_Type> |
BIT11 | BIT12 | BIT31))

Run this same sequence in order to inject the error in any of the CMN device node. NODE_OFF for each node must be known before performing the injection, which can be determined from the CMN discovery process.

CMN KFH Software

To enable CMN KFH following SW components are required.

Arm Error Source Table (AEST) ACPI table to represent CMN errors
SSDT table
AEST device driver for CMN.

SSDT Table

Add one entry in the SSDT table to define the CMN cyprus device memory CRS object. Refer ACPI for Arm Components spec for more information on various field details.

// CMN 800 device
 Device (CMN8) { // CMN-800 device object for an X * Y
   Name (_HID, "ARMHC800")
   Name (_UID, Zero)
   Name (_CRS, ResourceTemplate () {
     // Descriptor for 1 GB of the CFG region at offset PERIPHBASE
     QWordMemory (
       ResourceConsumer,
       PosDecode,
       MinFixed,
       MaxFixed,
       NonCacheable,
       ReadWrite,
       0x00000000,       // Granularity
       0x100000000,      // Min, set to PERIPHBASE
       0x13FFFFFFF,      // Max
       0x000000000,      // Translation
       0x040000000,      // Range Length 1GB
       ,                 // ResourceSourceIndex
       ,                 // ResourceSource
       CFGM              // DescriptorName
     )
   })
 } // Device(CMN8)

AEST table

Each RAS capable device node is represented as AEST node within the AEST table. E.g below is the AEST node entry for HNF0, where 0 represent the logical ID of the HNF. For more information refer ACPI for the RAS and ACPI for Arm Components specs. These specs describes all the necessary fields to be populated to define a AEST node for a given CMN device node.

{
  .NodeResource = {
    .Vendor = {
      {
        EFI_ACPI_AEST_NODE_TYPE_VENDOR_DEFINED,    /* Type */
        sizeof (EFI_ACPI_AEST_NODE_DATA),          /* Length */
        0,                                         /* Reserved */
        sizeof (EFI_ACPI_AEST_NODE_STRUCT),        /* Offset to Node data */
        sizeof (EFI_ACPI_AEST_NODE_RESOURCE),      /* Offset to Node Interface */
        (sizeof (EFI_ACPI_AEST_NODE_RESOURCE) +    /* Offset to Node Interrupt */
         sizeof (EFI_ACPI_AEST_INTERFACE_STRUCT)),
        1,                                         /* Interrupt array size */
        0,                                         /* Timestamp */
        0,                                         /* Reserved1 */
        0,                                         /* Injection countdown rate */
      },
      // Vendor Node Structure
      AEST_NODE_TYPE_VENDOR_HID,                   /* Hardware ID */
      1,                                           /* Unique ID */
      // Vendor Data
      {
        0x00,                                      /* Offset HNF0 0x1700000 */
        0x00,
        0x70,
        0x01,
        0,
        0,
        0,
        0,
        0x00,                                      /* Offset HND 0x0000 */
        0x00,
        0,
        0,
      },
    },
  },
  {
    EFI_ACPI_AEST_INTERFACE_TYPE_MMIO,       /* Interface type */
    {0, 0, 0},                               /* Reserved */
    0,                                       /* Flags */
    0,                                       /* Base Address */
    0,                                       /* Record Index */
    0,                                       /* Num Error records */
    0,                                       /* Record implemented */
    0,                                       /* Group status reporting */
    0,                                       /* Addressing mode */
    0,                                       /* ACPI ARM error node device */
    0,                                       /* Processor Affinity */
    0,                                       /* ErrGsr base address */
  },
  {
   {
      EFI_ACPI_AEST_INTERRUPT_TYPE_FAULT_HANDLING,     /* Interrupt type */
      {0, 0},                                          /* Reserved */
      EFI_ACPI_AEST_INTERRUPT_FLAG_TRIGGER_TYPE_LEVEL, /* Flags */
      79,                                              /* GSIV */
      0,                                               /* ID */
      {0, 0, 0},                                       /* Reserved */
    },
  },
},

Note that HNF0 error node does not define anything in the interface structure. CMN relies completely on the Vendor-defined nodedata structure to communicate the device node offset and respective HND node offset.

AEST CMN driver for CMN

The AEST driver for CMN is implemented as an extension to the AEST ACPI table driver. The AEST CMN driver at boot reads the SSDT table and reads the CRS object to determine the CMN base address and size and creates virtual mapping the CMN address space.

Each CMN device error node data is represented using the vendor-defined structure in the AEST ACPI table. At boot the AEST ACPI driver parses the AEST table and when it locates a vendor node, it adds the node data to a platform device structure and registers a platform device. AEST ACPI driver registers a platform device driver to process the vendor defined errors. For each AEST node of type vendor error that is detected by the AEST ACPI driver it registers a platform device and calls into the probe function. For each platform device registered if the vendor HID is set to CMN HID, it is registered with the AEST CMN driver.

The AEST CMN driver reads the vendor platform device information into a driver specific data structure. The AEST CMN driver maintains the device structure in the linked list. Each list entry holds the information for all the error nodes of same device type. Driver also registers the IRQ handlers to process the FHI interrupt generated when a device node detects CE, DE or UE. On an error event the IRQ handler parses through all the device node structures and reads the ErrGsr register for each node. For a non-zero ErrGsr located the handler logs the error records, clears the interrupt and returns. Below is a example log for DE detected on HNS0 and HNI1

[    2.117375] AEST_CMN: RAS v2 enabled = 0
[    2.118373] AEST_CMN: Error record registers for device node HNS0
[    2.119858] AEST_CMN: [HNS0] ErrFr_NS = 0x5200008012c9a2
[    2.121154] AEST_CMN: [HNS0] ErrCtlr_NS = 0x10d
[    2.122263] AEST_CMN: [HNS0] ErrStatus_NS = 0xc4800000
[    2.123512] AEST_CMN: [HNS0] ErrAddr_NS = 0x0
[    2.124573] AEST_CMN: [HNS0] ErrMisc0_NS = 0x0
[    2.125656] AEST_CMN: [HNS0] ErrMisc1_NS = 0x0
[    2.140341] AEST_CMN: RAS v2 enabled = 0
[    2.141305] AEST_CMN: Error record registers for device node HNI1
[    2.142784] AEST_CMN: [HNI1] ErrFr_NS = 0x120000801201a2
[    2.144077] AEST_CMN: [HNI1] ErrCtlr_NS = 0x10d
[    2.145181] AEST_CMN: [HNI1] ErrStatus_NS = 0xc4800000
[    2.146430] AEST_CMN: [HNI1] ErrAddr_NS = 0x0
[    2.147491] AEST_CMN: [HNI1] ErrMisc0_NS = 0x0
[    2.148570] AEST_CMN: [HNI1] ErrMisc1_NS = 0x0