Reliability, Availability, and Serviceability (RAS)

Overview

Reliability, Availability and Serviceability (RAS) is a measure that defines the robustness of the system. A RAS enabled platform ensures that the system produces correct outputs, is always operational and is easily maintainable. RAS reduces the systems downtime by detecting the hardware errors and correcting them when possible. The level of RAS to be achieved is implementation dependent. There are various techniques that help achieve RAS targets e.g Fault prevention and fault removal, error handling and recovery and fault handling. A well designed RAS system ensures that the software and hardware collectively work to minimize the impact of hardware faults on entire system operation and hence boost performance.

RAS specification divides the entire RAS architectural extension support into two categories:

  • ARMv8-A RAS Extension

  • RAS System Architecture

RAS architectural specification defines the hardware RAS extensions that the cpu and the system could implement to achieve the desired level of RAS support.

ARMv8-A RAS Extension defines the RAS extensions that are mandatory for CPU implementation that are based on ARMv8.2 and above. To enable RAS extension architectural support in software the RAS_EXTENSION flag must be set to 1.

RAS system architecture defines the architectural support required to enable system level RAS support on a platform. It defines a reusable component architecture that can detect, record errors and also signal them to Processing Element (PE). PE is implementation defined, it can be anything that is capable of handling the given error e.g AP, SCP or MCP. This architectural definitions makes designing the software easier.

Component Definitions by RAS System Architecture

Below are some component definitions that the RAS System architecture defines:

Node

A node is one such component architecture defined by RAS. A system can have single or multiple error nodes. Architecturally a node:

  • Implements one or more standard error record.

  • Records detected and consumed errors.

  • Might include control to disable the error reporting and recording while the software initializes.

  • Reports recorded errors with asynchronous error reporting mechanism like interrupts e.g Fault Handling Interrupt (FHI).

  • Implements a counter for counting corrected errors.

  • Logs timestamps in each error record.

  • Report uncorrected error by in-band error reporting signaling (external abort)

  • Report critical error condition via Critical Error Interrupt (CRI).

Error Record

RAS system architecture defines standard error record. A node captures entire error information as part of these error records. Spec defines a mechanism to access error records as system register or memory mapped registers. A standard error record comprises of:

  • ERR<n>STATUS: characterizes the error and marks valid status fields.

  • ERR<n>ADDR: error address register.

  • ERR<n>MISC<m>: miscellaneous error register. To be used for:

    • Identifying the Field Replaceable Unit (FRU).

    • Locating the error within the FRU.

    • Implementing corrected error counter to count the corrected errors.

    • Storing the timestamp value for recorded errors.

An Error record records following component error states:

  • Corrected Error (CE).

  • Deferred Error (DE).

  • Uncorrected Error (UE): UE has following sub-types:

    • Uncontainable error (UC).

    • Unrecoverable error (UEU).

    • Recoverable error or Signaled error (UER).

    • Restartable error or Latent error (UEO).

Error Handling

There are two approaches to achieve error handling in software:

Firmware First Error Handling

Firmware First error handling requires the error events that occur are handled in EL3 and then relayed to OSPM for logging. On error firmware consumes the error information generates a standard Common Platform Error Record (CPER) information buffer which is defined by UEFI specification to store error information. CPER is placed in firmware reserved memory that is later shared with the OSPM when it is notified about the error.

On Arm Neoverse Reference design platforms the Firmware First error handling is achieved using Hardware Error Source Table (HEST) and Software Delegated Exception Interface (SDEI) tables. The Secure Partition (Standalone MM driver) is used to generate CPER info for the error. At boot the HEST table is published and OSPM is made aware about the hardware error source(s) the platform supports.

During the runtime when hardware fault is detected the corresponding error or fault handling interrupt is generated. This interrupt is taken to EL3 runtime firmware which calls into Secure Partition that generates CPER record and places it in firmware reserved memory. EL3 runtime firmware using SDEI notifies the OSPM about the error.

Kernel First Error Handling

Kernel First errors are handled directly by the OSPM without firmware intervention. The fault and error events that are generated by the platform are taken directly to OSPM.

Arm Neoverse Reference design platforms use Arm Error Source Table (AEST) to achieve kernel first error handling. AEST table is defined in ACPI specification for RAS specification. AEST table defines the hardware error sources that are present on the platform. AEST table comprises of one or more error nodes. A AEST node entry has information of component the node belongs to e.g Processor, Memory, SMMU, GIC etc. It defines interface type for accessing the node e.g memory mapped or system register. A node also defines the list of interrupts the node supports.

OSPM implements a AEST driver module to traverse through the AEST table. The module registers Irq handlers for all supported node interrupts. The fault event occurring on that node or error source is directly forwarded to OSPM for handling.

Error Injection

Error injection feature is a micro-architecture feature defined by RAS to inject errors in the RAS supported system components. Software can use these registers to inject the error and test the error handling software implemented by the platform.

Arm Neoverse Reference Designs use the Error Injection (EINJ) ACPI table defined in the ACPI specification to implement error injection feature. EINJ is action and instruction based table that defines set of actions and their corresponding instructions. Each action is also assigned a firmware reserved memory space to store action specific data. An instruction is essentially a read or a write operation that is performed on that reserved memory.

On Arm Neoverse Reference Platforms the firmware at EL3 implements the functionality to program the error injection registers. OSPM initiates the injection and generates an SPI interrupt to call in to firmware. EINJ defines a action to program the GICD register that triggers a SPI interrupt that is handled in EL3.

Firmware-first and Kernel-first software use the EINJ ACPI table to validate the software functionality.

Note

Error injection, whether firmware-first or kernel-first, are both initiated from the kernel.

Error Injection via Kernel

CPU Error Injecton

The Neoverse RD-N2 platforms has support for 2 error nodes, and the presence of these nodes enable the RAS extension.

  • Node 0: Includes the L3 memory system in the DSU.

  • Node 1: Includes the private L1 and L2 memory systems in the cpu.

RD-Fremont only supports one error node.

  • Node 0: Includes the private L1 and L2 memory systems in the cpu.

CPU support SED parity (Single Error Detect) and SECDED ECC (Single Error Correct Double Error Detect) capabilities.

Rd-Fremont-Cfg1 and RdN2 platforms also supports injecting error’s to verify error handling software.

Note

The Neoverse RD-Fremont reference design platforms are based on direct connect configuration and has no DSU. Hence they only support one error node i.e Node0.

Error Injection Software Sequence

CPU implements Pseudo Fault Generation registers. With the help of these registers, software can inject either CE, DE or UE into the cache RAMs.

Detailed error injection software sequence:

  • Select error record for L1 and L2 memory systems i.e. Node0

    • write_errselr_el1 (0)

  • Program the Error Control Register to enable Error Detection, FHI for CE, DE and UE.

    • write_erxctlr_el1 (0x109) (Note: To enable ERI on UE write 0x10D)

  • Program the PFG Control Register to 0.

    • write_cpu_pfg_ctrl_register (0)

  • Clear the Error Status Register to 0.

    • write_erxstatus_el1 (0xFFC00000)

  • Set PFG countdown register to 1.

    • write_cpu_pfg_cdn_register (1)

  • For Corrected Error injection write

    • write_cpu_pfg_ctrl_register (0x80000020) [Generates FHI interrupt]

Procedure to Perform Error Injection

Note

This section assumes the user has completed the Getting Started chapter and has a functional working environment.

Error Handling Mode Selection

CPU supports both Firmware First and Kernel First error handling modes, and the default mode is set to Firmware First.

Important

Only one error handling mode can be enabled at a time.

The error handling modes are a build time option, in order to select either the user needs to navigate to the <workspace> and edit the configuration file of the platform of interest and look for TF_A_RAS_FW_FIRST flag.

As an example for RD-Fremont Cfg1 platform:

vim <workspace>/build-scripts/configs/rdfremontcfg1/rdfremontcfg1
  • Firmware First Selection:

TF_A_RAS_FW_FIRST = 1
  • Kernel First Selection

TF_A_RAS_FW_FIRST = 0

Note

Clean and build once you switch error handling mode.

Build and Boot Operating System(s)

Refer to any of the bellow list of supported operating systems, to build the reference design platform software stack and boot into the OS.

Inject Error

After the boot is complete, based on the error handling scheme selected use EINJ table debugfs entries to inject the error.

The field sel-firmware-first in oem-einj is used to toggle firmware first error injection, with the default being kernel first error injection. Field sel-error-type is used to choose the type of error injection, where the current implementation only suppports deferred errors.

Firmware First Error Injection
mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject

On successful error injection the firmware reception log’s this error information on the console.

Check the secure uart terminal (window with the name FVP terminal_sec_uart) for a log similar to below.

SP 8001: ErrAddr = 0x8F840
SP 8001: MmEntryPoint Done
INFO:    EINJ event received 83
INFO:    cpu_id 2
INFO:    Injecting DE...
INFO:    ErrStatus = 0x0
INFO:    [CPU RAS] CPU intr received = 17 on cpu_id = 2
INFO:    [CPU RAS] ERXMISC0_EL1 = 0x0
INFO:    [CPU RAS] ERXSTATUS_EL1 = 0x40800000
INFO:    [CPU RAS] ERXADDR_EL1 = 0x0 buff_base = 0xf4600000

Check the non-secure uart terminal (window with the name FVP terminal_nsec_ uart) for a log similar to below.

{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
{2}[Hardware Error]: event severity: recoverable
{2}[Hardware Error]:  Error 0, type: recoverable
{2}[Hardware Error]:   section_type: ARM processor error
{2}[Hardware Error]:   MIDR: 0x00000000410fd840
{2}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081020000
{2}[Hardware Error]:   running state: 0x1
{2}[Hardware Error]:   Power State Coordination Interface state: 0
{2}[Hardware Error]:   Error info structure 0:
{2}[Hardware Error]:   num errors: 1
{2}[Hardware Error]:    first error captured
{2}[Hardware Error]:    error_type: 0, cache error
{2}[Hardware Error]:    error_info: 0x000000000002001f
{2}[Hardware Error]:     transaction type: Generic
{2}[Hardware Error]:     operation type: Generic error (type cannot be determined)
{2}[Hardware Error]:     cache level: 0
{2}[Hardware Error]:     processor context not corrupted
{2}[Hardware Error]:     the error has not been corrected
{2}[Hardware Error]:    physical fault address: 0x0000000000000000
{2}[Hardware Error]:   Context info structure 0:
{2}[Hardware Error]:    register context type: AArch64 general purpose registers
{2}[Hardware Error]:    00000000: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000010: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000020: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000030: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000040: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000050: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000060: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000070: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000080: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    00000090: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000a0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000b0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000c0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000d0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000e0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]:    000000f0: 00000000 00000000 00000000 00000000
Kernel First Error Injection
mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 0 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject

On successful error injection the kernel receives a error event which is received in the irq handler. The handler traverses through the error record info and logs the error.

Check the non-secure uart terminal (window with the name FVP terminal_nsec_ uart) for a log similar to below.

[ 2365.760926] Injecting DE-
[ 2365.760928] ARM RAS: error from CPU7
[ 2365.760930] ERR0STATUS: 0x40800000
EDAC ( Error Detection and Correction)

The EDAC(Eror Detection and Correction) Linux interface provides a framework, for reproting memory and CPU errors encountered on a system. It allow the Kernel to detect and manage errors, providing valuable information for diagnostics and troubleshooting hardware issue.

We currently only enabled EDAC support for CPU for both RdN2 and RdFremont Platforms. Error count is exposed through sysfs inteface this interface allows user to access information about Corrected (CE) and Uncorrected (UE) errors that have occured in the system aiding in monitoring and diagnosing hardware issues.

Note

This feature is only supported on RD-Fremont Cfg1 and RdN2Cfg1 Platforms if Kernel first error Handling is enabled.

cat /sys/devices/system/edac/cpu/cpu*/ue_count

Shared RAM Error Injection

RD Fremont and N2 platform have support for Shared RAM that is shared between AP, MCP, SCP and RSS. The shared RAM is protected with SECDED (Single Error Correct Double Error Detect). RD Fremont platform defines ECC RAS registers to log any ECC errors that occur during Shared RAM access from each master AP, SCP, MCP or RSS. For RD Fremont 4 sets of ECC RAS registers defined for each master to log errors based on master’s PAS and 2 sets of ECC Ras registers for RD-N2 platform.

RD-Fremont: The list for Shared RAM ECC RAS registers is defined below:

  • AP Secure RAM ECC RAS registers

  • AP Non-Secure RAM ECC RAS registers

  • AP Realm RAM ECC RAS registers

  • AP Root RAM ECC RAS registers

  • SCP Secure RAM ECC RAS registers

  • SCP Non-Secure RAM ECC RAS registers

  • SCP Realm RAM ECC RAS registers

  • SCP Root RAM ECC RAS registers

  • MCP Secure RAM ECC RAS registers

  • MCP Non-Secure RAM ECC RAS registers

  • MCP Realm RAM ECC RAS registers

  • MCP Root RAM ECC RAS registers

RD-N2: The list for Shared RAM ECC RAS registers is defined below:

  • AP Secure RAM ECC RAS registers

  • AP Non-Secure RAM ECC RAS registers

  • SCP Secure RAM ECC RAS registers

  • SCP Non-Secure RAM ECC RAS registers

  • MCP Secure RAM ECC RAS registers

  • MCP Non-Secure RAM ECC RAS registers

Note

This test is only supported on RD-Fremont Cfg1 and RdN2Cfg1 Platforms. Firmware First Error Handling

Error Injection on Shared RAM

Each ECC RAS register set implements SRAMECC_ERRMISC1 register which provides a way to inject Corrected Error (CE) or Uncorrected Error (UE) in the Shared RAM. The error injection only takes effect if the register programming is followed by a read access to shared RAM. If the injection is successful the error records pertaining to the master and respective access are populated with error information and an error interrupt is delivered to the master.

RD-Fremont Shared SRAM

Detailed Error injection software sequence is illustrated to inject 1-bit CE into Root Shared RAM from AP executing in RD-Fremont.

  • Add memory map for the Shared RAM ECC RAS registers memory space.

  • Add memory map for the Shared memory space.

  • Program the SRAMECC_ERRCTRL register to enable ED(Error detection), FI(Fault Interrupt) and CFI(Corrected Fault Interrupt)

  • Program the SRAMECC_ERRMISC1 register to enable INJECT_CE.

  • Read from memory mapped shared memory space to inject the error.

RD-N2 Shared SRAM

Detailed Error injection software sequence is illustrated to inject 1-bit CE into Non-Secure Shared RAM from AP executing in RD-N2.

  • Add memory map for the Shared RAM ECC RAS registers memory space.

  • Add memory map for the Shared memory space.

  • Program the SRAMECC_ERRCTRL register to enable RAM_ECC_EN and set INJECT_ERROR to [01] for Correctable error.

  • Read from memory mapped shared memory space to inject the error.

Procedure to Perform Error Injection on Shared RAM

Note

This section assumes the user has completed the Getting Started chapter and has a functional working environment.

Error Handling Mode Selection

Both platform only supports Firmware First SRAM eror handling mode, and the default mode is set to Firmware First.

Important

Only Firmware first mode is supported for SRAM-Errors.

The error handling modes are a build time option, in order to select either the user needs to navigate to the <workspace> and edit the configuration file of the platform of interest and look for TF_A_RAS_FW_FIRST flag.

As an example for RD-Fremont Cfg1 platform:

vim <workspace>/build-scripts/configs/rdfremontcfg1/rdfremontcfg1
  • Firmware First Selection:

TF_A_RAS_FW_FIRST = 1

Note

Clean and build once you switch error handling mode.

Build and Boot Operating System(s)

Refer to any of the bellow list of supported operating systems, to build the reference design platform software stack and boot into the OS.

Inject Error on Shared RAM

Run below command to inject 1-bit CE to the Shared RAM. This test uses EINJ ACPI table to perform error injection. Shared RAM is not a standard defined error_type in EINJ ACPI table so use the vendor defined error type. Bit 31 of error_type field represents vendor error type. Use error_type value 0x8002_0000 to represent Shared RAM errors.

mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject

Shared RAM error handling happens in Firmware first mode. The EL3 firmware receives the fault handling interrupt (FHI) for the corrected error detected and logs the error on the secure console.

EDAC MC0: 1 CE unknown error on unknown memory
( page:0x8f offset:0x840 grain:-281474976710655 syndrome:0x0 - APEI location: )
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 20
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:   section_type: memory error
{1}[Hardware Error]:   physical_address: 0x000000000008f840
{1}[Hardware Error]:   physical_address_mask: 0x0000ffffffffffff

Error Injection via SCP Utility

The error injection utility is referred to as einj-util in this document. Einj-util is a command-line utility designed for SCP. This utility integrates with the SCP CLI Debugger, enabling users to insert commands at runtime. Einj-util facilitates error injection into various RAS-supported components when a user provides error injection command input in the CLI. This utility helps in validating the RAS capable hardware components’ behavior when error is detected and reported.

The term “Component” defines the RAS-supported components for which error injection is supported. “Sub-component” signifies the next level of error categorization for each component, and it varies for different components. For instance, in the context of SRAM, subcomponents represent error injection in different worlds: Root, Secure, Realm, and Non-Secure. “Type” defines the various types of errors supported by each component. Error types supported are Correctable Error(CE), Deferred Error(DE), Uncorrectable Error(UE).

Procedure to Perform Error Injection into Various Components

Note

This section assumes the user has completed the Getting Started chapter and has a functional working environment.

Build Software Stack

This procedure doesn’t require a full host OS to be present, but the Busybox Boot is still recommended as it is the simplest method to build the required components.

Boot up to SCP CLI Debugger Shell

Once the build step is completed, boot the Busybox stack on FVP as normal but identify the window with the name FVP terminal_uart_scp once it shows up, as this window is the one to interact with. The steps are as follows:

  • Launch the FVP and access the SCP UART.

  • Once in the SCP UART terminal, use Ctrl + e to enter the CLI.

  • To access the help menu for the einj-util utility, run the command

einj-util -h
  • The “help” command displays the CLI usage.

> einj-util -h
    Inject error into various components.

    Usage: einj-util -comp <n> -subcomp <n> -type <n>

    -comp: sram (0), tcm (1), cpu (2), rsm (3)

    -subcomp:

            sram: root (0), secure (1), non-secure (2), realm (3)

            tcm: itcm (0), dtcm (1)

            cpu: always 0 for now

            rsm: secure (0), non-secure (1)

    -type:

            sram/tcm/rsm: correctable (0), uncorrectable (1)

            cpu: correctable (0), uncorrectable (1), deferred (2)

    example:

            1) ce into shared sram from secure world:
                    einj-util -comp 0 -subcomp 1 -type 0
            2) ce into scp itcm:
                    einj-util -comp 1 -subcomp 0 -type 0
            3) cpu ue:
                    einj-util -comp 2 -subcomp 0 -type 1
  • To exit the CLI Debugger, press Ctrl + d.

Various Error Injection Scenarios

Component

Subcomponent

Type of Error

Error Status

Shared SRAM

Secure World

CE

0x86000000

Root World

UE

0xa4000000

RSM SRAM

Secure World

CE

0x86000000

Non-Secure World

UE

0xa4000000

TCM

ITCM

CE

0x5

DTCM

UE

0x7

CPU

Core

CE

0xC6000000

UE

0x60000000

DE

0x40800000

Shared SRAM Error Injection

Run the following command to inject a correctable error into shared SRAM from the secure world.

> einj-util -comp 0 -subcomp 1 -type 0

After triggering the error, the interrupt handler is invoked, logging error records.

[SRAM_INT] ErrStatus = 0x86000000
[SRAM_INT] fwk_int number = 24
[SRAM_INT] ErrAddr   = 0x10
SRAM ECC Error Status Register Bit Descriptions
AV[31:31]  :  Address Valid

MV[26:26]  :  Miscellaneous Registers Valid

CE[25:24]  :  Correctable error has occurred

DE[23:23]  :  Deferred Error

UET[21:20] :  Uncorrected Error Type

SERR[7:0]  :  Primary Error code
CPU Error Injection

Run the following command to inject a CPU correctable error.

> einj-util -comp 2 -subcomp 0 -type 0

The ErrorStatus register captures information about the triggered CPU error.

Injecting CPU CE
ErrStatus  0xC6000000
ErrAddress 0x0
Core Error Injection ERXSTATUS_EL1 Register Description
AV[31:31]  :  Address Valid

V[30:30]   :  Status Register Valid

MV[26:26]  :  Miscellaneous Registers Valid

CE[25:24]  :  Corrected Error

DE[24:24]  :  Deferred Error

UET[21:20] :  Uncorrected Error Type

SERR[4:0]  :  Primary Error code
SCP ITCM/DTCM Error Injection

Invoke the following command to inject a correctable error into SCP ITCM.

> einj-util -comp 1 -subcomp 0 -type 0

The error record information will be logged in the following manner.

ITCM
Injecting CE
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode   = 0x9
[TCM_INT] ErrStatus = 0x5
[TCM_INT] ErrAddr   = 0x34d8
TCMECC_ERRSTATUS Bit Descriptions
OF[2:2] : Multiple errors occurred before SW cleared the current error

UE[1:1] : Uncorrectable and uncontainable error have occurred

CE[0:0] : Correctable error has occurred
RSM SRAM Error Injection

Invoke the following command to trigger a correctable error in RSM SRAM from the secure world.

> einj-util -comp 3 -subcomp 0 -type 0

The error record information is logged as follows:

Injecting CE into RSM SRAM
[RSM_INT] ErrStatus = 0x86000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr   = 0x10

Note

Refer to the SRAM ECC Error Status register bit descriptions to decode the error status for RSM SRAM errors.

Expected Output for the Various Scenarios

Description

Command

Expected Output

Shared SRAM Secure World CE

einj-util -comp 0 -subcomp 1 -type 0

Injecting CE into Shared SRAM
[SRAM_INT] ErrStatus = 0x86000000
[SRAM_INT] fwk_int number = 24
[SRAM_INT] ErrAddr = 0x10

Shared SRAM Secure World UE

einj-util -comp 0 -subcomp 1 -type 1

Injecting UE into Shared SRAM
[SRAM_INT] ErrStatus = 0xa4000000
[SRAM_INT] fwk_int number = 24
[SRAM_INT] ErrAddr = 0x1

Shared SRAM Root CE

einj-util -comp 0 -subcomp 0 -type 0

Injecting CE into Shared SRAM
[SRAM_INT] ErrStatus = 0x86000000
[SRAM_INT] fwk_int number = 26
[SRAM_INT] ErrAddr = 0x10

Shared SRAM Root UE

einj-util -comp 0 -subcomp 0 -type 1

Injecting UE into Shared SRAM
[SRAM_INT] ErrStatus = 0xa4000000
[SRAM_INT] fwk_int number = 26
[SRAM_INT] ErrAddr = 0x10

RSM SRAM Secure World CE

einj-util -comp 3 -subcomp 0 -type 0

Injecting CE into RSM SRAM
[RSM_INT] ErrStatus = 0x86000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10

RSM SRAM Secure World UE

einj-util -comp 3 -subcomp 0 -type 1

Injecting UE into RSM SRAM
[RSM_INT] ErrStatus = 0xa4000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10

RSM SRAM Non-secure World CE

einj-util -comp 3 -subcomp 1 -type 0

Injecting CE into RSM SRAM
[RSM_INT] ErrStatus = 0x86000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10

RSM SRAM Non-secure World UE

einj-util -comp 3 -subcomp 1 -type 1

Injecting UE into RSM SRAM
[RSM_INT] ErrStatus = 0xa4000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10

TCM ITCM CE

einj-util -comp 1 -subcomp 0 -type 0

ITCM
Injecting CE
[TCM_INT] ErrStatus = 0x5
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0x9
[TCM_INT] ErrAddr = 0x6b38

TCM ITCM UE

einj-util -comp 1 -subcomp 0 -type 1

ITCM
Injecting UE
[TCM_INT] ErrStatus = 0x7
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0x9
[TCM_INT] ErrAddr = 0x6a46

TCM DTCM CE

einj-util -comp 1 -subcomp 1 -type 0

DTCM
Injecting CE
[TCM_INT] ErrStatus = 0x7
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0xb
[TCM_INT] ErrAddr = 0x6b3c

TCM DTCM UE

einj-util -comp 1 -subcomp 1 -type 1

DTCM
Injecting UE
[TCM_INT] ErrStatus = 0x7
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0xb
[TCM_INT] ErrAddr = 0x6a46

CPU Core CE

einj-util -comp 2 -subcomp 0 -type 0

Injecting CPU CE
ErrStatus  0xC6000000
ErrAddress 0x0

CPU Core UE

einj-util -comp 2 -subcomp 0 -type 1

Injecting CPU UE
ErrStatus  0x60000000
ErrAddress 0x0

CPU Core DE

einj-util -comp 2 -subcomp 0 -type 2

Injecting CPU DE
ErrStatus  0x40800000
ErrAddress 0x0

Rasdaemon

Overview

Rasdaemon is error logging tool that is used to log RAS (Reliability, Availability and Serviceability) events. The daemon uses the kernel trace sub-system to capture the error events reported by the kernel modules. The trace events that are captured in /sys/kernel/debug/tracing are reported by the rasdaemon.

Enabling rasdaemon creates a “instances/rasdaemon” directory inside “/sys/kernel/debug/tracing” debugfs directory. All the tracing events that are enabled by the rasdaemon are captured in this directory.

Note

This test is only supported on RD-Fremont-Cfg1 and Rd-N2-Cfg1 Platforms. Firmware First Error Handling

Enabling Rasdaemon

Note

This section assumes the user has completed the chapter Getting Started and has a functional working environment.

Rd-N2-Cfg1 and Rd-Fremont-Cfg1 platform have rasdaemon package enabled by default on the buildroot file system. Buildroot repository has support added to enable rasdaemon, any platform performing a buildroot boot can enable rasdaemon package.

To enable rasdaemon on other platform variants add following code to the buildroot defconfig file.

BR2_PACKAGE_RASDAEMON=y
BR2_GLOBAL_PATCH_DIR="board/aarch64-efi/rdinfra/patches/"

To add rasdaemon support on Rd-Fremont platform add above two lines to file configs/rdfremont/buildroot/aarch64_rdinfra_defconfig

Build the software stack for buildroot. Refer Build the platform software.

Perform buildroot filesystem boot. Refer Booting with Buildroot as the filesystem.

On the buildroot shell type following command to enable rasdaemon

mount -t debugfs none /sys/kernel/debug
rasdaemon -e

This command starts rasdaemon and enables trace events for memory controller, aer, non_standard error records, arm event and arm ras external events.

rasdaemon: ras:mc_event event enabled
rasdaemon: ras:aer_event event enabled
rasdaemon: ras:non_standard_event event enabled
rasdaemon: ras:arm_ras_ext_event event enabled
rasdaemon: ras:arm_event event enabled

Test to validate rasdaemon

To validate the logging of RAS events by rasdaemon requires a platform with RAS support enabled. Here we look at the 1-bit DE reported by the CPU on Rd-Fremont-Cfg1 platform that has RAS support enabled. Perform the test for firmware first error handling for 1-bit DE on CPU. The kernel logs this event and also reports an arm_event for this error to the tracing subsystem. Rasdaemon captures this arm_event trace log and prints it.

Refer CPU Error Injecton to perform CPU firmware first error handling test on Rd-Fremont-Cfg1 platform. On the error injection the kernel logs the error and also the arm_event. The trace event is also recorded as part of rasdaemon buffer. To log the trace from rasdaemon run following command.

cat /sys/kernel/debug/tracing/instances/rasdaemon/trace_pipe

The above command outputs following log from rasdaemon.

<idle>-0       [004] d.h1.   555.977157: arm_event: affinity level: 255;
MPIDR: 0000000081040000; MIDR: 00000000410fd840; running state: 1; PSCI state: 0