Reliability, Availability, and Serviceability (RAS)
Overview
Reliability, Availability and Serviceability (RAS) is a measure that defines the robustness of the system. A RAS enabled platform ensures that the system produces correct outputs, is always operational and is easily maintainable. RAS reduces the systems downtime by detecting the hardware errors and correcting them when possible. The level of RAS to be achieved is implementation dependent. There are various techniques that help achieve RAS targets e.g Fault prevention and fault removal, error handling and recovery and fault handling. A well designed RAS system ensures that the software and hardware collectively work to minimize the impact of hardware faults on entire system operation and hence boost performance.
RAS specification divides the entire RAS architectural extension support into two categories:
ARMv8-A RAS Extension
RAS System Architecture
RAS architectural specification defines the hardware RAS extensions that the cpu and the system could implement to achieve the desired level of RAS support.
ARMv8-A RAS Extension defines the RAS extensions that are mandatory for CPU implementation that are based on ARMv8.2 and above. To enable RAS extension architectural support in software the RAS_EXTENSION flag must be set to 1.
RAS system architecture defines the architectural support required to enable system level RAS support on a platform. It defines a reusable component architecture that can detect, record errors and also signal them to Processing Element (PE). PE is implementation defined, it can be anything that is capable of handling the given error e.g AP, SCP or MCP. This architectural definitions makes designing the software easier.
Component Definitions by RAS System Architecture
Below are some component definitions that the RAS System architecture defines:
Node
A node is one such component architecture defined by RAS. A system can have single or multiple error nodes. Architecturally a node:
Implements one or more standard error record.
Records detected and consumed errors.
Might include control to disable the error reporting and recording while the software initializes.
Reports recorded errors with asynchronous error reporting mechanism like interrupts e.g Fault Handling Interrupt (FHI).
Implements a counter for counting corrected errors.
Logs timestamps in each error record.
Report uncorrected error by in-band error reporting signaling (external abort)
Report critical error condition via Critical Error Interrupt (CRI).
Error Record
RAS system architecture defines standard error record. A node captures entire error information as part of these error records. Spec defines a mechanism to access error records as system register or memory mapped registers. A standard error record comprises of:
ERR<n>STATUS: characterizes the error and marks valid status fields.
ERR<n>ADDR: error address register.
ERR<n>MISC<m>: miscellaneous error register. To be used for:
Identifying the Field Replaceable Unit (FRU).
Locating the error within the FRU.
Implementing corrected error counter to count the corrected errors.
Storing the timestamp value for recorded errors.
An Error record records following component error states:
Corrected Error (CE).
Deferred Error (DE).
Uncorrected Error (UE): UE has following sub-types:
Uncontainable error (UC).
Unrecoverable error (UEU).
Recoverable error or Signaled error (UER).
Restartable error or Latent error (UEO).
Error Handling
There are two approaches to achieve error handling in software:
Firmware First Error Handling
Firmware First error handling requires the error events that occur are handled in EL3 and then relayed to OSPM for logging. On error firmware consumes the error information generates a standard Common Platform Error Record (CPER) information buffer which is defined by UEFI specification to store error information. CPER is placed in firmware reserved memory that is later shared with the OSPM when it is notified about the error.
On Arm Neoverse Reference design platforms the Firmware First error handling is achieved using Hardware Error Source Table (HEST) and Software Delegated Exception Interface (SDEI) tables. The Secure Partition (Standalone MM driver) is used to generate CPER info for the error. At boot the HEST table is published and OSPM is made aware about the hardware error source(s) the platform supports.
During the runtime when hardware fault is detected the corresponding error or fault handling interrupt is generated. This interrupt is taken to EL3 runtime firmware which calls into Secure Partition that generates CPER record and places it in firmware reserved memory. EL3 runtime firmware using SDEI notifies the OSPM about the error.
Kernel First Error Handling
Kernel First errors are handled directly by the OSPM without firmware intervention. The fault and error events that are generated by the platform are taken directly to OSPM.
Arm Neoverse Reference design platforms use Arm Error Source Table (AEST) to achieve kernel first error handling. AEST table is defined in ACPI specification for RAS specification. AEST table defines the hardware error sources that are present on the platform. AEST table comprises of one or more error nodes. A AEST node entry has information of component the node belongs to e.g Processor, Memory, SMMU, GIC etc. It defines interface type for accessing the node e.g memory mapped or system register. A node also defines the list of interrupts the node supports.
OSPM implements a AEST driver module to traverse through the AEST table. The module registers Irq handlers for all supported node interrupts. The fault event occurring on that node or error source is directly forwarded to OSPM for handling.
Error Injection
Error injection feature is a micro-architecture feature defined by RAS to inject errors in the RAS supported system components. Software can use these registers to inject the error and test the error handling software implemented by the platform.
Arm Neoverse Reference Designs use the Error Injection (EINJ) ACPI table defined in the ACPI specification to implement error injection feature. EINJ is action and instruction based table that defines set of actions and their corresponding instructions. Each action is also assigned a firmware reserved memory space to store action specific data. An instruction is essentially a read or a write operation that is performed on that reserved memory.
On Arm Neoverse Reference Platforms the firmware at EL3 implements the functionality to program the error injection registers. OSPM initiates the injection and generates an SPI interrupt to call in to firmware. EINJ defines a action to program the GICD register that triggers a SPI interrupt that is handled in EL3.
Firmware-first and Kernel-first software use the EINJ ACPI table to validate the software functionality.
Note
Error injection, whether firmware-first or kernel-first, are both initiated from the kernel.
Error Injection via Kernel
CPU Error Injecton
The Neoverse RD-N2 platforms has support for 2 error nodes, and the presence of these nodes enable the RAS extension.
Node 0: Includes the L3 memory system in the DSU.
Node 1: Includes the private L1 and L2 memory systems in the cpu.
RD-V3 only supports one error node.
Node 0: Includes the private L1 and L2 memory systems in the cpu.
CPU support SED parity (Single Error Detect) and SECDED ECC (Single Error Correct Double Error Detect) capabilities.
Rd-V3-Cfg1 and RdN2 platforms also supports injecting error’s to verify error handling software.
Note
The Neoverse RD-V3 reference design platforms are based on direct connect configuration and has no DSU. Hence they only support one error node i.e Node0.
Error Injection Software Sequence
CPU implements Pseudo Fault Generation registers. With the help of these registers, software can inject either CE, DE or UE into the cache RAMs.
Detailed error injection software sequence:
Select error record for L1 and L2 memory systems i.e. Node0
write_errselr_el1 (0)
Program the Error Control Register to enable Error Detection, FHI for CE, DE and UE.
write_erxctlr_el1 (0x109) (Note: To enable ERI on UE write 0x10D)
Program the PFG Control Register to 0.
write_cpu_pfg_ctrl_register (0)
Clear the Error Status Register to 0.
write_erxstatus_el1 (0xFFC00000)
Set PFG countdown register to 1.
write_cpu_pfg_cdn_register (1)
For Deferred Error injection write
write_cpu_pfg_ctrl_register (0x80000020) [Generates FHI interrupt]
Procedure to Perform Error Injection
Note
This section assumes the user has completed the Getting Started chapter and has a functional working environment.
Error Handling Mode Selection
CPU supports both Firmware First and Kernel First error handling modes, and the default mode is set to Firmware First.
Important
Only one error handling mode can be enabled at a time.
The error handling modes are a build time option, in order to select either
the user needs to navigate to the <workspace>
and edit the configuration
file of the platform of interest and look for TF_A_RAS_FW_FIRST
flag.
As an example for RD-V3 Cfg1 platform:
vim <workspace>/build-scripts/configs/rdv3cfg1/rdv3cfg1
Firmware First Selection:
TF_A_RAS_FW_FIRST = 1
Kernel First Selection
TF_A_RAS_FW_FIRST = 0
Note
Clean and build once you switch error handling mode.
Build and Boot Operating System(s)
Refer to any of the bellow list of supported operating systems, to build the reference design platform software stack and boot into the OS.
Inject Error
After the boot is complete, based on the error handling scheme selected use EINJ table debugfs entries to inject the error.
The field sel-firmware-first
in oem-einj is used to toggle firmware first
error injection, with the default being kernel first error injection. Field
sel-error-type
is used to choose the type of error injection, where the
current implementation only suppports deferred errors.
mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 1 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject
On successful error injection the firmware reception log’s this error information on the console.
Check the secure uart terminal (window with the name FVP terminal_sec_uart
)
for a log similar to below.
SP 8001: ErrAddr = 0x8F840
SP 8001: MmEntryPoint Done
INFO: EINJ event received 83
INFO: cpu_id 2
INFO: Injecting DE...
INFO: ErrStatus = 0x0
INFO: [CPU RAS] CPU intr received = 17 on cpu_id = 2
INFO: [CPU RAS] ERXMISC0_EL1 = 0x0
INFO: [CPU RAS] ERXSTATUS_EL1 = 0x40800000
INFO: [CPU RAS] ERXADDR_EL1 = 0x0 buff_base = 0xf4600000
Check the non-secure uart terminal (window with the name FVP terminal_nsec_
uart
) for a log similar to below.
{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
{2}[Hardware Error]: event severity: recoverable
{2}[Hardware Error]: Error 0, type: recoverable
{2}[Hardware Error]: section_type: ARM processor error
{2}[Hardware Error]: MIDR: 0x00000000410fd840
{2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000081020000
{2}[Hardware Error]: running state: 0x1
{2}[Hardware Error]: Power State Coordination Interface state: 0
{2}[Hardware Error]: Error info structure 0:
{2}[Hardware Error]: num errors: 1
{2}[Hardware Error]: first error captured
{2}[Hardware Error]: error_type: 0, cache error
{2}[Hardware Error]: error_info: 0x000000000002001f
{2}[Hardware Error]: transaction type: Generic
{2}[Hardware Error]: operation type: Generic error (type cannot be determined)
{2}[Hardware Error]: cache level: 0
{2}[Hardware Error]: processor context not corrupted
{2}[Hardware Error]: the error has not been corrected
{2}[Hardware Error]: physical fault address: 0x0000000000000000
{2}[Hardware Error]: Context info structure 0:
{2}[Hardware Error]: register context type: AArch64 general purpose registers
{2}[Hardware Error]: 00000000: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000040: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000050: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 00000090: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 000000a0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 000000b0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 000000c0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 000000d0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 000000e0: 00000000 00000000 00000000 00000000
{2}[Hardware Error]: 000000f0: 00000000 00000000 00000000 00000000
mount -t debugfs none /sys/kernel/debug # Step needed for Buildroot only
echo 0x80020000 > /sys/kernel/debug/apei/einj/error_type
echo 0 > /sys/kernel/debug/apei/einj/oem-einj/sel-firmware-first
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-component
echo 2 > /sys/kernel/debug/apei/einj/oem-einj/sel-error-type
echo 1 > /sys/kernel/debug/apei/einj/error_inject
On successful error injection the kernel receives a error event which is received in the irq handler. The handler traverses through the error record info and logs the error.
Check the non-secure uart terminal (window with the name FVP terminal_nsec_
uart
) for a log similar to below.
[ 2365.760926] Injecting DE-
[ 2365.760928] ARM RAS: error from CPU7
[ 2365.760930] ERR0STATUS: 0x40800000
EDAC ( Error Detection and Correction)
The EDAC(Eror Detection and Correction) Linux interface provides a framework, for reproting memory and CPU errors encountered on a system. It allow the Kernel to detect and manage errors, providing valuable information for diagnostics and troubleshooting hardware issue.
We currently only enabled EDAC support for CPU for both RD-N2 and RD-V3 Platforms. Error count is exposed through sysfs inteface this interface allows user to access information about Corrected (CE) and Uncorrected (UE) errors that have occured in the system aiding in monitoring and diagnosing hardware issues.
Note
This feature is only supported on RD-V3-Cfg1 and RD-N2-Cfg1 Platforms if Kernel first error Handling is enabled.
cat /sys/devices/system/edac/cpu/cpu*/ue_count
Error Injection via SCP Utility
The error injection utility is referred to as einj-util in this document. Einj-util is a command-line utility designed for SCP. This utility integrates with the SCP CLI Debugger, enabling users to insert commands at runtime. Einj-util facilitates error injection into various RAS-supported components when a user provides error injection command input in the CLI. This utility helps in validating the RAS capable hardware components’ behavior when error is detected and reported.
The term “Component” defines the RAS-supported components for which error injection is supported. “Sub-component” signifies the next level of error categorization for each component, and it varies for different components. For instance, in the context of SRAM, subcomponents represent error injection in different worlds: Root, Secure, Realm, and Non-Secure. “Type” defines the various types of errors supported by each component. Error types supported are Correctable Error(CE), Deferred Error(DE), Uncorrectable Error(UE).
Procedure to Perform Error Injection into Various Components
Note
This section assumes the user has completed the Getting Started chapter and has a functional working environment.
Build Software Stack
This procedure doesn’t require a full host OS to be present, but the Busybox Boot is still recommended as it is the simplest method to build the required components.
Boot up to SCP CLI Debugger Shell
Once the build step is completed, boot the Busybox stack on FVP as normal but
identify the window with the name FVP terminal_uart_scp
once it shows up,
as this window is the one to interact with. The steps are as follows:
Launch the FVP and access the SCP UART.
Once in the SCP UART terminal, use
Ctrl + e
to enter the CLI.To access the help menu for the einj-util utility, run the command
einj-util -h
The “help” command displays the CLI usage.
> einj-util -h
Inject error into various components.
Usage: einj-util -comp <n> -subcomp <n> -type <n>
-comp: sram (0), tcm (1), cpu (2), rsm (3)
-subcomp:
sram: root (0), secure (1), non-secure (2), realm (3)
tcm: itcm (0), dtcm (1)
cpu: always 0 for now
rsm: secure (0), non-secure (1)
-type:
sram/tcm/rsm: correctable (0), uncorrectable (1)
cpu: correctable (0), uncorrectable (1), deferred (2)
example:
1) ce into shared sram from secure world:
einj-util -comp 0 -subcomp 1 -type 0
2) ce into scp itcm:
einj-util -comp 1 -subcomp 0 -type 0
3) cpu ue:
einj-util -comp 2 -subcomp 0 -type 1
To exit the CLI Debugger, press
Ctrl + d
.
Various Error Injection Scenarios
Component |
Subcomponent |
Type of Error |
Error Status |
---|---|---|---|
Shared SRAM |
Secure World |
CE |
0x86000000 |
Root World |
UE |
0xa4000000 |
|
RSM SRAM |
Secure World |
CE |
0x86000000 |
Non-Secure World |
UE |
0xa4000000 |
|
TCM |
ITCM |
CE |
0x5 |
DTCM |
UE |
0x7 |
|
CPU |
Core |
CE |
0xC6000000 |
UE |
0x60000000 |
||
DE |
0x40800000 |
CPU Error Injection
Run the following command to inject a CPU correctable error.
> einj-util -comp 2 -subcomp 0 -type 0
The ErrorStatus register captures information about the triggered CPU error.
Injecting CPU CE
ErrStatus 0xC6000000
ErrAddress 0x0
Core Error Injection ERXSTATUS_EL1 Register Description
AV[31:31] : Address Valid
V[30:30] : Status Register Valid
MV[26:26] : Miscellaneous Registers Valid
CE[25:24] : Corrected Error
DE[24:24] : Deferred Error
UET[21:20] : Uncorrected Error Type
SERR[4:0] : Primary Error code
SCP ITCM/DTCM Error Injection
Invoke the following command to inject a correctable error into SCP ITCM.
> einj-util -comp 1 -subcomp 0 -type 0
The error record information will be logged in the following manner.
ITCM
Injecting CE
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0x9
[TCM_INT] ErrStatus = 0x5
[TCM_INT] ErrAddr = 0x34d8
TCMECC_ERRSTATUS Bit Descriptions
OF[2:2] : Multiple errors occurred before SW cleared the current error
UE[1:1] : Uncorrectable and uncontainable error have occurred
CE[0:0] : Correctable error has occurred
RSM SRAM Error Injection
Invoke the following command to trigger a correctable error in RSM SRAM from the secure world.
> einj-util -comp 3 -subcomp 0 -type 0
The error record information is logged as follows:
Injecting CE into RSM SRAM
[RSM_INT] ErrStatus = 0x86000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10
Note
Refer to the SRAM ECC Error Status register bit descriptions to decode the error status for RSM SRAM errors.
Expected Output for the Various Scenarios
Description |
Command |
Expected Output |
---|---|---|
Shared SRAM Secure World CE |
einj-util -comp 0 -subcomp 1 -type 0 |
Injecting CE into Shared SRAM
[SRAM_INT] ErrStatus = 0x86000000
[SRAM_INT] fwk_int number = 24
[SRAM_INT] ErrAddr = 0x10
|
Shared SRAM Secure World UE |
einj-util -comp 0 -subcomp 1 -type 1 |
Injecting UE into Shared SRAM
[SRAM_INT] ErrStatus = 0xa4000000
[SRAM_INT] fwk_int number = 24
[SRAM_INT] ErrAddr = 0x1
|
Shared SRAM Root CE |
einj-util -comp 0 -subcomp 0 -type 0 |
Injecting CE into Shared SRAM
[SRAM_INT] ErrStatus = 0x86000000
[SRAM_INT] fwk_int number = 26
[SRAM_INT] ErrAddr = 0x10
|
Shared SRAM Root UE |
einj-util -comp 0 -subcomp 0 -type 1 |
Injecting UE into Shared SRAM
[SRAM_INT] ErrStatus = 0xa4000000
[SRAM_INT] fwk_int number = 26
[SRAM_INT] ErrAddr = 0x10
|
RSM SRAM Secure World CE |
einj-util -comp 3 -subcomp 0 -type 0 |
Injecting CE into RSM SRAM
[RSM_INT] ErrStatus = 0x86000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10
|
RSM SRAM Secure World UE |
einj-util -comp 3 -subcomp 0 -type 1 |
Injecting UE into RSM SRAM
[RSM_INT] ErrStatus = 0xa4000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10
|
RSM SRAM Non-secure World CE |
einj-util -comp 3 -subcomp 1 -type 0 |
Injecting CE into RSM SRAM
[RSM_INT] ErrStatus = 0x86000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10
|
RSM SRAM Non-secure World UE |
einj-util -comp 3 -subcomp 1 -type 1 |
Injecting UE into RSM SRAM
[RSM_INT] ErrStatus = 0xa4000000
[RSM_INT] fwk_int number = 29
[RSM_INT] ErrAddr = 0x10
|
TCM ITCM CE |
einj-util -comp 1 -subcomp 0 -type 0 |
ITCM
Injecting CE
[TCM_INT] ErrStatus = 0x5
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0x9
[TCM_INT] ErrAddr = 0x6b38
|
TCM ITCM UE |
einj-util -comp 1 -subcomp 0 -type 1 |
ITCM
Injecting UE
[TCM_INT] ErrStatus = 0x7
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0x9
[TCM_INT] ErrAddr = 0x6a46
|
TCM DTCM CE |
einj-util -comp 1 -subcomp 1 -type 0 |
DTCM
Injecting CE
[TCM_INT] ErrStatus = 0x7
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0xb
[TCM_INT] ErrAddr = 0x6b3c
|
TCM DTCM UE |
einj-util -comp 1 -subcomp 1 -type 1 |
DTCM
Injecting UE
[TCM_INT] ErrStatus = 0x7
[TCM_INT] fwk_int number = 21
[TCM_INT] ErrCode = 0xb
[TCM_INT] ErrAddr = 0x6a46
|
CPU Core CE |
einj-util -comp 2 -subcomp 0 -type 0 |
Injecting CPU CE
ErrStatus 0xC6000000
ErrAddress 0x0
|
CPU Core UE |
einj-util -comp 2 -subcomp 0 -type 1 |
Injecting CPU UE
ErrStatus 0x60000000
ErrAddress 0x0
|
CPU Core DE |
einj-util -comp 2 -subcomp 0 -type 2 |
Injecting CPU DE
ErrStatus 0x40800000
ErrAddress 0x0
|
Rasdaemon
Overview
Rasdaemon is error logging tool that is used to log RAS (Reliability, Availability and Serviceability) events. The daemon uses the kernel trace sub-system to capture the error events reported by the kernel modules. The trace events that are captured in /sys/kernel/debug/tracing are reported by the rasdaemon.
Enabling rasdaemon creates a “instances/rasdaemon” directory inside “/sys/kernel/debug/tracing” debugfs directory. All the tracing events that are enabled by the rasdaemon are captured in this directory.
Note
This test is only supported on RD-V3-Cfg1 and RD-N2-Cfg1 Platforms. Firmware First Error Handling
Enabling Rasdaemon
Note
This section assumes the user has completed the chapter Getting Started and has a functional working environment.
RD-N2-Cfg1 and RD-V3-Cfg1 platform have rasdaemon package enabled by default on the buildroot file system. Buildroot repository has support added to enable rasdaemon, any platform performing a buildroot boot can enable rasdaemon package.
To enable rasdaemon on other platform variants add following code to the buildroot defconfig file.
BR2_PACKAGE_RASDAEMON=y
BR2_GLOBAL_PATCH_DIR="board/aarch64-efi/rdinfra/patches/"
To add rasdaemon support on RD-V3 platform add above two lines to file configs/rdv3/buildroot/aarch64_rdinfra_defconfig
Build the software stack for buildroot. Refer Build the platform software.
Perform buildroot filesystem boot. Refer Booting with Buildroot as the filesystem.
On the buildroot shell type following command to enable rasdaemon
mount -t debugfs none /sys/kernel/debug
rasdaemon -e
This command starts rasdaemon and enables trace events for memory controller, aer, non_standard error records, arm event and arm ras external events.
rasdaemon: ras:mc_event event enabled
rasdaemon: ras:aer_event event enabled
rasdaemon: ras:non_standard_event event enabled
rasdaemon: ras:arm_ras_ext_event event enabled
rasdaemon: ras:arm_event event enabled
Test to validate rasdaemon
To validate the logging of RAS events by rasdaemon requires a platform with RAS support enabled. Here we look at the 1-bit DE reported by the CPU on RD-V3-Cfg1 platform that has RAS support enabled. Perform the test for firmware first error handling for 1-bit DE on CPU. The kernel logs this event and also reports an arm_event for this error to the tracing subsystem. Rasdaemon captures this arm_event trace log and prints it.
Refer CPU Error Injecton to perform CPU firmware first error handling test on RD-V3-Cfg1 platform. On the error injection the kernel logs the error and also the arm_event. The trace event is also recorded as part of rasdaemon buffer. To log the trace from rasdaemon run following command.
cat /sys/kernel/debug/tracing/instances/rasdaemon/trace_pipe
The above command outputs following log from rasdaemon.
<idle>-0 [004] d.h1. 555.977157: arm_event: affinity level: 255;
MPIDR: 0000000081040000; MIDR: 00000000410fd840; running state: 1; PSCI state: 0
Other components supporting RAS
CMN Cyprus Kernel First Handling (KFH)
Important
This feature might not be applicable to all Platforms. Please check section Supported Features of individual plafrom pages to confirm if this feature is listed as supported. Also this feature can be validated only on a pre-silicon validation platform. Current support is limited to RASv1.
CMN Cyprus RAS support
CMN Cyprus implements RAS as a distributed architecture with set of logging, reporting registers and a central interrupt handling unit. The logging and reporting registers are implemented in the XP, HN-I, HN-F/S, SBSX and CCG device nodes.
Logging registers implemented in the device node are:
Error Feature register (ErrFr)
Error Control register (ErrCtlr)
Error Status register (ErrStatus)
Error Address register (ErrAddr)
Error Misc register 0 (ErrMisc0)
Error Misc register 1 (ErrMisc1)
Two sets of these registers are implemented by each device node, one to log error that occur when in root address space and other to log the error when executing in non-secure address space. Each device node also implements ErrGsr (Error group status register) that is set when that node is has non-zero ErrStatus register. CMN Cyprus supports following error types:
Corrected Error (CE)
Deferred Error (DE)
Uncorrected Error Unrecoverable (UEU)
Example: In RD-V3-Cfg1 platform implements a CMN mesh of size 3*3. That has 9 XP’s, 8 HNS, 1 SBSX, 4 HN-I and 5 CCG device nodes. Each of these nodes implement a set of error records to log the detected RAS errors.
Each device node also implements the Pseudo Fault Generation (PFG) registers that allows to inject the pseudo errors within the device node and validate the software error handling flow. The PFG registers defined for each node are:
Error Pseudo Fault Generation Feature register (ErrPfgf)
Error Pseudo Fault Generation Control register (ErrPfgctl)
Error Pseudo Fault Generation Count Down register (ErrPfgcdn)
There are 2 sets of PFG registers implemented per device node. One for root world error injection and other for NS world error injection.
Error/Fault injection in CMN Cyprus
Sequence to be followed to perform SW induced error injection:
Program the Error Control register to enable error detection and enable the FHI interrupt
mmio_write_errctlr ((CMN_BASE + NODE_OFF + ErrCtlr), (BIT3 | BIT0))
Program the PFG count down register to 1, to inject error on first clock tick.
mmio_write_pfgcdn ((CMN_BASE + NODE_OFF + ErrPfgCdn), 1)
Program the PFG control register with following fields:
Type of error, if CE set BIT6, if DE set BIT5, if UEU set BIT2
Set BIT11 to update ErrStatus.AV field on fault injection
Set BIT12 to update ErrStatus.MV field on fault injection
Set BIT31 to enable the injection by reading the PFG count down register
mmio_write_pfgCtlr ((CMN_BASE + NODE_OFF + PfgCtlr), (BIT<Error_Type> |
BIT11 | BIT12 | BIT31))
Run this same sequence in order to inject the error in any of the CMN device node. NODE_OFF for each node must be known before performing the injection, which can be determined from the CMN discovery process.
CMN KFH Software
To enable CMN KFH following SW components are required.
Arm Error Source Table (AEST) ACPI table to represent CMN errors
SSDT table
AEST device driver for CMN.
SSDT Table
Add one entry in the SSDT table to define the CMN cyprus device memory CRS object. Refer ACPI for Arm Components spec for more information on various field details.
// CMN 800 device
Device (CMN8) { // CMN-800 device object for an X * Y
Name (_HID, "ARMHC800")
Name (_UID, Zero)
Name (_CRS, ResourceTemplate () {
// Descriptor for 1 GB of the CFG region at offset PERIPHBASE
QWordMemory (
ResourceConsumer,
PosDecode,
MinFixed,
MaxFixed,
NonCacheable,
ReadWrite,
0x00000000, // Granularity
0x100000000, // Min, set to PERIPHBASE
0x13FFFFFFF, // Max
0x000000000, // Translation
0x040000000, // Range Length 1GB
, // ResourceSourceIndex
, // ResourceSource
CFGM // DescriptorName
)
})
} // Device(CMN8)
AEST table
Each RAS capable device node is represented as AEST node within the AEST table. E.g below is the AEST node entry for HNF0, where 0 represent the logical ID of the HNF. For more information refer ACPI for the RAS and ACPI for Arm Components specs. These specs describes all the necessary fields to be populated to define a AEST node for a given CMN device node.
{
.NodeResource = {
.Vendor = {
{
EFI_ACPI_AEST_NODE_TYPE_VENDOR_DEFINED, /* Type */
sizeof (EFI_ACPI_AEST_NODE_DATA), /* Length */
0, /* Reserved */
sizeof (EFI_ACPI_AEST_NODE_STRUCT), /* Offset to Node data */
sizeof (EFI_ACPI_AEST_NODE_RESOURCE), /* Offset to Node Interface */
(sizeof (EFI_ACPI_AEST_NODE_RESOURCE) + /* Offset to Node Interrupt */
sizeof (EFI_ACPI_AEST_INTERFACE_STRUCT)),
1, /* Interrupt array size */
0, /* Timestamp */
0, /* Reserved1 */
0, /* Injection countdown rate */
},
// Vendor Node Structure
AEST_NODE_TYPE_VENDOR_HID, /* Hardware ID */
1, /* Unique ID */
// Vendor Data
{
0x00, /* Offset HNF0 0x1700000 */
0x00,
0x70,
0x01,
0,
0,
0,
0,
0x00, /* Offset HND 0x0000 */
0x00,
0,
0,
},
},
},
{
EFI_ACPI_AEST_INTERFACE_TYPE_MMIO, /* Interface type */
{0, 0, 0}, /* Reserved */
0, /* Flags */
0, /* Base Address */
0, /* Record Index */
0, /* Num Error records */
0, /* Record implemented */
0, /* Group status reporting */
0, /* Addressing mode */
0, /* ACPI ARM error node device */
0, /* Processor Affinity */
0, /* ErrGsr base address */
},
{
{
EFI_ACPI_AEST_INTERRUPT_TYPE_FAULT_HANDLING, /* Interrupt type */
{0, 0}, /* Reserved */
EFI_ACPI_AEST_INTERRUPT_FLAG_TRIGGER_TYPE_LEVEL, /* Flags */
79, /* GSIV */
0, /* ID */
{0, 0, 0}, /* Reserved */
},
},
},
Note that HNF0 error node does not define anything in the interface structure. CMN relies completely on the Vendor-defined nodedata structure to communicate the device node offset and respective HND node offset.
AEST CMN driver for CMN
The AEST driver for CMN is implemented as an extension to the AEST ACPI table driver. The AEST CMN driver at boot reads the SSDT table and reads the CRS object to determine the CMN base address and size and creates virtual mapping the CMN address space.
Each CMN device error node data is represented using the vendor-defined structure in the AEST ACPI table. At boot the AEST ACPI driver parses the AEST table and when it locates a vendor node, it adds the node data to a platform device structure and registers a platform device. AEST ACPI driver registers a platform device driver to process the vendor defined errors. For each AEST node of type vendor error that is detected by the AEST ACPI driver it registers a platform device and calls into the probe function. For each platform device registered if the vendor HID is set to CMN HID, it is registered with the AEST CMN driver.
The AEST CMN driver reads the vendor platform device information into a driver specific data structure. The AEST CMN driver maintains the device structure in the linked list. Each list entry holds the information for all the error nodes of same device type. Driver also registers the IRQ handlers to process the FHI interrupt generated when a device node detects CE, DE or UE. On an error event the IRQ handler parses through all the device node structures and reads the ErrGsr register for each node. For a non-zero ErrGsr located the handler logs the error records, clears the interrupt and returns. Below is a example log for DE detected on HNS0 and HNI1
[ 2.117375] AEST_CMN: RAS v2 enabled = 0
[ 2.118373] AEST_CMN: Error record registers for device node HNS0
[ 2.119858] AEST_CMN: [HNS0] ErrFr_NS = 0x5200008012c9a2
[ 2.121154] AEST_CMN: [HNS0] ErrCtlr_NS = 0x10d
[ 2.122263] AEST_CMN: [HNS0] ErrStatus_NS = 0xc4800000
[ 2.123512] AEST_CMN: [HNS0] ErrAddr_NS = 0x0
[ 2.124573] AEST_CMN: [HNS0] ErrMisc0_NS = 0x0
[ 2.125656] AEST_CMN: [HNS0] ErrMisc1_NS = 0x0
[ 2.140341] AEST_CMN: RAS v2 enabled = 0
[ 2.141305] AEST_CMN: Error record registers for device node HNI1
[ 2.142784] AEST_CMN: [HNI1] ErrFr_NS = 0x120000801201a2
[ 2.144077] AEST_CMN: [HNI1] ErrCtlr_NS = 0x10d
[ 2.145181] AEST_CMN: [HNI1] ErrStatus_NS = 0xc4800000
[ 2.146430] AEST_CMN: [HNI1] ErrAddr_NS = 0x0
[ 2.147491] AEST_CMN: [HNI1] ErrMisc0_NS = 0x0
[ 2.148570] AEST_CMN: [HNI1] ErrMisc1_NS = 0x0