N2 CPU RAS Test

Overview

The Neoverse N2 core based reference design platform has support for 2 error nodes. The presence of these nodes thus enables RAS extension on Neoverse N2 core.

  • Node 0: Includes the L3 memory system in the DSU.

  • Node 1: Includes the private L1 and L2 memory systems in the core.

The RAM’s in the N2 core support SED parity (Single Error Detect) and SECDED ECC (Single Error Correct Double Error Detect) capabilities.

Neoverse N2 core also supports inserting errors in the error detection logic to verify error handling software.

Note

The Neoverse N2 reference design platform is based on direct connect configuration and has no DSU. Hence Neoverse N2 reference design platform supports only one error node i.e Node1.

1-bit CE error injection on N2 CPU

Neoverse N2 core implements Pseudo Fault Generation registers. With the help of these register software can inject either CE, DE or UE into the cache RAMs.

Detailed Error injection software sequence is illustrated to inject 1-bit CE N2 CPU.

  • Select error record for L1 and L2 memory systems i.e. Node1
    • write_errselr_el1 (1)

  • Program the Error Control Register to enable Error Detection, FHI for CE, DE and UE.
    • write_erxctlr_el1 (0x109) (Note: To enable ERI on UE write 0x10D)

  • Program the PFG Control Register to 0.
    • write_cpu_pfg_ctrl_register (0)

  • Clear the Error Status Register to 0.
    • write_erxstatus_el1 (0xFFC00000)

  • Set PFG countdown register to 1.
    • write_cpu_pfg_cdn_register (1)

  • For Corrected Error injection write
    • write_cpu_pfg_ctrl_register (0xC0000040) // Generates FHI interrupt

  • Note the CE is generated when the CE counter implemented in the ErrMisc register overflows, so clear the cpu_pfg_ctrl register after the overflow happens to stop the injection.
    • write_cpu_pfg_ctrl_register (0)

Download the platform software

Skip this section if the required sources have been downloaded.

To obtain the required sources for the platform, follow the steps listed on the Setup Workspace page. Ensure that the platform software is downloaded before proceeding with the steps listed below. Also, note the host machine requirements listed on that page which is essential to build and execute the platform software stack.

Select the Build option

N2 CPU supports both Firmware First and Kernel First Error handling. At given point of time either of the support can be enabled. Firmware First Support is enabled by default. To enable Kernel First support enable build option ARM_TF_RAS_KERNEL_FIRST and disable ARM_TF_RAS_FW_FIRST vice versa. Navigate to your workspace and

  • For Firmware First
    • vim configs/rdn2cfg1/rdn2cfg1

    • Set ARM_TF_RAS_FW_FIRST = 1

    • Set ARM_TF_RAS_KERNEL_FIRST = 0

  • For Kernel First
    • vim configs/rdn2cfg1/rdn2cfg1

    • Set ARM_TF_RAS_FW_FIRST = 0

    • Set ARM_TF_RAS_KERNEL_FIRST = 1

Procedure to perform 1-bit CE injection and handling on N2 CPU

Boot upto Busybox

Refer to the Busybox Boot page to build the reference design platform software stack and boot into busybox on the Neoverse RD FVP.

N2 CPU error handling test

After the busybox boot is complete, use below commands to inject 1-bit CE on the N2 CPU. EINJ table debugfs enteries are used to inject the error. error_type field is set to 1 indicating its processor correctable error.

echo 1 > /sys/kernel/debug/apei/einj/error_type
echo 1 > /sys/kernel/debug/apei/einj/error_inject

Firmware First Error Handling

On successful error injection the firmware publishes this error to kernel via standard error record format (CPER) for Processor errors. The kernel on reception of this error information logs it on the console.

{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 10
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
{1}[Hardware Error]: event severity: corrected
{1}[Hardware Error]:  Error 0, type: corrected
{1}[Hardware Error]:   section_type: ARM processor error
{1}[Hardware Error]:   MIDR: 0x00000000410fd490
{1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081000000
{1}[Hardware Error]:   running state: 0x1
{1}[Hardware Error]:   Power State Coordination Interface state: 0
{1}[Hardware Error]:   Error info structure 0:
{1}[Hardware Error]:   num errors: 135
{1}[Hardware Error]:    overflow occurred, error info is incomplete
{1}[Hardware Error]:    error_type: 1, TLB error
{1}[Hardware Error]:    error_info: 0x000000000402001f
{1}[Hardware Error]:     transaction type: Generic
{1}[Hardware Error]:     operation type: Generic error (type cannot be determined)
{1}[Hardware Error]:     TLB level: 0
{1}[Hardware Error]:     processor context not corrupted
{1}[Hardware Error]:     the error has been corrected
{1}[Hardware Error]:    physical fault address: 0x0000000000000000

Kernel First Error Handling

On successful error injection the kernel receives a error event which is received in the irq handler. The handler traverses through the error record info and logs the error. Logs from kernel first error handling test.

ERR1STATUS: 0x4e000012
ERR1MISC0: 0xe000000000
ERR1MISC1: 0x0
ERR1MISC2: 0x0
ERR1MISC3: 0x0

Copyright (c) 2022-2023, Arm Limited. All rights reserved.