Memory system resource Partitioning And Monitoring (MPAM)

Important

This feature might not be applicable to all Platforms. Please check individual Platform pages, section Supported Features to confirm if this feature is listed as supported.

MPAM-resctrl - A quick glance

MPAM stands for memory system resource partitioning and monitoring. As the name suggests, it deals with two things; partitioning and monitoring. MPAM’s resource partitioning logic deals with partitioning resources such as shared CPU caches, interconnect caches, memory bandwidth, interconnect bandwidth, etc. In MPAM terminology, such resources can be classified as MSCs. How each MSC gets partitioned varies from MSC to MSC and also by the type of MSC. For instance, partitioning a cache could be very different from partitioning memory bandwidth. MPAM’s resource monitoring logic deals with monitoring each MSC. A monitor can measure resource usage or capacity usage, depending on the resource. For instance, a cache can have monitors for cache storage that measures the usage of the cache. Reading a monitor could help in tuning the memory-system partitioning controls. For detailed information on MPAM, refer to MPAM specification

resctrl is a Linux kernel feature by which Arm’s MPAM and Intel’s RDT can be configured and controlled. resctrl exposes MPAM capabilities and configuration options via a file-system interface. On the latest kernel source tree, users would find resctrl adapted for X86 RDT. The file and folder names reflect RDT’s feature sets rather than a generic resource portioning interface naming or MPAM’s feature names. In short, for Arm64 architecture, resctrl is how the user space can configure MPAM. The steps by which MPAM could be configured via resctrl are described in the subsequent section.

Exploring resctrl file-system

MPAM-resctrl is enabled by default on the platform (from here on platform/ platform under test/ platform under consideration would be abbreviated as PuT). This documentation advises users to follow the Busybox build to enable MPAM-resctrl capabilities for the PuT. Once the necessary sources have been fetched, checkout RD-INFRA-2024.04.17-MPAM tag for linux and repository before proceeding with the build. Build and boot the system to command prompt. Run the following command to mount the resctrl file-system. It is to be noted that MPAM’s performance aspect cannot be tested on an FVP, rather only the register configurations could be tested on it.

# mount -t resctrl resctrl /sys/fs/resctrl

It would be good to refer to resctrl documentation in parallel as many of the concepts that would be discussed further along would be present in better clarity in the documentation. However, as mentioned in the beginning, be aware that the documentation as of now covers resctrl file-system as utilized by Intel’s RDT.

Once resctrl file-system has been mounted, change directory to /sys/fs/resctrl and list the files.

# cd /sys/fs/resctrl
/sys/fs/resctrl# ls

cpus        info     mon_data      schemata    tasks
cpus_list   mmode    mon_groups    size

These are the files and folders through which MPAM’s MSCs for the PuT would be accessed and configured. Before proceeding further, it is important to understand more about MPAM’s PARTID. PARTID can be considered as an ID or label associated with MPAM configurations for a single software environment or a collection of software environments. Quoting MPAM specification “An MPAM resource control uses the PARTID that is set for one or more software environments. A PARTID for the current software environment labels each memory system request. Each MPAM resource control has control settings for each PARTID. The PARTID in a request selects the control settings for that PARTID, which are then used to control the partitioning of the performance resources of that memory-system component”. In short, each set of MPAM configuration is associated with a PARTID. The required configuration is selected/modified by programming the associated PARTID into MPAMCFG_PART_SEL register present at the MSC’s memory-mapped interface.

MPAM driver is designed in such a way that the default configuration uses a single PARTID (PARTID 0) with the default maximum partition configuration for the MSCs. This is done in the early stages of Linux kernel boot up. This will be covered in greater detail in the sections to come.

resctrl is organized in such a way that each PARTID would in turn have a separate copy of all these files and folders. At this point, there is just one set of these files/folders as shown above. More the number of PARTIDs, more would be the copy of these sets of files and folders. To understand what these files/folders denote, the user could try the following.

/sys/fs/resctrl # cat cpus
ffff

/sys/fs/resctrl # cat cpus_list
0-15

The file named cpus lists CPUs having access to the MPAM’s MSCs under consideration, for a given PARTID. The output is in bitmap format. For the PuT, it shows 0xffff indicating the presence of 16 CPUs. Reading contents of the file named cpus_list shows the same information in a different style (CPUs marked from 0-15).

/sys/fs/resctrl # cat schemata
L3:49=ffff

schemata would be one of the most important files out of the list of files exposed by resctrl. It shows the MPAM resource, its ID and the partition for this particular PARTID. From the above logs, it is clear that the MSC to be partitioned is an L3 cache, having cache ID 49. The default cache portion bitmask assigned for this PARTID is ‘0xffff’ which means the entire cache.

As discussed earlier in the MPAM-resctrl - A quick glance section, an MSC is partitioned in accordance with its type. When it comes to caches, two partitioning schemes can be used - cache portion partitioning and cache capacity partitioning. For cache portion partitioning, a cache is divided into equal number of portions represented by a bitmap. A ‘1’ indicates that the corresponding portion is allowed and ‘0’ otherwise. 0xffff represents the cache portion bitmap with all portions enabled. Since cache capacity partitioning is not being exercised here, this won’t be discussed in this documentation. Please refer to MPAM specification to get a better idea about these partitioning schemes.

Neoverse reference design platforms as of now don’t have an L3 cache. Instead, system level cache (SLC) on the interconnect acts as the shared cache for all DSU clusters. SLC cache for the PuT has been added within the PPTT table. The cache topology parsing logic within the OS walks through all caches available associates each cache with a level. SLC caches for the PuT is mapped as an L3 cache. For more details, refer to PPTT and MPAM ACPI tables present in the source code.

/sys/fs/resctrl # cat tasks
1
2
3
4
~

Reading the tasks file would give an idea of the tasks that use this PARTID. Writing a task id to the file will add a task to the group. Since this is the default config, the user should be able to find all the tasks in this file. An example where the tasks file gets modified will be looked at in the latter part of this section.

/sys/fs/resctrl # cat mode
shareable

The mode of the resource group dictates the sharing of its allocations. A “shareable” resource group allows sharing of its allocations while an “exclusive” resource group does not allow sharing.

/sys/fs/resctrl # cd info
/sys/fs/resctrl/info # ls
L3    L3_MON    last_cmd_status

The info directory contains information about the enabled resources. Each resource has its own sub-directory. There should be a sub-directory with the name that reflects the resource’s names. Since SLC has been modeled as an L3 MPAM node, an L3 directory should be present. If the resource supports monitoring capabilities, a folder with the name <MSC>_MON should also exist. L3_MON in this case is the directory having information about L3’s monitoring capabilities.

/sys/fs/resctrl/info # cd L3
/sys/fs/resctrl/info/L3 # ls

bit_usage    min_cbm_bits    shareable_bits
cbm_mask     num_closids

L3 sub-directory contains the files as shown above. Enter the following commands to understand what each of these files denote.

/sys/fs/resctrl/info/L3 # cat cbm_mask
ffff

cbm_bitmask shows the cache portion bitmask corresponding to 100% allocation of the MSC. This value is in line with what is observed as the cache portion bitmap given in schemata.

/sys/fs/resctrl/info/L3 # cat bit_usage
49=XXXXXXXXXXXXXXXX

bit_usage gives details about how each instance of the MSC gets used. Since schemata describes the cache portion bitmap for L3, bit_usage talks about the status of each of these portions. Each portion represented by a bit could be any of the below types.

0: Corresponding region is unused. When the system’s resources have been allocated and a “0” is found in “bit_usage” it is a sign that resources are wasted.

H: Corresponding region is used by hardware only but available for software use. If a resource has bits set in “shareable_bits” but not all of these bits appear in the resource groups’ schematas then the bits appearing in “shareable_bits” but no resource group will be marked as “H”.

X: Corresponding region is available for sharing and used by hardware and software. These are the bits that appear in “shareable_bits” as well as a resource group’s allocation.

S: Corresponding region is used by software and available for sharing.

E: Corresponding region is used exclusively by one resource group. No sharing allowed.

P: Corresponding region is pseudo-locked. No sharing is allowed.

From the value that is read out, all 16 portions of the cache portion bitmap are of type shareable.

/sys/fs/resctrl/info/L3 # cat min_cbm_bits
1

min_cbm_bits denotes the minimum number of consecutive bits which must be set when writing a mask. Setting anything lower than what min_cbm_bits suggests would lead to an error.

/sys/fs/resctrl/info/L3 # cat shareable_bits
ffff

shareable_bits is again a bitmask of all the shareable bits in the cache portion bitmask. For the PuT, it is 0xffff.

/sys/fs/resctrl/info/L3 # cat num_closids
32

num_closid denotes the number of closids. closids again is Intel’s terminology which expands to “class of service IDs”. This essentially means PARTIDs under MPAM. Therefore, num_closid tells us the number of valid PARTIDs the MSC supports.

/sys/fs/resctrl/info # cat last_cmd_status
ok

At the top level of the info directory, there is a file named last_cmd_status. This is reset with every “command” issued via the file-system (making new directories or writing to any of the control files). If the command was successful, it will read as “ok”. If the command fails, it will provide more information about the error generated during the operation. A simple example is shown below.

/sys/fs/resctrl # echo L3:49=0000 > schemata
sh: write error: Invalid argument

/sys/fs/resctrl # cat info/last_cmd_status
Mask out of range

As discussed earlier, the min_cbm_mask or the minimum bitmask that should be programmed into the configuration register is at least 1. If a value less than min_cbm_mask is used, the resctrl filesystem would throw an error.

Configuring MPAM via resctrl file-system

The file-system interface for the default PARTID has been looked at in the last section. Real MPAM use-cases have multiple partition spaces (PARTIDs) with different MSC partitions. With resctrl, adding a new partition space (PARTID) is simple; create a new folder with any name (users are advised to give a name resonating the use-case so that maintenance becomes easier) in the root resctrl directory.

/sys/fs/resctrl # mkdir partid_space_2
/sys/fs/resctrl # ls

cpus                   mode               partid_space_2     tasks
cpus_list              mon_data           schemata
info                   mon_groups         size
/sys/fs/resctrl # cd partid_space_2/
/sys/fs/resctrl/partid_space_2 # ls

cpus                   mode               mon_groups         size
cpus_list              mon_data           schemata           tasks

Once a new folder named partid_space_2 is created, MPAM driver internally allocates a new PARTID and associates it with this new resctrl directory. The user can modify the configurations via the resctrl file-system. resctrl talks with the MPAM driver and the driver would in turn program the required configuration registers for the new PARTID for the MSC under consideration to add the new configurations. In order to define the schemata for this new PARTID, do the following.

/sys/fs/resctrl/partid_space_2 # cat schemata
L3:49=ffff

/sys/fs/resctrl/partid_space_2 # echo "L3:49=3ff" > schemata
/sys/fs/resctrl/partid_space_2 # cat schemata
L3:49=03ff

As shown above, to define a schemata, a file write to the schemata file under the new PARTID’s root directory is required. Whenever a new folder is added under the resctrl root directory, the schemata would always reflect the default maximum for the resource under consideration - in this case, the L3 cache with 0xffff. The value to be written has to align with the format by which schemata describes the MSC and its partitions. In this case, the new value should be of the format L3:<cache ID>=<cache portion bitmap>. Changing the schemata of the default PARTID space is also valid. Users could try changing the value of the default schemata as an experiment.

As the new schemata values have been updated, the next step would be to update the tasks file with the tasks that need to use this new partitioning scheme. Select one task at random from ps -A.

/sys/fs/resctrl/partid_space_2 # cat tasks
/sys/fs/resctrl/partid_space_2 #
/sys/fs/resctrl/partid_space_2 # ps -A

PID   USER     TIME  COMMAND
1 0         0:00 sh
2 0         0:00 [kthreadd]
3 0         0:00 [rcu_gp]
~
23 0         0:00 [kworker/2:0H-ev]
24 0         0:00 [cpuhp/3]
25 0         0:00 [migration/3]

For this demonstration, task 23 has been selected to be added to the new PARTID/ partition space. Before assigning the task, take a look at the tasks file under the default PARTID to make sure that the task is currently assigned to it. As discussed in the beginning, with just the default PARTID, all tasks should be part of the default PARTID’s task file.

/sys/fs/resctrl/partid_space_2 # cd ../
/sys/fs/resctrl # cat tasks

1
2
3
4
~
23
24
~

Proceed to add the task to the tasks file under partid_space_2.

/sys/fs/resctrl # cd partid_space_2
/sys/fs/resctrl/partid_space_2 # echo 23 > tasks
/sys/fs/resctrl/partid_space_2 # cat tasks

23

A task can any time exist only under one configuration. This means that the task would no longer be present under the default PARTID’s tasks directory.

/sys/fs/resctrl/partid_space_2 # cd ../
/sys/fs/resctrl # cat tasks

1
2
3
4
~
24
~

Additional tasks can be added to the tasks file in the same manner by which the first task was added.

/sys/fs/resctrl # cd partid_space_2
/sys/fs/resctrl/partid_space_2 # echo 24 > tasks
/sys/fs/resctrl/partid_space_2 # cat tasks

23
24

Multiple PARTIDs up to num_closid limit can be added in the same fashion. Repeat the steps to configure the schemata and tasks as shown above for any new PARTID directory created.

/sys/fs/resctrl # cd ../
/sys/fs/resctrl # mkdir partid_space_3
/sys/fs/resctrl # ls

cpus                   mode               partid_space_2     tasks
cpus_list              mon_data           schemata           size
partid_space_3         info               mon_groups
/sys/fs/resctrl # cd partid_space_3/
/sys/fs/resctrl/partid_space_3 # ls

cpus                   mode               mon_groups         size
cpus_list              mon_data           schemata           tasks

A closer look at MPAM software

Enabling MPAM on the PuT involves enabling MPAM EL1/EL2 register access from EL3 (trusted firmware), building kernel drivers and having proper ACPI tables to populate platform-specific MPAM data.

EFI_ACPI_6_4_PPTT_STRUCTURE_CACHE_INIT (
  PPTT_CACHE_STRUCTURE_FLAGS,           /* Flag */
  0,                                    /* Next level of cache */
  SIZE_8MB,                             /* Size */
  8192,                                 /* Num of sets */
  16,                                   /* Associativity */
  PPTT_UNIFIED_CACHE_ATTR,              /* Attributes */
  64,                                   /* Line size */
  RD_PPTT_CACHE_ID(0, -1, -1, L3Cache)  /* Cache id */
)

For processor side caches, MPAM references the cache/MSC of interest via cache ID. The way an MSC gets referenced in the MPAM table changes from MSC to MSC. Please refer to MPAM ACPI Specification to get a detailed understanding of how MPAM tables are described. For a complete view of the PPTT table implemented on the PuT, please refer to Platform/ARM/SgiPkg/AcpiTables/<PuT>/Pptt.aslc under uefi/edk2/edk2-platforms repository in the source files. Corresponding MPAM ACPI table entries, based on MPAM ACPI v2.0 are as shown below.

/* MPAM_MSC_NODE 1 */
{
  RD_MPAM_MSC_NODE_INIT(0x1, RDN2CFG1_BASE_ADDRESS(0x141601000, 0),
     RDN2CFG1_MPAM_MMIO_SIZE, 0, RDN2CFG1_MPAM_MSC_COUNT,
     RDN2CFG1_RESOURCES_PER_MSC,
     RDN2CFG1_FUNCTIONAL_DEPENDENCY_PER_RESOURCE)
},

/* MPAM_MSC_NODE 2 */
{
  RD_MPAM_MSC_NODE_INIT(0x2, RDN2CFG1_BASE_ADDRESS(0x141641000, 0),
     RDN2CFG1_MPAM_MMIO_SIZE, 0, RDN2CFG1_MPAM_MSC_COUNT,
     RDN2CFG1_RESOURCES_PER_MSC,
     RDN2CFG1_FUNCTIONAL_DEPENDENCY_PER_RESOURCE)
},

The number of SLC cache slices can vary on each platform. Each of the cache slice would be configured as an MSC. Unique indices should be used for each SLC slice as OS would use the index as one of the criteria to differentiate between MSC nodes. For a complete view of MPAM ACPI table, please refer to Platform/ARM/SgiPkg/AcpiTables/<PuT>/Mpam.aslc file under uefi/edk2/edk-plaforms repository in the source files.

On the Linux side, MPAM software can be categorized into MPAM ACPI driver, MPAM platform driver, MPAM platform devices, MPAM layer for resctrl, MPAM support for architecture, etc. This would not be the complete list, but still covers most of the major software layers MPAM touches.

Quite early into the Linux boot, __init_el2_mpam ( arch/arm64/include/asm/ el2_setup.h ) is invoked from within head.S. __init_el2_mpam takes care of detecting and MPAM, doing basic MPAM system register setup and trap disablement to EL2.

.macro __init_el2_mpam
#ifdef CONFIG_ARM64_MPAM
    /* Memory Partioning And Monitoring: disable EL4 traps */
    mrs     x1, id_aa64pfr0_el1
    ubfx    x0, x1, #ID_AA64PFR0_MPAM_SHIFT, #4
    cbz     x0, 1f                          // skip if no MPAM
    msr_s   SYS_MPAM0_EL1, xzr              // use the default partition..
    msr_s   SYS_MPAM2_EL2, xzr              // ..and disable lower traps
    msr_s   SYS_MPAM1_EL1, xzr
    mrs_s   x0, SYS_MPAMIDR_EL1
    tbz     x0, #17, 1f                     // skip if no MPAMHCR reg
    msr_s   SYS_MPAMHCR_EL2, xzr            // clear TRAP_MPAMIDR_EL1 -> EL2
    1:
#endif /* CONFIG_ARM64_MPAM */
.endm

As the kernel proceeds to boot, the MPAM platform driver initialization routine gets invoked (mpam_msc_driver_init). The total count of MPAM MSCs is queried from the MPAM ACPI table. This is also the first time the MPAM ACPI table gets queried, starting from kernel boot up. Platform driver would get initialized only if a valid MPAM ACPI table with at least one MSC is defined. Once the platform driver is initialized, MPAM driver probing kicks off (mpam_msc_drv_probe). It is at this point that the MPAM ACPI table is completely parsed and appropriate platform device data structures are populated. Each of the populated MSC gets registered as an individual platform device. Once all the platform devices are probed, temporary CPU hotplug callbacks (mpam_discovery_cpu_online) are installed. if the system supports 128 MSCs, the callbacks would only get registered after the 128th platform device gets registered. The callbacks installed at this point are for discovering hardware details about MSCs (known as hardware probing in MPAM driver terminology) and would be replaced at a later point. This is the reason why they are described as temporary callbacks. More information on CPU hotplugging and supported API sets can be found at CPU hot plugging on Linux. Please refer to drivers/platform/mpam/mpam_devices.c under the Linux kernel repository to see the detailed implementation of the routines discussed here.

Soon after the CPU hotplug callbacks are installed, the corresponding setup (mpam_discovery_cpu_online) callbacks get called by each of the CPUs. Suppose if the PuT has 16 CPUs, the setup function would be called 16 times with CPU IDs ranging from 0-15. At this stage, setup callback proceeds with MSC hardware discovery. This includes discovering details such as the features supported, maximum PARTID, maximum PMG, etc. To understand all the features a particular MSC could support, please refer to MPAM Specification chapter 9. Once the supported features are discovered and maximum PARTID and PMG values supported are established, a default config is programmed to the configuration registers (MPAMCFG_*) for each of these features for all PARTIDs starting from 0 to the maximum value. setup function is defined in such a way that the first CPU to come online would discover features of all the registered MSCs and program appropriate configs for them. Rest of the setup calls on the other CPUs would skip over hardware discovery. A small snippet of what happens in the setup function (mpam_discovery_cpu_online) is shown below.

/* For all MSCs, if the current CPU has access to the MSC and HW discovery
 * is yet to be carried out for the MSC under consideration, proceed with
 * the discovery.
 */

list_for_each_entry(msc, &mpam_all_msc, glbl_list) {
    if (!cpumask_test_cpu(cpu, &msc->accessibility))
        continue;

    spin_lock(&msc->lock);
    if (!msc->probed)
      err = mpam_msc_hw_probe(msc);
    spin_unlock(&msc->lock);

    if (!err)
      new_device_probed = true;

The logic to program any config register (MPAMCFG_*) has been mentioned in MPAM specification, section 11.1.2.

After the first CPU to come up completes hardware probing and feature configuration, the kernel is free to enable MPAM. This is done with the help of workqueues.

static DECLARE_WORK(mpam_enable_work, &mpam_enable);

~

if (new_device_probed && !err)
    schedule_work(&mpam_enable_work);

The code scheduled under the workqueue shown above gets executed soon after the probing. This is where MPAM resctrl configurations are set up. resctrl has a dependency with cacheinfo and hence the workqueue task that’s responsible for setting up resctrl stays in wait state until cacheinfo is up and ready. cacheinfo deals with populating cache nodes from PPTT and exporting them to /sys/devices/system/cpu/cpu*/cache/index* for user space to access. MPAM’s resctrl layer internally queries the MSC cache node’s size from cacheinfo and thus have to wait till proper data is available.

wait_event(wait_cacheinfo_ready, cacheinfo_ready);

~

static int __init __cacheinfo_ready(void)
{
    cacheinfo_ready = true;
    wake_up(&wait_cacheinfo_ready);

    return 0;
}
device_initcall_sync(__cacheinfo_ready);

A teardown (mpam_cpu_offline) callback is also part of the hotplug callbacks installed earlier. The teardown callback gets called when the CPUs go offline. Atomic reference counters are added within the data structures that manage each MSC. In case of a hotplug shutdown on the PuT, the MPAM driver wouldn’t reprogram any register or initiate cleanup until the last CPU goes offline.

list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) {
if (!cpumask_test_cpu(cpu, &msc->accessibility))
    continue;

spin_lock(&msc->lock);
if (msc->reenable_error_ppi)
    disable_percpu_irq(msc->reenable_error_ppi);

if (atomic_dec_and_test(&msc->online_refs))
    mpam_reset_msc(msc, false);
spin_unlock(&msc->lock);

Once cacheinfo is set up, MPAM’s resctrl setup proceeds. With the completion of resctrl, MPAM is ready to be enabled and a new set of hotplug callbacks are installed replacing the old one. The maximum PARTID and PMG that the system can support have been established at this point and can’t be changed after the new callbacks are installed.

/*
 * Once the cpuhp callbacks have been changed, mpam_partid_max can no
 * longer change.
 */
 spin_lock(&partid_max_lock);
 partid_max_published = true;
 spin_unlock(&partid_max_lock);

 static_branch_enable(&mpam_enabled);
 mpam_register_cpuhp_callbacks(mpam_cpu_online);

As discussed earlier, the new setup function deals with marking CPUs online and reprogramming MSCs in case all CPUs went down. Just like the teardown function, the first CPU to come up would re-program the feature registers for each PARTID. The same atomic reference counter used in the teardown function is used here for this purpose.

rcu_read_lock();
list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) {
if (!cpumask_test_cpu(cpu, &msc->accessibility))
    continue;

spin_lock(&msc->lock);
if (msc->reenable_error_ppi)
    _enable_percpu_irq(&msc->reenable_error_ppi);

if (atomic_fetch_inc(&msc->online_refs) == 0)
    mpam_reprogram_msc(msc);
spin_unlock(&msc->lock);
}
rcu_read_unlock();

if (mpam_is_enabled())
    mpam_resctrl_online_cpu(cpu);

Once the system boots up and resctrl is mounted, PARTID 0 with default maximum cache portion bitmap comes into use. Whenever a new directory is added, the MPAM driver selects the new PARTID to be the first free PARTID in a range of PARTIDs from 0 to maximum. More information about the PARTID allocator could be found from fs/resctrl/rdtgroup.c within the kernel source tree. Since the file-system interface is tied to Intel’s feature set and convention, PARTID allocator is named as closid_allocator.

for_each_set_bit(closid, &closid_free_map, closid_free_map_len) {
   if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID) &&
       resctrl_closid_is_dirty(closid))
           continue;

   clear_bit(closid, &closid_free_map);
   return closid;
}

Also, for every folder created, a default config needs to be programmed into MPAM’s supported MSC’s feature configuration registers for the new PARTID. For PuT, this means programming L3’s cache portion bitmaps with the default maximum portion bitmap. This is also taken care of by resctrl. The snippet below shows a portion of the MPAM driver API code that gets called when a new folder is created.

case RDT_RESOURCE_L3:
cfg.cpbm = cfg_val;
mpam_set_feature(mpam_feat_cpor_part, &cfg);
break;

~
return mpam_apply_config(dom->comp, partid, &cfg);

The same API (mpam_apply_config) is used when the user makes any change in the schemata. Instead of the default config, the cache portion bitmap written by the user gets programmed into the MPAM configuration register for the PARTID.

Even if the L3 cache/SLC for the PuT supports a large set of PARTIDs, resctrl has a limit of 32 PARTIDs at maximum due to the bitmaps algorithm used for closid calculation. If the user tries to generate more than 32 folders including the root folder /sys/fs/resctrl, the system would throw an error.

/*
 * MSC may raise an error interrupt if it sees an out or range partid/pmg,
 * and go on to truncate the value. Regardless of what the hardware
 * supports, only the system wide safe value is safe to use.
 */
u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored)
{
    return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID);
}

Please refer to drivers/platform/mpam/mpam_resctrl.c in the Linux source tree to get a detailed understanding of MPAM’s interaction with resctrl.

MPAM and task scheduling

In the last section, the main focus was on understanding how the MPAM driver was designed, how the resctrl file-system interacted with the MPAM driver and the basic boot initialization sequence of the MPAM driver. In this section, an interesting topic would be looked at; how MPAM works along with the task scheduler.

Once MPAM is enabled, each task should belong to a PARTID group. Since PARTID gets so tightly ingrained with a task’s basic identity, the thread_info (arch/arm64/include/asm/thread_info.h) struct has been modified to hold an additional member as shown below.

/*
 * low level task data that entry.S needs immediate access to.
 */
 struct thread_info {
 ~

 #ifdef CONFIG_ARM64_MPAM
 u64    mpam_partid_pmg;
 #endif

When a system boots up with MPAM enabled and resctrl mounted, all tasks belong to the default PARTID-PMG (0) group. Once new partitions are allocated and tasks are moved from one PARTID-PMG group to another, this member of the thread_info (mpam_partid_pmg) would have to be updated accordingly. Below is the stack dump for the case where a task is moved from the default PARTID group to a new one.

/sys/fs/resctrl/partid_space_2 # echo 26 > tasks

[  404.607377] CPU: 3 PID: 1 Comm: sh Not tainted 5.17.0-g5bf032719b99-dirty #19
[  404.607381] Hardware name: ARM LTD RdN2Cfg1, BIOS EDK II Jun 15 2022
[  404.607384] Call trace:
[  404.607386]  dump_backtrace.part.0+0xd0/0xe0
[  404.607391]  show_stack+0x1c/0x6c
[  404.607396]  dump_stack_lvl+0x68/0x84
[  404.607400]  dump_stack+0x1c/0x38
[  404.607405]  resctrl_arch_set_closid_rmid+0x50/0xac
[  404.607410]  rdtgroup_tasks_write+0x2b0/0x4a0
[  404.607414]  rdtgroup_file_write+0x24/0x40
[  404.607419]  kernfs_fop_write_iter+0x11c/0x1ac
[  404.607424]  new_sync_write+0xe8/0x184
[  404.607427]  vfs_write+0x230/0x290
[  404.607431]  ksys_write+0x68/0xf4
[  404.607435]  __arm64_sys_write+0x20/0x2c
[  404.607439]  invoke_syscall+0x48/0x114
[  404.607444]  el0_svc_common.constprop.0+0x44/0xec
[  404.607449]  do_el0_svc+0x28/0x90
[  404.607453]  el0_svc+0x20/0x60
[  404.607457]  el0t_64_sync_handler+0x1a8/0x1b0
[  404.607461]  el0t_64_sync+0x1a0/0x1a4

The write to tasks file ends up as a synchronous exception from a 64-bit lower EL. The exception handler then routes it to the appropriate resctrl routines which then proceeds to call resctrl_arch_set_closid_rmid. On taking a closer look at resctrl_arch_set_closid_rmid, it takes care of calling mpam_set_cpu_defaults with the new PARTID and PMG. mpam_set_cpu_defaults goes ahead to update the thread_info member field of the very task that got swapped between PARTID groups.

void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg)
{
    BUG_ON(closid > U16_MAX);
    BUG_ON(pmg > U8_MAX);

    if (!cdp_enabled) {
        mpam_set_cpu_defaults(cpu, closid, closid, pmg, pmg);
~
static inline void mpam_set_task_partid_pmg(struct task_struct *tsk,
                                      u16 partid_d, u16 partid_i,
                                      u8 pmg_d, u8 pmg_i)
{
#ifdef CONFIG_ARM64_MPAM
    u64 regval;

    regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d);
    regval |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i);
    regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d);
    regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i);

    WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval);
#endif
}

How would the mpam_partid_pmg field from thread_info get utilized? The actual use of this field is in enabling the propagation of corresponding PARTID-PMG value pair via the bus interface downstream. Every memory request should be tagged with PARTID-PMG fields so that the MSCs downstream can respond according to the feature configuration that has been set up on it for that particular PARTID-PMG that it received from upstream. For PuT, PARTID-PMG would be propagated downstream via the CHI interface. To enable propagation of PARTID-PMG values, the system register MPAM0_EL1 have to be programmed with the PARTID-PMG value. From MPAM specification, this register’s purpose is described as follows - “Holds information to generate MPAM labels for memory requests when executing at EL0.” Please refer to the MPAM Specification chapter 4 to get detailed information on MPAM information propagation.

When the system boots up with all tasks in the default configuration, the PARTID-PMG pair would have a value of zero and MPAM0_EL1 would hold this same value. The early boot call to __init_el2_mpam writes zero to this system register. As new PARTIDs are allocated and tasks are moved from the default PARTID group, MPAM0_EL1 would need re-programming. When a task that had been moved from the default group to a new group gets scheduled, there has to be a check to see if the PARTID-PMG pair that MPAM0_EL1 holds is the one that thread_info for the task that got scheduled has. mpam_thread_switch (arch/arm64/include/asm/mpam.h) does the exact same thing.

__notrace_funcgraph __sched
struct task_struct *__switch_to(struct task_struct *prev,
                            struct task_struct *next)
{
struct task_struct *last;
~

/*
 * MPAM thread switch happens after the DSB to ensure prev's accesses
 * use prev's MPAM settings.
 */
mpam_thread_switch(next);
static inline void mpam_thread_switch(struct task_struct *tsk)
{
    u64 oldregval;
    int cpu = smp_processor_id();
    u64 regval = mpam_get_regval(tsk);

    if (!IS_ENABLED(CONFIG_ARM64_MPAM))
        return;

    if (!static_branch_likely(&mpam_enabled))
        return;


    oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu));
    if (oldregval == regval)
        return;

    if (!regval)
        regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu));

    write_sysreg_s(regval, SYS_MPAM0_EL1);
    WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval);
}

Every time a task switch happens via __switch_to, mpam_thread_switch gets called with the new task_struct (include/linux/sched.h) struct as param . What has been programmed in MPAM0_EL1 for the CPU in context, is held in an SMP specific per CPU variable called arm64_mpam_current. If there is a mismatch between the thread_info value and the value in MPAM0_EL1, the value from thread_info is copied to MPAM0_EL1. Re-programming the value in MPAM0_EL1 generally happens when two tasks of different PARTID-PMG group gets scheduled on the same core. If the tasks keep switching back and forth on the CPU in context, the system register keeps getting programmed with relevant PARTID-PMG pairs.

To conclude, a simple test done on the PuT would be discussed below. As part of the test, a new PARTID (partid_space_2) space was created as soon as the system booted to prompt. A simple script that moved tasks from the default PARTID space to the new PARTID space was used to move tasks under partid_space_2.

/sys/fs/resctrl/partid_space_2 # cat  ~/mv_task.sh

#/bin/sh!

for i in `seq $1 $2`

do
    echo "$i" > tasks
done

Basic conditional debug logs were added in the build within mpam_thread_switch. The process list was dumped to get an idea of the processes that were planned to be moved to the new PARTID space (PID 5 to 20).

/sys/fs/resctrl/partid_space_2 # ps -A

PID   USER     TIME  COMMAND
 1 0         0:00 sh
 2 0         0:00 [kthreadd]
 3 0         0:00 [rcu_gp]
 4 0         0:00 [rcu_par_gp]
 6 0         0:00 [kworker/0:0H]
 8 0         0:00 [mm_percpu_wq]
 9 0         0:00 [rcu_tasks_kthre]
10 0         0:00 [ksoftirqd/0]
11 0         0:00 [rcu_preempt]
12 0         0:00 [migration/0]
13 0         0:00 [cpuhp/0]
14 0         0:00 [cpuhp/1]
15 0         0:00 [migration/1]
16 0         0:00 [ksoftirqd/1]
17 0         0:00 [kworker/1:0-mm_]
18 0         0:00 [kworker/1:0H]
19 0         0:00 [cpuhp/2]
20 0         0:00 [migration/2]

The following logs were observed as soon as the tasks were moved from the default PARTID space to the new PARTID space.

/sys/fs/resctrl/partid_space_2 #  ~/mv_task.sh 5 20

[  274.393977] oldregval (arm64_mpam_current) : 0         //chunk 1
[  274.393977] regval (thread_info field)     : 10001
[  274.393981] pid                            : 11
[  274.393981] tgid                           : 11
[  274.393981] cpu id                         : 1
[  274.393984] SYS_MPAM0_EL1 before update    : 0
[  274.393987] SYS_MPAM0_EL1 after update     : 10001
[  274.393987]
[  274.393991] oldregval (arm64_mpam_current) : 10001     //chunk 2
[  274.393993] regval (thread_info field)     : 0
[  274.393996] pid                            : 0
[  274.393999] tgid                           : 0
[  274.393999] cpu id                         : 1
[  274.401977] SYS_MPAM0_EL1 before update    : 10001
[  274.401980] SYS_MPAM0_EL1 after update     : 0
[  274.401983]
[  274.401985] oldregval (arm64_mpam_current) : 0         //chunk 3
[  274.401985] regval (thread_info field)     : 10001
[  274.401990] pid                            : 11
[  274.401992] tgid                           : 11
[  274.401995] cpu id                         : 1
[  274.401998] SYS_MPAM0_EL1 before update    : 0
[  274.401998] SYS_MPAM0_EL1 after update     : 10001
[  274.409975]
[  274.409978] oldregval (arm64_mpam_current) : 10001     //chunk 4
[  274.409980] regval (thread_info field)     : 0
[  274.409983] pid                            : 0
[  274.409983] tgid                           : 0
[  274.409987] cpu id                         : 1
[  274.409990] SYS_MPAM0_EL1 before update    : 10001
[  274.409992] SYS_MPAM0_EL1 after update     : 0
[  274.409995]

The above log can be divided into 4 chunks of data, each captured at the time when one of the threads were being scheduled. The first chunk shows the tgid, a value equivalent to the PID which is visible from user space, being scheduled on CPU 1. Since we moved PID 11 to the new partid space, partid_space_2 with PARTD 1, the new PARTID-PMG value stored in its thread_info field, mpam_partid_pmg would be 10001. However, the last thread scheduled on this CPU was of PARTID 0 group as indicated by the per-CPU variable (oldreg) in the logs. This is the same value stored in MPAM0_EL1. Since there is a mismatch between these values, MPAM0_EL1 is updated with the new PARTID-PMG pair using the WRITE_ONCE macro to avoid store tearing and re-ordering.

The next chunk shows that the thread with tgid/PID 0 gets scheduled on the same CPU. However, PID 0 still belongs to the default PARTID space and thus there is a conflict between its thread_info field and the newly programmed PARTID-PMG value in MPAM0_EL1/arm64_mpam_current. The default PARTID-PMG again gets programmed into MPAM0_EL1 and arm64_mpam_current. Two more context switches have been captured, where chunk 3 is similar to chunk 1 and chunk 4 to chunk 2. Kernel changes for MPAM are quite large and for brevity, what is most essential only has been covered in this documentation.