Memory system resource Partitioning And Monitoring (MPAM)
MPAM-resctrl - A quick glance
MPAM stands for memory system resource partitioning and monitoring. As the name suggests, it deals with two things; partitioning and monitoring. MPAM’s resource partitioning logic deals with partitioning resources such as shared CPU caches, interconnect caches, memory bandwidth, interconnect bandwidth, etc. In MPAM terminology, such resources can be classified as MSCs. How each MSC gets partitioned varies from MSC to MSC and also by the type of MSC. For instance, partitioning a cache could be very different from partitioning memory bandwidth. MPAM’s resource monitoring logic deals with monitoring each MSC. A monitor can measure resource usage or capacity usage, depending on the resource. For instance, a cache can have monitors for cache storage that measures the usage of the cache. Reading a monitor could help in tuning the memory-system partitioning controls. For detailed information on MPAM, refer to MPAM specification
resctrl is a Linux kernel feature by which Arm’s MPAM and Intel’s RDT can be configured and controlled. resctrl exposes MPAM capabilities and configuration options via a file-system interface. On the latest kernel source tree, users would find resctrl adapted for X86 RDT. The file and folder names reflect RDT’s feature sets rather than a generic resource portioning interface naming or MPAM’s feature names. In short, for Arm64 architecture, resctrl is how the user space can configure MPAM. The steps by which MPAM could be configured via resctrl are described in the subsequent section.
Exploring resctrl file-system
MPAM-resctrl is enabled by default on the platform (from here on platform/
platform under test/ platform under consideration would be abbreviated as
PuT
). This documentation advises users to follow the Busybox build to enable MPAM-resctrl capabilities for the PuT.
Once the necessary sources have been fetched, checkout RD-INFRA-2022.09.30-MPAM
tag for linux and repository before proceeding with the build. Build and boot
the system to command prompt. Run the following command to mount the resctrl
file-system. It is to be noted that MPAM’s performance aspect cannot be tested
on an FVP, rather only the register configurations could be tested on it.
# mount -t resctrl resctrl /sys/fs/resctrl
It would be good to refer to resctrl documentation in parallel as many of the concepts that would be discussed further along would be present in better clarity in the documentation. However, as mentioned in the beginning, be aware that the documentation as of now covers resctrl file-system as utilized by Intel’s RDT.
Once resctrl file-system has been mounted, change directory to /sys/fs/resctrl and list the files.
# cd /sys/fs/resctrl /sys/fs/resctrl# ls cpus info mon_data schemata tasks cpus_list mmode mon_groups size
These are the files and folders through which MPAM’s MSCs for the PuT would be accessed and configured. Before proceeding further, it is important to understand more about MPAM’s PARTID. PARTID can be considered as an ID or label associated with MPAM configurations for a single software environment or a collection of software environments. Quoting MPAM specification “An MPAM resource control uses the PARTID that is set for one or more software environments. A PARTID for the current software environment labels each memory system request. Each MPAM resource control has control settings for each PARTID. The PARTID in a request selects the control settings for that PARTID, which are then used to control the partitioning of the performance resources of that memory-system component”. In short, each set of MPAM configuration is associated with a PARTID. The required configuration is selected/modified by programming the associated PARTID into MPAMCFG_PART_SEL register present at the MSC’s memory-mapped interface.
MPAM driver is designed in such a way that the default configuration uses a single PARTID (PARTID 0) with the default maximum partition configuration for the MSCs. This is done in the early stages of Linux kernel boot up. This will be covered in greater detail in the sections to come.
resctrl is organized in such a way that each PARTID would in turn have a separate copy of all these files and folders. At this point, there is just one set of these files/folders as shown above. More the number of PARTIDs, more would be the copy of these sets of files and folders. To understand what these files/folders denote, the user could try the following.
/sys/fs/resctrl # cat cpus ffff /sys/fs/resctrl # cat cpus_list 0-15
The file named cpus
lists CPUs having access to the MPAM’s MSCs under
consideration, for a given PARTID. The output is in bitmap format. For the PuT,
it shows 0xffff indicating the presence of 16 CPUs. Reading contents of the file
named cpus_list
shows the same information in a different style (CPUs marked
from 0-15).
/sys/fs/resctrl # cat schemata L3:49=ffff
schemata
would be one of the most important files out of the list of files
exposed by resctrl. It shows the MPAM resource, its ID and the partition for
this particular PARTID. From the above logs, it is clear that the MSC to be
partitioned is an L3 cache, having cache ID 49. The default cache portion
bitmask assigned for this PARTID is ‘0xffff’ which means the entire cache.
As discussed earlier in the MPAM-resctrl - A quick glance section, an MSC is partitioned in accordance with its type. When it comes to caches, two partitioning schemes can be used - cache portion partitioning and cache capacity partitioning. For cache portion partitioning, a cache is divided into equal number of portions represented by a bitmap. A ‘1’ indicates that the corresponding portion is allowed and ‘0’ otherwise. 0xffff represents the cache portion bitmap with all portions enabled. Since cache capacity partitioning is not being exercised here, this won’t be discussed in this documentation. Please refer to MPAM specification to get a better idea about these partitioning schemes.
Neoverse reference design platforms as of now don’t have an L3 cache. Instead, system level cache (SLC) on the interconnect acts as the shared cache for all DSU clusters. SLC cache for the PuT has been added within the PPTT table. The cache topology parsing logic within the OS walks through all caches available associates each cache with a level. SLC caches for the PuT is mapped as an L3 cache. For more details, refer to PPTT and MPAM ACPI tables present in the source code.
/sys/fs/resctrl # cat tasks 1 2 3 4 ~
Reading the tasks
file would give an idea of the tasks that use this PARTID.
Writing a task id to the file will add a task to the group. Since this is the
default config, the user should be able to find all the tasks in this file. An
example where the tasks
file gets modified will be looked at in the latter
part of this section.
/sys/fs/resctrl # cat mode shareable
The mode
of the resource group dictates the sharing of its allocations. A
“shareable” resource group allows sharing of its allocations while an
“exclusive” resource group does not allow sharing.
/sys/fs/resctrl # cd info /sys/fs/resctrl/info # ls L3 L3_MON last_cmd_status
The info
directory contains information about the enabled resources. Each
resource has its own sub-directory. There should be a sub-directory with the
name that reflects the resource’s names. Since SLC has been modeled as an L3
MPAM node, an L3
directory should be present. If the resource supports
monitoring capabilities, a folder with the name <MSC>_MON
should also
exist. L3_MON
in this case is the directory having information about L3’s
monitoring capabilities.
/sys/fs/resctrl/info # cd L3 /sys/fs/resctrl/info/L3 # ls bit_usage min_cbm_bits shareable_bits cbm_mask num_closids
L3
sub-directory contains the files as shown above. Enter the following
commands to understand what each of these files denote.
/sys/fs/resctrl/info/L3 # cat cbm_mask ffff
cbm_bitmask
shows the cache portion bitmask corresponding to 100% allocation
of the MSC. This value is in line with what is observed as the cache portion
bitmap given in schemata
.
/sys/fs/resctrl/info/L3 # cat bit_usage 49=XXXXXXXXXXXXXXXX
bit_usage
gives details about how each instance of the MSC gets used. Since
schemata
describes the cache portion bitmap for L3, bit_usage
talks
about the status of each of these portions. Each portion represented by a bit
could be any of the below types.
0
: Corresponding region is unused. When the system’s resources have been
allocated and a “0” is found in “bit_usage” it is a sign that resources are
wasted.
H
: Corresponding region is used by hardware only but available for software
use. If a resource has bits set in “shareable_bits” but not all of these bits
appear in the resource groups’ schematas then the bits appearing in
“shareable_bits” but no resource group will be marked as “H”.
X
: Corresponding region is available for sharing and used by hardware and
software. These are the bits that appear in “shareable_bits” as well as a
resource group’s allocation.
S
: Corresponding region is used by software and available for sharing.
E
: Corresponding region is used exclusively by one resource group. No
sharing allowed.
P
: Corresponding region is pseudo-locked. No sharing is allowed.
From the value that is read out, all 16 portions of the cache portion bitmap are of type shareable.
/sys/fs/resctrl/info/L3 # cat min_cbm_bits 1
min_cbm_bits
denotes the minimum number of consecutive bits which must be
set when writing a mask. Setting anything lower than what min_cbm_bits
suggests would lead to an error.
/sys/fs/resctrl/info/L3 # cat shareable_bits ffff
shareable_bits
is again a bitmask of all the shareable bits in the cache
portion bitmask. For the PuT, it is 0xffff.
/sys/fs/resctrl/info/L3 # cat num_closids 32
num_closid
denotes the number of closids. closids again is Intel’s
terminology which expands to “class of service IDs”. This essentially means
PARTIDs under MPAM. Therefore, num_closid
tells us the number of valid
PARTIDs the MSC supports.
/sys/fs/resctrl/info # cat last_cmd_status ok
At the top level of the info
directory, there is a file named
last_cmd_status
. This is reset with every “command” issued via the
file-system (making new directories or writing to any of the control files). If
the command was successful, it will read as “ok”. If the command fails, it will
provide more information about the error generated during the operation. A
simple example is shown below.
/sys/fs/resctrl # echo L3:49=0000 > schemata sh: write error: Invalid argument /sys/fs/resctrl # cat info/last_cmd_status Mask out of range
As discussed earlier, the min_cbm_mask
or the minimum bitmask that should be
programmed into the configuration register is at least 1. If a value less than
min_cbm_mask is used, the resctrl filesystem would throw an error.
Configuring MPAM via resctrl file-system
The file-system interface for the default PARTID has been looked at in the last section. Real MPAM use-cases have multiple partition spaces (PARTIDs) with different MSC partitions. With resctrl, adding a new partition space (PARTID) is simple; create a new folder with any name (users are advised to give a name resonating the use-case so that maintenance becomes easier) in the root resctrl directory.
/sys/fs/resctrl # mkdir partid_space_2 /sys/fs/resctrl # ls cpus mode partid_space_2 tasks cpus_list mon_data schemata info mon_groups size/sys/fs/resctrl # cd partid_space_2/ /sys/fs/resctrl/partid_space_2 # ls cpus mode mon_groups size cpus_list mon_data schemata tasks
Once a new folder named partid_space_2
is created, MPAM driver internally
allocates a new PARTID and associates it with this new resctrl directory. The
user can modify the configurations via the resctrl file-system. resctrl talks
with the MPAM driver and the driver would in turn program the required
configuration registers for the new PARTID for the MSC under consideration to
add the new configurations. In order to define the schemata
for this new
PARTID, do the following.
/sys/fs/resctrl/partid_space_2 # cat schemata L3:49=ffff /sys/fs/resctrl/partid_space_2 # echo "L3:49=3ff" > schemata /sys/fs/resctrl/partid_space_2 # cat schemata L3:49=03ff
As shown above, to define a schemata
, a file write to the schemata file
under the new PARTID’s root directory is required. Whenever a new folder is
added under the resctrl root directory, the schemata
would always reflect
the default maximum for the resource under consideration - in this case, the L3
cache with 0xffff. The value to be written has to align with the format by which
schemata
describes the MSC and its partitions. In this case, the new value
should be of the format L3:<cache ID>=<cache portion bitmap>
. Changing the
schemata of the default PARTID space is also valid. Users could try changing the
value of the default schemata as an experiment.
As the new schemata
values have been updated, the next step would be to
update the tasks
file with the tasks that need to use this new partitioning
scheme. Select one task at random from ps -A
.
/sys/fs/resctrl/partid_space_2 # cat tasks /sys/fs/resctrl/partid_space_2 # /sys/fs/resctrl/partid_space_2 # ps -A PID USER TIME COMMAND 1 0 0:00 sh 2 0 0:00 [kthreadd] 3 0 0:00 [rcu_gp] ~ 23 0 0:00 [kworker/2:0H-ev] 24 0 0:00 [cpuhp/3] 25 0 0:00 [migration/3]
For this demonstration, task 23 has been selected to be added to the new PARTID/
partition space. Before assigning the task, take a look at the tasks
file under the default PARTID to make sure that the task is currently assigned
to it. As discussed in the beginning, with just the default PARTID, all tasks
should be part of the default PARTID’s task
file.
/sys/fs/resctrl/partid_space_2 # cd ../ /sys/fs/resctrl # cat tasks 1 2 3 4 ~ 23 24 ~
Proceed to add the task to the tasks
file under partid_space_2
.
/sys/fs/resctrl # cd partid_space_2 /sys/fs/resctrl/partid_space_2 # echo 23 > tasks /sys/fs/resctrl/partid_space_2 # cat tasks 23
A task can any time exist only under one configuration. This means that the task
would no longer be present under the default PARTID’s tasks
directory.
/sys/fs/resctrl/partid_space_2 # cd ../ /sys/fs/resctrl # cat tasks 1 2 3 4 ~ 24 ~
Additional tasks can be added to the tasks
file in the same manner by which
the first task was added.
/sys/fs/resctrl # cd partid_space_2 /sys/fs/resctrl/partid_space_2 # echo 24 > tasks /sys/fs/resctrl/partid_space_2 # cat tasks 23 24
Multiple PARTIDs up to num_closid
limit can be added in the same fashion.
Repeat the steps to configure the schemata and tasks as shown above for any new
PARTID directory created.
/sys/fs/resctrl # cd ../ /sys/fs/resctrl # mkdir partid_space_3 /sys/fs/resctrl # ls cpus mode partid_space_2 tasks cpus_list mon_data schemata size partid_space_3 info mon_groups/sys/fs/resctrl # cd partid_space_3/ /sys/fs/resctrl/partid_space_3 # ls cpus mode mon_groups size cpus_list mon_data schemata tasks
A closer look at MPAM software
Enabling MPAM on the PuT involves enabling MPAM EL1/EL2 register access from EL3 (trusted firmware), building kernel drivers and having proper ACPI tables to populate platform-specific MPAM data.
EFI_ACPI_6_4_PPTT_STRUCTURE_CACHE_INIT ( PPTT_CACHE_STRUCTURE_FLAGS, /* Flag */ 0, /* Next level of cache */ SIZE_8MB, /* Size */ 8192, /* Num of sets */ 16, /* Associativity */ PPTT_UNIFIED_CACHE_ATTR, /* Attributes */ 64, /* Line size */ RD_PPTT_CACHE_ID(0, -1, -1, L3Cache) /* Cache id */ )
For processor side caches, MPAM references the cache/MSC of interest via cache ID. The way an MSC gets referenced in the MPAM table changes from MSC to MSC. Please refer to MPAM ACPI Specification to get a detailed understanding of how MPAM tables are described. For a complete view of the PPTT table implemented on the PuT, please refer to Platform/ARM/SgiPkg/AcpiTables/<PuT>/Pptt.aslc under uefi/edk2/edk2-platforms repository in the source files. Corresponding MPAM ACPI table entries are as shown below.
/* MPAM_MSC_NODE 1 */ { RD_MPAM_MSC_NODE_INIT(0x1, 0x142601000, 0, MPAM_MSC_COUNT, RESOURCES_PER_MSC, FUNCTIONAL_DEPENDENCY_PER_RESOURCE) }, /* MPAM_MSC_NODE 2 */ { RD_MPAM_MSC_NODE_INIT(0x2, 0x142641000, 0, MPAM_MSC_COUNT, RESOURCES_PER_MSC, FUNCTIONAL_DEPENDENCY_PER_RESOURCE) },
The number of SLC cache slices can vary on each platform. Each of the cache slice would be configured as an MSC. Unique indices should be used for each SLC slice as OS would use the index as one of the criteria to differentiate between MSC nodes. For a complete view of MPAM ACPI table, please refer to Platform/ARM/SgiPkg/AcpiTables/<PuT>/Mpam.aslc file under uefi/edk2/edk-plaforms repository in the source files.
On the Linux side, MPAM software can be categorized into MPAM ACPI driver, MPAM platform driver, MPAM platform devices, MPAM layer for resctrl, MPAM support for architecture, etc. This would not be the complete list, but still covers most of the major software layers MPAM touches.
Quite early into the Linux boot, __init_el2_mpam
( arch/arm64/include/asm/
el2_setup.h ) is invoked from within head.S
. __init_el2_mpam
takes care
of detecting and MPAM, doing basic MPAM system register setup and trap
disablement to EL2.
.macro __init_el2_mpam #ifdef CONFIG_ARM64_MPAM /* Memory Partioning And Monitoring: disable EL4 traps */ mrs x1, id_aa64pfr0_el1 ubfx x0, x1, #ID_AA64PFR0_MPAM_SHIFT, #4 cbz x0, 1f // skip if no MPAM msr_s SYS_MPAM0_EL1, xzr // use the default partition.. msr_s SYS_MPAM2_EL2, xzr // ..and disable lower traps msr_s SYS_MPAM1_EL1, xzr mrs_s x0, SYS_MPAMIDR_EL1 tbz x0, #17, 1f // skip if no MPAMHCR reg msr_s SYS_MPAMHCR_EL2, xzr // clear TRAP_MPAMIDR_EL1 -> EL2 1: #endif /* CONFIG_ARM64_MPAM */ .endm
As the kernel proceeds to boot, the MPAM platform driver initialization routine
gets invoked (mpam_msc_driver_init
). The total count of MPAM MSCs is queried
from the MPAM ACPI table. This is also the first time the MPAM ACPI table gets
queried, starting from kernel boot up. Platform driver would get initialized
only if a valid MPAM ACPI table with at least one MSC is defined. Once the
platform driver is initialized, MPAM driver probing kicks off
(mpam_msc_drv_probe
). It is at this point that the MPAM ACPI table is
completely parsed and appropriate platform device data structures are populated.
Each of the populated MSC gets registered as an individual platform device. Once
all the platform devices are probed, temporary CPU hotplug callbacks
(mpam_discovery_cpu_online
) are installed. if the system supports 128 MSCs,
the callbacks would only get registered after the 128th platform device gets
registered. The callbacks installed at this point are for discovering hardware
details about MSCs (known as hardware probing in MPAM driver terminology) and
would be replaced at a later point. This is the reason why they are described as
temporary callbacks. More information on CPU hotplugging and supported API sets
can be found at CPU hot plugging on Linux. Please refer to
drivers/platform/mpam/mpam_devices.c under the Linux kernel repository to see
the detailed implementation of the routines discussed here.
Soon after the CPU hotplug callbacks are installed, the corresponding setup
(mpam_discovery_cpu_online
) callbacks get called by each of the CPUs.
Suppose if the PuT has 16 CPUs, the setup
function would be called 16 times
with CPU IDs ranging from 0-15. At this stage, setup
callback proceeds with
MSC hardware discovery. This includes discovering details such as the features
supported, maximum PARTID, maximum PMG, etc. To understand all the features a
particular MSC could support, please refer to MPAM Specification chapter 9.
Once the supported features are discovered and maximum PARTID and PMG values
supported are established, a default config is programmed to the configuration
registers (MPAMCFG_*) for each of these features for all PARTIDs starting from 0
to the maximum value. setup
function is defined in such a way that the first
CPU to come online would discover features of all the registered MSCs and
program appropriate configs for them. Rest of the setup
calls on the other
CPUs would skip over hardware discovery. A small snippet of what happens in the
setup
function (mpam_discovery_cpu_online
) is shown below.
/* For all MSCs, if the current CPU has access to the MSC and HW discovery * is yet to be carried out for the MSC under consideration, proceed with * the discovery. */ list_for_each_entry(msc, &mpam_all_msc, glbl_list) { if (!cpumask_test_cpu(cpu, &msc->accessibility)) continue; spin_lock(&msc->lock); if (!msc->probed) err = mpam_msc_hw_probe(msc); spin_unlock(&msc->lock); if (!err) new_device_probed = true;
The logic to program any config register (MPAMCFG_*) has been mentioned in MPAM specification, section 11.1.2.
After the first CPU to come up completes hardware probing and feature configuration, the kernel is free to enable MPAM. This is done with the help of workqueues.
static DECLARE_WORK(mpam_enable_work, &mpam_enable); ~ if (new_device_probed && !err) schedule_work(&mpam_enable_work);
The code scheduled under the workqueue shown above gets executed soon after the probing. This is where MPAM resctrl configurations are set up. resctrl has a dependency with cacheinfo and hence the workqueue task that’s responsible for setting up resctrl stays in wait state until cacheinfo is up and ready. cacheinfo deals with populating cache nodes from PPTT and exporting them to /sys/devices/system/cpu/cpu*/cache/index* for user space to access. MPAM’s resctrl layer internally queries the MSC cache node’s size from cacheinfo and thus have to wait till proper data is available.
wait_event(wait_cacheinfo_ready, cacheinfo_ready); ~ static int __init __cacheinfo_ready(void) { cacheinfo_ready = true; wake_up(&wait_cacheinfo_ready); return 0; } device_initcall_sync(__cacheinfo_ready);
A teardown
(mpam_cpu_offline
) callback is also part of the hotplug
callbacks installed earlier. The teardown
callback gets called when the CPUs
go offline. Atomic reference counters are added within the data structures that
manage each MSC. In case of a hotplug shutdown on the PuT, the MPAM driver
wouldn’t reprogram any register or initiate cleanup until the last CPU goes
offline.
list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { if (!cpumask_test_cpu(cpu, &msc->accessibility)) continue; spin_lock(&msc->lock); if (msc->reenable_error_ppi) disable_percpu_irq(msc->reenable_error_ppi); if (atomic_dec_and_test(&msc->online_refs)) mpam_reset_msc(msc, false); spin_unlock(&msc->lock);
Once cacheinfo is set up, MPAM’s resctrl setup proceeds. With the completion of resctrl, MPAM is ready to be enabled and a new set of hotplug callbacks are installed replacing the old one. The maximum PARTID and PMG that the system can support have been established at this point and can’t be changed after the new callbacks are installed.
/* * Once the cpuhp callbacks have been changed, mpam_partid_max can no * longer change. */ spin_lock(&partid_max_lock); partid_max_published = true; spin_unlock(&partid_max_lock); static_branch_enable(&mpam_enabled); mpam_register_cpuhp_callbacks(mpam_cpu_online);
As discussed earlier, the new setup
function deals with marking CPUs online
and reprogramming MSCs in case all CPUs went down. Just like the teardown
function, the first CPU to come up would re-program the feature registers for
each PARTID. The same atomic reference counter used in the teardown
function
is used here for this purpose.
rcu_read_lock(); list_for_each_entry_rcu(msc, &mpam_all_msc, glbl_list) { if (!cpumask_test_cpu(cpu, &msc->accessibility)) continue; spin_lock(&msc->lock); if (msc->reenable_error_ppi) _enable_percpu_irq(&msc->reenable_error_ppi); if (atomic_fetch_inc(&msc->online_refs) == 0) mpam_reprogram_msc(msc); spin_unlock(&msc->lock); } rcu_read_unlock(); if (mpam_is_enabled()) mpam_resctrl_online_cpu(cpu);
Once the system boots up and resctrl is mounted, PARTID 0 with default maximum
cache portion bitmap comes into use. Whenever a new directory is added, the MPAM
driver selects the new PARTID to be the first free PARTID in a range of PARTIDs
from 0 to maximum. More information about the PARTID allocator could be found
from fs/resctrl/rdtgroup.c within the kernel source tree. Since the file-system
interface is tied to Intel’s feature set and convention, PARTID allocator is
named as closid_allocator
.
for_each_set_bit(closid, &closid_free_map, closid_free_map_len) { if (IS_ENABLED(CONFIG_RESCTRL_RMID_DEPENDS_ON_CLOSID) && resctrl_closid_is_dirty(closid)) continue; clear_bit(closid, &closid_free_map); return closid; }
Also, for every folder created, a default config needs to be programmed into MPAM’s supported MSC’s feature configuration registers for the new PARTID. For PuT, this means programming L3’s cache portion bitmaps with the default maximum portion bitmap. This is also taken care of by resctrl. The snippet below shows a portion of the MPAM driver API code that gets called when a new folder is created.
case RDT_RESOURCE_L3: cfg.cpbm = cfg_val; mpam_set_feature(mpam_feat_cpor_part, &cfg); break; ~ return mpam_apply_config(dom->comp, partid, &cfg);
The same API (mpam_apply_config
) is used when the user makes any change in
the schemata
. Instead of the default config, the cache portion bitmap
written by the user gets programmed into the MPAM configuration register for the
PARTID.
Even if the L3 cache/SLC for the PuT supports a large set of PARTIDs, resctrl has a limit of 32 PARTIDs at maximum due to the bitmaps algorithm used for closid calculation. If the user tries to generate more than 32 folders including the root folder /sys/fs/resctrl, the system would throw an error.
/* * MSC may raise an error interrupt if it sees an out or range partid/pmg, * and go on to truncate the value. Regardless of what the hardware * supports, only the system wide safe value is safe to use. */ u32 resctrl_arch_get_num_closid(struct rdt_resource *ignored) { return min((u32)mpam_partid_max + 1, (u32)RESCTRL_MAX_CLOSID); }
Please refer to drivers/platform/mpam/mpam_resctrl.c in the Linux source tree to get a detailed understanding of MPAM’s interaction with resctrl.
MPAM and task scheduling
In the last section, the main focus was on understanding how the MPAM driver was designed, how the resctrl file-system interacted with the MPAM driver and the basic boot initialization sequence of the MPAM driver. In this section, an interesting topic would be looked at; how MPAM works along with the task scheduler.
Once MPAM is enabled, each task should belong to a PARTID group. Since PARTID
gets so tightly ingrained with a task’s basic identity, the thread_info
(arch/arm64/include/asm/thread_info.h) struct has been modified to hold an
additional member as shown below.
/* * low level task data that entry.S needs immediate access to. */ struct thread_info { ~ #ifdef CONFIG_ARM64_MPAM u64 mpam_partid_pmg; #endif
When a system boots up with MPAM enabled and resctrl mounted, all tasks belong
to the default PARTID-PMG (0) group. Once new partitions are allocated and tasks
are moved from one PARTID-PMG group to another, this member of the
thread_info
(mpam_partid_pmg
) would have to be updated accordingly.
Below is the stack dump for the case where a task is moved from the default
PARTID group to a new one.
/sys/fs/resctrl/partid_space_2 # echo 26 > tasks [ 404.607377] CPU: 3 PID: 1 Comm: sh Not tainted 5.17.0-g5bf032719b99-dirty #19 [ 404.607381] Hardware name: ARM LTD RdN2Cfg1, BIOS EDK II Jun 15 2022 [ 404.607384] Call trace: [ 404.607386] dump_backtrace.part.0+0xd0/0xe0 [ 404.607391] show_stack+0x1c/0x6c [ 404.607396] dump_stack_lvl+0x68/0x84 [ 404.607400] dump_stack+0x1c/0x38 [ 404.607405] resctrl_arch_set_closid_rmid+0x50/0xac [ 404.607410] rdtgroup_tasks_write+0x2b0/0x4a0 [ 404.607414] rdtgroup_file_write+0x24/0x40 [ 404.607419] kernfs_fop_write_iter+0x11c/0x1ac [ 404.607424] new_sync_write+0xe8/0x184 [ 404.607427] vfs_write+0x230/0x290 [ 404.607431] ksys_write+0x68/0xf4 [ 404.607435] __arm64_sys_write+0x20/0x2c [ 404.607439] invoke_syscall+0x48/0x114 [ 404.607444] el0_svc_common.constprop.0+0x44/0xec [ 404.607449] do_el0_svc+0x28/0x90 [ 404.607453] el0_svc+0x20/0x60 [ 404.607457] el0t_64_sync_handler+0x1a8/0x1b0 [ 404.607461] el0t_64_sync+0x1a0/0x1a4
The write to tasks
file ends up as a synchronous exception from a 64-bit
lower EL. The exception handler then routes it to the appropriate resctrl
routines which then proceeds to call resctrl_arch_set_closid_rmid
. On taking
a closer look at resctrl_arch_set_closid_rmid
, it takes care of calling
mpam_set_cpu_defaults
with the new PARTID and PMG. mpam_set_cpu_defaults
goes ahead to update the thread_info
member field of the very task that got
swapped between PARTID groups.
void resctrl_arch_set_cpu_default_closid_rmid(int cpu, u32 closid, u32 pmg) { BUG_ON(closid > U16_MAX); BUG_ON(pmg > U8_MAX); if (!cdp_enabled) { mpam_set_cpu_defaults(cpu, closid, closid, pmg, pmg); ~static inline void mpam_set_task_partid_pmg(struct task_struct *tsk, u16 partid_d, u16 partid_i, u8 pmg_d, u8 pmg_i) { #ifdef CONFIG_ARM64_MPAM u64 regval; regval = FIELD_PREP(MPAM_SYSREG_PARTID_D, partid_d); regval |= FIELD_PREP(MPAM_SYSREG_PARTID_I, partid_i); regval |= FIELD_PREP(MPAM_SYSREG_PMG_D, pmg_d); regval |= FIELD_PREP(MPAM_SYSREG_PMG_I, pmg_i); WRITE_ONCE(task_thread_info(tsk)->mpam_partid_pmg, regval); #endif }
How would the mpam_partid_pmg
field from thread_info
get utilized? The
actual use of this field is in enabling the propagation of corresponding
PARTID-PMG value pair via the bus interface downstream. Every memory request
should be tagged with PARTID-PMG fields so that the MSCs downstream can respond
according to the feature configuration that has been set up on it for that
particular PARTID-PMG that it received from upstream. For PuT, PARTID-PMG would
be propagated downstream via the CHI interface. To enable propagation of
PARTID-PMG values, the system register MPAM0_EL1
have to be programmed with
the PARTID-PMG value. From MPAM specification, this register’s purpose is
described as follows - “Holds information to generate MPAM labels for memory
requests when executing at EL0.” Please refer to the MPAM Specification
chapter 4 to get detailed information on MPAM information propagation.
When the system boots up with all tasks in the default configuration, the
PARTID-PMG pair would have a value of zero and MPAM0_EL1
would hold this
same value. The early boot call to __init_el2_mpam
writes zero to this
system register. As new PARTIDs are allocated and tasks are moved from the
default PARTID group, MPAM0_EL1
would need re-programming. When a task that
had been moved from the default group to a new group gets scheduled, there has
to be a check to see if the PARTID-PMG pair that MPAM0_EL1
holds is the one
that thread_info
for the task that got scheduled has. mpam_thread_switch
(arch/arm64/include/asm/mpam.h) does the exact same thing.
__notrace_funcgraph __sched struct task_struct *__switch_to(struct task_struct *prev, struct task_struct *next) { struct task_struct *last; ~ /* * MPAM thread switch happens after the DSB to ensure prev's accesses * use prev's MPAM settings. */ mpam_thread_switch(next);static inline void mpam_thread_switch(struct task_struct *tsk) { u64 oldregval; int cpu = smp_processor_id(); u64 regval = mpam_get_regval(tsk); if (!IS_ENABLED(CONFIG_ARM64_MPAM)) return; if (!static_branch_likely(&mpam_enabled)) return; oldregval = READ_ONCE(per_cpu(arm64_mpam_current, cpu)); if (oldregval == regval) return; if (!regval) regval = READ_ONCE(per_cpu(arm64_mpam_default, cpu)); write_sysreg_s(regval, SYS_MPAM0_EL1); WRITE_ONCE(per_cpu(arm64_mpam_current, cpu), regval); }
Every time a task switch happens via __switch_to
, mpam_thread_switch
gets called with the new task_struct
(include/linux/sched.h) struct as param
. What has been programmed in MPAM0_EL1
for the CPU in context, is held in
an SMP
specific per CPU variable called arm64_mpam_current
. If there
is a mismatch between the thread_info
value and the value in MPAM0_EL1
,
the value from thread_info
is copied to MPAM0_EL1
. Re-programming the
value in MPAM0_EL1
generally happens when two tasks of different PARTID-PMG
group gets scheduled on the same core. If the tasks keep switching back and
forth on the CPU in context, the system register keeps getting programmed with
relevant PARTID-PMG pairs.
To conclude, a simple test done on the PuT would be discussed below. As part of
the test, a new PARTID (partid_space_2) space was created as soon as the system
booted to prompt. A simple script that moved tasks from the default PARTID space
to the new PARTID space was used to move tasks under partid_space_2
.
/sys/fs/resctrl/partid_space_2 # cat ~/mv_task.sh #/bin/sh! for i in `seq $1 $2` do echo "$i" > tasks done
Basic conditional debug logs were added in the build within
mpam_thread_switch
. The process list was dumped to get an idea of the
processes that were planned to be moved to the new PARTID space (PID 5 to 20).
/sys/fs/resctrl/partid_space_2 # ps -A PID USER TIME COMMAND 1 0 0:00 sh 2 0 0:00 [kthreadd] 3 0 0:00 [rcu_gp] 4 0 0:00 [rcu_par_gp] 6 0 0:00 [kworker/0:0H] 8 0 0:00 [mm_percpu_wq] 9 0 0:00 [rcu_tasks_kthre] 10 0 0:00 [ksoftirqd/0] 11 0 0:00 [rcu_preempt] 12 0 0:00 [migration/0] 13 0 0:00 [cpuhp/0] 14 0 0:00 [cpuhp/1] 15 0 0:00 [migration/1] 16 0 0:00 [ksoftirqd/1] 17 0 0:00 [kworker/1:0-mm_] 18 0 0:00 [kworker/1:0H] 19 0 0:00 [cpuhp/2] 20 0 0:00 [migration/2]
The following logs were observed as soon as the tasks were moved from the default PARTID space to the new PARTID space.
/sys/fs/resctrl/partid_space_2 # ~/mv_task.sh 5 20 [ 274.393977] oldregval (arm64_mpam_current) : 0 //chunk 1 [ 274.393977] regval (thread_info field) : 10001 [ 274.393981] pid : 11 [ 274.393981] tgid : 11 [ 274.393981] cpu id : 1 [ 274.393984] SYS_MPAM0_EL1 before update : 0 [ 274.393987] SYS_MPAM0_EL1 after update : 10001 [ 274.393987] [ 274.393991] oldregval (arm64_mpam_current) : 10001 //chunk 2 [ 274.393993] regval (thread_info field) : 0 [ 274.393996] pid : 0 [ 274.393999] tgid : 0 [ 274.393999] cpu id : 1 [ 274.401977] SYS_MPAM0_EL1 before update : 10001 [ 274.401980] SYS_MPAM0_EL1 after update : 0 [ 274.401983] [ 274.401985] oldregval (arm64_mpam_current) : 0 //chunk 3 [ 274.401985] regval (thread_info field) : 10001 [ 274.401990] pid : 11 [ 274.401992] tgid : 11 [ 274.401995] cpu id : 1 [ 274.401998] SYS_MPAM0_EL1 before update : 0 [ 274.401998] SYS_MPAM0_EL1 after update : 10001 [ 274.409975] [ 274.409978] oldregval (arm64_mpam_current) : 10001 //chunk 4 [ 274.409980] regval (thread_info field) : 0 [ 274.409983] pid : 0 [ 274.409983] tgid : 0 [ 274.409987] cpu id : 1 [ 274.409990] SYS_MPAM0_EL1 before update : 10001 [ 274.409992] SYS_MPAM0_EL1 after update : 0 [ 274.409995]
The above log can be divided into 4 chunks of data, each captured at the time
when one of the threads were being scheduled. The first chunk shows the
tgid
, a value equivalent to the PID which is visible from user space, being
scheduled on CPU 1. Since we moved PID 11 to the new partid space,
partid_space_2
with PARTD 1, the new PARTID-PMG value stored in its
thread_info
field, mpam_partid_pmg
would be 10001
. However, the last
thread scheduled on this CPU was of PARTID 0 group as indicated by the per-CPU
variable (oldreg
) in the logs. This is the same value stored in MPAM0_EL1.
Since there is a mismatch between these values, MPAM0_EL1 is updated with the
new PARTID-PMG pair using the WRITE_ONCE
macro to avoid store tearing and
re-ordering.
The next chunk shows that the thread with tgid
/PID 0 gets scheduled on the
same CPU. However, PID 0 still belongs to the default PARTID space and thus
there is a conflict between its thread_info
field and the newly programmed
PARTID-PMG value in MPAM0_EL1
/arm64_mpam_current
. The default PARTID-PMG
again gets programmed into MPAM0_EL1
and arm64_mpam_current
. Two more
context switches have been captured, where chunk 3 is similar to chunk 1 and
chunk 4 to chunk 2. Kernel changes for MPAM are quite large and for brevity,
what is most essential only has been covered in this documentation.
Copyright (c) 2022-2023, Arm Limited. All rights reserved.