VCP 6 Study Note – Numa system with ESXi

Sources:

ESXi supports memory access optimization for Intel and AMD Opteron processors in server architectures that support NUMA (non-uniform memory access).

NUMA architecture developed in multiprocessor system where time to access to a single data by a processor depending on the processor position from memory. In NUMA a processor could access rapidly to its local memory or more slowly to memory region shared or handled by another processor. This happens because modern CPUs are more faster than the memory bus. To solve this problem it is necessary to limit the amount of the memory available for a single processor. The use of more big cache system could happens to keep available all frequently access data, but the use of huge amount of memory and the software request may vanish the cache effects.

 

How ESXi NUMA Scheduling Works

  1. Each virtual machine managed by the NUMA scheduler is assigned a home node. A home node is one of the system’s NUMA nodes containing processors and local memory, as indicated by the System Resource Allocation Table (SRAT).
  2. When memory is allocated to a virtual machine, the ESXi host preferentially allocates it from the home node. The virtual CPUs of the virtual machine are constrained to run on the home node to maximize memory locality.
  3. The NUMA scheduler can dynamically change a virtual machine’s home node to respond to changes in system load. The scheduler might migrate a virtual machine to a new home node to reduce processor load imbalance. Because this might cause more of its memory to be remote, the scheduler might migrate the virtual machine’s memory dynamically to its new home node to improve memory locality. The NUMA scheduler might also swap virtual machines between nodes when this improves overall memory locality.

Some virtual machines are not managed by the ESXi NUMA scheduler: if you manually set the processor or memory affinity for a virtual machine, the NUMA scheduler might not be able to manage this virtual machine.

The optimizations work seamlessly regardless of the type of guest operating system. ESXi provides NUMA support even to virtual machines that do not support NUMA hardware, such as Windows NT 4.0. As a result, you can take advantage of new hardware even with legacy operating systems.

A virtual machine that has more virtual processors than the number of physical processor cores available on a single hardware node can be managed automatically

 

VMware NUMA Optimization Algorithms

When a virtual machine is powered on, ESXi assigns it a home node. ESXi combines the traditional initial placement approach with a dynamic rebalancing algorithm. Periodically (every two seconds by default), the system examines the loads of the various nodes and determines if it should rebalance the load by moving a virtual machine from one node to another.

Rebalancing is an effective solution to maintain fairness and ensure that all nodes are fully used. The rebalancer might need to move a virtual machine to a node on which it has allocated little or no memory. In this case, the virtual machine incurs a performance penalty associated with a large number of remote memory accesses. ESXi can eliminate this penalty by transparently migrating memory from the virtual machine’s original node to its new home node:

  1. The system selects a page (4KB of contiguous memory) on the original node and copies its data to a page in the destination node
  2. The system uses the virtual machine monitor layer and the processor’s memory management hardware to seamlessly remap the virtual machine’s view of memory, so that it uses the page on the destination node for all further references, eliminating the penalty of remote memory access.

Many ESXi workloads present opportunities for sharing memory across virtual machines. Transparent page sharing for ESXi systems has also been optimized for use on NUMA systems. On NUMA systems, pages are shared per-node, so each NUMA node has its own local copy of heavily shared pages. When virtual machines use shared pages, they don’t need to access remote memory.

You can perform resource management with different types of NUMA architecture. Typically, you can use BIOS settings to enable and disable NUMA behavior. For scheduling fairness, NUMA optimizations are not enabled for systems with too few cores per NUMA node or too few cores overall. You can modify the numa.rebalancecorestotal and numa.rebalancecoresnode options to change this behavior.

 

Virtual NUMA

vSphere >= 5.0 includes support for exposing virtual NUMA topology to guest operating systems, which can improve performance by facilitating guest operating system and application NUMA optimizations.

When the number of virtual CPUs and the amount of memory used grow proportionately, you can use the default values. For virtual machines that consume a disproportionally large amount of memory, you can override the default values in one of the following ways:

  1. increase the number of virtual CPUs, even if this number of virtual CPUs is not used (up to 128 vCPU)
  2. Use advanced options to control virtual NUMA topology and its mapping over physical NUMA topology
    1. cpuid.coresPerSocket (number of vCore per vCPU)
    2. numa.vcpu.maxPerVirtualNode (If cpuid.coresPerSocket is too restrictive as a power of two, you can set numa.vcpu.maxPerVirtualNode directly. In this case, do not set cpuid.coresPerSocket)
    3. numa.autosize (if true the virtual NUMA topology has the same number of virtual CPUs per virtual node as there are cores on each physical node)
    4. numa.autosize.once (if true the settings are guaranteed to remain the same every time you subsequently power on the virtual machine)
    5. numa.vcpu.min (Minimum number of virtual CPUs in a virtual machine that are required in order to generate a virtual NUMA topology)

 

NUMA controls

If you have applications that use a lot of memory or have a small number of virtual machines, you might want to optimize performance by specifying virtual machine CPU and memory placement explicitly

The vSphere Web Client lets you specify the following options;

  • NUMA Node Affinity
  • CPU Affinity
  • Memory affinity