Non-Uniform
Memory Access (NUMA)
Frequently Asked Questions
-
What does NUMA stand for?
-
OK, So what does Non-Uniform Memory Access really mean?
- What is the
difference between NUMA and SMP?
- What is the
difference between NUMA and ccNUMA?
- What is a
node?
- What
is meant by local and remote memory?
- What do
you mean by distance?
- Could
you give a real-world analogy of the NUMA architecture to help understand all
these terms?
- Why should I use
NUMA? What are the benefits of NUMA?
- What are the
peculiarities of NUMA?
- What are some
alternatives to NUMA?
- Could you
give a brief description of the main NUMA architecture implementations?
Frequently Given Answers
-
What does NUMA stand for?
NUMA stands for Non-Uniform Memory Access.
[Top]
-
OK, So what does Non-Uniform Memory Access really mean to me?
Non-Uniform Memory Access means that it will take longer to access some
regions of memory than others. This is due to the fact that some regions of
memory are on physically different busses from other regions. For a more
visual description, please refer to the section on
NUMA architecture implementations. Also, see the
real-world analogy for the NUMA
architecture. This can result in some programs that are not NUMA-aware
performing poorly. It also introduces the concept of
local and remote memory.
[Top]
- What is the
difference between NUMA and SMP?
The NUMA architecture was designed to surpass the scalability limits of the
SMP architecture. With SMP, which stands for Symmetric Multi-Processing,
all memory access are posted to the same shared memory bus. This works fine
for a relatively small number of CPUs, but the problem with the shared bus
appears when you have dozens, even hundreds, of CPUs competing for access to
the shared memory bus. NUMA alleviates these bottlenecks by limiting the
number of CPUs on any one memory bus, and connecting the various nodes
by means of a high speed interconnect.
[Top]
- What is the
difference between NUMA and ccNUMA?
The difference is almost nonexistent at this point. ccNUMA stands for Cache-Coherent
NUMA, but NUMA and ccNUMA have really come to be synonymous. The applications
for non-cache coherent NUMA machines are almost non-existent, and they are a
real pain to program for, so unless specifically stated otherwise, NUMA
actually means ccNUMA.
[Top]
- What is a
node?
One of the problems with describing NUMA is that there are many different ways
to implement this technology. This has led to a plethora of "definitions" for
node. A fairly technically correct and also fairly ugly definition of a
node is: a region of memory in which every byte has the same distance
from each CPU. A more common definition is: a block of memory and the CPUs,
I/O, etc. physically on the same bus as the memory. Some architectures do not
have memory, CPUs, and I/O all on the same physical bus, so the second
definition does not truly hold. In many cases, the less technical definition
should be sufficient, but often the technical definition is more correct.
[Top]
- What
is meant by local and remote memory?
The terms local memory and remote memory are typically used in
reference to a currently running process. That said, local memory is
typically defined to be the memory that is on the same node as the CPU
currently running the process. Any memory that does not belong to the node
on which the process is currently running is then, by that definition,
remote.
Local and remote memory can also be used in reference to things
other than the currently running process. When in interrupt context, there
technically is no currently executing process, but memory on the node
containing the CPU handling the interrupt is still called local memory.
Also, you could use local and remote memory in terms of a disk.
For example if there was a disk (attached to node 1) doing a DMA, the memory
it is reading or writing would be called remote if it were located on
another node (i.e. node 0).
[Top]
- What do
you mean by distance?
NUMA-based architectures necessarily introduce a notion of distance
between system components (ie: CPUs, memory, I/O busses, etc). The metric used
to determine a distance varies, but hops is a popular metric, along
with latency and bandwidth. These terms all mean essentially the same thing
that they do when used in a networking context (mostly because a NUMA machine
is not all that different from a very tightly coupled cluster). So when used
to describe a node, we could say that a particular range of memory is 2
hops (busses) from CPUs 0..3 and SCSI Controller 0. Thus, CPUs 0..3 and the
SCSI Controller are a part of the same node.
[Top]
- Could
you give a real-world analogy of the NUMA architecture to help understand all
these terms?
Imagine that you are baking a cake. You have a group of ingredients (=memory
pages) that you need to complete the recipe(=process). Some of the ingredients
you may have in your cabinet(=local memory), but some of the ingredients you
might not have, and have to ask a neighbor for(=remote memory). The general
idea is to try and have as many of the ingredients in your own cabinet as
possible, since this reduces your time and effort in making the cake.
You also have to remember that your cabinets can only hold a fixed amount of
ingredients(=physical nodal memory). If you try and buy more, but you have no
room to store it, you may have to ask your neighbor to keep it in his/her
cabinet until you need it(=local memory full, so allocate pages remotely).
A bit of a strange example, I'll admit, but I think it works. If you have a
better analogy, I'm all ears! ;)
[Top]
- Why should I use
NUMA? What are the benefits of NUMA?
The main benefit of NUMA is, as mentioned above, scalability. It is extremely
difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus
is under heavy contention. NUMA is one way of reducing the number of CPUs
competing for access to a shared memory bus. This is accomplished by having
several memory busses and only having a small number of CPUs on each of those
busses. There are other ways of building massively multiprocessor machines,
but this is a NUMA FAQ, so we'll leave the discussion of other methods to
other FAQs.
[Top]
- What are
the peculiarities of NUMA?
CPU and/or node caches can result in NUMA effects. For example, the CPUs on a
particular node will have a higher bandwidth and/or a lower latency to access
the memory and CPUs on that same node. Due to this, you can see things like
lock starvation under high contention. This is because if CPU x in the node
requests a lock already held by another CPU y in the node, it's request will
tend to beat out a request from a remote CPU z.
[Top]
- What are
some alternatives to NUMA?
Also, splitting memory up and (possibly arbitrarily) assigning it to groups of
CPUs can give some performance benefits similar to actual NUMA. A setup like
this would be like a regular NUMA machine where the line between local
and remote memory is blurred, since all the memory is actually on the
same bus. The PowerPC Regatta system is an example of this.
You can achieve some NUMA-like performance by using clusters as well. A
cluster is very similar to a NUMA machine, where each individual machine in
the cluster becomes a node in our virtual NUMA machine. The only real
difference is the nodal latency. In a clustered environment, the latency and
bandwidth on the internodal links are likely to be much worse.
[Top]
- Could you
give a brief description of the main NUMA architecture implementations?
Sure! The main types are IBM NUMA-Q, Compaq Wildfire, and SGI MIPS64. Click
here for
descriptions and diagrams of the above system types, and also a standard SMP
system for comparison.
[Top]
Source:
http://lse.sourceforge.net/numa/faq/