TL;DR
This blog post explains how computers running the Linux kernel receive packets, as well as how to monitor and tune each component of the networking stack as packets flow from the network toward userland programs.
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning the Linux Networking Stack: Sending Data.
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data, which adds some diagrams for the information presented below.
It is impossible to tune or monitor the Linux networking stack without reading the source code of the kernel and having a deep understanding of what exactly is happening.
This blog post will hopefully serve as a reference to anyone looking to do this.
Special thanks
Special thanks to the folks at Private Internet Access who hired us to research this information in conjunction with other network research and who have graciously allowed us to build upon the research and publish this information.
The information presented here builds upon the work done for Private Internet Access, which was originally published as a 5 part series starting here.
General advice on monitoring and tuning the Linux networking stack
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning the Linux Networking Stack: Sending Data.
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data, which adds some diagrams for the information presented below.
The networking stack is complex and there is no one size fits all solution. If the performance and health of your networking is critical to you or your business, you will have no choice but to invest a considerable amount of time, effort, and money into understanding how the various parts of the system interact.
Ideally, you should consider measuring packet drops at each layer of the network stack. That way you can determine and narrow down which component needs to be tuned.
This is where, I think, many operators go off track: the assumption is made that a set of sysctl settings or /proc
values can simply be reused wholesale. In some cases, perhaps, but it turns out that the entire system is so nuanced and intertwined that if you desire to have meaningful monitoring or tuning, you must strive to understand how the system functions at a deep level. Otherwise, you can simply use the default settings, which should be good enough until further optimization (and the required investment to deduce those settings) is necessary.
Many of the example settings provided in this blog post are used solely for illustrative purposes and are not a recommendation for or against a certain configuration or default setting. Before adjusting any setting, you should develop a frame of reference around what you need to be monitoring to notice a meaningful change.
Adjusting networking settings while connected to the machine over a network is dangerous; you could very easily lock yourself out or completely take out your networking. Do not adjust these settings on production machines; instead make adjustments on new machines and rotate them into production, if possible.
Overview
For reference, you may want to have a copy of the device data sheet handy. This post will examine the Intel I350 Ethernet controller, controlled by the igb
device driver. You can find that data sheet (warning: LARGE PDF) here for your reference.
The high level path a packet takes from arrival to socket receive buffer is as follows:
- Driver is loaded and initialized.
- Packet arrives at the NIC from the network.
- Packet is copied (via DMA) to a ring buffer in kernel memory.
- Hardware interrupt is generated to let the system know a packet is in memory.
- Driver calls into NAPI to start a poll loop if one was not running already.
ksoftirqd
processes run on each CPU on the system. They are registered at boot time. Theksoftirqd
processes pull packets off the ring buffer by calling the NAPIpoll
function that the device driver registered during initialization.- Memory regions in the ring buffer that have had network data written to them are unmapped.
- Data that was DMA’d into memory is passed up the networking layer as an ‘skb’ for more processing.
- Incoming network data frames are distributed among multiple CPUs if packet steering is enabled or if the NIC has multiple receive queues.
- Network data frames are handed to the protocol layers from the queues.
- Protocol layers process data.
- Data is added to receive buffers attached to sockets by protocol layers.
This entire flow will be examined in detail in the following sections.
The protocol layers examined below are the IP and UDP protocol layers. Much of the information presented will serve as a reference for other protocol layers, as well.
Detailed Look
UPDATE We’ve released the counterpart to this post: Monitoring and Tuning the Linux Networking Stack: Sending Data.
UPDATE Take a look at the Illustrated Guide to Monitoring and Tuning the Linux Networking Stack: Receiving Data, which adds some diagrams for the information presented below.
This blog post will be examining the Linux kernel version 3.13.0 with links to code on GitHub and code snippets throughout this post.
Understanding exactly how packets are received in the Linux kernel is very involved. We’ll need to closely examine and understand how a network driver works, so that parts of the network stack later are more clear.
This blog post will look at the igb
network driver. This driver is used for a relatively common server NIC, the Intel Ethernet Controller I350. So, let’s start by understanding how the igb
network driver works.
Network Device Driver
Initialization
A driver registers an initialization function which is called by the kernel when the driver is loaded. This function is registered by using the module_init
macro.
The igb
initialization function (igb_init_module
) and its registration with module_init
can be found in drivers/net/ethernet/intel/igb/igb_main.c.
Both are fairly straightforward:
The bulk of the work to initialize the device happens with the call to pci_register_driver
as we’ll see next.
PCI initialization
The Intel I350 network card is a PCI express device.
PCI devices identify themselves with a series of registers in the PCI Configuration Space.
When a device driver is compiled, a macro named MODULE_DEVICE_TABLE
(from include/module.h
) is used to export a table of PCI device IDs identifying devices that the device driver can control. The table is also registered as part of a structure, as we’ll see shortly.
The kernel uses this table to determine which device driver to load to control the device.
That’s how the OS can figure out which devices are connected to the system and which driver should be used to talk to the device.
This table and the PCI device IDs for the igb
driver can be found in drivers/net/ethernet/intel/igb/igb_main.c
and drivers/net/ethernet/intel/igb/e1000_hw.h
, respectively:
As seen in the previous section, pci_register_driver
is called in the driver’s initialization function.
This function registers a structure of pointers. Most of the pointers are function pointers, but the PCI device ID table is also registered. The kernel uses the functions registered by the driver to bring the PCI device up.
From drivers/net/ethernet/intel/igb/igb_main.c
:
PCI probe
Once a device has been identified by its PCI IDs, the kernel can then select the proper driver to use to control the device. Each PCI driver registers a probe function with the PCI system in the kernel. The kernel calls this function for devices which have not yet been claimed by a device driver. Once a device is claimed, other drivers will not be asked about the device. Most drivers have a lot of code that runs to get the device ready for use. The exact things done vary from driver to driver.
Some typical operations to perform include:
- Enabling the PCI device.
- Requesting memory ranges and IO ports.
- Setting the DMA mask.
- The ethtool (described more below) functions the driver supports are registered.
- Any watchdog tasks needed (for example, e1000e has a watchdog task to check if the hardware is hung).
- Other device specific stuff like workarounds or dealing with hardware specific quirks or similar.
- The creation, initialization, and registration of a
struct net_device_ops
structure. This structure contains function pointers to the various functions needed for opening the device, sending data to the network, setting the MAC address, and more. - The creation, initialization, and registration of a high level
struct net_device
which represents a network device.
Let’s take a quick look at some of these operations in the igb
driver in the function igb_probe
.
A peek into PCI initialization
The following code from the igb_probe
function does some basic PCI configuration. From drivers/net/ethernet/intel/igb/igb_main.c:
First, the device is initialized with pci_enable_device_mem
. This will wake up the device if it is suspended, enable memory resources, and more.
Next, the DMA mask will be set. This device can read and write to 64bit memory addresses, so dma_set_mask_and_coherent
is called with DMA_BIT_MASK(64)
.
Memory regions will be reserved with a call to pci_request_selected_regions
, PCI Express Advanced Error Reporting is enabled (if the PCI AER driver is loaded), DMA is enabled with a call to pci_set_master
, and the PCI configuration space is saved with a call to pci_save_state
.
Phew.
More Linux PCI driver information
Going into the full explanation of how PCI devices work is beyond the scope of this post, but this excellent talk, this wiki, and this text file from the linux kernel are excellent resources.
Network device initialization
The igb_probe
function does some important network device initialization. In addition to the PCI specific work, it will do more general networking and network device work:
- The
struct net_device_ops
is registered. ethtool
operations are registered.- The default MAC address is obtained from the NIC.
net_device
feature flags are set.- And lots more.
Let’s take a look at each of these as they will be interesting later.
struct net_device_ops
The struct net_device_ops
contains function pointers to lots of important operations that the network subsystem needs to control the device. We’ll be mentioning this structure many times throughout the rest of this post.
This net_device_ops
structure is attached to a struct net_device
in igb_probe
. From drivers/net/ethernet/intel/igb/igb_main.c)
And the functions that this net_device_ops
structure holds pointers to are set in the same file. From drivers/net/ethernet/intel/igb/igb_main.c:
As you can see, there are several interesting fields in this struct
like ndo_open
, ndo_stop
, ndo_start_xmit
, and ndo_get_stats64
which hold the addresses of functions implemented by the igb
driver.
We’ll be looking at some of these in more detail later.
ethtool
registration
ethtool
is a command line program you can use to get and set various driver and hardware options. You can install it on Ubuntu by running apt-get install ethtool
.
A common use of ethtool
is to gather detailed statistics from network devices. Other ethtool
settings of interest will be described later.
The ethtool
program talks to device drivers by using the ioctl
system call. The device drivers register a series of functions that run for the ethtool
operations and the kernel provides the glue.
When an ioctl
call is made from ethtool
, the kernel finds the ethtool
structure registered by the appropriate driver and executes the functions registered. The driver’s ethtool
function implementation can do anything from change a simple software flag in the driver to adjusting how the actual NIC hardware works by writing register values to the device.
The igb
driver registers its ethtool
operations in igb_probe
by calling igb_set_ethtool_ops
:
All of the igb
driver’s ethtool
code can be found in the file drivers/net/ethernet/intel/igb/igb_ethtool.c
along with the igb_set_ethtool_ops
function.
From drivers/net/ethernet/intel/igb/igb_ethtool.c
:
Above that, you can find the igb_ethtool_ops
structure with the ethtool
functions the igb
driver supports set to the appropriate fields.
From drivers/net/ethernet/intel/igb/igb_ethtool.c
:
It is up to the individual drivers to determine which ethtool
functions are relevant and which should be implemented. Not all drivers implement all ethtool
functions, unfortunately.
One interesting ethtool
function is get_ethtool_stats
, which (if implemented) produces detailed statistics counters that are tracked either in software in the driver or via the device itself.
The monitoring section below will show how to use ethtool
to access these detailed statistics.
IRQs
When a data frame is written to RAM via DMA, how does the NIC tell the rest of the system that data is ready to be processed?
Traditionally, a NIC would generate an interrupt request (IRQ) indicating data had arrived. There are three common types of IRQs: MSI-X, MSI, and legacy IRQs. These will be touched upon shortly. A device generating an IRQ when data has been written to RAM via DMA is simple enough, but if large numbers of data frames arrive this can lead to a large number of IRQs being generated. The more IRQs that are generated, the less CPU time is available for higher level tasks like user processes.
The New Api (NAPI) was created as a mechanism for reducing the number of IRQs generated by network devices on packet arrival. While NAPI reduces the number of IRQs, it cannot eliminate them completely.
We’ll see why that is, exactly, in later sections.
NAPI
NAPI differs from the legacy method of harvesting data in several important ways. NAPI allows a device driver to register a poll
function that the NAPI subsystem will call to harvest data frames.
The intended use of NAPI in network device drivers is as follows:
- NAPI is enabled by the driver, but is in the off position initially.
- A packet arrives and is DMA’d to memory by the NIC.
- An IRQ is generated by the NIC which triggers the IRQ handler in the driver.
- The driver wakes up the NAPI subsystem using a softirq (more on these later). This will begin harvesting packets by calling the driver’s registered
poll
function in a separate thread of execution. - The driver should disable further IRQs from the NIC. This is done to allow the NAPI subsystem to process packets without interruption from the device.
- Once there is no more work to do, the NAPI subsystem is disabled and IRQs from the device are re-enabled.
- The process starts back at step 2.
This method of gathering data frames has reduced overhead compared to the legacy method because many data frames can be consumed at a time without having to deal with processing each of them one IRQ at a time.
The device driver implements a poll
function and registers it with NAPI by calling netif_napi_add
. When registering a NAPI poll
function with netif_napi_add
, the driver will also specify the weight
. Most of the drivers hardcode a value of 64
. This value and its meaning will be described in more detail below.
Typically, drivers register their NAPI poll
functions during driver initialization.
NAPI initialization in the igb
driver
The igb
driver does this via a long call chain:
igb_probe
callsigb_sw_init
.igb_sw_init
callsigb_init_interrupt_scheme
.igb_init_interrupt_scheme
callsigb_alloc_q_vectors
.igb_alloc_q_vectors
callsigb_alloc_q_vector
.igb_alloc_q_vector
callsnetif_napi_add
.
This call trace results in a few high level things happening:
- If MSI-X is supported, it will be enabled with a call to
pci_enable_msix
. - Various settings are computed and initialized; most notably the number of transmit and receive queues that the device and driver will use for sending and receiving packets.
igb_alloc_q_vector
is called once for every transmit and receive queue that will be created.- Each call to
igb_alloc_q_vector
callsnetif_napi_add
to register apoll
function for that queue and an instance ofstruct napi_struct
that will be passed topoll
when called to harvest packets.
Let’s take a look at igb_alloc_q_vector
to see how the poll
callback and its private data are registered.
From drivers/net/ethernet/intel/igb/igb_main.c:
The above code is allocation memory for a receive queue and registering the function igb_poll
with the NAPI subsystem. It provides a reference to the struct napi_struct
associated with this newly created RX queue (&q_vector->napi
above). This will be passed into igb_poll
when called by the NAPI subsystem when it comes time to harvest packets from this RX queue.
This will be important later when we examine the flow of data from drivers up the network stack.
Bringing a network device up
Recall the net_device_ops
structure we saw earlier which registered a set of functions for bringing the network device up, transmitting packets, setting the MAC address, etc.
When a network device is brought up (for example, with ifconfig eth0 up
), the function attached to the ndo_open
field of the net_device_ops
structure is called.
The ndo_open
function will typically do things like:
- Allocate RX and TX queue memory
- Enable NAPI
- Register an interrupt handler
- Enable hardware interrupts
- And more.
In the case of the igb
driver, the function attached to the ndo_open
field of the net_device_ops
structure is called igb_open
.
Preparing to receive data from the network
Most NICs you’ll find today will use DMA to write data directly into RAM where the OS can retrieve the data for processing. The data structure most NICs use for this purpose resembles a queue built on circular buffer (or a ring buffer).
In order to do this, the device driver must work with the OS to reserve a region of memory that the NIC hardware can use. Once this region is reserved, the hardware is informed of its location and incoming data will be written to RAM where it will later be picked up and processed by the networking subsystem.
This seems simple enough, but what if the packet rate was high enough that a single CPU was not able to properly process all incoming packets? The data structure is built on a fixed length region of memory, so incoming packets would be dropped.
This is where something known as known as Receive Side Scaling (RSS) or multiqueue can help.
Some devices have the ability to write incoming packets to several different regions of RAM simultaneously; each region is a separate queue. This allows the OS to use multiple CPUs to process incoming data in parallel, starting at the hardware level. This feature is not supported by all NICs.
The Intel I350 NIC does support multiple queues. We can see evidence of this in the igb
driver. One of the first things the igb
driver does when it is brought up is call a function named igb_setup_all_rx_resources
. This function calls another function, igb_setup_rx_resources
, once for each RX queue to arrange for DMA-able memory where the device will write incoming data.
If you are curious how exactly this works, please see the Linux kernel’s DMA API HOWTO.
It turns out the number and size of the RX queues can be tuned by using ethtool
. Tuning these values can have a noticeable impact on the number of frames which are processed vs the number of frames which are dropped.
The NIC uses a hash function on the packet header fields (like source, destination, port, etc) to determine which RX queue the data should be directed to.
Some NICs let you adjust the weight of the RX queues, so you can send more traffic to specific queues.
Fewer NICs let you adjust this hash function itself. If you can adjust the hash function, you can send certain flows to specific RX queues for processing or even drop the packets at the hardware level, if desired.
We’ll take a look at how to tune these settings shortly.
Enable NAPI
When a network device is brought up, a driver will usually enable NAPI.
We saw earlier how drivers register poll
functions with NAPI, but NAPI is not usually enabled until the device is brought up.
Enabling NAPI is relatively straight forward. A call to napi_enable
will flip a bit in the struct napi_struct
to indicate that it is now enabled. As mentioned above, while NAPI will be enabled it will be in the off position.
In the case of the igb
driver, NAPI is enabled for each q_vector
that was initialized when the driver was loaded or when the queue count or size are changed with ethtool
.
From drivers/net/ethernet/intel/igb/igb_main.c:
Register an interrupt handler
After enabling NAPI, the next step is to register an interrupt handler. There are different methods a device can use to signal an interrupt: MSI-X, MSI, and legacy interrupts. As such, the code differs from device to device depending on what the supported interrupt methods are for a particular piece of hardware.
The driver must determine which method is supported by the device and register the appropriate handler function that will execute when the interrupt is received.
Some drivers, like the igb
driver, will try to register an interrupt handler with each method, falling back to the next untested method on failure.
MSI-X interrupts are the preferred method, especially for NICs that support multiple RX queues. This is because each RX queue can have its own hardware interrupt assigned, which can then be handled by a specific CPU (with irqbalance
or by modifying /proc/irq/IRQ_NUMBER/smp_affinity
). As we’ll see shortly, the CPU that handles the interrupt will be the CPU that processes the packet. In this way, arriving packets can be processed by separate CPUs from the hardware interrupt level up through the networking stack.
If MSI-X is unavailable, MSI still presents advantages over legacy interrupts and will be used by the driver if the device supports it. Read this useful wiki page for more information about MSI and MSI-X.
In the igb
driver, the functions igb_msix_ring
, igb_intr_msi
, igb_intr
are the interrupt handler methods for the MSI-X, MSI, and legacy interrupt modes, respectively.
You can find the code in the driver which attempts each interrupt method in drivers/net/ethernet/intel/igb/igb_main.c:
As you can see in the abbreviated code above, the driver first attempts to set an MSI-X interrupt handler with igb_request_msix
, falling back to MSI on failure. Next, request_irq
is used to register igb_intr_msi
, the MSI interrupt handler. If this fails, the driver falls back to legacy interrupts. request_irq
is used again to register the legacy interrupt handler igb_intr
.
And this is how the igb
driver registers a function that will be executed when the NIC raises an interrupt signaling that data has arrived and is ready for processing.
Enable Interrupts
At this point, almost everything is setup. The only thing left is to enable interrupts from the NIC and wait for data to arrive. Enabling interrupts is hardware specific, but the igb
driver does this in __igb_open
by calling a helper function named igb_irq_enable
.
Interrupts are enabled for this device by writing to registers:
The network device is now up
Drivers may do a few more things like start timers, work queues, or other hardware-specific setup. Once that is completed. the network device is up and ready for use.
Let’s take a look at monitoring and tuning settings for network device drivers.
Monitoring network devices
There are several different ways to monitor your network devices offering different levels of granularity and complexity. Let’s start with most granular and move to least granular.
Using ethtool -S
You can install ethtool
on an Ubuntu system by running: sudo apt-get install ethtool
.
Once it is installed, you can access the statistics by passing the -S
flag along with the name of the network device you want statistics about.
Monitor detailed NIC device statistics (e.g., packet drops) with `ethtool -S`.
$ sudo ethtool -S eth0
NIC statistics:
rx_packets: 597028087
tx_packets: 5924278060
rx_bytes: 112643393747
tx_bytes: 990080156714
rx_broadcast: 96
tx_broadcast: 116
rx_multicast: 20294528
....
Monitoring this data can be difficult. It is easy to obtain, but there is no standardization of the field values. Different drivers, or even different versions of the same driver might produce different field names that have the same meaning.
You should look for values with “drop”, “buffer”, “miss”, etc in the label. Next, you will have to read your driver source. You’ll be able to determine which values are accounted for totally in software (e.g., incremented when there is no memory) and which values come directly from hardware via a register read. In the case of a register value, you should consult the data sheet for your hardware to determine what the meaning of the counter really is; many of the labels given via ethtool
can be misleading.
Using sysfs
sysfs also provides a lot of statistics values, but they are slightly higher level than the direct NIC level stats provided.
You can find the number of dropped incoming network data frames for, e.g. eth0 by using cat
on a file.
Monitor higher level NIC statistics with sysfs.
$ cat /sys/class/net/eth0/statistics/rx_dropped
2
The counter values will be split into files like collisions
, rx_dropped
, rx_errors
, rx_missed_errors
, etc.
Unfortunately, it is up to the drivers to decide what the meaning of each field is, and thus, when to increment them and where the values come from. You may notice that some drivers count a certain type of error condition as a drop, but other drivers may count the same as a miss.
If these values are critical to you, you will need to read your driver source to understand exactly what your driver thinks each of these values means.
Using /proc/net/dev
An even higher level file is /proc/net/dev
which provides high-level summary-esque information for each network adapter on the system.
Monitor high level NIC statistics by reading /proc/net/dev
.
$ cat /proc/net/dev
Inter-| Receive | Transmit
face | bytes packets errs drop fifo frame compressed multicast | bytes packets errs drop fifo colls carrier compressed
eth0: 110346752214 597737500 0 2 0 0 0 20963860 990024805984 6066582604 0 0 0 0 0 0
lo: 428349463836 1579868535 0 0 0 0 0 0 428349463836 1579868535 0 0 0 0 0 0
This file shows a subset of the values you’ll find in the sysfs files mentioned above, but it may serve as a useful general reference.
The caveat mentioned above applies here, as well: if these values are important to you, you will still need to read your driver source to understand exactly when, where, and why they are incremented to ensure your understanding of an error, drop, or fifo are the same as your driver.
Tuning network devices
Check the number of RX queues being used
If your NIC and the device driver loaded on your system support RSS / multiqueue, you can usually adjust the number of RX queues (also called RX channels), by using ethtool
.
Check the number of NIC receive queues with ethtool
$ sudo ethtool -l eth0
Channel parameters for eth0:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 8
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
This output is displaying the pre-set maximums (enforced by the driver and the hardware) and the current settings.
Note: not all device drivers will have support for this operation.
Error seen if your NIC doesn't support this operation.
$ sudo ethtool -l eth0
Channel parameters for eth0:
Cannot get device channel parameters
: Operation not supported
This means that your driver has not implemented the ethtool get_channels
operation. This could be because the NIC doesn’t support adjusting the number of queues, doesn’t support RSS / multiqueue, or your driver has not been updated to handle this feature.
Adjusting the number of RX queues
Once you’ve found the current and maximum queue count, you can adjust the values by using sudo ethtool -L
.
Note: some devices and their drivers only support combined queues that are paired for transmit and receive, as in the example in the above section.
Set combined NIC transmit and receive queues to 8 with ethtool -L
$ sudo ethtool -L eth0 combined 8
If your device and driver support individual settings for RX and TX and you’d like to change only the RX queue count to 8, you would run:
Set the number of NIC receive queues to 8 with ethtool -L
.
$ sudo ethtool -L eth0 rx 8
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
Adjusting the size of the RX queues
Some NICs and their drivers also support adjusting the size of the RX queue. Exactly how this works is hardware specific, but luckily ethtool
provides a generic way for users to adjust the size. Increasing the size of the RX queue can help prevent network data drops at the NIC during periods where large numbers of data frames are received. Data may still be dropped in software, though, and other tuning is required to reduce or eliminate drops completely.
Check current NIC queue sizes with ethtool -g
$ sudo ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 512
RX Mini: 0
RX Jumbo: 0
TX: 512
the above output indicates that the hardware supports up to 4096 receive and transmit descriptors, but it is currently only using 512.
Increase size of each RX queue to 4096 with ethtool -G
$ sudo ethtool -G eth0 rx 4096
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
Adjusting the processing weight of RX queues
Some NICs support the ability to adjust the distribution of network data among the RX queues by setting a weight.
You can configure this if:
- Your NIC supports flow indirection.
- Your driver implements the
ethtool
functionsget_rxfh_indir_size
andget_rxfh_indir
. - You are running a new enough version of
ethtool
that has support for the command line options-x
and-X
to show and set the indirection table, respectively.
Check the RX flow indirection table with ethtool -x
$ sudo ethtool -x eth0
RX flow hash indirection table for eth3 with 2 RX ring(s):
0: 0 1 0 1 0 1 0 1
8: 0 1 0 1 0 1 0 1
16: 0 1 0 1 0 1 0 1
24: 0 1 0 1 0 1 0 1
This output shows packet hash values on the left, with receive queue 0 and 1 listed. So, a packet which hashes to 2 will be delivered to receive queue 0, while a packet which hashes to 3 will be delivered to receive queue 1.
Example: spread processing evenly between first 2 RX queues
$ sudo ethtool -X eth0 equal 2
If you want to set custom weights to alter the number of packets which hit certain receive queues (and thus CPUs), you can specify those on the command line, as well:
Set custom RX queue weights with ethtool -X
$ sudo ethtool -X eth0 weight 6 2
The above command specifies a weight of 6 for rx queue 0 and 2 for rx queue 1, pushing much more data to be processed on queue 0.
Some NICs will also let you adjust the fields which be used in the hash algorithm, as we’ll see now.
Adjusting the rx hash fields for network flows
You can use ethtool
to adjust the fields that will be used when computing a hash for use with RSS.
Check which fields are used for UDP RX flow hash with ethtool -n
.
$ sudo ethtool -n eth0 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
For eth0, the fields that are used for computing a hash on UDP flows is the IPv4 source and destination addresses. Let’s include the source and destination ports:
Set UDP RX flow hash fields with ethtool -N
.
$ sudo ethtool -N eth0 rx-flow-hash udp4 sdfn
The sdfn
string is a bit cryptic; check the ethtool
man page for an explanation of each letter.
Adjusting the fields to take a hash on is useful, but ntuple
filtering is even more useful for finer grained control over which flows will be handled by which RX queue.
ntuple filtering for steering network flows
Some NICs support a feature known as “ntuple filtering.” This feature allows the user to specify (via ethtool
) a set of parameters to use to filter incoming network data in hardware and queue it to a particular RX queue. For example, the user can specify that TCP packets destined to a particular port should be sent to RX queue 1.
On Intel NICs this feature is commonly known as Intel Ethernet Flow Director. Other NIC vendors may have other marketing names for this feature.
As we’ll see later, ntuple filtering is a crucial component of another feature called Accelerated Receive Flow Steering (aRFS), which makes using ntuple much easier if your NIC supports it. aRFS will be covered later.
This feature can be useful if the operational requirements of the system involve maximizing data locality with the hope of increasing CPU cache hit rates when processing network data. For example consider the following configuration for a webserver running on port 80:
- A webserver running on port 80 is pinned to run on CPU 2.
- IRQs for an RX queue are assigned to be processed by CPU 2.
- TCP traffic destined to port 80 is ‘filtered’ with ntuple to CPU 2.
- All incoming traffic to port 80 is then processed by CPU 2 starting at data arrival to the userland program.
- Careful monitoring of the system including cache hit rates and networking stack latency will be needed to determine effectiveness.
As mentioned, ntuple filtering can be configured with ethtool
, but first, you’ll need to ensure that this feature is enabled on your device.
Check if ntuple filters are enabled with ethtool -k
$ sudo ethtool -k eth0
Offload parameters for eth0:
...
ntuple-filters: off
receive-hashing: on
As you can see, ntuple-filters
are set to off on this device.
Enable ntuple filters with ethtool -K
$ sudo ethtool -K eth0 ntuple on
Once you’ve enabled ntuple filters, or verified that it is enabled, you can check the existing ntuple rules by using ethtool
:
Check existing ntuple filters with ethtool -u
$ sudo ethtool -u eth0
40 RX rings available
Total 0 rules
As you can see, this device has no ntuple filter rules. You can add a rule by specifying it on the command line to ethtool
. Let’s add a rule to direct all TCP traffic with a destination port of 80 to RX queue 2:
Add ntuple filter to send TCP flows with destination port 80 to RX queue 2
$ sudo ethtool -U eth0 flow-type tcp4 dst-port 80 action 2
You can also use ntuple filtering to drop packets for particular flows at the hardware level. This can be useful for mitigating heavy incoming traffic from specific IP addresses. For more information about configuring ntuple filter rules, see the ethtool
man page.
You can usually get statistics about the success (or failure) of your ntuple rules by checking values output from ethtool -S [device name]
. For example, on Intel NICs, the statistics fdir_match
and fdir_miss
calculate the number of matches and misses for your ntuple filtering rules. Consult your device driver source and device data sheet for tracking down statistics counters (if available).
SoftIRQs
Before examining the network stack, we’ll need to take a short detour to examine something in the Linux kernel called SoftIRQs.
What is a softirq?
The softirq system in the Linux kernel is a mechanism for executing code outside of the context of an interrupt handler implemented in a driver. This system is important because hardware interrupts may be disabled during all or part of the execution of an interrupt handler. The longer interrupts are disabled, the greater chance that events may be missed. So, it is important to defer any long running actions outside of the interrupt handler so that it can complete as quickly as possible and re-enable interrupts from the device.
There are other mechanisms that can be used for deferring work in the kernel, but for the purposes of the networking stack, we’ll be looking at softirqs.
The softirq system can be imagined as a series of kernel threads (one per CPU) that run handler functions which have been registered for different softirq events. If you’ve ever looked at top and seen ksoftirqd/0
in the list of kernel threads, you were looking at the softirq kernel thread running on CPU 0.
Kernel subsystems (like networking) can register a softirq handler by executing the open_softirq
function. We’ll see later how the networking system registers its softirq handlers. For now, let’s learn a bit more about how softirqs work.
ksoftirqd
Since softirqs are so important for deferring the work of device drivers, you might imagine that the ksoftirqd
process is spawned pretty early in the life cycle of the kernel and you’d be correct.
Looking at the code found in kernel/softirq.c reveals how the ksoftirqd
system is initialized:
As you can see from the struct smp_hotplug_thread
definition above, there are two function pointers being registered: ksoftirqd_should_run
and run_ksoftirqd
.
Both of these functions are called from kernel/smpboot.c as part of something which resembles an event loop.
The code in kernel/smpboot.c
first calls ksoftirqd_should_run
which determines if there are any pending softirqs and, if there are pending softirqs, run_ksoftirqd
is executed. The run_ksoftirqd
does some minor bookkeeping before it calls __do_softirq
.
__do_softirq
The __do_softirq
function does a few interesting things:
- determines which softirq is pending
- softirq time is accounted for statistics purposes
- softirq execution statistics are incremented
- the softirq handler for the pending softirq (which was registered with a call to
open_softirq
) is executed.
So, when you look at graphs of CPU usage and see softirq
or si
you now know that this is measuring the amount of CPU usage happening in a deferred work context.
Monitoring
/proc/softirqs
The softirq
system increments statistic counters which can be read from /proc/softirqs
Monitoring these statistics can give you a sense for the rate at which softirqs for various events are being generated.
Check softIRQ stats by reading /proc/softirqs
.
$ cat /proc/softirqs
CPU0 CPU1 CPU2 CPU3
HI: 0 0 0 0
TIMER: 2831512516 1337085411 1103326083 1423923272
NET_TX: 15774435 779806 733217 749512
NET_RX: 1671622615 1257853535 2088429526 2674732223
BLOCK: 1800253852 1466177 1791366 634534
BLOCK_IOPOLL: 0 0 0 0
TASKLET: 25 0 0 0
SCHED: 2642378225 1711756029 629040543 682215771
HRTIMER: 2547911 2046898 1558136 1521176
RCU: 2056528783 4231862865 3545088730 844379888
This file can give you an idea of how your network receive (NET_RX
) processing is currently distributed across your CPUs. If it is distributed unevenly, you will see a larger count value for some CPUs than others. This is one indicator that you might be able to benefit from Receive Packet Steering / Receive Flow Steering described below. Be careful using just this file when monitoring your performance: during periods of high network activity you would expect to see the rate NET_RX
increments increase, but this isn’t necessarily the case. It turns out that this is a bit nuanced, because there are additional tuning knobs in the network stack that can affect the rate at which NET_RX
softirqs will fire, which we’ll see soon.
You should be aware of this, however, so that if you adjust the other tuning knobs you will know to examine /proc/softirqs
and expect to see a change.
Now, let’s move on to the networking stack and trace how network data is received from top to bottom.
Linux network device subsystem
Now that we’ve taken a look in to how network drivers and softirqs work, let’s see how the Linux network device subsystem is initialized. Then, we can follow the path of a packet starting with its arrival.
Initialization of network device subsystem
The network device (netdev) subsystem is initialized in the function net_dev_init
. Lots of interesting things happen in this initialization function.
Initialization of struct softnet_data
structures
net_dev_init
creates a set of struct softnet_data
structures for each CPU on the system. These structures will hold pointers to several important things for processing network data:
- List for NAPI structures to be registered to this CPU.
- A backlog for data processing.
- The processing
weight
. - The receive offload structure list.
- Receive packet steering settings.
- And more.
Each of these will be examined in greater detail later as we progress up the stack.
Initialization of softirq handlers
net_dev_init
registers a transmit and receive softirq handler which will be used to process incoming or outgoing network data. The code for this is pretty straight forward:
We’ll see soon how the driver’s interrupt handler will “raise” (or trigger) the net_rx_action
function registered to the NET_RX_SOFTIRQ
softirq.
Data arrives
At long last; network data arrives!
Assuming that the RX queue has enough available descriptors, the packet is written to RAM via DMA. The device then raises the interrupt that is assigned to it (or in the case of MSI-X, the interrupt tied to the rx queue the packet arrived on).
Interrupt handler
In general, the interrupt handler which runs when an interrupt is raised should try to defer as much processing as possible to happen outside the interrupt context. This is crucial because while an interrupt is being processed, other interrupts may be blocked.
Let’s take a look at the source for the MSI-X interrupt handler; it will really help illustrate the idea that the interrupt handler does as little work as possible.
From drivers/net/ethernet/intel/igb/igb_main.c:
This interrupt handler is very short and performs 2 very quick operations before returning.
First, this function calls igb_write_itr
which simply updates a hardware specific register. In this case, the register that is updated is one which is used to track the rate hardware interrupts are arriving.
This register is used in conjunction with a hardware feature called “Interrupt Throttling” (also called “Interrupt Coalescing”) which can be used to to pace the delivery of interrupts to the CPU. We’ll see soon how ethtool
provides a mechanism for adjusting the rate at which IRQs fire.
Secondly, napi_schedule
is called which wakes up the NAPI processing loop if it was not already active. Note that the NAPI processing loop executes in a softirq; the NAPI processing loop does not execute from the interrupt handler. The interrupt handler simply causes it to start executing if it was not already.
The actual code showing exactly how this works is important; it will guide our understanding of how network data is processed on multi-CPU systems.
NAPI and napi_schedule
Let’s figure out how the napi_schedule
call from the hardware interrupt handler works.
Remember, NAPI exists specifically to harvest network data without needing interrupts from the NIC to signal that data is ready for processing. As mentioned earlier, the NAPI poll
loop is bootstrapped by receiving a hardware interrupt. In other words: NAPI is enabled, but off, until the first packet arrives at which point the NIC raises an IRQ and NAPI is started. There are a few other cases, as we’ll see soon, where NAPI can be disabled and will need a hardware interrupt to be raised before it will be started again.
The NAPI poll loop is started when the interrupt handler in the driver calls napi_schedule
. napi_schedule
is actually just a wrapper function defined in a header file which calls down to __napi_schedule
.
From net/core/dev.c:
This code is using __get_cpu_var
to get the softnet_data
structure that is registered to the current CPU. This softnet_data
structure and the struct napi_struct
structure handed up from the driver are passed into ____napi_schedule
. Wow, that’s a lot of underscores ;)
Let’s take a look at ____napi_schedule
, from net/core/dev.c:
This code does two important things:
- The
struct napi_struct
handed up from the device driver’s interrupt handler code is added to thepoll_list
attached to thesoftnet_data
structure associated with the current CPU. __raise_softirq_irqoff
is used to “raise” (or trigger) a NET_RX_SOFTIRQ softirq. This will cause thenet_rx_action
registered during the network device subsystem initialization to be executed, if it’s not currently being executed.
As we’ll see shortly, the softirq handler function net_rx_action
will call the NAPI poll
function to harvest packets.
A note about CPU and network data processing
Note that all the code we’ve seen so far to defer work from a hardware interrupt handler to a softirq has been using structures associated with the current CPU.
While the driver’s IRQ handler itself does very little work itself, the softirq handler will execute on the same CPU as the driver’s IRQ handler.
This why setting the CPU a particular IRQ will be handled by is important: that CPU will be used not only to execute the interrupt handler in the driver, but the same CPU will also be used when harvesting packets in a softirq via NAPI.
As we’ll see later, things like Receive Packet Steering can distribute some of this work to other CPUs further up the network stack.
Monitoring network data arrival
Hardware interrupt requests
Note: monitoring hardware IRQs does not give a complete picture of packet processing health. Many drivers turn off hardware IRQs while NAPI is running, as we'll see later. It is one important part of your whole monitoring solution.
Check hardware interrupt stats by reading /proc/interrupts
.
$ cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 46 0 0 0 IR-IO-APIC-edge timer
1: 3 0 0 0 IR-IO-APIC-edge i8042
30: 3361234770 0 0 0 IR-IO-APIC-fasteoi aacraid
64: 0 0 0 0 DMAR_MSI-edge dmar0
65: 1 0 0 0 IR-PCI-MSI-edge eth0
66: 863649703 0 0 0 IR-PCI-MSI-edge eth0-TxRx-0
67: 986285573 0 0 0 IR-PCI-MSI-edge eth0-TxRx-1
68: 45 0 0 0 IR-PCI-MSI-edge eth0-TxRx-2
69: 394 0 0 0 IR-PCI-MSI-edge eth0-TxRx-3
NMI: 9729927 4008190 3068645 3375402 Non-maskable interrupts
LOC: 2913290785 1585321306 1495872829 1803524526 Local timer interrupts
You can monitor the statistics in /proc/interrupts
to see how the number and rate of hardware interrupts change as packets arrive and to ensure that each RX queue for your NIC is being handled by an appropriate CPU. As we’ll see shortly, this number only tells us how many hardware interrupts have happened, but it is not necessarily a good metric for understanding how much data has been received or processed as many drivers will disable NIC IRQs as part of their contract with the NAPI subsystem. Further, using interrupt coalescing will also affect the statistics gathered from this file. Monitoring this file can help you determine if the interrupt coalescing settings you select are actually working.
To get a more complete picture of your network processing health, you’ll need to monitor /proc/softirqs
(as mentioned above) and additional files in /proc
that we’ll cover below.
Tuning network data arrival
Interrupt coalescing
Interrupt coalescing is a method of preventing interrupts from being raised by a device to a CPU until a specific amount of work or number of events are pending.
This can help prevent interrupt storms and can help increase throughput or latency, depending on the settings used. Fewer interrupts generated result in higher throughput, increased latency, and lower CPU usage. More interrupts generated result in the opposite: lower latency, lower throughput, but also increased CPU usage.
Historically, earlier versions of the igb
, e1000
, and other drivers included support for a parameter called InterruptThrottleRate
. This parameter has been replaced in more recent drivers with a generic ethtool
function.
Get the current IRQ coalescing settings with ethtool -c
.
$ sudo ethtool -c eth0
Coalesce parameters for eth0:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
...
ethtool
provides a generic interface for setting various coalescing settings. Keep in mind, however, that not every device or driver will support every setting. You should check your driver documentation or driver source code to determine what is, or is not, supported. As per the ethtool documentation: “Anything not implemented by the driver causes these values to be silently ignored.”
One interesting option that some drivers support is “adaptive RX/TX IRQ coalescing.” This option is typically implemented in hardware. The driver usually needs to do some work to inform the NIC that this feature is enabled and some bookkeeping as well (as seen in the igb
driver code above).
The result of enabling adaptive RX/TX IRQ coalescing is that interrupt delivery will be adjusted to improve latency when packet rate is low and also improve throughput when packet rate is high.
Enable adaptive RX IRQ coalescing with ethtool -C
$ sudo ethtool -C eth0 adaptive-rx on
You can also use ethtool -C
to set several options. Some of the more common options to set are:
rx-usecs
: How many usecs to delay an RX interrupt after a packet arrives.rx-frames
: Maximum number of data frames to receive before an RX interrupt.rx-usecs-irq
: How many usecs to delay an RX interrupt while an interrupt is being serviced by the host.rx-frames-irq
: Maximum number of data frames to receive before an RX interrupt is generated while the system is servicing an interrupt.
And many, many more.
Reminder that your hardware and driver may only support a subset of the options listed above. You should consult your driver source code and your hardware data sheet for more information on supported coalescing options.
Unfortunately, the options you can set aren’t well documented anywhere except in a header file. Check the source of include/uapi/linux/ethtool.h to find an explanation of each option supported by ethtool
(but not necessarily your driver and NIC).
Note: while interrupt coalescing seems to be a very useful optimization at first glance, the rest of the networking stack internals also come into the fold when attempting to optimize. Interrupt coalescing can be useful in some cases, but you should ensure that the rest of your networking stack is also tuned properly. Simply modifying your coalescing settings alone will likely provide minimal benefit in and of itself.
Adjusting IRQ affinities
If your NIC supports RSS / multiqueue or if you are attempting to optimize for data locality, you may wish to use a specific set of CPUs for handling interrupts generated by your NIC.
Setting specific CPUs allows you to segment which CPUs will be used for processing which IRQs. These changes may affect how upper layers operate, as we’ve seen for the networking stack.
If you do decide to adjust your IRQ affinities, you should first check if you running the irqbalance
daemon. This daemon tries to automatically balance IRQs to CPUs and it may overwrite your settings. If you are running irqbalance
, you should either disable irqbalance
or use the --banirq
in conjunction with IRQBALANCE_BANNED_CPUS
to let irqbalance
know that it shouldn’t touch a set of IRQs and CPUs that you want to assign yourself.
Next, you should check the file /proc/interrupts
for a list of the IRQ numbers for each network RX queue for your NIC.
Finally, you can adjust the which CPUs each of those IRQs will be handled by modifying /proc/irq/IRQ_NUMBER/smp_affinity
for each IRQ number.
You simply write a hexadecimal bitmask to this file to instruct the kernel which CPUs it should use for handling the IRQ.
Example: Set the IRQ affinity for IRQ 8 to CPU 0
$ sudo bash -c 'echo 1 > /proc/irq/8/smp_affinity'
Network data processing begins
Once the softirq code determines that a softirq is pending, begins processing, and executes net_rx_action
, network data processing begins.
Let’s take a look at portions of the net_rx_action
processing loop to understand how it works, which pieces are tunable, and what can be monitored.
net_rx_action
processing loop
net_rx_action
begins the processing of packets from the memory the packets were DMA’d into by the device.
The function iterates through the list of NAPI structures that are queued for the current CPU, dequeuing each structure, and operating on it.
The processing loop bounds the amount of work and execution time that can be consumed by the registered NAPI poll
functions. It does this in two ways:
- By keeping track of a work
budget
(which can be adjusted), and - Checking the elapsed time
From net/core/dev.c:
This is how the kernel prevents packet processing from consuming the entire CPU. The budget
above is the total available budget that will be spent among each of the available NAPI structures registered to this CPU.
This is another reason why multiqueue NICs should have the IRQ affinity carefully tuned. Recall that the CPU which handles the IRQ from the device will be the CPU where the softirq handler will execute and, as a result, will also be the CPU where the above loop and budget computation runs.
Systems with multiple NICs each with multiple queues can end up in a situation where multiple NAPI structs are registered to the same CPU. Data processing for all NAPI structs on the same CPU spend from the same budget
.
If you don’t have enough CPUs to distribute your NIC’s IRQs, you can consider increasing the net_rx_action
budget
to allow for more packet processing for each CPU. Increasing the budget will increase CPU usage (specifically sitime
or si
in top
or other programs), but should reduce latency as data will be processed more promptly.
Note: the CPU will still be bounded by a time limit of 2 jiffies, regardless of the assigned budget.
NAPI poll
function and weight
Recall that network device drivers use netif_napi_add
for registering poll
function. As we saw earlier in this post, the igb
driver has a piece of code like this:
This registers a NAPI structure with a hardcoded weight of 64. We’ll see now how this is used in the net_rx_action
processing loop.
From net/core/dev.c:
This code obtains the weight which was registered to the NAPI struct (64
in the above driver code) and passes it into the poll
function which was also registered to the NAPI struct (igb_poll
in the above code).
The poll
function returns the number of data frames that were processed. This amount is saved above as work
, which is then subtracted from the overall budget
.
So, assuming:
- You are using a weight of
64
from your driver (all drivers were hardcoded with this value in Linux 3.13.0), and - You have your
budget
set to the default of300
Your system would stop processing data when either:
- The
igb_poll
function was called at most 5 times (less if no data to process as we’ll see next), OR - At least 2 jiffies of time have elapsed.
The NAPI / network device driver contract
One important piece of information about the contract between the NAPI subsystem and device drivers which has not been mentioned yet are the requirements around shutting down NAPI.
This part of the contract is as follows:
- If a driver’s
poll
function consumes its entire weight (which is hardcoded to64
) it must NOT modify NAPI state. Thenet_rx_action
loop will take over. - If a driver’s
poll
function does NOT consume its entire weight, it must disable NAPI. NAPI will be re-enabled next time an IRQ is received and the driver’s IRQ handler callsnapi_schedule
.
We’ll see how net_rx_action
deals with the first part of that contract now. Next, the poll
function is examined, we’ll see how the second part of that contract is handled.
Finishing the net_rx_action
loop
The net_rx_action
processing loop finishes up with one last section of code that deals with the first part of the NAPI contract explained in the previous section. From net/core/dev.c:
If the entire work is consumed, there are two cases that net_rx_action
handles:
- The network device should be shutdown (e.g. because the user ran
ifconfig eth0 down
), - If the device is not being shutdown, check if there’s a generic receive offload (GRO) list. If the timer tick rate is >= 1000, all GRO’d network flows that were recently updated will be flushed. We’ll dig into GRO in detail later. Move the NAPI structure to the end of the list for this CPU so the next iteration of the loop will get the next NAPI structure registered.
And that is how the packet processing loop invokes the driver’s registered poll
function to process packets. As we’ll see shortly, the poll
function will harvest network data and send it up the stack to be processed.
Exiting the loop when limits are reached
The net_rx_action
loop will exit when either:
- The poll list registered for this CPU has no more NAPI structures (
!list_empty(&sd->poll_list)
), or - The remaining budget is <= 0, or
- The time limit of 2 jiffies has been reached
Here’s this code we saw earlier again:
If you follow the softnet_break
label you stumble upon something interesting. From net/core/dev.c:
The struct softnet_data
structure has some statistics incremented and the softirq NET_RX_SOFTIRQ
is shut down. The time_squeeze
field is a measure of the number of times net_rx_action
had more work to do but either the budget was exhausted or the time limit was reached before it could be completed. This is a tremendously useful counter for understanding bottlenecks in network processing. We’ll see shortly how to monitor this value. The NET_RX_SOFTIRQ
is disabled to free up processing time for other tasks. This makes sense as this small stub of code is only executed when more work could have been done, but we don’t want to monopolize the CPU.
Execution is then transferred to the out
label. Execution can also make it to the out
label if there were no more NAPI structures to process, in other words, there is more budget than there is network activity and all the drivers have shut NAPI off and there is nothing left for net_rx_action
to do.
The out
section does one important thing before returning from net_rx_action
: it calls net_rps_action_and_irq_enable
. This function serves an important purpose if Receive Packet Steering is enabled; it wakes up remote CPUs to start processing network data.
We’ll see more about how RPS works later. For now, let’s see how to monitor the health of the net_rx_action
processing loop and move on to the inner working of NAPI poll
functions so we can progress up the network stack.
NAPI poll
Recall in previous sections that device drivers allocate a region of memory for the device to perform DMA to incoming packets. Just as it is the responsibility of the driver to allocate those regions, it is also the responsibility of the driver to unmap those regions, harvest the data, and send it up the network stack.
Let’s take a look at how the igb
driver does this to get an idea of how this works in practice.
igb_poll
At long last, we can finally examine our friend igb_poll
. It turns out the code for igb_poll
is deceptively simple. Let’s take a look. From drivers/net/ethernet/intel/igb/igb_main.c:
This code does a few interesting things:
- If Direct Cache Access (DCA) support is enabled in the kernel, the CPU cache is warmed so that accesses to the RX ring will hit CPU cache. You can read more about DCA in the Extras section at the end of this blog post.
- Next,
igb_clean_rx_irq
is called which does the heavy lifting, as we’ll see next. - Next,
clean_complete
is checked to determine if there was still more work that could have been done. If so, thebudget
(remember, this was hardcoded to64
) is returned. As we saw earlier,net_rx_action
will move this NAPI structure to the end of the poll list. - Otherwise, the driver turns off NAPI by calling
napi_complete
and re-enables interrupts by callingigb_ring_irq_enable
. The next interrupt that arrives will re-enable NAPI.
Let’s see how igb_clean_rx_irq
sends network data up the stack.
igb_clean_rx_irq
The igb_clean_rx_irq
function is a loop which processes one packet at a time until the budget
is reached or no additional data is left to process.
The loop in this function does a few important things:
- Allocates additional buffers for receiving data as used buffers are cleaned out. Additional buffers are added
IGB_RX_BUFFER_WRITE
(16) at a time. - Fetch a buffer from the RX queue and store it in an
skb
structure. - Check if the buffer is an “End of Packet” buffer. If so, continue processing. Otherwise, continue fetching additional buffers from the RX queue, adding them to the
skb
. This is necessary if a received data frame is larger than the buffer size. - Verify that the layout and headers of the data are correct.
- The number of bytes processed statistic counter is increased by
skb->len
. - Set the hash, checksum, timestamp, VLAN id, and protocol fields of the skb. The hash, checksum, timestamp, and VLAN id are provided by the hardware. If the hardware is signaling a checksum error, the
csum_error
statistic is incremented. If the checksum succeeded and the data is UDP or TCP data, theskb
is marked asCHECKSUM_UNNECESSARY
. If the checksum failed, the protocol stacks are left to deal with this packet. The protocol is computed with a call toeth_type_trans
and stored in theskb
struct. - The constructed
skb
is handed up the network stack with a call tonapi_gro_receive
. - The number of packets processed statistics counter is incremented.
- The loop continues until the number of packets processed reaches the budget.
Once the loop terminates, the function assigns statistics counters for rx packets and bytes processed.
Now it’s time to take two detours prior to proceeding up the network stack. First, let’s see how to monitor and tune the network subsystem’s softirqs. Next, let’s talk about Generic Receive Offloading (GRO). After that, the rest of the networking stack will make more sense as we enter napi_gro_receive
.
Monitoring network data processing
/proc/net/softnet_stat
As seen in the previous section, net_rx_action
increments a statistic when exiting the net_rx_action
loop and when additional work could have been done, but either the budget
or the time limit for the softirq was hit. This statistic is tracked as part of the struct softnet_data
associated with the CPU.
These statistics are output to a file in proc: /proc/net/softnet_stat
for which there is, unfortunately, very little documentation. The fields in the file in proc are not labeled and could change between kernel releases.
In Linux 3.13.0, you can find which values map to which field in /proc/net/softnet_stat
by reading the kernel source. From net/core/net-procfs.c:
Many of these statistics have confusing names and are incremented in places where you might not expect. An explanation of when and where each of these is incremented will be provided as the network stack is examined. Since the squeeze_time
statistic was seen in net_rx_action
, I thought it made sense to document this file now.
Monitor network data processing statistics by reading /proc/net/softnet_stat
.
$ cat /proc/net/softnet_stat
6dcad223 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000
6f0e1565 00000000 00000002 00000000 00000000 00000000 00000000 00000000 00000000 00000000
660774ec 00000000 00000003 00000000 00000000 00000000 00000000 00000000 00000000 00000000
61c99331 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
6794b1b3 00000000 00000005 00000000 00000000 00000000 00000000 00000000 00000000 00000000
6488cb92 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Important details about /proc/net/softnet_stat
:
- Each line of
/proc/net/softnet_stat
corresponds to astruct softnet_data
structure, of which there is 1 per CPU. - The values are separated by a single space and are displayed in hexadecimal
- The first value,
sd->processed
, is the number of network frames processed. This can be more than the total number of network frames received if you are using ethernet bonding. There are cases where the ethernet bonding driver will trigger network data to be re-processed, which would increment thesd->processed
count more than once for the same packet. - The second value,
sd->dropped
, is the number of network frames dropped because there was no room on the processing queue. More on this later. - The third value,
sd->time_squeeze
, is (as we saw) the number of times thenet_rx_action
loop terminated because the budget was consumed or the time limit was reached, but more work could have been. Increasing thebudget
as explained earlier can help reduce this. - The next 5 values are always 0.
- The ninth value,
sd->cpu_collision
, is a count of the number of times a collision occurred when trying to obtain a device lock when transmitting packets. This article is about receive, so this statistic will not be seen below. - The tenth value,
sd->received_rps
, is a count of the number of times this CPU has been woken up to process packets via an Inter-processor Interrupt - The last value,
flow_limit_count
, is a count of the number of times the flow limit has been reached. Flow limiting is an optional Receive Packet Steering feature that will be examined shortly.
If you decide to monitor this file and graph the results, you must be extremely careful that the ordering of these fields hasn’t changed and that the meaning of each field has been preserved. You will need to read the kernel source to verify this.
Tuning network data processing
Adjusting the net_rx_action
budget
You can adjust the net_rx_action
budget, which determines how much packet processing can be spent among all NAPI structures registered to a CPU by setting a sysctl value named net.core.netdev_budget
.
Example: set the overall packet processing budget to 600.
$ sudo sysctl -w net.core.netdev_budget=600
You may also want to write this setting to your /etc/sysctl.conf
file so that changes persist between reboots.
The default value on Linux 3.13.0 is 300.
Generic Receive Offloading (GRO)
Generic Receive Offloading (GRO) is a software implementation of a hardware optimization that is known as Large Receive Offloading (LRO).
The main idea behind both methods is that reducing the number of packets passed up the network stack by combining “similar enough” packets together can reduce CPU usage. For example, imagine a case where a large file transfer is occurring and most of the packets contain chunks of data in the file. Instead of sending small packets up the stack one at a time, the incoming packets can be combined into one packet with a huge payload. That packet can then be passed up the stack. This allows the protocol layers to process a single packet’s headers while delivering bigger chunks of data to the user program.
The problem with this sort of optimization is, of course, information loss. If a packet had some important option or flag set, that option or flag could be lost if the packet is coalesced into another. And this is exactly why most people don’t use or encourage the use of LRO. LRO implementations, generally speaking, had very lax rules for coalescing packets.
GRO was introduced as an implementation of LRO in software, but with more strict rules around which packets can be coalesced.
By the way: if you have ever used tcpdump
and seen unrealistically large incoming packet sizes, it is most likely because your system has GRO enabled. As you’ll see soon, packet capture taps are inserted further up the stack, after GRO has already happened.
Tuning: Adjusting GRO settings with ethtool
You can use ethtool
to check if GRO is enabled and also to adjust the setting.
Use ethtool -k
to check your GRO settings.
$ ethtool -k eth0 | grep generic-receive-offload
generic-receive-offload: on
As you can see, on this system I have generic-receive-offload
set to on.
Use ethtool -K
to enable (or disable) GRO.
$ sudo ethtool -K eth0 gro on
Note: making these changes will, for most drivers, take the interface down and then bring it back up; connections to this interface will be interrupted. This may not matter much for a one-time change, though.
napi_gro_receive
The function napi_gro_receive
deals processing network data for GRO (if GRO is enabled for the system) and sending the data up the stack toward the protocol layers. Much of this logic is handled in a function called dev_gro_receive
.
dev_gro_receive
This function begins by checking if GRO is enabled and, if so, preparing to do GRO. In the case where GRO is enabled, a list of GRO offload filters is traversed to allow the higher level protocol stacks to act on a piece of data which is being considered for GRO. This is done so that the protocol layers can let the network device layer know if this packet is part of a network flow that is currently being receive offloaded and handle anything protocol specific that should happen for GRO. For example, the TCP protocol will need to decide if/when to ACK a packet that is being coalesced into an existing packet.
Here’s the code from net/core/dev.c
which does this:
If the protocol layers indicated that it is time to flush the GRO’d packet, that is taken care of next. This happens with a call to napi_gro_complete
, which calls a gro_complete
callback for the protocol layers and then passes the packet up the stack by calling netif_receive_skb
.
Here’s the code from net/core/dev.c
which does this:
Next, if the protocol layers merged this packet to an existing flow, napi_gro_receive
simply returns as there’s nothing else to do.
If the packet was not merged and there are fewer than MAX_GRO_SKBS
(8) GRO flows on the system, a new entry is added to the gro_list
on the NAPI structure for this CPU.
Here’s the code from net/core/dev.c
which does this:
And that is how the GRO system in the Linux networking stack works.
napi_skb_finish
Once dev_gro_receive
completes, napi_skb_finish
is called which either frees unneeded data structures because a packet has been merged, or calls netif_receive_skb
to pass the data up the network stack (because there were already MAX_GRO_SKBS
flows being GRO’d).
Next, it’s time for netif_receive_skb
to see how data is handed off to the protocol layers. Before this can be examined, we’ll need to take a look at Receive Packet Steering (RPS) first.
Receive Packet Steering (RPS)
Recall earlier how we discussed that network device drivers register a NAPI poll
function. Each NAPI
poller instance is executed in the context of a softirq of which there is one per CPU. Further recall that the CPU which the driver’s IRQ handler runs on will wake its softirq processing loop to process packets.
In other words: a single CPU processes the hardware interrupt and polls for packets to process incoming data.
Some NICs (like the Intel I350) support multiple queues at the hardware level. This means incoming packets can be DMA’d to a separate memory region for each queue, with a separate NAPI structure to manage polling this region, as well. Thus multiple CPUs will handle interrupts from the device and also process packets.
This feature is typically called Receive Side Scaling (RSS).
Receive Packet Steering (RPS) is a software implementation of RSS. Since it is implemented in software, this means it can be enabled for any NIC, even NICs which have only a single RX queue. However, since it is in software, this means that RPS can only enter into the flow after a packet has been harvested from the DMA memory region.
This means that you wouldn’t notice a decrease in CPU time spent handling IRQs or the NAPI poll
loop, but you can distribute the load for processing the packet after it’s been harvested and reduce CPU time from there up the network stack.
RPS works by generating a hash for incoming data to determine which CPU should process the data. The data is then enqueued to the per-CPU receive network backlog to be processed. An Inter-processor Interrupt (IPI) is delivered to the CPU owning the backlog. This helps to kick-start backlog processing if it is not currently processing data on the backlog. The /proc/net/softnet_stat
contains a count of the number of times each softnet_data
struct has received an IPI (the received_rps
field).
Thus, netif_receive_skb
will either continue sending network data up the networking stack, or hand it over to RPS for processing on a different CPU.
Tuning: Enabling RPS
For RPS to work, it must be enabled in the kernel configuration (it is on Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should process packets for a given interface and RX queue.
You can find some documentation about these bitmasks in the kernel documentation.
In short, the bitmasks to modify are found in:
/sys/class/net/DEVICE_NAME/queues/QUEUE/rps_cpus
So, for eth0
and receive queue 0, you would modify the file: /sys/class/net/eth0/queues/rx-0/rps_cpus
with a hexadecimal number indicating which CPUs should process packets from eth0
’s receive queue 0. As the documentation points out, RPS may be unnecessary in certain configurations.
Note: enabling RPS to distribute packet processing to CPUs which were previously not processing packets will cause the number of `NET_RX` softirqs to increase for that CPU, as well as the `si` or `sitime` in the CPU usage graph. You can compare before and after of your softirq and CPU usage graphs to confirm that RPS is configured properly to your liking.
Receive Flow Steering (RFS)
Receive flow steering (RFS) is used in conjunction with RPS. RPS attempts to distribute incoming packet load amongst multiple CPUs, but does not take into account any data locality issues for maximizing CPU cache hit rates. You can use RFS to help increase cache hit rates by directing packets for the same flow to the same CPU for processing.
Tuning: Enabling RFS
For RFS to work, you must have RPS enabled and configured.
RFS keeps track of a global hash table of all flows and the size of this hash table can be adjusted by setting the net.core.rps_sock_flow_entries
sysctl.
Increase the size of the RFS socket flow hash by setting a sysctl
.
$ sudo sysctl -w net.core.rps_sock_flow_entries=32768
Next, you can also set the number of flows per RX queue by writing this value to the sysfs file named rps_flow_cnt
for each RX queue.
Example: increase the number of flows for RX queue 0 on eth0 to 2048.
$ sudo bash -c 'echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt'
Hardware accelerated Receive Flow Steering (aRFS)
RFS can be sped up with the use of hardware acceleration; the NIC and the kernel can work together to determine which flows should be processed on which CPUs. To use this feature, it must be supported by the NIC and your driver.
Consult your NIC’s data sheet to determine if this feature is supported. If your NIC’s driver exposes a function called ndo_rx_flow_steer
, then the driver has support for accelerated RFS.
Tuning: Enabling accelerated RFS (aRFS)
Assuming that your NIC and driver support it, you can enable accelerated RFS by enabling and configuring a set of things:
- Have RPS enabled and configured.
- Have RFS enabled and configured.
- Your kernel has
CONFIG_RFS_ACCEL
enabled at compile time. The Ubuntu kernel 3.13.0 does. - Have ntuple support enabled for the device, as described previously. You can use
ethtool
to verify that ntuple support is enabled for the device. - Configure your IRQ settings to ensure each RX queue is handled by one of your desired network processing CPUs.
Once the above is configured, accelerated RFS will be used to automatically move data to the RX queue tied to a CPU core that is processing data for that flow and you won’t need to specify an ntuple filter rule manually for each flow.
Moving up the network stack with netif_receive_skb
Picking up where we left off with netif_receive_skb
, which is called from a few places. The two most common (and also the two we’ve already looked at):
napi_skb_finish
if the packet is not going to be merged to an existing GRO’d flow, ORnapi_gro_complete
if the protocol layers indicated that it’s time to flush the flow, OR
Reminder: netif_receive_skb
and its descendants are operating in the context of a the softirq processing loop and you'll see the time spent here accounted for as sitime
or si
with tools like top
.
netif_receive_skb
begins by first checking a sysctl
value to determine if the user has requested receive timestamping before or after a packet hits the backlog queue. If this setting is enabled, the data is timestamped now, prior to it hitting RPS (and the CPU’s associated backlog queue). If this setting is disabled, it will be timestamped after it hits the queue. This can be used to distribute the load of timestamping amongst multiple CPUs if RPS is enabled, but will introduce some delay as a result.
Tuning: RX packet timestamping
You can tune when packets will be timestamped after they are received by adjusting a sysctl named net.core.netdev_tstamp_prequeue
:
Disable timestamping for RX packets by adjusting a sysctl
$ sudo sysctl -w net.core.netdev_tstamp_prequeue=0
The default value is 1. Please see the previous section for an explanation as to what this setting means, exactly.
netif_receive_skb
After the timestamping is dealt with, netif_receive_skb
operates differently depending on whether or not RPS is enabled. Let’s start with the simpler path first: RPS disabled.
Without RPS (default setting)
If RPS is not enabled, __netif_receive_skb
is called which does some bookkeeping and then calls __netif_receive_skb_core
to move data closer to the protocol stacks.
We’ll see precisely how __netif_receive_skb_core
works, but first let’s see how the RPS enabled code path works, as that code will also call __netif_receive_skb_core
.
With RPS enabled
If RPS is enabled, after the timestamping options mentioned above are dealt with, netif_receive_skb
will perform some computations to determine which CPU’s backlog queue should be used. This is done by using the function get_rps_cpu
. From net/core/dev.c:
get_rps_cpu
will take into account RFS and aRFS settings as described above to ensure the the data gets queued to the desired CPU’s backlog with a call to enqueue_to_backlog
.
enqueue_to_backlog
This function begins by getting a pointer to the remote CPU’s softnet_data
structure, which contains a pointer to the input_pkt_queue
. Next, the queue length of the input_pkt_queue
of the remote CPU is checked. From net/core/dev.c:
The length of input_pkt_queue
is first compared to netdev_max_backlog
. If the queue is longer than this value, the data is dropped. Similarly, the flow limit is checked and if it has been exceeded, the data is dropped. In both cases the drop count on the softnet_data
structure is incremented. Note that this is the softnet_data
structure of the CPU the data was going to be queued to. Read the section above about /proc/net/softnet_stat
to learn how to get the drop count for monitoring purposes.
enqueue_to_backlog
is not called in many places. It is called for RPS-enabled packet processing and also from netif_rx
. Most drivers should not be using netif_rx
and should instead be using netif_receive_skb
. If you are not using RPS and your driver is not using netif_rx
, increasing the backlog won’t produce any noticeable effect on your system as it is not used.
Note: You need to check the driver you are using. If it calls netif_receive_skb
and you are not using RPS, increasing the netdev_max_backlog
will not yield any performance improvement because no data will ever make it to the input_pkt_queue
.
Assuming that the input_pkt_queue
is small enough and the flow limit (more about this, next) hasn’t been reached (or is disabled), the data can be queued. The logic here is a bit funny, but can be summarized as:
- If the queue is empty: check if NAPI has been started on the remote CPU. If not, check if an IPI is queued to be sent. If not, queue one and start the NAPI processing loop by calling
____napi_schedule
. Proceed to queuing the data. - If the queue is not empty, or the previously described operation has completed, enqueue the data.
The code is a bit tricky with its use of goto
, so read it carefully. From net/core/dev.c:
Flow limits
RPS distributes packet processing load amongst multiple CPUs, but a single large flow can monopolize CPU processing time and starve smaller flows. Flow limits are a feature that can be used to limit the number of packets queued to the backlog for each flow to a certain amount. This can help ensure that smaller flows are processed even though much larger flows are pushing packets in.
The if statement above from net/core/dev.c checks the flow limit with a call to skb_flow_limit
:
This code is checking that there is still room in the queue and that the flow limit has not been reached. By default, flow limits are disabled. In order to enable flow limits, you must specify a bitmap (similar to RPS’ bitmap).
Monitoring: Monitor drops due to full input_pkt_queue
or flow limit
See the section above about monitoring /proc/net/softnet_stat
. The dropped
field is a counter that gets incremented each time data is dropped instead of queued to a CPU’s input_pkt_queue
.
Tuning
Tuning: Adjusting netdev_max_backlog
to prevent drops
Before adjusting this tuning value, see the note in the previous section.
You can help prevent drops in enqueue_to_backlog
by increasing the netdev_max_backlog
if you are using RPS or if your driver calls netif_rx
.
Example: increase backlog to 3000 with sysctl
.
$ sudo sysctl -w net.core.netdev_max_backlog=3000
The default value is 1000.
Tuning: Adjust the NAPI weight of the backlog poll
loop
You can adjust the weight of the backlog’s NAPI poller by setting the net.core.dev_weight
sysctl. Adjusting this value determines how much of the overall budget the backlog poll
loop can consume (see the section above about adjusting net.core.netdev_budget
):
Example: increase the NAPI poll
backlog processing loop with sysctl
.
$ sudo sysctl -w net.core.dev_weight=600
The default value is 64.
Remember, backlog processing runs in the softirq context similar to the device driver’s registered poll
function and will be limited by the overall budget
and a time limit, as described in previous sections.
Tuning: Enabling flow limits and tuning flow limit hash table size
Set the size of the flow limit table with a sysctl
.
$ sudo sysctl -w net.core.flow_limit_table_len=8192
The default value is 4096.
This change only affects newly allocated flow hash tables. So, if you’d like to increase the table size, you should do it before you enable flow limits.
To enable flow limits you should specify a bitmask in /proc/sys/net/core/flow_limit_cpu_bitmap
similar to the RPS bitmask which indicates which CPUs have flow limits enabled.
backlog queue NAPI poller
The per-CPU backlog queue plugs into NAPI the same way a device driver does. A poll
function is provided that is used to process packets from the softirq context. A weight
is also provided, just as a device driver would.
This NAPI struct is provided during initialization of the networking system. From net_dev_init
in net/core/dev.c
:
The backlog NAPI structure differs from the device driver NAPI structure in that the weight
parameter is adjustable, where as drivers hardcode their NAPI weight to 64. We’ll see in the tuning section below how to adjust the weight using a sysctl
.
process_backlog
The process_backlog
function is a loop which runs until its weight (as described in the previous section) has been consumed or no more data remains on the backlog.
Each piece of data on the backlog queue is removed from the backlog queue and passed on to __netif_receive_skb
. The code path once the data hits __netif_receive_skb
is the same as explained above for the RPS disabled case. Namely, __netif_receive_skb
does some bookkeeping prior to calling __netif_receive_skb_core
to pass network data up to the protocol layers.
process_backlog
follows the same contract with NAPI that device drivers do, which is: NAPI is disabled if the total weight will not be used. The poller is restarted with the call to ____napi_schedule
from enqueue_to_backlog
as described above.
The function returns the amount of work done, which net_rx_action
(described above) will subtract from the budget (which is adjusted with the net.core.netdev_budget
, as described above).
__netif_receive_skb_core
delivers data to packet taps and protocol layers
__netif_receive_skb_core
performs the heavy lifting of delivering the data to protocol stacks. Before it does this, it checks if any packet taps have been installed which are catching all incoming packets. One example of something that does this is the AF_PACKET
address family, typically used via the libpcap library.
If such a tap exists, the data is delivered there first then to the protocol layers next.
Packet tap delivery
If a packet tap is installed (usually via libpcap), the packet is delivered there with the following code from net/core/dev.c:
If you are curious about how the path of the data through pcap, read net/packet/af_packet.c.
Protocol layer delivery
Once the taps have been satisfied, __netif_receive_skb_core
delivers data to protocol layers. It does this by obtaining the protocol field from the data and iterating across a list of deliver functions registered for that protocol type.
This can be seen in __netif_receive_skb_core
in net/core/dev.c:
The ptype_base
identifier above is defined as a hash table of lists in net/core/dev.c:
Each protocol layer adds a filter to a list at a given slot in the hash table, computed with a helper function called ptype_head
:
Adding a filter to the list is accomplished with a call to dev_add_pack
. That is how protocol layers register themselves for network data delivery for their protocol type.
And now you know how network data gets from the NIC to the protocol layer.
Protocol layer registration
Now that we know how data is delivered to the protocol stacks from the network device subsystem, let’s see how a protocol layer registers itself.
This blog post is going to examine the IP protocol stack as it is a commonly used protocol and will be relevant to most readers.
IP protocol layer
The IP protocol layer plugs itself into the ptype_base
hash table so that data will be delivered to it from the network device layer described in previous sections.
This happens in the function inet_init
from net/ipv4/af_inet.c:
This registers the IP packet type structure defined at net/ipv4/af_inet.c:
__netif_receive_skb_core
calls deliver_skb
(as seen in the above section), which calls func
(in this case, ip_rcv
).
ip_rcv
The ip_rcv
function is pretty straight-forward at a high level. There are several integrity checks to ensure the data is valid. Statistics counters are bumped, as well.
ip_rcv
ends by passing the packet to ip_rcv_finish
by way of netfilter. This is done so that any iptables rules that should be matched at the IP protocol layer can take a look at the packet before it continues on.
We can see the code which hands the data over to netfilter at the end of ip_rcv
in net/ipv4/ip_input.c:
netfilter and iptables
In the interest of brevity (and my RSI), I’ve decided to skip my deep dive into netfilter, iptables, and conntrack.
The short version is that NF_HOOK_THRESH
will check if any filters are installed and attempt to return execution back to the IP protocol layer to avoid going deeper into netfilter and anything that hooks in below that like iptables and conntrack.
Keep in mind: if you have numerous or very complex netfilter or iptables rules, those rules will be executed in the softirq context and can lead to latency in your network stack. This may be unavoidable, though, if you need to have a particular set of rules installed.
ip_rcv_finish
Once net filter has had a chance to take a look at the data and decide what to do with it, ip_rcv_finish
is called. This only happens if the data is not being dropped by netfilter, of course.
ip_rcv_finish
begins with an optimization. In order to deliver the packet to proper place, a dst_entry
from the routing system needs to be in place. In order to obtain one, the code initially attempts to call the early_demux
function from the higher level protocol that this data is destined for.
The early_demux
routine is an optimization which attempts to find the dst_entry
needed to deliver the packet by checking if a dst_entry
is cached on the socket structure.
Here’s what that looks like from net/ipv4/ip_input.c:
As you can see above, this code is guarded by a sysctl sysctl_ip_early_demux
. By default early_demux
is enabled. The next section includes information about how to disable it and why you might want to.
If the optimization is enabled and there is no cached entry (because this is the first packet arriving), the packet will be handed off to the routing system in the kernel where the dst_entry
will be computed and assigned.
Once the routing layer completes, statistics counters are updated and the function ends by calling dst_input(skb)
which in turn calls the input function pointer on the packet’s dst_entry
structure that was affixed by the routing system.
If the packet’s final destination is the local system, the routing system will attach the function ip_local_deliver
to the input function pointer in the dst_entry
structure on the packet.
Tuning: adjusting IP protocol early demux
Disable the early_demux
optimization by setting a sysctl
.
$ sudo sysctl -w net.ipv4.ip_early_demux=0
The default value is 1; early_demux
is enabled.
This sysctl was added as some users saw a ~5% decrease in throughput with the early_demux
optimization in some cases.
ip_local_deliver
Recall how we saw the following pattern in the IP protocol layer:
- Calls to
ip_rcv
do some initial bookkeeping. - Packet is handed off to netfilter for processing, with a pointer to a callback to be executed when processing finishes.
ip_rcv_finish
is the callback which finished processing and continued working toward pushing the packet up the networking stack.
ip_local_deliver
has the same pattern. From net/ipv4/ip_input.c:
Once netfilter has had a chance to take a look at the data, ip_local_deliver_finish
will be called, assuming the data is not dropped first by netfilter.
ip_local_deliver_finish
ip_local_deliver_finish
obtains the protocol from the packet, looks up a net_protocol
structure registered for that protocol, and calls the function pointed to by handler
in the net_protocol
structure.
This hands the packet up to the higher level protocol layer.
Monitoring: IP protocol layer statistics
Monitor detailed IP protocol statistics by reading /proc/net/snmp
.
$ cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos InDiscards InDelivers OutRequests OutDiscards OutNoRoutes ReasmTimeout ReasmReqds ReasmOKs ReasmFails FragOKs FragFails FragCreates
Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 12987882 51
1 10129840 2196520 1 0 0 0
...
This file contains statistics for several protocol layers. The IP protocol layer appears first. The first line contains space separate names for each of the corresponding values in the next line.
In the IP protocol layer, you will find statistics counters being bumped. Those counters are referenced by a C enum. All of the valid enum values and the field names they correspond to in /proc/net/snmp
can be found in include/uapi/linux/snmp.h:
Monitor extended IP protocol statistics by reading /proc/net/netstat
.
$ cat /proc/net/netstat | grep IpExt
IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts InOctets OutOctets InMcastOctets OutMcastOctets InBcastOctets OutBcastOctets InCsumErrors InNoECTPkts InECT0Pktsu InCEPkts
IpExt: 0 0 0 0 277959 0 14568040307695 32991309088496 0 0 58649349 0 0 0 0 0
The format is similar to /proc/net/snmp
, except the lines are prefixed with IpExt
.
Some interesting statistics:
InReceives
: The total number of IP packets that reachedip_rcv
before any data integrity checks.InHdrErrors
: Total number of IP packets with corrupted headers. The header was too short, too long, non-existent, had the wrong IP protocol version number, etc.InAddrErrors
: Total number of IP packets where the host was unreachable.ForwDatagrams
: Total number of IP packets that have been forwarded.InUnknownProtos
: Total number of IP packets with unknown or unsupported protocol specified in the header.InDiscards
: Total number of IP packets discarded due to memory allocation failure or checksum failure when packets are trimmed.InDelivers
: Total number of IP packets successfully delivered to higher protocol layers. Keep in mind that those protocol layers may drop data even if the IP layer does not.InCsumErrors
: Total number of IP Packets with checksum errors.
Note that each of these is incremented in really specific locations in the IP layer. Code gets moved around from time to time and double counting errors or other accounting bugs can sneak in. If these statistics are important to you, you are strongly encouraged to read the IP protocol layer source code for the metrics that are important to you so you understand when they are (and are not) being incremented.
Higher level protocol registration
This blog post will examine UDP, but the TCP protocol handler is registered the same way and at the same time as the UDP protocol handler.
In net/ipv4/af_inet.c
, the structure definitions which contain the handler functions for connecting the UDP, TCP , and ICMP protocols to the IP protocol layer can be found. From net/ipv4/af_inet.c:
These structures are registered in the initialization code of the inet address family. From net/ipv4/af_inet.c:
We’re going to be looking at the UDP protocol layer. As seen above, the handler
function for UDP is called udp_rcv
.
This is the entry point into the UDP layer where the IP layer hands data. Let’s continue our journey there.
UDP protocol layer
The code for the UDP protocol layer can be found in: net/ipv4/udp.c.
udp_rcv
The code for the udp_rcv
function is just a single line which calls directly into __udp4_lib_rcv
to handle receiving the datagram.
__udp4_lib_rcv
The __udp4_lib_rcv
function will check to ensure the packet is valid and obtain the UDP header, UDP datagram length, source address, and destination address. Next, are some additional integrity checks and checksum verification.
Recall that earlier in the IP protocol layer, we saw that an optimization is performed to attach a dst_entry
to the packet before it is handed off to the upper layer protocol (UDP in our case).
If a socket and corresponding dst_entry
were found, __udp4_lib_rcv
will queue the packet to the socket:
If there is no socket attached from the early_demux operation, a receiving socket will now be looked up by calling __udp4_lib_lookup_skb
.
In both cases described above, the datagram will be queued to the socket:
If no socket was found, the datagram will be dropped:
udp_queue_rcv_skb
The initial parts of this function are as follows:
- Determine if the socket associated with the datagram is an encapsulation socket. If so, pass the packet up to that layer’s handler function before proceeding.
- Determine if the datagram is a UDP-Lite datagram and do some integrity checks.
- Verify the UDP checksum of the datagram and drop it if the checksum fails.
Finally, we arrive at the receive queue logic which begins by checking if the receive queue for the socket is full. From net/ipv4/udp.c
:
sk_rcvqueues_full
The sk_rcvqueues_full
function checks the socket’s backlog length and the socket’s sk_rmem_alloc
to determine if the sum is greater than the sk_rcvbuf
for the socket (sk->sk_rcvbuf
in the above code snippet):
Tuning these values is a bit tricky as there are many things that can be adjusted.
Tuning: Socket receive queue memory
The sk->sk_rcvbuf
(called limit in sk_rcvqueues_full
above) value can be increased to whatever the sysctl net.core.rmem_max
is set to.
Increase the maximum receive buffer size by setting a sysctl
.
$ sudo sysctl -w net.core.rmem_max=8388608
sk->sk_rcvbuf
starts at the net.core.rmem_default
value, which can also be adjusted by setting a sysctl, like so:
Adjust the default initial receive buffer size by setting a sysctl
.
$ sudo sysctl -w net.core.rmem_default=8388608
You can also set the sk->sk_rcvbuf
size by calling setsockopt
from your application and passing SO_RCVBUF
. The maximum you can set with setsockopt
is net.core.rmem_max
.
However, you can override the net.core.rmem_max
limit by calling setsockopt
and passing SO_RCVBUFFORCE
, but the user running the application will need the CAP_NET_ADMIN
capability.
The sk->sk_rmem_alloc
value is incremented by calls to skb_set_owner_r
which set the owner socket of a datagram. We’ll see this called later in the UDP layer.
The sk->sk_backlog.len
is incremented by calls to sk_add_backlog
, which we’ll see next.
udp_queue_rcv_skb
Once it’s been verified that the queue is not full, progress toward queuing the datagram can continue. From net/ipv4/udp.c:
The first step is determine if the socket currently has any system calls against it from a userland program. If it does not, the datagram can be added to the receive queue with a call to __udp_queue_rcv_skb
. If it does, the datagram is queued to the backlog with a call to sk_add_backlog
.
The datagrams on the backlog are added to the receive queue when socket system calls release the socket with a call to release_sock
in the kernel.
__udp_queue_rcv_skb
The __udp_queue_rcv_skb
function adds datagrams to the receive queue by calling sock_queue_rcv_skb
and bumps statistics counters if the datagram could not be added to the receive queue for the socket.
From net/ipv4/udp.c:
Monitoring: UDP protocol layer statistics
Two very useful files for getting UDP protocol statistics are:
/proc/net/snmp
/proc/net/udp
/proc/net/snmp
Monitor detailed UDP protocol statistics by reading /proc/net/snmp
.
$ cat /proc/net/snmp | grep Udp\:
Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors
Udp: 16314 0 0 17161 0 0
Much like the detailed statistics found in this file for the IP protocol, you will need to read the protocol layer source to determine exactly when and where these values are incremented.
InDatagrams
: Incremented whenrecvmsg
was used by a userland program to read datagram. Also incremented when a UDP packet is encapsulated and sent back for processing.NoPorts
: Incremented when UDP packets arrive destined for a port where no program is listening.InErrors
: Incremented in several cases: no memory in the receive queue, when a bad checksum is seen, and ifsk_add_backlog
fails to add the datagram.OutDatagrams
: Incremented when a UDP packet is handed down without error to the IP protocol layer to be sent.RcvbufErrors
: Incremented whensock_queue_rcv_skb
reports that no memory is available; this happens ifsk->sk_rmem_alloc
is greater than or equal tosk->sk_rcvbuf
.SndbufErrors
: Incremented if the IP protocol layer reported an error when trying to send the packet and no error queue has been setup. Also incremented if no send queue space or kernel memory are available.InCsumErrors
: Incremented when a UDP checksum failure is detected. Note that in all cases I could find,InCsumErrors
is incrememnted at the same time asInErrors
. Thus,InErrors
-InCsumErros
should yield the count of memory related errors on the receive side.
/proc/net/udp
Monitor UDP socket statistics by reading /proc/net/udp
$ cat /proc/net/udp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode ref pointer drops
515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000 00000000 104 0 7518 2 0000000000000000 0
558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7408 2 0000000000000000 0
588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7511 2 0000000000000000 0
769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7673 2 0000000000000000 0
812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000 00000000 0 0 7407 2 0000000000000000 0
The first line describes each of the fields in the lines following:
sl
: Kernel hash slot for the socketlocal_address
: Hexadecimal local address of the socket and port number, separated by:
.rem_address
: Hexadecimal remote address of the socket and port number, separated by:
.st
: The state of the socket. Oddly enough, the UDP protocol layer seems to use some TCP socket states. In the example above,7
isTCP_CLOSE
.tx_queue
: The amount of memory allocated in the kernel for outgoing UDP datagrams.rx_queue
: The amount of memory allocated in the kernel for incoming UDP datagrams.tr
,tm->when
,retrnsmt
: These fields are unused by the UDP protocol layer.uid
: The effective user id of the user who created this socket.timeout
: Unused by the UDP protocol layer.inode
: The inode number corresponding to this socket. You can use this to help you determine which user process has this socket open. Check/proc/[pid]/fd
, which will contain symlinks tosocket[:inode]
.ref
: The current reference count for the socket.pointer
: The memory address in the kernel of thestruct sock
.drops
: The number of datagram drops associated with this socket. Note that this does not include any drops related to sending datagrams (on corked UDP sockets or otherwise); this is only incremented in receive paths as of the kernel version examined by this blog post.
The code which outputs this can be found in net/ipv4/udp.c
.
Queuing data to a socket
Network data is queued to a socket with a call to sock_queue_rcv
. This function does a few things before adding the datagram to the queue:
- The socket’s allocated memory is checked to determine if it has exceeded the receive buffer size. If so, the drop count for the socket is incremented.
- Next
sk_filter
is used to process any Berkeley Packet Filter filters that have been applied to the socket. sk_rmem_schedule
is run to ensure sufficient receive buffer space exists to accept this datagram.- Next the size of the datagram is charged to the socket with a call to
skb_set_owner_r
. This incrementssk->sk_rmem_alloc
. - The data is added to the queue with a call to
__skb_queue_tail
. - Finally, any processes waiting on data to arrive in the socket are notified with a call to the
sk_data_ready
notification handler function.
And that is how data arrives at a system and traverses the network stack until it reaches a socket and is ready to be read by a user program.
Extras
There are a few extra things worth mentioning that are worth mentioning which didn’t seem quite right anywhere else.
Timestamping
As mentioned in the above blog post, the networking stack can collect timestamps of incoming data. There are sysctl values controlling when/how to collect timestamps when used in conjunction with RPS; see the above post for more information on RPS, timestamping, and where, exactly, in the network stack receive timestamping happens. Some NICs even support timestamping in hardware, too.
This is a useful feature if you’d like to try to determine how much latency the kernel network stack is adding to receiving packets.
The kernel documentation about timestamping is excellent and there is even an included sample program and Makefile you can check out!.
Determine which timestamp modes your driver and device support with ethtool -T
.
$ sudo ethtool -T eth0
Time stamping parameters for eth0:
Capabilities:
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
PTP Hardware Clock: none
Hardware Transmit Timestamp Modes: none
Hardware Receive Filter Modes: none
This NIC, unfortunately, does not support hardware receive timestamping, but software timestamping can still be used on this system to help me determine how much latency the kernel is adding to my packet receive path.
Busy polling for low latency sockets
It is possible to use a socket option called SO_BUSY_POLL
which will cause the kernel to busy poll for new data when a blocking receive is done and there is no data.
IMPORTANT NOTE: For this option to work, your device driver must support it. Linux kernel 3.13.0’s igb
driver does not support this option. The ixgbe
driver, however, does. If your driver has a function set to the ndo_busy_poll
field of its struct net_device_ops
structure (mentioned in the above blog post), it supports SO_BUSY_POLL
.
A great paper explaining how this works and how to use it is available from Intel.
When using this socket option for a single socket, you should pass a time value in microseconds as the amount of time to busy poll in the device driver’s receive queue for new data. When you issue a blocking read to this socket after setting this value, the kernel will busy poll for new data.
You can also set the sysctl value net.core.busy_poll
to a time value in microseconds of how long calls with poll
or select
should busy poll waiting for new data to arrive, as well.
This option can reduce latency, but will increase CPU usage and power consumption.
Netpoll: support for networking in critical contexts
The Linux kernel provides a way for device drivers to be used to send and receive data on a NIC when the kernel has crashed. The API for this is called Netpoll and it is used by a few things, but most notably: kgdb, netconsole.
Most drivers support Netpoll; your driver needs to implement the ndo_poll_controller
function and attach it to the struct net_device_ops
that is registered during probe (as seen above).
When the networking device subsystem performs operations on incoming or outgoing data, the netpoll system is checked first to determine if the packet is destined for netpoll.
For example, we can see the following code in __netif_receive_skb_core
from net/dev/core.c
:
The Netpoll checks happen early in most of the Linux network device subsystem code that deals with transmitting or receiving network data.
Consumers of the Netpoll API can register struct netpoll
structures by calling netpoll_setup
. The struct netpoll
structure has function pointers for attaching receive hooks, and the API exports a function for sending data.
If you are interested in using the Netpoll API, you should take a look at the netconsole
driver, the Netpoll API header file, ‘include/linux/netpoll.h`, and this excellent talk.
SO_INCOMING_CPU
The SO_INCOMING_CPU
flag was not added until Linux 3.19, but it is useful enough that it should be included in this blog post.
You can use getsockopt
with the SO_INCOMING_CPU
option to determine which CPU is processing network packets for a particular socket. Your application can then use this information to hand sockets off to threads running on the desired CPU to help increase data locality and CPU cache hits.
The mailing list message introducing SO_INCOMING_CPU
provides a short example architecture where this option is useful.
DMA Engines
A DMA engine is a piece of hardware that allows the CPU to offload large copy operations. This frees the CPU to do other tasks while memory copies are done with hardware. Enabling the use of a DMA engine and running code that takes advantage of it, should yield reduced CPU usage.
The Linux kernel has a generic DMA engine interface that DMA engine driver authors can plug into. You can read more about the Linux DMA engine interface in the kernel source Documentation.
While there are a few DMA engines that the kernel supports, we’re going to discuss one in particular that is quite common: the Intel IOAT DMA engine.
Intel’s I/O Acceleration Technology (IOAT)
Many servers include the Intel I/O AT bundle, which is comprised of a series of performance changes.
One of those changes is the inclusion of a hardware DMA engine. You can check your dmesg
output for ioatdma
to determine if the module is being loaded and if it has found supported hardware.
The DMA offload engine is used in a few places, most notably in the TCP stack.
Support for the Intel IOAT DMA engine was included in Linux 2.6.18, but was disabled later in 3.13.11.10 due to some unfortunate data corruption bugs.
Users on kernels before 3.13.11.10 may be using the ioatdma
module on their servers by default. Perhaps this will be fixed in future kernel releases.
Direct cache access (DCA)
Another interesting feature included with the Intel I/O AT bundle is Direct Cache Access (DCA).
This feature allows network devices (via their drivers) to place network data directly in the CPU cache. How this works, exactly, is driver specific. For the igb
driver, you can check the code for the function igb_update_dca
, as well as the code for igb_update_rx_dca
. The igb
driver uses DCA by writing a register value to the NIC.
To use DCA, you will need to ensure that DCA is enabled in your BIOS, the dca
module is loaded, and that your network card and driver both support DCA.
Monitoring IOAT DMA engine
If you are using the ioatdma
module despite the risk of data corruption mentioned above, you can monitor it by examining some entries in sysfs
.
Monitor the total number of offloaded memcpy
operations for a DMA channel.
$ cat /sys/class/dma/dma0chan0/memcpy_count
123205655
Similarly, to get the number of bytes offloaded by this DMA channel, you’d run a command like:
Monitor total number of bytes transferred for a DMA channel.
$ cat /sys/class/dma/dma0chan0/bytes_transferred
131791916307
Tuning IOAT DMA engine
The IOAT DMA engine is only used when packet size is above a certain threshold. That threshold is called the copybreak
. This check is in place because for small copies, the overhead of setting up and using the DMA engine is not worth the accelerated transfer.
Adjust the DMA engine copybreak with a sysctl
.
$ sudo sysctl -w net.ipv4.tcp_dma_copybreak=2048
The default value is 4096.
Conclusion
The Linux networking stack is complicated.
It is impossible to monitor or tune it (or any other complex piece of software) without understanding at a deep level exactly what’s going on. Often, out in the wild of the Internet, you may stumble across a sample sysctl.conf
that contains a set of sysctl values that should be copied and pasted on to your computer. This is probably not the best way to optimize your networking stack.
Monitoring the networking stack requires careful accounting of network data at every layer. Starting with the drivers and proceeding up. That way you can determine where exactly drops and errors are occurring and then adjust settings to determine how to reduce the errors you are seeing.
There is, unfortunately, no easy way out.
Help with Linux networking or other systems
Need some extra help navigating the network stack? Have questions about anything in this post or related things not covered? Send us an email and let us know how we can help.