avatar

Liang DONG
(looking for a job)

Chief Engineer
vc2004 (at) gmail.com
Hong Kong, China
+852 (hidden)
中文 / English / Cantonese
Software Engineering • System Architecture • Cloud Computing
书当快意读易尽,
客有可人期不来


Linux Networking Optimization Guide - Part III

Linux Networking Optimisation Guide Part III (Cont.)

  • date: 2017-01-20
  • category: Networking
  • tags: networking, Linux

MTU

Change MTU to 9000 is going to help increase the throughput and efficiency on big packets. However all the routes/switches on the path should support jumbo-frame, e.g. over 9000.

1
2
3
4
5
6
7
8
ip link set eth1 mtu 9000
ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:1e:c9:b4:86:0e brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:1e:c9:b4:86:10 brd ff:ff:ff:ff:ff:ff

QLEN

The length of the Queuing Discipline in the packet transfer process. Increasing this value may lead to bufferbloat with increasing latency.

Normally the queue length 1000 is enough for the 10G/40G network. However if the error on ip -s link or ifconfig -a eth0 is increasing, then try to increase the qlen.

1
2
3
4
5
6
7
8
9
10
11
 ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether 00:22:19:5b:e2:f2 brd ff:ff:ff:ff:ff:ff

 ip -s link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT group default qlen 1000
    link/ether 00:22:19:5b:e2:f2 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast
    56272058545094 237996400274 10125   0       10125   790686739
    TX: bytes  packets  errors  dropped carrier collsns
    338460632172563 338969673742 0       0       0       0

Power State

CPU Power State other than C1 & C0 should all be disabled, it should first be disabled in BIOS.

The processor.max_cstate=1 and intel_idle.max_cstate=0 could be added in the grub line to override BIOS setting.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 vi /etc/default/grub

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash processor.max_cstate=1 intel_idle.max_cstate=0"
GRUB_CMDLINE_LINUX="clocksource=tsc ipv6.disable=1"

 update-grub
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.16.0-4-amd64
Found initrd image: /boot/initrd.img-3.16.0-4-amd64
Found memtest86+ image: /boot/memtest86+.bin
Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
done

It could be confirmed by

1
2
cat /sys/module/intel_idle/parameters/max_cstate
9

Pause Frame

Pause frame are sent out once the Tx and Rx are full to local switch port, if the switch support pause frame, the switch will pause sending the packets for orders of ms or less, which is pretty enough for the processing the Tx/Rx remaining buffer. Both switch and ethernet card on the server should support Pause frame.

1
2
3
4
5
 ethtool -a eth4
Pause parameters for eth4:
Autonegotiate:    off
RX:        on
TX:        on

To turn on the Rx and Tx Pause Frame

1
 ethtool -A eth4 rx on

TCP/UDP Parameter

There are a lot of TCP parameters on linux could be optimised:

net.ipv4.tcp_timestamps = 1

Timestamp options could avoid wrapped sequence numbers, and improve window size and buffer calculation. If bandwidth is high and TCP sequence numbers is wrapped very quickly, do turn on the tcp time stamp option. But keep in mind it does increase the CPU usage.

1
net.ipv4.tcp_sack = 0

Selective ack could allow sender only transmit the lost bytes other than all the bytes. Turning on the sack may increase the CPU load. Unless there is a very high latency or high packet loss link, it is suggest to turn off the tcp_sack. There are still some controversy on turning off this option.

1
net.ipv4.tcp_window_scaling = 1

Originally the tcp window size is 8bit, it is not enough for the bandwidth nowadays. Turning on tcp window scaling to increase the window size, and both size should both support tcp window scale to finish a successful tcp negotiation.

1
2
3
4
5
6
7
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.wmem_default = 2097152
net.core.rmem_default = 2097152
net.core.optmem_max = 524287

wmem/rmem is the socket buffer size. TCP and UDP Rx and Tx buffer could be set to a large value to avoid packet errors. The command line netstat -us or -ts could be checked if there are some UDP or TCP related errors.

1
net.core.somaxconn = 8192

To avoid SYN Flood to stop new connection drop or SYN Cookie to send. The TCP listen back log should be increased to a larger number. It is the maximum number of unaccepted TCP connection the system could handle.

1
2
net.ipv4.tcp_adv_win_scale = 1
net.ipv4.tcp_fin_timeout = 15

To release resource more quickly by setting fin timeout to 15s. It is the timeout period of fin-wait-2 to closed.

MSI-X/NAPI

Check if MSI-X is enabled, normally it is enabled by default.

1
2
3
4
# lspci -vvv | less
        Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
                Vector table: BAR=4 offset=00000000
                PBA: BAR=4 offset=00002000

Normally NAPI is enabled by default.

Module Parameters

Some Module (bnx2x or ixgbe) parameters may need to adjusted to improve the performance.

For example, the num of queues could be adjusted by load the module with new parameter value on ‘num_queues’, do it on the console. Normally this value doesn’t need to be changed, the default value is number of the CPU.

modprobe -r bnx2x modprobe bnx2x num_queues=2

1
2
3
4
5
6
7
 modinfo bnx2x
parm:           num_queues: Set number of queues (default is as a number of CPUs) (int)
parm:           disable_tpa: Disable the TPA (LRO) feature (int)
parm:           int_mode: Force interrupt mode other than MSI-X (1 INT#x; 2 MSI) (int)
parm:           dropless_fc: Pause on exhausted host ring (int)
parm:           mrrs: Force Max Read Req Size (0..3) (for debug) (int)
parm:           debug: Default debug msglevel (int)

The value could be checked in

1
2
 cat /sys/module/bnx2x/parameters/int_mode
0

For intel ixgbe, there are lots of parameter could be modified as well

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
parm:           InterruptType:Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default IntMode (deprecated) (array of int)
parm:           IntMode:Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2 (array of int)
parm:           MQ:Disable or enable Multiple Queues, default 1 (array of int)
parm:           DCA:Disable or enable Direct Cache Access, 0=disabled, 1=descriptor only, 2=descriptor and data (array of int)
parm:           RSS:Number of Receive-Side Scaling Descriptor Queues, default 0=number of cpus (array of int)
parm:           VMDQ:Number of Virtual Machine Device Queues: 0/1 = disable, 2-16 enable (default=8) (array of int)
parm:           max_vfs:Number of Virtual Functions: 0 = disable (default), 1-63 = enable this many VFs (array of int)
parm:           VEPA:VEPA Bridge Mode: 0 = VEB (default), 1 = VEPA (array of int)
parm:           InterruptThrottleRate:Maximum interrupts per second, per vector, (0,1,956-488281), default 1 (array of int)
parm:           LLIPort:Low Latency Interrupt TCP Port (0-65535) (array of int)
parm:           LLIPush:Low Latency Interrupt on TCP Push flag (0,1) (array of int)
parm:           LLISize:Low Latency Interrupt on Packet Size (0-1500) (array of int)
parm:           LLIEType:Low Latency Interrupt Ethernet Protocol Type (array of int)
parm:           LLIVLANP:Low Latency Interrupt on VLAN priority threshold (array of int)
parm:           FdirPballoc:Flow Director packet buffer allocation level:
            1 = 8k hash filters or 2k perfect filters
            2 = 16k hash filters or 4k perfect filters
            3 = 32k hash filters or 8k perfect filters (array of int)
parm:           AtrSampleRate:Software ATR Tx packet sample rate (array of int)
parm:           FCoE:Disable or enable FCoE Offload, default 1 (array of int)
parm:           MDD:Malicious Driver Detection: (0,1), default 1 = on (array of int)
parm:           LRO:Large Receive Offload (0,1), default 0 = off (array of int)
parm:           allow_unsupported_sfp:Allow unsupported and untested SFP+ modules on 82599 based adapters, default 0 = Disable (array of int)
parm:           dmac_watchdog:DMA coalescing watchdog in microseconds (0,41-10000), default 0 = off (array of int)
parm:           vxlan_rx:VXLAN receive checksum offload (0,1), default 1 = Enable (array of int)

Conntrack Parameters

nf_conntrack_max and related hash size needs to optimise the conntrack based on the free memory.

1
2
# sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 1000000

The hash size could be setup dynamically by changing the parameter

1
2
3
# echo 1000000 > /sys/module/nf_conntrack/parameters/hashsize
# cat /sys/module/nf_conntrack/parameters/hashsize
1000448

Normally the hash size is the 1/8 of the nf_conntrack_max. However in order to increase the efficiency, 1:1 could be set if you have enough memory.

In case of 1M nf_conntrack_max setting, the total memory used is

1
2
total mem = conntrack_max * sizeof(struct ip_conntrack) + hash_size * sizeof(struct list_head)
          = 1M * 328B + 1M * 16B = 344MB

There is a script for calculating the required memory, note the hash size and max is 1:8 in this case, it is not the optimised case.

1
2
3
4
5
6
 python conn_table_mem.py 1000
On this machine, each conntrack entry requires 328 bytes of kernel memory, and each hash table entry requires 16.

Therefore to consume a maximum of 1000 MiB of kernel memory:
 - conntrack_max should be set to 3177503
 - Using the kernel's default ratio, the nf_conntrack module's `hashsize' parameter should be set to 397188

The timeout value could also be optimised

It is ok to set generic timeout to 30 ~ 120s, also the tcp_timeout_established should to modified to smaller value

1
2
net.netfilter.nf_conntrack_generic_timeout = 120
net.netfilter.nf_conntrack_tcp_timeout_established = 86400

Please note in the situation of long tcp sessions, for example, 5 days of single tcp connection of online gaming, it is better keep tcp_timeout_established valued to its original value.

Hardware Offload

Hardware offload are ethernet card’s embedded function to offload some load from CPU. If the performance is extremely poor, tweak the GRO/TSO/LRO configuration. If UDP performance is extremely poor, try to turn off UFO. Likewise turn off the TSO, provided that the TCP performance is low.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
 ethtool -k eth0
Features for eth0:
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: on
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]

As the result of the TSO/GSO, the data in the ring buffer will also greatly increase, consequently the latency would also raise. Tweaking these features to balance in CPU load/Throughput with Latencies.

Queueing Disciplines

The default QDisc for linux is pfifo_fast, it is far from best queuing strategy because of the deep buffer in the single queue. As a result the latency will grow, coupled with bufferbloat effect. Moreover, the different traffic class may not get well prioritised by default pfifo_fast strategy. Multiple other choices could be selected, it seems now fq_codel is now the best choice. However the actually selection is quite depending on the actual traffic pattern.

Memory

Each memory channel should have at least one memory DIMM(at least 4G) inserted to max performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
 dmidecode -t memory | grep Locator
    Locator: PROC 1 DIMM 1
    Bank Locator: Not Specified
    Locator: PROC 1 DIMM 2
    Bank Locator: Not Specified
    Locator: PROC 1 DIMM 3
    Bank Locator: Not Specified
    Locator: PROC 1 DIMM 4
    Bank Locator: Not Specified
    Locator: PROC 1 DIMM 5
    Bank Locator: Not Specified
    Locator: PROC 1 DIMM 6
    Bank Locator: Not Specified
    Locator: PROC 1 DIMM 7
    Bank Locator: Not Specified
    Locator: PROC 1 DIMM 8
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 1
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 2
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 3
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 4
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 5
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 6
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 7
    Bank Locator: Not Specified
    Locator: PROC 2 DIMM 8
    Bank Locator: Not Specified

PCI-E Slots

Gen3 PCI-E slots have larger throughput than Gen2 PCI-E slots.

Make sure the PCI-slot with ethernet card inserted support speed with 20G or more, Gen2 slot typically does NOT support 20G or more bandwidth.

1
2
3
 lspci -s 04:00.0 -vvv | grep LnkSta
        LnkSta:    Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
 dmidecode --type 9
# dmidecode 2.12
SMBIOS 2.8 present.

Handle 0x0900, DMI type 9, 17 bytes
System Slot Information
    Designation: PCIe Slot 1
    Type: x16 PCI Express 3
    Current Usage: Available
    Length: Long
    Characteristics:
        3.3 V is provided
        PME signal is supported

Handle 0x0901, DMI type 9, 17 bytes
System Slot Information
    Designation: PCIe Slot 2
    Type: x8 PCI Express 3 x16
    Current Usage: Available
    Length: Long
    Characteristics:
        3.3 V is provided
        PME signal is supported

Handle 0x0902, DMI type 9, 17 bytes
System Slot Information
    Designation: PCIe Slot 3
    Type: x8 PCI Express 3 x16
    Current Usage: In Use
    Length: Long
    Characteristics:
        3.3 V is provided
        PME signal is supported
    Bus Address: 0000:04:00.0
    

5Gt/s * 8 Lanes = 40Gt/s * (8b/10b) = 32Gbps

So theoretically it could provide 32Gbps input/output

1
2
3
4
May 26 11:37:30 kernel: [    2.899292] ixgbe 0000:04:00.0: PCI Express bandwidth of 32GT/s available
May 26 11:37:30 kernel: [    2.899294] ixgbe 0000:04:00.0: (Speed:5.0GT/s, Width: x8, Encoding Loss:20%)
May 26 11:37:30 kernel: [    2.899378] ixgbe 0000:04:00.0: MAC: 2, PHY: 15, SFP+: 5, PBA No: E66560-002

For PCIe 2.0, the overhead is 2b/10b, for PCIe 3.0, the overhead is 2b/130b.

BIOS

  • Select max performance in Power Management Options
  • Disable CPU power state such as C6, C3, C1E or similar, leaving only C1 and C0
  • Turn on HT(Hyper Threading)
  • PCI configuration of ‘extended_tag’ has big impact on small packet performance of 40G ethernet interface. (also setpci could be used)

Intel specific

UDP Flow Hash

Intel ethernet card support RSS, but on UDP packets. Ixgbe driver will send UDP packets with fragmentation bit set to CPU0 other than other CPUs. This default behaviour is set to avoid UDP packets out of order. This have performance degrade on UDP packets especially on VxLAN tunnel performance on 10G nic.

ethtool -N eth0 rx-flow-hash udp4 sdfn

Burst Length

setpci modify the adapter’s configuration register to allow it to read up to 4k bytes at a time (Tx only). Use it with caution, it may lead some system to unstable state, restart to set it back or use value 22 to setpci back.

1
2
3
 lspci -nn | grep 82599
04:00.0 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
04:00.1 Ethernet controller [0200]: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection [8086:10fb] (rev 01)
1
setpci -d 8086:1a48 e6.b=2e

Powered by Jekyll and Minimal Light theme.