Difference between revisions of "TCP-Tuning"

Un article de lcgwiki.
Jump to: navigation, search
 
(Tuning disk I/O)
 
(10 intermediate revisions by 2 users not shown)
Ligne 1: Ligne 1:
= Performance TCP =
+
This page describes how to enhance the performance of data transfers between distant computing sites. It mainly focus on the tuning of Linux kernel parameters, leading to the improvement of TCP and disk I/O performances. Additional informations specific to hardware retailers are detailed at the end of the document.
  
Cette page décrit les paramètres du kernel Linux qui sont importants pour améliorer la performance réseaux.
 
  
== Les paramètres importants ==
+
= TCP Performance =
Les paramètres suivants sont importants :
 
* net.ipv4.tcp_rmem
 
* net.ipv4.tcp_wmem
 
* net.ipv4.tcp_mem
 
* net.core.rmem_default
 
* net.core.wmem_default
 
* net.core.rmem_max
 
* net.core.wmem_max
 
* net.ipv4.tcp_dsack
 
* net.ipv4.tcp_sack
 
* net.ipv4.tcp_timestamps
 
* net.core.netdev_max_backlog
 
  
Ces paramètres sont modifiés :
+
== Quattor ==
* par Quattor
 
* par YAIM
 
* en modifiant manuellement le fichier /etc/sysctl.conf
 
  
 +
The [http://lcg.in2p3.fr/wiki/images/Quattor_tcp_tuning.pdf Quattor TCP Tuning guide] aims to be a good documentation for helping you tuning your TCP Performance with Quattor.
  
== Valeurs proposées ==
+
== Kernel tuning ==
  
net.ipv4.tcp_rmem = 131072      1048576 2097152
+
The following parameters are important for TCP tuning:
net.ipv4.tcp_wmem = 131072      1048576 2097152
+
* <code>net.ipv4.tcp_rmem</code>
net.ipv4.tcp_mem = 10%mem      25%mem  50%mem
+
* <code>net.ipv4.tcp_wmem</code>
net.core.rmem_default = 1048576
+
* <code>net.core.rmem_default</code>
net.core.wmem_default = 1048576
+
* <code>net.core.wmem_default</code>
net.core.rmem_max = 2097152
+
* <code>net.core.rmem_max</code>
net.core.wmem_max = 2097152
+
* <code>net.core.wmem_max</code>
net.ipv4.tcp_dsack = 0
+
* <code>net.ipv4.tcp_dsack</code>
net.ipv4.tcp_sack = 0
+
* <code>net.ipv4.tcp_sack</code>
net.ipv4.tcp_timestamps = 0
+
* <code>net.ipv4.tcp_timestamps</code>
net.core.netdev_max_backlog = 10000
+
* <code>net.core.netdev_max_backlog</code>
  
 +
They can be modified:
 +
* By directly passing the parameter to the '''sysctl''' command. This method is useful for testing a parameter, as the modification will not persist after the next reboot.
 +
* By adding the parameter to the <code>/etc/sysctl.conf</code> file and loading it with the '''sysctl''' command. This way is interesting when you want to preserve the modification over a reboot.
  
== Liens externes ==
+
The following values are recommended:
 +
* <code>net.ipv4.tcp_rmem = 131072 1048576 2097152</code>
 +
* <code>net.ipv4.tcp_wmem = 131072 1048576 2097152</code>
 +
* <code>net.core.rmem_default = 1048576</code>
 +
* <code>net.core.wmem_default = 1048576</code>
 +
* <code>net.core.rmem_max = 2097152</code>
 +
* <code>net.core.wmem_max = 2097152</code>
 +
* <code>net.ipv4.tcp_dsack = 0</code>
 +
* <code>net.ipv4.tcp_sack = 0</code>
 +
* <code>net.ipv4.tcp_timestamps = 0</code>
 +
* <code>net.core.netdev_max_backlog = 10000</code>
 +
 
 +
== External links ==
 +
 
 +
* http://fasterdata.es.net/
 +
* http://monalisa.cern.ch/FDT/documentation_syssettings.html
 +
* http://indico.cern.ch/contributionDisplay.py?sessionId=31&amp;contribId=55&amp;confId=61917
 +
* http://www.psc.edu/networking/projects/tcptune/
 +
* http://onlamp.com/pub/a/onlamp/2005/11/17/tcp_tuning.html
 +
* http://en.wikipedia.org/wiki/TCP_tuning
 +
 
 +
 
 +
= Tuning disk I/O =
 +
 
 +
This section details disk I/O tuning. The benchmarking of disk is detailed on a [[Storage-Benches|separate page]].
 +
 
 +
== Kernel tuning ==
 +
 
 +
Few kernel parameters have a big impact on I/O performance:
 +
* getra
 +
* queue_depth
 +
* nr_request
 +
* scheduler
 +
 
 +
To get your current configuration (i. e. on a /dev/sdb disk):
 +
<pre>
 +
blockdev --getra /dev/sdb
 +
cat /sys/block/sdb/device/queue_depth
 +
cat /sys/block/sdb/queue/nr_requests
 +
cat /sys/block/sdb/queue/scheduler
 +
</pre>
 +
 
 +
To modify your current configuration:
 +
<pre>
 +
blockdev=sdb
 +
blockdev --setra 16384 /dev/$sdb
 +
echo 512 > /sys/block/sdb/queue/nr_requests
 +
echo deadline > /sys/block/sdb/queue/scheduler
 +
echo 256 > /sys/block/sdb/device/queue_depth
 +
</pre>
 +
 
 +
Note that for the queue_depth parameter, there is a maximal value (i.e. for the old Sun X4500 server, the queue_depth cannot 31).
 +
 
 +
Tuning the kernel parameters should be done for each block devices on each boot. The most simple is to create a script that can be called on boot and that set the parameters for each device. Below is an example of such a script used at [http://ipnwww.in2p3.fr/|IPNO].
 +
 
 +
<pre>
 +
#!/bin/bash
 +
 
 +
function f_get_disks_list {
 +
  # Obtenir la liste des disques de donnes
 +
  # Je suppose qu'il n'y a pas de melange RAID linux (/dev/mdXX) et hardware (/dev/sdXX) sur la meme machine
 +
 
 +
  # On supprime les disques systemes de la liste fournie par df
 +
  # Liste des disks avec le type de fs (xfs, ext3, ext4)
 +
  local l_disks_type_list=$(df -P -T |egrep -v "Filesystem|\/$|\/opt|\/tmp|\/var|\/usr|\/boot|tmpfs|varvol|tmpvol|usrvol|optvol" |egrep 'xfs|ext4|ext3' | awk '{print $1 " " $2 }')
 +
  echo  ${l_disks_type_list}
 +
}
 +
 
 +
function f_tune_io {
 +
  # Usage : f_tune_io block_device
 +
  # Exemple : f_tune_io.sh sdb
 +
  [ $# != 1 ] && echo "Usage: f_tune_io block_device" && return 1
 +
  local blockdev=$1
 +
  #echo ${blockdev}
 +
  echo "....................... Avant le tuning .........................."
 +
  f_check_tune ${blockdev}
 +
  QUEUE_DEPTH=$(cat /sys/block/${blockdev}/device/queue_depth)
 +
  [ $QUEUE_DEPTH -lt 128 ] && QUEUE_DEPTH=128 # si c'est plus grand on le garde
 +
  dmidecode  -s system-product-name | egrep -q 'Sun Fire X4500'
 +
  [ $? -eq 0 ] && QUEUE_DEPTH=31
 +
  dmidecode  -s system-product-name | egrep -q 'Sun Fire X4540'
 +
  [ $? -eq 0 ] && QUEUE_DEPTH=127
 +
  echo ${QUEUE_DEPTH} > /sys/block/${blockdev}/device/queue_depth # 31 sur les SUN X4500, 128 ou 256 sur Dell
 +
  echo 512 > /sys/block/${blockdev}/queue/nr_requests  #  (au lieu de 128)
 +
  echo deadline > /sys/block/${blockdev}/queue/scheduler #(au lieu de cfq)
 +
  blockdev --setra 16384 /dev/${blockdev} #(au lieu de 256)
 +
  echo "....................... Apres le tuning .........................."
 +
  f_check_tune ${blockdev}
 +
}
 +
 
 +
function f_tune_md_io {
 +
  # Usage : f_tune_md_io md_device
 +
  # Exemple : f_tune_md_io .sh md11
 +
  [ $# != 1 ] && echo "Usage: $0 md_device. Exemple: f_tune_md_io md11" && return 1
 +
 
 +
  local mddev=$1
 +
  echo "+++++++++++++++++++++++++ MD DEVICE = ${mddev} +++++++++++++++++++++++"
 +
 
 +
  local disks=$(mdadm --query --detail /dev/${mddev}|grep 'active sync'|awk '{print $NF}')
 +
  local i
 +
  local j
 +
  for i in ${disks}; do
 +
    # /dev/sde1 deviendra sde
 +
    j=$(echo $i | awk -F/ '{print $NF}') # /dev/sde1 ==> sde1
 +
    j=${j%%[0-9]}                    # sde1 ==> sde
 +
    echo "Tuning ${j} ....................................................."
 +
    f_tune_io $j
 +
  done
 +
}
 +
 
 +
 
 +
function f_check_tune {
 +
  # Usage : f_check_tune block_device
 +
  # Exemple : f_check_tune sdb
 +
  [ $# != 1 ] && echo "Usage: f_check_tune block_device. Exemple: f_check_tune sdb" && return 1
 +
 
 +
  local i=$1
 +
 
 +
  echo -n "blockdev --getra /dev/$i : "
 +
  blockdev --getra /dev/$i
 +
  echo -n "cat /sys/block/$i/device/queue_depth : "
 +
  cat /sys/block/$i/device/queue_depth
 +
  echo -n "cat /sys/block/$i/queue/nr_requests : "
 +
  cat /sys/block/$i/queue/nr_requests
 +
  echo -n "cat /sys/block/$i/queue/scheduler : "
 +
  cat /sys/block/$i/queue/scheduler
 +
}
 +
 
 +
# Debut tuning
 +
echo " ======= I/O tuning : $(date) ======= "
 +
disks_type_list=$(f_get_disks_list)
 +
disks_list=$(echo ${disks_type_list} | sed -e 's/\b\(xfs\|ext4\|ext3\)\b//g' |sort -u)
 +
echo ${disks_list}
 +
 
 +
#NB: effet su sed :remplace par exemple /dev/sda1 par sda
 +
sd_devices=$(echo ${disks_list} | grep "\/dev\/sd" | sed 's/\(\/dev\/\|[0-9]\)//g' | tr '[:space:]' "\n" | sort -u | tr "\n" " ")
 +
#NB: effet su sed :remplace par exemple /dev/md11 par md11
 +
md_devices=$(echo ${disks_list} | grep "\/dev\/md" | sed 's/\/dev\///g')
 +
lvm_devices=$(echo ${disks_list} | grep "\/dev\/mapper")
 +
 
 +
echo "sd_devices = ${sd_devices}"
 +
echo "md_devices = ${md_devices}"
 +
echo "lvm_devices = ${lvm_devices}"
 +
[ ! -z "${sd_devices}" ] && for d in ${sd_devices}; do
 +
  echo f_tune_io ${d}
 +
  f_tune_io ${d}
 +
done
 +
 
 +
[ ! -z "${md_devices}" ] && for d in ${md_devices}; do
 +
  echo f_tune_md_io ${d}
 +
  f_tune_md_io ${d}
 +
done
 +
 
 +
# Rechercher les VG puis les PV et les disque entiers puis tuner
 +
#[ ! -z "${lvm_devices}" ] && for d in ${lvm_devices}; do
 +
#  echo "tune_lvm_io ${d} not implemented yet"
 +
#done
 +
</pre>
 +
 
 +
== External links ==
 +
 
 +
* https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/main-io.html
 +
* https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/main-fs.html
 +
* http://www.gluster.org/community/documentation/index.php/Linux_Kernel_Tuning
 +
* http://insights.oetiker.ch/linux/raidoptimization/
 +
* http://mylabfr.kaminoweb.com/increase-disk-queue-depth-on-linux/
 +
* http://www.redhat.com/magazine/008jun05/features/schedulers/
 +
 
 +
= Hardware recommendations =
 +
 
 +
== DELL Systems ==
 +
 
 +
=== R510 + PowerVault MD1200 ===
 +
 
 +
* RAID with a stripe size of 1MB and using the adaptive read ahead mode
 +
* XFS filesystem (take care of partition alignment and use <code>noatime</code> mount option)
 +
* Increase the read_ahead kernel parameter
 +
* Modify the scheduler and the number of request
 +
 
 +
Filesystem tuning:
 +
 
 +
<pre>
 +
# parted /dev/sdb mklabel gpt
 +
# parted /dev/sdb mkpart primary xfs 1m 50%
 +
# parted /dev/sdb mkpart primary xfs 50% 100%
 +
# mkfs.xfs -d su=1m,sw=10 /dev/sdb1 -L R510_sdb1
 +
# mkfs.xfs -d su=1m,sw=10 /dev/sdb2 -L R510_sdb2
 +
# cat /etc/fstab | grep noatime
 +
LABEL=R510_sdb1 /fs1 xfs defaults,noatime 0 0
 +
LABEL=R510_sdb2 /fs2 xfs defaults,noatime 0 0
 +
</pre>
 +
 
 +
Kernel tuning:
 +
 
 +
<pre>
 +
# tail -6 /etc/rc.d/rc.local
 +
blockdev --setra 16384 /dev/sdb
 +
blockdev --setra 16384 /dev/sdc
 +
echo 512 > /sys/block/sdb/queue/nr_requests
 +
echo 512 > /sys/block/sdc/queue/nr_requests
 +
echo deadline > /sys/block/sdb/queue/scheduler
 +
echo deadline > /sys/block/sdc/queue/scheduler
 +
</pre>
 +
 
 +
 
 +
== HP Systems ==
 +
 
 +
=== MDS 600 based system ===
 +
 
 +
In order to enhance the performance of the MDS 600 based systems, the following tunings have been applied:
 +
* Upgrading the hard disk firmware (HPD3). All hard disks should have the same firmware version
 +
* Change the power management profile (max cpu power)
 +
* Use XFS filesystem aligned with the RAID strip
 +
* Use RAID units composed of 5 disks wisely selected on different columns
 +
* Use the following kernel parameters for each disk:
 +
<pre>
 +
echo "cfq" > /sys/block/cciss\!${disk}/queue/scheduler
 +
echo 256 > /sys/block/cciss\!${disk}/queue/nr_requests
 +
echo 4096 > /sys/block/cciss\!${disk}/queue/read_ahead_kb
 +
</pre>

Latest revision as of 13:08, 24 juin 2016

This page describes how to enhance the performance of data transfers between distant computing sites. It mainly focus on the tuning of Linux kernel parameters, leading to the improvement of TCP and disk I/O performances. Additional informations specific to hardware retailers are detailed at the end of the document.


TCP Performance

Quattor

The Quattor TCP Tuning guide aims to be a good documentation for helping you tuning your TCP Performance with Quattor.

Kernel tuning

The following parameters are important for TCP tuning:

  • net.ipv4.tcp_rmem
  • net.ipv4.tcp_wmem
  • net.core.rmem_default
  • net.core.wmem_default
  • net.core.rmem_max
  • net.core.wmem_max
  • net.ipv4.tcp_dsack
  • net.ipv4.tcp_sack
  • net.ipv4.tcp_timestamps
  • net.core.netdev_max_backlog

They can be modified:

  • By directly passing the parameter to the sysctl command. This method is useful for testing a parameter, as the modification will not persist after the next reboot.
  • By adding the parameter to the /etc/sysctl.conf file and loading it with the sysctl command. This way is interesting when you want to preserve the modification over a reboot.

The following values are recommended:

  • net.ipv4.tcp_rmem = 131072 1048576 2097152
  • net.ipv4.tcp_wmem = 131072 1048576 2097152
  • net.core.rmem_default = 1048576
  • net.core.wmem_default = 1048576
  • net.core.rmem_max = 2097152
  • net.core.wmem_max = 2097152
  • net.ipv4.tcp_dsack = 0
  • net.ipv4.tcp_sack = 0
  • net.ipv4.tcp_timestamps = 0
  • net.core.netdev_max_backlog = 10000

External links


Tuning disk I/O

This section details disk I/O tuning. The benchmarking of disk is detailed on a separate page.

Kernel tuning

Few kernel parameters have a big impact on I/O performance:

  • getra
  • queue_depth
  • nr_request
  • scheduler

To get your current configuration (i. e. on a /dev/sdb disk):

blockdev --getra /dev/sdb
cat /sys/block/sdb/device/queue_depth
cat /sys/block/sdb/queue/nr_requests
cat /sys/block/sdb/queue/scheduler

To modify your current configuration:

blockdev=sdb
blockdev --setra 16384 /dev/$sdb
echo 512 > /sys/block/sdb/queue/nr_requests
echo deadline > /sys/block/sdb/queue/scheduler
echo 256 > /sys/block/sdb/device/queue_depth

Note that for the queue_depth parameter, there is a maximal value (i.e. for the old Sun X4500 server, the queue_depth cannot 31).

Tuning the kernel parameters should be done for each block devices on each boot. The most simple is to create a script that can be called on boot and that set the parameters for each device. Below is an example of such a script used at [1].

#!/bin/bash

function f_get_disks_list {
  # Obtenir la liste des disques de donnes
  # Je suppose qu'il n'y a pas de melange RAID linux (/dev/mdXX) et hardware (/dev/sdXX) sur la meme machine

  # On supprime les disques systemes de la liste fournie par df
  # Liste des disks avec le type de fs (xfs, ext3, ext4)
  local l_disks_type_list=$(df -P -T |egrep -v "Filesystem|\/$|\/opt|\/tmp|\/var|\/usr|\/boot|tmpfs|varvol|tmpvol|usrvol|optvol" |egrep 'xfs|ext4|ext3' | awk '{print $1 " " $2 }')
  echo  ${l_disks_type_list}
}

function f_tune_io {
  # Usage : f_tune_io block_device
  # Exemple : f_tune_io.sh sdb
  [ $# != 1 ] && echo "Usage: f_tune_io block_device" && return 1
  local blockdev=$1
  #echo ${blockdev}
  echo "....................... Avant le tuning .........................."
  f_check_tune ${blockdev}
  QUEUE_DEPTH=$(cat /sys/block/${blockdev}/device/queue_depth)
  [ $QUEUE_DEPTH -lt 128 ] && QUEUE_DEPTH=128 # si c'est plus grand on le garde
  dmidecode  -s system-product-name | egrep -q 'Sun Fire X4500'
  [ $? -eq 0 ] && QUEUE_DEPTH=31
  dmidecode  -s system-product-name | egrep -q 'Sun Fire X4540'
  [ $? -eq 0 ] && QUEUE_DEPTH=127
  echo ${QUEUE_DEPTH} > /sys/block/${blockdev}/device/queue_depth # 31 sur les SUN X4500, 128 ou 256 sur Dell
  echo 512 > /sys/block/${blockdev}/queue/nr_requests  #   (au lieu de 128)
  echo deadline > /sys/block/${blockdev}/queue/scheduler #(au lieu de cfq)
  blockdev --setra 16384 /dev/${blockdev} #(au lieu de 256)
  echo "....................... Apres le tuning .........................."
  f_check_tune ${blockdev}
}

function f_tune_md_io {
  # Usage : f_tune_md_io md_device
  # Exemple : f_tune_md_io .sh md11
  [ $# != 1 ] && echo "Usage: $0 md_device. Exemple: f_tune_md_io md11" && return 1

  local mddev=$1
  echo "+++++++++++++++++++++++++ MD DEVICE = ${mddev} +++++++++++++++++++++++"

  local disks=$(mdadm --query --detail /dev/${mddev}|grep 'active sync'|awk '{print $NF}')
  local i
  local j
  for i in ${disks}; do
    # /dev/sde1 deviendra sde
    j=$(echo $i | awk -F/ '{print $NF}') # /dev/sde1 ==> sde1
    j=${j%%[0-9]}                     # sde1 ==> sde
    echo "Tuning ${j} ....................................................."
    f_tune_io $j
  done
}


function f_check_tune {
  # Usage : f_check_tune block_device
  # Exemple : f_check_tune sdb
  [ $# != 1 ] && echo "Usage: f_check_tune block_device. Exemple: f_check_tune sdb" && return 1

  local i=$1

  echo -n "blockdev --getra /dev/$i : "
  blockdev --getra /dev/$i
  echo -n "cat /sys/block/$i/device/queue_depth : "
  cat /sys/block/$i/device/queue_depth
  echo -n "cat /sys/block/$i/queue/nr_requests : "
  cat /sys/block/$i/queue/nr_requests
  echo -n "cat /sys/block/$i/queue/scheduler : "
  cat /sys/block/$i/queue/scheduler
}

# Debut tuning
echo " ======= I/O tuning : $(date) ======= "
disks_type_list=$(f_get_disks_list) 
disks_list=$(echo ${disks_type_list} | sed -e 's/\b\(xfs\|ext4\|ext3\)\b//g' |sort -u)
echo ${disks_list}

#NB: effet su sed :remplace par exemple /dev/sda1 par sda
sd_devices=$(echo ${disks_list} | grep "\/dev\/sd" | sed 's/\(\/dev\/\|[0-9]\)//g' | tr '[:space:]' "\n" | sort -u | tr "\n" " ")
#NB: effet su sed :remplace par exemple /dev/md11 par md11
md_devices=$(echo ${disks_list} | grep "\/dev\/md" | sed 's/\/dev\///g')
lvm_devices=$(echo ${disks_list} | grep "\/dev\/mapper")

echo "sd_devices = ${sd_devices}"
echo "md_devices = ${md_devices}"
echo "lvm_devices = ${lvm_devices}"
[ ! -z "${sd_devices}" ] && for d in ${sd_devices}; do
  echo f_tune_io ${d}
  f_tune_io ${d}
done

[ ! -z "${md_devices}" ] && for d in ${md_devices}; do
  echo f_tune_md_io ${d}
  f_tune_md_io ${d}
done

# Rechercher les VG puis les PV et les disque entiers puis tuner
#[ ! -z "${lvm_devices}" ] && for d in ${lvm_devices}; do
#  echo "tune_lvm_io ${d} not implemented yet"
#done

External links

Hardware recommendations

DELL Systems

R510 + PowerVault MD1200

  • RAID with a stripe size of 1MB and using the adaptive read ahead mode
  • XFS filesystem (take care of partition alignment and use noatime mount option)
  • Increase the read_ahead kernel parameter
  • Modify the scheduler and the number of request

Filesystem tuning:

# parted /dev/sdb mklabel gpt
# parted /dev/sdb mkpart primary xfs 1m 50%
# parted /dev/sdb mkpart primary xfs 50% 100%
# mkfs.xfs -d su=1m,sw=10 /dev/sdb1 -L R510_sdb1
# mkfs.xfs -d su=1m,sw=10 /dev/sdb2 -L R510_sdb2
# cat /etc/fstab | grep noatime
LABEL=R510_sdb1 /fs1 xfs defaults,noatime 0 0
LABEL=R510_sdb2 /fs2 xfs defaults,noatime 0 0

Kernel tuning:

# tail -6 /etc/rc.d/rc.local
blockdev --setra 16384 /dev/sdb
blockdev --setra 16384 /dev/sdc
echo 512 > /sys/block/sdb/queue/nr_requests
echo 512 > /sys/block/sdc/queue/nr_requests
echo deadline > /sys/block/sdb/queue/scheduler
echo deadline > /sys/block/sdc/queue/scheduler


HP Systems

MDS 600 based system

In order to enhance the performance of the MDS 600 based systems, the following tunings have been applied:

  • Upgrading the hard disk firmware (HPD3). All hard disks should have the same firmware version
  • Change the power management profile (max cpu power)
  • Use XFS filesystem aligned with the RAID strip
  • Use RAID units composed of 5 disks wisely selected on different columns
  • Use the following kernel parameters for each disk:
echo "cfq" > /sys/block/cciss\!${disk}/queue/scheduler
echo 256 > /sys/block/cciss\!${disk}/queue/nr_requests
echo 4096 > /sys/block/cciss\!${disk}/queue/read_ahead_kb