  Infiniband HOWTO
  Guy Coates


  This document describes how to install and configure the OFED infini-
  band software on Debian.
  ______________________________________________________________________

  Table of Contents



  1. Introduction
     1.1 The latest version
     1.2 What is OFED?

  2. Installing the OFED Software
     2.1 Installing prebuilt packages
     2.2 Building packages from source
        2.2.1 Install the prerequisites development packages
        2.2.2 Checkout the svn tree
        2.2.3 Install the upstream source (optional)
        2.2.4 Build the packages.

  3. Install the kernel modules
     3.1 Building new kernel modules

  4. Setting up a basic infiniband network
     4.1 Upgrade your Infiniband card and switch firmware
     4.2 Physically Connect the network
     4.3 Choose a Subnet Manager
     4.4 Load the kernel modules
     4.5 (optional) Start opensm
     4.6 Check network health
     4.7 Check the extended network connectivity
     4.8 testing connectivity with ibping
     4.9 Testing RDMA performance

  5. IP over Infiniband (IPoIB)
     5.1 List the network devices
     5.2 IP Configuration
     5.3 Connected vs Unconnected Mode
     5.4 TCP tuning
     5.5 ARP and dual ported cards

  6. OpenMPI
     6.1 Configure IPoIB
     6.2 Load the modules
     6.3 Check permissions and limits
     6.4 Install the mpi test programs
     6.5 Configure Host Access
     6.6 Run the MPI PingPong benchmark

  7. SDP
     7.1 Configuration
     7.2 Example Using SDP with Netpipe

  8. SRP
     8.1 Configuration
     8.2 SRP daemon configuration
        8.2.1 Determine the IDs of presented devices
        8.2.2 Configure srp_deamon to connect to the devices
     8.3 Multipathing, LVM and formatting

  9. Building Lustre against OFED
     9.1 Check Compatibility
     9.2 Build a lustre patched kernel
     9.3 Build OFED modules for the lustre patched kernel
     9.4 Configure lustre

  10. Troubleshooting
     10.1 Genernal fabric troubleshooting
     10.2 ib_query_gid() failed errors on mlx4 platforms
     10.3 Missing XRC support

  11. Tips and Tricks
     11.1 Descriptive node names

  12. Further Information


  ______________________________________________________________________

  11..  IInnttrroodduuccttiioonn

  This document describes how to install and configure the OFED
  infiniband software on Debian. This document is intended to show you
  how to configure a simple Infiniband network as quickly as possible.
  It is not a replacement for the details documentation provided in the
  ofed-docs package!

  11..11..  TThhee llaatteesstt vveerrssiioonn

  The latest version of the howto can be found on the pkg-ofed alioth
  webite:

  http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html
  <http://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html>

  Source is kept in the SVN repository:

  http://svn.debian.org/wsvn/pkg-ofed/ <http://svn.debian.org/wsvn/pkg-
  ofed/>

  11..22..  WWhhaatt iiss OOFFEEDD??

  OFED (OpenFabric's Enterprise Distribution) is the defacto Infiniband
  software stack on Linux. OFED provides a consistent set of kernel
  modules and userspace libraries which have been tested together.

  Further details of the Openfabrics Alliance and OFED can be found here
  http://www.openfabrics.org <http://www.openfabrics.org/>

  22..  IInnssttaalllliinngg tthhee OOFFEEDD SSooffttwwaarree

  Before you can use your infiniband network you will need to install
  the OFED software on your infiniband client machines.  You can choose
  to use the pre-build packages on alioth, or build your own packages
  straight from the alioth SVN repository.

  22..11..  IInnssttaalllliinngg pprreebbuuiilltt ppaacckkaaggeess

  Add the following lines to your sources.list file:


       deb http://pkg-ofed.alioth.debian.org/apt/ofed ./
       deb-src http://pkg-ofed.alioth.debian.org/apt/ofed ./



  and run:


       aptitude update
       aptitude install ofed



  22..22..  BBuuiillddiinngg ppaacckkaaggeess ffrroomm ssoouurrccee

  If you wish to build the OFED packages from the alioth svn repository,
  use the following procedure.
  22..22..11..  IInnssttaallll tthhee pprreerreeqquuiissiitteess ddeevveellooppmmeenntt ppaacckkaaggeess



       aptitude install svn-buildpackage build-essential devscripts



  22..22..22..  CChheecckkoouutt tthhee ssvvnn ttrreeee


       svn co svn://svn.debian.org/pkg-ofed/


  22..22..33..  IInnssttaallll tthhee uuppssttrreeaamm ssoouurrccee ((ooppttiioonnaall))

  The upstream source tarballs need to be available if you want to build
  pukka debian packages suitable for inclusion upstream. If you are
  simply building packages for your own use, you can ignore this step.


       cd pkg-ofed
       mkdir tarballs



  Original source tarballs can be downloaded from the repository:


         apt-get source libibverbs



  Alternatively, you can grab the source code directly from upstream.

  http://www.openfabrics.org/downloads/OFED/

  Upstream source is distributed via SRPMS; you can use alien to convert
  them into tarballs.

  22..22..44..  BBuuiilldd tthhee ppaacckkaaggeess..

  cd into the package you wish to build. eg for libibcommon,

       cd pkg-ofed/libibcommon


  Link in the upstream tarballs directory (optional)

       ln -s -f ../tarballs .


  Run svn-buildpackage from within the trunk directory.


        cd pkg-ofed/libibcommon/trunk
        svn-buildpackage -uc -us -rfakeroot



  The build process will generate a deb in the build-area directory.

  Repeat the process for the rest of the packages. Note that some
  packages have build dependancies on other OFED packages. The suggested
  build order is:


        libibverbs
        libnes
        libcxgb3
        libipathverbs
        libmlx4
        libmthca
        librdmacm
        libibcm
        libibcommon
        libibumad
        libibmad
        libsdp
        dapl
        opensm
        infiniband-diags
        ibutils
        mstflint
        perftest
        qlvnictools
        qperf
        rds-tools
        sdpnetstat
        srptools
        tvflash
        ibsim
        mpitests
        ofed-docs
        ofa_kernel
        ofed



  33..  IInnssttaallll tthhee kkeerrnneell mmoodduulleess

  You now need to build a set of OFED kernel modules which match the
  version of the OFED software you have installed.

  The Debian kernel contains a set of OFED infiniband drivers, but they
  may not match the OFED userspace version have installed.  Consult the
  table below to determine what OFED version the Debian kernel contains.



       Debian Kernel Version      OFED Version
       <=2.6.26                       1.3
       >=2.6.27                       1.4



  If the debian kernel modules are the incorrect version, you can build
  a new set of modules using the ofa-kernel-source package.  If your
  kernel already includes the correct OFED kernel modules you can skip
  the rest of this section. If you are in doubt, you should build a new
  set of modules rather than relying on the modules shipped with the
  kernel.



  33..11..  BBuuiillddiinngg nneeww kkeerrnneell mmoodduulleess

  You can build new kernel modules using module-assistant.


       aptitude install module-assistant



  Ensure you have the ofa-kernel-source package installed, and then run:


        module-assistant prepare
        module-assistant clean ofa-kernel
        module-assistant build ofa-kernel



  This procedure will create an ofa-kernel-modules deb in /usr/src. You
  can the install the deb using dpkg or by running:


        module-assistant install ofa-kernel



  The deb can also be copied to your other infiniband hosts and
  installed using dpkg.

  As the deb contains replacements for existing kernel modules you will
  need to either manually remove any infiniband modules which have
  already been loaded, or reboot the machine, before you can use the new
  modules.

  The new kernel modules will be installed into /usr/lib/<kernel-
  version>/updates. They will not overwrite the original kernel modules,
  but the module loader will pick up the modules from the updates
  directory in preference. You can verify that the system is using the
  new kernel modules by running the modinfo command.



       # modinfo ib_core
       filename:       /lib/modules/2.6.22.19/updates/kernel/drivers/infiniband/core/ib_core.ko
       author:         Roland Dreier
       description:    core kernel InfiniBand API
       license:        Dual BSD/GPL
       vermagic:       2.6.22.19 SMP mod_unload



  Note that if you wish to rebuild the kernel modules for any reason,
  (eg for a new kernel version or to continue an interrupted build) then
  you must issue the "module-assistant clean" command before trying a
  new build.

  44..  SSeettttiinngg uupp aa bbaassiicc iinnffiinniibbaanndd nneettwwoorrkk

  This sections describes how to set up a basic infiniband network and
  test its functionality.


  44..11..  UUppggrraaddee yyoouurr IInnffiinniibbaanndd ccaarrdd aanndd sswwiittcchh ffiirrmmwwaarree

  Before proceeding you should ensure that the firmware in your switches
  and infiniband cards is at the latest release.  Older firmware
  versions may cause interoperability and fabric stability issues. Do
  not assume that just because your hardware has come fresh from the
  factory that it has the latest firmware on it.

  You should follow the documentation from your vendor as to how the
  firmware should be updated.

  44..22..  PPhhyyssiiccaallllyy CCoonnnneecctt tthhee nneettwwoorrkk

  Connect up to your hosts and switches.

  44..33..  CChhoooossee aa SSuubbnneett MMaannaaggeerr

  Each infiniband network requires a subnet manager. You can choose to
  run the OFED opensm subnet manager on one of the Linux clients, or you
  may choose to use an embedded subnet manager running on one of the
  switches in your fabric. Note that not all switches come with a subnet
  manager; check your switch documentation.

  44..44..  LLooaadd tthhee kkeerrnneell mmoodduulleess

  Infiniband kernel modules are not loaded automatically. You should
  adding them to /etc/modules so that they are automatically loaded on
  machine bootup. You will need to include the hardware specific modules
  and the protocol modules.

  /etc/modules:

  # Hardware drivers
  # Choose the apropriate modules from
  # /lib/modules/<kernel-version>/updates/kernel/drivers/infiniband/hw
  #
  #mlx4_ib  # Mellanox ConnectX cards
  #ib_mthca # some mellanox cards
  #iw_cxgb3 # Chelsio T3 cards
  #iw_nes # NetEffect cards
  #
  # Protocol modules
  # Common modules
  rdma_ucm
  ib_umad
  ib_uverbs
  # IP over IB
  ib_ipoib
  # scsi over IB
  ib_srp
  # IB SDP protocol
  ib_sdp



  44..55..  ((ooppttiioonnaall)) SSttaarrtt ooppeennssmm

  If you are going to use the opensm suetnet manager, edit
  /etc/default/opensm and add the port GUIDs of the interfaces on which
  you wish to start opensm.

  You can find the port GUIDs of your cards with the ibstat -p command:



  # ibstat -p
  0x0002c9030002fb05
  0x0002c9030002fb06



  /etc/default/opensm:


       PORTS="0x0002c9030002fb05 0x0002c9030002fb06"



  Note if you want to start opensm on all ports you can use the
  PORTS="ALL" keyword.

  Start opensm:


  #/etc/init.d/opensm start



  If opensm has started correctly you should see SUBNET UP messages in
  the opensm logfile (/var/log/opensm.<PORTID>.log).


  Mar 04 14:56:06 600685 [4580A960] 0x02 -> SUBNET UP



  Note that you can start opensm on multiple nodes; one node will be the
  active subnet manager and the others will put themselves into standby.

  44..66..  CChheecckk nneettwwoorrkk hheeaalltthh

  You can now check the status of the local IB link with the ibstat
  command. Connected links should be in the "LinkUp" state. The
  following output is from a dual ported card, only one of which (port1)
  is connected.



  # ibstat
  CA 'mlx4_0'
          CA type: MT25418
          Number of ports: 2
          Firmware version: 2.3.0
          Hardware version: a0
          Node GUID: 0x0002c9030002fb04
          System image GUID: 0x0002c9030002fb07
          Port 1:
                  State: Active
                  Physical state: LinkUp
                  Rate: 20
                  Base lid: 2
                  LMC: 0
                  SM lid: 1
                  Capability mask: 0x02510868
                  Port GUID: 0x0002c9030002fb05
          Port 2:
                  State: Down
                  Physical state: Polling
                  Rate: 10
                  Base lid: 0
                  LMC: 0
                  SM lid: 0
                  Capability mask: 0x02510868
                  Port GUID: 0x0002c9030002fb06



  44..77..  CChheecckk tthhee eexxtteennddeedd nneettwwoorrkk ccoonnnneeccttiivviittyy

  Once the host is connected to the infiniband network you can check the
  health of all of the other network components with the ibhosts,
  ibswitches and iblinkinfo commands.

  ibhosts displays all of the hosts visible on the network.



       # ibhosts
       Ca      : 0x0008f1040399d3d0 ports 2 "Voltaire HCA400Ex-D"
       Ca      : 0x0008f1040399d370 ports 2 "Voltaire HCA400Ex-D"
       Ca      : 0x0008f1040399d3fc ports 2 "Voltaire HCA400Ex-D"
       Ca      : 0x0008f1040399d3f4 ports 2 "Voltaire HCA400Ex-D"
       Ca      : 0x0002c9030002faf4 ports 2 "MT25408 ConnectX Mellanox Technologies"
       Ca      : 0x0002c9030002fc0c ports 2 "MT25408 ConnectX Mellanox Technologies"
       Ca      : 0x0002c9030002fc10 ports 2 "MT25408 ConnectX Mellanox Technologies"



  ibswitches will display all of the switches in the network.


       # ibswitches
       Switch  : 0x0008f104004121fa ports 24 "ISR9024D-M Voltaire" enhanced port 0 lid 1 lmc 0



  iblinkinfo will show the status and speed of all of the links in the
  network.



  #iblinkinfo.pl
  Switch 0x0008f104004121fa ISR9024D-M Voltaire:
        1    1[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       2    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    2[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      13    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    3[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       4    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    4[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      26    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    5[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      27    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    6[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      24    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    7[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      28    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    8[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      25    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1    9[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      31    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1   10[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      32    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1   11[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      33    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1   12[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      29    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
        1   13[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      30    1[  ] "MT25408 ConnectX Mellanox Technologies" (  )
            14[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
        1   15[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       3    1[  ] "Voltaire HCA400Ex-D" (  )
        1   16[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      10    1[  ] "Voltaire HCA400Ex-D" (  )
            17[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
            18[  ]  ==( 4X 2.5 Gbps   Down /  Polling)==>             [  ] "" (  )
        1   19[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       7    2[  ] "Voltaire HCA400Ex-D" (  )
        1   20[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       6    2[  ] "Voltaire HCA400Ex-D" (  )
        1   21[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       5    2[  ] "Voltaire HCA400Ex-D" (  )
        1   22[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>      21    1[  ] "Voltaire HCA400Ex-D" (  )
        1   23[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       9    2[  ] "Voltaire HCA400Ex-D" (  )
        1   24[  ]  ==( 4X 5.0 Gbps Active /   LinkUp)==>       8    1[  ] "Voltaire HCA400Ex-D" (  )



  44..88..  tteessttiinngg ccoonnnneeccttiivviittyy wwiitthh iibbppiinngg

  ibping is an infiniband equivalent to the icmp ping command. Choose a
  node on the fabric and run a ibping server:

       #ibping -S


  Choose another node on your network, and then ping the port GUID of
  the server. (ibstat on the server will list the port GUID).



       #ibping -G 0x0002c9030002fc1d
       Pong from test.example.com (Lid 13): time 0.072 ms
       Pong from test.example.com (Lid 13): time 0.043 ms
       Pong from test.example.com (Lid 13): time 0.045 ms
       Pong from test.example.com (Lid 13): time 0.045 ms



  44..99..  TTeessttiinngg RRDDMMAA ppeerrffoorrmmaannccee

  You can test the latency and bandwidth of a link with the ib_rdma_lat
  commands.

  To test the latency, start the server on a node:

       #ib_rdma_lat


  and then start a client on another node, giving it the hostname of the
  server.


  #ib_rdma_lat  hostname-of-server
     local address: LID 0x0d QPN 0x18004a PSN 0xca58c4 RKey 0xda002824 VAddr 0x00000000509001
    remote address: LID 0x02 QPN 0x7c004a PSN 0x4b4eba RKey 0x82002466 VAddr 0x00000000509001
  Latency typical: 1.15193 usec
  Latency best   : 1.13094 usec
  Latency worst  : 5.48519 usec



  You can test the bandwith of the link using the ib_rdma_bw command.

       #ib_rdma_bw


  and then start a client on another node, giving it the hostname of the
  server.


       #ib_rdma_bw  hostname-of-server
       855: | port=18515 | ib_port=1 | size=65536 | tx_depth=100 | iters=1000 | duplex=0 | cma=0 |
       855: Local address:  LID 0x0d, QPN 0x1c004a, PSN 0xbf60dd RKey 0xde002824 VAddr 0x002aea4092b000
       855: Remote address: LID 0x02, QPN 0x004a, PSN 0xaad03c, RKey 0x86002466 VAddr 0x002b8a4e191000


       855: Bandwidth peak (#0 to #955): 1486.85 MB/sec
       855: Bandwidth average: 1486.47 MB/sec
       855: Service Demand peak (#0 to #955): 1970 cycles/KB
       855: Service Demand Avg  : 1971 cycles/KB



  The perftest package contains a number of other similar benchmarking
  programs to test various aspects of your network.

  55..  IIPP oovveerr IInnffiinniibbaanndd ((IIPPooIIBB))

  The OFED stack allows you to run TCP/IP over your infiniband network,
  allowing you to run non-infiniband aware applications across your
  network. Several native infiniband applications also use IPoIB for
  host resolution (eg Lustre and SDP).

  55..11..  LLiisstt tthhee nneettwwoorrkk ddeevviicceess

  Check that the IBoIP modules is loaded.


       #modprobe ib_ipoib


  You will now have an "ib" network interface for each of your infini-
  band cards.



  #ifconfig -a

  <snip>
  ib0       Link encap:UNSPEC  HWaddr 80-06-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
            BROADCAST MULTICAST  MTU:2044  Metric:1
            RX packets:0 errors:0 dropped:0 overruns:0 frame:0
            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:256
            RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

  ib1       Link encap:UNSPEC  HWaddr 80-06-00-49-FE-80-00-00-00-00-00-00-00-00-00-00
            BROADCAST MULTICAST  MTU:2044  Metric:1
            RX packets:0 errors:0 dropped:0 overruns:0 frame:0
            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:256
            RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  <snip>



  55..22..  IIPP CCoonnffiigguurraattiioonn

  You can now configure the ib network devices using
  /etc/network/interfaces.


       auto ib0
       iface ib0 inet static
         address 172.31.128.50
         netmask 255.255.240.0
         broadcast 172.31.143.255



  Bring the network device up, as normal.

       ifup ib0


  55..33..  CCoonnnneecctteedd vvss UUnnccoonnnneecctteedd MMooddee

  IPoIB can run over two infiniband transports, Unreliable Datagram (UD)
  mode or Connected mode (CM). The difference between these two modes
  are described in:

  RFC4392 - IP over InfiniBand (IPoIB) Architecture
  RFC4391 - Transmission of IP over InfiniBand (IPoIB) (UD mode)
  RFC4755 - IP over InfiniBand: Connected Mode


  ADDME: Pro/cons of these two methods?

  You can switch between these two mode at runtime with:



        echo datagram > /sys/class/net/ibX/mode
        echo connected > /sys/class/net/ibX/mode



  The default is datagram (UD) mode. If you with to use CM then you can
  add a script to /etc/network/interfaces/if-up.d to automatically set
  CM mode on your interfaces when they are configured.

  55..44..  TTCCPP ttuunniinngg

  In order to obtain maximum IPoIB throughput you may need to tweak the
  MTU and various kernel TCP buffer and window settings.  See the
  details in the ipoib_release_notes.txt document in the ofed-docs
  package.

  55..55..  AARRPP aanndd dduuaall ppoorrtteedd ccaarrddss

  If you have a dual ported card with both ports on the same IB subnet,
  but different IP subnets, you will need to tweak the ARP settings for
  the IPoIB interfaces. See ipoib_release_notes.txt in the ofed-docs
  package for a full discussion of this issue.



          sysctl -w net.ipv4.conf.ib0.arp_ignore=1
          sysctl -w net.ipv4.conf.ib1.arp_ignore=1



  66..  OOppeennMMPPII

  The section describes how to configure OpenMPI to use Infiniband.

  66..11..  CCoonnffiigguurree IIPPooIIBB

  OpenMPI uses IPoIB for job startup and tear-down. You should configure
  IPoIB on all of your hosts.

  66..22..  LLooaadd tthhee mmoodduulleess

  Ensure the rdma_ucm module is loaded.

       modprobe rdma_ucm


  66..33..  CChheecckk ppeerrmmiissssiioonnss aanndd lliimmiittss

  Uses who want to run MPI jobs will need to have write permissions for
  the following devices:


        /dev/infiniband/uverbs*
       /dev/infiniband/rdma_cm*



  The simplest way to do this is to add the users to the rdma group. If
  that is not suitiable for your site, you can change the permissions
  and ownership of these devices by editing the following udev rules:


       /etc/udev/rules.d/50-udev.rules
       /etc/udev/rules.d/91-permissions.rules



  OpenMPI will need to pin memory. Edit /etc/security/limits.conf and
  add the line:

  * hard memlock unlimited


  66..44..  IInnssttaallll tthhee mmppii tteesstt pprrooggrraammss

  Check the mpitests package is installed.

       aptitude install mpitests


  66..55..  CCoonnffiigguurree HHoosstt AAcccceessss

  OpenMPI uses ssh to spawn jobs on remote hosts. You should configure a
  public/private keypair to ensure that you can ssh between hosts
  without entering a password. You should also ensure that your login
  process is silent.

  66..66..  RRuunn tthhee MMPPII PPiinnggPPoonngg bbeenncchhmmaarrkk

  We will use the MPI PingPong benchmark for our testing. By default,
  openmpi should use inifiniband networks in preference to any tcp
  networks it finds. However, we will force mpi to ignore tcp networks
  to ensure that is using the infiniband network.


  #!/bin/bash
  #Infiniband MPI test program
  #Edit the hosts below to match your test hosts
  cat > /tmp/hostfile.$$.mpi <<EOF
  hostA slots=1
  HostB slots=1
  EOF

  mpirun --mca btl_openib_verbose 1 --mca btl ^tcp -n 2 -hostfile /tmp/hostfile.$$.mpi IMB-MPI1 PingPong



  If all goes well you should see openib debugging messages from both
  hosts, together with the job output.



  <snip>
  # PingPong
  [HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
  [HostB][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
  [HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)
  [HostA][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] Set MTU to IBV value 4 (2048 bytes)

  #---------------------------------------------------
  # Benchmarking PingPong
  # #processes = 2
  #---------------------------------------------------
         #bytes #repetitions      t[usec]   Mbytes/sec
              0         1000         1.53         0.00
              1         1000         1.44         0.66
              2         1000         1.42         1.34
              4         1000         1.41         2.70
              8         1000         1.48         5.15
             16         1000         1.50        10.15
             32         1000         1.54        19.85
             64         1000         1.79        34.05
            128         1000         3.01        40.56
            256         1000         3.56        68.66
            512         1000         4.46       109.41
           1024         1000         5.37       181.92
           2048         1000         8.13       240.25
           4096         1000        10.87       359.48
           8192         1000        15.97       489.17
          16384         1000        30.54       511.68
          32768         1000        55.01       568.12
          65536          640       122.20       511.46
         131072          320       207.20       603.27
         262144          160       377.10       662.96
         524288           80       706.21       708.00
        1048576           40      1376.93       726.25
        2097152           20      1946.00      1027.75
        4194304           10      3119.29      1282.34



  If you encounter any errors read the excellent OpenMPI troubleshooting
  guide. http://www.openmpi.org <http://www.openmpi.org>

  If you want to compare infiniband performance with your ethernet/TCP
  networks, you can re-run the tests using flags to tell openmpi to use
  your ethernet network. (The example below assumes that your test nodes
  are connected via eth0).


  #!/bin/bash
  #TCP MPI test program
  #Edit the hosts below to match your test hosts
  cat > /tmp/hostfile.$$.mpi <<EOF
  hostA slots=1
  HostB slots=1
  EOF
  mpirun --mca btl ^openib --mca btl_tcp_if_include eth0 --hostfile hostfile -n 2 IMB-MPI1 -benchmark PingPong



  You should notice signficantly higher latencies than for the
  infiniband test.



  77..  SSDDPP

  Sockets Direct Protocol (SDP) is a network protocol which provides an
  RDMA accelerated alternative to TCP over infiniband networks. OFED
  provides an LD_PRELOADable library (libsdp.so) which allows programs
  which use TCP to use the more efficient SDP protocol instead.  The use
  of an LD_PRELOADable libary means that the switch in protocol is
  transparent, and does not require the application to be recompiled.

  77..11..  CCoonnffiigguurraattiioonn

  SDP used IPoIB for address resolution, so you must configure IPoIB
  before using SDP.

  You should also ensure the ib_sdp kernel module is installed.

  modprobe ib_sdp



  You can use libsdp in two ways; you can either manually LD_PRELOAD the
  library whilst invoking your application, or create a config file
  which specifies which applications will use SDP.

  To manually LD_PRELOAD a library, simply set the LD_PRELOAD variable
  before invoking your application.

  LD_PRELOAD=libsdp.so ./path/to/your/application ...


  If you which to choose which programs will use SDP you can edit
  /etc/sdp.conf and specify which programs, ports and addresses are eli-
  gible for use.

  77..22..  EExxaammppllee UUssiinngg SSDDPP wwiitthh NNeettppiippee

  The following example shows how to use libsdp to make the TCP
  benchmarking application, netpipe, use SDP rather than TCP.  NodeA is
  the server and NodeB is the client. IPoIB is configured on both nodes,
  and NodeA's IPoIB address is 10.0.0.1

  Install netpipe on both nodes.

  aptitude install netpipe-tcp



  First, run the netpipe benchmark over TCP in order to obtain a
  baseline number.



       nodeA# NPtcp
       nodeB# NPtcp -h 10.0.0.1
       Send and receive buffers are 16384 and 87380 bytes
       (A bug in Linux doubles the requested buffer sizes)
       Now starting the main loop
         0:       1 bytes   2778 times -->      0.22 Mbps in      34.04 usec
         1:       2 bytes   2937 times -->      0.45 Mbps in      33.65 usec
         2:       3 bytes   2971 times -->      0.69 Mbps in      33.41 usec
       <snip>
       121: 8388605 bytes      3 times -->   2951.89 Mbps in   21680.99 usec
       122: 8388608 bytes      3 times -->   3008.08 Mbps in   21276.00 usec
       123: 8388611 bytes      3 times -->   2941.76 Mbps in   21755.66 usec


  Now repeat the test, but force netpipe to use SDP rather than TCP.



       nodeA# LD_PRELOAD=libsdp.so NPtcp
       nodeB# LD_PRELOAD=libsdp.so  NPtcp -h 10.0.0.1
       Send and receive buffers are 16384 and 87380 bytes
       (A bug in Linux doubles the requested buffer sizes)
       Now starting the main loop
         0:       1 bytes   9765 times -->      1.45 Mbps in       5.28 usec
         1:       2 bytes  18946 times -->      2.80 Mbps in       5.46 usec
         2:       3 bytes  18323 times -->      4.06 Mbps in       5.63 usec
       <snip>
       121: 8388605 bytes      5 times -->   7665.51 Mbps in    8349.08 usec
       122: 8388608 bytes      5 times -->   7668.62 Mbps in    8345.70 usec
       123: 8388611 bytes      5 times -->   7629.04 Mbps in    8389.00 usec



  You should see a significant increase in performance when using SDP.

  88..  SSRRPP

  SRP (SCSI Remote protocol or SCSI RDMA protocol) is a protocol that
  allows the use of SCSI devices across infiniband. If you have
  infiniband storage, use can use SRP to acess the devices.

  88..11..  CCoonnffiigguurraattiioonn

  Ensure that your infiniband storage is presented to the host in
  question. Check your storage controller documentation.  Ensure that
  the ib_srp kernel module is loaded and that the srptools package is
  installed.


       modprobe ib_srp


  88..22..  SSRRPP ddaaeemmoonn ccoonnffiigguurraattiioonn

  srp_daemon is responsible for discovering and connecting to SRP
  targets. The default configuration shipped with srp_daemon is to
  ignore all presented devices; this is a failsafe to prevent devices
  from being mounted by accident on the wrong hosts.

  The srp_daemon config file /etc/srp_daemon.conf has a simply syntax,
  and is described in the srp_daemon(1) manpage. Each line in this file
  is a rule which can be either to allow connection or to disallow
  connection according to the first character in the line (a or d
  accordingly) and ID of the storage device.

  88..22..11..  DDeetteerrmmiinnee tthhee IIDDss ooff pprreesseenntteedd ddeevviicceess

  You can determine the IDs of SRP devices presented to your hosts by
  running the ibsrpdm -c command.


       # ibsrpdm -c
       id_ext=50001ff10005052a,ioc_guid=50001ff10005052a,dgid=fe8000000000000050001ff10005052a,pkey=ffff,service_id=2a050500f11f0050



  88..22..22..  CCoonnffiigguurree ssrrpp__ddeeaammoonn ttoo ccoonnnneecctt ttoo tthhee ddeevviicceess

  Once we have the IDs of the devices, we can add them to
  /etc/srp_daemon.conf. You can also specify other srp related options
  for the target, such as max_cmd_per_lun and Max_sect. These are
  storage specific; check your vendor documentation for reccomended
  values.


       # This rule allows connection to our target
       a id_ext=50001ff10005052a,ioc_guid=50001ff10005052a,max_cmd_per_lun=32,max_sect=65535
       # This rule disallows everything else
       d



  Restart the srp_daemon and the storage target should now become visi-
  ble; check the kernel log to see if the disk has been detected.


  /etc/init.d/srptools restart



  In the example kernel log output the disk has been descovered as scsi
  device sdb.


       scsi 3:0:0:1: Direct-Access     IBM      DCS9900          5.03 PQ: 0 ANSI: 5
       sd 3:0:0:1: [sdb] 1953458176 4096-byte hardware sectors (8001365 MB)
       sd 3:0:0:1: [sdb] Write Protect is off
       sd 3:0:0:1: [sdb] Mode Sense: 97 00 10 08
       sd 3:0:0:1: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
       sd 3:0:0:1: [sdb] 1953458176 4096-byte hardware sectors (8001365 MB)
       sd 3:0:0:1: [sdb] Write Protect is off
       sd 3:0:0:1: [sdb] Mode Sense: 97 00 10 08
       sd 3:0:0:1: [sdb] Write cache: disabled, read cache: enabled, supports DPO and FUA
        sdb:<6>scsi4 : SRP.T10:50001FF10005052A
        unknown partition table
       sd 3:0:0:1: [sdb] Attached SCSI disk
       sd 3:0:0:1: Attached scsi generic sg5 type 0



  88..33..  MMuullttiippaatthhiinngg,, LLVVMM aanndd ffoorrmmaattttiinngg

  The newly detected SRP device can be treated as an other scsi device.
  If you have multiple infiniband adapters you can use multipath-tools
  on top of the SRP devices to protects against a network failure. If
  you are not using multipathed IO you can simply format the device as
  normal.

  99..  BBuuiillddiinngg LLuussttrree aaggaaiinnsstt OOFFEEDD

  Lustre is a scalable cluster filesystem popular on high performance
  compute clusters. See http://www.lustre.org <http://www.lustre.org>
  for more information. lustre can use infiniband as one of its network
  transports in order to increase performance. The section describes how
  to compile lustre against the OFED infiniband stack.

  99..11..  CChheecckk CCoommppaattiibbiilliittyy

  Not all lustre versions are compatible with all OFED or kernel
  versions. Read the lustre release notes for which versions are
  supported.

  99..22..  BBuuiilldd aa lluussttrree ppaattcchheedd kkeerrnneell

  Build a lustre patched kernel as per the instructions on the lustre
  wiki. Once you have build the kernel keep the configured source tree.
  It is required for the next step.

  99..33..  BBuuiilldd OOFFEEDD mmoodduulleess ffoorr tthhee lluussttrree ppaattcchheedd kkeerrnneell

  Build OFED modules against the newly build lustre patched kernel.



        module-assistant prepare
        module-assistant clean ofa-kernel
        module-assistant -k/path/to/lustre/patched/kernel build ofa-kernel



  Do not issue a "module-assistant clean" command after the build. The
  ofa-kernel-module source tree is needed for the next step.

  99..44..  CCoonnffiigguurree lluussttrree

  You can now configure lustre to build against the lustre patched
  kernel and the ofa-kernel-module sources.



        cd lustre-source
        ./configure --with-o2ib=/usr/src/modules/ofa-kernel  --with-linux=/path/to/patched/linux/source \
        --other-options



  1100..  TTrroouubblleesshhoooottiinngg

  This section covers general troubleshooting and commonly reported
  problems.

  1100..11..  GGeenneerrnnaall ffaabbrriicc ttrroouubblleesshhoooottiinngg

  The ibdiagnet program can be used to troubleshoot potential issues
  with your infiniband fabric.

       ibdiagnet -r


  1100..22..  iibb__qquueerryy__ggiidd(()) ffaaiilleedd eerrrroorrss oonn mmllxx44 ppllaattffoorrmmss

  ibstat or opensm hangs and the following kernel messages are printed:



       kernel: [   78.170077] ib0: ib_query_gid() failed
       kernel: [   89.272789] ib0: ib_query_port failed



  Fix: Load the mlx4_core module with the msi_x=0 option.


       cat > /etc/modprobe.d/mlx4_core <<EOF
       options mlx4_core msi_x=0
       EOF

       update-initramfs -u



  1100..33..  MMiissssiinngg XXRRCC ssuuppppoorrtt

  If you see error messages pertaining to missing support for XRC, it
  means you have mis-matched kernel modules and userspace libraries.


       mlx4: There is a mismatch between the kernel and the userspace
       libraries: Kernel does not support XRC. Exiting.



  Fix: Make sure that you build and install the OFED kernel modules as
  described in section X.

  1111..  TTiippss aanndd TTrriicckkss

  This section details an assortment of miscellaneous tips.

  1111..11..  DDeessccrriippttiivvee nnooddee nnaammeess

  You can give your hosts descriptive names by echoing text to the
  following file:


       echo `uname -n` > /sys/class/infiniband/<driver>/node_desc



  1122..  FFuurrtthheerr IInnffoorrmmaattiioonn

  Extensive documentation on the OFED software is present in the ofed-
  docs package.

  The openfabrics alliance webpage can be found here:

  http://www.openfabrics.org/ <http://www.openfabrics.org/>

  The following mailing lists are also useful:

  http://lists.alioth.debian.org/mailman/listinfo/pkg-ofed-devel
  <http://lists.alioth.debian.org/mailman/listinfo/pkg-ofed-devel>: pkg-
  ofed-devel: Discussion of debian specific problem or issues.

  http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
  <http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general>: ofa-
  general: General discussion of the OFED software.

  Books:

  Infiniband Network Architecture
  by MindShare, Inc.; Tom Shanley
  Publisher: Addison-Wesley Professional
  Pub Date: October 31, 2002
  Print ISBN-10: 0-321-11765-4



