wiki:Xen_GPU_cluster

Version 33 (modified by rider, 16 years ago) (diff)

--

Xen GPU cluster

Hardware

Machine Dell OptiPlex 755
Node 9 nodes
CPU Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz
Memory 6GB/node
Storage 160GB/node
Video Card NVIDIA GeForce 9800GT 1GB/node

Software

OS#1 Ubuntu 8.10 with Kernel: 2.6.27-11-server x86_64 (non-patched kernel)
OS#2 Ubuntu 8.10 with Kernel: 2.6.22-9 x86_64 (Xen-3.3.1+Lustre patched kernel)


Part 1 Build essential environment

1.1 - Basic Environment

# NVIDIA CUDA driver #
rock@cloud:~/nvidia/cuda$ wget http://developer.download.nvidia.com/compute/cuda/2_1/drivers/NVIDIA-Linux-x86_64-180.22-pkg2.run
# NVIDIA CUDA toolkit #
rock@cloud:~/nvidia/cuda$ wget http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/cudatoolkit_2.1_linux64_ubuntu8.04.run
# NVIDIA CUDA SDK #
rock@cloud:~/nvidia/cuda$ wget http://developer.download.nvidia.com/compute/cuda/2_1/SDK/cuda-sdk-linux-2.10.1215.2015-3233425.run
rock@cloud:~$ sudo apt-get install autoconf automake build-essential gcc make libtool initramfs-tools libxi6 libxi-dev libxmu6 libxmu-dev linux-kernel-devel linux-headers-2.6.27-11-server xserver-xorg-core xserver-xorg-dev
rock@cloud:~$ sudo ln -sf /usr/src/linux-2.6.22 /usr/src/linux
rock@cloud:~/nvidia/cuda$ sudo sh NVIDIA-Linux-x86_64-180.22-pkg2.run
rock@cloud:~/nvidia/cuda$ sudo sh cudatoolkit_2.1_linux64_ubuntu8.04.run

Enter install path (default /usr/local/cuda, '/cuda' will be appended): /usr/local/cuda

# Note:

* Please make sure your PATH includes /usr/local/cuda/bin
* Please make sure your LD_LIBRARY_PATH includes /usr/local/cuda/lib
*   or add /usr/local/cuda/lib to /etc/ld.so.conf and run ldconfig as root

* Please read the release notes in /usr/local/cuda/doc/

* To uninstall CUDA, delete /usr/local/cuda
* Installation Complete

rock@cloud:~/nvidia/cuda$ sudo sh cuda-sdk-linux-2.10.1215.2015-3233425.run

# Note:

{{{
Enter install path (default /usr/local/cuda, '/cuda' will be appended): /usr/local/NVIDIA_CUDA_SDK
}}}

Configuring SDK Makefile (/usr/local/NVIDIA_CUDA_SDK/common/common.mk)...

* Please make sure your PATH includes /usr/local/cuda/bin
* Please make sure your LD_LIBRARY_PATH includes /usr/local/cuda/lib

* To uninstall the NVIDIA CUDA SDK, please delete /usr/local/NVIDIA_CUDA_SDK

rock@cloud:~$ sudo vim /etc/profile

Add:
export PATH=$PATH:/usr/local/cuda/bin

rock@cloud:~$ source /etc/profile
rock@cloud:~$ sudo vim /etc/ld.so.conf

Add:
/usr/local/cuda/lib

rock@cloud:~$ sudo ldconfig

1.2 NVIDIA Driver HowTo OnNoneXenKernel

# Rock said that the unknown identification of the VGA device might be the "pciids" problem.
Sol1:
rock@cloud:~$ sudo update-pciids <older version>
Sol2:
rock@cloud:~$ wget http://pciids.sourceforge.net/v2.2/pci.ids <latest version>
rock@cloud:~$ sudo cp pci.ids /usr/share/misc/
rock@cloud:~$ sudo lspci -v -v

01:00.0 VGA compatible controller: nVidia Corporation GeForce 9800 GT (rev a2)
	Subsystem: ASUSTeK Computer Inc. Device 82a0
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
	Region 5: I/O ports at dc80 [size=128]
	[virtual] Expansion ROM at fea00000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable-
		Address: 0000000000000000  Data: 0000
	Capabilities: [78] Express (v1) Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 <4us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <1us
			ClockPM- Suprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
	Capabilities: [100] Virtual Channel <?>
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [600] Vendor Specific Information <?>
	Kernel driver in use: nvidia
	Kernel modules: nvidia, nvidiafb

rock@cloud:~$ less /var/log/Xorg.0.log | grep nVidia

(--) PCI:*(0@1:0:0) nVidia Corporation GeForce 9800 GT rev 162, Mem @ 0xfd000000/16777216, 0xd0000000/268435456, 0xfa000000/33554432, I/O @ 0x0000dc80/128, BIOS @ 0x????????/131072

rock@cloud:~$ less /usr/share/misc/pci.ids | grep 9800

	0601  GeForce 9800 GT 512
	0604  GeForce 9800 GX2
        0605  GeForce 9800 GT
	0612  GeForce 9800 GTX
	0613  GeForce 9800 GTX+
	0614  GeForce 9800 GT
	0617  GeForce 9800M GTX
	10de  GeForce 9800M GTX

rock@cloud:~$ sudo Xorg -scanpci

Probing for PCI devices (Bus:Device:Function)

(0:0:0) unknown card (0x1028/0x0211) using a Intel Corporation DRAM Controller
(0:1:0) Intel Corporation PCI Express Root Port
(0:3:0) unknown card (0x1028/0x0211) using a Intel Corporation MEI Controller
(0:3:2) unknown card (0x1028/0x0211) using a Intel Corporation PT IDER Controller
(0:3:3) unknown card (0x1028/0x0211) using a Intel Corporation Serial KT Controller
(0:25:0) unknown card (0x1028/0x0211) using a Intel Corporation 82566DM-2 Gigabit Network Connection
(0:26:0) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4
(0:26:1) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5
(0:26:7) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2
(0:27:0) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) HD Audio Controller
(0:28:0) Intel Corporation 82801I (ICH9 Family) PCI Express Port 1
(0:29:0) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1
(0:29:1) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2
(0:29:2) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3
(0:29:7) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1
(0:30:0) Intel Corporation 82801 PCI Bridge
(0:31:0) Intel Corporation LPC Interface Controller
(0:31:2) unknown card (0x1028/0x0211) using a Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller
(0:31:3) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) SMBus Controller
(0:31:5) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller

---> (1:0:0) unknown card (0x1043/0x82a0) using an unknown chip (DeviceId 0x0605) from nVidia Corporation

rock@cloud:~$ sudo /etc/X11/xorg.conf

# Allocate the BusID for the VGA Device
Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    BusID          "PCI:1:0:0"
    VendorName     "NVIDIA Corporation"
    BoardName      "GeForce 9800 GT"
    Option         "RenderAccel" "True"
    Option         "UseEdidDpi" "False"
EndSection

rock@cloud:~$ sudo glxinfo -display :0

#It seems that the 3D accerlation works fine without any trouble.
name of display: :0.0
display: :0  screen: 0
direct rendering: Yes
server glx vendor string: NVIDIA Corporation
server glx version string: 1.4
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: GeForce 9800 GT/PCI/SSE2
OpenGL version string: 3.0.0 NVIDIA 180.29
OpenGL shading language version string: 1.30 NVIDIA via Cg compiler

1.3 NVIDIA Driver HowTo OnXenKernel

In this case,we pick the driver Ver.180.22 x86_64 for Xen_Lustre kernel.
# Test1- Success
rider@cloud:~/nvidia/driver$ export IGNORE_XEN_PRESENCE=1
rider@cloud:~/nvidia/driver$ export SYSSRC=/lib/modules/2.6.22.9/source
rider@cloud:~/nvidia/driver$ export SYSOUT=/lib/modules/2.6.22.9/build
rider@cloud:~/nvidia/driver$ sudo IGNORE_XEN_PRESENCE=1 ./NVIDIA-Linux-x86_64-180.22-pkg2.run --x-module-path=/usr/lib/xorg/modules/ --kernel-source-path=/usr/src/linux/
rider@cloud:~$ sudo modprobe -l | grep nv

/lib/modules/2.6.22.9/kernel/drivers/video/nvidia.ko

rider@cloud:~/nvidia/driver$ sudo modprobe nvidia
rider@cloud:~/nvidia/driver$ dmesg

NVRM: loading NVIDIA UNIX x86_64 Kernel Module  180.22  Tue Jan  6 09:15:58 PST 2009

# Test2- Testing
rider@cloud:~/nvidia/driver$ export IGNORE_XEN_PRESENCE=1
rider@cloud:~/nvidia/driver$ export SYSSRC=/lib/modules/2.6.22.9/source
rider@cloud:~/nvidia/driver$ export SYSOUT=/lib/modules/2.6.22.9/build
rider@cloud:~/nvidia/driver$ ./NVIDIA-Linux-x86_64-180.22-pkg2.run --extract-only
rider@cloud:~/nvidia/driver$ cd ./NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/
rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ CC="gcc -DNV_VMAP_4_PRESENT -DNV_SIGNAL_STRUCT_RLIM" make SYSSRC=/lib/modules/2.6.22.9/source SYSOUT=/lib/modules/2.6.22.9/build module
rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ mkdir -p /lib/modules/2.6.22.9/extra
rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ sudo cp nvidia.ko /lib/modules/2.6.22.9/extra/
rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ sudo depmod -a
rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ sudo modprobe nvidia

errMsg: nvidia: Unknown symbol __phys_addr

PS: Test1 & Test2 Modified File

#Kernel Source (Test1)
/usr/src/linux/include/asm/smp.h
/usr/src/linux/include/xen/interface/memory.h

#NVIDIA Source (Test2)
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv.c
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv-vm.c
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/conftest.sh
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/Makefile.kbuild
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv-linux.h
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/os-interface.c
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv-linux.h_old
NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/conftest.sh_old

1.4 NVIDIA GPU StatusCheck

rock@cloud:~$ sudo nvidia-xconfig -query-gpu-info

# GPU Status check
Number of GPUs: 1

GPU #0:
  Name      : GeForce 9800 GT
  PCI BusID : PCI:1:0:0

  Number of Display Devices: 1

  Display Device 0 (CRT-0):
     EDID Name             : ViewSonic VA721
     Minimum HorizSync     : 30.000 kHz
     Maximum HorizSync     : 82.000 kHz
     Minimum VertRefresh   : 50 Hz
     Maximum VertRefresh   : 85 Hz
     Maximum PixelClock    : 140.000 MHz
     Maximum Width         : 1280 pixels
     Maximum Height        : 1024 pixels
     Preferred Width       : 1280 pixels
     Preferred Height      : 1024 pixels
     Preferred VertRefresh : 60 Hz
     Physical Width        : 340 mm
     Physical Height       : 270 mm

rock@cloud:~$ sudo nvidia-smi

Gpus found in probe:
Found Gpuid 0x1000
Attaching all probed Gpus...OK
Getting unit information...OK
Getting all static information..

Part 2 Xen PCI Express configuration HowTo

2.1 DEV_IDs confirmation

rider@cloud:~$ lspci -vvn

01:00.0 0300: 10de:0605 (rev a2)  ---> DEV_IDs
	Subsystem: 1043:82a0
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at d0000000 (64-bit, prefetchable) [size=256M]
	Region 3: Memory at fa000000 (64-bit, non-prefetchable) [size=32M]
	Region 5: I/O ports at dc80 [size=128]
	Expansion ROM at fea00000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidia

rider@cloud:~$ sudo vim /etc/xen/xend-pci-permissive.sxp

(unconstrained_dev_ids
     #('0123:4567:89AB:CDEF')
     ('0000:01:00.0')
)

rider@cloud:~$ sudo vim /etc/xen/xend-pci-quirks.sxp

(pci_ids
   # Entries are formated as follows:  
   #     <vendor>:<device>[:<subvendor>:<subdevice>]

   ('10DE:0605'   # NVIDIA 9800GT
   )
)

rider@cloud:~$ sudo vim /etc/xen/vm01.cfg

# We create a new virtual machine named "vm01",and pci_ids configuration example is as below.
# In this case, we take the "PCI Express" deviceID for example.
pci =['01:00.0']
01:00.0 --> PCI Express
00:01.0 --> PCI bridge: Intel Corporation 82Q35 Express PCI Express Root Port
00:1d.0 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller
00:1d.1 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller
00:1d.2 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller
00:1d.7 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller

rider@cloud:~$ sudo su -
root@cloud:~$ echo -n "0000:01:00.0" > /sys/bus/pci/drivers/pciback/new_slot
root@cloud:~$ echo -n "0000:01:00.0" > /sys/bus/pci/drivers/pciback/bind
root@cloud:~$ cat /sys/bus/pci/drivers/pciback/slots

0000:01:00.0

root@cloud:~$ exit
rider@cloud:/etc/xen$ sudo xm create -c vm01.cfg
rider@cloud:/etc/xen$ dmesg | grep pciback

pciback 0000:01:00.0: seizing device
pciback: vpci: 0000:01:00.0: assign to virtual slot 0

rider@cloud:~$ sudo xm console vm01
vm01:~# dmesg | grep pci

pcifront pci-0: Installing PCI frontend
pcifront pci-0: Creating PCI Frontend Bus 0000:00

Part 3 CUDA HowTo

3.1 NVIDIA CUDA Example

In this case, We have to use "gcc-4.1" & "g++-4.1" instead of "gcc-4.3" to avoid getting stdio error.
---> /usr/include/bits/stdio2.h(35): error: identifier "builtin_va_arg_pack" is undefined

CUDA compiled example PATH: /usr/local/NVIDIA_CUDA_SDK/bin/linux/release

For example:
Choose an example from /opt/NVIDIA_CUDA_SDK/projects
rock@cloud:~$ cd /opt/NVIDIA_CUDA_SDK/projects/bandwidthTest/
rock@cloud:~/opt/NVIDIA_CUDA_SDK/projects/bandwidthTest$ sudo make
rock@cloud:~/opt/NVIDIA_CUDA_SDK/projects/bandwidthTest$ cd ../../bin/linux/release/
rock@cloud:~/opt/NVIDIA_CUDA_SDK/bin/linux/release$ ./bandwidthTest
rock@cloud:~/opt/NVIDIA_CUDA_SDK/bin/linux/release$ ./deviceQuery

Running on......
      device 0:GeForce 9800 GT
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes)	Bandwidth(MB/s)
 33554432		1574.6

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes)	Bandwidth(MB/s)
 33554432		1187.9

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes)	Bandwidth(MB/s)
 33554432		41442.7

&&&& Test PASSED

Press ENTER to exit...
There is 1 device supporting CUDA (Running on Xen + Lustre Kernel)

Device 0: "GeForce 9800 GT"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 1073414144 bytes
  Number of multiprocessors:                     14
  Number of cores:                               112
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.51 GHz
  Concurrent copy and execution:                 Yes

Reference:
1.NVIDIA CUDA: http://www.nvidia.com/object/cuda_home.html
2.openSUSE NVIDIA + Xen: http://en.opensuse.org/Use_Nvidia_driver_with_Xen
3.NVIDIA GPUs DEV_IDs: http://www.laptopvideo2go.com/forum/index.php?showtopic=7664
4.pci_ids db: http://www.pcidatabase.com/
5.Xen: assigning PCI devices to a domain: http://www.bestgrid.org/index.php/Xen:_assigning_PCI_devices_to_a_domain

Attachments (5)

Download all attachments as: .zip