= Xen GPU cluster = == Hardware == ||Machine|| Dell !OptiPlex 755 ||Node|| 9 nodes ||CPU|| Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz ||Memory|| 6GB/node ||Storage|| 160GB/node ||Video Card|| NVIDIA !GeForce 9800GT 1GB/node == Software == ||OS!#1|| Ubuntu 8.10 with Kernel: 2.6.27-11-server x86_64 (non-patched kernel) ||OS!#2|| Ubuntu 8.10 with Kernel: 2.6.22-9 x86_64 (Xen-3.3.1+Lustre patched kernel) [[BR]] = Part 1 Build essential environment = == 1.1 - Basic Environment == # NVIDIA CUDA driver # [[BR]] rock@cloud:~/nvidia/cuda$ wget http://developer.download.nvidia.com/compute/cuda/2_1/drivers/NVIDIA-Linux-x86_64-180.22-pkg2.run [[BR]] # NVIDIA CUDA toolkit # [[BR]] rock@cloud:~/nvidia/cuda$ wget http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/cudatoolkit_2.1_linux64_ubuntu8.04.run [[BR]] # NVIDIA CUDA SDK # [[BR]] rock@cloud:~/nvidia/cuda$ wget http://developer.download.nvidia.com/compute/cuda/2_1/SDK/cuda-sdk-linux-2.10.1215.2015-3233425.run [[BR]] rock@cloud:~$ sudo apt-get install autoconf automake build-essential gcc make libtool initramfs-tools libxi6 libxi-dev libxmu6 libxmu-dev linux-kernel-devel linux-headers-2.6.27-11-server xserver-xorg-core xserver-xorg-dev [[BR]] rock@cloud:~$ sudo ln -sf /usr/src/linux-2.6.22 /usr/src/linux [[BR]] rock@cloud:~/nvidia/cuda$ sudo sh NVIDIA-Linux-x86_64-180.22-pkg2.run [[BR]] rock@cloud:~/nvidia/cuda$ sudo sh cudatoolkit_2.1_linux64_ubuntu8.04.run [[BR]] {{{ Enter install path (default /usr/local/cuda, '/cuda' will be appended): /usr/local/cuda }}} # Note: {{{ * Please make sure your PATH includes /usr/local/cuda/bin * Please make sure your LD_LIBRARY_PATH includes /usr/local/cuda/lib * or add /usr/local/cuda/lib to /etc/ld.so.conf and run ldconfig as root * Please read the release notes in /usr/local/cuda/doc/ * To uninstall CUDA, delete /usr/local/cuda * Installation Complete }}} rock@cloud:~/nvidia/cuda$ sudo sh cuda-sdk-linux-2.10.1215.2015-3233425.run [[BR]] # Note: {{{ {{{ Enter install path (default /usr/local/cuda, '/cuda' will be appended): /usr/local/NVIDIA_CUDA_SDK }}} }}} {{{ Configuring SDK Makefile (/usr/local/NVIDIA_CUDA_SDK/common/common.mk)... * Please make sure your PATH includes /usr/local/cuda/bin * Please make sure your LD_LIBRARY_PATH includes /usr/local/cuda/lib * To uninstall the NVIDIA CUDA SDK, please delete /usr/local/NVIDIA_CUDA_SDK }}} rock@cloud:~$ sudo vim /etc/profile [[BR]] {{{ Add: export PATH=$PATH:/usr/local/cuda/bin }}} rock@cloud:~$ source /etc/profile [[BR]] rock@cloud:~$ sudo vim /etc/ld.so.conf [[BR]] {{{ Add: /usr/local/cuda/lib }}} rock@cloud:~$ sudo ldconfig [[BR]] == 1.2 NVIDIA Driver !HowTo !OnNoneXenKernel == # Rock said that the unknown identification of the VGA device might be the "pciids" problem. [[BR]] Sol1: [[BR]] rock@cloud:~$ sudo update-pciids [[BR]] Sol2: [[BR]] rock@cloud:~$ wget http://pciids.sourceforge.net/v2.2/pci.ids [[BR]] rock@cloud:~$ sudo cp pci.ids /usr/share/misc/ [[BR]] rock@cloud:~$ sudo lspci -v -v [[BR]] {{{ 01:00.0 VGA compatible controller: nVidia Corporation GeForce 9800 GT (rev a2) Subsystem: ASUSTeK Computer Inc. Device 82a0 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Capabilities: [128] Power Budgeting Capabilities: [600] Vendor Specific Information Kernel driver in use: nvidia Kernel modules: nvidia, nvidiafb }}} rock@cloud:~$ less /var/log/Xorg.0.log | grep nVidia [[BR]] {{{ (--) PCI:*(0@1:0:0) nVidia Corporation GeForce 9800 GT rev 162, Mem @ 0xfd000000/16777216, 0xd0000000/268435456, 0xfa000000/33554432, I/O @ 0x0000dc80/128, BIOS @ 0x????????/131072 }}} rock@cloud:~$ less /usr/share/misc/pci.ids | grep 9800 [[BR]] {{{ 0601 GeForce 9800 GT 512 0604 GeForce 9800 GX2 0605 GeForce 9800 GT 0612 GeForce 9800 GTX 0613 GeForce 9800 GTX+ 0614 GeForce 9800 GT 0617 GeForce 9800M GTX 10de GeForce 9800M GTX }}} rock@cloud:~$ sudo Xorg -scanpci [[BR]] {{{ Probing for PCI devices (Bus:Device:Function) (0:0:0) unknown card (0x1028/0x0211) using a Intel Corporation DRAM Controller (0:1:0) Intel Corporation PCI Express Root Port (0:3:0) unknown card (0x1028/0x0211) using a Intel Corporation MEI Controller (0:3:2) unknown card (0x1028/0x0211) using a Intel Corporation PT IDER Controller (0:3:3) unknown card (0x1028/0x0211) using a Intel Corporation Serial KT Controller (0:25:0) unknown card (0x1028/0x0211) using a Intel Corporation 82566DM-2 Gigabit Network Connection (0:26:0) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (0:26:1) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (0:26:7) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (0:27:0) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) HD Audio Controller (0:28:0) Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (0:29:0) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (0:29:1) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (0:29:2) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (0:29:7) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (0:30:0) Intel Corporation 82801 PCI Bridge (0:31:0) Intel Corporation LPC Interface Controller (0:31:2) unknown card (0x1028/0x0211) using a Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller (0:31:3) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) SMBus Controller (0:31:5) unknown card (0x1028/0x0211) using a Intel Corporation 82801I (ICH9 Family) 2 port SATA IDE Controller ---> (1:0:0) unknown card (0x1043/0x82a0) using an unknown chip (DeviceId 0x0605) from nVidia Corporation }}} rock@cloud:~$ sudo /etc/X11/xorg.conf [[BR]] {{{ # Allocate the BusID for the VGA Device Section "Device" Identifier "Device0" Driver "nvidia" BusID "PCI:1:0:0" VendorName "NVIDIA Corporation" BoardName "GeForce 9800 GT" Option "RenderAccel" "True" Option "UseEdidDpi" "False" EndSection }}} rock@cloud:~$ sudo glxinfo -display :0 {{{ #It seems that the 3D accerlation works fine without any trouble. name of display: :0.0 display: :0 screen: 0 direct rendering: Yes server glx vendor string: NVIDIA Corporation server glx version string: 1.4 OpenGL vendor string: NVIDIA Corporation OpenGL renderer string: GeForce 9800 GT/PCI/SSE2 OpenGL version string: 3.0.0 NVIDIA 180.29 OpenGL shading language version string: 1.30 NVIDIA via Cg compiler }}} == 1.3 NVIDIA Driver !HowTo !OnXenKernel == ''In this case,we pick the driver Ver.180.22 x86_64 for Xen_Lustre kernel.''[[BR]] # Test1- Success [[BR]] rider@cloud:~/nvidia/driver$ export IGNORE_XEN_PRESENCE=1 [[BR]] rider@cloud:~/nvidia/driver$ export SYSSRC=/lib/modules/2.6.22.9/source [[BR]] rider@cloud:~/nvidia/driver$ export SYSOUT=/lib/modules/2.6.22.9/build [[BR]] rider@cloud:~/nvidia/driver$ sudo IGNORE_XEN_PRESENCE=1 ./NVIDIA-Linux-x86_64-180.22-pkg2.run --x-module-path=/usr/lib/xorg/modules/ --kernel-source-path=/usr/src/linux/ [[BR]] rider@cloud:~$ sudo modprobe -l | grep nv {{{ /lib/modules/2.6.22.9/kernel/drivers/video/nvidia.ko }}} rider@cloud:~/nvidia/driver$ sudo modprobe nvidia [[BR]] rider@cloud:~/nvidia/driver$ dmesg [[BR]] {{{ NVRM: loading NVIDIA UNIX x86_64 Kernel Module 180.22 Tue Jan 6 09:15:58 PST 2009 }}} # Test2- Testing [[BR]] rider@cloud:~/nvidia/driver$ export IGNORE_XEN_PRESENCE=1 [[BR]] rider@cloud:~/nvidia/driver$ export SYSSRC=/lib/modules/2.6.22.9/source [[BR]] rider@cloud:~/nvidia/driver$ export SYSOUT=/lib/modules/2.6.22.9/build [[BR]] rider@cloud:~/nvidia/driver$ ./NVIDIA-Linux-x86_64-180.22-pkg2.run --extract-only [[BR]] rider@cloud:~/nvidia/driver$ cd ./NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/ [[BR]] rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ CC="gcc -DNV_VMAP_4_PRESENT -DNV_SIGNAL_STRUCT_RLIM" make SYSSRC=/lib/modules/2.6.22.9/source SYSOUT=/lib/modules/2.6.22.9/build module [[BR]] rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ mkdir -p /lib/modules/2.6.22.9/extra [[BR]] rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ sudo cp nvidia.ko /lib/modules/2.6.22.9/extra/ [[BR]] rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ sudo depmod -a [[BR]] rider@cloud:~/nvidia/driver/NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv$ sudo modprobe nvidia [[BR]] {{{ errMsg: nvidia: Unknown symbol __phys_addr }}} PS: Test1 & Test2 Modified File [[BR]] {{{ #Kernel Source (Test1) /usr/src/linux/include/asm/smp.h /usr/src/linux/include/xen/interface/memory.h #NVIDIA Source (Test2) NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv.c NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv-vm.c NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/conftest.sh NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/Makefile.kbuild NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv-linux.h NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/os-interface.c NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/nv-linux.h_old NVIDIA-Linux-x86_64-180.22-pkg2/usr/src/nv/conftest.sh_old }}} == 1.4 NVIDIA GPU !StatusCheck == rock@cloud:~$ sudo nvidia-xconfig -query-gpu-info [[BR]] {{{ # GPU Status check Number of GPUs: 1 GPU #0: Name : GeForce 9800 GT PCI BusID : PCI:1:0:0 Number of Display Devices: 1 Display Device 0 (CRT-0): EDID Name : ViewSonic VA721 Minimum HorizSync : 30.000 kHz Maximum HorizSync : 82.000 kHz Minimum VertRefresh : 50 Hz Maximum VertRefresh : 85 Hz Maximum PixelClock : 140.000 MHz Maximum Width : 1280 pixels Maximum Height : 1024 pixels Preferred Width : 1280 pixels Preferred Height : 1024 pixels Preferred VertRefresh : 60 Hz Physical Width : 340 mm Physical Height : 270 mm }}} rock@cloud:~$ sudo nvidia-smi [[BR]] {{{ Gpus found in probe: Found Gpuid 0x1000 Attaching all probed Gpus...OK Getting unit information...OK Getting all static information.. }}} = Part 2 Xen PCI Express configuration !HowTo = == 2.1 Xen_Kernel_config == {{{ CONFIG_XEN_PCIDEV_FRONTEND=y CONFIG_XEN_PCIDEV_BACKEND=y CONFIG_XEN_PCIDEV_BACKEND_PASS=y CONFIG_XEN_PCIDEV_BACKEND_VPCI is not set CONFIG_XEN_PCIDEV_BACKEND_SLOT is not set }}} == 2.2 DEV_IDs confirmation == rider@cloud:~$ lspci -vvn [[BR]] {{{ 01:00.0 0300: 10de:0605 (rev a2) ---> vendorID + DEV_IDs Subsystem: 1043:82a0 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Kernel driver in use: nvidia Kernel modules: nvidia }}} == 2.3 PCI Backend Configuration == === 2.3.1 Binding at Boot === rider@cloud:~$ sudo vim /boot/grub/menu.lst [[BR]] {{{ module /boot/vmlinuz-2.6.22.9 root=UUID=d3fa560e-7071-46d8-a168-036f40960c7b ro console=tty0 pciback.hide=(0000:01:00.0) }}} === 2.3.2 Late Binding === rider@cloud:~$ sudo su - [[BR]] # Hide the device from dom0 so pciback can take control. [[BR]] root@cloud:~$ echo -n "0000:01:00.0" > /sys/bus/pci/drivers/nvidia/unbind [[BR]] # Give the dev_ids to pciback, and give it a new slot then bind.[[BR]] root@cloud:~$ echo -n "0000:01:00.0" > /sys/bus/pci/drivers/pciback/new_slot [[BR]] root@cloud:~$ echo -n "0000:01:00.0" > /sys/bus/pci/drivers/pciback/bind [[BR]] # You can use an initialization script to invoke the PCIE device at startup. [[BR]] root@cloud:~$ cat /sys/bus/pci/drivers/pciback/slots [[BR]] {{{ 0000:01:00.0 }}} === 2.3.3 Permissive Flag === rider@cloud:~$ sudo vim /etc/xen/xend-pci-permissive.sxp [[BR]] {{{ (unconstrained_dev_ids #('0123:4567:89AB:CDEF') ('0000:01:00.0') ) }}} === 2.3.4 User-space Quirks === rider@cloud:~$ sudo vim /etc/xen/xend-pci-quirks.sxp [[BR]] {{{ (pci_ids # Entries are formated as follows: # :[::] ('10DE:0605' # NVIDIA 9800GT ) ) }}} === 2.3.5 PCI Frontend Configuration === rider@cloud:~$ sudo vim /etc/xen/vm01.cfg [[BR]] {{{ # We create a new virtual machine named "vm01",and pci_ids configuration example is as below. # In this case, we take the "PCI Express" deviceID for example. pci =['01:00.0'] }}} {{{ 01:00.0 --> PCI Express 00:01.0 --> PCI bridge: Intel Corporation 82Q35 Express PCI Express Root Port 00:1d.0 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller 00:1d.1 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller 00:1d.2 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller 00:1d.7 --> USB Controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller }}} rider@cloud:/etc/xen$ sudo xm create vm01.cfg[[BR]] rider@cloud:/etc/xen$ dmesg | grep pciback [[BR]] {{{ pciback 0000:01:00.0: seizing device pciback: vpci: 0000:01:00.0: assign to virtual slot 0 }}} rider@cloud:~$ sudo xm console vm01 [[BR]] vm01:~# dmesg | grep pci [[BR]] {{{ pcifront pci-0: Installing PCI frontend pcifront pci-0: Creating PCI Frontend Bus 0000:00 }}} = Part 3 Running CUDA on Xen !HowTo = == 3.1 Create a virtual machine for CUDA == rider@cloud:~$ sudo vim /etc/xen-tools/xen-tools.conf [[BR]] {{{ dir = /home install-method = debootstrap size = 6Gb # Disk image size. memory = 256Mb # Memory size swap = 128Mb # Swap size fs = ext3 # use the EXT3 filesystem for the disk image. dist = hardy # Default distribution to install. ---> For CUDA Support (Ubuntu 8.0.4) image = sparse # Specify sparse vs. full disk images. gateway = 140.XXX.XXX.XXX netmask = 255.255.255.0 broadcast = 140.XXX.XXX.XXX kernel = /boot/vmlinuz-`uname -r` initrd = /boot/initrd.img-`uname -r` mirror = http://gb.archive.ubuntu.com/ubuntu/ ext3_options = noatime,nodiratime,errors=remount-ro ext2_options = noatime,nodiratime,errors=remount-ro xfs_options = defaults reiser_options = defaults }}} rider@cloud:~$ sudo xen-create-image --hostname cuda --ip 140.XXX.XXX.XXX [[BR]] == 3.2 Running CUDA Example on !VirtualMachine == Step1: [[BR]] # !VirtualMachine startup [[BR]] rider@cloud:~$ sudo xm create cuda.cfg [[BR]] Step2: [[BR]] # Remote login [[BR]] rider@350Z:~$ ssh 140.xxx.xxx.xxx [[BR]] # Local login [[BR]] rider@cloud:~$ sudo xm console cuda [[BR]] Step3: [[BR]] # NVIDIA CUDA toolkit & sdk installation. Reference: Chapter1: 1.1 - Basic Environment [[BR]] Step4: [[BR]] # Build your own cuda project or cuda example running test. Reference: Chapter3: 3.3 - Running CUDA Example on Xen [[BR]] # Example:Device Bandwidth [[BR]] rider@cuda:/usr/local/NVIDIA_CUDA_SDK/bin/linux/release$ sudo ./bandwidthTest [[BR]] {{{ (Running on Xen_VirtualMachine) Device 0: "GeForce 9800 GT" Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 31999998.0 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 320000000.0 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 640000000.0 &&&& Test PASSED }}} # Example:Device Query [[BR]] rider@cuda:/usr/local/NVIDIA_CUDA_SDK/bin/linux/release$ sudo ./deviceQuery [[BR]] {{{ (Running on Xen_VirtualMachine) Device 0: "GeForce 9800 GT" Major revision number: 0 Minor revision number: 0 Total amount of global memory: 6385920 bytes Number of multiprocessors: 11007 Number of cores: 88056 Total amount of constant memory: 6385872 bytes Total amount of shared memory per block: 3236702400 bytes Total number of registers available per block: 6385904 Warp size: 0 Maximum number of threads per block: 0 Maximum sizes of each dimension of a block: 0 x 6385808 x 0 Maximum sizes of each dimension of a grid: 0 x 0 x 2 Maximum memory pitch: 3234490924 bytes Texture alignment: 3236702608 bytes Clock rate: 0.00 GHz Concurrent copy and execution: Yes }}} == 3.3 Running CUDA Example on Xen == '''Important:'''[[BR]] In this case, We have to use "gcc-4.1" & "g++-4.1" instead of "gcc-4.3" to avoid getting stdio error.[[BR]] {{{ /usr/include/bits/stdio2.h(...): error: identifier "__builtin_va_arg_pack" is undefined }}} Example !HowTo : [[BR]] rider@cloud:~/opt/NVIDIA_CUDA_SDK$ sudo make [[BR]] rider@cloud:~/opt/NVIDIA_CUDA_SDK$ cd ./bin/linux/release/ [[BR]] rider@cloud:~/opt/NVIDIA_CUDA_SDK/bin/linux/release$ ./bandwidthTest [[BR]] rider@cloud:~/opt/NVIDIA_CUDA_SDK/bin/linux/release$ ./deviceQuery [[BR]] ### Demo Example ### {{{ (Running on Xen + Lustre Kernel) Running on...... device 0:GeForce 9800 GT Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1574.6 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 1187.9 Quick Mode Device to Device Bandwidth . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 41442.7 &&&& Test PASSED Press ENTER to exit... }}} {{{ (Running on Xen + Lustre Kernel) There is 1 device supporting CUDA Device 0: "GeForce 9800 GT" Major revision number: 1 Minor revision number: 1 Total amount of global memory: 1073414144 bytes Number of multiprocessors: 14 Number of cores: 112 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.51 GHz Concurrent copy and execution: Yes }}} Reference: [[BR]] 1.NVIDIA CUDA: http://www.nvidia.com/object/cuda_home.html [[BR]] 2.openSUSE NVIDIA + Xen: http://en.opensuse.org/Use_Nvidia_driver_with_Xen [[BR]] 3.NVIDIA GPUs DEV_IDs: http://www.laptopvideo2go.com/forum/index.php?showtopic=7664 [[BR]] 4.pci_ids db: http://www.pcidatabase.com/ [[BR]] 5.Xen: assigning PCI devices to a domain: http://www.bestgrid.org/index.php/Xen:_assigning_PCI_devices_to_a_domain [[BR]] 6.Xen PCI Passthrough: http://www.wlug.org.nz/XenPciPassthrough [[BR]] 7.Xen Users' Manual v3.0: http://www.cl.cam.ac.uk/research/srg/netos/xen/readmes/user/ [[BR]]