Disclaimers ----------- The files in this directory are provided by IBM on an "AS IS" basis without warranty of any kind. In addition, the results that you obtain from using these files to measure the general performance of your General Parallel File System (GPFS) file systems are "AS IS." Your reliance on any measurements is at your own risk and IBM does not assume any liability, whatsoever from your use of these files or your use of resultant performance measurements. The performance of GPFS file systems is affected by many factors, including the access patterns of application programs, the configuration and amount of memory on the SP nodes, the number and characteristics of IBM Virtual Shared Disk (VSD) servers, the number and speed of disks and disk adapters attached to the VSD servers, GPFS, VSD and SP switch configuration parameters, other traffic through the SP switch, etc. As a result, GPFS file system performance may vary and IBM does not make any particular performance claims for GPFS file systems. Introduction ------------ The files in this directory serve two purposes: - Provide a simple benchmark program (gpfsperf) that can be used to measure the performance of GPFS for several common file access patterns. - Give examples of how to use some of the gpfs_fcntl hints and directives that are new in GPFS version 1.3. There are four versions of the program binary built from a single set of source files. The four versions correspond to all of the possible combinations of single node/multiple node and with/without features that only are supported on GPFS version 1.3. Multinode versions of the gpfsperf program contain -mpi as part of their names, while versions that do not use features of GPFS requiring version 1.3 have a suffix of -v12 in their names. Parallelism ----------- There are two independent ways to achieve parallelism in the gpfsperf program. More than one instance of the program can be run on multiple nodes using Message Passing Interface (MPI) to synchronize their execution, or a single instance of the program can execute several threads in parallel on a single node. These two techniques can also be combined. When describing the behavior of the program, it should be understood that 'threads' means any of the threads of the gpfsperf program on any node where MPI runs it. When gpfsperf runs on multiple nodes, the multiple instances of the program communicate using the Message Passing Interface (MPI) to synchronize their execution and to combine their measurements into an aggregate throughput result. Access patterns --------------- The gpfsperf program operates on a file that is assumed to consist of a collection of records, each of the same size. It can generate three different types of access patterns: sequential, strided, and random. The meaning of these access patterns in some cases depends on whether or not parallelism is employed when the benchmark is run. The simplest access pattern is random. The gpfsperf program generates a sequence of random record numbers and reads or writes the corresponding records. When run on multiple nodes, or when multiple threads are used within instances of the program, each thread of the gpfsperf program uses a different seed for its random number generator, so each thread will access independent sequences of records. Two threads may access the same record if the same random record number occurs in two sequences. In the sequential access pattern, each gpfsperf thread reads from or writes to a contiguous partition of a file sequentially. For example, suppose that a 10 billion byte file consists of one million records of 10000 bytes each. If 10 threads read the file according to the sequential access pattern, then first thread will read sequentially through the partition consisting of the first 100,000 records, the next thread will read from the next 100,000 records, and so on. In a strided access pattern, each thread skips some number of records between each record that it reads or writes. Reading the file from the example above in a strided pattern, the first thread would read records 0, 10, 20, ..., 999,990. The second thread would read records 1, 11, 21, ..., 999,991, and so on. The gpfsperf program by default uses a stride, or distance between records, equal to the total number of threads operating on the file, in this case 10. Amount of data to be transferred -------------------------------- One of the input parameters to gpfsperf is the amount of data to be transferred. This is the total number of bytes to be read or written by all threads of the program. If there are T threads in total, and the total number of bytes to be transferred is N, each thread will read or write about N/T bytes, rounded to a multiple of the record size. By default, gpfsperf sets N to the size of the file. For the sequential or strided access pattern, this default means that every record in the file will be read or written exactly once. If N is greater than the size of the file, each thread will read or write its partition of the file repeatedly until reaching its share (N/T) of the bytes to be transferred. For example, suppose 10 threads sequentially read a 10 billion byte file of 10000 byte records when N is 15 billion. The first thread will read the first 100,000 records, then reread the first 50,000 records. The second thread will read records 100,000 through 199,999, then reread records 100,000 through 149,999, and so on. When using strided access patterns with other than the default stride, this behavior of gpfsperf can cause unexpected results. For example, suppose that 10 threads read the 10 billion byte file using a strided access pattern, but instead of the default stride of 10 records gpfsperf is told to use a stride of 10000 records. The file partition read by the first thread will be records 0, 10000, 20000, ..., 990,000. This is only 100 distinct records of 10000 bytes each, or a total of only 1 million bytes of data. This data will likely remain in the GPFS buffer pool after it is read the first time. If N is more than 10 million bytes, gpfsperf will "read" the same buffered data multiple times, and the reported data rate will appear anomalously high. To avoid this effect, performance tests using non-default strides should reduce N in the same proportion as the stride was increased from its default. Computation of aggregate data rate and utilization -------------------------------------------------- The results of a run of gpfsperf are reported as an aggregate data rate. Data rate is defined as the total number of bytes read or written by all threads divided by the total time of the test. It is reported in units of 1000 bytes/second. The test time is measured from before any node opens the test file until after the last node closes the test file. To insure a consistent environment for each test, before beginning the timed test period gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes. This flushes the GPFS buffer cache and releases byte-range tokens. Note that versions of GPFS prior to v1.3 do not support GPFS file hints, so the state of the buffer cache at the beginning of a test will be influenced by the state left by the prior test. Since all threads of a gpfsperf run do approximately the same amount of work (read or write N/T bytes), in principle they should all run for the same amount of time. In practice, however, variations in disk and switch response time lead to variations in execution times among the threads. Lock contention in GPFS further contributes to these variations in execution times. A large degree of non-uniformity is undesirable, since it means that some nodes are idle while they wait for threads on other nodes to finish. To measure the degree of uniformity of thread execution times, gpfsperf computes a quantity it calls "utilization." Utilization is the fraction of the total number of thread-seconds in a test during which threads actively perform reads or writes. A value of 1.0 indicates perfect overlap, while lower values denote that some threads were idle while others still ran. The following timeline illustrates how gpfsperf computes utilization for a test involving one thread on each of two nodes, reading a total of 100M: time event ---- ----- 0.0 Node 0 captures timestamp for beginning of test ... both nodes open file 0.1 Node 0 about to do first read 0.1 Node 1 about to do first read ... many file reads 4.9 Node 0 finishes last read 4.9 Node 1 finishes last read ... both nodes close file 5.0 Node 0 captures timestamp for end of test The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96. The reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec. If node 0 ran significantly slower than node 1, it might have finished its last read at time 9.9 instead of at time 4.9, and the end of test timestamp might be 10.0 instead of 5.0. In this case the utilization would drop to ((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be 100M / (10.0-0.0) = 10M/sec. Command line parameters ----------------------- There are four versions of gpfsperf in this directory: gpfsperf-mpi - runs on multiple nodes under MPI, requires GPFS v1.3 or later gpfsperf - runs only on a single node, requires GPFS v1.3 or later The command line for any of the versions of gpfsperf is: gpfsperf[-mpi] operation pattern fn [options] The order of parameters on the command line is not significant. The operation must be either "create", "read", "write", or "uncache". All threads in a multinode or multithreaded run of gpfsperf do the same operation to the same file, but at different offsets. Create means to insure that the file exists, then do a write test. The uncache operation does not read or write the file, but only removes any buffered data for the file from the GPFS buffer cache. Pattern must be one of "rand", "randhint", "strided", or "seq". The meanings of each of these was explained earlier, except for "randhint". The "randhint" pattern is the same as "rand", except that the GPFS multiple access range hint is used through the library functions in irreg.c to prefetch blocks before they are accessed by gpfsperf. The filename parameter fn should resolve to a file in a GPFS file system that is mounted on all nodes where gpfsperf is to run. The file must already exist unless the "create" operation was specified. Use of gpfsperf on files not in GPFS may be meaningful in some situations, but this use has not been tested. Each optional parameter is described below, along with its default value. -nolabels - Produce a single line of output containing all parameters and the measured data rate. Format of the output line is 'op pattern fn recordSize nBytes fileSize nProcs nThreads strideRecs inv ds dio fsync reltoken aio osync rate util'. This format may be useful for importing results into a spreadsheet or other program for further analysis. The default is to produce multi-line labelled output. -r recsize - Record size. Defaults to filesystem block size. Must be specified for create operations. -n nBytes - Number of bytes to transfer. Defaults to file size. Must be specified for create operations. -s stride - Number of bytes between successive accesses by the same thread. Only meaningful for strided access patterns. Must be a multiple of the record size. See earlier cautions about combining large values of -s with large values of -n. Default is number of threads times number of processes. -th nThreads - Number of threads per process. Default is 1. When there are multiple threads per process they read adjacent blocks of the file for the sequential and strided access patterns. For example, suppose a file of 60 records is being read by 3 nodes with 2 threads per node. Under the sequential pattern, thread 0 on node 0 will read records 0-9, thread 1 on node 0 will read records 10-19, thread 0 on node 1 will read records 20-29, etc. Under a strided pattern, thread 0 on node 0 will read records 0, 6, 12, ..., 54, thread 1 on node 0 will read records 1, 7, 13, ..., 55, etc. -noinv - Do not clear blocks of fn from the GPFS file cache before starting the test. The default is to clear the cache. If this option is given, the results of the test can depend strongly on the lock and buffer state left by the last test. For example, a multinode sequential read with -noinv will run more slowly after a strided write test than after a sequential write test. -ds - Use GPFS data shipping. Data shipping avoids lock conflicts by partitioning the file among the nodes running gpfsperf, turning off byte range locking, and sending messages to the appropriate agent node to handle each read or write request. However, since (n-1)/n of the accesses are remote, the gpfsperf threads cannot take advantage of local block caching, although they may still benefit from prefetching and writebehind. Also, since byte range locking is not in effect, use of data shipping suspends the atomicity guarantees of X/Open file semantics. See the GPFS Guide and Reference manual for more details. Data shipping should show the largest performance benefit for strided writes that have small record sizes. The default is not to use data shipping. -aio depth - use Asynch I/O, prefetching to depth (default 0, max 1000). This can be used with any of the seq/rand/strided test patterns. -dio - Use direct IO flag when opening the file. This will allow sector aligned/sized buffers to be read/written directly from the application buffer to the disks where the blocks are allocated. -reltoken - Release the entire file byte-range token after the file is newly created. In a multi-node environment (MPI), only the first process will create the file and all the other processes will wait and open the file after the creation has occurred. This flag tell the first process to release the byte-range token it automatically gets during the create. This may increase performance because other nodes that work on different ranges of the file will not need to revoke the range held by the node running the first process. -fsync - Insure that no dirty data remain buffered at the conclusion of a write or create test. The time to perform the necessary fsync operation is included in the test time, so this option reduces the reported aggregate data rate. The default is not to fsync the file. -osync - Turn on the O_SYNC flag when opening the file. This causes every write operation to force the data to disk on each call. The default is not osync. -v - Verbose tracing. In a multinode test using gpfsperf, output from each instance of the program will be intermingled. By telling MPI to label the output from each node (MP_LABELIO environment variable =yes), the verbose output will make more sense. -V - Very verbose tracing. This option will display the offset of every read or write operation on every node. As with -v, labelling the output by node is suggested. Numbers in options can be given using K, M, or G suffixes, in upper or lower case, to denote 2**10, 2**20, or 2**30, respectively, or can have an R or r suffix to denote a multiple of the record size. For example, to specify a record size of 4096 bytes and a size to read or write of 409600, one could write "-r 4k -n 100r". AIX only: If the AsynchronousIO (AIO) kernel extension has not been loaded yet, running the gpfsperf program will fail and display output like: exec(): 0509-036 Cannot load program gpfsperf because of the following errors: 0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because: 0509-136 Symbol kaio_rdwr (number 0) is not exported from dependent module /unix. 0509-136 Symbol listio (number 1) is not exported from dependent module /unix. 0509-136 Symbol acancel (number 2) is not exported from dependent module /unix. 0509-136 Symbol iosuspend (number 3) is not exported from dependent module /unix. 0509-136 Symbol aio_nwait (number 4) is not exported from dependent module /unix. 0509-192 Examine .loader section symbols with the 'dump -Tv' command. If you do not wish to use AIO, you can recompile the gpfsperf program to not use the AIO calls: rm gpfsperf.o gpfsperf-mpi.o make OTHERINCL="-DNO_AIO" Enable AIO on your system, by doing commands like the following: lsattr -El aio0 chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128 mkdev -l aio0 Minservers just tells AIX how many AIO kprocs to create immediately, and maxservers limits the total number created. On AIX 5.2 the meaning of maxservers has changed to mean "maximum number of servers per CPU", so a 16-way SMP should set maxservers=8 to get a total of 128 kprocs. Examples -------- Suppose that /gpfs is a GPFS file system that was formatted with 256K blocks and that it has at least a gigabyte of free space. Assuming that the gpfsperf programs have been copied into /gpfs/test, and that /gpfs/test is the current directory, the following ksh commands illustrate how to run gpfsperf: # Number of nodes on which the test will run. If this is increased, the # size of the test file should also be increased. export MP_PROCS=8 # File containing a list of nodes on which gpfsperf will run. There are # other ways to specify where the test runs besides using an explicit # host list. See the Parallel Operating Environment documentation for # details. export MP_HOSTFILE=/etc/cluster.nodes # Name of test file to be manipulated by the tests that follow. export fn=/gpfs/test/testfile # Verify block size mmlsfs gpfs -B # Create test file. All write tests in these examples specify -fsync, so # the reported data rate includes the overhead of flushing all dirty buffers # to disk. The size of the test file should be increased if more than 8 # nodes are used or if GPFS pagepool sizes have been increased from their # defaults. It may be necessary to increase the maximum size file the user # is allowed to create. See the ulimit command. ./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync # Read entire test file sequentially ./gpfsperf-mpi read seq $fn -r 256k # Rewrite test file sequentially using full block writes ./gpfsperf-mpi write seq $fn -r 256k -fsync # Rewrite test file sequentially using small writes. This requires GPFS to # read blocks in order to update them, so will have worse performance than # the full block rewrite. ./gpfsperf-mpi write seq $fn -r 64k -fsync # Strided read using big records ./gpfsperf-mpi read strided $fn -r 256k # Strided read using medium sized records. Performance is worse because # average I/O size has gone down. This behavior will not be seen unless # the stride is larger than a block (8*50000 > 256K). ./gpfsperf-mpi read strided $fn -r 50000 # Strided read using a very large stride. Reported performance is # misleading because each node just reads the same records over and over # from its GPFS buffer cache. ./gpfsperf-mpi read strided $fn -r 50000 -s 2400r # Strided write using a record size equal to the block size. Decent # performance, since record size matches GPFS lock granularity. ./gpfsperf-mpi write strided $fn -r 256k -fsync # Strided write using small records. Since GPFS lock granularity is # larger than a record, performance is much worse. Number of bytes # written is less than the entire file to keep test time reasonable. ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync # Strided write using small records and data shipping. Data shipping # trades additional communication overhead for less lock contention, # improving performance. ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync # Random read of small records ./gpfsperf-mpi read rand $fn -r 10000 -n 100m # Random read of small records using the GPFS multiple access range hint. # Better performance (assuming more than MP_PROCS disks) because each node # has more than one disk read in progress at once due to prefetching. ./gpfsperf-mpi read randhint $fn -r 10000 -n 100m