Disclaimers
-----------

The files in this directory are provided by IBM on an "AS IS" basis
without warranty of any kind.  In addition, the results that you obtain
from using these files to measure the general performance of your
General Parallel File System (GPFS) file systems are "AS IS."  Your
reliance on any measurements is at your own risk and IBM does not assume
any liability, whatsoever from your use of these files or your use of
resultant performance measurements.  The performance of GPFS file
systems is affected by many factors, including the access patterns of
application programs, the configuration and amount of memory on the SP
nodes, the number and characteristics of IBM Virtual Shared Disk (VSD)
servers, the number and speed of disks and disk adapters attached to the
VSD servers, GPFS, VSD and SP switch configuration parameters, other
traffic through the SP switch, etc.  As a result, GPFS file system
performance may vary and IBM does not make any particular performance
claims for GPFS file systems.


Introduction
------------

The files in this directory serve two purposes:

 - Provide a simple benchmark program (gpfsperf) that can be used to
   measure the performance of GPFS for several common file access patterns.
 - Give examples of how to use some of the gpfs_fcntl hints and
   directives that are new in GPFS version 1.3.

There are four versions of the program binary built from a single set of
source files.  The four versions correspond to all of the possible
combinations of single node/multiple node and with/without features that
only are supported on GPFS version 1.3.  Multinode versions of the gpfsperf
program contain -mpi as part of their names, while versions that do not use
features of GPFS requiring version 1.3 have a suffix of -v12 in their names.


Parallelism
-----------

There are two independent ways to achieve parallelism in the gpfsperf
program.  More than one instance of the program can be run on multiple
nodes using Message Passing Interface (MPI) to synchronize their
execution, or a single instance of the program can execute several
threads in parallel on a single node.  These two techniques can also be
combined.  When describing the behavior of the program, it should be
understood that 'threads' means any of the threads of the gpfsperf
program on any node where MPI runs it.

When gpfsperf runs on multiple nodes, the multiple instances of the
program communicate using the Message Passing Interface (MPI) to
synchronize their execution and to combine their measurements into an
aggregate throughput result.


Access patterns
---------------

The gpfsperf program operates on a file that is assumed to consist of a
collection of records, each of the same size.  It can generate three
different types of access patterns: sequential, strided, and random.  The
meaning of these access patterns in some cases depends on whether or not
parallelism is employed when the benchmark is run.

The simplest access pattern is random.  The gpfsperf program generates a
sequence of random record numbers and reads or writes the corresponding
records.  When run on multiple nodes, or when multiple threads are used
within instances of the program, each thread of the gpfsperf program uses a
different seed for its random number generator, so each thread will access
independent sequences of records.  Two threads may access the same record
if the same random record number occurs in two sequences.

In the sequential access pattern, each gpfsperf thread reads from or writes
to a contiguous partition of a file sequentially.  For example, suppose that
a 10 billion byte file consists of one million records of 10000 bytes each.
If 10 threads read the file according to the sequential access
pattern, then first thread will read sequentially through the partition
consisting of the first 100,000 records, the next thread will read from the
next 100,000 records, and so on.

In a strided access pattern, each thread skips some number of records
between each record that it reads or writes.  Reading the file from the
example above in a strided pattern, the first thread would read records 0,
10, 20, ..., 999,990.  The second thread would read records 1, 11, 21, ...,
999,991, and so on.  The gpfsperf program by default uses a stride, or
distance between records, equal to the total number of threads operating on
the file, in this case 10.


Amount of data to be transferred
--------------------------------

One of the input parameters to gpfsperf is the amount of data to be
transferred.  This is the total number of bytes to be read or written by all
threads of the program.  If there are T threads in total, and the total
number of bytes to be transferred is N, each thread will read or write about
N/T bytes, rounded to a multiple of the record size.  By default, gpfsperf
sets N to the size of the file.  For the sequential or strided access
pattern, this default means that every record in the file will be read or
written exactly once.

If N is greater than the size of the file, each thread will read or write
its partition of the file repeatedly until reaching its share (N/T) of the
bytes to be transferred.  For example, suppose 10 threads sequentially read
a 10 billion byte file of 10000 byte records when N is 15 billion.  The
first thread will read the first 100,000 records, then reread the first
50,000 records.  The second thread will read records 100,000 through
199,999, then reread records 100,000 through 149,999, and so on.

When using strided access patterns with other than the default stride, this
behavior of gpfsperf can cause unexpected results.  For example, suppose
that 10 threads read the 10 billion byte file using a strided access
pattern, but instead of the default stride of 10 records gpfsperf is told to
use a stride of 10000 records.  The file partition read by the first thread
will be records 0, 10000, 20000, ..., 990,000.  This is only 100 distinct
records of 10000 bytes each, or a total of only 1 million bytes of data.
This data will likely remain in the GPFS buffer pool after it is read the
first time.  If N is more than 10 million bytes, gpfsperf will "read" the
same buffered data multiple times, and the reported data rate will appear
anomalously high.  To avoid this effect, performance tests using non-default
strides should reduce N in the same proportion as the stride was increased
from its default.


Computation of aggregate data rate and utilization
--------------------------------------------------

The results of a run of gpfsperf are reported as an aggregate data rate.
Data rate is defined as the total number of bytes read or written by all
threads divided by the total time of the test.  It is reported in units of
1000 bytes/second.  The test time is measured from before any node opens the
test file until after the last node closes the test file.  To insure a
consistent environment for each test, before beginning the timed test period
gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes.  This flushes
the GPFS buffer cache and releases byte-range tokens.  Note that versions of
GPFS prior to v1.3 do not support GPFS file hints, so the state of the
buffer cache at the beginning of a test will be influenced by the state left
by the prior test.

Since all threads of a gpfsperf run do approximately the same amount of work
(read or write N/T bytes), in principle they should all run for the same
amount of time.  In practice, however, variations in disk and switch
response time lead to variations in execution times among the threads.  Lock
contention in GPFS further contributes to these variations in execution
times.  A large degree of non-uniformity is undesirable, since it means that
some nodes are idle while they wait for threads on other nodes to finish.
To measure the degree of uniformity of thread execution times, gpfsperf
computes a quantity it calls "utilization."  Utilization is the fraction of
the total number of thread-seconds in a test during which threads actively
perform reads or writes.  A value of 1.0 indicates perfect overlap, while
lower values denote that some threads were idle while others still ran.

The following timeline illustrates how gpfsperf computes utilization for a
test involving one thread on each of two nodes, reading a total of 100M:

  time   event
  ----   -----
   0.0   Node 0 captures timestamp for beginning of test
         ... both nodes open file
   0.1   Node 0 about to do first read
   0.1   Node 1 about to do first read
         ... many file reads
   4.9   Node 0 finishes last read
   4.9   Node 1 finishes last read
         ... both nodes close file
   5.0   Node 0 captures timestamp for end of test

The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96.  The
reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec.

If node 0 ran significantly slower than node 1, it might have finished its
last read at time 9.9 instead of at time 4.9, and the end of test timestamp
might be 10.0 instead of 5.0.  In this case the utilization would drop to
((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be
100M / (10.0-0.0) = 10M/sec.


Command line parameters
-----------------------

There are four versions of gpfsperf in this directory:
  gpfsperf-mpi       - runs on multiple nodes under MPI, requires GPFS v1.3 or later
  gpfsperf           - runs only on a single node, requires GPFS v1.3 or later

The command line for any of the versions of gpfsperf is:

  gpfsperf[-mpi] operation pattern fn [options]

The order of parameters on the command line is not significant.

The operation must be either "create", "read", "write", or "uncache".  All
threads in a multinode or multithreaded run of gpfsperf do the same
operation to the same file, but at different offsets.  Create means to
insure that the file exists, then do a write test.  The uncache operation
does not read or write the file, but only removes any buffered data for the
file from the GPFS buffer cache.

Pattern must be one of "rand", "randhint", "strided", or "seq".  The
meanings of each of these was explained earlier, except for "randhint".  The
"randhint" pattern is the same as "rand", except that the GPFS multiple
access range hint is used through the library functions in irreg.c to
prefetch blocks before they are accessed by gpfsperf.

The filename parameter fn should resolve to a file in a GPFS file system
that is mounted on all nodes where gpfsperf is to run.  The file must
already exist unless the "create" operation was specified.  Use of gpfsperf
on files not in GPFS may be meaningful in some situations, but this use has
not been tested.

Each optional parameter is described below, along with its default value.

-nolabels    - Produce a single line of output containing all parameters
               and the measured data rate.  Format of the output line is 'op
               pattern fn recordSize nBytes fileSize nProcs nThreads
               strideRecs inv ds dio fsync reltoken aio osync rate util'.
               This format may be useful for importing results into a
               spreadsheet or other program for further analysis.  The
               default is to produce multi-line labelled output.

-r recsize   - Record size.  Defaults to filesystem block size.  Must be
               specified for create operations.

-n nBytes    - Number of bytes to transfer.  Defaults to file size.  Must be
               specified for create operations.

-s stride    - Number of bytes between successive accesses by the
               same thread.  Only meaningful for strided access patterns.
               Must be a multiple of the record size.  See earlier cautions
               about combining large values of -s with large values of -n.
               Default is number of threads times number of processes.

-th nThreads - Number of threads per process.  Default is 1.  When there
               are multiple threads per process they read adjacent blocks of
               the file for the sequential and strided access patterns.  For
               example, suppose a file of 60 records is being read by 3
               nodes with 2 threads per node.  Under the sequential pattern,
               thread 0 on node 0 will read records 0-9, thread 1 on node 0
               will read records 10-19, thread 0 on node 1 will read records
               20-29, etc.  Under a strided pattern, thread 0 on node 0 will
               read records 0, 6, 12, ..., 54, thread 1 on node 0 will read
               records 1, 7, 13, ..., 55, etc.

-noinv       - Do not clear blocks of fn from the GPFS file cache before
               starting the test.  The default is to clear the cache.  If
               this option is given, the results of the test can depend
               strongly on the lock and buffer state left by the last test.
               For example, a multinode sequential read with -noinv will run
               more slowly after a strided write test than after a
               sequential write test.

-ds          - Use GPFS data shipping.  Data shipping avoids lock conflicts
               by partitioning the file among the nodes running gpfsperf,
               turning off byte range locking, and sending messages to the
               appropriate agent node to handle each read or write request.
               However, since (n-1)/n of the accesses are remote, the
               gpfsperf threads cannot take advantage of local block
               caching, although they may still benefit from prefetching and
               writebehind.  Also, since byte range locking is not in
               effect, use of data shipping suspends the atomicity
               guarantees of X/Open file semantics.  See the GPFS Guide and
               Reference manual for more details.  Data shipping should show
               the largest performance benefit for strided writes that have
               small record sizes.  The default is not to use data shipping.

-aio depth   - use Asynch I/O, prefetching to depth (default 0, max 1000).
               This can be used with any of the seq/rand/strided test patterns.

-dio         - Use direct IO flag when opening the file. This will allow
               sector aligned/sized buffers to be read/written directly
               from the application buffer to the disks where the blocks
               are allocated.

-reltoken    - Release the entire file byte-range token after the file
               is newly created. In a multi-node environment (MPI), only
               the first process will create the file and all the other
               processes will wait and open the file after the creation
               has occurred. This flag tell the first process to release
               the byte-range token it automatically gets during the create.
               This may increase performance because other nodes that work
               on different ranges of the file will not need to revoke the
               range held by the node running the first process.

-fsync       - Insure that no dirty data remain buffered at the conclusion
               of a write or create test.  The time to perform the necessary
               fsync operation is included in the test time, so this option
               reduces the reported aggregate data rate.  The default is not
               to fsync the file.

-osync       - Turn on the O_SYNC flag when opening the file.
               This causes every write operation to force the data to disk
               on each call. The default is not osync.

-v           - Verbose tracing.  In a multinode test using gpfsperf, output
               from each instance of the program will be intermingled.  By
               telling MPI to label the output from each node (MP_LABELIO
               environment variable =yes), the verbose output will make more
               sense.

-V           - Very verbose tracing.  This option will display the offset
               of every read or write operation on every node.  As with -v,
               labelling the output by node is suggested.

Numbers in options can be given using K, M, or G suffixes, in upper or lower
case, to denote 2**10, 2**20, or 2**30, respectively, or can have an R or r
suffix to denote a multiple of the record size.  For example, to specify a
record size of 4096 bytes and a size to read or write of 409600, one could
write "-r 4k -n 100r".

AIX only:
If the AsynchronousIO (AIO) kernel extension has not been loaded yet,
running the gpfsperf program will fail and display output like:
exec(): 0509-036 Cannot load program gpfsperf because of the following errors:
        0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because:
        0509-136   Symbol kaio_rdwr (number 0) is not exported from
                   dependent module /unix.
        0509-136   Symbol listio (number 1) is not exported from
                   dependent module /unix.
        0509-136   Symbol acancel (number 2) is not exported from
                   dependent module /unix.
        0509-136   Symbol iosuspend (number 3) is not exported from
                   dependent module /unix.
        0509-136   Symbol aio_nwait (number 4) is not exported from
                   dependent module /unix.
        0509-192 Examine .loader section symbols with the
                 'dump -Tv' command.

If you do not wish to use AIO, you can recompile the gpfsperf program to not
use the AIO calls:
    rm gpfsperf.o gpfsperf-mpi.o
    make OTHERINCL="-DNO_AIO"

Enable AIO on your system, by doing commands like the following:
    lsattr -El aio0
    chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128
    mkdev -l aio0
  Minservers just tells AIX how many AIO kprocs to create immediately, and
  maxservers limits the total number created. On AIX 5.2 the meaning of
  maxservers has changed to mean "maximum number of servers per CPU", so a
  16-way SMP should set maxservers=8 to get a total of 128 kprocs.


Examples
--------

Suppose that /gpfs is a GPFS file system that was formatted with 256K
blocks and that it has at least a gigabyte of free space.  Assuming that
the gpfsperf programs have been copied into /gpfs/test, and that
/gpfs/test is the current directory, the following ksh commands
illustrate how to run gpfsperf:

# Number of nodes on which the test will run.  If this is increased, the
# size of the test file should also be increased.
export MP_PROCS=8

# File containing a list of nodes on which gpfsperf will run.  There are
# other ways to specify where the test runs besides using an explicit
# host list.  See the Parallel Operating Environment documentation for
# details.
export MP_HOSTFILE=/etc/cluster.nodes

# Name of test file to be manipulated by the tests that follow.
export fn=/gpfs/test/testfile

# Verify block size
mmlsfs gpfs -B

# Create test file.  All write tests in these examples specify -fsync, so
# the reported data rate includes the overhead of flushing all dirty buffers
# to disk.  The size of the test file should be increased if more than 8
# nodes are used or if GPFS pagepool sizes have been increased from their
# defaults.  It may be necessary to increase the maximum size file the user
# is allowed to create.  See the ulimit command.
./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync

# Read entire test file sequentially
./gpfsperf-mpi read seq $fn -r 256k

# Rewrite test file sequentially using full block writes
./gpfsperf-mpi write seq $fn -r 256k -fsync

# Rewrite test file sequentially using small writes.  This requires GPFS to
# read blocks in order to update them, so will have worse performance than
# the full block rewrite.
./gpfsperf-mpi write seq $fn -r 64k -fsync

# Strided read using big records
./gpfsperf-mpi read strided $fn -r 256k

# Strided read using medium sized records.  Performance is worse because
# average I/O size has gone down.  This behavior will not be seen unless
# the stride is larger than a block (8*50000 > 256K).
./gpfsperf-mpi read strided $fn -r 50000

# Strided read using a very large stride.  Reported performance is
# misleading because each node just reads the same records over and over
# from its GPFS buffer cache.
./gpfsperf-mpi read strided $fn -r 50000 -s 2400r

# Strided write using a record size equal to the block size.  Decent
# performance, since record size matches GPFS lock granularity.
./gpfsperf-mpi write strided $fn -r 256k -fsync

# Strided write using small records.  Since GPFS lock granularity is
# larger than a record, performance is much worse.  Number of bytes
# written is less than the entire file to keep test time reasonable.
./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync

# Strided write using small records and data shipping.  Data shipping
# trades additional communication overhead for less lock contention,
# improving performance.
./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync

# Random read of small records
./gpfsperf-mpi read rand $fn -r 10000 -n 100m

# Random read of small records using the GPFS multiple access range hint.
# Better performance (assuming more than MP_PROCS disks) because each node
# has more than one disk read in progress at once due to prefetching.
./gpfsperf-mpi read randhint $fn -r 10000 -n 100m