| [16] | 1 | Disclaimers | 
|---|
|  | 2 | ----------- | 
|---|
|  | 3 |  | 
|---|
|  | 4 | The files in this directory are provided by IBM on an "AS IS" basis | 
|---|
|  | 5 | without warranty of any kind.  In addition, the results that you obtain | 
|---|
|  | 6 | from using these files to measure the general performance of your | 
|---|
|  | 7 | General Parallel File System (GPFS) file systems are "AS IS."  Your | 
|---|
|  | 8 | reliance on any measurements is at your own risk and IBM does not assume | 
|---|
|  | 9 | any liability, whatsoever from your use of these files or your use of | 
|---|
|  | 10 | resultant performance measurements.  The performance of GPFS file | 
|---|
|  | 11 | systems is affected by many factors, including the access patterns of | 
|---|
|  | 12 | application programs, the configuration and amount of memory on the SP | 
|---|
|  | 13 | nodes, the number and characteristics of IBM Virtual Shared Disk (VSD) | 
|---|
|  | 14 | servers, the number and speed of disks and disk adapters attached to the | 
|---|
|  | 15 | VSD servers, GPFS, VSD and SP switch configuration parameters, other | 
|---|
|  | 16 | traffic through the SP switch, etc.  As a result, GPFS file system | 
|---|
|  | 17 | performance may vary and IBM does not make any particular performance | 
|---|
|  | 18 | claims for GPFS file systems. | 
|---|
|  | 19 |  | 
|---|
|  | 20 |  | 
|---|
|  | 21 | Introduction | 
|---|
|  | 22 | ------------ | 
|---|
|  | 23 |  | 
|---|
|  | 24 | The files in this directory serve two purposes: | 
|---|
|  | 25 |  | 
|---|
|  | 26 | - Provide a simple benchmark program (gpfsperf) that can be used to | 
|---|
|  | 27 | measure the performance of GPFS for several common file access patterns. | 
|---|
|  | 28 | - Give examples of how to use some of the gpfs_fcntl hints and | 
|---|
|  | 29 | directives that are new in GPFS version 1.3. | 
|---|
|  | 30 |  | 
|---|
|  | 31 | There are four versions of the program binary built from a single set of | 
|---|
|  | 32 | source files.  The four versions correspond to all of the possible | 
|---|
|  | 33 | combinations of single node/multiple node and with/without features that | 
|---|
|  | 34 | only are supported on GPFS version 1.3.  Multinode versions of the gpfsperf | 
|---|
|  | 35 | program contain -mpi as part of their names, while versions that do not use | 
|---|
|  | 36 | features of GPFS requiring version 1.3 have a suffix of -v12 in their names. | 
|---|
|  | 37 |  | 
|---|
|  | 38 |  | 
|---|
|  | 39 | Parallelism | 
|---|
|  | 40 | ----------- | 
|---|
|  | 41 |  | 
|---|
|  | 42 | There are two independent ways to achieve parallelism in the gpfsperf | 
|---|
|  | 43 | program.  More than one instance of the program can be run on multiple | 
|---|
|  | 44 | nodes using Message Passing Interface (MPI) to synchronize their | 
|---|
|  | 45 | execution, or a single instance of the program can execute several | 
|---|
|  | 46 | threads in parallel on a single node.  These two techniques can also be | 
|---|
|  | 47 | combined.  When describing the behavior of the program, it should be | 
|---|
|  | 48 | understood that 'threads' means any of the threads of the gpfsperf | 
|---|
|  | 49 | program on any node where MPI runs it. | 
|---|
|  | 50 |  | 
|---|
|  | 51 | When gpfsperf runs on multiple nodes, the multiple instances of the | 
|---|
|  | 52 | program communicate using the Message Passing Interface (MPI) to | 
|---|
|  | 53 | synchronize their execution and to combine their measurements into an | 
|---|
|  | 54 | aggregate throughput result. | 
|---|
|  | 55 |  | 
|---|
|  | 56 |  | 
|---|
|  | 57 | Access patterns | 
|---|
|  | 58 | --------------- | 
|---|
|  | 59 |  | 
|---|
|  | 60 | The gpfsperf program operates on a file that is assumed to consist of a | 
|---|
|  | 61 | collection of records, each of the same size.  It can generate three | 
|---|
|  | 62 | different types of access patterns: sequential, strided, and random.  The | 
|---|
|  | 63 | meaning of these access patterns in some cases depends on whether or not | 
|---|
|  | 64 | parallelism is employed when the benchmark is run. | 
|---|
|  | 65 |  | 
|---|
|  | 66 | The simplest access pattern is random.  The gpfsperf program generates a | 
|---|
|  | 67 | sequence of random record numbers and reads or writes the corresponding | 
|---|
|  | 68 | records.  When run on multiple nodes, or when multiple threads are used | 
|---|
|  | 69 | within instances of the program, each thread of the gpfsperf program uses a | 
|---|
|  | 70 | different seed for its random number generator, so each thread will access | 
|---|
|  | 71 | independent sequences of records.  Two threads may access the same record | 
|---|
|  | 72 | if the same random record number occurs in two sequences. | 
|---|
|  | 73 |  | 
|---|
|  | 74 | In the sequential access pattern, each gpfsperf thread reads from or writes | 
|---|
|  | 75 | to a contiguous partition of a file sequentially.  For example, suppose that | 
|---|
|  | 76 | a 10 billion byte file consists of one million records of 10000 bytes each. | 
|---|
|  | 77 | If 10 threads read the file according to the sequential access | 
|---|
|  | 78 | pattern, then first thread will read sequentially through the partition | 
|---|
|  | 79 | consisting of the first 100,000 records, the next thread will read from the | 
|---|
|  | 80 | next 100,000 records, and so on. | 
|---|
|  | 81 |  | 
|---|
|  | 82 | In a strided access pattern, each thread skips some number of records | 
|---|
|  | 83 | between each record that it reads or writes.  Reading the file from the | 
|---|
|  | 84 | example above in a strided pattern, the first thread would read records 0, | 
|---|
|  | 85 | 10, 20, ..., 999,990.  The second thread would read records 1, 11, 21, ..., | 
|---|
|  | 86 | 999,991, and so on.  The gpfsperf program by default uses a stride, or | 
|---|
|  | 87 | distance between records, equal to the total number of threads operating on | 
|---|
|  | 88 | the file, in this case 10. | 
|---|
|  | 89 |  | 
|---|
|  | 90 |  | 
|---|
|  | 91 | Amount of data to be transferred | 
|---|
|  | 92 | -------------------------------- | 
|---|
|  | 93 |  | 
|---|
|  | 94 | One of the input parameters to gpfsperf is the amount of data to be | 
|---|
|  | 95 | transferred.  This is the total number of bytes to be read or written by all | 
|---|
|  | 96 | threads of the program.  If there are T threads in total, and the total | 
|---|
|  | 97 | number of bytes to be transferred is N, each thread will read or write about | 
|---|
|  | 98 | N/T bytes, rounded to a multiple of the record size.  By default, gpfsperf | 
|---|
|  | 99 | sets N to the size of the file.  For the sequential or strided access | 
|---|
|  | 100 | pattern, this default means that every record in the file will be read or | 
|---|
|  | 101 | written exactly once. | 
|---|
|  | 102 |  | 
|---|
|  | 103 | If N is greater than the size of the file, each thread will read or write | 
|---|
|  | 104 | its partition of the file repeatedly until reaching its share (N/T) of the | 
|---|
|  | 105 | bytes to be transferred.  For example, suppose 10 threads sequentially read | 
|---|
|  | 106 | a 10 billion byte file of 10000 byte records when N is 15 billion.  The | 
|---|
|  | 107 | first thread will read the first 100,000 records, then reread the first | 
|---|
|  | 108 | 50,000 records.  The second thread will read records 100,000 through | 
|---|
|  | 109 | 199,999, then reread records 100,000 through 149,999, and so on. | 
|---|
|  | 110 |  | 
|---|
|  | 111 | When using strided access patterns with other than the default stride, this | 
|---|
|  | 112 | behavior of gpfsperf can cause unexpected results.  For example, suppose | 
|---|
|  | 113 | that 10 threads read the 10 billion byte file using a strided access | 
|---|
|  | 114 | pattern, but instead of the default stride of 10 records gpfsperf is told to | 
|---|
|  | 115 | use a stride of 10000 records.  The file partition read by the first thread | 
|---|
|  | 116 | will be records 0, 10000, 20000, ..., 990,000.  This is only 100 distinct | 
|---|
|  | 117 | records of 10000 bytes each, or a total of only 1 million bytes of data. | 
|---|
|  | 118 | This data will likely remain in the GPFS buffer pool after it is read the | 
|---|
|  | 119 | first time.  If N is more than 10 million bytes, gpfsperf will "read" the | 
|---|
|  | 120 | same buffered data multiple times, and the reported data rate will appear | 
|---|
|  | 121 | anomalously high.  To avoid this effect, performance tests using non-default | 
|---|
|  | 122 | strides should reduce N in the same proportion as the stride was increased | 
|---|
|  | 123 | from its default. | 
|---|
|  | 124 |  | 
|---|
|  | 125 |  | 
|---|
|  | 126 | Computation of aggregate data rate and utilization | 
|---|
|  | 127 | -------------------------------------------------- | 
|---|
|  | 128 |  | 
|---|
|  | 129 | The results of a run of gpfsperf are reported as an aggregate data rate. | 
|---|
|  | 130 | Data rate is defined as the total number of bytes read or written by all | 
|---|
|  | 131 | threads divided by the total time of the test.  It is reported in units of | 
|---|
|  | 132 | 1000 bytes/second.  The test time is measured from before any node opens the | 
|---|
|  | 133 | test file until after the last node closes the test file.  To insure a | 
|---|
|  | 134 | consistent environment for each test, before beginning the timed test period | 
|---|
|  | 135 | gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes.  This flushes | 
|---|
|  | 136 | the GPFS buffer cache and releases byte-range tokens.  Note that versions of | 
|---|
|  | 137 | GPFS prior to v1.3 do not support GPFS file hints, so the state of the | 
|---|
|  | 138 | buffer cache at the beginning of a test will be influenced by the state left | 
|---|
|  | 139 | by the prior test. | 
|---|
|  | 140 |  | 
|---|
|  | 141 | Since all threads of a gpfsperf run do approximately the same amount of work | 
|---|
|  | 142 | (read or write N/T bytes), in principle they should all run for the same | 
|---|
|  | 143 | amount of time.  In practice, however, variations in disk and switch | 
|---|
|  | 144 | response time lead to variations in execution times among the threads.  Lock | 
|---|
|  | 145 | contention in GPFS further contributes to these variations in execution | 
|---|
|  | 146 | times.  A large degree of non-uniformity is undesirable, since it means that | 
|---|
|  | 147 | some nodes are idle while they wait for threads on other nodes to finish. | 
|---|
|  | 148 | To measure the degree of uniformity of thread execution times, gpfsperf | 
|---|
|  | 149 | computes a quantity it calls "utilization."  Utilization is the fraction of | 
|---|
|  | 150 | the total number of thread-seconds in a test during which threads actively | 
|---|
|  | 151 | perform reads or writes.  A value of 1.0 indicates perfect overlap, while | 
|---|
|  | 152 | lower values denote that some threads were idle while others still ran. | 
|---|
|  | 153 |  | 
|---|
|  | 154 | The following timeline illustrates how gpfsperf computes utilization for a | 
|---|
|  | 155 | test involving one thread on each of two nodes, reading a total of 100M: | 
|---|
|  | 156 |  | 
|---|
|  | 157 | time   event | 
|---|
|  | 158 | ----   ----- | 
|---|
|  | 159 | 0.0   Node 0 captures timestamp for beginning of test | 
|---|
|  | 160 | ... both nodes open file | 
|---|
|  | 161 | 0.1   Node 0 about to do first read | 
|---|
|  | 162 | 0.1   Node 1 about to do first read | 
|---|
|  | 163 | ... many file reads | 
|---|
|  | 164 | 4.9   Node 0 finishes last read | 
|---|
|  | 165 | 4.9   Node 1 finishes last read | 
|---|
|  | 166 | ... both nodes close file | 
|---|
|  | 167 | 5.0   Node 0 captures timestamp for end of test | 
|---|
|  | 168 |  | 
|---|
|  | 169 | The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96.  The | 
|---|
|  | 170 | reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec. | 
|---|
|  | 171 |  | 
|---|
|  | 172 | If node 0 ran significantly slower than node 1, it might have finished its | 
|---|
|  | 173 | last read at time 9.9 instead of at time 4.9, and the end of test timestamp | 
|---|
|  | 174 | might be 10.0 instead of 5.0.  In this case the utilization would drop to | 
|---|
|  | 175 | ((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be | 
|---|
|  | 176 | 100M / (10.0-0.0) = 10M/sec. | 
|---|
|  | 177 |  | 
|---|
|  | 178 |  | 
|---|
|  | 179 | Command line parameters | 
|---|
|  | 180 | ----------------------- | 
|---|
|  | 181 |  | 
|---|
|  | 182 | There are four versions of gpfsperf in this directory: | 
|---|
|  | 183 | gpfsperf-mpi       - runs on multiple nodes under MPI, requires GPFS v1.3 or later | 
|---|
|  | 184 | gpfsperf           - runs only on a single node, requires GPFS v1.3 or later | 
|---|
|  | 185 |  | 
|---|
|  | 186 | The command line for any of the versions of gpfsperf is: | 
|---|
|  | 187 |  | 
|---|
|  | 188 | gpfsperf[-mpi] operation pattern fn [options] | 
|---|
|  | 189 |  | 
|---|
|  | 190 | The order of parameters on the command line is not significant. | 
|---|
|  | 191 |  | 
|---|
|  | 192 | The operation must be either "create", "read", "write", or "uncache".  All | 
|---|
|  | 193 | threads in a multinode or multithreaded run of gpfsperf do the same | 
|---|
|  | 194 | operation to the same file, but at different offsets.  Create means to | 
|---|
|  | 195 | insure that the file exists, then do a write test.  The uncache operation | 
|---|
|  | 196 | does not read or write the file, but only removes any buffered data for the | 
|---|
|  | 197 | file from the GPFS buffer cache. | 
|---|
|  | 198 |  | 
|---|
|  | 199 | Pattern must be one of "rand", "randhint", "strided", or "seq".  The | 
|---|
|  | 200 | meanings of each of these was explained earlier, except for "randhint".  The | 
|---|
|  | 201 | "randhint" pattern is the same as "rand", except that the GPFS multiple | 
|---|
|  | 202 | access range hint is used through the library functions in irreg.c to | 
|---|
|  | 203 | prefetch blocks before they are accessed by gpfsperf. | 
|---|
|  | 204 |  | 
|---|
|  | 205 | The filename parameter fn should resolve to a file in a GPFS file system | 
|---|
|  | 206 | that is mounted on all nodes where gpfsperf is to run.  The file must | 
|---|
|  | 207 | already exist unless the "create" operation was specified.  Use of gpfsperf | 
|---|
|  | 208 | on files not in GPFS may be meaningful in some situations, but this use has | 
|---|
|  | 209 | not been tested. | 
|---|
|  | 210 |  | 
|---|
|  | 211 | Each optional parameter is described below, along with its default value. | 
|---|
|  | 212 |  | 
|---|
|  | 213 | -nolabels    - Produce a single line of output containing all parameters | 
|---|
|  | 214 | and the measured data rate.  Format of the output line is 'op | 
|---|
|  | 215 | pattern fn recordSize nBytes fileSize nProcs nThreads | 
|---|
|  | 216 | strideRecs inv ds dio fsync reltoken aio osync rate util'. | 
|---|
|  | 217 | This format may be useful for importing results into a | 
|---|
|  | 218 | spreadsheet or other program for further analysis.  The | 
|---|
|  | 219 | default is to produce multi-line labelled output. | 
|---|
|  | 220 |  | 
|---|
|  | 221 | -r recsize   - Record size.  Defaults to filesystem block size.  Must be | 
|---|
|  | 222 | specified for create operations. | 
|---|
|  | 223 |  | 
|---|
|  | 224 | -n nBytes    - Number of bytes to transfer.  Defaults to file size.  Must be | 
|---|
|  | 225 | specified for create operations. | 
|---|
|  | 226 |  | 
|---|
|  | 227 | -s stride    - Number of bytes between successive accesses by the | 
|---|
|  | 228 | same thread.  Only meaningful for strided access patterns. | 
|---|
|  | 229 | Must be a multiple of the record size.  See earlier cautions | 
|---|
|  | 230 | about combining large values of -s with large values of -n. | 
|---|
|  | 231 | Default is number of threads times number of processes. | 
|---|
|  | 232 |  | 
|---|
|  | 233 | -th nThreads - Number of threads per process.  Default is 1.  When there | 
|---|
|  | 234 | are multiple threads per process they read adjacent blocks of | 
|---|
|  | 235 | the file for the sequential and strided access patterns.  For | 
|---|
|  | 236 | example, suppose a file of 60 records is being read by 3 | 
|---|
|  | 237 | nodes with 2 threads per node.  Under the sequential pattern, | 
|---|
|  | 238 | thread 0 on node 0 will read records 0-9, thread 1 on node 0 | 
|---|
|  | 239 | will read records 10-19, thread 0 on node 1 will read records | 
|---|
|  | 240 | 20-29, etc.  Under a strided pattern, thread 0 on node 0 will | 
|---|
|  | 241 | read records 0, 6, 12, ..., 54, thread 1 on node 0 will read | 
|---|
|  | 242 | records 1, 7, 13, ..., 55, etc. | 
|---|
|  | 243 |  | 
|---|
|  | 244 | -noinv       - Do not clear blocks of fn from the GPFS file cache before | 
|---|
|  | 245 | starting the test.  The default is to clear the cache.  If | 
|---|
|  | 246 | this option is given, the results of the test can depend | 
|---|
|  | 247 | strongly on the lock and buffer state left by the last test. | 
|---|
|  | 248 | For example, a multinode sequential read with -noinv will run | 
|---|
|  | 249 | more slowly after a strided write test than after a | 
|---|
|  | 250 | sequential write test. | 
|---|
|  | 251 |  | 
|---|
|  | 252 | -ds          - Use GPFS data shipping.  Data shipping avoids lock conflicts | 
|---|
|  | 253 | by partitioning the file among the nodes running gpfsperf, | 
|---|
|  | 254 | turning off byte range locking, and sending messages to the | 
|---|
|  | 255 | appropriate agent node to handle each read or write request. | 
|---|
|  | 256 | However, since (n-1)/n of the accesses are remote, the | 
|---|
|  | 257 | gpfsperf threads cannot take advantage of local block | 
|---|
|  | 258 | caching, although they may still benefit from prefetching and | 
|---|
|  | 259 | writebehind.  Also, since byte range locking is not in | 
|---|
|  | 260 | effect, use of data shipping suspends the atomicity | 
|---|
|  | 261 | guarantees of X/Open file semantics.  See the GPFS Guide and | 
|---|
|  | 262 | Reference manual for more details.  Data shipping should show | 
|---|
|  | 263 | the largest performance benefit for strided writes that have | 
|---|
|  | 264 | small record sizes.  The default is not to use data shipping. | 
|---|
|  | 265 |  | 
|---|
|  | 266 | -aio depth   - use Asynch I/O, prefetching to depth (default 0, max 1000). | 
|---|
|  | 267 | This can be used with any of the seq/rand/strided test patterns. | 
|---|
|  | 268 |  | 
|---|
|  | 269 | -dio         - Use direct IO flag when opening the file. This will allow | 
|---|
|  | 270 | sector aligned/sized buffers to be read/written directly | 
|---|
|  | 271 | from the application buffer to the disks where the blocks | 
|---|
|  | 272 | are allocated. | 
|---|
|  | 273 |  | 
|---|
|  | 274 | -reltoken    - Release the entire file byte-range token after the file | 
|---|
|  | 275 | is newly created. In a multi-node environment (MPI), only | 
|---|
|  | 276 | the first process will create the file and all the other | 
|---|
|  | 277 | processes will wait and open the file after the creation | 
|---|
|  | 278 | has occurred. This flag tell the first process to release | 
|---|
|  | 279 | the byte-range token it automatically gets during the create. | 
|---|
|  | 280 | This may increase performance because other nodes that work | 
|---|
|  | 281 | on different ranges of the file will not need to revoke the | 
|---|
|  | 282 | range held by the node running the first process. | 
|---|
|  | 283 |  | 
|---|
|  | 284 | -fsync       - Insure that no dirty data remain buffered at the conclusion | 
|---|
|  | 285 | of a write or create test.  The time to perform the necessary | 
|---|
|  | 286 | fsync operation is included in the test time, so this option | 
|---|
|  | 287 | reduces the reported aggregate data rate.  The default is not | 
|---|
|  | 288 | to fsync the file. | 
|---|
|  | 289 |  | 
|---|
|  | 290 | -osync       - Turn on the O_SYNC flag when opening the file. | 
|---|
|  | 291 | This causes every write operation to force the data to disk | 
|---|
|  | 292 | on each call. The default is not osync. | 
|---|
|  | 293 |  | 
|---|
|  | 294 | -v           - Verbose tracing.  In a multinode test using gpfsperf, output | 
|---|
|  | 295 | from each instance of the program will be intermingled.  By | 
|---|
|  | 296 | telling MPI to label the output from each node (MP_LABELIO | 
|---|
|  | 297 | environment variable =yes), the verbose output will make more | 
|---|
|  | 298 | sense. | 
|---|
|  | 299 |  | 
|---|
|  | 300 | -V           - Very verbose tracing.  This option will display the offset | 
|---|
|  | 301 | of every read or write operation on every node.  As with -v, | 
|---|
|  | 302 | labelling the output by node is suggested. | 
|---|
|  | 303 |  | 
|---|
|  | 304 | Numbers in options can be given using K, M, or G suffixes, in upper or lower | 
|---|
|  | 305 | case, to denote 2**10, 2**20, or 2**30, respectively, or can have an R or r | 
|---|
|  | 306 | suffix to denote a multiple of the record size.  For example, to specify a | 
|---|
|  | 307 | record size of 4096 bytes and a size to read or write of 409600, one could | 
|---|
|  | 308 | write "-r 4k -n 100r". | 
|---|
|  | 309 |  | 
|---|
|  | 310 | AIX only: | 
|---|
|  | 311 | If the AsynchronousIO (AIO) kernel extension has not been loaded yet, | 
|---|
|  | 312 | running the gpfsperf program will fail and display output like: | 
|---|
|  | 313 | exec(): 0509-036 Cannot load program gpfsperf because of the following errors: | 
|---|
|  | 314 | 0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because: | 
|---|
|  | 315 | 0509-136   Symbol kaio_rdwr (number 0) is not exported from | 
|---|
|  | 316 | dependent module /unix. | 
|---|
|  | 317 | 0509-136   Symbol listio (number 1) is not exported from | 
|---|
|  | 318 | dependent module /unix. | 
|---|
|  | 319 | 0509-136   Symbol acancel (number 2) is not exported from | 
|---|
|  | 320 | dependent module /unix. | 
|---|
|  | 321 | 0509-136   Symbol iosuspend (number 3) is not exported from | 
|---|
|  | 322 | dependent module /unix. | 
|---|
|  | 323 | 0509-136   Symbol aio_nwait (number 4) is not exported from | 
|---|
|  | 324 | dependent module /unix. | 
|---|
|  | 325 | 0509-192 Examine .loader section symbols with the | 
|---|
|  | 326 | 'dump -Tv' command. | 
|---|
|  | 327 |  | 
|---|
|  | 328 | If you do not wish to use AIO, you can recompile the gpfsperf program to not | 
|---|
|  | 329 | use the AIO calls: | 
|---|
|  | 330 | rm gpfsperf.o gpfsperf-mpi.o | 
|---|
|  | 331 | make OTHERINCL="-DNO_AIO" | 
|---|
|  | 332 |  | 
|---|
|  | 333 | Enable AIO on your system, by doing commands like the following: | 
|---|
|  | 334 | lsattr -El aio0 | 
|---|
|  | 335 | chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128 | 
|---|
|  | 336 | mkdev -l aio0 | 
|---|
|  | 337 | Minservers just tells AIX how many AIO kprocs to create immediately, and | 
|---|
|  | 338 | maxservers limits the total number created. On AIX 5.2 the meaning of | 
|---|
|  | 339 | maxservers has changed to mean "maximum number of servers per CPU", so a | 
|---|
|  | 340 | 16-way SMP should set maxservers=8 to get a total of 128 kprocs. | 
|---|
|  | 341 |  | 
|---|
|  | 342 |  | 
|---|
|  | 343 |  | 
|---|
|  | 344 | Examples | 
|---|
|  | 345 | -------- | 
|---|
|  | 346 |  | 
|---|
|  | 347 | Suppose that /gpfs is a GPFS file system that was formatted with 256K | 
|---|
|  | 348 | blocks and that it has at least a gigabyte of free space.  Assuming that | 
|---|
|  | 349 | the gpfsperf programs have been copied into /gpfs/test, and that | 
|---|
|  | 350 | /gpfs/test is the current directory, the following ksh commands | 
|---|
|  | 351 | illustrate how to run gpfsperf: | 
|---|
|  | 352 |  | 
|---|
|  | 353 | # Number of nodes on which the test will run.  If this is increased, the | 
|---|
|  | 354 | # size of the test file should also be increased. | 
|---|
|  | 355 | export MP_PROCS=8 | 
|---|
|  | 356 |  | 
|---|
|  | 357 | # File containing a list of nodes on which gpfsperf will run.  There are | 
|---|
|  | 358 | # other ways to specify where the test runs besides using an explicit | 
|---|
|  | 359 | # host list.  See the Parallel Operating Environment documentation for | 
|---|
|  | 360 | # details. | 
|---|
|  | 361 | export MP_HOSTFILE=/etc/cluster.nodes | 
|---|
|  | 362 |  | 
|---|
|  | 363 | # Name of test file to be manipulated by the tests that follow. | 
|---|
|  | 364 | export fn=/gpfs/test/testfile | 
|---|
|  | 365 |  | 
|---|
|  | 366 | # Verify block size | 
|---|
|  | 367 | mmlsfs gpfs -B | 
|---|
|  | 368 |  | 
|---|
|  | 369 | # Create test file.  All write tests in these examples specify -fsync, so | 
|---|
|  | 370 | # the reported data rate includes the overhead of flushing all dirty buffers | 
|---|
|  | 371 | # to disk.  The size of the test file should be increased if more than 8 | 
|---|
|  | 372 | # nodes are used or if GPFS pagepool sizes have been increased from their | 
|---|
|  | 373 | # defaults.  It may be necessary to increase the maximum size file the user | 
|---|
|  | 374 | # is allowed to create.  See the ulimit command. | 
|---|
|  | 375 | ./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync | 
|---|
|  | 376 |  | 
|---|
|  | 377 | # Read entire test file sequentially | 
|---|
|  | 378 | ./gpfsperf-mpi read seq $fn -r 256k | 
|---|
|  | 379 |  | 
|---|
|  | 380 | # Rewrite test file sequentially using full block writes | 
|---|
|  | 381 | ./gpfsperf-mpi write seq $fn -r 256k -fsync | 
|---|
|  | 382 |  | 
|---|
|  | 383 | # Rewrite test file sequentially using small writes.  This requires GPFS to | 
|---|
|  | 384 | # read blocks in order to update them, so will have worse performance than | 
|---|
|  | 385 | # the full block rewrite. | 
|---|
|  | 386 | ./gpfsperf-mpi write seq $fn -r 64k -fsync | 
|---|
|  | 387 |  | 
|---|
|  | 388 | # Strided read using big records | 
|---|
|  | 389 | ./gpfsperf-mpi read strided $fn -r 256k | 
|---|
|  | 390 |  | 
|---|
|  | 391 | # Strided read using medium sized records.  Performance is worse because | 
|---|
|  | 392 | # average I/O size has gone down.  This behavior will not be seen unless | 
|---|
|  | 393 | # the stride is larger than a block (8*50000 > 256K). | 
|---|
|  | 394 | ./gpfsperf-mpi read strided $fn -r 50000 | 
|---|
|  | 395 |  | 
|---|
|  | 396 | # Strided read using a very large stride.  Reported performance is | 
|---|
|  | 397 | # misleading because each node just reads the same records over and over | 
|---|
|  | 398 | # from its GPFS buffer cache. | 
|---|
|  | 399 | ./gpfsperf-mpi read strided $fn -r 50000 -s 2400r | 
|---|
|  | 400 |  | 
|---|
|  | 401 | # Strided write using a record size equal to the block size.  Decent | 
|---|
|  | 402 | # performance, since record size matches GPFS lock granularity. | 
|---|
|  | 403 | ./gpfsperf-mpi write strided $fn -r 256k -fsync | 
|---|
|  | 404 |  | 
|---|
|  | 405 | # Strided write using small records.  Since GPFS lock granularity is | 
|---|
|  | 406 | # larger than a record, performance is much worse.  Number of bytes | 
|---|
|  | 407 | # written is less than the entire file to keep test time reasonable. | 
|---|
|  | 408 | ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync | 
|---|
|  | 409 |  | 
|---|
|  | 410 | # Strided write using small records and data shipping.  Data shipping | 
|---|
|  | 411 | # trades additional communication overhead for less lock contention, | 
|---|
|  | 412 | # improving performance. | 
|---|
|  | 413 | ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync | 
|---|
|  | 414 |  | 
|---|
|  | 415 | # Random read of small records | 
|---|
|  | 416 | ./gpfsperf-mpi read rand $fn -r 10000 -n 100m | 
|---|
|  | 417 |  | 
|---|
|  | 418 | # Random read of small records using the GPFS multiple access range hint. | 
|---|
|  | 419 | # Better performance (assuming more than MP_PROCS disks) because each node | 
|---|
|  | 420 | # has more than one disk read in progress at once due to prefetching. | 
|---|
|  | 421 | ./gpfsperf-mpi read randhint $fn -r 10000 -n 100m | 
|---|
|  | 422 |  | 
|---|