| 1 | Disclaimers | 
|---|
| 2 | ----------- | 
|---|
| 3 |  | 
|---|
| 4 | The files in this directory are provided by IBM on an "AS IS" basis | 
|---|
| 5 | without warranty of any kind.  In addition, the results that you obtain | 
|---|
| 6 | from using these files to measure the general performance of your | 
|---|
| 7 | General Parallel File System (GPFS) file systems are "AS IS."  Your | 
|---|
| 8 | reliance on any measurements is at your own risk and IBM does not assume | 
|---|
| 9 | any liability, whatsoever from your use of these files or your use of | 
|---|
| 10 | resultant performance measurements.  The performance of GPFS file | 
|---|
| 11 | systems is affected by many factors, including the access patterns of | 
|---|
| 12 | application programs, the configuration and amount of memory on the SP | 
|---|
| 13 | nodes, the number and characteristics of IBM Virtual Shared Disk (VSD) | 
|---|
| 14 | servers, the number and speed of disks and disk adapters attached to the | 
|---|
| 15 | VSD servers, GPFS, VSD and SP switch configuration parameters, other | 
|---|
| 16 | traffic through the SP switch, etc.  As a result, GPFS file system | 
|---|
| 17 | performance may vary and IBM does not make any particular performance | 
|---|
| 18 | claims for GPFS file systems. | 
|---|
| 19 |  | 
|---|
| 20 |  | 
|---|
| 21 | Introduction | 
|---|
| 22 | ------------ | 
|---|
| 23 |  | 
|---|
| 24 | The files in this directory serve two purposes: | 
|---|
| 25 |  | 
|---|
| 26 | - Provide a simple benchmark program (gpfsperf) that can be used to | 
|---|
| 27 | measure the performance of GPFS for several common file access patterns. | 
|---|
| 28 | - Give examples of how to use some of the gpfs_fcntl hints and | 
|---|
| 29 | directives that are new in GPFS version 1.3. | 
|---|
| 30 |  | 
|---|
| 31 | There are four versions of the program binary built from a single set of | 
|---|
| 32 | source files.  The four versions correspond to all of the possible | 
|---|
| 33 | combinations of single node/multiple node and with/without features that | 
|---|
| 34 | only are supported on GPFS version 1.3.  Multinode versions of the gpfsperf | 
|---|
| 35 | program contain -mpi as part of their names, while versions that do not use | 
|---|
| 36 | features of GPFS requiring version 1.3 have a suffix of -v12 in their names. | 
|---|
| 37 |  | 
|---|
| 38 |  | 
|---|
| 39 | Parallelism | 
|---|
| 40 | ----------- | 
|---|
| 41 |  | 
|---|
| 42 | There are two independent ways to achieve parallelism in the gpfsperf | 
|---|
| 43 | program.  More than one instance of the program can be run on multiple | 
|---|
| 44 | nodes using Message Passing Interface (MPI) to synchronize their | 
|---|
| 45 | execution, or a single instance of the program can execute several | 
|---|
| 46 | threads in parallel on a single node.  These two techniques can also be | 
|---|
| 47 | combined.  When describing the behavior of the program, it should be | 
|---|
| 48 | understood that 'threads' means any of the threads of the gpfsperf | 
|---|
| 49 | program on any node where MPI runs it. | 
|---|
| 50 |  | 
|---|
| 51 | When gpfsperf runs on multiple nodes, the multiple instances of the | 
|---|
| 52 | program communicate using the Message Passing Interface (MPI) to | 
|---|
| 53 | synchronize their execution and to combine their measurements into an | 
|---|
| 54 | aggregate throughput result. | 
|---|
| 55 |  | 
|---|
| 56 |  | 
|---|
| 57 | Access patterns | 
|---|
| 58 | --------------- | 
|---|
| 59 |  | 
|---|
| 60 | The gpfsperf program operates on a file that is assumed to consist of a | 
|---|
| 61 | collection of records, each of the same size.  It can generate three | 
|---|
| 62 | different types of access patterns: sequential, strided, and random.  The | 
|---|
| 63 | meaning of these access patterns in some cases depends on whether or not | 
|---|
| 64 | parallelism is employed when the benchmark is run. | 
|---|
| 65 |  | 
|---|
| 66 | The simplest access pattern is random.  The gpfsperf program generates a | 
|---|
| 67 | sequence of random record numbers and reads or writes the corresponding | 
|---|
| 68 | records.  When run on multiple nodes, or when multiple threads are used | 
|---|
| 69 | within instances of the program, each thread of the gpfsperf program uses a | 
|---|
| 70 | different seed for its random number generator, so each thread will access | 
|---|
| 71 | independent sequences of records.  Two threads may access the same record | 
|---|
| 72 | if the same random record number occurs in two sequences. | 
|---|
| 73 |  | 
|---|
| 74 | In the sequential access pattern, each gpfsperf thread reads from or writes | 
|---|
| 75 | to a contiguous partition of a file sequentially.  For example, suppose that | 
|---|
| 76 | a 10 billion byte file consists of one million records of 10000 bytes each. | 
|---|
| 77 | If 10 threads read the file according to the sequential access | 
|---|
| 78 | pattern, then first thread will read sequentially through the partition | 
|---|
| 79 | consisting of the first 100,000 records, the next thread will read from the | 
|---|
| 80 | next 100,000 records, and so on. | 
|---|
| 81 |  | 
|---|
| 82 | In a strided access pattern, each thread skips some number of records | 
|---|
| 83 | between each record that it reads or writes.  Reading the file from the | 
|---|
| 84 | example above in a strided pattern, the first thread would read records 0, | 
|---|
| 85 | 10, 20, ..., 999,990.  The second thread would read records 1, 11, 21, ..., | 
|---|
| 86 | 999,991, and so on.  The gpfsperf program by default uses a stride, or | 
|---|
| 87 | distance between records, equal to the total number of threads operating on | 
|---|
| 88 | the file, in this case 10. | 
|---|
| 89 |  | 
|---|
| 90 |  | 
|---|
| 91 | Amount of data to be transferred | 
|---|
| 92 | -------------------------------- | 
|---|
| 93 |  | 
|---|
| 94 | One of the input parameters to gpfsperf is the amount of data to be | 
|---|
| 95 | transferred.  This is the total number of bytes to be read or written by all | 
|---|
| 96 | threads of the program.  If there are T threads in total, and the total | 
|---|
| 97 | number of bytes to be transferred is N, each thread will read or write about | 
|---|
| 98 | N/T bytes, rounded to a multiple of the record size.  By default, gpfsperf | 
|---|
| 99 | sets N to the size of the file.  For the sequential or strided access | 
|---|
| 100 | pattern, this default means that every record in the file will be read or | 
|---|
| 101 | written exactly once. | 
|---|
| 102 |  | 
|---|
| 103 | If N is greater than the size of the file, each thread will read or write | 
|---|
| 104 | its partition of the file repeatedly until reaching its share (N/T) of the | 
|---|
| 105 | bytes to be transferred.  For example, suppose 10 threads sequentially read | 
|---|
| 106 | a 10 billion byte file of 10000 byte records when N is 15 billion.  The | 
|---|
| 107 | first thread will read the first 100,000 records, then reread the first | 
|---|
| 108 | 50,000 records.  The second thread will read records 100,000 through | 
|---|
| 109 | 199,999, then reread records 100,000 through 149,999, and so on. | 
|---|
| 110 |  | 
|---|
| 111 | When using strided access patterns with other than the default stride, this | 
|---|
| 112 | behavior of gpfsperf can cause unexpected results.  For example, suppose | 
|---|
| 113 | that 10 threads read the 10 billion byte file using a strided access | 
|---|
| 114 | pattern, but instead of the default stride of 10 records gpfsperf is told to | 
|---|
| 115 | use a stride of 10000 records.  The file partition read by the first thread | 
|---|
| 116 | will be records 0, 10000, 20000, ..., 990,000.  This is only 100 distinct | 
|---|
| 117 | records of 10000 bytes each, or a total of only 1 million bytes of data. | 
|---|
| 118 | This data will likely remain in the GPFS buffer pool after it is read the | 
|---|
| 119 | first time.  If N is more than 10 million bytes, gpfsperf will "read" the | 
|---|
| 120 | same buffered data multiple times, and the reported data rate will appear | 
|---|
| 121 | anomalously high.  To avoid this effect, performance tests using non-default | 
|---|
| 122 | strides should reduce N in the same proportion as the stride was increased | 
|---|
| 123 | from its default. | 
|---|
| 124 |  | 
|---|
| 125 |  | 
|---|
| 126 | Computation of aggregate data rate and utilization | 
|---|
| 127 | -------------------------------------------------- | 
|---|
| 128 |  | 
|---|
| 129 | The results of a run of gpfsperf are reported as an aggregate data rate. | 
|---|
| 130 | Data rate is defined as the total number of bytes read or written by all | 
|---|
| 131 | threads divided by the total time of the test.  It is reported in units of | 
|---|
| 132 | 1000 bytes/second.  The test time is measured from before any node opens the | 
|---|
| 133 | test file until after the last node closes the test file.  To insure a | 
|---|
| 134 | consistent environment for each test, before beginning the timed test period | 
|---|
| 135 | gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes.  This flushes | 
|---|
| 136 | the GPFS buffer cache and releases byte-range tokens.  Note that versions of | 
|---|
| 137 | GPFS prior to v1.3 do not support GPFS file hints, so the state of the | 
|---|
| 138 | buffer cache at the beginning of a test will be influenced by the state left | 
|---|
| 139 | by the prior test. | 
|---|
| 140 |  | 
|---|
| 141 | Since all threads of a gpfsperf run do approximately the same amount of work | 
|---|
| 142 | (read or write N/T bytes), in principle they should all run for the same | 
|---|
| 143 | amount of time.  In practice, however, variations in disk and switch | 
|---|
| 144 | response time lead to variations in execution times among the threads.  Lock | 
|---|
| 145 | contention in GPFS further contributes to these variations in execution | 
|---|
| 146 | times.  A large degree of non-uniformity is undesirable, since it means that | 
|---|
| 147 | some nodes are idle while they wait for threads on other nodes to finish. | 
|---|
| 148 | To measure the degree of uniformity of thread execution times, gpfsperf | 
|---|
| 149 | computes a quantity it calls "utilization."  Utilization is the fraction of | 
|---|
| 150 | the total number of thread-seconds in a test during which threads actively | 
|---|
| 151 | perform reads or writes.  A value of 1.0 indicates perfect overlap, while | 
|---|
| 152 | lower values denote that some threads were idle while others still ran. | 
|---|
| 153 |  | 
|---|
| 154 | The following timeline illustrates how gpfsperf computes utilization for a | 
|---|
| 155 | test involving one thread on each of two nodes, reading a total of 100M: | 
|---|
| 156 |  | 
|---|
| 157 | time   event | 
|---|
| 158 | ----   ----- | 
|---|
| 159 | 0.0   Node 0 captures timestamp for beginning of test | 
|---|
| 160 | ... both nodes open file | 
|---|
| 161 | 0.1   Node 0 about to do first read | 
|---|
| 162 | 0.1   Node 1 about to do first read | 
|---|
| 163 | ... many file reads | 
|---|
| 164 | 4.9   Node 0 finishes last read | 
|---|
| 165 | 4.9   Node 1 finishes last read | 
|---|
| 166 | ... both nodes close file | 
|---|
| 167 | 5.0   Node 0 captures timestamp for end of test | 
|---|
| 168 |  | 
|---|
| 169 | The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96.  The | 
|---|
| 170 | reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec. | 
|---|
| 171 |  | 
|---|
| 172 | If node 0 ran significantly slower than node 1, it might have finished its | 
|---|
| 173 | last read at time 9.9 instead of at time 4.9, and the end of test timestamp | 
|---|
| 174 | might be 10.0 instead of 5.0.  In this case the utilization would drop to | 
|---|
| 175 | ((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be | 
|---|
| 176 | 100M / (10.0-0.0) = 10M/sec. | 
|---|
| 177 |  | 
|---|
| 178 |  | 
|---|
| 179 | Command line parameters | 
|---|
| 180 | ----------------------- | 
|---|
| 181 |  | 
|---|
| 182 | There are four versions of gpfsperf in this directory: | 
|---|
| 183 | gpfsperf-mpi       - runs on multiple nodes under MPI, requires GPFS v1.3 or later | 
|---|
| 184 | gpfsperf           - runs only on a single node, requires GPFS v1.3 or later | 
|---|
| 185 |  | 
|---|
| 186 | The command line for any of the versions of gpfsperf is: | 
|---|
| 187 |  | 
|---|
| 188 | gpfsperf[-mpi] operation pattern fn [options] | 
|---|
| 189 |  | 
|---|
| 190 | The order of parameters on the command line is not significant. | 
|---|
| 191 |  | 
|---|
| 192 | The operation must be either "create", "read", "write", or "uncache".  All | 
|---|
| 193 | threads in a multinode or multithreaded run of gpfsperf do the same | 
|---|
| 194 | operation to the same file, but at different offsets.  Create means to | 
|---|
| 195 | insure that the file exists, then do a write test.  The uncache operation | 
|---|
| 196 | does not read or write the file, but only removes any buffered data for the | 
|---|
| 197 | file from the GPFS buffer cache. | 
|---|
| 198 |  | 
|---|
| 199 | Pattern must be one of "rand", "randhint", "strided", or "seq".  The | 
|---|
| 200 | meanings of each of these was explained earlier, except for "randhint".  The | 
|---|
| 201 | "randhint" pattern is the same as "rand", except that the GPFS multiple | 
|---|
| 202 | access range hint is used through the library functions in irreg.c to | 
|---|
| 203 | prefetch blocks before they are accessed by gpfsperf. | 
|---|
| 204 |  | 
|---|
| 205 | The filename parameter fn should resolve to a file in a GPFS file system | 
|---|
| 206 | that is mounted on all nodes where gpfsperf is to run.  The file must | 
|---|
| 207 | already exist unless the "create" operation was specified.  Use of gpfsperf | 
|---|
| 208 | on files not in GPFS may be meaningful in some situations, but this use has | 
|---|
| 209 | not been tested. | 
|---|
| 210 |  | 
|---|
| 211 | Each optional parameter is described below, along with its default value. | 
|---|
| 212 |  | 
|---|
| 213 | -nolabels    - Produce a single line of output containing all parameters | 
|---|
| 214 | and the measured data rate.  Format of the output line is 'op | 
|---|
| 215 | pattern fn recordSize nBytes fileSize nProcs nThreads | 
|---|
| 216 | strideRecs inv ds dio fsync reltoken aio osync rate util'. | 
|---|
| 217 | This format may be useful for importing results into a | 
|---|
| 218 | spreadsheet or other program for further analysis.  The | 
|---|
| 219 | default is to produce multi-line labelled output. | 
|---|
| 220 |  | 
|---|
| 221 | -r recsize   - Record size.  Defaults to filesystem block size.  Must be | 
|---|
| 222 | specified for create operations. | 
|---|
| 223 |  | 
|---|
| 224 | -n nBytes    - Number of bytes to transfer.  Defaults to file size.  Must be | 
|---|
| 225 | specified for create operations. | 
|---|
| 226 |  | 
|---|
| 227 | -s stride    - Number of bytes between successive accesses by the | 
|---|
| 228 | same thread.  Only meaningful for strided access patterns. | 
|---|
| 229 | Must be a multiple of the record size.  See earlier cautions | 
|---|
| 230 | about combining large values of -s with large values of -n. | 
|---|
| 231 | Default is number of threads times number of processes. | 
|---|
| 232 |  | 
|---|
| 233 | -th nThreads - Number of threads per process.  Default is 1.  When there | 
|---|
| 234 | are multiple threads per process they read adjacent blocks of | 
|---|
| 235 | the file for the sequential and strided access patterns.  For | 
|---|
| 236 | example, suppose a file of 60 records is being read by 3 | 
|---|
| 237 | nodes with 2 threads per node.  Under the sequential pattern, | 
|---|
| 238 | thread 0 on node 0 will read records 0-9, thread 1 on node 0 | 
|---|
| 239 | will read records 10-19, thread 0 on node 1 will read records | 
|---|
| 240 | 20-29, etc.  Under a strided pattern, thread 0 on node 0 will | 
|---|
| 241 | read records 0, 6, 12, ..., 54, thread 1 on node 0 will read | 
|---|
| 242 | records 1, 7, 13, ..., 55, etc. | 
|---|
| 243 |  | 
|---|
| 244 | -noinv       - Do not clear blocks of fn from the GPFS file cache before | 
|---|
| 245 | starting the test.  The default is to clear the cache.  If | 
|---|
| 246 | this option is given, the results of the test can depend | 
|---|
| 247 | strongly on the lock and buffer state left by the last test. | 
|---|
| 248 | For example, a multinode sequential read with -noinv will run | 
|---|
| 249 | more slowly after a strided write test than after a | 
|---|
| 250 | sequential write test. | 
|---|
| 251 |  | 
|---|
| 252 | -ds          - Use GPFS data shipping.  Data shipping avoids lock conflicts | 
|---|
| 253 | by partitioning the file among the nodes running gpfsperf, | 
|---|
| 254 | turning off byte range locking, and sending messages to the | 
|---|
| 255 | appropriate agent node to handle each read or write request. | 
|---|
| 256 | However, since (n-1)/n of the accesses are remote, the | 
|---|
| 257 | gpfsperf threads cannot take advantage of local block | 
|---|
| 258 | caching, although they may still benefit from prefetching and | 
|---|
| 259 | writebehind.  Also, since byte range locking is not in | 
|---|
| 260 | effect, use of data shipping suspends the atomicity | 
|---|
| 261 | guarantees of X/Open file semantics.  See the GPFS Guide and | 
|---|
| 262 | Reference manual for more details.  Data shipping should show | 
|---|
| 263 | the largest performance benefit for strided writes that have | 
|---|
| 264 | small record sizes.  The default is not to use data shipping. | 
|---|
| 265 |  | 
|---|
| 266 | -aio depth   - use Asynch I/O, prefetching to depth (default 0, max 1000). | 
|---|
| 267 | This can be used with any of the seq/rand/strided test patterns. | 
|---|
| 268 |  | 
|---|
| 269 | -dio         - Use direct IO flag when opening the file. This will allow | 
|---|
| 270 | sector aligned/sized buffers to be read/written directly | 
|---|
| 271 | from the application buffer to the disks where the blocks | 
|---|
| 272 | are allocated. | 
|---|
| 273 |  | 
|---|
| 274 | -reltoken    - Release the entire file byte-range token after the file | 
|---|
| 275 | is newly created. In a multi-node environment (MPI), only | 
|---|
| 276 | the first process will create the file and all the other | 
|---|
| 277 | processes will wait and open the file after the creation | 
|---|
| 278 | has occurred. This flag tell the first process to release | 
|---|
| 279 | the byte-range token it automatically gets during the create. | 
|---|
| 280 | This may increase performance because other nodes that work | 
|---|
| 281 | on different ranges of the file will not need to revoke the | 
|---|
| 282 | range held by the node running the first process. | 
|---|
| 283 |  | 
|---|
| 284 | -fsync       - Insure that no dirty data remain buffered at the conclusion | 
|---|
| 285 | of a write or create test.  The time to perform the necessary | 
|---|
| 286 | fsync operation is included in the test time, so this option | 
|---|
| 287 | reduces the reported aggregate data rate.  The default is not | 
|---|
| 288 | to fsync the file. | 
|---|
| 289 |  | 
|---|
| 290 | -osync       - Turn on the O_SYNC flag when opening the file. | 
|---|
| 291 | This causes every write operation to force the data to disk | 
|---|
| 292 | on each call. The default is not osync. | 
|---|
| 293 |  | 
|---|
| 294 | -v           - Verbose tracing.  In a multinode test using gpfsperf, output | 
|---|
| 295 | from each instance of the program will be intermingled.  By | 
|---|
| 296 | telling MPI to label the output from each node (MP_LABELIO | 
|---|
| 297 | environment variable =yes), the verbose output will make more | 
|---|
| 298 | sense. | 
|---|
| 299 |  | 
|---|
| 300 | -V           - Very verbose tracing.  This option will display the offset | 
|---|
| 301 | of every read or write operation on every node.  As with -v, | 
|---|
| 302 | labelling the output by node is suggested. | 
|---|
| 303 |  | 
|---|
| 304 | Numbers in options can be given using K, M, or G suffixes, in upper or lower | 
|---|
| 305 | case, to denote 2**10, 2**20, or 2**30, respectively, or can have an R or r | 
|---|
| 306 | suffix to denote a multiple of the record size.  For example, to specify a | 
|---|
| 307 | record size of 4096 bytes and a size to read or write of 409600, one could | 
|---|
| 308 | write "-r 4k -n 100r". | 
|---|
| 309 |  | 
|---|
| 310 | AIX only: | 
|---|
| 311 | If the AsynchronousIO (AIO) kernel extension has not been loaded yet, | 
|---|
| 312 | running the gpfsperf program will fail and display output like: | 
|---|
| 313 | exec(): 0509-036 Cannot load program gpfsperf because of the following errors: | 
|---|
| 314 | 0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because: | 
|---|
| 315 | 0509-136   Symbol kaio_rdwr (number 0) is not exported from | 
|---|
| 316 | dependent module /unix. | 
|---|
| 317 | 0509-136   Symbol listio (number 1) is not exported from | 
|---|
| 318 | dependent module /unix. | 
|---|
| 319 | 0509-136   Symbol acancel (number 2) is not exported from | 
|---|
| 320 | dependent module /unix. | 
|---|
| 321 | 0509-136   Symbol iosuspend (number 3) is not exported from | 
|---|
| 322 | dependent module /unix. | 
|---|
| 323 | 0509-136   Symbol aio_nwait (number 4) is not exported from | 
|---|
| 324 | dependent module /unix. | 
|---|
| 325 | 0509-192 Examine .loader section symbols with the | 
|---|
| 326 | 'dump -Tv' command. | 
|---|
| 327 |  | 
|---|
| 328 | If you do not wish to use AIO, you can recompile the gpfsperf program to not | 
|---|
| 329 | use the AIO calls: | 
|---|
| 330 | rm gpfsperf.o gpfsperf-mpi.o | 
|---|
| 331 | make OTHERINCL="-DNO_AIO" | 
|---|
| 332 |  | 
|---|
| 333 | Enable AIO on your system, by doing commands like the following: | 
|---|
| 334 | lsattr -El aio0 | 
|---|
| 335 | chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128 | 
|---|
| 336 | mkdev -l aio0 | 
|---|
| 337 | Minservers just tells AIX how many AIO kprocs to create immediately, and | 
|---|
| 338 | maxservers limits the total number created. On AIX 5.2 the meaning of | 
|---|
| 339 | maxservers has changed to mean "maximum number of servers per CPU", so a | 
|---|
| 340 | 16-way SMP should set maxservers=8 to get a total of 128 kprocs. | 
|---|
| 341 |  | 
|---|
| 342 |  | 
|---|
| 343 |  | 
|---|
| 344 | Examples | 
|---|
| 345 | -------- | 
|---|
| 346 |  | 
|---|
| 347 | Suppose that /gpfs is a GPFS file system that was formatted with 256K | 
|---|
| 348 | blocks and that it has at least a gigabyte of free space.  Assuming that | 
|---|
| 349 | the gpfsperf programs have been copied into /gpfs/test, and that | 
|---|
| 350 | /gpfs/test is the current directory, the following ksh commands | 
|---|
| 351 | illustrate how to run gpfsperf: | 
|---|
| 352 |  | 
|---|
| 353 | # Number of nodes on which the test will run.  If this is increased, the | 
|---|
| 354 | # size of the test file should also be increased. | 
|---|
| 355 | export MP_PROCS=8 | 
|---|
| 356 |  | 
|---|
| 357 | # File containing a list of nodes on which gpfsperf will run.  There are | 
|---|
| 358 | # other ways to specify where the test runs besides using an explicit | 
|---|
| 359 | # host list.  See the Parallel Operating Environment documentation for | 
|---|
| 360 | # details. | 
|---|
| 361 | export MP_HOSTFILE=/etc/cluster.nodes | 
|---|
| 362 |  | 
|---|
| 363 | # Name of test file to be manipulated by the tests that follow. | 
|---|
| 364 | export fn=/gpfs/test/testfile | 
|---|
| 365 |  | 
|---|
| 366 | # Verify block size | 
|---|
| 367 | mmlsfs gpfs -B | 
|---|
| 368 |  | 
|---|
| 369 | # Create test file.  All write tests in these examples specify -fsync, so | 
|---|
| 370 | # the reported data rate includes the overhead of flushing all dirty buffers | 
|---|
| 371 | # to disk.  The size of the test file should be increased if more than 8 | 
|---|
| 372 | # nodes are used or if GPFS pagepool sizes have been increased from their | 
|---|
| 373 | # defaults.  It may be necessary to increase the maximum size file the user | 
|---|
| 374 | # is allowed to create.  See the ulimit command. | 
|---|
| 375 | ./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync | 
|---|
| 376 |  | 
|---|
| 377 | # Read entire test file sequentially | 
|---|
| 378 | ./gpfsperf-mpi read seq $fn -r 256k | 
|---|
| 379 |  | 
|---|
| 380 | # Rewrite test file sequentially using full block writes | 
|---|
| 381 | ./gpfsperf-mpi write seq $fn -r 256k -fsync | 
|---|
| 382 |  | 
|---|
| 383 | # Rewrite test file sequentially using small writes.  This requires GPFS to | 
|---|
| 384 | # read blocks in order to update them, so will have worse performance than | 
|---|
| 385 | # the full block rewrite. | 
|---|
| 386 | ./gpfsperf-mpi write seq $fn -r 64k -fsync | 
|---|
| 387 |  | 
|---|
| 388 | # Strided read using big records | 
|---|
| 389 | ./gpfsperf-mpi read strided $fn -r 256k | 
|---|
| 390 |  | 
|---|
| 391 | # Strided read using medium sized records.  Performance is worse because | 
|---|
| 392 | # average I/O size has gone down.  This behavior will not be seen unless | 
|---|
| 393 | # the stride is larger than a block (8*50000 > 256K). | 
|---|
| 394 | ./gpfsperf-mpi read strided $fn -r 50000 | 
|---|
| 395 |  | 
|---|
| 396 | # Strided read using a very large stride.  Reported performance is | 
|---|
| 397 | # misleading because each node just reads the same records over and over | 
|---|
| 398 | # from its GPFS buffer cache. | 
|---|
| 399 | ./gpfsperf-mpi read strided $fn -r 50000 -s 2400r | 
|---|
| 400 |  | 
|---|
| 401 | # Strided write using a record size equal to the block size.  Decent | 
|---|
| 402 | # performance, since record size matches GPFS lock granularity. | 
|---|
| 403 | ./gpfsperf-mpi write strided $fn -r 256k -fsync | 
|---|
| 404 |  | 
|---|
| 405 | # Strided write using small records.  Since GPFS lock granularity is | 
|---|
| 406 | # larger than a record, performance is much worse.  Number of bytes | 
|---|
| 407 | # written is less than the entire file to keep test time reasonable. | 
|---|
| 408 | ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync | 
|---|
| 409 |  | 
|---|
| 410 | # Strided write using small records and data shipping.  Data shipping | 
|---|
| 411 | # trades additional communication overhead for less lock contention, | 
|---|
| 412 | # improving performance. | 
|---|
| 413 | ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync | 
|---|
| 414 |  | 
|---|
| 415 | # Random read of small records | 
|---|
| 416 | ./gpfsperf-mpi read rand $fn -r 10000 -n 100m | 
|---|
| 417 |  | 
|---|
| 418 | # Random read of small records using the GPFS multiple access range hint. | 
|---|
| 419 | # Better performance (assuming more than MP_PROCS disks) because each node | 
|---|
| 420 | # has more than one disk read in progress at once due to prefetching. | 
|---|
| 421 | ./gpfsperf-mpi read randhint $fn -r 10000 -n 100m | 
|---|
| 422 |  | 
|---|