source: gpfs_3.1_ker2.6.20/lpp/mmfs/samples/perf/README @ 250

Last change on this file since 250 was 16, checked in by rock, 17 years ago
File size: 20.9 KB
Line 
1Disclaimers
2-----------
3
4The files in this directory are provided by IBM on an "AS IS" basis
5without warranty of any kind.  In addition, the results that you obtain
6from using these files to measure the general performance of your
7General Parallel File System (GPFS) file systems are "AS IS."  Your
8reliance on any measurements is at your own risk and IBM does not assume
9any liability, whatsoever from your use of these files or your use of
10resultant performance measurements.  The performance of GPFS file
11systems is affected by many factors, including the access patterns of
12application programs, the configuration and amount of memory on the SP
13nodes, the number and characteristics of IBM Virtual Shared Disk (VSD)
14servers, the number and speed of disks and disk adapters attached to the
15VSD servers, GPFS, VSD and SP switch configuration parameters, other
16traffic through the SP switch, etc.  As a result, GPFS file system
17performance may vary and IBM does not make any particular performance
18claims for GPFS file systems.
19
20
21Introduction
22------------
23
24The files in this directory serve two purposes:
25
26 - Provide a simple benchmark program (gpfsperf) that can be used to
27   measure the performance of GPFS for several common file access patterns.
28 - Give examples of how to use some of the gpfs_fcntl hints and
29   directives that are new in GPFS version 1.3.
30
31There are four versions of the program binary built from a single set of
32source files.  The four versions correspond to all of the possible
33combinations of single node/multiple node and with/without features that
34only are supported on GPFS version 1.3.  Multinode versions of the gpfsperf
35program contain -mpi as part of their names, while versions that do not use
36features of GPFS requiring version 1.3 have a suffix of -v12 in their names.
37
38
39Parallelism
40-----------
41
42There are two independent ways to achieve parallelism in the gpfsperf
43program.  More than one instance of the program can be run on multiple
44nodes using Message Passing Interface (MPI) to synchronize their
45execution, or a single instance of the program can execute several
46threads in parallel on a single node.  These two techniques can also be
47combined.  When describing the behavior of the program, it should be
48understood that 'threads' means any of the threads of the gpfsperf
49program on any node where MPI runs it.
50
51When gpfsperf runs on multiple nodes, the multiple instances of the
52program communicate using the Message Passing Interface (MPI) to
53synchronize their execution and to combine their measurements into an
54aggregate throughput result.
55
56
57Access patterns
58---------------
59
60The gpfsperf program operates on a file that is assumed to consist of a
61collection of records, each of the same size.  It can generate three
62different types of access patterns: sequential, strided, and random.  The
63meaning of these access patterns in some cases depends on whether or not
64parallelism is employed when the benchmark is run.
65
66The simplest access pattern is random.  The gpfsperf program generates a
67sequence of random record numbers and reads or writes the corresponding
68records.  When run on multiple nodes, or when multiple threads are used
69within instances of the program, each thread of the gpfsperf program uses a
70different seed for its random number generator, so each thread will access
71independent sequences of records.  Two threads may access the same record
72if the same random record number occurs in two sequences.
73
74In the sequential access pattern, each gpfsperf thread reads from or writes
75to a contiguous partition of a file sequentially.  For example, suppose that
76a 10 billion byte file consists of one million records of 10000 bytes each.
77If 10 threads read the file according to the sequential access
78pattern, then first thread will read sequentially through the partition
79consisting of the first 100,000 records, the next thread will read from the
80next 100,000 records, and so on.
81
82In a strided access pattern, each thread skips some number of records
83between each record that it reads or writes.  Reading the file from the
84example above in a strided pattern, the first thread would read records 0,
8510, 20, ..., 999,990.  The second thread would read records 1, 11, 21, ...,
86999,991, and so on.  The gpfsperf program by default uses a stride, or
87distance between records, equal to the total number of threads operating on
88the file, in this case 10.
89
90
91Amount of data to be transferred
92--------------------------------
93
94One of the input parameters to gpfsperf is the amount of data to be
95transferred.  This is the total number of bytes to be read or written by all
96threads of the program.  If there are T threads in total, and the total
97number of bytes to be transferred is N, each thread will read or write about
98N/T bytes, rounded to a multiple of the record size.  By default, gpfsperf
99sets N to the size of the file.  For the sequential or strided access
100pattern, this default means that every record in the file will be read or
101written exactly once.
102
103If N is greater than the size of the file, each thread will read or write
104its partition of the file repeatedly until reaching its share (N/T) of the
105bytes to be transferred.  For example, suppose 10 threads sequentially read
106a 10 billion byte file of 10000 byte records when N is 15 billion.  The
107first thread will read the first 100,000 records, then reread the first
10850,000 records.  The second thread will read records 100,000 through
109199,999, then reread records 100,000 through 149,999, and so on.
110
111When using strided access patterns with other than the default stride, this
112behavior of gpfsperf can cause unexpected results.  For example, suppose
113that 10 threads read the 10 billion byte file using a strided access
114pattern, but instead of the default stride of 10 records gpfsperf is told to
115use a stride of 10000 records.  The file partition read by the first thread
116will be records 0, 10000, 20000, ..., 990,000.  This is only 100 distinct
117records of 10000 bytes each, or a total of only 1 million bytes of data.
118This data will likely remain in the GPFS buffer pool after it is read the
119first time.  If N is more than 10 million bytes, gpfsperf will "read" the
120same buffered data multiple times, and the reported data rate will appear
121anomalously high.  To avoid this effect, performance tests using non-default
122strides should reduce N in the same proportion as the stride was increased
123from its default.
124
125
126Computation of aggregate data rate and utilization
127--------------------------------------------------
128
129The results of a run of gpfsperf are reported as an aggregate data rate.
130Data rate is defined as the total number of bytes read or written by all
131threads divided by the total time of the test.  It is reported in units of
1321000 bytes/second.  The test time is measured from before any node opens the
133test file until after the last node closes the test file.  To insure a
134consistent environment for each test, before beginning the timed test period
135gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes.  This flushes
136the GPFS buffer cache and releases byte-range tokens.  Note that versions of
137GPFS prior to v1.3 do not support GPFS file hints, so the state of the
138buffer cache at the beginning of a test will be influenced by the state left
139by the prior test.
140
141Since all threads of a gpfsperf run do approximately the same amount of work
142(read or write N/T bytes), in principle they should all run for the same
143amount of time.  In practice, however, variations in disk and switch
144response time lead to variations in execution times among the threads.  Lock
145contention in GPFS further contributes to these variations in execution
146times.  A large degree of non-uniformity is undesirable, since it means that
147some nodes are idle while they wait for threads on other nodes to finish.
148To measure the degree of uniformity of thread execution times, gpfsperf
149computes a quantity it calls "utilization."  Utilization is the fraction of
150the total number of thread-seconds in a test during which threads actively
151perform reads or writes.  A value of 1.0 indicates perfect overlap, while
152lower values denote that some threads were idle while others still ran.
153
154The following timeline illustrates how gpfsperf computes utilization for a
155test involving one thread on each of two nodes, reading a total of 100M:
156
157  time   event
158  ----   -----
159   0.0   Node 0 captures timestamp for beginning of test
160         ... both nodes open file
161   0.1   Node 0 about to do first read
162   0.1   Node 1 about to do first read
163         ... many file reads
164   4.9   Node 0 finishes last read
165   4.9   Node 1 finishes last read
166         ... both nodes close file
167   5.0   Node 0 captures timestamp for end of test
168
169The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96.  The
170reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec.
171
172If node 0 ran significantly slower than node 1, it might have finished its
173last read at time 9.9 instead of at time 4.9, and the end of test timestamp
174might be 10.0 instead of 5.0.  In this case the utilization would drop to
175((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be
176100M / (10.0-0.0) = 10M/sec.
177
178
179Command line parameters
180-----------------------
181
182There are four versions of gpfsperf in this directory:
183  gpfsperf-mpi       - runs on multiple nodes under MPI, requires GPFS v1.3 or later
184  gpfsperf           - runs only on a single node, requires GPFS v1.3 or later
185
186The command line for any of the versions of gpfsperf is:
187
188  gpfsperf[-mpi] operation pattern fn [options]
189
190The order of parameters on the command line is not significant.
191
192The operation must be either "create", "read", "write", or "uncache".  All
193threads in a multinode or multithreaded run of gpfsperf do the same
194operation to the same file, but at different offsets.  Create means to
195insure that the file exists, then do a write test.  The uncache operation
196does not read or write the file, but only removes any buffered data for the
197file from the GPFS buffer cache.
198
199Pattern must be one of "rand", "randhint", "strided", or "seq".  The
200meanings of each of these was explained earlier, except for "randhint".  The
201"randhint" pattern is the same as "rand", except that the GPFS multiple
202access range hint is used through the library functions in irreg.c to
203prefetch blocks before they are accessed by gpfsperf.
204
205The filename parameter fn should resolve to a file in a GPFS file system
206that is mounted on all nodes where gpfsperf is to run.  The file must
207already exist unless the "create" operation was specified.  Use of gpfsperf
208on files not in GPFS may be meaningful in some situations, but this use has
209not been tested.
210
211Each optional parameter is described below, along with its default value.
212
213-nolabels    - Produce a single line of output containing all parameters
214               and the measured data rate.  Format of the output line is 'op
215               pattern fn recordSize nBytes fileSize nProcs nThreads
216               strideRecs inv ds dio fsync reltoken aio osync rate util'.
217               This format may be useful for importing results into a
218               spreadsheet or other program for further analysis.  The
219               default is to produce multi-line labelled output.
220
221-r recsize   - Record size.  Defaults to filesystem block size.  Must be
222               specified for create operations.
223
224-n nBytes    - Number of bytes to transfer.  Defaults to file size.  Must be
225               specified for create operations.
226
227-s stride    - Number of bytes between successive accesses by the
228               same thread.  Only meaningful for strided access patterns.
229               Must be a multiple of the record size.  See earlier cautions
230               about combining large values of -s with large values of -n.
231               Default is number of threads times number of processes.
232
233-th nThreads - Number of threads per process.  Default is 1.  When there
234               are multiple threads per process they read adjacent blocks of
235               the file for the sequential and strided access patterns.  For
236               example, suppose a file of 60 records is being read by 3
237               nodes with 2 threads per node.  Under the sequential pattern,
238               thread 0 on node 0 will read records 0-9, thread 1 on node 0
239               will read records 10-19, thread 0 on node 1 will read records
240               20-29, etc.  Under a strided pattern, thread 0 on node 0 will
241               read records 0, 6, 12, ..., 54, thread 1 on node 0 will read
242               records 1, 7, 13, ..., 55, etc.
243
244-noinv       - Do not clear blocks of fn from the GPFS file cache before
245               starting the test.  The default is to clear the cache.  If
246               this option is given, the results of the test can depend
247               strongly on the lock and buffer state left by the last test.
248               For example, a multinode sequential read with -noinv will run
249               more slowly after a strided write test than after a
250               sequential write test.
251
252-ds          - Use GPFS data shipping.  Data shipping avoids lock conflicts
253               by partitioning the file among the nodes running gpfsperf,
254               turning off byte range locking, and sending messages to the
255               appropriate agent node to handle each read or write request.
256               However, since (n-1)/n of the accesses are remote, the
257               gpfsperf threads cannot take advantage of local block
258               caching, although they may still benefit from prefetching and
259               writebehind.  Also, since byte range locking is not in
260               effect, use of data shipping suspends the atomicity
261               guarantees of X/Open file semantics.  See the GPFS Guide and
262               Reference manual for more details.  Data shipping should show
263               the largest performance benefit for strided writes that have
264               small record sizes.  The default is not to use data shipping.
265
266-aio depth   - use Asynch I/O, prefetching to depth (default 0, max 1000).
267               This can be used with any of the seq/rand/strided test patterns.
268
269-dio         - Use direct IO flag when opening the file. This will allow
270               sector aligned/sized buffers to be read/written directly
271               from the application buffer to the disks where the blocks
272               are allocated.
273
274-reltoken    - Release the entire file byte-range token after the file
275               is newly created. In a multi-node environment (MPI), only
276               the first process will create the file and all the other
277               processes will wait and open the file after the creation
278               has occurred. This flag tell the first process to release
279               the byte-range token it automatically gets during the create.
280               This may increase performance because other nodes that work
281               on different ranges of the file will not need to revoke the
282               range held by the node running the first process.
283
284-fsync       - Insure that no dirty data remain buffered at the conclusion
285               of a write or create test.  The time to perform the necessary
286               fsync operation is included in the test time, so this option
287               reduces the reported aggregate data rate.  The default is not
288               to fsync the file.
289
290-osync       - Turn on the O_SYNC flag when opening the file.
291               This causes every write operation to force the data to disk
292               on each call. The default is not osync.
293
294-v           - Verbose tracing.  In a multinode test using gpfsperf, output
295               from each instance of the program will be intermingled.  By
296               telling MPI to label the output from each node (MP_LABELIO
297               environment variable =yes), the verbose output will make more
298               sense.
299
300-V           - Very verbose tracing.  This option will display the offset
301               of every read or write operation on every node.  As with -v,
302               labelling the output by node is suggested.
303
304Numbers in options can be given using K, M, or G suffixes, in upper or lower
305case, to denote 2**10, 2**20, or 2**30, respectively, or can have an R or r
306suffix to denote a multiple of the record size.  For example, to specify a
307record size of 4096 bytes and a size to read or write of 409600, one could
308write "-r 4k -n 100r".
309
310AIX only:
311If the AsynchronousIO (AIO) kernel extension has not been loaded yet,
312running the gpfsperf program will fail and display output like:
313exec(): 0509-036 Cannot load program gpfsperf because of the following errors:
314        0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because:
315        0509-136   Symbol kaio_rdwr (number 0) is not exported from
316                   dependent module /unix.
317        0509-136   Symbol listio (number 1) is not exported from
318                   dependent module /unix.
319        0509-136   Symbol acancel (number 2) is not exported from
320                   dependent module /unix.
321        0509-136   Symbol iosuspend (number 3) is not exported from
322                   dependent module /unix.
323        0509-136   Symbol aio_nwait (number 4) is not exported from
324                   dependent module /unix.
325        0509-192 Examine .loader section symbols with the
326                 'dump -Tv' command.
327
328If you do not wish to use AIO, you can recompile the gpfsperf program to not
329use the AIO calls:
330    rm gpfsperf.o gpfsperf-mpi.o
331    make OTHERINCL="-DNO_AIO"
332
333Enable AIO on your system, by doing commands like the following:
334    lsattr -El aio0
335    chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128
336    mkdev -l aio0
337  Minservers just tells AIX how many AIO kprocs to create immediately, and
338  maxservers limits the total number created. On AIX 5.2 the meaning of
339  maxservers has changed to mean "maximum number of servers per CPU", so a
340  16-way SMP should set maxservers=8 to get a total of 128 kprocs.
341
342
343
344Examples
345--------
346
347Suppose that /gpfs is a GPFS file system that was formatted with 256K
348blocks and that it has at least a gigabyte of free space.  Assuming that
349the gpfsperf programs have been copied into /gpfs/test, and that
350/gpfs/test is the current directory, the following ksh commands
351illustrate how to run gpfsperf:
352
353# Number of nodes on which the test will run.  If this is increased, the
354# size of the test file should also be increased.
355export MP_PROCS=8
356
357# File containing a list of nodes on which gpfsperf will run.  There are
358# other ways to specify where the test runs besides using an explicit
359# host list.  See the Parallel Operating Environment documentation for
360# details.
361export MP_HOSTFILE=/etc/cluster.nodes
362
363# Name of test file to be manipulated by the tests that follow.
364export fn=/gpfs/test/testfile
365
366# Verify block size
367mmlsfs gpfs -B
368
369# Create test file.  All write tests in these examples specify -fsync, so
370# the reported data rate includes the overhead of flushing all dirty buffers
371# to disk.  The size of the test file should be increased if more than 8
372# nodes are used or if GPFS pagepool sizes have been increased from their
373# defaults.  It may be necessary to increase the maximum size file the user
374# is allowed to create.  See the ulimit command.
375./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync
376
377# Read entire test file sequentially
378./gpfsperf-mpi read seq $fn -r 256k
379
380# Rewrite test file sequentially using full block writes
381./gpfsperf-mpi write seq $fn -r 256k -fsync
382
383# Rewrite test file sequentially using small writes.  This requires GPFS to
384# read blocks in order to update them, so will have worse performance than
385# the full block rewrite.
386./gpfsperf-mpi write seq $fn -r 64k -fsync
387
388# Strided read using big records
389./gpfsperf-mpi read strided $fn -r 256k
390
391# Strided read using medium sized records.  Performance is worse because
392# average I/O size has gone down.  This behavior will not be seen unless
393# the stride is larger than a block (8*50000 > 256K).
394./gpfsperf-mpi read strided $fn -r 50000
395
396# Strided read using a very large stride.  Reported performance is
397# misleading because each node just reads the same records over and over
398# from its GPFS buffer cache.
399./gpfsperf-mpi read strided $fn -r 50000 -s 2400r
400
401# Strided write using a record size equal to the block size.  Decent
402# performance, since record size matches GPFS lock granularity.
403./gpfsperf-mpi write strided $fn -r 256k -fsync
404
405# Strided write using small records.  Since GPFS lock granularity is
406# larger than a record, performance is much worse.  Number of bytes
407# written is less than the entire file to keep test time reasonable.
408./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync
409
410# Strided write using small records and data shipping.  Data shipping
411# trades additional communication overhead for less lock contention,
412# improving performance.
413./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync
414
415# Random read of small records
416./gpfsperf-mpi read rand $fn -r 10000 -n 100m
417
418# Random read of small records using the GPFS multiple access range hint.
419# Better performance (assuming more than MP_PROCS disks) because each node
420# has more than one disk read in progress at once due to prefetching.
421./gpfsperf-mpi read randhint $fn -r 10000 -n 100m
422
Note: See TracBrowser for help on using the repository browser.