1 | Disclaimers |
---|
2 | ----------- |
---|
3 | |
---|
4 | The files in this directory are provided by IBM on an "AS IS" basis |
---|
5 | without warranty of any kind. In addition, the results that you obtain |
---|
6 | from using these files to measure the general performance of your |
---|
7 | General Parallel File System (GPFS) file systems are "AS IS." Your |
---|
8 | reliance on any measurements is at your own risk and IBM does not assume |
---|
9 | any liability, whatsoever from your use of these files or your use of |
---|
10 | resultant performance measurements. The performance of GPFS file |
---|
11 | systems is affected by many factors, including the access patterns of |
---|
12 | application programs, the configuration and amount of memory on the SP |
---|
13 | nodes, the number and characteristics of IBM Virtual Shared Disk (VSD) |
---|
14 | servers, the number and speed of disks and disk adapters attached to the |
---|
15 | VSD servers, GPFS, VSD and SP switch configuration parameters, other |
---|
16 | traffic through the SP switch, etc. As a result, GPFS file system |
---|
17 | performance may vary and IBM does not make any particular performance |
---|
18 | claims for GPFS file systems. |
---|
19 | |
---|
20 | |
---|
21 | Introduction |
---|
22 | ------------ |
---|
23 | |
---|
24 | The files in this directory serve two purposes: |
---|
25 | |
---|
26 | - Provide a simple benchmark program (gpfsperf) that can be used to |
---|
27 | measure the performance of GPFS for several common file access patterns. |
---|
28 | - Give examples of how to use some of the gpfs_fcntl hints and |
---|
29 | directives that are new in GPFS version 1.3. |
---|
30 | |
---|
31 | There are four versions of the program binary built from a single set of |
---|
32 | source files. The four versions correspond to all of the possible |
---|
33 | combinations of single node/multiple node and with/without features that |
---|
34 | only are supported on GPFS version 1.3. Multinode versions of the gpfsperf |
---|
35 | program contain -mpi as part of their names, while versions that do not use |
---|
36 | features of GPFS requiring version 1.3 have a suffix of -v12 in their names. |
---|
37 | |
---|
38 | |
---|
39 | Parallelism |
---|
40 | ----------- |
---|
41 | |
---|
42 | There are two independent ways to achieve parallelism in the gpfsperf |
---|
43 | program. More than one instance of the program can be run on multiple |
---|
44 | nodes using Message Passing Interface (MPI) to synchronize their |
---|
45 | execution, or a single instance of the program can execute several |
---|
46 | threads in parallel on a single node. These two techniques can also be |
---|
47 | combined. When describing the behavior of the program, it should be |
---|
48 | understood that 'threads' means any of the threads of the gpfsperf |
---|
49 | program on any node where MPI runs it. |
---|
50 | |
---|
51 | When gpfsperf runs on multiple nodes, the multiple instances of the |
---|
52 | program communicate using the Message Passing Interface (MPI) to |
---|
53 | synchronize their execution and to combine their measurements into an |
---|
54 | aggregate throughput result. |
---|
55 | |
---|
56 | |
---|
57 | Access patterns |
---|
58 | --------------- |
---|
59 | |
---|
60 | The gpfsperf program operates on a file that is assumed to consist of a |
---|
61 | collection of records, each of the same size. It can generate three |
---|
62 | different types of access patterns: sequential, strided, and random. The |
---|
63 | meaning of these access patterns in some cases depends on whether or not |
---|
64 | parallelism is employed when the benchmark is run. |
---|
65 | |
---|
66 | The simplest access pattern is random. The gpfsperf program generates a |
---|
67 | sequence of random record numbers and reads or writes the corresponding |
---|
68 | records. When run on multiple nodes, or when multiple threads are used |
---|
69 | within instances of the program, each thread of the gpfsperf program uses a |
---|
70 | different seed for its random number generator, so each thread will access |
---|
71 | independent sequences of records. Two threads may access the same record |
---|
72 | if the same random record number occurs in two sequences. |
---|
73 | |
---|
74 | In the sequential access pattern, each gpfsperf thread reads from or writes |
---|
75 | to a contiguous partition of a file sequentially. For example, suppose that |
---|
76 | a 10 billion byte file consists of one million records of 10000 bytes each. |
---|
77 | If 10 threads read the file according to the sequential access |
---|
78 | pattern, then first thread will read sequentially through the partition |
---|
79 | consisting of the first 100,000 records, the next thread will read from the |
---|
80 | next 100,000 records, and so on. |
---|
81 | |
---|
82 | In a strided access pattern, each thread skips some number of records |
---|
83 | between each record that it reads or writes. Reading the file from the |
---|
84 | example above in a strided pattern, the first thread would read records 0, |
---|
85 | 10, 20, ..., 999,990. The second thread would read records 1, 11, 21, ..., |
---|
86 | 999,991, and so on. The gpfsperf program by default uses a stride, or |
---|
87 | distance between records, equal to the total number of threads operating on |
---|
88 | the file, in this case 10. |
---|
89 | |
---|
90 | |
---|
91 | Amount of data to be transferred |
---|
92 | -------------------------------- |
---|
93 | |
---|
94 | One of the input parameters to gpfsperf is the amount of data to be |
---|
95 | transferred. This is the total number of bytes to be read or written by all |
---|
96 | threads of the program. If there are T threads in total, and the total |
---|
97 | number of bytes to be transferred is N, each thread will read or write about |
---|
98 | N/T bytes, rounded to a multiple of the record size. By default, gpfsperf |
---|
99 | sets N to the size of the file. For the sequential or strided access |
---|
100 | pattern, this default means that every record in the file will be read or |
---|
101 | written exactly once. |
---|
102 | |
---|
103 | If N is greater than the size of the file, each thread will read or write |
---|
104 | its partition of the file repeatedly until reaching its share (N/T) of the |
---|
105 | bytes to be transferred. For example, suppose 10 threads sequentially read |
---|
106 | a 10 billion byte file of 10000 byte records when N is 15 billion. The |
---|
107 | first thread will read the first 100,000 records, then reread the first |
---|
108 | 50,000 records. The second thread will read records 100,000 through |
---|
109 | 199,999, then reread records 100,000 through 149,999, and so on. |
---|
110 | |
---|
111 | When using strided access patterns with other than the default stride, this |
---|
112 | behavior of gpfsperf can cause unexpected results. For example, suppose |
---|
113 | that 10 threads read the 10 billion byte file using a strided access |
---|
114 | pattern, but instead of the default stride of 10 records gpfsperf is told to |
---|
115 | use a stride of 10000 records. The file partition read by the first thread |
---|
116 | will be records 0, 10000, 20000, ..., 990,000. This is only 100 distinct |
---|
117 | records of 10000 bytes each, or a total of only 1 million bytes of data. |
---|
118 | This data will likely remain in the GPFS buffer pool after it is read the |
---|
119 | first time. If N is more than 10 million bytes, gpfsperf will "read" the |
---|
120 | same buffered data multiple times, and the reported data rate will appear |
---|
121 | anomalously high. To avoid this effect, performance tests using non-default |
---|
122 | strides should reduce N in the same proportion as the stride was increased |
---|
123 | from its default. |
---|
124 | |
---|
125 | |
---|
126 | Computation of aggregate data rate and utilization |
---|
127 | -------------------------------------------------- |
---|
128 | |
---|
129 | The results of a run of gpfsperf are reported as an aggregate data rate. |
---|
130 | Data rate is defined as the total number of bytes read or written by all |
---|
131 | threads divided by the total time of the test. It is reported in units of |
---|
132 | 1000 bytes/second. The test time is measured from before any node opens the |
---|
133 | test file until after the last node closes the test file. To insure a |
---|
134 | consistent environment for each test, before beginning the timed test period |
---|
135 | gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes. This flushes |
---|
136 | the GPFS buffer cache and releases byte-range tokens. Note that versions of |
---|
137 | GPFS prior to v1.3 do not support GPFS file hints, so the state of the |
---|
138 | buffer cache at the beginning of a test will be influenced by the state left |
---|
139 | by the prior test. |
---|
140 | |
---|
141 | Since all threads of a gpfsperf run do approximately the same amount of work |
---|
142 | (read or write N/T bytes), in principle they should all run for the same |
---|
143 | amount of time. In practice, however, variations in disk and switch |
---|
144 | response time lead to variations in execution times among the threads. Lock |
---|
145 | contention in GPFS further contributes to these variations in execution |
---|
146 | times. A large degree of non-uniformity is undesirable, since it means that |
---|
147 | some nodes are idle while they wait for threads on other nodes to finish. |
---|
148 | To measure the degree of uniformity of thread execution times, gpfsperf |
---|
149 | computes a quantity it calls "utilization." Utilization is the fraction of |
---|
150 | the total number of thread-seconds in a test during which threads actively |
---|
151 | perform reads or writes. A value of 1.0 indicates perfect overlap, while |
---|
152 | lower values denote that some threads were idle while others still ran. |
---|
153 | |
---|
154 | The following timeline illustrates how gpfsperf computes utilization for a |
---|
155 | test involving one thread on each of two nodes, reading a total of 100M: |
---|
156 | |
---|
157 | time event |
---|
158 | ---- ----- |
---|
159 | 0.0 Node 0 captures timestamp for beginning of test |
---|
160 | ... both nodes open file |
---|
161 | 0.1 Node 0 about to do first read |
---|
162 | 0.1 Node 1 about to do first read |
---|
163 | ... many file reads |
---|
164 | 4.9 Node 0 finishes last read |
---|
165 | 4.9 Node 1 finishes last read |
---|
166 | ... both nodes close file |
---|
167 | 5.0 Node 0 captures timestamp for end of test |
---|
168 | |
---|
169 | The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96. The |
---|
170 | reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec. |
---|
171 | |
---|
172 | If node 0 ran significantly slower than node 1, it might have finished its |
---|
173 | last read at time 9.9 instead of at time 4.9, and the end of test timestamp |
---|
174 | might be 10.0 instead of 5.0. In this case the utilization would drop to |
---|
175 | ((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be |
---|
176 | 100M / (10.0-0.0) = 10M/sec. |
---|
177 | |
---|
178 | |
---|
179 | Command line parameters |
---|
180 | ----------------------- |
---|
181 | |
---|
182 | There are four versions of gpfsperf in this directory: |
---|
183 | gpfsperf-mpi - runs on multiple nodes under MPI, requires GPFS v1.3 or later |
---|
184 | gpfsperf - runs only on a single node, requires GPFS v1.3 or later |
---|
185 | |
---|
186 | The command line for any of the versions of gpfsperf is: |
---|
187 | |
---|
188 | gpfsperf[-mpi] operation pattern fn [options] |
---|
189 | |
---|
190 | The order of parameters on the command line is not significant. |
---|
191 | |
---|
192 | The operation must be either "create", "read", "write", or "uncache". All |
---|
193 | threads in a multinode or multithreaded run of gpfsperf do the same |
---|
194 | operation to the same file, but at different offsets. Create means to |
---|
195 | insure that the file exists, then do a write test. The uncache operation |
---|
196 | does not read or write the file, but only removes any buffered data for the |
---|
197 | file from the GPFS buffer cache. |
---|
198 | |
---|
199 | Pattern must be one of "rand", "randhint", "strided", or "seq". The |
---|
200 | meanings of each of these was explained earlier, except for "randhint". The |
---|
201 | "randhint" pattern is the same as "rand", except that the GPFS multiple |
---|
202 | access range hint is used through the library functions in irreg.c to |
---|
203 | prefetch blocks before they are accessed by gpfsperf. |
---|
204 | |
---|
205 | The filename parameter fn should resolve to a file in a GPFS file system |
---|
206 | that is mounted on all nodes where gpfsperf is to run. The file must |
---|
207 | already exist unless the "create" operation was specified. Use of gpfsperf |
---|
208 | on files not in GPFS may be meaningful in some situations, but this use has |
---|
209 | not been tested. |
---|
210 | |
---|
211 | Each optional parameter is described below, along with its default value. |
---|
212 | |
---|
213 | -nolabels - Produce a single line of output containing all parameters |
---|
214 | and the measured data rate. Format of the output line is 'op |
---|
215 | pattern fn recordSize nBytes fileSize nProcs nThreads |
---|
216 | strideRecs inv ds dio fsync reltoken aio osync rate util'. |
---|
217 | This format may be useful for importing results into a |
---|
218 | spreadsheet or other program for further analysis. The |
---|
219 | default is to produce multi-line labelled output. |
---|
220 | |
---|
221 | -r recsize - Record size. Defaults to filesystem block size. Must be |
---|
222 | specified for create operations. |
---|
223 | |
---|
224 | -n nBytes - Number of bytes to transfer. Defaults to file size. Must be |
---|
225 | specified for create operations. |
---|
226 | |
---|
227 | -s stride - Number of bytes between successive accesses by the |
---|
228 | same thread. Only meaningful for strided access patterns. |
---|
229 | Must be a multiple of the record size. See earlier cautions |
---|
230 | about combining large values of -s with large values of -n. |
---|
231 | Default is number of threads times number of processes. |
---|
232 | |
---|
233 | -th nThreads - Number of threads per process. Default is 1. When there |
---|
234 | are multiple threads per process they read adjacent blocks of |
---|
235 | the file for the sequential and strided access patterns. For |
---|
236 | example, suppose a file of 60 records is being read by 3 |
---|
237 | nodes with 2 threads per node. Under the sequential pattern, |
---|
238 | thread 0 on node 0 will read records 0-9, thread 1 on node 0 |
---|
239 | will read records 10-19, thread 0 on node 1 will read records |
---|
240 | 20-29, etc. Under a strided pattern, thread 0 on node 0 will |
---|
241 | read records 0, 6, 12, ..., 54, thread 1 on node 0 will read |
---|
242 | records 1, 7, 13, ..., 55, etc. |
---|
243 | |
---|
244 | -noinv - Do not clear blocks of fn from the GPFS file cache before |
---|
245 | starting the test. The default is to clear the cache. If |
---|
246 | this option is given, the results of the test can depend |
---|
247 | strongly on the lock and buffer state left by the last test. |
---|
248 | For example, a multinode sequential read with -noinv will run |
---|
249 | more slowly after a strided write test than after a |
---|
250 | sequential write test. |
---|
251 | |
---|
252 | -ds - Use GPFS data shipping. Data shipping avoids lock conflicts |
---|
253 | by partitioning the file among the nodes running gpfsperf, |
---|
254 | turning off byte range locking, and sending messages to the |
---|
255 | appropriate agent node to handle each read or write request. |
---|
256 | However, since (n-1)/n of the accesses are remote, the |
---|
257 | gpfsperf threads cannot take advantage of local block |
---|
258 | caching, although they may still benefit from prefetching and |
---|
259 | writebehind. Also, since byte range locking is not in |
---|
260 | effect, use of data shipping suspends the atomicity |
---|
261 | guarantees of X/Open file semantics. See the GPFS Guide and |
---|
262 | Reference manual for more details. Data shipping should show |
---|
263 | the largest performance benefit for strided writes that have |
---|
264 | small record sizes. The default is not to use data shipping. |
---|
265 | |
---|
266 | -aio depth - use Asynch I/O, prefetching to depth (default 0, max 1000). |
---|
267 | This can be used with any of the seq/rand/strided test patterns. |
---|
268 | |
---|
269 | -dio - Use direct IO flag when opening the file. This will allow |
---|
270 | sector aligned/sized buffers to be read/written directly |
---|
271 | from the application buffer to the disks where the blocks |
---|
272 | are allocated. |
---|
273 | |
---|
274 | -reltoken - Release the entire file byte-range token after the file |
---|
275 | is newly created. In a multi-node environment (MPI), only |
---|
276 | the first process will create the file and all the other |
---|
277 | processes will wait and open the file after the creation |
---|
278 | has occurred. This flag tell the first process to release |
---|
279 | the byte-range token it automatically gets during the create. |
---|
280 | This may increase performance because other nodes that work |
---|
281 | on different ranges of the file will not need to revoke the |
---|
282 | range held by the node running the first process. |
---|
283 | |
---|
284 | -fsync - Insure that no dirty data remain buffered at the conclusion |
---|
285 | of a write or create test. The time to perform the necessary |
---|
286 | fsync operation is included in the test time, so this option |
---|
287 | reduces the reported aggregate data rate. The default is not |
---|
288 | to fsync the file. |
---|
289 | |
---|
290 | -osync - Turn on the O_SYNC flag when opening the file. |
---|
291 | This causes every write operation to force the data to disk |
---|
292 | on each call. The default is not osync. |
---|
293 | |
---|
294 | -v - Verbose tracing. In a multinode test using gpfsperf, output |
---|
295 | from each instance of the program will be intermingled. By |
---|
296 | telling MPI to label the output from each node (MP_LABELIO |
---|
297 | environment variable =yes), the verbose output will make more |
---|
298 | sense. |
---|
299 | |
---|
300 | -V - Very verbose tracing. This option will display the offset |
---|
301 | of every read or write operation on every node. As with -v, |
---|
302 | labelling the output by node is suggested. |
---|
303 | |
---|
304 | Numbers in options can be given using K, M, or G suffixes, in upper or lower |
---|
305 | case, to denote 2**10, 2**20, or 2**30, respectively, or can have an R or r |
---|
306 | suffix to denote a multiple of the record size. For example, to specify a |
---|
307 | record size of 4096 bytes and a size to read or write of 409600, one could |
---|
308 | write "-r 4k -n 100r". |
---|
309 | |
---|
310 | AIX only: |
---|
311 | If the AsynchronousIO (AIO) kernel extension has not been loaded yet, |
---|
312 | running the gpfsperf program will fail and display output like: |
---|
313 | exec(): 0509-036 Cannot load program gpfsperf because of the following errors: |
---|
314 | 0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because: |
---|
315 | 0509-136 Symbol kaio_rdwr (number 0) is not exported from |
---|
316 | dependent module /unix. |
---|
317 | 0509-136 Symbol listio (number 1) is not exported from |
---|
318 | dependent module /unix. |
---|
319 | 0509-136 Symbol acancel (number 2) is not exported from |
---|
320 | dependent module /unix. |
---|
321 | 0509-136 Symbol iosuspend (number 3) is not exported from |
---|
322 | dependent module /unix. |
---|
323 | 0509-136 Symbol aio_nwait (number 4) is not exported from |
---|
324 | dependent module /unix. |
---|
325 | 0509-192 Examine .loader section symbols with the |
---|
326 | 'dump -Tv' command. |
---|
327 | |
---|
328 | If you do not wish to use AIO, you can recompile the gpfsperf program to not |
---|
329 | use the AIO calls: |
---|
330 | rm gpfsperf.o gpfsperf-mpi.o |
---|
331 | make OTHERINCL="-DNO_AIO" |
---|
332 | |
---|
333 | Enable AIO on your system, by doing commands like the following: |
---|
334 | lsattr -El aio0 |
---|
335 | chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128 |
---|
336 | mkdev -l aio0 |
---|
337 | Minservers just tells AIX how many AIO kprocs to create immediately, and |
---|
338 | maxservers limits the total number created. On AIX 5.2 the meaning of |
---|
339 | maxservers has changed to mean "maximum number of servers per CPU", so a |
---|
340 | 16-way SMP should set maxservers=8 to get a total of 128 kprocs. |
---|
341 | |
---|
342 | |
---|
343 | |
---|
344 | Examples |
---|
345 | -------- |
---|
346 | |
---|
347 | Suppose that /gpfs is a GPFS file system that was formatted with 256K |
---|
348 | blocks and that it has at least a gigabyte of free space. Assuming that |
---|
349 | the gpfsperf programs have been copied into /gpfs/test, and that |
---|
350 | /gpfs/test is the current directory, the following ksh commands |
---|
351 | illustrate how to run gpfsperf: |
---|
352 | |
---|
353 | # Number of nodes on which the test will run. If this is increased, the |
---|
354 | # size of the test file should also be increased. |
---|
355 | export MP_PROCS=8 |
---|
356 | |
---|
357 | # File containing a list of nodes on which gpfsperf will run. There are |
---|
358 | # other ways to specify where the test runs besides using an explicit |
---|
359 | # host list. See the Parallel Operating Environment documentation for |
---|
360 | # details. |
---|
361 | export MP_HOSTFILE=/etc/cluster.nodes |
---|
362 | |
---|
363 | # Name of test file to be manipulated by the tests that follow. |
---|
364 | export fn=/gpfs/test/testfile |
---|
365 | |
---|
366 | # Verify block size |
---|
367 | mmlsfs gpfs -B |
---|
368 | |
---|
369 | # Create test file. All write tests in these examples specify -fsync, so |
---|
370 | # the reported data rate includes the overhead of flushing all dirty buffers |
---|
371 | # to disk. The size of the test file should be increased if more than 8 |
---|
372 | # nodes are used or if GPFS pagepool sizes have been increased from their |
---|
373 | # defaults. It may be necessary to increase the maximum size file the user |
---|
374 | # is allowed to create. See the ulimit command. |
---|
375 | ./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync |
---|
376 | |
---|
377 | # Read entire test file sequentially |
---|
378 | ./gpfsperf-mpi read seq $fn -r 256k |
---|
379 | |
---|
380 | # Rewrite test file sequentially using full block writes |
---|
381 | ./gpfsperf-mpi write seq $fn -r 256k -fsync |
---|
382 | |
---|
383 | # Rewrite test file sequentially using small writes. This requires GPFS to |
---|
384 | # read blocks in order to update them, so will have worse performance than |
---|
385 | # the full block rewrite. |
---|
386 | ./gpfsperf-mpi write seq $fn -r 64k -fsync |
---|
387 | |
---|
388 | # Strided read using big records |
---|
389 | ./gpfsperf-mpi read strided $fn -r 256k |
---|
390 | |
---|
391 | # Strided read using medium sized records. Performance is worse because |
---|
392 | # average I/O size has gone down. This behavior will not be seen unless |
---|
393 | # the stride is larger than a block (8*50000 > 256K). |
---|
394 | ./gpfsperf-mpi read strided $fn -r 50000 |
---|
395 | |
---|
396 | # Strided read using a very large stride. Reported performance is |
---|
397 | # misleading because each node just reads the same records over and over |
---|
398 | # from its GPFS buffer cache. |
---|
399 | ./gpfsperf-mpi read strided $fn -r 50000 -s 2400r |
---|
400 | |
---|
401 | # Strided write using a record size equal to the block size. Decent |
---|
402 | # performance, since record size matches GPFS lock granularity. |
---|
403 | ./gpfsperf-mpi write strided $fn -r 256k -fsync |
---|
404 | |
---|
405 | # Strided write using small records. Since GPFS lock granularity is |
---|
406 | # larger than a record, performance is much worse. Number of bytes |
---|
407 | # written is less than the entire file to keep test time reasonable. |
---|
408 | ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync |
---|
409 | |
---|
410 | # Strided write using small records and data shipping. Data shipping |
---|
411 | # trades additional communication overhead for less lock contention, |
---|
412 | # improving performance. |
---|
413 | ./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync |
---|
414 | |
---|
415 | # Random read of small records |
---|
416 | ./gpfsperf-mpi read rand $fn -r 10000 -n 100m |
---|
417 | |
---|
418 | # Random read of small records using the GPFS multiple access range hint. |
---|
419 | # Better performance (assuming more than MP_PROCS disks) because each node |
---|
420 | # has more than one disk read in progress at once due to prefetching. |
---|
421 | ./gpfsperf-mpi read randhint $fn -r 10000 -n 100m |
---|
422 | |
---|