Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

README @ 16

Last change on this file since 16 was 16, checked in by rock, 17 years ago

File size: 20.9 KB

Line
1	Disclaimers
2	-----------
3
4	The files in this directory are provided by IBM on an "AS IS" basis
5	without warranty of any kind. In addition, the results that you obtain
6	from using these files to measure the general performance of your
7	General Parallel File System (GPFS) file systems are "AS IS." Your
8	reliance on any measurements is at your own risk and IBM does not assume
9	any liability, whatsoever from your use of these files or your use of
10	resultant performance measurements. The performance of GPFS file
11	systems is affected by many factors, including the access patterns of
12	application programs, the configuration and amount of memory on the SP
13	nodes, the number and characteristics of IBM Virtual Shared Disk (VSD)
14	servers, the number and speed of disks and disk adapters attached to the
15	VSD servers, GPFS, VSD and SP switch configuration parameters, other
16	traffic through the SP switch, etc. As a result, GPFS file system
17	performance may vary and IBM does not make any particular performance
18	claims for GPFS file systems.
19
20
21	Introduction
22	------------
23
24	The files in this directory serve two purposes:
25
26	- Provide a simple benchmark program (gpfsperf) that can be used to
27	measure the performance of GPFS for several common file access patterns.
28	- Give examples of how to use some of the gpfs_fcntl hints and
29	directives that are new in GPFS version 1.3.
30
31	There are four versions of the program binary built from a single set of
32	source files. The four versions correspond to all of the possible
33	combinations of single node/multiple node and with/without features that
34	only are supported on GPFS version 1.3. Multinode versions of the gpfsperf
35	program contain -mpi as part of their names, while versions that do not use
36	features of GPFS requiring version 1.3 have a suffix of -v12 in their names.
37
38
39	Parallelism
40	-----------
41
42	There are two independent ways to achieve parallelism in the gpfsperf
43	program. More than one instance of the program can be run on multiple
44	nodes using Message Passing Interface (MPI) to synchronize their
45	execution, or a single instance of the program can execute several
46	threads in parallel on a single node. These two techniques can also be
47	combined. When describing the behavior of the program, it should be
48	understood that 'threads' means any of the threads of the gpfsperf
49	program on any node where MPI runs it.
50
51	When gpfsperf runs on multiple nodes, the multiple instances of the
52	program communicate using the Message Passing Interface (MPI) to
53	synchronize their execution and to combine their measurements into an
54	aggregate throughput result.
55
56
57	Access patterns
58	---------------
59
60	The gpfsperf program operates on a file that is assumed to consist of a
61	collection of records, each of the same size. It can generate three
62	different types of access patterns: sequential, strided, and random. The
63	meaning of these access patterns in some cases depends on whether or not
64	parallelism is employed when the benchmark is run.
65
66	The simplest access pattern is random. The gpfsperf program generates a
67	sequence of random record numbers and reads or writes the corresponding
68	records. When run on multiple nodes, or when multiple threads are used
69	within instances of the program, each thread of the gpfsperf program uses a
70	different seed for its random number generator, so each thread will access
71	independent sequences of records. Two threads may access the same record
72	if the same random record number occurs in two sequences.
73
74	In the sequential access pattern, each gpfsperf thread reads from or writes
75	to a contiguous partition of a file sequentially. For example, suppose that
76	a 10 billion byte file consists of one million records of 10000 bytes each.
77	If 10 threads read the file according to the sequential access
78	pattern, then first thread will read sequentially through the partition
79	consisting of the first 100,000 records, the next thread will read from the
80	next 100,000 records, and so on.
81
82	In a strided access pattern, each thread skips some number of records
83	between each record that it reads or writes. Reading the file from the
84	example above in a strided pattern, the first thread would read records 0,
85	10, 20, ..., 999,990. The second thread would read records 1, 11, 21, ...,
86	999,991, and so on. The gpfsperf program by default uses a stride, or
87	distance between records, equal to the total number of threads operating on
88	the file, in this case 10.
89
90
91	Amount of data to be transferred
92	--------------------------------
93
94	One of the input parameters to gpfsperf is the amount of data to be
95	transferred. This is the total number of bytes to be read or written by all
96	threads of the program. If there are T threads in total, and the total
97	number of bytes to be transferred is N, each thread will read or write about
98	N/T bytes, rounded to a multiple of the record size. By default, gpfsperf
99	sets N to the size of the file. For the sequential or strided access
100	pattern, this default means that every record in the file will be read or
101	written exactly once.
102
103	If N is greater than the size of the file, each thread will read or write
104	its partition of the file repeatedly until reaching its share (N/T) of the
105	bytes to be transferred. For example, suppose 10 threads sequentially read
106	a 10 billion byte file of 10000 byte records when N is 15 billion. The
107	first thread will read the first 100,000 records, then reread the first
108	50,000 records. The second thread will read records 100,000 through
109	199,999, then reread records 100,000 through 149,999, and so on.
110
111	When using strided access patterns with other than the default stride, this
112	behavior of gpfsperf can cause unexpected results. For example, suppose
113	that 10 threads read the 10 billion byte file using a strided access
114	pattern, but instead of the default stride of 10 records gpfsperf is told to
115	use a stride of 10000 records. The file partition read by the first thread
116	will be records 0, 10000, 20000, ..., 990,000. This is only 100 distinct
117	records of 10000 bytes each, or a total of only 1 million bytes of data.
118	This data will likely remain in the GPFS buffer pool after it is read the
119	first time. If N is more than 10 million bytes, gpfsperf will "read" the
120	same buffered data multiple times, and the reported data rate will appear
121	anomalously high. To avoid this effect, performance tests using non-default
122	strides should reduce N in the same proportion as the stride was increased
123	from its default.
124
125
126	Computation of aggregate data rate and utilization
127	--------------------------------------------------
128
129	The results of a run of gpfsperf are reported as an aggregate data rate.
130	Data rate is defined as the total number of bytes read or written by all
131	threads divided by the total time of the test. It is reported in units of
132	1000 bytes/second. The test time is measured from before any node opens the
133	test file until after the last node closes the test file. To insure a
134	consistent environment for each test, before beginning the timed test period
135	gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes. This flushes
136	the GPFS buffer cache and releases byte-range tokens. Note that versions of
137	GPFS prior to v1.3 do not support GPFS file hints, so the state of the
138	buffer cache at the beginning of a test will be influenced by the state left
139	by the prior test.
140
141	Since all threads of a gpfsperf run do approximately the same amount of work
142	(read or write N/T bytes), in principle they should all run for the same
143	amount of time. In practice, however, variations in disk and switch
144	response time lead to variations in execution times among the threads. Lock
145	contention in GPFS further contributes to these variations in execution
146	times. A large degree of non-uniformity is undesirable, since it means that
147	some nodes are idle while they wait for threads on other nodes to finish.
148	To measure the degree of uniformity of thread execution times, gpfsperf
149	computes a quantity it calls "utilization." Utilization is the fraction of
150	the total number of thread-seconds in a test during which threads actively
151	perform reads or writes. A value of 1.0 indicates perfect overlap, while
152	lower values denote that some threads were idle while others still ran.
153
154	The following timeline illustrates how gpfsperf computes utilization for a
155	test involving one thread on each of two nodes, reading a total of 100M:
156
157	time event
158	---- -----
159	0.0 Node 0 captures timestamp for beginning of test
160	... both nodes open file
161	0.1 Node 0 about to do first read
162	0.1 Node 1 about to do first read
163	... many file reads
164	4.9 Node 0 finishes last read
165	4.9 Node 1 finishes last read
166	... both nodes close file
167	5.0 Node 0 captures timestamp for end of test
168
169	The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96. The
170	reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec.
171
172	If node 0 ran significantly slower than node 1, it might have finished its
173	last read at time 9.9 instead of at time 4.9, and the end of test timestamp
174	might be 10.0 instead of 5.0. In this case the utilization would drop to
175	((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be
176	100M / (10.0-0.0) = 10M/sec.
177
178
179	Command line parameters
180	-----------------------
181
182	There are four versions of gpfsperf in this directory:
183	gpfsperf-mpi - runs on multiple nodes under MPI, requires GPFS v1.3 or later
184	gpfsperf - runs only on a single node, requires GPFS v1.3 or later
185
186	The command line for any of the versions of gpfsperf is:
187
188	gpfsperf[-mpi] operation pattern fn [options]
189
190	The order of parameters on the command line is not significant.
191
192	The operation must be either "create", "read", "write", or "uncache". All
193	threads in a multinode or multithreaded run of gpfsperf do the same
194	operation to the same file, but at different offsets. Create means to
195	insure that the file exists, then do a write test. The uncache operation
196	does not read or write the file, but only removes any buffered data for the
197	file from the GPFS buffer cache.
198
199	Pattern must be one of "rand", "randhint", "strided", or "seq". The
200	meanings of each of these was explained earlier, except for "randhint". The
201	"randhint" pattern is the same as "rand", except that the GPFS multiple
202	access range hint is used through the library functions in irreg.c to
203	prefetch blocks before they are accessed by gpfsperf.
204
205	The filename parameter fn should resolve to a file in a GPFS file system
206	that is mounted on all nodes where gpfsperf is to run. The file must
207	already exist unless the "create" operation was specified. Use of gpfsperf
208	on files not in GPFS may be meaningful in some situations, but this use has
209	not been tested.
210
211	Each optional parameter is described below, along with its default value.
212
213	-nolabels - Produce a single line of output containing all parameters
214	and the measured data rate. Format of the output line is 'op
215	pattern fn recordSize nBytes fileSize nProcs nThreads
216	strideRecs inv ds dio fsync reltoken aio osync rate util'.
217	This format may be useful for importing results into a
218	spreadsheet or other program for further analysis. The
219	default is to produce multi-line labelled output.
220
221	-r recsize - Record size. Defaults to filesystem block size. Must be
222	specified for create operations.
223
224	-n nBytes - Number of bytes to transfer. Defaults to file size. Must be
225	specified for create operations.
226
227	-s stride - Number of bytes between successive accesses by the
228	same thread. Only meaningful for strided access patterns.
229	Must be a multiple of the record size. See earlier cautions
230	about combining large values of -s with large values of -n.
231	Default is number of threads times number of processes.
232
233	-th nThreads - Number of threads per process. Default is 1. When there
234	are multiple threads per process they read adjacent blocks of
235	the file for the sequential and strided access patterns. For
236	example, suppose a file of 60 records is being read by 3
237	nodes with 2 threads per node. Under the sequential pattern,
238	thread 0 on node 0 will read records 0-9, thread 1 on node 0
239	will read records 10-19, thread 0 on node 1 will read records
240	20-29, etc. Under a strided pattern, thread 0 on node 0 will
241	read records 0, 6, 12, ..., 54, thread 1 on node 0 will read
242	records 1, 7, 13, ..., 55, etc.
243
244	-noinv - Do not clear blocks of fn from the GPFS file cache before
245	starting the test. The default is to clear the cache. If
246	this option is given, the results of the test can depend
247	strongly on the lock and buffer state left by the last test.
248	For example, a multinode sequential read with -noinv will run
249	more slowly after a strided write test than after a
250	sequential write test.
251
252	-ds - Use GPFS data shipping. Data shipping avoids lock conflicts
253	by partitioning the file among the nodes running gpfsperf,
254	turning off byte range locking, and sending messages to the
255	appropriate agent node to handle each read or write request.
256	However, since (n-1)/n of the accesses are remote, the
257	gpfsperf threads cannot take advantage of local block
258	caching, although they may still benefit from prefetching and
259	writebehind. Also, since byte range locking is not in
260	effect, use of data shipping suspends the atomicity
261	guarantees of X/Open file semantics. See the GPFS Guide and
262	Reference manual for more details. Data shipping should show
263	the largest performance benefit for strided writes that have
264	small record sizes. The default is not to use data shipping.
265
266	-aio depth - use Asynch I/O, prefetching to depth (default 0, max 1000).
267	This can be used with any of the seq/rand/strided test patterns.
268
269	-dio - Use direct IO flag when opening the file. This will allow
270	sector aligned/sized buffers to be read/written directly
271	from the application buffer to the disks where the blocks
272	are allocated.
273
274	-reltoken - Release the entire file byte-range token after the file
275	is newly created. In a multi-node environment (MPI), only
276	the first process will create the file and all the other
277	processes will wait and open the file after the creation
278	has occurred. This flag tell the first process to release
279	the byte-range token it automatically gets during the create.
280	This may increase performance because other nodes that work
281	on different ranges of the file will not need to revoke the
282	range held by the node running the first process.
283
284	-fsync - Insure that no dirty data remain buffered at the conclusion
285	of a write or create test. The time to perform the necessary
286	fsync operation is included in the test time, so this option
287	reduces the reported aggregate data rate. The default is not
288	to fsync the file.
289
290	-osync - Turn on the O_SYNC flag when opening the file.
291	This causes every write operation to force the data to disk
292	on each call. The default is not osync.
293
294	-v - Verbose tracing. In a multinode test using gpfsperf, output
295	from each instance of the program will be intermingled. By
296	telling MPI to label the output from each node (MP_LABELIO
297	environment variable =yes), the verbose output will make more
298	sense.
299
300	-V - Very verbose tracing. This option will display the offset
301	of every read or write operation on every node. As with -v,
302	labelling the output by node is suggested.
303
304	Numbers in options can be given using K, M, or G suffixes, in upper or lower
305	case, to denote 210, 220, or 2**30, respectively, or can have an R or r
306	suffix to denote a multiple of the record size. For example, to specify a
307	record size of 4096 bytes and a size to read or write of 409600, one could
308	write "-r 4k -n 100r".
309
310	AIX only:
311	If the AsynchronousIO (AIO) kernel extension has not been loaded yet,
312	running the gpfsperf program will fail and display output like:
313	exec(): 0509-036 Cannot load program gpfsperf because of the following errors:
314	0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because:
315	0509-136 Symbol kaio_rdwr (number 0) is not exported from
316	dependent module /unix.
317	0509-136 Symbol listio (number 1) is not exported from
318	dependent module /unix.
319	0509-136 Symbol acancel (number 2) is not exported from
320	dependent module /unix.
321	0509-136 Symbol iosuspend (number 3) is not exported from
322	dependent module /unix.
323	0509-136 Symbol aio_nwait (number 4) is not exported from
324	dependent module /unix.
325	0509-192 Examine .loader section symbols with the
326	'dump -Tv' command.
327
328	If you do not wish to use AIO, you can recompile the gpfsperf program to not
329	use the AIO calls:
330	rm gpfsperf.o gpfsperf-mpi.o
331	make OTHERINCL="-DNO_AIO"
332
333	Enable AIO on your system, by doing commands like the following:
334	lsattr -El aio0
335	chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128
336	mkdev -l aio0
337	Minservers just tells AIX how many AIO kprocs to create immediately, and
338	maxservers limits the total number created. On AIX 5.2 the meaning of
339	maxservers has changed to mean "maximum number of servers per CPU", so a
340	16-way SMP should set maxservers=8 to get a total of 128 kprocs.
341
342
343
344	Examples
345	--------
346
347	Suppose that /gpfs is a GPFS file system that was formatted with 256K
348	blocks and that it has at least a gigabyte of free space. Assuming that
349	the gpfsperf programs have been copied into /gpfs/test, and that
350	/gpfs/test is the current directory, the following ksh commands
351	illustrate how to run gpfsperf:
352
353	# Number of nodes on which the test will run. If this is increased, the
354	# size of the test file should also be increased.
355	export MP_PROCS=8
356
357	# File containing a list of nodes on which gpfsperf will run. There are
358	# other ways to specify where the test runs besides using an explicit
359	# host list. See the Parallel Operating Environment documentation for
360	# details.
361	export MP_HOSTFILE=/etc/cluster.nodes
362
363	# Name of test file to be manipulated by the tests that follow.
364	export fn=/gpfs/test/testfile
365
366	# Verify block size
367	mmlsfs gpfs -B
368
369	# Create test file. All write tests in these examples specify -fsync, so
370	# the reported data rate includes the overhead of flushing all dirty buffers
371	# to disk. The size of the test file should be increased if more than 8
372	# nodes are used or if GPFS pagepool sizes have been increased from their
373	# defaults. It may be necessary to increase the maximum size file the user
374	# is allowed to create. See the ulimit command.
375	./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync
376
377	# Read entire test file sequentially
378	./gpfsperf-mpi read seq $fn -r 256k
379
380	# Rewrite test file sequentially using full block writes
381	./gpfsperf-mpi write seq $fn -r 256k -fsync
382
383	# Rewrite test file sequentially using small writes. This requires GPFS to
384	# read blocks in order to update them, so will have worse performance than
385	# the full block rewrite.
386	./gpfsperf-mpi write seq $fn -r 64k -fsync
387
388	# Strided read using big records
389	./gpfsperf-mpi read strided $fn -r 256k
390
391	# Strided read using medium sized records. Performance is worse because
392	# average I/O size has gone down. This behavior will not be seen unless
393	# the stride is larger than a block (8*50000 > 256K).
394	./gpfsperf-mpi read strided $fn -r 50000
395
396	# Strided read using a very large stride. Reported performance is
397	# misleading because each node just reads the same records over and over
398	# from its GPFS buffer cache.
399	./gpfsperf-mpi read strided $fn -r 50000 -s 2400r
400
401	# Strided write using a record size equal to the block size. Decent
402	# performance, since record size matches GPFS lock granularity.
403	./gpfsperf-mpi write strided $fn -r 256k -fsync
404
405	# Strided write using small records. Since GPFS lock granularity is
406	# larger than a record, performance is much worse. Number of bytes
407	# written is less than the entire file to keep test time reasonable.
408	./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync
409
410	# Strided write using small records and data shipping. Data shipping
411	# trades additional communication overhead for less lock contention,
412	# improving performance.
413	./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync
414
415	# Random read of small records
416	./gpfsperf-mpi read rand $fn -r 10000 -n 100m
417
418	# Random read of small records using the GPFS multiple access range hint.
419	# Better performance (assuming more than MP_PROCS disks) because each node
420	# has more than one disk read in progress at once due to prefetching.
421	./gpfsperf-mpi read randhint $fn -r 10000 -n 100m
422

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gpfs_3.1_ker2.6.20/lpp/mmfs/samples/perf/README @ 16

Download in other formats: