Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Normal
Revision Log

README @ 91

Last change on this file since 91 was 16, checked in by rock, 17 years ago

File size: 20.9 KB

Rev	Line
[16]	1	Disclaimers
	2	-----------
	3
	4	The files in this directory are provided by IBM on an "AS IS" basis
	5	without warranty of any kind. In addition, the results that you obtain
	6	from using these files to measure the general performance of your
	7	General Parallel File System (GPFS) file systems are "AS IS." Your
	8	reliance on any measurements is at your own risk and IBM does not assume
	9	any liability, whatsoever from your use of these files or your use of
	10	resultant performance measurements. The performance of GPFS file
	11	systems is affected by many factors, including the access patterns of
	12	application programs, the configuration and amount of memory on the SP
	13	nodes, the number and characteristics of IBM Virtual Shared Disk (VSD)
	14	servers, the number and speed of disks and disk adapters attached to the
	15	VSD servers, GPFS, VSD and SP switch configuration parameters, other
	16	traffic through the SP switch, etc. As a result, GPFS file system
	17	performance may vary and IBM does not make any particular performance
	18	claims for GPFS file systems.
	19
	20
	21	Introduction
	22	------------
	23
	24	The files in this directory serve two purposes:
	25
	26	- Provide a simple benchmark program (gpfsperf) that can be used to
	27	measure the performance of GPFS for several common file access patterns.
	28	- Give examples of how to use some of the gpfs_fcntl hints and
	29	directives that are new in GPFS version 1.3.
	30
	31	There are four versions of the program binary built from a single set of
	32	source files. The four versions correspond to all of the possible
	33	combinations of single node/multiple node and with/without features that
	34	only are supported on GPFS version 1.3. Multinode versions of the gpfsperf
	35	program contain -mpi as part of their names, while versions that do not use
	36	features of GPFS requiring version 1.3 have a suffix of -v12 in their names.
	37
	38
	39	Parallelism
	40	-----------
	41
	42	There are two independent ways to achieve parallelism in the gpfsperf
	43	program. More than one instance of the program can be run on multiple
	44	nodes using Message Passing Interface (MPI) to synchronize their
	45	execution, or a single instance of the program can execute several
	46	threads in parallel on a single node. These two techniques can also be
	47	combined. When describing the behavior of the program, it should be
	48	understood that 'threads' means any of the threads of the gpfsperf
	49	program on any node where MPI runs it.
	50
	51	When gpfsperf runs on multiple nodes, the multiple instances of the
	52	program communicate using the Message Passing Interface (MPI) to
	53	synchronize their execution and to combine their measurements into an
	54	aggregate throughput result.
	55
	56
	57	Access patterns
	58	---------------
	59
	60	The gpfsperf program operates on a file that is assumed to consist of a
	61	collection of records, each of the same size. It can generate three
	62	different types of access patterns: sequential, strided, and random. The
	63	meaning of these access patterns in some cases depends on whether or not
	64	parallelism is employed when the benchmark is run.
	65
	66	The simplest access pattern is random. The gpfsperf program generates a
	67	sequence of random record numbers and reads or writes the corresponding
	68	records. When run on multiple nodes, or when multiple threads are used
	69	within instances of the program, each thread of the gpfsperf program uses a
	70	different seed for its random number generator, so each thread will access
	71	independent sequences of records. Two threads may access the same record
	72	if the same random record number occurs in two sequences.
	73
	74	In the sequential access pattern, each gpfsperf thread reads from or writes
	75	to a contiguous partition of a file sequentially. For example, suppose that
	76	a 10 billion byte file consists of one million records of 10000 bytes each.
	77	If 10 threads read the file according to the sequential access
	78	pattern, then first thread will read sequentially through the partition
	79	consisting of the first 100,000 records, the next thread will read from the
	80	next 100,000 records, and so on.
	81
	82	In a strided access pattern, each thread skips some number of records
	83	between each record that it reads or writes. Reading the file from the
	84	example above in a strided pattern, the first thread would read records 0,
	85	10, 20, ..., 999,990. The second thread would read records 1, 11, 21, ...,
	86	999,991, and so on. The gpfsperf program by default uses a stride, or
	87	distance between records, equal to the total number of threads operating on
	88	the file, in this case 10.
	89
	90
	91	Amount of data to be transferred
	92	--------------------------------
	93
	94	One of the input parameters to gpfsperf is the amount of data to be
	95	transferred. This is the total number of bytes to be read or written by all
	96	threads of the program. If there are T threads in total, and the total
	97	number of bytes to be transferred is N, each thread will read or write about
	98	N/T bytes, rounded to a multiple of the record size. By default, gpfsperf
	99	sets N to the size of the file. For the sequential or strided access
	100	pattern, this default means that every record in the file will be read or
	101	written exactly once.
	102
	103	If N is greater than the size of the file, each thread will read or write
	104	its partition of the file repeatedly until reaching its share (N/T) of the
	105	bytes to be transferred. For example, suppose 10 threads sequentially read
	106	a 10 billion byte file of 10000 byte records when N is 15 billion. The
	107	first thread will read the first 100,000 records, then reread the first
	108	50,000 records. The second thread will read records 100,000 through
	109	199,999, then reread records 100,000 through 149,999, and so on.
	110
	111	When using strided access patterns with other than the default stride, this
	112	behavior of gpfsperf can cause unexpected results. For example, suppose
	113	that 10 threads read the 10 billion byte file using a strided access
	114	pattern, but instead of the default stride of 10 records gpfsperf is told to
	115	use a stride of 10000 records. The file partition read by the first thread
	116	will be records 0, 10000, 20000, ..., 990,000. This is only 100 distinct
	117	records of 10000 bytes each, or a total of only 1 million bytes of data.
	118	This data will likely remain in the GPFS buffer pool after it is read the
	119	first time. If N is more than 10 million bytes, gpfsperf will "read" the
	120	same buffered data multiple times, and the reported data rate will appear
	121	anomalously high. To avoid this effect, performance tests using non-default
	122	strides should reduce N in the same proportion as the stride was increased
	123	from its default.
	124
	125
	126	Computation of aggregate data rate and utilization
	127	--------------------------------------------------
	128
	129	The results of a run of gpfsperf are reported as an aggregate data rate.
	130	Data rate is defined as the total number of bytes read or written by all
	131	threads divided by the total time of the test. It is reported in units of
	132	1000 bytes/second. The test time is measured from before any node opens the
	133	test file until after the last node closes the test file. To insure a
	134	consistent environment for each test, before beginning the timed test period
	135	gpfsperf issues the GPFS_CLEAR_FILE_CACHE hint on all nodes. This flushes
	136	the GPFS buffer cache and releases byte-range tokens. Note that versions of
	137	GPFS prior to v1.3 do not support GPFS file hints, so the state of the
	138	buffer cache at the beginning of a test will be influenced by the state left
	139	by the prior test.
	140
	141	Since all threads of a gpfsperf run do approximately the same amount of work
	142	(read or write N/T bytes), in principle they should all run for the same
	143	amount of time. In practice, however, variations in disk and switch
	144	response time lead to variations in execution times among the threads. Lock
	145	contention in GPFS further contributes to these variations in execution
	146	times. A large degree of non-uniformity is undesirable, since it means that
	147	some nodes are idle while they wait for threads on other nodes to finish.
	148	To measure the degree of uniformity of thread execution times, gpfsperf
	149	computes a quantity it calls "utilization." Utilization is the fraction of
	150	the total number of thread-seconds in a test during which threads actively
	151	perform reads or writes. A value of 1.0 indicates perfect overlap, while
	152	lower values denote that some threads were idle while others still ran.
	153
	154	The following timeline illustrates how gpfsperf computes utilization for a
	155	test involving one thread on each of two nodes, reading a total of 100M:
	156
	157	time event
	158	---- -----
	159	0.0 Node 0 captures timestamp for beginning of test
	160	... both nodes open file
	161	0.1 Node 0 about to do first read
	162	0.1 Node 1 about to do first read
	163	... many file reads
	164	4.9 Node 0 finishes last read
	165	4.9 Node 1 finishes last read
	166	... both nodes close file
	167	5.0 Node 0 captures timestamp for end of test
	168
	169	The utilization is ((4.9-0.1) + (4.9-0.1)) / 2*(5.0-0.0) = 0.96. The
	170	reported aggregate data rate would be 100M / (5.0-0.0) = 20M/sec.
	171
	172	If node 0 ran significantly slower than node 1, it might have finished its
	173	last read at time 9.9 instead of at time 4.9, and the end of test timestamp
	174	might be 10.0 instead of 5.0. In this case the utilization would drop to
	175	((4.9-0.1) + (9.9-0.1)) / 2*(10.0-0.0) = 0.73, and the data rate would be
	176	100M / (10.0-0.0) = 10M/sec.
	177
	178
	179	Command line parameters
	180	-----------------------
	181
	182	There are four versions of gpfsperf in this directory:
	183	gpfsperf-mpi - runs on multiple nodes under MPI, requires GPFS v1.3 or later
	184	gpfsperf - runs only on a single node, requires GPFS v1.3 or later
	185
	186	The command line for any of the versions of gpfsperf is:
	187
	188	gpfsperf[-mpi] operation pattern fn [options]
	189
	190	The order of parameters on the command line is not significant.
	191
	192	The operation must be either "create", "read", "write", or "uncache". All
	193	threads in a multinode or multithreaded run of gpfsperf do the same
	194	operation to the same file, but at different offsets. Create means to
	195	insure that the file exists, then do a write test. The uncache operation
	196	does not read or write the file, but only removes any buffered data for the
	197	file from the GPFS buffer cache.
	198
	199	Pattern must be one of "rand", "randhint", "strided", or "seq". The
	200	meanings of each of these was explained earlier, except for "randhint". The
	201	"randhint" pattern is the same as "rand", except that the GPFS multiple
	202	access range hint is used through the library functions in irreg.c to
	203	prefetch blocks before they are accessed by gpfsperf.
	204
	205	The filename parameter fn should resolve to a file in a GPFS file system
	206	that is mounted on all nodes where gpfsperf is to run. The file must
	207	already exist unless the "create" operation was specified. Use of gpfsperf
	208	on files not in GPFS may be meaningful in some situations, but this use has
	209	not been tested.
	210
	211	Each optional parameter is described below, along with its default value.
	212
	213	-nolabels - Produce a single line of output containing all parameters
	214	and the measured data rate. Format of the output line is 'op
	215	pattern fn recordSize nBytes fileSize nProcs nThreads
	216	strideRecs inv ds dio fsync reltoken aio osync rate util'.
	217	This format may be useful for importing results into a
	218	spreadsheet or other program for further analysis. The
	219	default is to produce multi-line labelled output.
	220
	221	-r recsize - Record size. Defaults to filesystem block size. Must be
	222	specified for create operations.
	223
	224	-n nBytes - Number of bytes to transfer. Defaults to file size. Must be
	225	specified for create operations.
	226
	227	-s stride - Number of bytes between successive accesses by the
	228	same thread. Only meaningful for strided access patterns.
	229	Must be a multiple of the record size. See earlier cautions
	230	about combining large values of -s with large values of -n.
	231	Default is number of threads times number of processes.
	232
	233	-th nThreads - Number of threads per process. Default is 1. When there
	234	are multiple threads per process they read adjacent blocks of
	235	the file for the sequential and strided access patterns. For
	236	example, suppose a file of 60 records is being read by 3
	237	nodes with 2 threads per node. Under the sequential pattern,
	238	thread 0 on node 0 will read records 0-9, thread 1 on node 0
	239	will read records 10-19, thread 0 on node 1 will read records
	240	20-29, etc. Under a strided pattern, thread 0 on node 0 will
	241	read records 0, 6, 12, ..., 54, thread 1 on node 0 will read
	242	records 1, 7, 13, ..., 55, etc.
	243
	244	-noinv - Do not clear blocks of fn from the GPFS file cache before
	245	starting the test. The default is to clear the cache. If
	246	this option is given, the results of the test can depend
	247	strongly on the lock and buffer state left by the last test.
	248	For example, a multinode sequential read with -noinv will run
	249	more slowly after a strided write test than after a
	250	sequential write test.
	251
	252	-ds - Use GPFS data shipping. Data shipping avoids lock conflicts
	253	by partitioning the file among the nodes running gpfsperf,
	254	turning off byte range locking, and sending messages to the
	255	appropriate agent node to handle each read or write request.
	256	However, since (n-1)/n of the accesses are remote, the
	257	gpfsperf threads cannot take advantage of local block
	258	caching, although they may still benefit from prefetching and
	259	writebehind. Also, since byte range locking is not in
	260	effect, use of data shipping suspends the atomicity
	261	guarantees of X/Open file semantics. See the GPFS Guide and
	262	Reference manual for more details. Data shipping should show
	263	the largest performance benefit for strided writes that have
	264	small record sizes. The default is not to use data shipping.
	265
	266	-aio depth - use Asynch I/O, prefetching to depth (default 0, max 1000).
	267	This can be used with any of the seq/rand/strided test patterns.
	268
	269	-dio - Use direct IO flag when opening the file. This will allow
	270	sector aligned/sized buffers to be read/written directly
	271	from the application buffer to the disks where the blocks
	272	are allocated.
	273
	274	-reltoken - Release the entire file byte-range token after the file
	275	is newly created. In a multi-node environment (MPI), only
	276	the first process will create the file and all the other
	277	processes will wait and open the file after the creation
	278	has occurred. This flag tell the first process to release
	279	the byte-range token it automatically gets during the create.
	280	This may increase performance because other nodes that work
	281	on different ranges of the file will not need to revoke the
	282	range held by the node running the first process.
	283
	284	-fsync - Insure that no dirty data remain buffered at the conclusion
	285	of a write or create test. The time to perform the necessary
	286	fsync operation is included in the test time, so this option
	287	reduces the reported aggregate data rate. The default is not
	288	to fsync the file.
	289
	290	-osync - Turn on the O_SYNC flag when opening the file.
	291	This causes every write operation to force the data to disk
	292	on each call. The default is not osync.
	293
	294	-v - Verbose tracing. In a multinode test using gpfsperf, output
	295	from each instance of the program will be intermingled. By
	296	telling MPI to label the output from each node (MP_LABELIO
	297	environment variable =yes), the verbose output will make more
	298	sense.
	299
	300	-V - Very verbose tracing. This option will display the offset
	301	of every read or write operation on every node. As with -v,
	302	labelling the output by node is suggested.
	303
	304	Numbers in options can be given using K, M, or G suffixes, in upper or lower
	305	case, to denote 210, 220, or 2**30, respectively, or can have an R or r
	306	suffix to denote a multiple of the record size. For example, to specify a
	307	record size of 4096 bytes and a size to read or write of 409600, one could
	308	write "-r 4k -n 100r".
	309
	310	AIX only:
	311	If the AsynchronousIO (AIO) kernel extension has not been loaded yet,
	312	running the gpfsperf program will fail and display output like:
	313	exec(): 0509-036 Cannot load program gpfsperf because of the following errors:
	314	0509-130 Symbol resolution failed for /usr/lib/threads/libc.a(aio.o) because:
	315	0509-136 Symbol kaio_rdwr (number 0) is not exported from
	316	dependent module /unix.
	317	0509-136 Symbol listio (number 1) is not exported from
	318	dependent module /unix.
	319	0509-136 Symbol acancel (number 2) is not exported from
	320	dependent module /unix.
	321	0509-136 Symbol iosuspend (number 3) is not exported from
	322	dependent module /unix.
	323	0509-136 Symbol aio_nwait (number 4) is not exported from
	324	dependent module /unix.
	325	0509-192 Examine .loader section symbols with the
	326	'dump -Tv' command.
	327
	328	If you do not wish to use AIO, you can recompile the gpfsperf program to not
	329	use the AIO calls:
	330	rm gpfsperf.o gpfsperf-mpi.o
	331	make OTHERINCL="-DNO_AIO"
	332
	333	Enable AIO on your system, by doing commands like the following:
	334	lsattr -El aio0
	335	chdev -l aio0 -P -a autoconfig=available -a minservers=10 -a maxservers=128
	336	mkdev -l aio0
	337	Minservers just tells AIX how many AIO kprocs to create immediately, and
	338	maxservers limits the total number created. On AIX 5.2 the meaning of
	339	maxservers has changed to mean "maximum number of servers per CPU", so a
	340	16-way SMP should set maxservers=8 to get a total of 128 kprocs.
	341
	342
	343
	344	Examples
	345	--------
	346
	347	Suppose that /gpfs is a GPFS file system that was formatted with 256K
	348	blocks and that it has at least a gigabyte of free space. Assuming that
	349	the gpfsperf programs have been copied into /gpfs/test, and that
	350	/gpfs/test is the current directory, the following ksh commands
	351	illustrate how to run gpfsperf:
	352
	353	# Number of nodes on which the test will run. If this is increased, the
	354	# size of the test file should also be increased.
	355	export MP_PROCS=8
	356
	357	# File containing a list of nodes on which gpfsperf will run. There are
	358	# other ways to specify where the test runs besides using an explicit
	359	# host list. See the Parallel Operating Environment documentation for
	360	# details.
	361	export MP_HOSTFILE=/etc/cluster.nodes
	362
	363	# Name of test file to be manipulated by the tests that follow.
	364	export fn=/gpfs/test/testfile
	365
	366	# Verify block size
	367	mmlsfs gpfs -B
	368
	369	# Create test file. All write tests in these examples specify -fsync, so
	370	# the reported data rate includes the overhead of flushing all dirty buffers
	371	# to disk. The size of the test file should be increased if more than 8
	372	# nodes are used or if GPFS pagepool sizes have been increased from their
	373	# defaults. It may be necessary to increase the maximum size file the user
	374	# is allowed to create. See the ulimit command.
	375	./gpfsperf-mpi create seq $fn -n 999m -r 256k -fsync
	376
	377	# Read entire test file sequentially
	378	./gpfsperf-mpi read seq $fn -r 256k
	379
	380	# Rewrite test file sequentially using full block writes
	381	./gpfsperf-mpi write seq $fn -r 256k -fsync
	382
	383	# Rewrite test file sequentially using small writes. This requires GPFS to
	384	# read blocks in order to update them, so will have worse performance than
	385	# the full block rewrite.
	386	./gpfsperf-mpi write seq $fn -r 64k -fsync
	387
	388	# Strided read using big records
	389	./gpfsperf-mpi read strided $fn -r 256k
	390
	391	# Strided read using medium sized records. Performance is worse because
	392	# average I/O size has gone down. This behavior will not be seen unless
	393	# the stride is larger than a block (8*50000 > 256K).
	394	./gpfsperf-mpi read strided $fn -r 50000
	395
	396	# Strided read using a very large stride. Reported performance is
	397	# misleading because each node just reads the same records over and over
	398	# from its GPFS buffer cache.
	399	./gpfsperf-mpi read strided $fn -r 50000 -s 2400r
	400
	401	# Strided write using a record size equal to the block size. Decent
	402	# performance, since record size matches GPFS lock granularity.
	403	./gpfsperf-mpi write strided $fn -r 256k -fsync
	404
	405	# Strided write using small records. Since GPFS lock granularity is
	406	# larger than a record, performance is much worse. Number of bytes
	407	# written is less than the entire file to keep test time reasonable.
	408	./gpfsperf-mpi write strided $fn -r 10000 -n 100m -fsync
	409
	410	# Strided write using small records and data shipping. Data shipping
	411	# trades additional communication overhead for less lock contention,
	412	# improving performance.
	413	./gpfsperf-mpi write strided $fn -r 10000 -n 100m -ds -fsync
	414
	415	# Random read of small records
	416	./gpfsperf-mpi read rand $fn -r 10000 -n 100m
	417
	418	# Random read of small records using the GPFS multiple access range hint.
	419	# Better performance (assuming more than MP_PROCS disks) because each node
	420	# has more than one disk read in progress at once due to prefetching.
	421	./gpfsperf-mpi read randhint $fn -r 10000 -n 100m
	422

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gpfs_3.1_ker2.6.20/lpp/mmfs/samples/perf/README @ 91

Download in other formats: