MPICL INSTRUMENTATION COMMANDS The instrumentation aspects of MPICL are identical to those of PICL 2.0, with two exceptions. 1) New communication events have been defined to represent the MPI commands for which there are no PICL equivalents. The basic format of the trace data is unchanged. 2) All MPICL instrumentation commands can be called using the MPI_PCONTROL interface, allowing codes containing MPICL instrumentation calls to be used for both performance evaluation (linking in the MPICL library) and for production (without linking in the MPICL library and collecting no performance data). The MPI_PCONTROL interface is discussed at the bottom of this document. The new event types are described in the file mpicl.format. Note that MPICL includes the original PICL library. Use of the PICL message-passing commands and PICL-specific instrumentation commands are not discussed here. See the file picl2.commands for this information. ----------------------------------------------------------------------------- MPICL INSTRUMENTATION OVERVIEW There is a great deal of flexibility supported by the instrumentation interface, but the following "template" describes how instrumentation is most often used. Note that the data is never flushed explicitly in the template. Rather, MPI_Finalize takes care of the handshaking required to guarantee that flushing does not interfere with the performance being measured. Also, while not strictly true, MPI_INIT typically must be called before any of the MPICL commands. tracefiles - specify temporary and/or permanent trace files (required. If node process 0 opens a tracefile, then data from other processes that have not done so will be funneled through process 0.) tracestatistics - specify which user events to collect statistics for (optional) tracelevel - specify level of tracing (optional, but no data collected if levels not set) tracenode - begin tracing (sync option required if collecting data for ParaGraph; sync option not required if only collecting statistics) . . . traceevent - to record user events (optional, and called as often as needed) tracedata - to record special user event data (optional, and called as often as needed) . . . MPI_Finalize - turn off tracing, wait until all processes are finished, renormalize clocks if necessary, then flush trace data to disk one process at a time ----------------------------------------------------------------------------- ENABLING COMMANDS void tracenode(int tracesize, int flush, int sync) : node initialization routine - tracesize is the number of bytes to be allocated for data storage - flush == 1, if space runs out, send the data to secondary storage and reinitialize == 2, if space runs out, overwrite the data otherwise, if space runs out, stop collecting data - sync == 0, do nothing == 1, synchronize the processor clocks. ----------------------------------------------------------------------------- CONTROL COMMANDS void tracefiles(char *tempfile, char *permfile, int verbose) : used for specifying temporary and permanent disk storage for trace data - tempfile is the prefix (including directory) of the name of the disk to be used for temporary storage of trace data. A suffix (the node number) is appended to make all temporary files unique. If a null string is specified for this parameter, a temporary file is not used. - permfile is the name of the disk file where this node's trace data should be sent for "permanent" storage. If a null string is specified for this parameter, the data is send to processor 0. If processor 0 does not specify a permanent tracefile name, the data sent to it or generated locally is not saved. - verbose == 1, fields in trace records are labelled != 1, fields are not labelled (ParaGraph-readable form) void tracestatistics(int events, int picltime, int piclcnt, int piclvol, int usertime, int usercnt) : trace events initialization routine - events is the maximum number of user events for which statistics are to be collected. The event types for which statistics will be recorded are types {0,...,events-1} - picltime is a switch (0/1) indicating whether time spent in PICL events within user events should be recorded - piclcnt is a switch (0/1) indicating whether PICL event occurrences within user events should be recorded - piclvol is a switch (0/1) indicating whether PICL event volumes within user events should be recorded - usertime is a switch (0/1) indicating whether time spent in (other) user events within user events should be recorded - usercnt is a switch (0/1) indicating whether (other) user event occurrences within user events should be recorded void tracelevel(int mpi, int user, int trace) : set the types of tracing data collected - mpi: tracing level for MPI commands - user: tracing level for user-specified events - trace: tracing level for MPICL commands if < 0, then instrumentation is disabled. if >= 0, then statistics are collected. if > 0, then event records are generated. void traceinfo (int *remaining, int *picl, int *user, int *trace) : get instrumentation information - remaining: approximate number of event records that can be saved in the remaining free storage in the instrumentation work space - mpi: tracing level for MPI commands - user: tracing level for user-specified events - trace: tracing level for MPICL commands void traceexit() : stop tracing void traceflush() : send data to the temporary or permanent trace file and reinitialize the data storage area. (Implicitly called in MPI_Finalzie if instrumentation has ever been enabled.) Statistics records are a good way to determine general performance without requiring the collection of detailed trace data. But collecting "nested" statistics (recording occurrences of events within a given event) requires potentially a large amount of internal storage. tracestatistics is used to specify what user event statistics are to be collected, and whether nested statistics are to be collected. Note that MPI calls and MPICL instrumentration commands are not nested, so tracestatistics only affects user events. tracestatistics only allocates memory for the collection. tracelevel must still be used to enable the collection of statistics. ----------------------------------------------------------------------------- USER EVENT COMMANDS void traceevent(char *recordstring, int event, int nparams, int *params) : used to record information about a user-defined event - recordstring: record the beginning ("entry"), ending ("exit"), or simple occurrence ("mark") of a user event, label the event ("label"), or write a message to the trace file immediately ("message"). - event: user-specified event identifier. It should be a nonnegative integer. If statistics for this event are to be collected, the event id should also be less than the "events" field specified in the call to tracestatistics. - nparams: number of integers or characters (see the params description below) in the params data. - params: The data associated with the entry, exit, and mark records should be integer. The data associated with the label and message records should be character. void tracedata(int event, int dataid, char *datatype, int items, char *data) : used to save (additional) data associated with a user-defined event. - event: user-specified event identifier. It should be a nonnegative integer. - dataid: user-specified data identifier. This is used by by the user to identify the data, and the only restriction is that it be an integer. - datatype: charater string indicating type of data. Supported data types are "character", "integer", "long", "float", "real", and "double". - items: number of data elements (of the specified type - data: user event data, or specified type. ----------------------------------------------------------------------------- FORTRAN COMMANDS For most platforms, the Fortran callable routines have the same names and parameters as the C routines, modulo the usual differences in parameter passing. But, on some platforms, C and Fortran external names are indistinguishable. To force the correct routines to be linked, MPICL also provides versions of the Fortran callable routines with an "f" suffix in the name. Thus, for example, to be portable between SUN and RS6000 workstations, use CALL TRACENODEF instead of CALL TRACENODE when sending enabling instrumentation in a Fortran program. ----------------------------------------------------------------------------- PCONTROL INTERFACE The MPI standard provides the MPI_Pcontrol command as a way to support instrumentation without making instrumentation commands part of the standard. When an instrumentation library is linked in, a call to MPI_Pcontrol will invoke one of the instrumentation commands. When the library is missing, the call to MPI_Pcontrol does nothing. MPI_Pcontrol uses an integer to identify which instrumentation command to call. This information in encoded in the file pcontrol.h, and is repeated below. #define PCDISABLE 0 #define PCENABLE 1 #define PCFLUSH 2 #define TRACENODE 3 #define TRACEFLUSH 5 #define TRACELEVEL 6 #define TRACEDATA 7 #define TRACEEVENT 8 #define TRACEFILES 101 #define TRACEEXIT 102 #define TRACEINFO 103 #define TRACESTATISTICS 104 For Fortran programs, include the file pcontrolf.h. Both pcontrol.h and pcontrolf.h can be found in mpicl/INCLUDE . For example, the command tracefiles can now be called by tracefiles("","tracefile",0); or MPI_Pcontrol(101,"","tracefile",0); There are three new commands described here: PCDISABLE, PCENABLE, and PCFLUSH. These are specified in the MPI standard, and correspond to the following. PCENABLE: tracefiles("", "mpicltrace", 0); tracenode(100000, 0, 0) PCDISABLE: tracelevel(-1,-1,-1); PCFLUSH: traceflush(); This particular encoding was based partly on that developed by Loretta Elwood and Michael Heath in their MPI tracing package. ----------------------------------------------------------------------------- EXAMPLES a) Profiling the MPI communication in an application code using the MPICL routines directly. CALL MPI_INIT(IERR) CALL TRACELEVELF(0,0,0) CALL TRACEFILESF('','tracefile',0) CALL TRACENODEF(1000000,0,1) . . . CALL MPI_FINALIZE(IERR) This concatenates all trace data in a single file by the name of 'tracefile'. On some systems this may not work well, and it may be safer to specify the permanent tracefile on process 0 only. b) Tracing the MPI communication in an application code using the MPI_Pcontrol interface. CALL MPI_INIT(IERR) CALL MPI_PCONTROL(6,1,1,1,IERR) CALL MPI_PCONTROL(101,'','tracefile',0,IERR) CALL MPI_PCONTROL(3,1000000,0,1,IERR) . . . CALL MPI_FINALIZE(IERR) Note that the only difference in these two examples, other than the use of MPI_PCONTROL, is that one sets the instrumentation levels to (0,0,0), while the other sets it to (1,1,1). Note that instrumentation will be disabled here as soon as the data storage fills up. c) Profiling user-events. INCLUDE 'pcontrolf.h' CALL MPI_INIT(IERR) CALL MPI_PCONTROL(TRACELEVEL,1,1,1,IERR) CALL MPI_PCONTROL(TRACEFILES,'','tracefile',0,IERR) CALL MPI_PCONTROL(TRACESTATISTICS,200,0,0,0,0,0,IERR) CALL MPI_PCONTROL(TRACENODE,1000000,0,1,IERR) . . . CALL MPI_PCONTROL(TRACEEVENT, 'entry', 0, 0, 0, IERR) CALL X(...) CALL MPI_PCONTROL(TRACEEVENT, 'exit', 0, 0, 0, IERR) . . . DO I=1,100 CALL MPI_PCONTROL(TRACEEVENT, 'entry', I+100, 0, 0, IERR) ... CALL MPI_PCONTROL(TRACEEVENT, 'entry', 1, 1, I, IERR) CALL Y(...) CALL MPI_PCONTROL(TRACEEVENT, 'exit', 1, 1, I, IERR) ... CALL MPI_PCONTROL(TRACEEVENT, 'entry', 2, 1, I, IERR) CALL Z(...) CALL MPI_PCONTROL(TRACEEVENT, 'exit', 2, 1, I, IERR) ... CALL MPI_PCONTROL(TRACEEVENT, 'exit', I+100, 0, 0, IERR) ENDDO . . CALL MPI_FINALIZE(IERR) This code identifies calls to routines X, Y, and, Z as events 1, 2, and 3, respectively. It also identifies each time through the DO loop as a separate event, denoted as the DO loop index plus 100. Each event record for calls to Y and Z also contains the DO loop index current during the call. ----------------------------------------------------------------------------- FINAL COMMENTS While the user interface to the MPICL instrumentation library is fairly simple, there are subtleties to using the library, many of which are machine or MPI-implementation dependent. If problems arise, we are interested in hearing about them. However, we can not guarantee that we will be able to respond in a timely manner (if at all).