Working Note 1: Runtime versus Compile Time Parameters

Author: Pat Worley

June 2000, Please send review comments to worleyph@ornl.gov

Here are my current results. To summarize, we have a programming "style" that should make supporting both runtime and compile-time loop bounds and array dimensions "trivial". I also have results that indicate that runtime loop bounds are not a performance hit on the IBM SP. My next task is to look at runtime vs. compile time array declarations, and to repeat all of this on the AlphaServer SC.

1) The first "result" is an observation by Bill Putman that it is trivial to support both the compile-time and run-time options. If the number of columns is declared in the "physics data structures module", which seems to be consistent with the rest of the design, then this can be defined as a parameter or a runtime variable without touching the rest of the code. That is, the number of columns is not passed into the physics routines - only a pointer to the data structure, the definition of which is defined in a module. If the number of columns per segment needs to vary, we can still use this approach by computing on bogus columns or using masks.

Note that I feel that this is a crucial observation. No matter how many tests we make now, we can not be sure that some platform in the near term will not have different behavior. By making "nlon" a global variable, we can address any performance issues when they arise without changing the body of the code.

2) "The Experiments"

- CRM experiments with 18 vertical levels (NVER).
- Arrays defined as (PLOND,NVER,PLAT).
- PLOND and PLAT declared at compile-time.
- Modified CRM to either read in PLON or define it at compile time. The number of latitudes actually computed, rlat <= PLAT, is determined at runtime.
- Running on ORNL SP (375 MHz POWER3-II processors with 8MB L2 cache).
- 2 experiments:
           i) runtime PLON=1 vs. compile-time PLON=1; PLOND=1,..,512
          ii) runtime PLON=1,..,512 vs. compile-time PLON=1,..,512; PLOND=PLON for both -O3 and -O3 -qhot compiler options on the SP

This examines issues of runtime loop bounds and compile-time array size declarations. In the original compile-time experiments, -O3 produced consistently mediocre performance as PLON varied, while -O3, -qhot produced good performance when PLON > 16.

i) MFlops/sec PLON=1; PLOND=1,..,512; PLAT=512
PLOND runtime       runtime         compile-time compile-time
                 -O3             -O3 -qhot  -O3                   -O3 -qhot
1                 111             75                     115                     112
2                 113             74                     114                     112
4                 110             73                     111                     109
8                 102             70                     103                     103
16                 85             62                        85                      89
32                 71             58                         72                     79
64                 47             46                         45                     62
128               22             29                         22                     35
256               11             16                         11                     17
512                  6              9                            6                        9

So, CRM performance IS cache sensitive on the IBM. If the columns being computed are widely separated in memory (PLON=1, PLOND >> 1) then performance becomes VERY poor. Compile-time and runtime performance is identical with -O3. With -O3 -qhot, compile-time is equivalent to -O3, while runtime -O3 -qhot performance is worse. (Not knowing loop bounds and poor cache locality leads -qhot optimizations astray?)

ii) MFlops/sec PLON=1,..,512; PLOND=PLON; PLAT=512,..,1
PLON     runtime     runtime         compile-time     compile-time
                   -O3         -O3 -qhot     -O3                     -O3 -qhot
1                 111             75                    115                        112
2                 119             109                 120                         115
4                 123             166                 123                         123
8                 124             192                 124                         166
16               123             235                 123                         236
32               122             273                 122                         269
64               120             282                 120                         278
128             117             274                 117                         276
256             116             269                 -                             265
512             113             263                 113                         259

Runtime and compile-time performance are again identical as the number of columns in the segment varies (PLON loop bound increases), and is relatively insensitive to this variation. -O3 -qhot is still very sensitive to this parameter, and compile-time -qhot does a better job for small PLON. However, once PLON is large enough that the improved -qhot performance becomes evident, runtime performance is equivalent to that of compile-time.

June 2000