The following short progress report, written by a student in geology, provides an excellent example of how concrete and affirmative a progress report can be. Note the specificity even in the title, and how sections such as "Remaining Questions" and "Expected Results" demonstrate that the writer, even though he is two months away from the completion of his thesis, is thinking about the work in a professional manner.
Click here to open a sample progress report within this page.
"Stratigraphic Architecture of Deep-Ramp Carbonates: Implications for Deposition
of Volcanic Ashes, Salona and Coburn Formations, Central Pennsylvania"
by John Lerner
SCOPE AND PURPOSE
The Late Middle Ordovician-age Salona and Coburn formations of central Pennsylvania show cyclic patterns on a scale of tens of meters. Little research has been done on sequence stratigraphy of deep-water mixed carbonate/siliciclastic systems, and a depositional model for this environment is necessary to understand the timing and processes of deposition. The stratigraphic position of the bentonites at the base of the larger cycles is significant because it indicates that they accumulated during a time of non-deposition in a deep water environment.
To date, I have described five lithofacies present in the Salona and Coburn formations. Two lithofacies are interpreted as storm deposits and make up the limestone component of the thinly-bedded couplets. Some trends were observed in the raw data; however, because of the "noisy" nature of the data, a plot of the five-point moving average of bed thickness was created to define the cycles better.
Two key tasks are to be completed in the coming weeks. With the results of these tests and the field observations, I will create a model for deposition of a deep-ramp mixed carbonate/siliciclastic system in a foreland basin environment. The model will include depositional processes, stratigraphic architecture, and tectonic setting.
Questions remain regarding the depositional processes responsible for the featureless micrite at the base of the Salona Formation. . . . How rapid was the transition? What record (if any?) remains of the transition? Were bentonites not deposited, or were they selectively removed at certain locations by erosive storm processes?
I expect to find that the large-scale cycles represent parasequences. Flooding surfaces are marked by bentonites and shales, with bentonites removed in some locations. If the cycles are true parasequences, the implication is that eustatic sea level changes and not tectonic influences controlled the depositional changes over the interval.
Parallel Computer Systems
Blocked Matrix Multiply
Tutor: Krister Dackland (email@example.com)
Nico, C95 (firstname.lastname@example.org)
Selander, C93 (email@example.com)
17th September 1998
An implementation of a matrix multiplication subroutine, optimized for use on a IBM RS/6000 25T, is presented. The optimizations includes matrix multiplication with scalar temporary variable, cache blocking, register blocking by loop unrolling, block copying and function inlining.
Acquisition and user guidance
This report covers an assignment on implementing a fast matrix multiplication subroutine on the IBM RS/6000 25T. Aside from a variety of compiler optimizations, we have used matrix multiplication with scalar temporary variables, cache blocking, register blocking by loop unrolling, block copying and function inlining to improve performance.
This is a compulsory exercise in the course Parallel Computer Systems (Parallelldatorsystem), given by the Department of Computing Science at Ume� University in 1998.
Acquisition and user guidance
The source-code files are located in the directory on the computer system of the Department of Computing Science. The files are:
| Contains our highly tuned matrix multiplication subroutine: |
The given test program. It is most advisable to use a raw version of the test program. However, since we are to achieve more than 30 MFLOPS on matrices larger than 500 times 500, we have modified it to include some huge matrices as the original program never goes beyond 256 in size.
The Makefile is quite ordinary. The only thing worth mentioning is the compiler options: . That is, aggressive optimizations, inlining where possible, no use of , target architecture Power PC, target processor 601, do inter-procedural analysis and unroll loops.
To perform a test, simply run . First some random matrices are multiplied and the results are checked to be correct. Then the performance for some even quad-sizes matrices and some arbitrary matrices are measured. The output is on the form .
Example:anaris ~/src/pds/lab1>mm_contest Checking for correctness on sizes: 238 65 31 78 109 39 181 184 180 248 Checking quad-word aligned sizes 16 52.390698 32 58.099291 64 48.415417 128 39.099444 256 42.182715 512 41.943040 1024 41.690616 Checking arbitrary sizes 23 52.614054 43 47.545714 61 49.667615 79 48.423473 99 47.870539 119 49.772629 151 50.209702 255 38.118103 257 36.864861 501 49.802575 633 49.106706
The figure below pretty much shows how our program works:
This figure depicts the cache blocking. The three outermost loops steps with block-size and thus determine the darker sub-matrices that the innermost loops performs the actual matrix multiplication on. This way, each C sub-matrix is used only once and the sub-matrices of A and B is visited less than in the case of the na�ve matrix multiplication algorithm. We have used a block-size of 32. This size was chosen after lots of empirical testing of different sizes around the size suggested by the formula in [IBM93] (see the section Cache Blocking below).
This figure also depicts the data copying. This is used to boost the performance of the otherwise problematic matrices even multiples of 32 in size. Each A and B sub-matrices are copied into a temporary array that also has room for the resulting C sub-matrix (the C sub-matrix is not copied into the array, just zeroed). The blocks of this temporary array is then fed to the same general function that handles all other sizes of matrices. After the current C sub-matrix is completely computed, it is copied into the corresponding place in the C matrix. See also the section Data Copying below.
We have used the same block-size on all levels of the program (the register blocking is achieved by loop unrolling to a depth of four, not by explicit loops). However, our code supports one block-size for the cache blocking in the general function and another for the data-copying. That is, they are determined by two separate defines in the source code, BLOCKSIZE and BSIZEPRIM.
We have been a bit troubled over how to present the algorithm we have used. We first tried to follow the examples found in the references, but that quickly turned into something more like the source code than a presentation of an algorithm. Instead we chose to cover each optimization trick independent of the others. Hopefully, this will make this report a lot more comprehensible.
Common Sense Optimizations
This is a small, but very important section. The optimizations of this section is not found in fancy papers on fast computations. Instead, they are sometimes found in the curriculum of good and thorough educational programmes, and most often only after long experience of programming. We are referring to small details like moving invariants out of loops, always precalculate loop conditions and such. Easily overlooked, they can give you that extra MFLOP you were looking for. ;-)
By replacing the expression C[i, j] with a scalar temporary variable the memory traffic may be reduced since the scalar is placed in a register and hopefully stays there.
Example:for i := 0 to N for j := 0 to N s := C[i, j] for k := 0 to N s += A[i, k] * B[k, j] end C[i, j] := s end end
The above example is a bit simple but carefully used together with loop unrolling you can achieve register blocking (good register reuse) which can give good performance, especially for small matrices.
In the innermost main-loop, we use 16 scalar temporary variables to hold values for a 4 by 4 sub-matrix of C (we unroll both i and j with a depth of 4). Whenever i or j is not divisible by 4 special cases arise were we use less registers (in the case of both i and j being indivisible by 4, our code resembles that of the example above).
The scheme of cache blocking is to keep blocks of the matrices in the cache for as long as possible, to maximize cache reuse and minimize memory traffic.
Example:# Cache blocking: for ib := 0 to N by BLOCKSIZE for jb := 0 to N by BLOCKSIZE for kb := 0 to N by BLOCKSIZE # Computations local to cache for i := ib to MIN(ib + BLOCKSIZE, N) for i := ib to MIN(ib + BLOCKSIZE, N) for i := ib to MIN(ib + BLOCKSIZE, N) C[i, j] += A[i, k] * B[k, j] end end end end end end
[IBM93] uses this formula to compute the block-size:
in our case these figures apply:
However, some of the cache should be saved for access of scalars. Empirically, we found 34 to be a good block-size and used for some time but now we get the best overall performance with a block-size of 32. We have also lost about one MFLOP of our peak performance and we cannot find out why. :-( (Now we think we have figured it out - see the trailing comment in the test section.)
Register blocking follows the same principle as cache blocking, except that the register is much smaller than the cache. The scheme is to maximize register reuse and minimize memory traffic.
There are two ways to achieve register blocking. You use loops as in cache blocking and a block-size small enough to fit a mini-block in the registers. You can also unroll the inner loops of the matrix multiplication and load scalar temporary variables by hand to get register blocking.
We have used unrolling to a depth of four and 16 temporary doubles to store a 4 by 4 sub-matrix of C. That is, we have used half of the 32 8-byte floating point registers of the IBM RS/6000 25T. We also tried to use an unrolling depth of 5 and 25 temporary doubles, but this did not give any extra performance.
By unrolling loops you can minimize loop overhead and create larger basics blocks for the compiler optimizer to work on. Example:for i := ... for j := 0 to BLOCKSIZE by DEPTH s0 := C[i, j ] s1 := C[i, j + 1] s2 := C[i, j + 2] s3 := C[i, j + 3] for k := ... ... end C[i, j ] := s0 C[i, j + 1] := s1 C[i, j + 2] := s2 C[i, j + 3] := s3 end end
This is a one-dimensional loop unrolling. We have used a two-dimensional unrolling. That is, we have also unrolled the outer i loop and used 16 temporary doubles to store a 4 by 4 sub-matrix of C. We have tried to also unroll the innermost k loop but this did not give any extra performance. Probably the compiler does a better job of unrolling innermost loops.
For some matrices ordinary cache blocking do not work so well. The case might be that the elements of the different matrices are stored in memory such that they map onto the same cache sets. This can cause performance to drop considerably. Another problem is that you do not get good cache line reuse when whole cache lines are exchanged between cache hits. One solution to these problems is to copy chunks of large matrices into temporary arrays that will fit into the cache. Carefully used, the overhead of block copying can be less than the performance drop of cache set interference.
We have found that the performance of our program drops for even multiples of 32 larger than 64, but for 32 by 32 matrices we get our peak performance. Therefore we use a block copying scheme in these cases were we copy 32 by 32 blocks out of A and B into a 3 * 32 * 32 temporary array. The remaining room is zeroed and reserved for the computed C matrix. Then we call our general function (redundantly inlined, of course) with local A, B and C pointing to parts of this array.
Example:for ib := 0 to N by 32 for jb := 0 to N by 32 zero_local_C for kb := 0 to N by 32 copy_block_of_A copy_block_of_B Call_general_function end copy_local_C_back_to_block_of_C end end
Invocation of sub-functions costs. They introduce a delay as current environment and parameters are pushed onto the stack, local variables created and flow of control is shifted to the new piece of code. Therefore you can gain performance when inlining functions. For small functions that do not call other functions, this can be done by the compiler. For more complex functions, you have to do it by hand. For the prize of very non-beautified code you can get a little extra performance.
Testsanaris ~/src/pds/lab1>mm_contest Checking for correctness on sizes: 238 65 31 78 109 39 181 184 180 248 Checking quad-word aligned sizes 16 52.390698 32 58.099291 64 48.415417 128 39.099444 256 42.182715 512 41.943040 1024 41.690616 Checking arbitrary sizes 23 52.614054 43 47.545714 61 49.667615 79 48.423473 99 47.870539 119 49.772629 151 50.209702 255 38.118103 257 36.864861 501 49.802575 633 49.106706
All tests are performed on the machine . Since this workstation is residing in the master thesis lab, our fellow students in the course have not found it and thus we have had it for ourselves! ;-)
Since the time of a floating point operation depends on the actual values of the operands, since these are randomized in the test program and since the current time is used as seed to the random-function of the test program, the output of the test vary up to one MFLOPS depending on when the test is run and which values happens to be used in the A and B matrices. This can really make a serious student feeling very sick as earlier achieved peak performances suddenly are impossible to duplicate...
- Many thanks to hamlet and Psycho (Fredrik Augustsson and Tomas Halvarsson) for pointing out to us our far from optimal wrapper condition.
- Thanks to the ma446-community for providing a nutritious environment for creative assignment solving.
- [IBM93] IBM manuals, Optimization and Tuning Guide for Fortran, C, and C++ (SC09-1705-00), IBM 1993
- [LAM91] Lam, M S, Rothberg, E E, Wolf, M E, The Cache Performance and Optimizations of Blocked Algorithms , Stanford University, 1991
This assignment was made with , , and .
I think we better leave the source out, don't you think?