MIMS EPrints
Not a member yet
2151 research outputs found
Sort by
Performance Evaluation of Mixed Precision Algorithms for Solving Sparse Linear Systems
It is well established that mixed precision algorithms that factorize
a matrix at a precision lower than the working precision can reduce
the execution time of parallel solvers for dense linear systems. Much
less is known about the efficiency of mixed precision parallel
algorithms for sparse linear systems, and existing work focuses on
single core experiments. We evaluate the benefits of using single
precision arithmetic in solving a double precision sparse linear
systems using multiple cores, focusing on the key components of LU
factorization and matrix–vector products. We find that single
precision sparse LU factorization is prone to a severe loss of
performance due to the intrusion of subnormal numbers. We identify a
mechanism that allows cascading fill-ins to generate subnormal numbers
and show that automatically flushing subnormals to zero avoids the
performance penalties. Our results show that the anticipated speedup
of 2 over a double precision LU factorization is obtained only for the
very largest of our test problems. For iterative solvers, we find that
for the majority of the matrices computing or applying incomplete
factorization preconditioners in single precision does not present
sufficient performance benefits to justify the loss of accuracy
compared with the use of double precision
Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic
We present algorithms for performing the five elementary arithmetic operations (+, -, ×, ÷, and √) in floating point arithmetic with stochastic rounding, and demonstrate the value of these algorithms by discussing various applications where stochastic rounding is beneficial. The algorithms require that the hardware be compliant with the IEEE 754 floating-point standard and that a floating-point pseudorandom number generator be available. The goal of these techniques is to emulate stochastic rounding when the underlying hardware does not support this rounding mode, as is the case for most existing CPUs and GPUs. Simulating stochastically rounded floating-point operations can be used to explore the behavior of this rounding, as well as to develop applications before hardware with stochastic rounding is available-once such hardware becomes available, the proposed algorithms can be replaced by calls to the relevant hardware routines. When stochastically rounding double precision operations, the algorithms we propose are between 7.3 and 19 times faster than the implementations that use the GNU MPFR library to simulate extended precision. We test our algorithms on various problems where stochastic rounding is expected to bring advantages, which includes summation algorithms and ordinary differential equation solvers
A spectral-in-time Newton-Krylov method for nonlinear PDE-constrained optimization
We devise a method for nonlinear time-dependent PDE-constrained optimization problems that uses a spectral-in-time representation of the residual, combined with a Newton-Krylov method to drive the residual to zero. We also propose a preconditioner to accelerate this scheme. Numerical results indicate that this method can achieve fast and accurate solution of nonlinear problems for a range of mesh sizes and problem parameters
Alan Turing’s Manchester by Jonathon Swinton
This is a delightful book, enjoyable for anyone with an interest in science or Manchester or its university. The title precisely captures the subject matter, namely what Turing would have experienced through his window from abstract academia on his arrival in 1948. It describes the scientific–academic–social environment in Manchester during the 1940s and 1950s, when the University of Manchester was drawing talent, which included Turing, away from the Oxford–London–Cambridge triangle
Stochastic Rounding and its Probabilistic Backward Error Analysis
Stochastic rounding rounds a real number to the next larger or smaller floating-point number with probabilities minus the relative distances to those numbers. % It has a larger worst-case error than round to nearest % but has useful statistical properties. It is gaining attention in deep learning because it can improve the accuracy of the computations. We compare basic properties of stochastic rounding with those for round to nearest, finding properties in common as well as significant differences. We prove that for stochastic rounding the rounding errors are mean independent random variables with zero mean. We derive a new version of our probabilistic error analysis theorem from [{\em SIAM J. Sci. Comput.}, 41 (2019), pp.\ A2815--A2835], weakening the assumption of independence of the random variables to mean independence. These results imply that for a wide range of linear algebra computations the backward error for stochastic rounding is unconditionally bounded by a multiple of to first order, with a certain probability, where is the problem size and is the unit roundoff. This is the first scenario where the rule of thumb that one can replace by in a rounding error bound has been shown to hold without any additional assumptions on the rounding errors. We also explain how stochastic rounding avoids the phenomenon of stagnation in sums, whereby small addends are obliterated by round to nearest when they are too small relative to the sum
Numerical Behavior of NVIDIA Tensor Cores
We explore the floating-point arithmetic implemented in NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100 and Turing T4 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: 1) accurately simulate NVIDIA tensor cores on conventional hardware; 2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and 3) build hardware that computes a matrix-matrix product matching the results of the NVIDIA tensor cores. As part of this work we provide a testsuite that can be easily adapted to test the latest tensor cores available in the NVIDIA Ampere A100, once those graphics cards become easily accessible. Moreover, we identify a non-monotonicity issue that arises in floating-point multi-operand addition if the intermediate results are not normalized
CPFloat: A C library for simulating low-precision arithmetic
One can simulate low-precision floating-point arithmetic via software by executing each arithmetic operation in hardware and then rounding the result to the desired number of significant bits. For IEEE-compliant formats, rounding requires only standard mathematical library functions, but handling subnormals, underflow, and overflow demands special attention, and numerical errors can cause mathematically correct formulae to behave incorrectly in finite arithmetic. Moreover, the ensuing implementations are not necessarily efficient, as the library functions these techniques build upon are typically designed to handle a broad range of cases and may not be optimized for the specific needs of rounding algorithms. CPFloat is a C library for simulating low-precision arithmetics. It offers efficient routines for rounding, performing mathematical computations, and querying properties of the simulated low-precision format. The software exploits the bit-level floating-point representation of the format in which the numbers are stored, and replaces costly library calls with low-level bit manipulations and integer arithmetic. In numerical experiments, the new techniques bring a considerable speedup (typically one order of magnitude or more) over existing alternatives in C, C++, and MATLAB. To our knowledge, CPFloat is currently the most efficient and complete library for experimenting with custom low-precision floating-point arithmetic available in any language
Exploiting Fast Matrix Arithmetic in Block Low-Rank Factorizations
We consider the LU factorization of an matrix represented as a block low-rank (BLR) matrix: most of its off-diagonal blocks are approximated by matrices of small rank , which reduces the asymptoticcomplexity of computing the LU factorization of down to \O(n^2r). In this article, our aim is to further reduce this complexity by exploiting fast matrix arithmetic, that is, the ability to multiply two full-rank matrices together for \O(n^\w) flops, where \w<3. This is not straightforward: simply accelerating the intermediate operations performed in the standard BLR factorization algorithm does not suffice to reduce the quadratic complexity in , because these operations are performed on matrices whose size is too small. To overcome this obstacle, we devise a new BLR factorization algorithm that, by recasting the operations so as to work on intermediate matrices of larger size, can exploit more efficiently fast matrix arithmetic. This new algorithm achieves an asymptotic complexity of \O(n^{(\w+1)/2}r^{(\w-1)/2}), which represents an asymptotic improvement compared to the standard BLR factorization as soon as \w<3. In particular, for Strassen's algorithm, \w\approx2.81 yields a complexity of \O(n^{1.904}r^{0.904}). Our numerical experiments are in good agreement with this theoretical result
SIMULATING LOW PRECISION FLOATING-POINT ARITHMETIC
The half precision (fp16) floating-point format, defined in the 2008 revision of the IEEE standard for floating-point arithmetic, and a more recently proposed half precision format bfloat16, are increasingly available in GPUs and other accelerators. While the support for low precision arithmetic is mainly motivated by machine learning applications, general purpose numerical algorithms can benefit from it, too, gaining in speed, energy usage, and reduced communication costs. Since the appropriate hardware is not always available, and one may wish to experiment with new arithmetics not yet implemented in hardware, software simulations of low precision arithmetic are needed. We discuss how to simulate low precision arithmetic using arithmetic of higher precision. We examine the correctness of such simulations and explain via rounding error analysis why a natural method of simulation can provide results that are more accurate than actual computations at low precision. We provide a MATLAB function chop that can be used to efficiently simulate fp16 and bfloat16 arithmetics, with or without the representation of subnormal numbers and with the options of round to nearest, directed rounding, stochastic rounding, and random bit flips in the significand. We demonstrate the advantages of this approach over defining a new MATLAB class and overloading operators