It is more convenient to implement this algorithm in a slightly different form: suppose we have the bits b[i] of the number n, so that n=b[0]+2*b[1]+...+b[m]*2^m. Then
In Yacas script form, the algorithm looks like this:
power(x_IsPositiveInteger, n_IsPositiveInteger)<-- [ Local(result); result:=1; While(n != 0) [ if ((n&1) = 1) [ result := result*x; ]; x := x*x; n := n>>1; ]; result; ]; |
Since the "binary squaring" algorithm scans bits in the number n, the execution time is linear in the number of digits of n, i.e. logarithmic in n.
The square root is computed by using the bisection method, which works well for integers. The general approach is to scan each bit of the input number and to see if a certain bit should be set in the resulting integer. The time is linear in the number of decimals, or logarithmic in the input number. The method is very similar in approach to the repeated squaring method described above for raising numbers to a power.
For integer N, the following steps are performed:
The intermediate results, u^2, v^2 and 2*u*v can be maintained easily too, due to the nature of the numbers involved ( v having only one bit set, and it being known which bit that is).
For floating point numbers, first the required number of decimals p after the decimal point is determined. Then the input number N is multiplied by a power of 10 until it has 2*p decimal. Then the integer square root calculation is performed, and the resulting number has p digits of precision.
Below is some Yacas script code to perform the calculation for integers.
//sqrt(1) = 1, sqrt(0) = 0 10 # BisectSqrt(0) <-- 0; 10 # BisectSqrt(1) <-- 1; 20 # BisectSqrt(N_IsPositiveInteger) <-- [ Local(l2,u,v,u2,v2,uv2,n); // Find highest set bit, l2 u := N; l2 := 0; While (u!=0) [ u:=u>>1; l2++; ]; l2--; // 1<<(l2/2) now would be a good under estimate // for the square root. 1<<(l2/2) is definitely // set in the result. Also it is the highest // set bit. l2 := l2>>1; // initialize u and u2 (u2==u^2). u := 1 << l2; u2 := u << l2; // Now for each lower bit: l2--; While( l2 != 0 ) [ // Get that bit in v, and v2 == v^2. v := 1<<l2; v2 := v<<l2; // uv2 == 2*u*v, where 2==1<<1, and // v==1<<l2, thus 2*u*v == // (1<<1)*u*(1<<l2) == u<<(l2+1) uv2 := u<<(l2 + 1); // n = (u+v)^2 = u^2 + 2*u*v + v^2 // = u2+uv2+v2 n := u2 + uv2 + v2; // if n (possible new best estimate for // sqrt(N)^2 is smaller than N, then the // bit l2 is set in the result, and // add v to u. if( n <= N ) [ u := u+v; // u <- u+v u2 := n; // u^2 <- u^2 + 2*u*v + v^2 ]; l2--; ]; u; // return result, accumulated in u. ]; |
A separate function IntNthRoot is provided to compute the integer part of n^(1/s) for integer n and s. For a given s, it evaluates the integer part of n^(1/s) using only integer arithmetic with integers of size n^(1+1/s). This can be done by Halley's iteration method, solving the equation x^s=n. For this function, the Halley iteration sequence is monotonic. The initial guess is x[0]=2^(b(n)/s) where b(n) is the number of bits in n obtained by bit counting or using the integer logarithm function. It is clear that the initial guess is accurate to within a factor of 2. Since the relative error is squared at every iteration, we need as many iteration steps as bits in n^(1/s).
Since we only need the integer part of the root, it is enough to use integer division in the Halley iteration. The sequence x[k] will monotonically approximate the number n^(1/s) from below if we start from an initial guess that is less than the exact value. (We start from below so that we have to deal with smaller integers rather than with larger integers.) If n=p^s, then after enough iterations the floating-point value of x[k] would be slightly less than p; our value is the integer part of x[k]. Therefore, at each step we check whether 1+x[k] is a solution of x^s=n, in which case we are done; and we also check whether (1+x[k])^s>n, in which case the integer part of the root is x[k]. To speed up the Halley iteration in the worst case when s^s>n, it is combined with bisection. The root bracket interval x1<x<x2 is maintained and the next iteration x[k+1] is assigned to the midpoint of the interval if the Newton formula does not give sufficiently rapid convergence. The initial root bracket interval can be taken as x[0], 2*x[0].
Real powers (as opposed to integer powers and roots) are computed by using the exponential and logarithm functions, a^b=Exp(b*Ln(a)).
The "integer logarithm", defined as the integer part of Ln(x)/Ln(b), where x and b are integers, is computed using a special routine IntLog(x, b) with purely integer math. This is much faster than evaluating the full logarithm when both arguments are integers and only the integer part of the logarithm is needed. The algorithm consists of (integer) dividing x by b repeatedly until x becomes 0 and counting the number of divisions. A speed-up for large x is achieved by first comparing x with b, then with b^2, b^4, etc., until the factor b^2^n is larger than x. At this point, x is divided by that power of b and the remaining value is iteratively compared with and divided by successively smaller powers of b.
The logarithm function Ln(x) for general (complex) x can be computed using its Taylor series,
Currently the routine LnNum uses the Halley method for the equation Exp(x)=a to find x=Ln(a),
A much faster algorithm based on the AGM sequence was given by Salamin (see R. P. Brent: Multiple-precision zero-finding methods and the complexity of elementary function evaluation, in Analytic Computational Complexity, ed. by J. F. Traub, Academic Press, 1975, p. 151; also available online from Oxford Computing Laboratory, as the paper rpb028). The formula is based on an asymptotic relation,
The required number of AGM iterations is approximately 2*Ln(P)/Ln(2). For smaller values of x (but x>1), one can either raise x to a large integer power s (this is quick only if x is an integer or a rational) and compute 1/r*Ln(x^r), or multiply x by a large integer power of 2 (this is better for floating-point x) and compute Ln(2^s*x)-s*Ln(2). Here the required powers are
If x<1, then (-Ln(1/x)) is computed. Finally, there is a special case when x is very close to 1, where the Taylor series converges quickly but the AGM algorithm requires to multiply x by a large power of 2 and then subtract two almost equal numbers, leading to a great loss of precision. Suppose 1<x<1+10^(-M), where M is large (say of order P). The Taylor series for Ln(1+epsilon) needs about N= -P*Ln(10)/Ln(epsilon)=P/M terms. If we evaluate the Taylor series using the rectangular scheme, we need 2*Sqrt(N) multiplications and Sqrt(N) units of storage. On the other hand, the main slow operation for the AGM sequence is the geometric mean Sqrt(a*b). If Sqrt(a*b) takes an equivalent of c multiplications (Brent's estimate would be c=13/2 but it may be more in practice), then the AGM sequence requires 2*c*Ln(P)/Ln(2) multiplications. Therefore the Taylor series method is more efficient for
For larger x>1+10^(-M), the AGM method is more efficient. It is necessary to increase the working precision to P+M*Ln(2)/Ln(10) but this does not decrease the asymptotic speed of the algorithm. To compute Ln(x) with P digits of precision for any x, only O(Ln(P)) long multiplications are required.
An alternative way to compute x=Exp(a) at large precision would be to solve the equation Ln(x)=a using a fast logarithm routine. A cubically convergent formula is obtained if we replace Ln(x)=a by an equivalent equation
Efficient iterative algorithms for computing pi with arbitrary precision have been recently developed by Brent, Salamin, Borwein and others. However, limitations of the current multiple-precision implementation in Yacas (compiled with the "internal" math option) make these advanced algorithms run slower because they require many more arbitrary-precision multiplications at each iteration.
The file examples/pi.ys implements five different algorithms that duplicate the functionality of Pi(). See http://numbers.computation.free.fr/Constants/ for details of computations of pi and generalizations of Newton-Raphson iteration.
PiMethod0(), PiMethod1(), PiMethod2() are all based on a generalized Newton-Raphson method of solving equations.
Since pi is a solution of Sin(x)=0, one may start sufficiently close, e.g. at x0=3.14159265 and iterate x'=x-Tan(x). In fact it is faster to iterate x'=x+Sin(x) which solves a different equation for pi. PiMethod0() is the straightforward implementation of the latter iteration. A significant speed improvement is achieved by doing calculations at each iteration only with the precision of the root that we expect to get from that iteration. Any imprecision introduced by round-off will be automatically corrected at the next iteration.
If at some iteration x=pi+epsilon for small epsilon, then from the Taylor expansion of Sin(x) it follows that the value x' at the next iteration will differ from pi by O(epsilon^3). Therefore, the number of correct digits triples at each iteration. If we know the number of correct digits of pi in the initial approximation, we can decide in advance how many iterations to compute and what precision to use at each iteration.
The final speed-up in PiMethod0() is to avoid computing at unnecessarily high precision. This may happen if, for example, we need to evaluate 200 digits of pi starting with 20 correct digits. After 2 iterations we would be calculating with 180 digits; the next iteration would have given us 540 digits but we only need 200, so the third iteration would be wasteful. This can be avoided by first computing pi to just over 1/3 of the required precision, i.e. to 67 digits, and then executing the last iteration at full 200 digits. There is still a wasteful step when we would go from 60 digits to 67, but much less time would be wasted than in the calculation with 200 digits of precision.
Newton's method is based on approximating the function f(x) by a straight line. One can achieve better approximation and therefore faster convergence to the root if one approximates the function with a polynomial curve of higher order. The routine PiMethod1() uses the iteration
Both PiMethod0() and PiMethod1() require a computation of Sin(x) at every iteration. An industrial-strength arbitrary precision library such as gmp can multiply numbers much faster than it can evaluate a trigonometric function. Therefore, it would be good to have a method which does not require trigonometrics. PiMethod2() is a simple attempt to remedy the problem. It computes the Taylor series for ArcTan(x),
The routines PiBrentSalamin() and PiBorwein() are based on much more advanced mathematics. (See papers of P. Borwein for review and explanations of the methods.) They do not require evaluations of trigonometric functions, but they do require taking a few square roots at each iteration, and all calculations must be done using full precision. Using modern algorithms, one can compute a square root roughly in the same time as a division; but Yacas's internal math is not yet up to it. Therefore, these two routines perform poorly compared to the more simple-minded PiMethod0().
Inverse trigonometric functions are computed by Newton's method (for ArcSin) or by continued fraction expansion (for ArcTan),
By the identity ArcCos(x):=Pi/2-ArcSin(x), the inverse cosine is reduced to the inverse sine. Newton's method for ArcSin(x) consists of solving the equation Sin(y)=x for y. Implementation is similar to the calculation of pi in PiMethod0().
For x close to 1, Newton's method for ArcSin(x) converges very slowly. An identity
Inverse tangent can also be related to inverse sine by
Hyperbolic and inverse hyperbolic functions are reduced to exponentials and logarithms: Cosh(x)=1/2*(Exp(x)+Exp(-x)), Sinh(x)=1/2*(Exp(x)-Exp(-x)), Tanh(x)=Sinh(x)/Cosh(x),
The idea to use continued fraction expansions for ArcTan comes from the book by Jack W. Crenshaw, MATH Toolkit for REAL-TIME Programming (CMP Media Inc., 2000). In that book the author explains how he got the idea to use continued fraction expansions to approximate ArcTan(x), given that the Taylor series converges slowly, and having a hunch that in that case the continued fraction expansion then converges rapidly. He then proceeds to show that in the case of ArcTan(x), this is true in a big way. Now, it might not be true for all slowly converging series. No articles or books have been found yet that prove this. The above book shows it empirically.
One disadvantage of both continued fraction expansions and approximation by rational functions, compared to a simple series, is that it is in general not easy to do the calculation with one step more precision, due to the nature of the form of the expressions, and the way in which they change when expressions with one order better precision are considered. The coefficients of the terms in the polynomials defining the numerator and the denominator of the rational function change. This contrasts with a Taylor series expansion, where each additional term improves the accuracy of the result, and the calculation can be terminated when sufficient accuracy is achieved.
The convergence of the continued fraction expansion of ArcTan(x) is indeed better than convergence of the Taylor series. Namely, the Taylor series converges only for Abs(x)<1 while the continued fraction converges for all x. However, the speed of its convergence is not uniform in x; the larger the value of x, the slower the convergence. The necessary number of terms of the continued fraction is in any case proportional to the required number of digits of precision, but the constant of proportionality depends on x.
This can be understood by the following elementary argument. The difference between two partial continued fractions that differ only by one extra last term can be estimated by
There are two tasks related to the factorial: the exact integer calculation and an approximate calculation to some floating-point precision. Factorial of n has approximately n*Ln(n)/Ln(10) decimal digits, so an exact calculation is practical only for relatively small n. In the current implementation, exact factorials for n>65535 are not computed but print an error message advising the user to avoid exact computations. For example, LnGammaNum(n+1) is able to compute Ln(n!) for very large n to the desired floating-point precision.
A second method uses a binary tree arrangement of the numbers 1, 2, ..., n similar to the recursive sorting routine ("merge-sort"). If we denote by a *** b the "partial factorial" product a*(a+1)*...(b-1)*b, then the tree-factorial algorithm consists of replacing n! by 1***n and recursively evaluating (1***m)*((m+1)***n) for some integer m near n/2. The partial factorials of nearby numbers such as m***(m+2) are evaluated explicitly. The binary tree algorithm requires one multiplication of P/2 digit integers at the last step, two P/4 digit multiplications at the last-but-one step and so on. There are O(Ln(n)) total steps of the recursion. If the cost of multiplication is M(P)=P^(1+a)*Ln(P)^b, then one can show that the total cost of the binary tree algorithm is O(M(P)) if a>0 and O(M(P)*Ln(n)) if a=0 (which is the best asymptotic multiplication algorithm).
Therefore, the tree method wins over the simple method if the cost of multiplication is lower than quadratic.
The tree method can also be used to compute "staggered factorials" ( n!!). This is faster than to use the identities 2*n!! =2^n*n! and
Binomial coefficients Bin(n,m) are found by first selecting the smaller of m, n-m and using the identity Bin(n,m)=Bin(n,n-m). Then a partial factorial is used to compute Bin(n,m)=((n-m+1)***n)/(m!). This is always much faster than computing the three factorials in the definition of Bin(n,m).