1 +---------------------------------------------------------------------------+ 2 | wm-FPU-emu an FPU emulator for 80386 and 80486SX microprocessors. | 3 | | 4 | Copyright (C) 1992,1993,1994,1995,1996,1997,1999 | 5 | W. Metzenthen, 22 Parker St, Ormond, Vic 3163, | 6 | Australia. E-mail billm@melbpc.org.au | 7 | | 8 | This program is free software; you can redistribute it and/or modify | 9 | it under the terms of the GNU General Public License version 2 as | 10 | published by the Free Software Foundation. | 11 | | 12 | This program is distributed in the hope that it will be useful, | 13 | but WITHOUT ANY WARRANTY; without even the implied warranty of | 14 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | 15 | GNU General Public License for more details. | 16 | | 17 | You should have received a copy of the GNU General Public License | 18 | along with this program; if not, write to the Free Software | 19 | Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. | 20 | | 21 +---------------------------------------------------------------------------+ 22 23 24 25wm-FPU-emu is an FPU emulator for Linux. It is derived from wm-emu387 26which was my 80387 emulator for early versions of djgpp (gcc under 27msdos); wm-emu387 was in turn based upon emu387 which was written by 28DJ Delorie for djgpp. The interface to the Linux kernel is based upon 29the original Linux math emulator by Linus Torvalds. 30 31My target FPU for wm-FPU-emu is that described in the Intel486 32Programmer's Reference Manual (1992 edition). Unfortunately, numerous 33facets of the functioning of the FPU are not well covered in the 34Reference Manual. The information in the manual has been supplemented 35with measurements on real 80486's. Unfortunately, it is simply not 36possible to be sure that all of the peculiarities of the 80486 have 37been discovered, so there is always likely to be obscure differences 38in the detailed behaviour of the emulator and a real 80486. 39 40wm-FPU-emu does not implement all of the behaviour of the 80486 FPU, 41but is very close. See "Limitations" later in this file for a list of 42some differences. 43 44Please report bugs, etc to me at: 45 billm@melbpc.org.au 46or b.metzenthen@medoto.unimelb.edu.au 47 48For more information on the emulator and on floating point topics, see 49my web pages, currently at http://www.suburbia.net/~billm/ 50 51 52--Bill Metzenthen 53 December 1999 54 55 56----------------------- Internals of wm-FPU-emu ----------------------- 57 58Numeric algorithms: 59(1) Add, subtract, and multiply. Nothing remarkable in these. 60(2) Divide has been tuned to get reasonable performance. The algorithm 61 is not the obvious one which most people seem to use, but is designed 62 to take advantage of the characteristics of the 80386. I expect that 63 it has been invented many times before I discovered it, but I have not 64 seen it. It is based upon one of those ideas which one carries around 65 for years without ever bothering to check it out. 66(3) The sqrt function has been tuned to get good performance. It is based 67 upon Newton's classic method. Performance was improved by capitalizing 68 upon the properties of Newton's method, and the code is once again 69 structured taking account of the 80386 characteristics. 70(4) The trig, log, and exp functions are based in each case upon quasi- 71 "optimal" polynomial approximations. My definition of "optimal" was 72 based upon getting good accuracy with reasonable speed. 73(5) The argument reducing code for the trig function effectively uses 74 a value of pi which is accurate to more than 128 bits. As a consequence, 75 the reduced argument is accurate to more than 64 bits for arguments up 76 to a few pi, and accurate to more than 64 bits for most arguments, 77 even for arguments approaching 2^63. This is far superior to an 78 80486, which uses a value of pi which is accurate to 66 bits. 79 80The code of the emulator is complicated slightly by the need to 81account for a limited form of re-entrancy. Normally, the emulator will 82emulate each FPU instruction to completion without interruption. 83However, it may happen that when the emulator is accessing the user 84memory space, swapping may be needed. In this case the emulator may be 85temporarily suspended while disk i/o takes place. During this time 86another process may use the emulator, thereby perhaps changing static 87variables. The code which accesses user memory is confined to five 88files: 89 fpu_entry.c 90 reg_ld_str.c 91 load_store.c 92 get_address.c 93 errors.c 94As from version 1.12 of the emulator, no static variables are used 95(apart from those in the kernel's per-process tables). The emulator is 96therefore now fully re-entrant, rather than having just the restricted 97form of re-entrancy which is required by the Linux kernel. 98 99----------------------- Limitations of wm-FPU-emu ----------------------- 100 101There are a number of differences between the current wm-FPU-emu 102(version 2.01) and the 80486 FPU (apart from bugs). The differences 103are fewer than those which applied to the 1.xx series of the emulator. 104Some of the more important differences are listed below: 105 106The Roundup flag does not have much meaning for the transcendental 107functions and its 80486 value with these functions is likely to differ 108from its emulator value. 109 110In a few rare cases the Underflow flag obtained with the emulator will 111be different from that obtained with an 80486. This occurs when the 112following conditions apply simultaneously: 113(a) the operands have a higher precision than the current setting of the 114 precision control (PC) flags. 115(b) the underflow exception is masked. 116(c) the magnitude of the exact result (before rounding) is less than 2^-16382. 117(d) the magnitude of the final result (after rounding) is exactly 2^-16382. 118(e) the magnitude of the exact result would be exactly 2^-16382 if the 119 operands were rounded to the current precision before the arithmetic 120 operation was performed. 121If all of these apply, the emulator will set the Underflow flag but a real 12280486 will not. 123 124NOTE: Certain formats of Extended Real are UNSUPPORTED. They are 125unsupported by the 80486. They are the Pseudo-NaNs, Pseudoinfinities, 126and Unnormals. None of these will be generated by an 80486 or by the 127emulator. Do not use them. The emulator treats them differently in 128detail from the way an 80486 does. 129 130Self modifying code can cause the emulator to fail. An example of such 131code is: 132 movl %esp,[%ebx] 133 fld1 134The FPU instruction may be (usually will be) loaded into the pre-fetch 135queue of the CPU before the mov instruction is executed. If the 136destination of the 'movl' overlaps the FPU instruction then the bytes 137in the prefetch queue and memory will be inconsistent when the FPU 138instruction is executed. The emulator will be invoked but will not be 139able to find the instruction which caused the device-not-present 140exception. For this case, the emulator cannot emulate the behaviour of 141an 80486DX. 142 143Handling of the address size override prefix byte (0x67) has not been 144extensively tested yet. A major problem exists because using it in 145vm86 mode can cause a general protection fault. Address offsets 146greater than 0xffff appear to be illegal in vm86 mode but are quite 147acceptable (and work) in real mode. A small test program developed to 148check the addressing, and which runs successfully in real mode, 149crashes dosemu under Linux and also brings Windows down with a general 150protection fault message when run under the MS-DOS prompt of Windows 1513.1. (The program simply reads data from a valid address). 152 153The emulator supports 16-bit protected mode, with one difference from 154an 80486DX. A 80486DX will allow some floating point instructions to 155write a few bytes below the lowest address of the stack. The emulator 156will not allow this in 16-bit protected mode: no instructions are 157allowed to write outside the bounds set by the protection. 158 159----------------------- Performance of wm-FPU-emu ----------------------- 160 161Speed. 162----- 163 164The speed of floating point computation with the emulator will depend 165upon instruction mix. Relative performance is best for the instructions 166which require most computation. The simple instructions are adversely 167affected by the FPU instruction trap overhead. 168 169 170Timing: Some simple timing tests have been made on the emulator functions. 171The times include load/store instructions. All times are in microseconds 172measured on a 33MHz 386 with 64k cache. The Turbo C tests were under 173ms-dos, the next two columns are for emulators running with the djgpp 174ms-dos extender. The final column is for wm-FPU-emu in Linux 0.97, 175using libm4.0 (hard). 176 177function Turbo C djgpp 1.06 WM-emu387 wm-FPU-emu 178 179 + 60.5 154.8 76.5 139.4 180 - 61.1-65.5 157.3-160.8 76.2-79.5 142.9-144.7 181 * 71.0 190.8 79.6 146.6 182 / 61.2-75.0 261.4-266.9 75.3-91.6 142.2-158.1 183 184 sin() 310.8 4692.0 319.0 398.5 185 cos() 284.4 4855.2 308.0 388.7 186 tan() 495.0 8807.1 394.9 504.7 187 atan() 328.9 4866.4 601.1 419.5-491.9 188 189 sqrt() 128.7 crashed 145.2 227.0 190 log() 413.1-419.1 5103.4-5354.21 254.7-282.2 409.4-437.1 191 exp() 479.1 6619.2 469.1 850.8 192 193 194The performance under Linux is improved by the use of look-ahead code. 195The following results show the improvement which is obtained under 196Linux due to the look-ahead code. Also given are the times for the 197original Linux emulator with the 4.1 'soft' lib. 198 199 [ Linus' note: I changed look-ahead to be the default under linux, as 200 there was no reason not to use it after I had edited it to be 201 disabled during tracing ] 202 203 wm-FPU-emu w original w 204 look-ahead 'soft' lib 205 + 106.4 190.2 206 - 108.6-111.6 192.4-216.2 207 * 113.4 193.1 208 / 108.8-124.4 700.1-706.2 209 210 sin() 390.5 2642.0 211 cos() 381.5 2767.4 212 tan() 496.5 3153.3 213 atan() 367.2-435.5 2439.4-3396.8 214 215 sqrt() 195.1 4732.5 216 log() 358.0-387.5 3359.2-3390.3 217 exp() 619.3 4046.4 218 219 220These figures are now somewhat out-of-date. The emulator has become 221progressively slower for most functions as more of the 80486 features 222have been implemented. 223 224 225----------------------- Accuracy of wm-FPU-emu ----------------------- 226 227 228The accuracy of the emulator is in almost all cases equal to or better 229than that of an Intel 80486 FPU. 230 231The results of the basic arithmetic functions (+,-,*,/), and fsqrt 232match those of an 80486 FPU. They are the best possible; the error for 233these never exceeds 1/2 an lsb. The fprem and fprem1 instructions 234return exact results; they have no error. 235 236 237The following table compares the emulator accuracy for the sqrt(), 238trig and log functions against the Turbo C "emulator". For this table, 239each function was tested at about 400 points. Ideal worst-case results 240would be 64 bits. The reduced Turbo C accuracy of cos() and tan() for 241arguments greater than pi/4 can be thought of as being related to the 242precision of the argument x; e.g. an argument of pi/2-(1e-10) which is 243accurate to 64 bits can result in a relative accuracy in cos() of 244about 64 + log2(cos(x)) = 31 bits. 245 246 247Function Tested x range Worst result Turbo C 248 (relative bits) 249 250sqrt(x) 1 .. 2 64.1 63.2 251atan(x) 1e-10 .. 200 64.2 62.8 252cos(x) 0 .. pi/2-(1e-10) 64.4 (x <= pi/4) 62.4 253 64.1 (x = pi/2-(1e-10)) 31.9 254sin(x) 1e-10 .. pi/2 64.0 62.8 255tan(x) 1e-10 .. pi/2-(1e-10) 64.0 (x <= pi/4) 62.1 256 64.1 (x = pi/2-(1e-10)) 31.9 257exp(x) 0 .. 1 63.1 ** 62.9 258log(x) 1+1e-6 .. 2 63.8 ** 62.1 259 260** The accuracy for exp() and log() is low because the FPU (emulator) 261does not compute them directly; two operations are required. 262 263 264The emulator passes the "paranoia" tests (compiled with gcc 2.3.3 or 265later) for 'float' variables (24 bit precision numbers) when precision 266control is set to 24, 53 or 64 bits, and for 'double' variables (53 267bit precision numbers) when precision control is set to 53 bits (a 268properly performing FPU cannot pass the 'paranoia' tests for 'double' 269variables when precision control is set to 64 bits). 270 271The code for reducing the argument for the trig functions (fsin, fcos, 272fptan and fsincos) has been improved and now effectively uses a value 273for pi which is accurate to more than 128 bits precision. As a 274consequence, the accuracy of these functions for large arguments has 275been dramatically improved (and is now very much better than an 80486 276FPU). There is also now no degradation of accuracy for fcos and fptan 277for operands close to pi/2. Measured results are (note that the 278definition of accuracy has changed slightly from that used for the 279above table): 280 281Function Tested x range Worst result 282 (absolute bits) 283 284cos(x) 0 .. 9.22e+18 62.0 285sin(x) 1e-16 .. 9.22e+18 62.1 286tan(x) 1e-16 .. 9.22e+18 61.8 287 288It is possible with some effort to find very large arguments which 289give much degraded precision. For example, the integer number 290 8227740058411162616.0 291is within about 10e-7 of a multiple of pi. To find the tan (for 292example) of this number to 64 bits precision it would be necessary to 293have a value of pi which had about 150 bits precision. The FPU 294emulator computes the result to about 42.6 bits precision (the correct 295result is about -9.739715e-8). On the other hand, an 80486 FPU returns 2960.01059, which in relative terms is hopelessly inaccurate. 297 298For arguments close to critical angles (which occur at multiples of 299pi/2) the emulator is more accurate than an 80486 FPU. For very large 300arguments, the emulator is far more accurate. 301 302 303Prior to version 1.20 of the emulator, the accuracy of the results for 304the transcendental functions (in their principal range) was not as 305good as the results from an 80486 FPU. From version 1.20, the accuracy 306has been considerably improved and these functions now give measured 307worst-case results which are better than the worst-case results given 308by an 80486 FPU. 309 310The following table gives the measured results for the emulator. The 311number of randomly selected arguments in each case is about half a 312million. The group of three columns gives the frequency of the given 313accuracy in number of times per million, thus the second of these 314columns shows that an accuracy of between 63.80 and 63.89 bits was 315found at a rate of 133 times per one million measurements for fsin. 316The results show that the fsin, fcos and fptan instructions return 317results which are in error (i.e. less accurate than the best possible 318result (which is 64 bits)) for about one per cent of all arguments 319between -pi/2 and +pi/2. The other instructions have a lower 320frequency of results which are in error. The last two columns give 321the worst accuracy which was found (in bits) and the approximate value 322of the argument which produced it. 323 324 frequency (per M) 325 ------------------- --------------- 326instr arg range # tests 63.7 63.8 63.9 worst at arg 327 bits bits bits bits 328----- ------------ ------- ---- ---- ----- ----- -------- 329fsin (0,pi/2) 547756 0 133 10673 63.89 0.451317 330fcos (0,pi/2) 547563 0 126 10532 63.85 0.700801 331fptan (0,pi/2) 536274 11 267 10059 63.74 0.784876 332fpatan 4 quadrants 517087 0 8 1855 63.88 0.435121 (4q) 333fyl2x (0,20) 541861 0 0 1323 63.94 1.40923 (x) 334fyl2xp1 (-.293,.414) 520256 0 0 5678 63.93 0.408542 (x) 335f2xm1 (-1,1) 538847 4 481 6488 63.79 0.167709 336 337 338Tests performed on an 80486 FPU showed results of lower accuracy. The 339following table gives the results which were obtained with an AMD 340486DX2/66 (other tests indicate that an Intel 486DX produces 341identical results). The tests were basically the same as those used 342to measure the emulator (the values, being random, were in general not 343the same). The total number of tests for each instruction are given 344at the end of the table, in case each about 100k tests were performed. 345Another line of figures at the end of the table shows that most of the 346instructions return results which are in error for more than 10 347percent of the arguments tested. 348 349The numbers in the body of the table give the approx number of times a 350result of the given accuracy in bits (given in the left-most column) 351was obtained per one million arguments. For three of the instructions, 352two columns of results are given: * The second column for f2xm1 gives 353the number cases where the results of the first column were for a 354positive argument, this shows that this instruction gives better 355results for positive arguments than it does for negative. * In the 356cases of fcos and fptan, the first column gives the results when all 357cases where arguments greater than 1.5 were removed from the results 358given in the second column. Unlike the emulator, an 80486 FPU returns 359results of relatively poor accuracy for these instructions when the 360argument approaches pi/2. The table does not show those cases when the 361accuracy of the results were less than 62 bits, which occurs quite 362often for fsin and fptan when the argument approaches pi/2. This poor 363accuracy is discussed above in relation to the Turbo C "emulator", and 364the accuracy of the value of pi. 365 366 367bits f2xm1 f2xm1 fpatan fcos fcos fyl2x fyl2xp1 fsin fptan fptan 36862.0 0 0 0 0 437 0 0 0 0 925 36962.1 0 0 10 0 894 0 0 0 0 1023 37062.2 14 0 0 0 1033 0 0 0 0 945 37162.3 57 0 0 0 1202 0 0 0 0 1023 37262.4 385 0 0 10 1292 0 23 0 0 1178 37362.5 1140 0 0 119 1649 0 39 0 0 1149 37462.6 2037 0 0 189 1620 0 16 0 0 1169 37562.7 5086 14 0 646 2315 10 101 35 39 1402 37662.8 8818 86 0 984 3050 59 287 131 224 2036 37762.9 11340 1355 0 2126 4153 79 605 357 321 1948 37863.0 15557 4750 0 3319 5376 246 1281 862 808 2688 37963.1 20016 8288 0 4620 6628 511 2569 1723 1510 3302 38063.2 24945 11127 10 6588 8098 1120 4470 2968 2990 4724 38163.3 25686 12382 69 8774 10682 1906 6775 4482 5474 7236 38263.4 29219 14722 79 11109 12311 3094 9414 7259 8912 10587 38363.5 30458 14936 393 13802 15014 5874 12666 9609 13762 15262 38463.6 32439 16448 1277 17945 19028 10226 15537 14657 19158 20346 38563.7 35031 16805 4067 23003 23947 18910 20116 21333 25001 26209 38663.8 33251 15820 7673 24781 25675 24617 25354 24440 29433 30329 38763.9 33293 16833 18529 28318 29233 31267 31470 27748 29676 30601 388 389Per cent with error: 390 30.9 3.2 18.5 9.8 13.1 11.6 17.4 391Total arguments tested: 392 70194 70099 101784 100641 100641 101799 128853 114893 102675 102675 393 394 395------------------------- Contributors ------------------------------- 396 397A number of people have contributed to the development of the 398emulator, often by just reporting bugs, sometimes with suggested 399fixes, and a few kind people have provided me with access in one way 400or another to an 80486 machine. Contributors include (to those people 401who I may have forgotten, please forgive me): 402 403Linus Torvalds 404Tommy.Thorn@daimi.aau.dk 405Andrew.Tridgell@anu.edu.au 406Nick Holloway, alfie@dcs.warwick.ac.uk 407Hermano Moura, moura@dcs.gla.ac.uk 408Jon Jagger, J.Jagger@scp.ac.uk 409Lennart Benschop 410Brian Gallew, geek+@CMU.EDU 411Thomas Staniszewski, ts3v+@andrew.cmu.edu 412Martin Howell, mph@plasma.apana.org.au 413M Saggaf, alsaggaf@athena.mit.edu 414Peter Barker, PETER@socpsy.sci.fau.edu 415tom@vlsivie.tuwien.ac.at 416Dan Russel, russed@rpi.edu 417Daniel Carosone, danielce@ee.mu.oz.au 418cae@jpmorgan.com 419Hamish Coleman, t933093@minyos.xx.rmit.oz.au 420Bruce Evans, bde@kralizec.zeta.org.au 421Timo Korvola, Timo.Korvola@hut.fi 422Rick Lyons, rick@razorback.brisnet.org.au 423Rick, jrs@world.std.com 424 425...and numerous others who responded to my request for help with 426a real 80486. 427 428