11/26/07: Benchmarking the CoCo by jdiffend, | Category: General | 39 comments - (Comments are closed)

Benchmarking the CoCo

Back in the 80′s Creative Computing published a BASIC benchmark written by David H. Ahl and at least one article that compared a lot of the machines of the day. I’ve found the main article and source for the program on the web but can’t find the results anymore. If anyone has the magazine I’d appreciate it if you would post the results.

I know the CoCo and MC10 were evaluated and I’m pretty sure the CoCo times were about average but I’m pretty sure the author was not a fan of the CoCo. I never saw an evaluation of the CoCo running with the high speed poke or the CoCo3 (it didn’t exist yet).

Here is the article minus the benchmark results and program (VOL. 9, NO. 11 / NOVEMBER 1983):
[url]http://www.atarimagazines.com/creative/v9n11/259_Benchmark_comparison_test.php[/url]

And here is the program:
[url]http://setiathome.berkeley.edu/~korpela/ahl/basic.html[/url]

39 comments to Benchmarking the CoCo

  • RobertGault

    Don’t have that issue but can report results for a Coco3 in fast clock.
    time 1min12sec
    accuracy 5.96284867E-04
    random 7.3876276 test1, 24.5601945 test2, 9.40813446 test3

    I used the GW-BAS program modified with RND(0) and changed the time$ (15,115) to a command to read the Smartwatch on my system.

    The article has an interesting typo where the sum of 1-100 is said to be 1010, it’s 5050. Of course, Ahl knew that as line 100 has 1010-S/5 not 1010-S. Why not 5050-S?

    It would seem that a Coco running at 2MHz has good speed but limited precision. That is a function of the method used to represent floating point numbers. The article does not indicate whether the units tested used single or double precision. If that data was not in the missing table, the results would be misleading.

  • jdiffend

    Don’t have that issue but can report results for a Coco3 in fast clock.
    time 1min12sec
    accuracy 5.96284867E-04
    random 7.3876276 test1, 24.5601945 test2, 9.40813446 test3

    Is that with or without the faster keyboard handling?
    I’m guessing without.

  • jdiffend

    BTW, Ahl’s assumption that random numbers should be evenly distributed is wrong. The numbers may appear random but in reality they are actually too random to be real.
    A real random function should return different results every time this program is run.

  • RobertGault

    Patching the ROM at $A1C1 for a faster keyboard routine gives a time of 1min9sec vs 1min11sec, an insignificant improvement.

    You can seed the random number generator to prevent repeats from a cold start by adding a line at the beginning of the program: A=RND(-TIMER)

  • jdiffend

    Does a 6309 in native mode have much of an impact on the numbers?

  • RobertGault

    Assuming the benchmark would run without crashing in 6309 native mode, I’d not expect much improvement. 10-15% might be about right. I’ll need to write an ml routine to switch the Coco into native mode after the program is loaded in order to run a test.

    EDIT 5:30
    Just ran a test in native 6309 mode and the time decreased by about 21%. The benchmark now completes in 56 seconds.

  • jdiffend

    The fastest system tested before was a 4MHz Z80 that completed it in 1:09 if the webpage I saw is correct.

  • jdiffend

    An Atari running Turbo Basic XL yielded: Time=41.6 Accuracy=.013649
    Turbo Basic XL is highly optimized.
    Standard Atari BASIC yielded: Time=6:45 sec. (!) Accuracy=.013959

  • RobertGault

    Clearly the benchmark is more dependent on the language than the hardware. We have already established that Disk Basic takes about 1min12sec to run and in 6309 native mode 56 seconds; not much different.

    What would happen if the benchmark were run on the same machine in Basic09? Here is the code I used.
    [code:1:955318fd32]
    DIM i,n:INTEGER
    s:=0 \\ r:=0
    a:=RND(-VAL(RIGHT$(DATE$,2))))
    PRINT DATE$
    FOR n=1 TO 100 \\a:=n
    FOR i=1 TO 10
    a:=SQ(a) \\r:=r+RND(0)
    NEXT i
    FOR i=1 TO 10
    a:=SQ(a) \\r:=r+RND(0)
    NEXT i
    s:=s+a
    NEXT n
    PRINT "Accuracy ";
    IF s<5050 THEN
    PRINT 1010-s/5
    ELSE
    PRINT s/5-1010
    ENDIF
    PRINT "Random "; ABS(1000-r)
    PRINT DATE$[/code:1:955318fd32]

    The benchmark ran in 6 seconds!

    Note that the version of Basic09 I used could not handle ABS( float in E code) which is why I used the IF/THEN/ELSE. Also note that in Basic09 as apposed to ROM Basic it is necessary to initialize variables as they are not cleared with each RUN command.

    Why is Basic09 so much faster? Probably because integers are used as the indexes in the FOR/NEXT loops rather than floating point numbers.

  • jdiffend

    Anyone with a 4MHz 6309 want to give this a try?

  • jdiffend

    Why is Basic09 so much faster? Probably because integers are used as the indexes in the FOR/NEXT loops rather than floating point numbers.

    I think there’s more to it than that.

    I’m not sure if BASIC09 uses a PCode interpreter of if it generates native code. Either way numbers are already converted from strings to the actual type they are. Depending on the BASIC, numbers embedded in source code are stored as strings and have to be converted every time the line is executed. Not very efficient if you ask me. You also have pointers to lines you jump to rather than searching through a linked list. There are lots of advantages.

  • RobertGault

    The Basic09 code runs at the same speed packed or unpacked.

    There is a significant difference between ROM Basic and Basic09. Basic09 permits defining variable types. The relevant types here are BYTE, INTEGER, and REAL. ROM Basic always uses REAL numbers.

    When you run a FOR/NEXT loop using real numbers, you place a significant overhead on the operation. Let’s look at a simple “do nothing” test program.
    [code:1:81ab56c145]
    DIM i,n:INTEGER
    PRINT DATE$
    FOR n= 1 TO 200
    FOR i= 1 TO 200
    NEXT i
    NEXT n
    PRINT DATE$[/code:1:81ab56c145]
    The above takes 3-4 seconds to run. Remove the DIM statement or change Integer to Real and the program takes 16 seconds to run, 4-5 times slower.

    That's a good part of the speed-up between ROM Basic and Basic09. I'd guess the random number routine may account for the rest of the difference but I've not made a comparison. I would also need to test the square and square root routines for speed.

  • jdiffend

    The benchmark ran in 6 seconds!

    Note that the version of Basic09 I used could not handle ABS( float in E code) which is why I used the IF/THEN/ELSE. Also note that in Basic09 as apposed to ROM Basic it is necessary to initialize variables as they are not cleared with each RUN command.

    Why is Basic09 so much faster? Probably because integers are used as the indexes in the FOR/NEXT loops rather than floating point numbers.

    BTW, what was the accuracy?

  • RobertGault

    I thought you would run the benchmark yourself since I posted the Basic09 source :) Random is variable but essentially the same as for ROM Basic.

    The accuracy is 9.45091248E-04, slightly different from ROM Basic but not much. Both languages use the same sized real, 5 bytes.

  • jdiffend

    1. My CoCos are in another state.
    2. I rarely mess with OS-9
    3. I don’t own Basic09

    BTW, Turbo Basic Compiler for the Atari ran it in 33.1 seconds and the accuracy wasn’t even close to Basic09. But then it’s way faster than the standard Atari BASIC which took something like 6:30.

    The Plus/4 managed 1:49. Compile should be about half that.

    The Apple IIc Plus ran it in about 31 seconds. Compiled would be about half that. The IIc Plus has a 4MHz CPU.

    Until you go to a 65816 running at a much higer MHz I think the Basic09 number will be the one to beat.

  • RobertGault

    Well MESS runs at the same speed as a real Coco so you could run these tests via emulation.

    I can provide Basic09 if you want to try it. The NitrOS-9 distribution contains RunB which means you could run packed Basic09 programs.

  • jdiffend

    Emulation would be close but not exact.
    I already have the numbers now so there’s really no need anyway.

  • Lomonster

    BTW, Ahl’s assumption that random numbers should be evenly distributed is wrong. The numbers may appear random but in reality they are actually too random to be real.
    A real random function should return different results every time this program is run.

    Actually Ahl is correct in his assumptions. RAND is generally assumed (and actually required in many languages) to provide uniform random numbers in some range. Uniform random numbers are, by definition, evenly distributed.

    He is also using the strong law of large numbers, which in this case would state that as the sample size gets larger, the expected sum would converge to the mean, which for uniform values on [0,1] is 1/2. So his test of adding 2000 uniformly random numbers on [0,1] and comparing the result to 1000 is a pretty good test for the quality of the generator (for such a short piece of code).

    There is no definitive test for randomness. If you’re inclined, try to precisely define “random number.”

    It is impossible to generate true randomness using deterministic methods. Thus computer random number generators are called PRNGs – Pseudo-Random Number Generators. You can gather entropy into a computer from outside noise, which is what cryptographically secure RNGs use.

    Check out Wikipedia or similar articles on these issues.

    I don’t know what you mean “too random to be real.”

  • Lomonster

    http://www.random.org/

    Those sites are neat. I have an article in Games Programming Gems 7 (coming out in the spring) on random number generation, and that site (and a few others like it) are listed.

    Two others from my article:
    http://www.fourmilab.ch/hotbits
    http://www.lavarnd.org

    However they are not using deterministic methods, except to whiten the output.

    Try running the Ahl averaging test on them. You’ll find random numbers do behave like those in the test. For example, I just used http://www.random.org to get 2000 integers in [0,100], obtaining a sum of 101912, for an average value of 50.956, very close to the expected mean.

  • jdiffend

    Do you have a good random number generator for the 6809?
    I wrote one for use in a C library but I haven’t tested how well it works yet. It was pretty much a conversion of a simple one for the 6502. It was designed more for speed than perfect distribution of results.
    Now if I can just remember where I put it. The machine it’s on crashes if I try to do much with it. I hate windows.

  • RobertGault

    There is one by Mark McDougall in Mary’s Oct. 2006 newsletter. There is one in my Symmetry program which you can download from my web pages. The source is included.
    [url]http://home.att.net/~robert.gault/Coco/Downloads/Downloads.htm[/url]
    I think there is one on Lomonster’s site.[/url]

  • Lomonster

    Do you have a good random number generator for the 6809?
    I wrote one for use in a C library but I haven’t tested how well it works yet. It was pretty much a conversion of a simple one for the 6502. It was designed more for speed than perfect distribution of results.
    Now if I can just remember where I put it. The machine it’s on crashes if I try to do much with it. I hate windows.

    For a decently fast one, try one of these http://en.wikipedia.org/wiki/Linear_feedback_shift_register

    However, they do not have great properties, nor are they secure. Making a cryptographically secure one on the Coco is a very hard task, so you should go for speed.

    Many modern C rand() functions use a linear congruential method, which would be slow on a CoCo due to a large multiply (and possibly an expensive modulus operation).

    Another one that would work well on a Coco, and provide tremendously good quality are the WELL512 type algorithms by LEcuyer ( http://www.iro.umontreal.ca/%7Epanneton/WELLRNG.html ), which are even better than the popular Mersenne twister based PRNGs.

    I’ve created a few PRNGs for embedded projects where bits and memory are constrained as in the CoCo. The original HypnoCube ( http://www.hypnocube.com ) used a home grown one based on Galois relations. The newer HypnoCubes (and soon to be release HypnoSquare) use the WELL512 algorithm, which provides state of the art high quality pseudo-random numbers, and is fast and easy to implement on the PICs.

    Hope this sets you in a good direction :).

  • RobertGault

    Chris, do you know if any LSFR sequences can pass a spectral test, say plotting N vs N1? I just get two parallel lines which is pretty bad.

  • Lomonster

    Chris, do you know if any LSFR sequences can pass a spectral test, say plotting N vs N1? I just get two parallel lines which is pretty bad.

    No – none of the LFSRs pass linear relations tests, since they are *linear* feedback. The output bit is a linear function of the state, hence it only takes a small amount of output to deduce the internal state.

    Same thing goes for linear congruential methods.

    The only benefit for LFSRs is they are fast and easy to implement and plenty good for many things such as games. They are not good when quality RNGs are needed, like weather simulation :)

  • RobertGault

    I’ve just tried some test on an LFSR, 32 bits (to match the Coco floating point notation) with taps at 1, 2, 22, and 32. The length for this LFSR is 4,294,967,295 so there is lots of room for skipping output. If I take every seventh output (i.e. skip 6) and plot the N vs N+1 values, a 192×192 screen is almost filled.

    That’s a massive improvement (I think) in the spectral test for pairs. I’ve not tried to plot triplets. However, there does not seem to be any degradation in the sums of sets of 10,000 values. They are within about +-40 for 5000-sum. If I plot the hits for 0-191 vs count, the bar graph is fairly smooth.

    What would you say about that type of prng, as a working mathematician.

  • Lomonster

    I’ve just tried some test on an LFSR, 32 bits (to match the Coco floating point notation) with taps at 1, 2, 22, and 32. The length for this LFSR is 4,294,967,295 so there is lots of room for skipping output. If I take every seventh output (i.e. skip 6) and plot the N vs N+1 values, a 192×192 screen is almost filled.

    That’s a massive improvement (I think) in the spectral test for pairs. I’ve not tried to plot triplets. However, there does not seem to be any degradation in the sums of sets of 10,000 values. They are within about +-40 for 5000-sum. If I plot the hits for 0-191 vs count, the bar graph is fairly smooth.

    What would you say about that type of prng, as a working mathematician.

    Again – it all depends on what you are trying to do. If you’re getting your values 0-191 by taking 8 bits (0-255 value) of output from the PRNG, and then wrapping, you’ll get twice as many occurrences of small numbers as you should expect. If you discard, you’ll get more uniform values, at the cost of losing some of the period.

    If you just want to pass spectral tests for a *fixed* dimension, just pick a value n relatively prime to the size of the space, and repeatedly add this value to itself, converting to pixel coords. This will hit each point exactly one before repeats. For example, 256×256 = 65536, so taking a variable p = 0, adding, say, 29501, wrapping around at 65536, converting p to an x,y (x = p mod 256, y = floor(p/256)) and plotting, you’ll get perfect 2D coverage (and actually, perfect coverage for 256x256x256…. to any dimension). However, this is weak for other purposes.

    For general use, your 32 LFSR will show bias in some experiments and situations. Otherwise it is a decent PRNG for many uses.

  • jdiffend

    Some more benchmarking over on a C128 forum:
    [url]http://landover.no-ip.com/128/viewtopic.php?pid=6532#p6532[/url]

  • RobertGault

    Chris,

    Actually all 32 bits are being used since they fit exactly into the Basic mantissa for floating point numbers. So, the output of the LFSR becomes a fraction between 0 and 1, and times 191 gives an answer between 0 and 191.
    I was just curious to see what it would take to improve the output for pairs. A better way to use this LFSR to plot random points on a 192×192 grid would be to use single results not pairs. R=LFSR*191^2, Y=INT(R/191+.5), and X=INT(Z-Y*191+.5). That does not produce any spectral test problems, no lines are seen, and there is total coverage of the grid.
    It would seem that non-perfect prngs can be of value if used appropriately. The trick for those of us that are not mathematicians is finding the right methods or having a good “cookbook”.

  • Lomonster

    Chris,

    Actually all 32 bits are being used since they fit exactly into the Basic mantissa for floating point numbers. So, the output of the LFSR becomes a fraction between 0 and 1, and times 191 gives an answer between 0 and 191.
    I was just curious to see what it would take to improve the output for pairs. A better way to use this LFSR to plot random points on a 192×192 grid would be to use single results not pairs. R=LFSR*191^2, Y=INT(R/191+.5), and X=INT(Z-Y*191+.5). That does not produce any spectral test problems, no lines are seen, and there is total coverage of the grid.
    It would seem that non-perfect prngs can be of value if used appropriately. The trick for those of us that are not mathematicians is finding the right methods or having a good “cookbook”.

    If you’re taking the entire 32-bit state then there will be a lot of bias – for example 1/2 of the numbers will be rougly twice as large as the previous one (or half as large, depending on how you shift), since the state is shifted one bit per pass. If you’re taking every 7th, then there will again be too many sequences where numbers tend to double, and fewer sequences where they halve, leaving a significant bias. The only output you should take from a LFSR is the single bit shifted off the end. A LFSR state is nowhere near random, since only one bit changes per pass, but the bits coming off the end are pretty random.

    If you’re just trying the spectral test a simple counter passes perfectly :)

  • jdiffend

    BTW, the CoCo3 with a 6309 in Fast mode should run the longer BASIC benchmark off the C128 forum in almost half the time of the C128 in Fast mode.

  • jdiffend

    Summary of results so far

    * all machines marked with an asterisk were benchmarked on an emulator and results may or may not be accurate

    Atari 8 bit—————————-
    Atari Advan Basic 456.2 sec
    Accuracy .036606
    Random 18.92, 11.51
    Atari 8K Basic 405 sec.
    Basic XL Time 395.88
    Basic XE Time 49.7
    Turbo Basic XL 41.6
    Turbo Basic Compiler 33.1
    Accuracy = .013959
    Random = 16.77, 8.80 (actually varies with each run on all versions of BASIC)
    Atari Microsoft Basic Time=101.4
    Accuracy .150879
    Random=2.06506, 2.06506 (5.60 when RANDOMIZEd)

    Tandy Color Computer 3 in high speed mode: ———————
    Time: 71 sec.
    With the fast keyboard routine enabled: 69 sec
    With 6309 running in native mode, no 6309 BASIC patches 56 seconds
    accuracy 5.96284867E-04
    random 7.3876276 test1, 24.5601945 test2, 9.40813446 test3
    BASIC O9 under OS-9 (Nitros-09?) 6 seconds
    accuracy 9.45091248E-04

    Apple IIc Plus ————————
    ~ 31 seconds (timed by hand)
    Apple IIgs 2.8MHz ——————-
    Time 42.5 (timed by hand)
    Accuracy 1.04141235E-03
    Random 8.42399192

    The Commodore Plus/4 (PAL) —————
    Time: 109 sec
    Accuracy 1.04141235-03
    Random 11.1208959

    Oric Atmos—————-
    Time 144 sec
    Accuracy : 1.04141235E-03
    Randow : 3.47206211

    Dick Smith VZ300 ——————————-
    Time : 108
    Accuracy : .0338745
    Random : 13.5997

    C64 timed by hand —————————————–
    A^2 method Time =123 sec.
    Accuracy=1.041 e-3
    Random 10.09
    A*A method Time=76 sec.
    Accuracy 3.227 e-4
    Random 4.19, 13.83

    Sinclair Spectrum* ————————————–
    Time 290 sec (278 appears to have been a math error)
    Time 175 (using “let a=a*a” instead of let a=a^2 and iterative square root instead of SQR)
    Accuracy 0.0006685257
    Random 4.322937

    Amstrad CPC 6128 Plus Locomotive BASIC 1.1 * —————————
    Time 38.9666667 sec
    Accuracy 3.50475E-05
    Random 3.07867885

  • jdiffend

    Well, after a lot of benchmarking we discovered that if you replace:
    a=a^2
    with
    a=a*a
    it will speed up the program.
    On VCC it was around 30% but VCC’s times don’t match the real thing.
    If someone could rerun benchmarks using that change I’ll add the new numbers to the results.

  • RobertGault

    It does not make any sense to change the benchmarking program to get a faster result as it is not relevant. The whole point of a benchmark is to determine the effects of a single variable, in this case an OS and Basic.

    It is a reasonable question to ask what the difference in speed is between exponentiation and multiplication. While this would be better tested as a separate benchmark, using the code for which I originally reported, the time changes from 1min12sec to 44sec by changing A=A^2 to A=A*A.

    So what impact does this have on Coco programming? It just means you should only use exponentiation when multiplication won’t work or space is at a premium, ex. A=A^3.141… or A=A^500. Neither one of these examples is appropriate for multiplication.

  • Lomonster

    It just means you should only use exponentiation when multiplication won’t work or space is at a premium, ex. A=A^3.141… or A=A^500. Neither one of these examples is appropriate for multiplication.

    Don’t be so quick to assume A=A^500 is faster than multiplying it cleverly. If the exponentiation algorithm is general purpose (like the CoCo) it might be much slower than computing A^500 with eight multiplies and five adds (and there might be faster ways – eight and five is just the easy way to do it). Exponentiating to a positive integral power is almost always faster to do with multiplies than with a general purpose exponentiation routine.

    Later if I get inspired I’ll time it.

  • jdiffend

    I think the real lesson here is there are many ways to do the same thing and doing a little timing and experimenting will help you write faster programs.

  • RobertGault

    Indeed, there are many ways to program the same task.

    Here is a short test program to try.
    [code:1:aab2bbde16]
    10 A=1.1:B=A:C=900
    20 PRINT DATE$ a custom function with a Smartwatch
    30 PRINT A^(C+1)
    40 PRINT DATE$
    50 FOR I=1 TO C
    60 B=B*A
    70 NEXT
    80 PRINT DATE$
    90 PRINT B[/code:1:aab2bbde16]

    The first thing you will find as C is changed is exponentiation and multiplication drift apart at as C increases in size on a Coco. The above takes 1 sec for line 30 and 7 seconds for the FOR/NEXT loop. For values of C less than about 100, it is not possible to measure the time accurately as both methods are about 1 sec or less.

    One point I was trying to make was that a long string of B=A*A*A*A... is not practical and eventually a loop will have too much overhead compared with exponentiation.
    The main point was that you don't care how fast a benchmark runs when the object is to compare systems. It is the system to system difference that matters not the absolute time. Ahl's benchmark times are meaningless if A^2 is used on one system and A*A used on another.

  • jdiffend

    That’s why I’m posting *both* times.

  • Lomonster

    Here is a short test program to try.
    [code:1:0b0b73691f]
    10 A=1.1:B=A:C=900
    20 PRINT DATE$ a custom function with a Smartwatch
    30 PRINT A^(C+1)
    40 PRINT DATE$
    50 FOR I=1 TO C
    60 B=B*A
    70 NEXT
    80 PRINT DATE$
    90 PRINT B[/code:1:0b0b73691f]

    The first thing you will find as C is changed is exponentiation and multiplication drift apart at as C increases in size on a Coco. The above takes 1 sec for line 30 and 7 seconds for the FOR/NEXT loop. For values of C less than about 100, it is not possible to measure the time accurately as both methods are about 1 sec or less.

    You're still doing exponentially more work than needed for the multiplication case, so it is not a very good way to test. You can compute the 901th power of something with 13 multiplies rather than 900. You are doing exponentially more work than necessary naively computing powers with a loop.

    Expand 901 in binary {1, 1, 1, 0, 0, 0, 0, 1, 0, 1}, and keep squaring B, multiplying B into a temp for each 1 in the binary expansion, like so

    T=B:B=B*B:B=B*B:T=T*B:B=B*B:B=B*B:B=B*B:B=B*B:B=B*B:T=T*B:B=B*B:T=T*B:B=B*B:T=T*B

    and T is your answer. No loop. Saves you 887 multiplies. I'd guess this cuts the multiplication method by a factor of 50-80, making it pretty fast, and perhaps faster than the exponentiation.

    Besides, the direct multiplication method is also more accurate. Using a general purpose exponentiation routine for integral powers is almost always the wrong way to do computation.

    Your method would be like using 900 adds to compute 901 times a number instead of doing a simple multiply. Of course all the adds would be slower since you're doing exponentially more work than needed.

    Since the CoCo has software multiplication, it might still be slow :)

    The CoCo (as far as I can tell) does exponentiation using a^x = e^(x ln a), so it does a natural log and a natural exponentiation, both of which lose precision compared to simple multiplies.