Menu

Re-review of multiplication based PRNGs

IgneousRed
2024-09-13
2025-08-23
  • IgneousRed

    IgneousRed - 2024-09-13

    From what I understand, your stance is that multiplication should be avoided since it is too slow? I would totally agree with that stance 10years ago.
    But for last several years every CPU can do full multiply with decent speed. Given that it does alot more mixing than an add, I think it more than worth it's weight.

    When choosing a PRNG for low-end CPUs, it should be avoided.
    But why not list such RNGs for people to have a choice, and read it's Pros and Cons?

    MWC is a classic at this point, but adding a simple output permutation improves it quite alot.

    MWC:
        32bit state, 8bit output: 2^16 bytes
        64bit state, 16bit output: 2^30 bytes
        128bit state, 32bit output: >2^45 bytes
        256bit state, 64bit output: >2^45 bytes
    MWCP:
        32bit state, 8bit output: 2^27 bytes
        64bit state, 16bit output: >2^45 bytes
        128bit state, 32bit output: >2^45 bytes
        256bit state, 64bit output: >2^45 bytes
    

    And it runs almost twice as fast as SFC on my M2 Pro.

    This guy: https://tom-kaitchuck.medium.com/designing-a-new-prng-1c4ffd27124d
    ...has sub 8 bit word tests that suggest MWC scales linearly with the state size, so I expect that the 64bit MWCP is very random.

    Here is a link to my implementation in Odin: https://github.com/IgneousRed/PrngBoard/blob/main/MWCP/MWCP.odin

    Will you revisit multiplication based PRNGs? and if not, why?

     
  • - 2024-09-13

    Multiplication is... decent? I mean, it's fast, on a lot of CPUs, and tends to mix things better than add/xor. The class of CPUs it's fast on smaller than the class of CPUs that the basics (add, xor, shift) are fast on, so a PRNG that uses multiplication is inherently targeting a smaller niche. Not much smaller these days, since fast multiplication is pretty common.

    And low multiplication alone is never enough, you inherently need shifts, and probably add or xor too, or maybe just decimated output in extreme cases. By "low multiplication" I mean integer multiplication that returns the result truncated to the same size as the inputs, like multiplying a 32 bit integer by a 32 bit integer and truncating the results to a 32 bit integer. High multiplication could maybe replace the need for shifts, but is slower, and using it that way would be a bit awkward. So multiplication seems good, but not great, and a tiny bit specialized.

    I've considered several multiplication-based PRNGs for PractRand, but they've rarely worked out. I had some some small chaotic multiplication-based PRNGs I was going to recommend for a while as alternatives to jsf/sfc when multiplication is fast, but none of the variants felt completely competitive so I dropped them eventually. I had some LCGs with hashed output I was considering... but I ended up happier with the larger-state variants that used add-with-carry instead of the multiplication-based ones.

    And it runs almost twice as fast as SFC on my M2 Pro.

    That sounds really fast. I have a hard time getting even tiny (3 to 5 operation) PRNGs to go significantly faster than sfc and jsf in my testing. Though that's probably largely a function of the compilers used and whatnot.

    Typically I see 5.5 to 8.0 GB/s for jsf64 and sfc64 (if called directly, half that it called through a vtable), and if I try for something faster I can get 8-10 GB/s but either the 8 & 16 bit word variants fails test badly, or I can't guarantee a minimum cycle length, or, often, both. Or worse.

     
  • - 2025-06-21

    I've been looking at multiplication-based PRNGs again lately. Some I've found interesting:

        Uint64 old = a + b;
        a = b * 0x9e3779b97f4a7c15ull;
        b = rotate64(old, 23);
        return a;
    

    That one is the fastest decent-ish PRNG I have in testing at the moment. Quality isn't particularly good - the 16 bit variant fails at a measly 8 megabytes, and the 32 bit variant fails at 64 GB. Ultimately, only the full 64 bit one seems like a quality I could possibly recommend, and even then I'd worry that I missed something or that my seeding method wasn't avoiding bad cycles well enough. But when simple speed is very important, it's hard to compete with something like this (without doing something more awkward, like a SIMD PRNG that directly dumps its output where you want it, or a GPU-based PRNG, or a fast hardware block cypher or something). I'm currently calling this multwistfast64, might change it to mtf64 or somesuch.

        Uint64 old = a + b;
        a = (b * 0x9e3779b97f4a7c15ull) ^ ++counter;
        b = rotate64(old, 23);
        return a;
    

    This is the same thing, just with a counter added to force a minimum cycle length. Actually, adding a counter helps in other ways too, but they're harder to analyze or communicate so they don't get discussed. Anyway, this is consistently slower than without the counter, but still slightly faster than sfc or jsf for me. The 16 bit variant fails at 1 GB, while the 32 & 64 bit variants have
    no detected failures so far. This could be a reasonable competitor to jsf64 and sfc64, being both smaller and faster. I'm currently calling this multwistcount64, could change that to mtc64 or something.

        Uint64 tmp;
        a = _wide_multiply(a, 0x9e3779b97f4a7c15ull, &tmp);
        a ^= tmp;
        return a;
    

    That one is of interest because the implementation can inline to literally just two opcodes (under ideal circumstances) and state fits in a single GPR. Unfortunately it's mediocre or worse by pretty much every other conceivable metric, so I can't actually recommend it for anything, but at least it's interesting. A decent PRNG of reasonable size inherently needs both leftward entropy spreading and rightward entropy spreading, and wide multiplication manages to act as both in one operation, since the high word of output is effectively the full output rightshifted. The standard approach requires at least two shift/rotate instructions plus at least 3 arithmetic instructions . The problems: the average cycle length is just 2^32 (it's irreversible and has a tiny state), minimum cycle length is literally 1 (the zero state for starters, and there's probably other short cycles too though they might not be quite that short), wide multiplication has serious portability problems, it's not as fast as the extremely low operation count would suggest, etc.

        Uint64 old;
        old = _wide_multiply(a, 0x9e3779b97f4a7c15ull, &a);
        a ^= b;
        b = old + ++counter;
        return a;
    

    Here's a variation on the above that drastically improves cycle length - the expected average went from 2^32 all the way up to 2^191, and the minimum cycle length from from 1 all the way up to 2^64. The quality is quite decent at that point, probably superior to multwistcount64 above. Unfortunately, that's about the only positive thing I can say - it's notably slower than multwistcount64, it inlines notably worse, and it's far less portable.

    I'm also looking at a standard 128 bit power-of-2-modulus LCG (decimated to 64 bit output), but it requires wide multiplication (with the corresponding portability issues) to be implemented at a reasonable speed and even then is slower than other random access PRNGs, and generally doesn't seem competitive so I'll likely drop it soon.

    And I've also rewritten one of my random access PRNGs as... a non-maximal-cycle power-of-2-modulus LCG that, oddly enough, doesn't use multiplication in the implementation of its LCG state transitions, but does use multiplication in its output hashing. It has some portability issues because the implementation is based upon ADC instructions, but I consider that acceptable because the performance penalty when ADC intrinsics can't be found is relatively minor.

     

    Last edit: 2025-06-24
  • IgneousRed

    IgneousRed - 2025-07-22

    Sorry for the late reply!
    Interesting ones, hope you had fun.
    This is how fast they run on my M2 Pro (ns/out):

    Mul1: 0.7171630859375
    Mul2: 0.8544921875
    Mul3: 1.129150390625
    Mul4: 1.129150390625
    
    WYRand: 0.3509521484375
    MWCP: 0.5645751953125
    SFC: 0.885009765625
    

    WYRand:

        r^ += WYR64_INC
        mix := u128(r^) * u128(r^ ~ WYR64_MIX)
        return u64(mix >> 64) ~ u64(mix)
    

    MWCP:

        mix := u128(r[1]) * u128(0xfeb344657c0af413) + u128(r[0])
        r^ = {u64(mix >> 64), r[2], r[3], u64(mix)}
        return u64(mix >> 64) + u64(mix)
    

    Odin uses LLVM compiler.

    I don't see how they can compete with WYRand and MWCP (MWC + permutation).
    As they are both fast and surprisingly random (as far as I have tested).
    I tried many times to create a 2 or 3 state WYRand, but I could not get the performance right.
    Also I tested MWC with 2 or 3 states, but it seems to slow it down, not sure why.

    As I stated before, I am making RNGs for games, which assumes mid-high end PCs and they all have fast multiply and ByteSwap.
    ARM based high end ones also have BitReverse tho it seems to me that it is slightly inferior to ByteSwap, but still awesome to have another shift alternative.

     

    Last edit: IgneousRed 2025-07-22
    • - 2025-07-23

      First, updates on my last post: The first two multiplication-based PRNGs I listed, I'm not thrilled with. They have incredibly bad avalanche properties. The output tests may be decent, but with avalanche that bad I just don't trust the output test results. I tried some variations and found one that was both faster and had better avalanche behavior... except then when I went back and benchmarked it again later it was way slower. I don't know if I somehow screwed up the earlier benchmarks, or if my CPU just changed behavior slightly - that's unfortunately a persistent problem I face benchmarking small bits of code, ever since the Pentium Pro or thereabouts. So I've gone back to the code listed there, despite their horrible avalanche properties. Also, while I was trying out weird things like wide-multiplication, I also tried the SHLD opcode (double-wide shift) for PRNG purposes, and it worked decently enough, but I don't think it's worth pursuing - SHLD support is rare.

      Anyway. Your post. It's clear our CPUs have different behaviors as I can't get anything to work 2x faster per call than multwistfast64 unless I give it code simple enough that the optimizer can eliminate entire PRNG calls.

      WYRand: From context, I'm guessing that "r^" is part of a variable name or otherwise ignorable, not an XOR operation like it is in C, and that "~" is an XOR operation, not a NOT operation like it is in C. So, assuming the optimizer knows what it's doing, that works out to a Weyl with the result hashed by a wide multiplication with the low and high words of the result xored with each other. Sort of like the 3rd PRNG in my previous post, what with the wide multiplication that has its high and low results xored with each other, it's using that as an output function instead of a state transition function. Anyway, that isn't very fast in my benchmarks, about 40% of the speed of multwistfast64. It's passing all output tests I've tried on it so far, though I'm feeling dubious - most of the bits are probably good, but regardless of what RNG_test is saying I wouldn't trust the bottom bit of the output from that code without a lot of analysis and testing - if I'm reading that right then mix should alternate even and odd values if WYR64_MIX is even, or just always be even if WYR64_MIX is odd, and either way I'd be dubious that xoring it by bit 64 of mix is entirely enough to fix that.

      And MWCP is... another PRNG based upon wide multiplication? And "r^ :=" is an assignment of an entire array, where last time it was a 64 bit scalar assignment? Maybe I'm just misreading all of your code? Well, assuming I'm reading it correctly... I haven't gotten around to actually testing anything empirically on it, but just looking at the code I'm pretty sure it will be slower than multwistfast64 or sfc64 on my computer, but quite good quality. Well, a high quality state transition function... not 100% sure but that output function looks slightly biased to me. I'd trust u64(mix) alone more than the sum of the high and low products, and if u64(mix) wasn't enough then I'd add two adjacent elements from the array rather than that.

       
  • IgneousRed

    IgneousRed - 2025-07-23

    I think you got how they work. They are still available on the link I posted.
    Wrote C long ago, here is LLM translation, it assumes __uint128_t

    #define WYR64_MIX 0x8bb84b93962eacc9ULL
    #define WYR64_INC 0x2d358dccaa6c78a5ULL
    #define ODD_PHI_64 0x9e3779b97f4a7c15ULL
    
    typedef uint64_t WYR64;
    
    WYR64 WYR64_init(uint64_t seed) {
        return (seed + WYR64_MIX) * WYR64_INC * ODD_PHI_64;
    }
    
    uint64_t WYR64_u64(WYR64 *r) {
        *r += WYR64_INC;
        __uint128_t mix = (__uint128_t)(*r) * (__uint128_t)(*r ^ WYR64_MIX);
        return (uint64_t)(mix >> 64) ^ (uint64_t)mix;
    }
    
    typedef uint64_t MWCP64[4];
    
    MWCP64 MWCP64_init(uint64_t seed) {
        MWCP64 rng = {0x14057b7ef767814fULL, 0xcafef00dd15ea5e5ULL, seed, seed};
        for (int i = 0; i < 8; i++) {
            rng[2] ^= MWCP64_u64(rng);
        }
        return rng;
    }
    
    uint64_t MWCP64_u64(MWCP64 r) {
        __uint128_t mix = (__uint128_t)r[1] * 0xfeb344657c0af413ULL + (__uint128_t)r[0];
        r[0] = (uint64_t)(mix >> 64);
        r[1] = r[2];
        r[2] = r[3];
        r[3] = (uint64_t)mix;
        return (uint64_t)(mix >> 64) + (uint64_t)mix;
    }
    

    MWC has a particular initialisation for r0 and r1 , I am not a math guy sooo, more at the article linked.
    The Final sum may be biased, not sure, but it works much better that any 2 states.

    I find it surprising that you find them slow, maybe a different compiler or way of measuring.
    I am testing throughput (filling a large array), as it is easier to measure as well as let's be real, if you only need one value every so often, the RNG will not be a bottleneck.

    I have said this before but thank you for making this marverous tester, had a lot fun.

     
  • - 2025-07-26

    Well, "slow" relative to the fastest decent simple PRNGs around. On my CPU. In the loops I measure performance in, which are different from those others use in benchmarks. In practice, anything over about 2.5 to 3.0 GB/s is difficult for me and I'm not quite sure why, past about twice that is absurdly difficult, and past 13 GB/s seems impossible - and multwistfast64 occasionally manages 10 GB/s.

    Though I am a bit surprised that wide multiplication is considered an appropriate operation for high speed PRNGs... I thought it was often slower or less portable than regular multiplication. Something about taking multiple operations on some ISAs, and being higher latency on a lot of CPUs... I thought some of those were for ARM variants though admittedly I have a hard time keeping track of which ARM variants matter and which ones don't. When I have easy access to wide multiplication and don't care about portability, it's my default method for mapping raw random bits to uniform integer ranges on modern CPUs, but I've tried to avoid it in PractRand code (otherwise there would be a randli_fast implemented with wide multiplication... I might add one ifdefed to int128 availability sometime anyway if I can find a good way to do that).

    WYR64_MIX seems to be an odd number. I think it would work slightly better if that were an even number.

    WYR64 is of interest simply for its easy ability to skip forwards/backwards within its cycle combined with decent quality. I'm probably going to try playing around with a few variations on that output hashing function. Maybe rotating the left operand.

    MWCP64 apparently also supports fastforwarding/rewinding? It's not quite the sort of structure I would normally use even if I were allowing wide multiplication, but aside from the output function it's quite close. It's outperforming my large-hashed-LCG-variant atm. That's looking pretty tempting at the moment.

     
  • IgneousRed

    IgneousRed - 2025-07-26

    I think I heard that most pc processors always calculate the upper mult, even when not asked... not sure for what % that is true, but it does seem to be quite fast on most CPUs.

    https://github.com/wangyi-fudan/wyhash
    The mix is prime number where each byte has 4 zeros and ones.
    As I mentioned, I tried to tinker with it mulitple times but I just end up destroying it's speed.

    https://tom-kaitchuck.medium.com/designing-a-new-prng-1c4ffd27124d
    And this was my intro to MWC, but the math is a bit above my head.
    I think MWC is better then LCG in most ways.

    I will look at "randli_fast" at some point, sounds like something I would enjoy.

     
  • - 2025-08-22

    I think I heard that most pc processors always calculate the upper mult, even when not asked... not sure for what % that is true, but it does seem to be quite fast on most CPUs.

    Sounds unlikely to me. All of this is limited by understanding, but this is how I understand the general picture. Integer wide multiplication almost always involves a different opcode than regular low multiplication. That makes it easy to save power by not calculating the upper half for low integer multiplication, which is much more common. Everyone is very very keen to save power, since basically everything ends up limited by that one way or another these days. Floating point might be a different story though... they normally only need the upper half for FP multiplication, but they might need much or all of the lower half simply to produce the correct upper half, depending upon the details of the algorithm they use

    I will look at "randli_fast" at some point, sounds like something I would enjoy.

    There is no randli_fast at the moment, only randi_fast. In PractRand, every PRNG in the namespace PractRand::RNGs::Polymorphic or PractRand::RNGs::LightWeight (cross my fingers that no emoticons corrupt that...) has methods randi, randli, and randi_fast. randi ("random integer") is simply the output of the PRNG converted to a uniform range of 32 bit integers. randli ("random long integer") is the same thing but for 64 bit integers.

    randi_fast ("fast random integer") is the same thing as randi conceptually, but it uses a distorted approximation of the target distribution, because, aside form the call to raw32() to get raw random bits to work with, it's basically just a single wide multiplication returning the high result. Before I refactored my RNG code for release as PractRand, that was literally implemented as an inline assembly wide multiplication, returning the high word of output. It probably still works out that way if the compiler is smart, but now the code uses 64 bit integer multiplication on 32 bit integers to achieve that result.

    I'd like to have another method, randli_fast, to do the same thing for 64 bit values, but that requires 128 bit integer types, which aren't supported by some of the compilers I use.

    The code in question is very simple stuff, it works out to something like this:

    class some_RNG {
    public:
        //blah blah blah PRNG code snipped out
        Uint32 randi_fast(Uint32 max) {
            // range is [0,max)
            return Uint32((Uint64(raw32()) * Uint64(max)) >> 32);
        }
        Uint32 randi_fast(Uint32 min, Uint32 max) {
            // range is [min,max)
            return min + randi_fast(max-min);
        }
    };
    

    Using that code you can say something like "rng->randi_fast(1,7);" to get a distribution roughly equivalent to a normal 6 sided die. (you don't want to have to actually find the code, because some paths to it involve a hideous mess of templates and macros - I wanted it able to call raw32 without having to go through the vtable, yet not require the code to be rewritten for each algorithm, plus a bunch of other stuff, and the solution I ended up with was ugly)

    I was just saying that I would like to have a 64/128 bit version of that in addition to the 32/64 bit version. Which, by the existing naming scheme, would be called randli_fast ("fast random long integer"). Though I hate that naming scheme.

     
    • - 2025-08-23

      On further thought, I have that naming scheme I may rename them to randi32, randi32_fast, randi64, and, if I add it, randi64_fast. I think I avoided those kinds of names to maximize the difference relative to raw8/raw16/raw32/raw64, but at this point that doesn't matter anymore.

      To be clear, randli_fast / randi64_fast could be added, I'd just have to #ifdef different codepaths for different compilers, and in rare cases (mostly 32 bit compilers) randli_fast aka randi64_fast would end up super slow due to having to emulate 64x64->128 multiplication.

       

Log in to post a comment.

MongoDB Logo MongoDB