#A4#
#2#
#2#
#2#
#2...#
#3#
#3#
#3...#
#Sieving cost:
Init to 0 and pattern sieve 2..64: I/64 SSE2 writes
Pattern sieving 3: I/64 SSE2 read/write
Sieving 5, 7, 9, 11: I/5 + I/9 + I/7 + I/ 11 = 0.54I byte read/write#
#5#
#5...#
#-I/2#
#I/2#
#-I/2#
#-I/2+64#
#2#
#2#
#2#
#2...#
#-I/2#
#-I/2+192#
#Copy pattern 3 times: 3 SSE2 writes#
#3#
#3#
#3#
#3...#
#Sieve 3 over interval of length 192: 64 byte read-writes#
#Expanding wheel sieve#
#Sieve 2..64 over interval of length 64#
#Copy pattern 5 times: 15 SSE2 read/writes#
#Sieve 5 over interval of length 960: 192 byte read/writes#
#Total cost dominated by sieving 7 and 11:
Assume I=65536.
Sieving 7:
Previous pattern has length 64*5*9=2880 (45 SSE2 words)
Copy pattern 7 times: 315 SSE2 read/writes
Sieve 7 over interval of length 64*5*9*7 = 20160: 2880 byte read/writes
Sieving 11:
Previous pattern has length 64*5*9*7 = 20160, 315 SSE2 words
Copy pattern 2.25 times (to fill sieve line): 709 SSE2 read/writes
Sieve 11 over interval of length 65536: I/11 = 5958 byte read/writes
Total cost:
I/64 = 1024 SSE2 read/writes for copying patterns
Sieving 9, 7 and 11: ~320 + 2880 + 5958 = 9158 byte read/writes
That's only about 0.14I = I/7.16 hits,
little less than the cost of sieving only 7 in the old code.
The savings get bigger if small primes have multiple roots:
old code did full pass of length I for each root,
expanding wheel sieves each root over shorter interval#
#Next sieve 9:
Copy pattern 3 times, 45 SSE2 read/writes
Sieve 9 over interval of length 64*9*5 = 2880: 320 byte read/writes
Next sieve 7, 11 the same way.#
#Old small sieve:#
#Sieve each prime (power) over whole line of length I#
#Sum 1/p over primes 11 < p < 65536 is ~1.4.
Assuming exactly 1 root per prime,
total number of byte read/writes with old sieve: ~1.94I.
total number of byte read/writes with expanding wheel: ~1.54I.
Relative saving: ~20%.#