#A4#
#bucket_update_t **slice_start points into bucket storage, indicates where in a bucket the updates from
the k-th slice begin that was sieved into this bucket array. The 0-th sieved slice's updates necessarily
start at the respective bucket start.#
#bucket 0 storage#
#bucket 1 storage#
#bucket 2 storage#
#bucket 15 storage#
#...#
#...#
# s0,b0#
# s0,b1#
# s0,b2#
# s0,b15#
##
#...#
# s1,b0#
# s1,b1#
# s1,b2#
# s1,b15#
# s2,b0#
#...#
#Suppose we have nr_buckets = 16 buckets.
bucket_update **bucket_start is an array of 16 pointers to the start of each bucket's storage. Each
bucket is an array that stores the updates generated by slices s0, s1, s2, ...#
#When we purge or apply one bucket, we want all the updates generated by one slice, as the slice's
index is used for converting short hints to long hints during purge(), and for looking up the log(p)
value of the slice during apply_buckets(). Thus, for bucket 0, we want the pointers
s0,b0, s1,b0, s2,b0, etc. which is a stride-16 (with 16 buckets) access through the slice_start array.
We do this for each bucket; with 256 buckets and possibly thousands of slices, this results in poor data
locality: 256 passes of stride 256, each pass of thousands of accesses.
Instead we should do a matrix transpose. With, say, m=10 slices and, n=16 buckets, we transpose
the m * n matrix into an n * m matrix with the following data layout#
#...#
# s0,b0#
# s1,b0#
# s2,b0#
# s9,b0#
#...#
# s0,b1#
# s1,b1#
# s2,b1#
# s9,b1#
# s0,b2#
#...#
#After this, accesses to the slice_start pointers are sequential during purge. We need
a cache-friendly matrix transpose.#
#...#
#s0#
#s1#
#s0#
#s0#
#s0#