提交 8b8a2e9f 编写于 作者: P Peter Dillinger 提交者: Facebook GitHub Bot

Ribbon: major re-work of hashing, seeds, and more (#7635)

Summary:
* Fully optimized StandardHasher, in terms of efficiently generating Start, CoeffRow, and ResultRow from a stock hash value, with sufficient independence between them to have no measurably degraded behavior. (Degraded behavior would be an FP rate higher than explainable by 2^-b and, if using a 32-bit stock hash function, expected stock hash collisions.) Details in code comments.
* Our standard 64-bit and 32-bit hash functions do not exhibit sufficient independence on sequential seeds (for one Ribbon construction attempt to have independent probability from the next). I have worked around this in the Ribbon code by "pre-mixing" "ordinal seeds," sequentially tried and appropriate for storage in persisted metadata, into "raw seeds," ready for application and appropriate for in-memory storage. This way the pre-mixing step (though fast) is only applied on loading or configuring the structure, not on each query or banding add.
* Fix a subtle flaw in which backtracking not clearing ResultRow data could lead to elevated FP rate on keys that were backtracked on and should (for generality) exhibit the same FP rate as novel keys.
* Added a basic test for PhsfQuery and construction algorithms (map or "retrieval structure" rather than set or filter), and made a few trivial related fixes.
* Better random configuration generation in unit tests
* Some other minor cleanup / clarification / etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7635

Test Plan: unit tests included

Reviewed By: jay-zhuang

Differential Revision: D24738978

Pulled By: pdillinger

fbshipit-source-id: f9d03599d9e2ca3e30e9d3e7d81cd936b56f76f0
上级 1e40696d
......@@ -29,6 +29,8 @@ namespace ROCKSDB_NAMESPACE {
// Stable/persistent 64-bit hash. Higher quality and generally faster than
// Hash(), especially for inputs > 24 bytes.
// KNOWN FLAW: incrementing seed by 1 might not give sufficiently independent
// results from previous seed. Recommend incrementing by a large odd number.
extern uint64_t Hash64(const char* data, size_t n, uint64_t seed);
// Specific optimization without seed (same as seed = 0)
......@@ -37,6 +39,8 @@ extern uint64_t Hash64(const char* data, size_t n);
// Non-persistent hash. Must only used for in-memory data structure.
// The hash results are thus applicable to change. (Thus, it rarely makes
// sense to specify a seed for this function.)
// KNOWN FLAW: incrementing seed by 1 might not give sufficiently independent
// results from previous seed. Recommend incrementing by a large odd number.
inline uint64_t NPHash64(const char* data, size_t n, uint32_t seed) {
// Currently same as Hash64
return Hash64(data, n, seed);
......@@ -51,6 +55,8 @@ inline uint64_t NPHash64(const char* data, size_t n) {
// Stable/persistent 32-bit hash. Moderate quality and high speed on
// small inputs.
// TODO: consider rename to Hash32
// KNOWN FLAW: incrementing seed by 1 might not give sufficiently independent
// results from previous seed. Recommend pseudorandom or hashed seeds.
extern uint32_t Hash(const char* data, size_t n, uint32_t seed);
// TODO: consider rename to LegacyBloomHash32
......
......@@ -405,7 +405,10 @@ namespace ribbon {
// // big enough for the largest number of columns allowed.
// typename ResultRow;
// // An unsigned integer type sufficient for representing the number of
// // rows in the solution structure. (TODO: verify any extra needed?)
// // rows in the solution structure, and at least the arithmetic
// // promotion size (usually 32 bits). uint32_t recommended because a
// // single Ribbon construction doesn't really scale to billions of
// // entries.
// typename Index;
// };
......@@ -554,11 +557,10 @@ bool BandingAdd(BandingStorage *bs, typename BandingStorage::Index start,
int tz = CountTrailingZeroBits(cr);
i += static_cast<Index>(tz);
cr >>= tz;
} else {
assert((cr & 1) == 1);
}
for (;;) {
assert((cr & 1) == 1);
CoeffRow other = *(bs->CoeffRowPtr(i));
if (other == 0) {
*(bs->CoeffRowPtr(i)) = cr;
......@@ -568,16 +570,19 @@ bool BandingAdd(BandingStorage *bs, typename BandingStorage::Index start,
return true;
}
assert((other & 1) == 1);
// Gaussian row reduction
cr ^= other;
rr ^= *(bs->ResultRowPtr(i));
if (cr == 0) {
// Inconsistency or (less likely) redundancy
break;
}
// Find relative offset of next non-zero coefficient.
int tz = CountTrailingZeroBits(cr);
i += static_cast<Index>(tz);
cr >>= tz;
}
// Failed, unless result row == 0 because e.g. a duplicate input or a
// stock hash collision, with same result row. (For filter, stock hash
// collision implies same result row.) Or we could have a full equation
......@@ -674,7 +679,11 @@ bool BandingAddRange(BandingStorage *bs, BacktrackStorage *bts,
--backtrack_pos;
Index i = bts->BacktrackGet(backtrack_pos);
*(bs->CoeffRowPtr(i)) = 0;
// Not required: *(bs->ResultRowPtr(i)) = 0;
// Not strictly required, but is required for good FP rate on
// inputs that might have been backtracked out. (We don't want
// anything we've backtracked on to leak into final result, as
// that might not be "harmless".)
*(bs->ResultRowPtr(i)) = 0;
}
}
return false;
......@@ -1088,8 +1097,8 @@ typename InterleavedSolutionStorage::ResultRow InterleavedPhsfQuery(
const Hash hash = hasher.GetHash(key);
const Index start_slot = hasher.GetStart(hash, iss.GetNumStarts());
const Index upper_start_block = iss->GetUpperStartBlock();
Index num_columns = iss->GetUpperNumColumns();
const Index upper_start_block = iss.GetUpperStartBlock();
Index num_columns = iss.GetUpperNumColumns();
Index start_block_num = start_slot / kCoeffBits;
Index segment = start_block_num * num_columns -
std::min(start_block_num, upper_start_block);
......@@ -1103,14 +1112,14 @@ typename InterleavedSolutionStorage::ResultRow InterleavedPhsfQuery(
ResultRow sr = 0;
const CoeffRow cr_left = cr << start_bit;
for (Index i = 0; i < num_columns; ++i) {
sr ^= BitParity(iss->LoadSegment(segment + i) & cr_left) << i;
sr ^= BitParity(iss.LoadSegment(segment + i) & cr_left) << i;
}
if (start_bit > 0) {
segment += num_columns;
const CoeffRow cr_right = cr >> (kCoeffBits - start_bit);
for (Index i = 0; i < num_columns; ++i) {
sr ^= BitParity(iss->LoadSegment(segment + i) & cr_right) << i;
sr ^= BitParity(iss.LoadSegment(segment + i) & cr_right) << i;
}
}
......@@ -1158,6 +1167,9 @@ bool InterleavedFilterQuery(const typename FilterQueryHasher::Key &key,
const ResultRow expected = hasher.GetResultRowFromHash(hash);
// TODO: consider optimizations such as
// * mask fetched values and shift cr, rather than shifting fetched values
// * get rid of start_bit == 0 condition with careful fetching & shifting
if (start_bit == 0) {
for (Index i = 0; i < num_columns; ++i) {
if (BitParity(iss.LoadSegment(segment + i) & cr) !=
......
......@@ -39,7 +39,8 @@ namespace ribbon {
// static constexpr bool kFirstCoeffAlwaysOne;
//
// // An unsigned integer type for identifying a hash seed, typically
// // uint32_t or uint64_t.
// // uint32_t or uint64_t. Importantly, this is the amount of data
// // stored in memory for identifying a raw seed. See StandardHasher.
// typename Seed;
//
// // When true, the PHSF implements a static filter, expecting just
......@@ -65,12 +66,7 @@ namespace ribbon {
// // A seedable stock hash function on Keys. All bits of Hash must
// // be reasonably high quality. XXH functions recommended, but
// // Murmur, City, Farm, etc. also work.
// //
// // If sequential seeds are not sufficiently independent for your
// // stock hash function, consider multiplying by a large odd constant.
// // If seed 0 is still undesirable, consider adding 1 before the
// // multiplication.
// static Hash HashFn(const Key &, Seed);
// static Hash HashFn(const Key &, Seed raw_seed);
// };
// A bit of a hack to automatically construct the type for
......@@ -114,6 +110,12 @@ struct AddInputSelector<Key, ResultRow, true /*IsFilter*/> {
0, \
"avoid unused warnings, semicolon expected after macro call")
#ifdef _MSC_VER
#pragma warning(push)
#pragma warning(disable : 4309) // cast truncating constant
#pragma warning(disable : 4307) // arithmetic constant overflow
#endif
// StandardHasher: A standard implementation of concepts RibbonTypes,
// PhsfQueryHasher, FilterQueryHasher, and BandingHasher from ribbon_alg.h.
//
......@@ -126,15 +128,31 @@ struct AddInputSelector<Key, ResultRow, true /*IsFilter*/> {
// can do" with available hash information in terms of FP rate and
// compactness. (64 bits recommended and sufficient for PHSF practical
// purposes.)
//
// Another feature of this hasher is a minimal "premixing" of seeds before
// they are provided to TypesAndSettings::HashFn in case that function does
// not provide sufficiently independent hashes when iterating merely
// sequentially on seeds. (This for example works around a problem with the
// preview version 0.7.2 of XXH3 used in RocksDB, a.k.a. XXH3p or Hash64, and
// MurmurHash1 used in RocksDB, a.k.a. Hash.) We say this pre-mixing step
// translates "ordinal seeds," which we iterate sequentially to find a
// solution, into "raw seeds," with many more bits changing for each
// iteration. The translation is an easily reversible lightweight mixing,
// not suitable for hashing on its own. An advantage of this approach is that
// StandardHasher can store just the raw seed (e.g. 64 bits) for fast query
// times, while from the application perspective, we can limit to a small
// number of ordinal keys (e.g. 64 in 6 bits) for saving in metadata.
//
// The default constructor initializes the seed to ordinal seed zero, which
// is equal to raw seed zero.
//
template <class TypesAndSettings>
class StandardHasher {
public:
IMPORT_RIBBON_TYPES_AND_SETTINGS(TypesAndSettings);
StandardHasher(Seed seed = 0) : seed_(seed) {}
inline Hash GetHash(const Key& key) const {
return TypesAndSettings::HashFn(key, seed_);
return TypesAndSettings::HashFn(key, raw_seed_);
};
// For when AddInput == pair<Key, ResultRow> (kIsFilter == false)
inline Hash GetHash(const std::pair<Key, ResultRow>& bi) const {
......@@ -180,18 +198,59 @@ class StandardHasher {
}
}
inline CoeffRow GetCoeffRow(Hash h) const {
// This is a reasonably cheap but empirically effective remix/expansion
// of the hash data to fill CoeffRow. (Large primes)
// This is not so much "critical path" code because it can be done in
// parallel (instruction level) with memory lookup.
Unsigned128 a = Multiply64to128(h, 0x85EBCA77C2B2AE63U);
Unsigned128 b = Multiply64to128(h, 0x27D4EB2F165667C5U);
auto cr = static_cast<CoeffRow>(b ^ (a << 64) ^ (a >> 64));
//
// We do not need exhaustive remixing for CoeffRow, but just enough that
// (a) every bit is reasonably independent from Start.
// (b) every Hash-length bit subsequence of the CoeffRow has full or
// nearly full entropy from h.
// (c) if nontrivial bit subsequences within are correlated, it needs to
// be more complicated than exact copy or bitwise not (at least without
// kFirstCoeffAlwaysOne), or else there seems to be a kind of
// correlated clustering effect.
// (d) the CoeffRow is not zero, so that no one input on its own can
// doom construction success. (Preferably a mix of 1's and 0's if
// satisfying above.)
// First, establish sufficient bitwise independence from Start, with
// multiplication by a large random prime.
// Note that we cast to Hash because if we use product bits beyond
// original input size, that's going to correlate with Start (FastRange)
// even with a (likely) different multiplier here.
Hash a = h * kCoeffAndResultFactor;
// If that's big enough, we're done. If not, we have to expand it,
// maybe up to 4x size.
uint64_t b = a;
static_assert(
sizeof(Hash) == sizeof(uint64_t) || sizeof(Hash) == sizeof(uint32_t),
"Supported sizes");
if (sizeof(Hash) < sizeof(uint64_t)) {
// Almost-trivial hash expansion (OK - see above), favoring roughly
// equal number of 1's and 0's in result
b = (b << 32) ^ b ^ kCoeffXor32;
}
Unsigned128 c = b;
static_assert(sizeof(CoeffRow) == sizeof(uint64_t) ||
sizeof(CoeffRow) == sizeof(Unsigned128),
"Supported sizes");
if (sizeof(uint64_t) < sizeof(CoeffRow)) {
// Almost-trivial hash expansion (OK - see above), favoring roughly
// equal number of 1's and 0's in result
c = (c << 64) ^ c ^ kCoeffXor64;
}
auto cr = static_cast<CoeffRow>(c);
// Now ensure the value is non-zero
if (kFirstCoeffAlwaysOne) {
cr |= 1;
} else if (sizeof(CoeffRow) == sizeof(Hash)) {
// Still have to ensure some bit is non-zero
cr |= (cr == 0) ? 1 : 0;
} else {
// Still have to ensure non-zero
cr |= static_cast<unsigned>(cr == 0);
// (We did trivial expansion with constant xor, which ensures some
// bits are non-zero.)
}
return cr;
}
......@@ -203,11 +262,19 @@ class StandardHasher {
}
inline ResultRow GetResultRowFromHash(Hash h) const {
if (TypesAndSettings::kIsFilter) {
// In contrast to GetStart, here we draw primarily from lower bits,
// but not literally, which seemed to cause FP rate hit in some cases.
// This is not so much "critical path" code because it can be done in
// parallel (instruction level) with memory lookup.
auto rr = static_cast<ResultRow>(h ^ (h >> 13) ^ (h >> 26));
//
// There is no evidence that ResultRow needs to be independent from
// CoeffRow, so we draw from the same bits computed for CoeffRow,
// which are reasonably independent from Start. (Inlining and common
// subexpression elimination with GetCoeffRow should make this
// a single shared multiplication in generated code.)
Hash a = h * kCoeffAndResultFactor;
// The bits here that are *most* independent of Start are the highest
// order bits (as in Knuth multiplicative hash). To make those the
// most preferred for use in the result row, we do a bswap here.
auto rr = static_cast<ResultRow>(EndianSwapValue(a));
return rr & GetResultRowMask();
} else {
// Must be zero
......@@ -226,33 +293,80 @@ class StandardHasher {
return bi.second;
}
bool NextSeed(Seed max_seed) {
if (seed_ >= max_seed) {
return false;
} else {
++seed_;
return true;
}
// Seed tracking APIs - see class comment
void SetRawSeed(Seed seed) { raw_seed_ = seed; }
Seed GetRawSeed() { return raw_seed_; }
void SetOrdinalSeed(Seed count) {
// A simple, reversible mixing of any size (whole bytes) up to 64 bits.
// This allows casting the raw seed to any smaller size we use for
// ordinal seeds without risk of duplicate raw seeds for unique ordinal
// seeds.
// Seed type might be smaller than numerical promotion size, but Hash
// should be at least that size, so we use Hash as intermediate type.
static_assert(sizeof(Seed) <= sizeof(Hash),
"Hash must be at least size of Seed");
// Multiply by a large random prime (one-to-one for any prefix of bits)
Hash tmp = count * kToRawSeedFactor;
// Within-byte one-to-one mixing
static_assert((kSeedMixMask & (kSeedMixMask >> kSeedMixShift)) == 0,
"Illegal mask+shift");
tmp ^= (tmp & kSeedMixMask) >> kSeedMixShift;
raw_seed_ = static_cast<Seed>(tmp);
// dynamic verification
assert(GetOrdinalSeed() == count);
}
Seed GetOrdinalSeed() {
Hash tmp = raw_seed_;
// Within-byte one-to-one mixing (its own inverse)
tmp ^= (tmp & kSeedMixMask) >> kSeedMixShift;
// Multiply by 64-bit multiplicative inverse
static_assert(kToRawSeedFactor * kFromRawSeedFactor == Hash{1},
"Must be inverses");
return static_cast<Seed>(tmp * kFromRawSeedFactor);
}
Seed GetSeed() const { return seed_; }
void ResetSeed(Seed seed = 0) { seed_ = seed; }
protected:
Seed seed_;
// For expanding hash:
// large random prime
static constexpr Hash kCoeffAndResultFactor =
static_cast<Hash>(0xc28f82822b650bedULL);
// random-ish data
static constexpr uint32_t kCoeffXor32 = 0xa6293635U;
static constexpr uint64_t kCoeffXor64 = 0xc367844a6e52731dU;
// For pre-mixing seeds
static constexpr Hash kSeedMixMask = static_cast<Hash>(0xf0f0f0f0f0f0f0f0ULL);
static constexpr unsigned kSeedMixShift = 4U;
static constexpr Hash kToRawSeedFactor =
static_cast<Hash>(0xc78219a23eeadd03ULL);
static constexpr Hash kFromRawSeedFactor =
static_cast<Hash>(0xfe1a137d14b475abULL);
// See class description
Seed raw_seed_ = 0;
};
// StandardRehasher (and StandardRehasherAdapter): A variant of
// StandardHasher that uses the same type for keys as for hashes.
// This is primarily intended for building a Ribbon filter/PHSF
// from existing hashes without going back to original inputs in order
// to apply a different seed. This hasher seeds a 1-to-1 mixing
// transformation to apply a seed to an existing hash (or hash-sized key).
// This is primarily intended for building a Ribbon filter
// from existing hashes without going back to original inputs in
// order to apply a different seed. This hasher seeds a 1-to-1 mixing
// transformation to apply a seed to an existing hash. (Untested for
// hash-sized keys that are not already uniformly distributed.) This
// transformation builds on the seed pre-mixing done in StandardHasher.
//
// Testing suggests essentially no degradation of solution success rate
// vs. going back to original inputs when changing hash seeds. For example:
// Average re-seeds for solution with r=128, 1.02x overhead, and ~100k keys
// is about 1.10 for both StandardHasher and StandardRehasher.
//
// StandardRehasher is not really recommended for general PHSFs (not
// filters) because a collision in the original hash could prevent
// construction despite re-seeding the Rehasher. (Such collisions
// do not interfere with filter construction.)
//
// concept RehasherTypesAndSettings: like TypesAndSettings but
// does not require Key or HashFn.
template <class RehasherTypesAndSettings>
......@@ -262,28 +376,20 @@ class StandardRehasherAdapter : public RehasherTypesAndSettings {
using Key = Hash;
using Seed = typename RehasherTypesAndSettings::Seed;
static Hash HashFn(const Hash& input, Seed seed) {
static_assert(sizeof(Hash) <= 8, "Hash too big");
if (sizeof(Hash) > 4) {
// XXH3_avalanche / XXH3p_avalanche (64-bit), modified for seed
uint64_t h = input;
h ^= h >> 37;
h ^= seed * uint64_t{0xC2B2AE3D27D4EB4F};
h *= uint64_t{0x165667B19E3779F9};
h ^= h >> 32;
return static_cast<Hash>(h);
} else {
// XXH32_avalanche (32-bit), modified for seed
uint32_t h32 = static_cast<uint32_t>(input);
h32 ^= h32 >> 15;
h32 ^= seed * uint32_t{0x27D4EB4F};
h32 *= uint32_t{0x85EBCA77};
h32 ^= h32 >> 13;
h32 *= uint32_t{0xC2B2AE3D};
h32 ^= h32 >> 16;
return static_cast<Hash>(h32);
}
static Hash HashFn(const Hash& input, Seed raw_seed) {
// Note: raw_seed is already lightly pre-mixed, and this multiplication
// by a large prime is sufficient mixing (low-to-high bits) on top of
// that for good FastRange results, which depends primarily on highest
// bits. (The hashed CoeffRow and ResultRow are less sensitive to
// mixing than Start.)
// Also note: did consider adding ^ (input >> some) before the
// multiplication, but doesn't appear to be necessary.
return (input ^ raw_seed) * kRehashFactor;
}
private:
static constexpr Hash kRehashFactor =
static_cast<Hash>(0x6193d459236a3a0dULL);
};
// See comment on StandardRehasherAdapter
......@@ -291,6 +397,10 @@ template <class RehasherTypesAndSettings>
using StandardRehasher =
StandardHasher<StandardRehasherAdapter<RehasherTypesAndSettings>>;
#ifdef _MSC_VER
#pragma warning(pop)
#endif
// Especially with smaller hashes (e.g. 32 bit), there can be noticeable
// false positives due to collisions in the Hash returned by GetHash.
// This function returns the expected FP rate due to those collisions,
......@@ -442,9 +552,17 @@ class StandardBanding : public StandardHasher<TypesAndSettings> {
// Iteratively (a) resets the structure for `num_slots`, (b) attempts
// to add the range of inputs, and (c) if unsuccessful, chooses next
// hash seed, until either successful or unsuccessful with max_seed
// (minimum one seed attempted). Returns true if successful. In that
// case, use GetSeed() to get the successful seed.
// hash seed, until either successful or unsuccessful with all the
// allowed seeds. Returns true if successful. In that case, use
// GetOrdinalSeed() or GetRawSeed() to get the successful seed.
//
// The allowed sequence of hash seeds is determined by
// `starting_ordinal_seed,` the first ordinal seed to be attempted
// (see StandardHasher), and `ordinal_seed_mask,` a bit mask (power of
// two minus one) for the range of ordinal seeds to consider. The
// max number of seeds considered will be ordinal_seed_mask + 1.
// For filters we suggest `starting_ordinal_seed` be chosen randomly
// or round-robin, to minimize false positive correlations between keys.
//
// If unsuccessful, how best to continue is going to be application
// specific. It should be possible to choose parameters such that
......@@ -459,16 +577,27 @@ class StandardBanding : public StandardHasher<TypesAndSettings> {
// significant correlation in success, rather than independence.)
template <typename InputIterator>
bool ResetAndFindSeedToSolve(Index num_slots, InputIterator begin,
InputIterator end, Seed max_seed) {
StandardHasher<TypesAndSettings>::ResetSeed();
InputIterator end,
Seed starting_ordinal_seed = 0U,
Seed ordinal_seed_mask = 63U) {
// power of 2 minus 1
assert((ordinal_seed_mask & (ordinal_seed_mask + 1)) == 0);
// starting seed is within mask
assert((starting_ordinal_seed & ordinal_seed_mask) ==
starting_ordinal_seed);
starting_ordinal_seed &= ordinal_seed_mask; // if not debug
Seed cur_ordinal_seed = starting_ordinal_seed;
do {
StandardHasher<TypesAndSettings>::SetOrdinalSeed(cur_ordinal_seed);
Reset(num_slots);
bool success = AddRange(begin, end);
if (success) {
return true;
}
} while (StandardHasher<TypesAndSettings>::NextSeed(max_seed));
// No seed through max_seed worked.
cur_ordinal_seed = (cur_ordinal_seed + 1) & ordinal_seed_mask;
} while (cur_ordinal_seed != starting_ordinal_seed);
// Reached limit by circling around
return false;
}
......
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册