• P
    Warn on excessive keys for legacy Bloom filter with 32-bit hash (#6317) · 8aa99fc7
    Peter Dillinger 提交于
    Summary:
    With many millions of keys, the old Bloom filter implementation
    for the block-based table (format_version <= 4) would have excessive FP
    rate due to the limitations of feeding the Bloom filter with a 32-bit hash.
    This change computes an estimated inflated FP rate due to this effect
    and warns in the log whenever an SST filter is constructed (almost
    certainly a "full" not "partitioned" filter) that exceeds 1.5x FP rate
    due to this effect. The detailed condition is only checked if 3 million
    keys or more have been added to a filter, as this should be a lower
    bound for common bits/key settings (< 20).
    
    Recommended remedies include smaller SST file size, using
    format_version >= 5 (for new Bloom filter), or using partitioned
    filters.
    
    This does not change behavior other than generating warnings for some
    constructed filters using the old implementation.
    Pull Request resolved: https://github.com/facebook/rocksdb/pull/6317
    
    Test Plan:
    Example with warning, 15M keys @ 15 bits / key: (working_mem_size_mb is just to stop after building one filter if it's large)
    
        $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=15000000 2>&1 | grep 'FP rate'
        [WARN] [/block_based/filter_policy.cc:292] Using legacy SST/BBT Bloom filter with excessive key count (15.0M @ 15bpk), causing estimated 1.8x higher filter FP rate. Consider using new Bloom with format_version>=5, smaller SST file size, or partitioned filters.
        Predicted FP rate %: 0.766702
        Average FP rate %: 0.66846
    
    Example without warning (150K keys):
    
        $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=150000 2>&1 | grep 'FP rate'
        Predicted FP rate %: 0.422857
        Average FP rate %: 0.379301
        $
    
    With more samples at 15 bits/key:
      150K keys -> no warning; actual: 0.379% FP rate (baseline)
      1M keys -> no warning; actual: 0.396% FP rate, 1.045x
      9M keys -> no warning; actual: 0.563% FP rate, 1.485x
      10M keys -> warning (1.5x); actual: 0.564% FP rate, 1.488x
      15M keys -> warning (1.8x); actual: 0.668% FP rate, 1.76x
      25M keys -> warning (2.4x); actual: 0.880% FP rate, 2.32x
    
    At 10 bits/key:
      150K keys -> no warning; actual: 1.17% FP rate (baseline)
      1M keys -> no warning; actual: 1.16% FP rate
      10M keys -> no warning; actual: 1.32% FP rate, 1.13x
      25M keys -> no warning; actual: 1.63% FP rate, 1.39x
      35M keys -> warning (1.6x); actual: 1.81% FP rate, 1.55x
    
    At 5 bits/key:
      150K keys -> no warning; actual: 9.32% FP rate (baseline)
      25M keys -> no warning; actual: 9.62% FP rate, 1.03x
      200M keys -> no warning; actual: 12.2% FP rate, 1.31x
      250M keys -> warning (1.5x); actual: 12.8% FP rate, 1.37x
      300M keys -> warning (1.6x); actual: 13.4% FP rate, 1.43x
    
    The reason for the modest inaccuracy at low bits/key is that the assumption of independence between a collision between 32-hash values feeding the filter and an FP in the filter is not quite true for implementations using "simple" logic to compute indices from the stock hash result. There's math on this in my dissertation, but I don't think it's worth the effort just for these extreme cases (> 100 million keys and low-ish bits/key).
    
    Differential Revision: D19471715
    
    Pulled By: pdillinger
    
    fbshipit-source-id: f80c96893a09bf1152630ff0b964e5cdd7e35c68
    8aa99fc7
filter_policy.cc 26.6 KB