1. 30 6月, 2020 1 次提交
    • A
      Extend Get/MultiGet deadline support to table open (#6982) · 9a5886bd
      Anand Ananthabhotla 提交于
      Summary:
      Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.
      
      The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.
      
      Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982
      
      Test Plan: Add new unit tests in db_basic_test
      
      Reviewed By: riversand963
      
      Differential Revision: D22219515
      
      Pulled By: anand1976
      
      fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
      9a5886bd
  2. 29 4月, 2020 1 次提交
    • P
      Basic MultiGet support for partitioned filters (#6757) · bae6f586
      Peter Dillinger 提交于
      Summary:
      In MultiGet, access each applicable filter partition only once
      per batch, rather than for each applicable key. Also,
      
      * Fix Bloom stats for MultiGet
      * Fix/refactor MultiGetContext::Range::KeysLeft, including
      * Add efficient BitsSetToOne implementation
      * Assert that MultiGetContext::Range does not go beyond shift range
      
      Performance test: Generate db:
      
          $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true
          ...
      
      Before (middle performing run of three; note some missing Bloom stats):
      
          $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
          multireadrandom :      26.403 micros/op 597517 ops/sec; (548427 of 671968 found)
          rocksdb.block.cache.filter.hit COUNT : 83443275
          rocksdb.bloom.filter.useful COUNT : 0
          rocksdb.bloom.filter.full.positive COUNT : 0
          rocksdb.bloom.filter.full.true.positive COUNT : 7931450
          rocksdb.number.multiget.get COUNT : 385984
          rocksdb.number.multiget.keys.read COUNT : 12351488
          rocksdb.number.multiget.bytes.read COUNT : 793145000
          rocksdb.number.multiget.keys.found COUNT : 7931450
      
      After (middle performing run of three):
      
          $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget'
          multireadrandom :      21.024 micros/op 752963 ops/sec; (705188 of 863968 found)
          rocksdb.block.cache.filter.hit COUNT : 49856682
          rocksdb.bloom.filter.useful COUNT : 45684579
          rocksdb.bloom.filter.full.positive COUNT : 10395458
          rocksdb.bloom.filter.full.true.positive COUNT : 9908456
          rocksdb.number.multiget.get COUNT : 481984
          rocksdb.number.multiget.keys.read COUNT : 15423488
          rocksdb.number.multiget.bytes.read COUNT : 990845600
          rocksdb.number.multiget.keys.found COUNT : 9908456
      
      So that's about 25% higher throughput even for random keys
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757
      
      Test Plan: unit test included
      
      Reviewed By: anand1976
      
      Differential Revision: D21243256
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
      bae6f586
  3. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  4. 28 1月, 2020 1 次提交
    • P
      Clean up PartitionedFilterBlockBuilder (#6299) · 986df371
      Peter Dillinger 提交于
      Summary:
      Remove the redundant PartitionedFilterBlockBuilder::num_added_ and ::NumAdded since the parent class, FullFilterBlockBuilder, already provides them.
      Also rename filters_in_partition_ and filters_per_partition_ to keys_added_to_partition_ and keys_per_partition_ to improve readability.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6299
      
      Test Plan: make check
      
      Differential Revision: D19413278
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 04926ee7874477d659cb2b6ae03f2d995fb747e5
      986df371
  5. 19 10月, 2019 1 次提交
    • L
      Store the filter bits reader alongside the filter block contents (#5936) · 29ccf207
      Levi Tamasi 提交于
      Summary:
      Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that
      only the filter block contents are stored in the block cache (as opposed to the
      earlier design where the cache stored the filter block reader itself, leading to
      potentially dangling pointers and concurrency bugs). However, this change
      introduced a performance hit since with the new code, the metadata fields are
      re-parsed upon every access. This patch reunites the block contents with the
      filter bits reader to eliminate this overhead; since this is still a self-contained
      pure data object, it is safe to store it in the cache. (Note: this is similar to how
      the zstd digest is handled.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936
      
      Test Plan:
      make asan_check
      
      filter_bench results for the old code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.7153
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.4258
        Single filter ns/op: 42.5974
        Random filter ns/op: 217.861
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.4217
        Single filter ns/op: 50.9855
        Random filter ns/op: 219.167
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5172
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 32.3556
        Single filter ns/op: 83.2239
        Random filter ns/op: 370.676
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.2265
        Single filter ns/op: 93.5651
        Random filter ns/op: 408.393
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      With the new code:
      
      ```
      $ ./filter_bench -quick
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 25.4285
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 31.0594
        Single filter ns/op: 43.8974
        Random filter ns/op: 226.075
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 31.0295
        Single filter ns/op: 50.3824
        Random filter ns/op: 226.805
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      
      $ ./filter_bench -quick -use_full_block_reader
      WARNING: Assertions are enabled; benchmarks unnecessarily slow
      Building...
      Build avg ns/key: 26.5308
      Number of filters: 16669
      Total memory (MB): 200.009
      Bits/key actual: 10.0647
      ----------------------------
      Inside queries...
        Dry run (46b) ns/op: 33.2968
        Single filter ns/op: 58.6163
        Random filter ns/op: 291.434
      ----------------------------
      Outside queries...
        Dry run (25d) ns/op: 32.1839
        Single filter ns/op: 66.9039
        Random filter ns/op: 292.828
          Average FP rate %: 1.13993
      ----------------------------
      Done. (For more info, run with -legend or -help.)
      ```
      
      Differential Revision: D17991712
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29
      29ccf207
  6. 25 9月, 2019 1 次提交
    • M
      Fix a bug in format_version 3 + partition filters + prefix search (#5835) · 6652c94f
      Maysam Yabandeh 提交于
      Summary:
      Partitioned filters make use of a top-level index to find the partition in which the filter resides. The top-level index has a key per partition. The key is guaranteed to be larger or equal than any key in that partition. When used with format_version 3, which excludes the sequence number form index keys, the separator key in the index could be equal to the prefix of the keys in the next partition. In this way, when searching for the key, the top-level index will lead us to the previous partition, which has no key with that prefix. The prefix bloom test thus returns false, although the prefix exists in the bloom of the next partition.
      The patch fixes that by a hack: It always adds the prefix of the first key of the next partition to the bloom of the current partition. In this way, in the corner cases that the index will lead us to the previous partition, we still can find the bloom filter there.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5835
      
      Differential Revision: D17513585
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: e2d1ff26c759e6e03875c4d57f4228316ecf50e9
      6652c94f
  7. 17 7月, 2019 1 次提交
    • L
      Move the filter readers out of the block cache (#5504) · 3bde41b5
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for the filter block, it is not
      really the block itself that is stored in the cache but a FilterBlockReader
      object. Since this object is not pure data (it has, for instance, pointers that
      might dangle, including in one case a back pointer to the TableReader), it's not
      really sharable. To avoid the issues around this, the current code erases the
      cache entries when the TableReader is closed (which, BTW, is not sufficient
      since a concurrent TableReader might have picked up the object in the meantime).
      Instead of doing this, the patch moves the FilterBlockReader out of the cache
      altogether, and decouples the filter reader object from the filter block.
      In particular, instead of the TableReader owning, or caching/pinning the
      FilterBlockReader (based on the customer's settings), with the change the
      TableReader unconditionally owns the FilterBlockReader, which in turn
      owns/caches/pins the filter block. This change also enables us to reuse the code
      paths historically used for data blocks for filters as well.
      
      Note:
      Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
      separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
      
      Test Plan: make asan_check
      
      Differential Revision: D16036974
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
      3bde41b5
  8. 11 6月, 2019 1 次提交
    • H
      Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421) · 5efa0d6b
      haoyuhuang 提交于
      Summary:
      BlockCacheLookupContext only contains the caller for now.
      We will trace block accesses at five places:
      1. BlockBasedTable::GetFilter.
      2. BlockBasedTable::GetUncompressedDict.
      3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
      4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      
      We create the context at:
      1. BlockBasedTable::Get. (kUserGet)
      2. BlockBasedTable::MultiGet. (kUserMGet)
      3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
      4. BlockBasedTable::Open. (kPrefetch)
      5. Index/Filter::CacheDependencies. (kPrefetch)
      6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).
      
      I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
      Throughput of this PR: 231334 ops/s.
      Throughput of the master branch: 238428 ops/s.
      
      Experiment setup:
      RocksDB:    version 6.2
      Date:       Mon Jun 10 10:42:51 2019
      CPU:        24 * Intel Core Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       20 bytes each
      Values:     100 bytes each (100 bytes after compression)
      Entries:    1000000
      Prefix:    20 bytes
      Keys per prefix:    0
      RawSize:    114.4 MB (estimated)
      FileSize:   114.4 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: NoCompression
      Compression sampling rate: 0
      Memtablerep: skip_list
      Perf Level: 1
      
      Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000
      
      Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120
      
      TODOs:
      1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
      2. Integrate tracer to trace block cache accesses.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421
      
      Differential Revision: D15704258
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
      5efa0d6b
  9. 31 5月, 2019 2 次提交
  10. 11 5月, 2019 1 次提交
    • L
      Turn CachableEntry into a proper resource handle (#5252) · f0bf3bf3
      Levi Tamasi 提交于
      Summary:
      CachableEntry is used in a variety of contexts: it may refer to a cached
      object (i.e. an object in the block cache), an owned object, or an
      unowned object; also, in some cases (most notably with iterators), the
      responsibility of managing the pointed-to object gets handed off to
      another object. Each of the above scenarios have different implications
      for the lifecycle of the referenced object. For the most part, the patch
      does not change the lifecycle of managed objects; however, it makes
      these relationships explicit, and it also enables us to eliminate some
      hacks and accident-prone code around releasing cache handles and
      deleting/cleaning up objects. (The only places where the patch changes
      how an objects are managed are the partitions of partitioned indexes and
      filters.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252
      
      Differential Revision: D15101358
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be
      f0bf3bf3
  11. 18 9月, 2018 1 次提交
    • M
      Fix bug in partition filters with format_version=4 (#4381) · 65ac72ed
      Maysam Yabandeh 提交于
      Summary:
      Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381
      
      Differential Revision: D9879505
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
      65ac72ed
  12. 10 8月, 2018 1 次提交
    • M
      Index value delta encoding (#3983) · caf0f53a
      Maysam Yabandeh 提交于
      Summary:
      Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.
      
      The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
      Results with sysbench read-only using 4k blocks and using 16 index restart interval:
      Format 2:
      19585   rocksdb read-only range=100
      Format 3:
      19569   rocksdb read-only range=100
      Format 4:
      19352   rocksdb read-only range=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
      
      Differential Revision: D8361343
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
      caf0f53a
  13. 07 6月, 2018 1 次提交
  14. 22 5月, 2018 1 次提交
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758
  15. 10 4月, 2018 1 次提交
    • M
      Fix the memory leak with pinned partitioned filters · d2bcd761
      Maysam Yabandeh 提交于
      Summary:
      The existing unit test did not set the level so the check for pinned partitioned filter/index being properly released from the block cache was not properly exercised as they only take effect in level 0. As a result a memory leak in pinned partitioned filters was hidden. The patch fix the test as well as the bug.
      Closes https://github.com/facebook/rocksdb/pull/3692
      
      Differential Revision: D7559763
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 55eff274945838af983c764a7d71e8daff092e4a
      d2bcd761
  16. 22 3月, 2018 1 次提交
  17. 23 8月, 2017 1 次提交
  18. 12 8月, 2017 1 次提交
    • S
      Support prefetch last 512KB with direct I/O in block based file reader · 666a005f
      Siying Dong 提交于
      Summary:
      Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
      Closes https://github.com/facebook/rocksdb/pull/2708
      
      Differential Revision: D5593091
      
      Pulled By: siying
      
      fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
      666a005f
  19. 16 7月, 2017 1 次提交
  20. 03 7月, 2017 1 次提交
  21. 06 5月, 2017 1 次提交
    • M
      Object lifetime in cache · 40af2381
      Maysam Yabandeh 提交于
      Summary:
      Any non-raw-data dependent object must be destructed before the table
          closes. There was a bug of not doing that for filter object. This patch
          fixes the bug and adds a unit test to prevent such bugs in future.
      Closes https://github.com/facebook/rocksdb/pull/2246
      
      Differential Revision: D5001318
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 6d8772e58765485868094b92964da82ef9730b6d
      40af2381
  22. 03 5月, 2017 1 次提交
  23. 02 5月, 2017 1 次提交
    • M
      Delete filter before closing the table · 89833577
      Maysam Yabandeh 提交于
      Summary:
      Some filters such as partitioned filter have pointers to the table for which they are created. Therefore is they are stored in the block cache, the should be forcibly erased from block cache before closing the  table, which would result into deleting the object. Otherwise the destructor will be called later when the cache is lazily erasing the object, which having the parent table no longer existent it could result into undefined behavior.
      
      Update: there will be still cases the filter is not removed from the cache since the table has not kept a pointer to the cache handle to be able to forcibly release it later. We make sure that the filter destructor does not access the table pointer to get around such cases.
      Closes https://github.com/facebook/rocksdb/pull/2207
      
      Differential Revision: D4941591
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 56fbab2a11cf447e1aa67caa30b58d7bd7ce5bbd
      89833577
  24. 28 4月, 2017 1 次提交
  25. 23 3月, 2017 1 次提交
  26. 08 3月, 2017 1 次提交
  27. 04 3月, 2017 1 次提交