1. 30 6月, 2020 1 次提交
    • A
      Extend Get/MultiGet deadline support to table open (#6982) · 9a5886bd
      Anand Ananthabhotla 提交于
      Summary:
      Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.
      
      The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.
      
      Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982
      
      Test Plan: Add new unit tests in db_basic_test
      
      Reviewed By: riversand963
      
      Differential Revision: D22219515
      
      Pulled By: anand1976
      
      fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
      9a5886bd
  2. 27 6月, 2020 1 次提交
    • S
      Add unity build to CircleCI (#7026) · f9817201
      sdong 提交于
      Summary:
      We are still keeping unity build working. So it's a good idea to add to a pre-commit CI.
      A latest GCC docker image just to get a little bit more coverage. Fix three small issues to make it pass.
      Also make unity_test to run db_basic_test rather than db_test to cut the test time. There is no point to run expensive tests here. It was set to run db_test before db_basic_test was separated out.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7026
      
      Test Plan: watch tests to pass.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D22223197
      
      fbshipit-source-id: baa3b6cbb623bf359829b63ce35715c75bcb0ed4
      f9817201
  3. 25 6月, 2020 2 次提交
    • Z
      Add a new option for BackupEngine to store table files under shared_checksum... · be41c61f
      Zitan Chen 提交于
      Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997)
      
      Summary:
      `BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`.
      
      Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true.
      
      Three new test cases are added.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997
      
      Test Plan: Passed make check.
      
      Reviewed By: ajkr
      
      Differential Revision: D22098895
      
      Pulled By: gg814
      
      fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76
      be41c61f
    • S
      Test CircleCI with CLANG-10 (#7025) · 9cc25190
      sdong 提交于
      Summary:
      It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025
      
      Test Plan: See all tests pass.
      
      Reviewed By: pdillinger
      
      Differential Revision: D22215700
      
      fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168
      9cc25190
  4. 23 6月, 2020 1 次提交
    • P
      Minimize memory internal fragmentation for Bloom filters (#6427) · 5b2bbacb
      Peter Dillinger 提交于
      Summary:
      New experimental option BBTO::optimize_filters_for_memory builds
      filters that maximize their use of "usable size" from malloc_usable_size,
      which is also used to compute block cache charges.
      
      Rather than always "rounding up," we track state in the
      BloomFilterPolicy object to mix essentially "rounding down" and
      "rounding up" so that the average FP rate of all generated filters is
      the same as without the option. (YMMV as heavily accessed filters might
      be unluckily lower accuracy.)
      
      Thus, the option near-minimizes what the block cache considers as
      "memory used" for a given target Bloom filter false positive rate and
      Bloom filter implementation. There are no forward or backward
      compatibility issues with this change, though it only works on the
      format_version=5 Bloom filter.
      
      With Jemalloc, we see about 10% reduction in memory footprint (and block
      cache charge) for Bloom filters, but 1-2% increase in storage footprint,
      due to encoding efficiency losses (FP rate is non-linear with bits/key).
      
      Why not weighted random round up/down rather than state tracking? By
      only requiring malloc_usable_size, we don't actually know what the next
      larger and next smaller usable sizes for the allocator are. We pick a
      requested size, accept and use whatever usable size it has, and use the
      difference to inform our next choice. This allows us to narrow in on the
      right balance without tracking/predicting usable sizes.
      
      Why not weight history of generated filter false positive rates by
      number of keys? This could lead to excess skew in small filters after
      generating a large filter.
      
      Results from filter_bench with jemalloc (irrelevant details omitted):
      
          (normal keys/filter, but high variance)
          $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9
          Build avg ns/key: 29.6278
          Number of filters: 5516
          Total size (MB): 200.046
          Reported total allocated memory (MB): 220.597
          Reported internal fragmentation: 10.2732%
          Bits/key stored: 10.0097
          Average FP rate %: 0.965228
          $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
          Build avg ns/key: 30.5104
          Number of filters: 5464
          Total size (MB): 200.015
          Reported total allocated memory (MB): 200.322
          Reported internal fragmentation: 0.153709%
          Bits/key stored: 10.1011
          Average FP rate %: 0.966313
      
          (very few keys / filter, optimization not as effective due to ~59 byte
           internal fragmentation in blocked Bloom filter representation)
          $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9
          Build avg ns/key: 29.5649
          Number of filters: 162950
          Total size (MB): 200.001
          Reported total allocated memory (MB): 224.624
          Reported internal fragmentation: 12.3117%
          Bits/key stored: 10.2951
          Average FP rate %: 0.821534
          $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
          Build avg ns/key: 31.8057
          Number of filters: 159849
          Total size (MB): 200
          Reported total allocated memory (MB): 208.846
          Reported internal fragmentation: 4.42297%
          Bits/key stored: 10.4948
          Average FP rate %: 0.811006
      
          (high keys/filter)
          $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9
          Build avg ns/key: 29.7017
          Number of filters: 164
          Total size (MB): 200.352
          Reported total allocated memory (MB): 221.5
          Reported internal fragmentation: 10.5552%
          Bits/key stored: 10.0003
          Average FP rate %: 0.969358
          $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory
          Build avg ns/key: 30.7131
          Number of filters: 160
          Total size (MB): 200.928
          Reported total allocated memory (MB): 200.938
          Reported internal fragmentation: 0.00448054%
          Bits/key stored: 10.1852
          Average FP rate %: 0.963387
      
      And from db_bench (block cache) with jemalloc:
      
          $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
          $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false
          $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
          17063835
          $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }'
          17430747
          $ #^ 2.1% additional filter storage
          $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
          rocksdb.block.cache.index.add COUNT : 33
          rocksdb.block.cache.index.bytes.insert COUNT : 8440400
          rocksdb.block.cache.filter.add COUNT : 33
          rocksdb.block.cache.filter.bytes.insert COUNT : 21087528
          rocksdb.bloom.filter.useful COUNT : 4963889
          rocksdb.bloom.filter.full.positive COUNT : 1214081
          rocksdb.bloom.filter.full.true.positive COUNT : 1161999
          $ #^ 1.04 % observed FP rate
          $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000
          rocksdb.block.cache.index.add COUNT : 33
          rocksdb.block.cache.index.bytes.insert COUNT : 8448592
          rocksdb.block.cache.filter.add COUNT : 33
          rocksdb.block.cache.filter.bytes.insert COUNT : 18220328
          rocksdb.bloom.filter.useful COUNT : 5360933
          rocksdb.bloom.filter.full.positive COUNT : 1321315
          rocksdb.bloom.filter.full.true.positive COUNT : 1262999
          $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters
      
      (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427
      
      Test Plan: unit test added, 'make check' with gcc, clang and valgrind
      
      Reviewed By: siying
      
      Differential Revision: D22124374
      
      Pulled By: pdillinger
      
      fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e
      5b2bbacb
  5. 20 6月, 2020 1 次提交
    • P
      Fix block checksum for >=4GB, refactor (#6978) · 25a0d0ca
      Peter Dillinger 提交于
      Summary:
      Although RocksDB falls over in various other ways with KVs
      around 4GB or more, this change fixes how XXH32 and XXH64 were being
      called by the block checksum code to support >= 4GB in case that should
      ever happen, or the code copied for other uses.
      
      This change is not a schema compatibility issue because the checksum
      verification code would checksum the first (block_size + 1) mod 2^32
      bytes while the checksum construction code would checksum the first
      block_size mod 2^32 plus the compression type byte, meaning the
      XXH32/64 checksums for >=4GB block would not match about 255/256 times.
      
      While touching this code, I refactored to consolidate redundant
      implementations, improving diagnostics and performance tracking in some
      cases. Also used less confusing language in those diagnostics.
      
      Makes https://github.com/facebook/rocksdb/issues/6875 obsolete.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978
      
      Test Plan:
      I was able to write a test for this using an SST file writer
      and VerifyChecksum in a reader. The test fails before the fix, though
      I'm leaving the test disabled because I don't think it's worth the
      expense of running regularly.
      
      Reviewed By: gg814
      
      Differential Revision: D22143260
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e
      25a0d0ca
  6. 18 6月, 2020 2 次提交
    • S
      Fix the bug that compressed cache is disabled in read-only DBs (#6990) · 223b57ee
      sdong 提交于
      Summary:
      Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990
      
      Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache.
      
      Reviewed By: ltamasi
      
      Differential Revision: D22072755
      
      fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1
      223b57ee
    • Z
      Store DB identity and DB session ID in SST files (#6983) · 94d04529
      Zitan Chen 提交于
      Summary:
      `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.
      
      The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.
      
      In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.
      
      A table property test is added.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983
      
      Test Plan: make check and some manual tests.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D22048826
      
      Pulled By: gg814
      
      fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
      94d04529
  7. 16 6月, 2020 1 次提交
    • L
      Fix uninitialized memory read in table_test (#6980) · aa8f1331
      Levi Tamasi 提交于
      Summary:
      When using parameterized tests, `gtest` sometimes prints the test
      parameters. If no other printing method is available, it essentially
      produces a hex dump of the object. This can cause issues with valgrind
      with types like `TestArgs` in `table_test`, where the object layout has
      gaps (with uninitialized contents) due to the members' alignment
      requirements. The patch fixes the uninitialized reads by providing an
      `operator<<` for `TestArgs` and also makes sure all members are
      initialized (in a consistent order) on all code paths.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6980
      
      Test Plan: `valgrind --leak-check=full ./table_test`
      
      Reviewed By: siying
      
      Differential Revision: D22045536
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 6f5920ac28c712d0aa88162fffb80172ed769c32
      aa8f1331
  8. 14 6月, 2020 1 次提交
    • Z
      Fix persistent cache on windows (#6932) · 9c24a5cb
      Zhen Li 提交于
      Summary:
      Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932
      
      Test Plan:
      1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed.
      2. manually run db_bench.exe with "read_cache_path" and verified.
      
      Reviewed By: riversand963
      
      Differential Revision: D21911608
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2
      9c24a5cb
  9. 13 6月, 2020 1 次提交
    • L
      Turn HarnessTest in table_test into a parameterized test (#6974) · bacd6edc
      Levi Tamasi 提交于
      Summary:
      `HarnessTest` in `table_test.cc` currently tests many parameter
      combinations sequentially in a loop. This is problematic from
      a testing perspective, since if the test fails, we have no way of
      knowing how many/which combinations have failed. It can also cause timeouts on
      our test system due to the sheer number of combinations tested.
      (Specifically, the parallel compression threads parameter added by
      https://github.com/facebook/rocksdb/pull/6262 seems to have been the last straw.)
      There is some DIY code there that splits the load among eight test cases
      but that does not appear to be sufficient anymore.
      
      Instead, the patch turns `HarnessTest` into a parameterized test, so all the
      parameter combinations can be tested separately and potentially
      concurrently. It also cleans up the tests a little, fixes
      `RandomizedLongDB`, which did not get updated when the parallel
      compression threads parameter was added, and turns `FooterTests` into a
      standalone test case (since it does not actually need a fixture class).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6974
      
      Test Plan: `make check`
      
      Reviewed By: siying
      
      Differential Revision: D22029572
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 51baea670771c33928f2eb3902bd69dcf540aa41
      bacd6edc
  10. 11 6月, 2020 1 次提交
    • A
      save a key comparison in block seeks (#6646) · e6be168a
      Andrew Kryczka 提交于
      Summary:
      This saves up to two key comparisons in block seeks. The first key
      comparison saved is a redundant key comparison against the restart key
      where the linear scan starts. This comparison is saved in all cases
      except when the found key is in the first restart interval. The
      second key comparison saved is a redundant key comparison against the
      restart key where the linear scan ends. This is only saved in cases
      where all keys in the restart interval are less than the target
      (probability roughly `1/restart_interval`).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646
      
      Test Plan:
      ran a benchmark with mostly default settings and counted key comparisons
      
      before: `user_key_comparison_count = 19399529`
      after: `user_key_comparison_count = 18431498`
      
      setup command:
      
      ```
      $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000
      ```
      
      benchmark command:
      
      ```
      $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D20849707
      
      Pulled By: ajkr
      
      fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac
      e6be168a
  11. 10 6月, 2020 1 次提交
  12. 08 6月, 2020 1 次提交
    • Y
      Remove unnecessary inclusion of version_edit.h in env (#6952) · 3020df9d
      Yanqin Jin 提交于
      Summary:
      In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`.
      
      Test plan (devserver):
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D21925341
      
      Pulled By: riversand963
      
      fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2
      3020df9d
  13. 06 6月, 2020 2 次提交
    • Z
      Clean up the dead code (#6946) · f941adef
      Zhichao Cao 提交于
      Summary:
      Remove the dead code in table test.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6946
      
      Test Plan: run table_test
      
      Reviewed By: riversand963
      
      Differential Revision: D21913563
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: c0aa9f3b95dfe87dd7fb2cd4823784f08cb3ddd3
      f941adef
    • A
      Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909) · 98b0cbea
      anand76 提交于
      Summary:
      The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller.
      
      Tests:
      Add a test in block_based_table_reader_test.cc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909
      
      Reviewed By: pdillinger
      
      Differential Revision: D21833922
      
      Pulled By: anand1976
      
      fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae
      98b0cbea
  14. 05 6月, 2020 1 次提交
  15. 04 6月, 2020 2 次提交
    • S
      Revert "Update googletest from 1.8.1 to 1.10.0 (#6808)" (#6923) · afa35188
      sdong 提交于
      Summary:
      This reverts commit 8d87e9ce.
      
      Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923
      
      Reviewed By: pdillinger
      
      Differential Revision: D21864799
      
      fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc
      afa35188
    • P
      Fix handling of too-small filter partition size (#6905) · 9360776c
      Peter Dillinger 提交于
      Summary:
      Because ARM and some other platforms have a larger cache line
      size, they have a larger minimum filter size, which causes recently
      added PartitionedMultiGet test in db_bloom_filter_test to fail on those
      platforms. The code would actually end up using larger partitions,
      because keys_per_partition_ would be 0 and never == number of keys
      added.
      
      The code now attempts to get as close as possible to the small target
      size, while fully utilizing that filter size, if the target partition
      size is smaller than the minimum filter size.
      
      Also updated the test to break more uniformly across platforms
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905
      
      Test Plan: updated test, tested on ARM
      
      Reviewed By: anand1976
      
      Differential Revision: D21840639
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817
      9360776c
  16. 03 6月, 2020 3 次提交
    • Z
      Fix potential overflow of unsigned type in for loop (#6902) · 2adb7e37
      Zhichao Cao 提交于
      Summary:
      x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902
      
      Test Plan: pass make asan_check
      
      Reviewed By: ajkr
      
      Differential Revision: D21843767
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
      2adb7e37
    • P
      For ApproximateSizes, pro-rate table metadata size over data blocks (#6784) · 14eca6bf
      Peter Dillinger 提交于
      Summary:
      The implementation of GetApproximateSizes was inconsistent in
      its treatment of the size of non-data blocks of SST files, sometimes
      including and sometimes now. This was at its worst with large portion
      of table file used by filters and querying a small range that crossed
      a table boundary: the size estimate would include large filter size.
      
      It's conceivable that someone might want only to know the size in terms
      of data blocks, but I believe that's unlikely enough to ignore for now.
      Similarly, there's no evidence the internal function AppoximateOffsetOf
      is used for anything other than a one-sided ApproximateSize, so I intend
      to refactor to remove redundancy in a follow-up commit.
      
      So to fix this, GetApproximateSizes (and implementation details
      ApproximateSize and ApproximateOffsetOf) now consistently include in
      their returned sizes a portion of table file metadata (incl filters
      and indexes) based on the size portion of the data blocks in range. In
      other words, if a key range covers data blocks that are X% by size of all
      the table's data blocks, returned approximate size is X% of the total
      file size. It would technically be more accurate to attribute metadata
      based on number of keys, but that's not computationally efficient with
      data available and rarely a meaningful difference.
      
      Also includes miscellaneous comment improvements / clarifications.
      
      Also included is a new approximatesizerandom benchmark for db_bench.
      No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
      
      Test Plan:
      Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
      Old code running new test...
      
          [ RUN      ] DBTest.ApproximateSizesFilesWithErrorMargin
          db/db_test.cc:1562: Failure
          Expected: (size) <= (11 * 100), actual: 9478 vs 1100
      
      Other tests updated to reflect consistent accounting of metadata.
      
      Reviewed By: siying
      
      Differential Revision: D21334706
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
      14eca6bf
    • S
      Reduce dependency on gtest dependency in release code (#6907) · 298b00a3
      sdong 提交于
      Summary:
      Release code now depends on gtest, indirectly through including "test_util/testharness.h". This creates multiple problems. One important reason is the definition of IGNORE_STATUS_IF_ERROR() in test_util/testharness.h. Move it to sync_point.h instead.
      Note that utilities/cassandra/format.h still depends on "test_util/testharness.h". This will be resolved in a separate diff.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6907
      
      Test Plan: Run all existing tests.
      
      Reviewed By: ajkr
      
      Differential Revision: D21829884
      
      fbshipit-source-id: 9253c19ffde2936f3ae68998210f8e54f645a6e6
      298b00a3
  17. 02 6月, 2020 3 次提交
  18. 29 5月, 2020 2 次提交
    • A
      avoid `IterKey::UpdateInternalKey()` in `BlockIter` (#6843) · c5abf78b
      Andrew Kryczka 提交于
      Summary:
      `IterKey::UpdateInternalKey()` is an error-prone API as it's
      incompatible with `IterKey::TrimAppend()`, which is used for
      decoding delta-encoded internal keys. This PR stops using it in
      `BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s
      buffer when needed. The logic for safely getting a Slice with global
      seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`.
      `BinarySeek()` is also migrated to use this API (previously it ignored
      global seqno entirely).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843
      
      Test Plan:
      benchmark setup -- single file DBs, in-memory, no compression. "normal_db"
      created by regular flush; "ingestion_db" created by ingesting a file. Both
      DBs have same contents.
      
      ```
      $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000
      $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst | awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}')
      $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/
      ```
      
      benchmark run command:
      ```
      TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000
      ```
      
      results:
      
      | DB | code | throughput |
      |---|---|---|
      | normal_db | master |  267.9 |
      | normal_db   |    PR6843 | 254.2 (-5.1%) |
      | ingestion_db |   master |  259.6 |
      | ingestion_db |   PR6843 | 250.5 (-3.5%) |
      
      Reviewed By: pdillinger
      
      Differential Revision: D21562604
      
      Pulled By: ajkr
      
      fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264
      c5abf78b
    • Y
      Add timestamp to delete (#6253) · 961c7590
      Yanqin Jin 提交于
      Summary:
      Preliminary user-timestamp support for delete.
      
      If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.
      
      Implementation
      A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
      The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.
      
      Test plan (dev server)
      ```
      make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253
      
      Reviewed By: ltamasi
      
      Differential Revision: D20995328
      
      Pulled By: riversand963
      
      fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
      961c7590
  19. 28 5月, 2020 1 次提交
    • A
      Allow MultiGet users to limit cumulative value size (#6826) · bcefc59e
      Akanksha Mahajan 提交于
      Summary:
      1. Add a value_size in read options which limits the cumulative value size of keys read in batches. Once the size exceeds read_options.value_size, all the remaining keys are returned with status Abort without further fetching any key.
      2. Add a unit test case MultiGetBatchedValueSizeSimple the reads keys from memory and sst files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6826
      
      Test Plan:
      1. make check -j64
      	   2. Add a new unit test case
      
      Reviewed By: anand1976
      
      Differential Revision: D21471483
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: dea51b8e76d5d1df38ece8cdb29933b1d798b900
      bcefc59e
  20. 27 5月, 2020 1 次提交
  21. 22 5月, 2020 2 次提交
  22. 21 5月, 2020 2 次提交
    • P
      Clean up some code related to file checksums (#6861) · c7aedf1b
      Peter Dillinger 提交于
      Summary:
      * Add missing unit test for schema stability of FileChecksumGenCrc32c
        (previously was only comparing to itself)
      * A lot of clarifying comments
      * Add some assertions for preconditions
      * Rename WritableFileWriter::CalculateFileChecksum -> UpdateFileChecksum
      * Simplify FileChecksumGenCrc32c with shared functions
      * Implement EndianSwapValue to replace unused EndianTransform
      
      And incidentally since I had trouble with 'make check-format' GitHub action disagreeing with local run,
      * Output full diagnostic information when 'make check-format' fails in CI
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6861
      
      Test Plan: new unit test passes before & after other changes
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D21667115
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 6a99970f87605aa024fa540c78cd519ff322c3e6
      c7aedf1b
    • Z
      Generate file checksum in SstFileWriter (#6859) · 545e14b5
      Zhichao Cao 提交于
      Summary:
      If Option.file_checksum_gen_factory is set, rocksdb generates the file checksum during flush and compaction based on the checksum generator created by the factory and store the checksum and function name in vstorage and Manifest.
      
      This PR enable file checksum generation in SstFileWrite and store the checksum and checksum function name in the  ExternalSstFileInfo, such that application can use them for other purpose, for example, ingest the file checksum with files in IngestExternalFile().
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6859
      
      Test Plan: add unit test and pass make asan_check.
      
      Reviewed By: ajkr
      
      Differential Revision: D21656247
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 78a3570c76031d8832e3d2de3d6c79cdf2b675d0
      545e14b5
  23. 15 5月, 2020 1 次提交
    • C
      Enable IO Uring in MultiGet in direct IO mode (#6815) · 91b75532
      Cheng Chang 提交于
      Summary:
      Currently, in direct IO mode, `MultiGet` retrieves the data blocks one by one instead of in parallel, see `BlockBasedTable::RetrieveMultipleBlocks`.
      
      Since direct IO is supported in `RandomAccessFileReader::MultiRead` in https://github.com/facebook/rocksdb/pull/6446, this PR applies `MultiRead` to `MultiGet` so that the data blocks can be retrieved in parallel.
      
      Also, in direct IO mode and when data blocks are compressed and need to uncompressed, this PR only allocates one continuous aligned buffer to hold the data blocks, and then directly uncompress the blocks to insert into block cache, there is no longer intermediate copies to scratch buffers.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6815
      
      Test Plan:
      1. added a new unit test `BlockBasedTableReaderTest::MultiGet`.
      2. existing unit tests and stress tests  contain tests against `MultiGet` in direct IO mode.
      
      Reviewed By: anand1976
      
      Differential Revision: D21426347
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: b8446ae0e74152444ef9111e97f8e402ac31b24f
      91b75532
  24. 13 5月, 2020 2 次提交
    • S
      sst_dump to reduce number of file reads (#6836) · 4a4b8a13
      sdong 提交于
      Summary:
      sst_dump can issue many file reads from the file system. This doesn't work well with file systems without a OS cache, especially remote file systems. In order to mitigate this problem, several improvements are done:
      1. --readahead_size is added, so that users can specify readahead size when scanning the data.
      2. Force a 512KB tail readahead, which prevents three I/Os for footer, meta index and property blocks and hopefully index and filter blocks too.
      3. Consoldiate SSTDump's I/Os before opening the file for read. Use the same file prefetch buffer.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6836
      
      Test Plan: Add a test that covers this new feature.
      
      Reviewed By: pdillinger
      
      Differential Revision: D21516607
      
      fbshipit-source-id: 3ae43526286f67b2f4a5bdedfbc92719d579b87e
      4a4b8a13
    • Z
      Add tests for compression failure in BlockBasedTableBuilder (#6709) · c384c08a
      Ziyue Yang 提交于
      Summary:
      Currently there is no check for whether BlockBasedTableBuilder will expose
      compression error status if compression fails during the table building.
      This commit adds fake faulting compressors and a unit test to test such
      cases.
      
      This check finds 5 bugs, and this commit also fixes them:
      
      1. Not handling compression failure well in
         BlockBasedTableBuilder::BGWorkWriteRawBlock.
      2. verify_compression failing in BlockBasedTableBuilder when used with ZSTD.
      3. Wrongly passing the same reference of block contents to
         BlockBasedTableBuilder::CompressAndVerifyBlock in parallel compression.
      4. Wrongly setting block_rep->first_key_in_next_block to nullptr in
         BlockBasedTableBuilder::EnterUnbuffered when there are still incoming data
         blocks.
      5. Not maintaining variables for compression ratio estimation and first_block
         in BlockBasedTableBuilder::EnterUnbuffered.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6709
      
      Reviewed By: ajkr
      
      Differential Revision: D21236254
      
      fbshipit-source-id: 101f6e62b2bac2b7be72be198adf93cd32a1ff46
      c384c08a
  25. 08 5月, 2020 1 次提交
    • P
      Fix false NotFound from batched MultiGet with kHashSearch (#6821) · b27a1448
      Peter Dillinger 提交于
      Summary:
      The error is assigning KeyContext::s to NotFound status in a
      table reader for a "not found in this table" case, which skips searching
      in later tables, like only a delete should. (The hash search index iterator
      is the only one that can return status NotFound even if Valid() == false.)
      
      This was detected by intermittent failure in
      MultiThreadedDBTest.MultiThreaded/5, a kHashSearch configuration.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6821
      
      Test Plan: modified existing unit test to reproduce problem
      
      Reviewed By: anand1976
      
      Differential Revision: D21450469
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 7478003684d637dbd491cdac81468041a791be2c
      b27a1448
  26. 06 5月, 2020 1 次提交
    • M
      Add OptionTypeInfo::Enum and related methods (#6423) · 394f2bbd
      mrambacher 提交于
      Summary:
      Add methods and constructors for handling enums to the OptionTypeInfo.  This change allows enums to be converted/compared without adding a special "type" to the OptionType.
      
      This change addresses a couple of issues:
      - It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
      - It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
      - It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
      - It reduces the number of types and cases that need to be handled in the various OptionType methods
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423
      
      Reviewed By: siying
      
      Differential Revision: D21408713
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
      394f2bbd
  27. 01 5月, 2020 2 次提交
    • S
      Disallow BlockBasedTableBuilder to set status from non-OK (#6776) · 079e50d2
      sdong 提交于
      Summary:
      There is no systematic mechanism to prevent BlockBasedTableBuilder's status to be set from non-OK to OK. Adding a mechanism to force this will help us prevent failures in the future.
      
      The solution is to only make it possible to set the status code if the status code to set is not OK.
      
      Since the status code passed to CompressAndVerifyBlock() is changed, a mini refactoring is done too so that the output arguments are changed from reference to pointers, based on Google C++ Style.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6776
      
      Test Plan: Run all existing test.
      
      Reviewed By: pdillinger
      
      Differential Revision: D21314382
      
      fbshipit-source-id: 27000c10f1e4c121661e026548d6882066409375
      079e50d2
    • A
      Pass a timeout to FileSystem for random reads (#6751) · ab13d43e
      anand76 提交于
      Summary:
      Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
      
      For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
      
      Tests:
      Update existing unit tests to verify the correct timeout value is being passed
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
      
      Reviewed By: riversand963
      
      Differential Revision: D21285631
      
      Pulled By: anand1976
      
      fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
      ab13d43e