1. 03 3月, 2021 1 次提交
    • L
      Break down the amount of data written during flushes/compactions per file type (#8013) · a46f080c
      Levi Tamasi 提交于
      Summary:
      The patch breaks down the "bytes written" (as well as the "number of output files")
      compaction statistics into two, so the values are logged separately for table files
      and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
      files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
      This will also come in handy for fixing the write amplification statistics, which currently
      do not consider the amount of data read from blob files during compaction. (This will
      be fixed by an upcoming patch.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013
      
      Test Plan: Ran `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D26742156
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
      a46f080c
  2. 02 3月, 2021 2 次提交
    • A
      Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003) · f1961297
      Akanksha Mahajan 提交于
      Summary:
      The checkpointing logic supports passing file level checksums
      to the copy_file_cb callback function which is used by the backup code
      for detecting corruption during file copies.
      However, this is currently implemented only for table files.
      
      This PR extends the checksum retrieval to blob files as well.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003
      
      Test Plan: Add new test units
      
      Reviewed By: ltamasi
      
      Differential Revision: D26680701
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
      f1961297
    • Y
      Enable compact filter for blob in dbstress and dbbench (#8011) · 1f11d07f
      Yanqin Jin 提交于
      Summary:
      As title.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8011
      
      Test Plan:
      ```
      ./db_bench -enable_blob_files=1 -use_keep_filter=1 -disable_auto_compactions=1
      /db_stress -enable_blob_files=1 -enable_compaction_filter=1 -acquire_snapshot_one_in=0 -compact_range_one_in=0 -iterpercent=0 -test_batches_snapshots=0 -readpercent=10 -prefixpercent=20 -writepercent=55 -delpercent=15 -continuous_verification_interval=0
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D26736061
      
      Pulled By: riversand963
      
      fbshipit-source-id: 1c7834903c28431ce23324c4f259ed71255614e2
      1f11d07f
  3. 27 2月, 2021 2 次提交
    • Y
      Still use SystemClock* instead of shared_ptr in StepPerfTimer (#8006) · 9fdc9fbe
      Yanqin Jin 提交于
      Summary:
      This is likely a temp fix before we figure out a better way.
      
      PerfStepTimer is used intensively in certain benchmarking/testings. https://github.com/facebook/rocksdb/issues/7858 stores a `shared_ptr` to system clock in PerfStepTimer which gets created each time a `PerfStepTimer` object is created. The atomic operations in `shared_ptr` may add overhead in CPU cycles. Therefore, we change it back to a raw `SystemClock*` for now.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8006
      
      Test Plan: make check
      
      Reviewed By: pdillinger
      
      Differential Revision: D26703560
      
      Pulled By: riversand963
      
      fbshipit-source-id: 519d0769b28da2334bea7d86c848fcc26ee8a17f
      9fdc9fbe
    • P
      Refine Ribbon configuration, improve testing, add Homogeneous (#7879) · a8b3b9a2
      Peter Dillinger 提交于
      Summary:
      This change only affects non-schema-critical aspects of the production candidate Ribbon filter. Specifically, it refines choice of internal configuration parameters based on inputs. The changes are minor enough that the schema tests in bloom_test, some of which depend on this, are unaffected. There are also some minor optimizations and refactorings.
      
      This would be a schema change for "smash" Ribbon, to fix some known issues with small filters, but "smash" Ribbon is not accessible in public APIs. Unit test CompactnessAndBacktrackAndFpRate updated to test small and medium-large filters. Run with --thoroughness=100 or so for much better detection power (not appropriate for continuous regression testing).
      
      Homogenous Ribbon:
      This change adds internally a Ribbon filter variant we call Homogeneous Ribbon, in collaboration with Stefan Walzer. The expected "result" value for every key is zero, instead of computed from a hash. Entropy for queries not to be false positives comes from free variables ("overhead") in the solution structure, which are populated pseudorandomly. Construction is slightly faster for not tracking result values, and never fails. Instead, FP rate can jump up whenever and whereever entries are packed too tightly. For small structures, we can choose overhead to make this FP rate jump unlikely, as seen in updated unit test CompactnessAndBacktrackAndFpRate.
      
      Unlike standard Ribbon, Homogeneous Ribbon seems to scale to arbitrary number of keys when accepting an FP rate penalty for small pockets of high FP rate in the structure. For example, 64-bit ribbon with 8 solution columns and 10% allocated space overhead for slots seems to achieve about 10.5% space overhead vs. information-theoretic minimum based on its observed FP rate with expected pockets of degradation. (FP rate is close to 1/256.) If targeting a higher FP rate with fewer solution columns, Homogeneous Ribbon can be even more space efficient, because the penalty from degradation is relatively smaller. If targeting a lower FP rate, Homogeneous Ribbon is less space efficient, as more allocated overhead is needed to keep the FP rate impact of degradation relatively under control. The new OptimizeHomogAtScale tool in ribbon_test helps to find these optimal allocation overheads for different numbers of solution columns. And Ribbon widths, with 128-bit Ribbon apparently cutting space overheads in half vs. 64-bit.
      
      Other misc item specifics:
      * Ribbon APIs in util/ribbon_config.h now provide configuration data for not just 5% construction failure rate (95% success), but also 50% and 0.1%.
        * Note that the Ribbon structure does not exhibit "threshold" behavior as standard Xor filter does, so there is a roughly fixed space penalty to cut construction failure rate in half. Thus, there isn't really an "almost sure" setting.
        * Although we can extrapolate settings for large filters, we don't have a good formula for configuring smaller filters (< 2^17 slots or so), and efforts to summarize with a formula have failed. Thus, small data is hard-coded from updated FindOccupancy tool.
      * Enhances ApproximateNumEntries for public API Ribbon using more precise data (new API GetNumToAdd), thus a more accurate but not perfect reversal of CalculateSpace. (bloom_test updated to expect the greater precision)
      * Move EndianSwapValue from coding.h to coding_lean.h to keep Ribbon code easily transferable from RocksDB
      * Add some missing 'const' to member functions
      * Small optimization to 128-bit BitParity
      * Small refactoring of BandingStorage in ribbon_alg.h to support Homogeneous Ribbon
      * CompactnessAndBacktrackAndFpRate now has an "expand" test: on construction failure, a possible alternative to re-seeding hash functions is simply to increase the number of slots (allocated space overhead) and try again with essentially the same hash values. (Start locations will be different roundings of the same scaled hash values--because fastrange not mod.) This seems to be as effective or more effective than re-seeding, as long as we increase the number of slots (m) by roughly m += m/w where w is the Ribbon width. This way, there is effectively an expansion by one slot for each ribbon-width window in the banding. (This approach assumes that getting "bad data" from your hash function is as unlikely as it naturally should be, e.g. no adversary.)
      * 32-bit and 16-bit Ribbon configurations are added to ribbon_test for understanding their behavior, e.g. with FindOccupancy. They are not considered useful at this time and not tested with CompactnessAndBacktrackAndFpRate.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7879
      
      Test Plan: unit test updates included
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26371245
      
      Pulled By: pdillinger
      
      fbshipit-source-id: da6600d90a3785b99ad17a88b2a3027710b4ea3a
      a8b3b9a2
  4. 26 2月, 2021 2 次提交
    • Y
      Remove unused/incorrect fwd declaration (#8002) · c370d8aa
      Yanqin Jin 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8002
      
      Reviewed By: anand1976
      
      Differential Revision: D26659354
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6b464dbea9fd8240ead8cc5af393f0b78e8f9dd1
      c370d8aa
    • Y
      Compaction filter support for (new) BlobDB (#7974) · cef4a6c4
      Yanqin Jin 提交于
      Summary:
      Allow applications to implement a custom compaction filter and pass it to BlobDB.
      
      The compaction filter's custom logic can operate on blobs.
      To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
      Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
      on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
      db/blob/db_blob_compaction_test.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D26509280
      
      Pulled By: riversand963
      
      fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
      cef4a6c4
  5. 25 2月, 2021 1 次提交
  6. 24 2月, 2021 5 次提交
    • X
      Append all characters not captured by xsputn() in overflow() function (#7991) · b085ee13
      xinyuliu 提交于
      Summary:
      In the adapter class `WritableFileStringStreamAdapter`, which wraps WritableFile to be used for std::ostream, previouly only `std::endl` is considered a special case because `endl` is written by `os.put()` directly without going through `xsputn()`. `os.put()` will call `sputc()` and if we further check the internal implementation of `sputc()`, we will see it is
      ```
      int_type __CLR_OR_THIS_CALL sputc(_Elem _Ch) {  // put a character
          return 0 < _Pnavail() ? _Traits::to_int_type(*_Pninc() = _Ch) : overflow(_Traits::to_int_type(_Ch));
      ```
      As we explicitly disabled buffering, _Pnavail() is always 0. Thus every write, not captured by xsputn, becomes an overflow.
      
      When I run tests on Windows, I found not only `std::endl` will drop into this case, writing an unsigned long long will also call `os.put()` then followed by `sputc()` and eventually call `overflow()`. Therefore, instead of only checking `std::endl`, we should try to append other characters as well unless the appending operation fails.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7991
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26615692
      
      Pulled By: ajkr
      
      fbshipit-source-id: 4c0003de1645b9531545b23df69b000e07014468
      b085ee13
    • A
      Make BlockBasedTable::kMaxAutoReadAheadSize configurable (#7951) · cd79a009
      Akanksha Mahajan 提交于
      Summary:
      RocksDB does auto-readahead for iterators on noticing more
      than two reads for a table file. The readahead starts at 8KB and doubles on every
      additional read upto BlockBasedTable::kMaxAutoReadAheadSize which is
      256*1024.
      This PR adds a new option BlockBasedTableOptions::max_auto_readahead_size which
      replaces BlockBasedTable::kMaxAutoReadAheadSize and the new option can be
      configured.
      If max_auto_readahead_size is set 0 then no implicit auto prefetching will
      be done. If max_auto_readahead_size provided is less than
      8KB (which is initial readahead size used by rocksdb in case of
      auto-readahead), readahead size will remain same as max_auto_readahead_size.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7951
      
      Test Plan: Add new unit test case.
      
      Reviewed By: anand1976
      
      Differential Revision: D26568085
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: b6543520fc74e97d859f2002328d4c5254d417af
      cd79a009
    • S
      Fix testcase failures on windows (#7992) · e017af15
      sherriiiliu 提交于
      Summary:
      Fixed 5 test case failures found on Windows 10/Windows Server 2016
      1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles.
      2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes.
      3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end.
      4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case.
      5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26615283
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d
      e017af15
    • S
      Always expose WITH_GFLAGS option to user (#7990) · 75c6ffb9
      sherriiiliu 提交于
      Summary:
      WITH_GFLAGS option does not work on MSVC.
      
       I checked the usage of [CMAKE_DEPENDENT_OPTION](https://cmake.org/cmake/help/latest/module/CMakeDependentOption.html). It says if the `depends` condition is not true, it will set the `option` to the value given by `force` and hides the option from the user. Therefore, `CMAKE_DEPENDENT_OPTION(WITH_GFLAGS "build with GFlags" ON "NOT MSVC;NOT MINGW" OFF)` will hide WITH_GFLAGS option from user if it is running on MSVC or MINGW and always set WITH_GFLAGS to be OFF. To expose WITH_GFLAGS option to user, I removed CMAKE_DEPENDENT_OPTION and split the logic into if-else statements
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7990
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26615755
      
      Pulled By: ajkr
      
      fbshipit-source-id: 33ca39a73423d9516510c15aaf9efb5c4072cdf9
      75c6ffb9
    • S
      Extract test cases correctly in run_ci_db_test.ps1 script (#7989) · f91fd0c9
      sherriiiliu 提交于
      Summary:
      Extract test cases correctly in run_ci_db_test.ps1 script.
      
      There are some new test group that are ended with # comments. Previously in the script when trying to extract test groups and test cases, the regex rule did not apply to this case so the concatenation of some test group and test case failed, see examples in comments.
      
      Also removed useless trailing whitespaces in the script.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7989
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26615909
      
      Pulled By: ajkr
      
      fbshipit-source-id: 8e68d599994f17d6fefde0daa925c3018179521a
      f91fd0c9
  7. 23 2月, 2021 2 次提交
  8. 22 2月, 2021 1 次提交
    • M
      Attempt to speed up tests by adding test to "slow" tests (#7973) · 59d91796
      mrambacher 提交于
      Summary:
      I noticed tests frequently timing out on CircleCI when I submit a PR.  I did some investigation and found the SeqAdvanceConcurrentTest suite (OneWriteQueue, TwoWriteQueues) tests were all taking a long time to complete (30 tests each taking at least 15K ms).
      
      This PR adds those test to the "slow reg" list in order to move them earlier in the execution sequence so that they are not the "long tail".
      
      For completeness, other tests that were also slow are:
      NumLevels/DBTestUniversalCompaction.UniversalCompactionTrivialMoveTest : 12 tests all taking 12K+ ms
      ReadSequentialFileTest with ReadaheadSize: 8 tests all 12K+ ms
      WriteUnpreparedTransactionTest.RecoveryTest : 2 tests at 22K+ ms
      DBBasicTest.EmptyFlush: 1 test at 35K+ ms
      RateLimiterTest.Rate: 1 test at 23K+ ms
      BackupableDBTest.ShareTableFilesWithChecksumsTransition: 1 test at 16K+ ms
      MulitThreadedDBTest.MultitThreaded: 78 tests at 10K+ ms
      TransactionStressTest.DeadlockStress: 7 tests at 11K+ ms
      DBBasicTestDeadline.IteratorDeadline: 3 tests at 10K+ ms
      
      No effort was made to determine why the tests were slow.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7973
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26519130
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 11555c9115acc207e45e210a7fc7f879170a3853
      59d91796
  9. 21 2月, 2021 1 次提交
  10. 20 2月, 2021 5 次提交
    • Y
      Update HISTORY and bump version (#7984) · 7343eb4a
      Yanqin Jin 提交于
      Summary:
      Prepare to cut 6.18.fb branch
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7984
      
      Reviewed By: ajkr
      
      Differential Revision: D26557151
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8c144c807090cdae67e6655e7a17056ce8c50bc0
      7343eb4a
    • A
      Limit buffering for collecting samples for compression dictionary (#7970) · d904233d
      Andrew Kryczka 提交于
      Summary:
      For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
      
      However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
      
      Related changes include:
      
      - Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
      - Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
      - Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
      
      Test Plan:
      - updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
      - looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
      
      Reviewed By: pdillinger
      
      Differential Revision: D26467994
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
      d904233d
    • M
      Avoid self-move-assign in pop operation of binary heap. (#7942) · cf14cb3e
      Max Neunhoeffer 提交于
      Summary:
      The current implementation of a binary heap in `util/heap.h` does a move-assign in the `pop` method. In the case that there is exactly one element stored in the heap, this ends up being a self-move-assign. This can cause trouble with certain classes, which are not prepared for this. Furthermore, it trips up the glibc STL debugger (`-D_GLIBCXX_DEBUG`), which produces an assertion failure in this case.
      
      This PR addresses this problem by not doing the (unnecessary in this case) move-assign if there is only one element in the heap.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7942
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26528739
      
      Pulled By: ajkr
      
      fbshipit-source-id: 5ca570e0c4168f086b10308ad766dff84e6e2d03
      cf14cb3e
    • T
      gitignore cmake-build-* for CLion integration (#7933) · ec76f031
      tison 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7933
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26529429
      
      Pulled By: ajkr
      
      fbshipit-source-id: 244344b70b1db161f9b224c25fe690c663264d7d
      ec76f031
    • M
      Fix handling of Mutable options; Allow DB::SetOptions to update mutable... · 4bc9df94
      mrambacher 提交于
      Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
      
      Summary:
      Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.
      
      Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.
      
      Added tests for the new flag.  Updated HISTORY.md
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D26389646
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
      4bc9df94
  11. 19 2月, 2021 9 次提交
  12. 18 2月, 2021 3 次提交
    • A
      Bug fix for status overridden by Status::NotFound in db_impl_readonly (#7972) · 6a85aea5
      Akanksha Mahajan 提交于
      Summary:
      Bug fix for status returned being overridden by Status::NotFound in
      DBImpl::OpenForReadOnlyCheckExistence. This was casuing some service
      owners to misinterpret the actual error and take appropriate steps.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7972
      
      Reviewed By: riversand963
      
      Differential Revision: D26499598
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 05e9fedbe2a2e0e53135760f8ff578a2816d2b8e
      6a85aea5
    • L
      Add checkpoint support to BlobDB (#7959) · dab4fe5b
      Levi Tamasi 提交于
      Summary:
      The patch adds checkpoint support to BlobDB. Blob files are hard linked or
      copied, depending on whether the checkpoint directory is on the same filesystem
      or not, similarly to table files.
      
      TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
      verification logic used by backup/restore.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959
      
      Test Plan: Ran `make check` and the crash test for a while.
      
      Reviewed By: riversand963
      
      Differential Revision: D26434768
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
      dab4fe5b
    • L
      Add support for the integrated BlobDB to db_bench (#7956) · 0743eba0
      Levi Tamasi 提交于
      Summary:
      The patch adds the configuration options of the new BlobDB implementation
      to `db_bench` and adjusts the help messages of the old (`StackableDB`-based)
      BlobDB's options to make it clear which implementation they pertain to.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7956
      
      Test Plan: Ran `make check` and `db_bench` with the new options.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26384808
      
      Pulled By: ltamasi
      
      fbshipit-source-id: b4405bb2c56cfd3506d4c32e3329c08dfdf69c94
      0743eba0
  13. 17 2月, 2021 2 次提交
  14. 16 2月, 2021 1 次提交
  15. 12 2月, 2021 1 次提交
  16. 11 2月, 2021 2 次提交
    • Z
      Handoff checksum Implementation (#7523) · d1c510ba
      Zhichao Cao 提交于
      Summary:
      in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
      
      Test Plan: add new unit test, pass make check/ make asan_check
      
      Reviewed By: pdillinger
      
      Differential Revision: D24313271
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
      d1c510ba
    • P
      Add prefetching (batched MultiGet) for experimental Ribbon filter (#7889) · e4f1e64c
      Peter Dillinger 提交于
      Summary:
      Adds support for prefetching data in Ribbon queries,
      which especially optimizes batched Ribbon queries for MultiGet
      (~222ns/key to ~97ns/key) but also single key queries on cold memory
      (~333ns to ~226ns) because many queries span more than one cache line.
      
      This required some refactoring of the query algorithm, and there
      does not appear to be a noticeable regression in "hot memory" query
      times (perhaps from 48ns to 50ns).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7889
      
      Test Plan:
      existing unit tests, plus performance validation with
      filter_bench:
      
      Each data point is the best of two runs. I saturated the machine
      CPUs with other filter_bench runs in the background.
      
      Before:
      
          $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
          WARNING: Assertions are enabled; benchmarks unnecessarily slow
          Building...
          Build avg ns/key: 125.86
          Number of filters: 1993
          Total size (MB): 168.166
          Reported total allocated memory (MB): 183.211
          Reported internal fragmentation: 8.94626%
          Bits/key stored: 7.05341
          Prelim FP rate %: 0.951827
          ----------------------------
          Mixed inside/outside queries...
            Single filter net ns/op: 48.0111
            Batched, prepared net ns/op: 222.384
            Batched, unprepared net ns/op: 343.908
            Skewed 50% in 1% net ns/op: 252.916
            Skewed 80% in 20% net ns/op: 320.579
            Random filter net ns/op: 332.957
      
      After:
      
          $ ./filter_bench -impl=3 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
          WARNING: Assertions are enabled; benchmarks unnecessarily slow
          Building...
          Build avg ns/key: 128.117
          Number of filters: 1993
          Total size (MB): 168.166
          Reported total allocated memory (MB): 183.211
          Reported internal fragmentation: 8.94626%
          Bits/key stored: 7.05341
          Prelim FP rate %: 0.951827
          ----------------------------
          Mixed inside/outside queries...
            Single filter net ns/op: 49.8812
            Batched, prepared net ns/op: 97.1514
            Batched, unprepared net ns/op: 222.025
            Skewed 50% in 1% net ns/op: 197.48
            Skewed 80% in 20% net ns/op: 212.457
            Random filter net ns/op: 226.464
      
      Bloom comparison, for reference:
      
          $ ./filter_bench -impl=2 -m_keys_total_max=200 -average_keys_per_filter=100000 -m_queries=50
          WARNING: Assertions are enabled; benchmarks unnecessarily slow
          Building...
          Build avg ns/key: 35.3042
          Number of filters: 1993
          Total size (MB): 238.488
          Reported total allocated memory (MB): 262.875
          Reported internal fragmentation: 10.2255%
          Bits/key stored: 10.0029
          Prelim FP rate %: 0.965327
          ----------------------------
          Mixed inside/outside queries...
            Single filter net ns/op: 9.09931
            Batched, prepared net ns/op: 34.21
            Batched, unprepared net ns/op: 88.8564
            Skewed 50% in 1% net ns/op: 139.75
            Skewed 80% in 20% net ns/op: 181.264
            Random filter net ns/op: 173.88
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26378710
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 058428967c55ed763698284cd3b4bbe3351b6e69
      e4f1e64c