1. 19 3月, 2021 3 次提交
    • Y
      Fix a test failure when built with ASSERT_STATUS_CHECKED=1 (#8075) · 7ee41a5d
      Yanqin Jin 提交于
      Summary:
      As title.
      Test plan
      ASSERT_STATUS_CHECKED=1 make -j20 backupable_db_test error_handler_fs_test
      ./backupable_db_test
      ./error_handler_fs_test
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8075
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27173832
      
      Pulled By: riversand963
      
      fbshipit-source-id: 37dac50f7c89127804ff2572abddd4174642de30
      7ee41a5d
    • Z
      Separate handling of WAL Sync io error with SST flush io error (#8049) · c8109471
      Zhichao Cao 提交于
      Summary:
      In previous codebase, if WAL is used, all the retryable IO Error will be treated as hard error. So write is stalled. In this PR, the retryable IO error from WAL sync is separated from SST file flush io error. If WAL Sync is ok and retryable IO Error only happens during SST flush, the error is mapped to soft error. So user can continue insert to Memtable and append to WAL.
      
      Resolve the bug that if WAL sync fails, the memtable status does not roll back due to calling PickMemtable early than calling and checking SyncClosedLog.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8049
      
      Test Plan: added new unit test, make check
      
      Reviewed By: anand1976
      
      Differential Revision: D26965529
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: f5fecb66602212523c92ee49d7edcb6065982410
      c8109471
    • P
      Revamp WriteController (#8064) · e7a60d01
      Peter Dillinger 提交于
      Summary:
      WriteController had a number of issues:
      * It could introduce a delay of 1ms even if the write rate never exceeded the
      configured delayed_write_rate.
      * The DB-wide delayed_write_rate could be exceeded in a number of ways
      with multiple column families:
        * Wiping all pending delay "debts" when another column family joins
        the delay with GetDelayToken().
        * Resetting last_refill_time_ to (now + sleep amount) means each
        column family can write with delayed_write_rate for large writes.
        * Updating bytes_left_ for a partial refill without updating
        last_refill_time_ would essentially give out random bonuses,
        especially to medium-sized writes.
      
      Now the code is much simpler, with these issues fixed. See comments in
      the new code and new (replacement) tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8064
      
      Test Plan: new tests, better than old tests
      
      Reviewed By: mrambacher
      
      Differential Revision: D27064936
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 497c23fe6819340b8f3d440bd634d8a2bc47323f
      e7a60d01
  2. 18 3月, 2021 2 次提交
    • Z
      Add the statistics and info log for Error handler (#8050) · 08ec5e73
      Zhichao Cao 提交于
      Summary:
      Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050
      
      Test Plan: make check and add test to error_handler_fs_test
      
      Reviewed By: anand1976
      
      Differential Revision: D26990565
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f
      08ec5e73
    • A
      Use SST file manager to track blob files as well (#8037) · 27d57a03
      Akanksha Mahajan 提交于
      Summary:
      Extend support to track blob files in SST File manager.
       This PR notifies SstFileManager whenever a new blob file is created,
       via OnAddFile and  an obsolete blob file deleted via OnDeleteFile
       and delete file via ScheduleFileDeletion.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
      
      Test Plan: Add new unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D26891237
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
      27d57a03
  3. 16 3月, 2021 2 次提交
    • Y
      Fix a bug in key comparison when index type is kBinarySearchWithFirstKey (#8062) · 03043528
      Yanqin Jin 提交于
      Summary:
      When timestamp is enabled, key comparison should take this into account.
      In `BlockBasedTableReader::Get()`, `BlockBasedTableReader::MultiGet()`,
      assume the target key is `key`, and the timestamp upper bound is `ts`.
      The highest key in current block is (key, ts1), while the lowest key in next
      block is (key, ts2).
      If
      ```
      ts1 > ts > ts2
      ```
      then
      ```
      (key, ts1) < (key, ts) < (key, ts2)
      ```
      It can be shown that if `Compare()` is used, then we will mistakenly skip the next
      block. Instead, we should use `CompareWithoutTimestamp()`.
      
      The majority of this PR makes some existing tests in `db_with_timestamp_basic_test.cc`
      parameterized so that different index types can be tested. A new unit test is
      also added for more coverage.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8062
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D27057557
      
      Pulled By: riversand963
      
      fbshipit-source-id: c1062fa7c159ed600a1ad7e461531d52265021f1
      03043528
    • Y
      Move a test file to a better location (#8054) · 85d4f2c8
      Yanqin Jin 提交于
      Summary:
      As title.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8054
      
      Test Plan: make check
      
      Reviewed By: mrambacher
      
      Differential Revision: D27017955
      
      Pulled By: riversand963
      
      fbshipit-source-id: 829497d507bc89afbe982f8a8cf3555e52fd7098
      85d4f2c8
  4. 15 3月, 2021 2 次提交
    • M
      Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033) · 3dff28cf
      mrambacher 提交于
      Summary:
      For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>.  The shared ptr has some performance degradation on certain hardware classes.
      
      For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere.  For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it.  The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
      
      There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold.  In those cases, the shared pointer was preserved.
      
      Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
      
      6.17: readrandom   :      28.046 micros/op 854902 ops/sec;   61.3 MB/s (355999 of 355999 found)
      6.18: readrandom   :      32.615 micros/op 735306 ops/sec;   52.7 MB/s (290999 of 290999 found)
      PR: readrandom   :      27.500 micros/op 871909 ops/sec;   62.5 MB/s (367999 of 367999 found)
      
      (Note that the times for 6.18 are prior to revert of the SystemClock).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
      
      Reviewed By: pdillinger
      
      Differential Revision: D27014563
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
      3dff28cf
    • A
      Deflake tests of compaction based on compensated file size (#8036) · b8f40f7f
      Andrew Kryczka 提交于
      Summary:
      CompactionDeletionTriggerReopen was observed to be flaky recently:
      https://app.circleci.com/pipelines/github/facebook/rocksdb/6030/workflows/787af4f3-b9f7-4645-8e8d-1fb0ebf05539/jobs/101451.
      
      I went through it and the related tests and arrived at different
      conclusions on what constraints we can expect on DB size. Some
      constraints got looser and some got tighter. The particular constraint
      that flaked got a lot looser so at least the flake linked above would have been prevented.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8036
      
      Reviewed By: riversand963
      
      Differential Revision: D26862566
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3512b86b4fb41aeecae32e1c7382c03916d88d88
      b8f40f7f
  5. 13 3月, 2021 2 次提交
    • L
      Fix a harmless data race affecting two test cases (#8055) · b708b166
      Levi Tamasi 提交于
      Summary:
      `DBTest.GetLiveBlobFiles` and `ObsoleteFilesTest.BlobFiles` both modify the
      current `Version` in their setup phase, implicitly assuming that no other
      threads would touch the `Version` while this is happening. The periodic
      stats dumper thread violates this assumption; the patch fixes this by
      disabling it in the affected test cases. (Note: the data race is
      harmless in the sense that it only affects test code.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8055
      
      Test Plan:
      ```
      COMPILE_WITH_TSAN=1 make db_test -j24
      gtest-parallel --repeat=10000 ./db_test --gtest_filter="*GetLiveBlobFiles"
      COMPILE_WITH_TSAN=1 make obsolete_files_test -j24
      gtest-parallel --repeat=10000 ./obsolete_files_test --gtest_filter="*BlobFiles"
      ```
      
      Reviewed By: riversand963
      
      Differential Revision: D27022715
      
      Pulled By: ltamasi
      
      fbshipit-source-id: b6cc77ed63d8bc1cbe0603522ff1a572182fc9ab
      b708b166
    • P
      Instantiate tests DBIteratorTestForPinnedData (#8051) · 119dda21
      Peter Dillinger 提交于
      Summary:
      a trial gtest upgrade discovered some parameterized tests missing instantiation. By some miracle, they still pass.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8051
      
      Test Plan: thisisthetest
      
      Reviewed By: mrambacher
      
      Differential Revision: D27003684
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cde1cab1551fb282f67d462d46574bd30bd5e61f
      119dda21
  6. 11 3月, 2021 2 次提交
    • Y
      Enable backward iterator for keys with user-defined timestamp (#8035) · 82b38884
      Yanqin Jin 提交于
      Summary:
      This PR does the following:
      
      - Enable backward iteration for keys with user-defined timestamp. Note that merge, single delete, range delete are not supported yet.
      - Introduces a new helper API `Comparator::EqualWithoutTimestamp()`.
      - Fix a typo in `SetTimestamp()`.
      - Add/update unit tests
      
      Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced for CPU-intensive workloads with a lot of `Prev()`. Also provided results of iterating keys with timestamps.
      
      1. Disable timestamp, run:
      ```
      ./db_bench -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
      ```
      Results:
      > Baseline
      > - seekrandom [AVG    6 runs] : 96115 ops/sec;   53.2 MB/sec
      > - seekrandom [MEDIAN 6 runs] : 98075 ops/sec;   54.2 MB/sec
      >
      > This PR
      > - seekrandom [AVG    6 runs] : 95521 ops/sec;   52.8 MB/sec
      > - seekrandom [MEDIAN 6 runs] : 96338 ops/sec;   53.3 MB/sec
      
      2. Enable timestamp, run:
      ```
      ./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5
      ```
      Result:
      > Baseline: not supported
      >
      > This PR
      > - seekrandom [AVG    6 runs] : 90514 ops/sec;   50.1 MB/sec
      > - seekrandom [MEDIAN 6 runs] : 90834 ops/sec;   50.2 MB/sec
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8035
      
      Reviewed By: ltamasi
      
      Differential Revision: D26926668
      
      Pulled By: riversand963
      
      fbshipit-source-id: 95330cc2242397c03e09d29e5417dfb0adc98ef5
      82b38884
    • Y
      Make secondary instance use ManifestTailer (#7998) · 64517d18
      Yanqin Jin 提交于
      Summary:
      This PR
      
      - adds a class `ManifestTailer` that inherits from `VersionEditHandlerPointInTime`. `ManifestTailer::Iterate()` can be called multiple times to tail the primary instance's MANIFEST and apply the changes to the secondary,
      - updates the implementation of `ReactiveVersionSet::ReadAndApply` to use this class,
      - removes unused code in version_set.cc,
      - updates existing tests, e.g. removing deleted sync points from unit tests,
      - adds a new test to address the bug in https://github.com/facebook/rocksdb/issues/7815.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7998
      
      Test Plan:
      make check
      Existing and newly-added tests in version_set_test.cc and db_secondary_test.cc
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26926641
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8d4dd15db0ba863c213f743e33b5a207e948c980
      64517d18
  7. 10 3月, 2021 3 次提交
  8. 09 3月, 2021 2 次提交
  9. 04 3月, 2021 1 次提交
    • L
      Update compaction statistics to include the amount of data read from blob files (#8022) · cb25bc11
      Levi Tamasi 提交于
      Summary:
      The patch does the following:
      1) Exposes the amount of data (number of bytes) read from blob files from
      `BlobFileReader::GetBlob` / `Version::GetBlob`.
      2) Tracks the total number and size of blobs read from blob files during a
      compaction (due to garbage collection or compaction filter usage) in
      `CompactionIterationStats` and propagates this data to
      `InternalStats::CompactionStats` / `CompactionJobStats`.
      3) Updates the formulae for write amplification calculations to include the
      amount of data read from blob files.
      4) Extends the compaction stats dump with a new column `Rblob(GB)` and
      a new line containing the total number and size of blob files in the current
      `Version` to complement the information about the shape and size of the LSM tree
      that's already there.
      5) Updates `CompactionJobStats` so that the number of files and amount of data
      written by a compaction are broken down per file type (i.e. table/blob file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022
      
      Test Plan: Ran `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D26801199
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
      cb25bc11
  10. 03 3月, 2021 2 次提交
    • Y
      Possibly bump NUMBER_OF_RESEEKS_IN_ITERATION (#8015) · 72d1e258
      Yanqin Jin 提交于
      Summary:
      When changing db iterator direction, we may perform a reseek.
      Therefore, we should bump the NUMBER_OF_RESEEKS_IN_ITERATION counter.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8015
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D26755415
      
      Pulled By: riversand963
      
      fbshipit-source-id: 211f51f1a454bcda768fc46c0dce51edeb7f05fe
      72d1e258
    • L
      Break down the amount of data written during flushes/compactions per file type (#8013) · a46f080c
      Levi Tamasi 提交于
      Summary:
      The patch breaks down the "bytes written" (as well as the "number of output files")
      compaction statistics into two, so the values are logged separately for table files
      and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
      files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
      This will also come in handy for fixing the write amplification statistics, which currently
      do not consider the amount of data read from blob files during compaction. (This will
      be fixed by an upcoming patch.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013
      
      Test Plan: Ran `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D26742156
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
      a46f080c
  11. 02 3月, 2021 1 次提交
  12. 26 2月, 2021 2 次提交
    • Y
      Remove unused/incorrect fwd declaration (#8002) · c370d8aa
      Yanqin Jin 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8002
      
      Reviewed By: anand1976
      
      Differential Revision: D26659354
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6b464dbea9fd8240ead8cc5af393f0b78e8f9dd1
      c370d8aa
    • Y
      Compaction filter support for (new) BlobDB (#7974) · cef4a6c4
      Yanqin Jin 提交于
      Summary:
      Allow applications to implement a custom compaction filter and pass it to BlobDB.
      
      The compaction filter's custom logic can operate on blobs.
      To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
      Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
      on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
      db/blob/db_blob_compaction_test.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D26509280
      
      Pulled By: riversand963
      
      fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
      cef4a6c4
  13. 24 2月, 2021 1 次提交
    • S
      Fix testcase failures on windows (#7992) · e017af15
      sherriiiliu 提交于
      Summary:
      Fixed 5 test case failures found on Windows 10/Windows Server 2016
      1. In `flush_job_test`, the DestroyDir function fails in deconstructor because some file handles are still being held by VersionSet. This happens on Windows Server 2016, so need to manually reset versions_ pointer to release all file handles.
      2. In `StatsHistoryTest.InMemoryStatsHistoryPurging` test, the capping memory cost of stats_history_size on Windows becomes 14000 bytes with latest changes, not just 13000 bytes.
      3. In `SSTDumpToolTest.RawOutput` test, the output file handle is not closed at the end.
      4. In `FullBloomTest.OptimizeForMemory` test, ROCKSDB_MALLOC_USABLE_SIZE is undefined on windows so `total_mem` is always equal to `total_size`. The internal memory fragmentation assertion does not apply in this case.
      5. In `BlockFetcherTest.FetchAndUncompressCompressedDataBlock` test, XPRESS cannot reach 87.5% compression ratio with original CreateTable method, so I append extra zeros to the string value to enhance compression ratio. Beside, since XPRESS allocates memory internally, thus does not support for custom allocator verification, we will skip the allocator verification for XPRESS
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7992
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D26615283
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3632612f84b99e2b9c77c403b112b6bedf3b125d
      e017af15
  14. 23 2月, 2021 1 次提交
  15. 20 2月, 2021 2 次提交
    • A
      Limit buffering for collecting samples for compression dictionary (#7970) · d904233d
      Andrew Kryczka 提交于
      Summary:
      For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
      
      However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
      
      Related changes include:
      
      - Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
      - Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
      - Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
      
      Test Plan:
      - updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
      - looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
      
      Reviewed By: pdillinger
      
      Differential Revision: D26467994
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
      d904233d
    • M
      Fix handling of Mutable options; Allow DB::SetOptions to update mutable... · 4bc9df94
      mrambacher 提交于
      Fix handling of Mutable options; Allow DB::SetOptions to update mutable TableFactory Options (#7936)
      
      Summary:
      Added a "only_mutable_options" flag to the ConfigOptions.  When set, the Configurable methods will only look at/update options that are marked as kMutable.
      
      Fixed DB::SetOptions to allow for the update of any mutable TableFactory options.  Fixes https://github.com/facebook/rocksdb/issues/7385.
      
      Added tests for the new flag.  Updated HISTORY.md
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D26389646
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
      4bc9df94
  16. 19 2月, 2021 2 次提交
    • Z
      Introduce a new trace file format (v 0.2) for better extension (#7977) · b0fd1cc4
      Zhichao Cao 提交于
      Summary:
      The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977
      
      Test Plan: make check. tested with old trace file in replay and analyzing.
      
      Reviewed By: anand1976
      
      Differential Revision: D26529948
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76
      b0fd1cc4
    • J
      Fix txn `MultiGet()` return un-committed data with snapshot (#7963) · 59ba104e
      Jay Zhuang 提交于
      Summary:
      TransactionDB uses read callback to filter out un-committed data before
      a snapshot. But `MultiGet()` API doesn't use that at all, which causes
      returning unwanted data.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963
      
      Test Plan: Added unittest to reproduce
      
      Reviewed By: anand1976
      
      Differential Revision: D26455851
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
      59ba104e
  17. 18 2月, 2021 1 次提交
  18. 17 2月, 2021 1 次提交
  19. 16 2月, 2021 1 次提交
  20. 11 2月, 2021 1 次提交
    • Z
      Handoff checksum Implementation (#7523) · d1c510ba
      Zhichao Cao 提交于
      Summary:
      in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
      
      Test Plan: add new unit test, pass make check/ make asan_check
      
      Reviewed By: pdillinger
      
      Differential Revision: D24313271
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
      d1c510ba
  21. 09 2月, 2021 1 次提交
  22. 07 2月, 2021 1 次提交
  23. 06 2月, 2021 1 次提交
  24. 05 2月, 2021 1 次提交
  25. 30 1月, 2021 2 次提交
    • L
      Fix a SingleDelete related optimization for blob indexes (#7904) · e5311a8e
      Levi Tamasi 提交于
      Summary:
      There is a small `SingleDelete` related optimization in the
      `CompactionIterator` code: when a `SingleDelete`-`Put` pair is preserved
      solely for the purposes of transaction conflict checking, the value
      itself gets cleared. (This is referred to as "optimization 3" in the
      `CompactionIterator` code.) Though the rest of the code got updated to
      support `SingleDelete`'ing blob indexes, this chunk was apparently
      missed, resulting in an assertion failure (or `ROCKS_LOG_FATAL` in release
      builds) when triggered. Note: in addition to clearing the value, we also
      need to update the type of the KV to regular value when dealing with
      blob indexes here.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7904
      
      Test Plan: `make check`
      
      Reviewed By: ajkr
      
      Differential Revision: D26118009
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 6bf78043d20265e2b15c2e1ab8865025040c42ae
      e5311a8e
    • A
      Integrity protection for live updates to WriteBatch (#7748) · 78ee8564
      Andrew Kryczka 提交于
      Summary:
      This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.).
      
      The foundation classes (`ProtectionInfo.*`) embed the coverage info in their type, and provide `Protect.*()` and `Strip.*()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer.
      
      When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748
      
      Test Plan:
      - an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught
      - add to stress/crash test to verify it works in variety of configs/operations without intentional corruption
      - [deferred] unit tests for `ProtectionInfo.*` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc.
      
      Reviewed By: pdillinger
      
      Differential Revision: D25754492
      
      Pulled By: ajkr
      
      fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866
      78ee8564