1. 05 8月, 2022 1 次提交
    • A
      Break TableReader MultiGet into filter and lookup stages (#10432) · bf4532eb
      anand76 提交于
      Summary:
      This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D38245910
      
      Pulled By: anand1976
      
      fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
      bf4532eb
  2. 03 8月, 2022 1 次提交
    • P
      Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) · 27f3af59
      Peter Dillinger 提交于
      Summary:
      TL;DR: due to a recent change, if you drop a column family,
      often that DB will no longer fsync after writing new SST files
      to remaining or new column families, which could lead to data
      loss on power loss.
      
      More bug detail:
      The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
      DB::Close time rather than waiting for DB object destruction.
      Unfortunately, it also closes shared FSDirectory objects on
      DropColumnFamily (& destroy remaining handles), which can lead
      to use-after-Close on FSDirectory shared with remaining column
      families. Those "uses" are only Fsyncs (or redundant Closes). In
      the default Posix filesystem, an Fsync on a closed FSDirectory is a
      quiet no-op. Consequently (under most configurations), if you drop
      a column family, that DB will no longer fsync after writing new SST
      files to column families sharing the same directory (true under most
      configurations).
      
      More fix detail:
      Basically, this removes unnecessary Close ops on destroying
      ColumnFamilyData. We let `shared_ptr` take care of calling the
      destructor at the right time. If the intent was to require Close be
      called before destroying FSDirectory, that was not made clear by the
      author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
      could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
      not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
      timely destruction of FSDirectory to suffice as Close (in
      CountedFileSystem). Added a TODO to revisit that.
      
      Also in this PR:
      * Added a TODO to share FSDirectory instances between DB and its column
      families. (Already shared among column families.)
      * Made DB::Close attempt to close all its open FSDirectory objects even
      if there is a failure in closing one. Also code clean-up around this
      logic.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
      
      Test Plan:
      add an assert to check for use-after-Close. With that
      existing tests can detect the misuse. With fix, tests pass (except noted
      relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
      
      Reviewed By: ajkr
      
      Differential Revision: D38357922
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
      27f3af59
  3. 16 7月, 2022 1 次提交
  4. 13 7月, 2022 1 次提交
    • Y
      Stop tracking syncing live WAL for performance (#10330) · b283f041
      Yanqin Jin 提交于
      Summary:
      With https://github.com/facebook/rocksdb/issues/10087, applications calling `SyncWAL()` or writing with `WriteOptions::sync=true` can suffer
      from performance regression. This PR reverts to original behavior of tracking the syncing of closed WALs.
      After we revert back to old behavior, recovery, whether kPointInTime or kAbsoluteConsistency, may fail to
      detect corruption in synced WALs if the corruption is in the live WAL.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10330
      
      Test Plan:
      make check
      
      Before https://github.com/facebook/rocksdb/issues/10087
      ```bash
      fillsync     :     750.269 micros/op 1332 ops/sec 75.027 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync     :     776.492 micros/op 1287 ops/sec 77.649 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 2 runs] : 1310 (± 44) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync     :     805.625 micros/op 1241 ops/sec 80.563 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [AVG    3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [MEDIAN 3 runs] : 1287 ops/sec;    0.1 MB/sec
      ```
      
      Before this PR and after https://github.com/facebook/rocksdb/issues/10087
      ```bash
      fillsync     :    1479.601 micros/op 675 ops/sec 147.960 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync     :    1626.080 micros/op 614 ops/sec 162.608 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 2 runs] : 645 (± 59) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync     :    1588.402 micros/op 629 ops/sec 158.840 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [AVG    3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [MEDIAN 3 runs] : 629 ops/sec;    0.1 MB/sec
      ```
      
      After this PR
      ```bash
      fillsync     :     749.621 micros/op 1334 ops/sec 74.962 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync     :     865.577 micros/op 1155 ops/sec 86.558 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 2 runs] : 1244 (± 175) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync     :     845.837 micros/op 1182 ops/sec 84.584 seconds 100000 operations;    0.1 MB/s (100 ops)
      fillsync [AVG 3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [AVG    3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
      fillsync [MEDIAN 3 runs] : 1182 ops/sec;    0.1 MB/sec
      ```
      
      Reviewed By: ajkr
      
      Differential Revision: D37725212
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8fa7d13b3c7662be5d56351c42caf3266af937ae
      b283f041
  5. 24 6月, 2022 1 次提交
    • P
      Don't count no prefix as Bloom hit (#10244) · f81ea75d
      Peter Dillinger 提交于
      Summary:
      When a key is "out of domain" for the prefix_extractor (no
      prefix assigned) then the Bloom filter is not queried. PerfContext
      was counting this as a Bloom "hit" while Statistics doesn't count this
      as a prefix Bloom checked. I think it's more accurate to call it neither
      hit nor miss, so changing the counting to make it PerfContext coounting
      more like Statistics.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244
      
      Test Plan:
      tests updates and expanded (Get and MultiGet). Iterator test
      coverage of the change will come in next PR
      
      Reviewed By: bjlemaire
      
      Differential Revision: D37371297
      
      Pulled By: pdillinger
      
      fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157
      f81ea75d
  6. 22 6月, 2022 1 次提交
    • Y
      Expose the initial logger creation error (#10223) · 9586dcf1
      Yanqin Jin 提交于
      Summary:
      https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
      `DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
      directly expose the error to caller without some additional work.
      This is a first version proposal which:
      - Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
      - Checks the error during `DB::Open()` and return it to caller if non-ok
      
      This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
      Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
      in case other callers of `SanitizeOptions()` assumes that a logger should be created.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223
      
      Test Plan: make check
      
      Reviewed By: pdillinger
      
      Differential Revision: D37321717
      
      Pulled By: riversand963
      
      fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
      9586dcf1
  7. 17 6月, 2022 1 次提交
    • A
      Update stats to help users estimate MultiGet async IO impact (#10182) · a6691d0f
      anand76 提交于
      Summary:
      Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements -
      1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file
      2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used
      
      The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D37213296
      
      Pulled By: anand1976
      
      fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb
      a6691d0f
  8. 16 6月, 2022 1 次提交
    • P
      Fix handling of accidental truncation of IDENTITY file (#10173) · 3d358a7e
      Peter Dillinger 提交于
      Summary:
      A consequence of https://github.com/facebook/rocksdb/issues/9990 was requiring a non-empty DB ID to generate
      new SST files. But if the DB ID is not tracked in the manifest and the IDENTITY file
      is somehow truncated to 0 bytes, then an empty DB ID would be assigned, leading
      to crash. This change ensures a non-empty DB ID is assigned and set in the
      IDENTITY file.
      
      Also,
      * Some light refactoring to clean up the logic
      * (I/O efficiency) If the ID is tracked in the manifest and already matches the
      IDENTITY file, don't needlessly overwrite the file.
      * (Debugging) Log the DB ID to info log on open, because sometimes IDENTITY
      can change if DB is moved around (though it would be unusual for info log to
      be copied/moved without IDENTITY file)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10173
      
      Test Plan: unit tests expanded/updated
      
      Reviewed By: ajkr
      
      Differential Revision: D37176545
      
      Pulled By: pdillinger
      
      fbshipit-source-id: a9b414cd35bfa33de48af322a36c24538d50bef1
      3d358a7e
  9. 04 6月, 2022 1 次提交
  10. 02 6月, 2022 1 次提交
    • Z
      Explicitly closing all directory file descriptors (#10049) · 65893ad9
      Zichen Zhu 提交于
      Summary:
      Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
      ```
      strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
      ```
      There is one directory file descriptor that is not closed in the strace log.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
      
      Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
      
      Reviewed By: ajkr, hx235
      
      Differential Revision: D36722135
      
      Pulled By: littlepig2013
      
      fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
      65893ad9
  11. 01 6月, 2022 1 次提交
  12. 27 5月, 2022 1 次提交
    • Y
      Fail DB::Open() if logger cannot be created (#9984) · 514f0b09
      Yanqin Jin 提交于
      Summary:
      For regular db instance and secondary instance, we return error and refuse to open DB if Logger creation fails.
      
      Our current code allows it, but it is really difficult to debug because
      there will be no LOG files. The same for OPTIONS file, which will be explored in another PR.
      
      Furthermore, Arena::AllocateAligned(size_t bytes, size_t huge_page_size, Logger* logger) has an
      assertion as the following:
      
      ```cpp
      #ifdef MAP_HUGETLB
      if (huge_page_size > 0 && bytes > 0) {
        assert(logger != nullptr);
      }
      #endif
      ```
      
      It can be removed.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9984
      
      Test Plan: make check
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36347754
      
      Pulled By: riversand963
      
      fbshipit-source-id: 529798c0511d2eaa2f0fd40cf7e61c4cbc6bc57e
      514f0b09
  13. 20 5月, 2022 1 次提交
    • A
      Multi file concurrency in MultiGet using coroutines and async IO (#9968) · 57997dda
      anand76 提交于
      Summary:
      This PR implements a coroutine version of batched MultiGet in order to concurrently read from multiple SST files in a level using async IO, thus reducing the latency of the MultiGet. The API from the user perspective is still synchronous and single threaded, with the RocksDB part of the processing happening in the context of the caller's thread. In Version::MultiGet, the decision is made whether to call synchronous or coroutine code.
      
      A good way to review this PR is to review the first 4 commits in order - de773b3, 70c2f70, 10b50e1, and 377a597 - before reviewing the rest.
      
      TODO:
      1. Figure out how to build it in CircleCI (requires some dependencies to be installed)
      2. Do some stress testing with coroutines enabled
      
      No regression in synchronous MultiGet between this branch and main -
      ```
      ./db_bench -use_existing_db=true --db=/data/mysql/rocksdb/prefix_scan -benchmarks="readseq,multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=64 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -adaptive_readahead=true -threads=16 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -statistics
      ```
      Branch - ```multireadrandom :       4.025 micros/op 3975111 ops/sec 60.001 seconds 238509056 operations; 2062.3 MB/s (14767808 of 14767808 found)```
      
      Main - ```multireadrandom :       3.987 micros/op 4013216 ops/sec 60.001 seconds 240795392 operations; 2082.1 MB/s (15231040 of 15231040 found)```
      
      More benchmarks in various scenarios are given below. The measurements were taken with ```async_io=false``` (no coroutines) and ```async_io=true``` (use coroutines). For an IO bound workload (with every key requiring an IO), the coroutines version shows a clear benefit, being ~2.6X faster. For CPU bound workloads, the coroutines version has ~6-15% higher CPU utilization, depending on how many keys overlap an SST file.
      
      1. Single thread IO bound workload on remote storage with sparse MultiGet batch keys (~1 key overlap/file) -
      No coroutines - ```multireadrandom :     831.774 micros/op 1202 ops/sec 60.001 seconds 72136 operations;    0.6 MB/s (72136 of 72136 found)```
      Using coroutines - ```multireadrandom :     318.742 micros/op 3137 ops/sec 60.003 seconds 188248 operations;    1.6 MB/s (188248 of 188248 found)```
      
      2. Single thread CPU bound workload (all data cached) with ~1 key overlap/file -
      No coroutines - ```multireadrandom :       4.127 micros/op 242322 ops/sec 60.000 seconds 14539384 operations;  125.7 MB/s (14539384 of 14539384 found)```
      Using coroutines - ```multireadrandom :       4.741 micros/op 210935 ops/sec 60.000 seconds 12656176 operations;  109.4 MB/s (12656176 of 12656176 found)```
      
      3. Single thread CPU bound workload with ~2 key overlap/file -
      No coroutines - ```multireadrandom :       3.717 micros/op 269000 ops/sec 60.000 seconds 16140024 operations;  139.6 MB/s (16140024 of 16140024 found)```
      Using coroutines - ```multireadrandom :       4.146 micros/op 241204 ops/sec 60.000 seconds 14472296 operations;  125.1 MB/s (14472296 of 14472296 found)```
      
      4. CPU bound multi-threaded (16 threads) with ~4 key overlap/file -
      No coroutines - ```multireadrandom :       4.534 micros/op 3528792 ops/sec 60.000 seconds 211728728 operations; 1830.7 MB/s (12737024 of 12737024 found) ```
      Using coroutines - ```multireadrandom :       4.872 micros/op 3283812 ops/sec 60.000 seconds 197030096 operations; 1703.6 MB/s (12548032 of 12548032 found) ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9968
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D36348563
      
      Pulled By: anand1976
      
      fbshipit-source-id: c0ce85a505fd26ebfbb09786cbd7f25202038696
      57997dda
  14. 07 5月, 2022 1 次提交
    • S
      Remove own ToString() (#9955) · 736a7b54
      sdong 提交于
      Summary:
      ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
      
      Test Plan: Watch CI tests
      
      Reviewed By: riversand963
      
      Differential Revision: D36176799
      
      fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
      736a7b54
  15. 12 4月, 2022 1 次提交
    • A
      Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true (#9634) · ae82d914
      Akanksha Mahajan 提交于
      Summary:
      1) In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
      flush the data from WAL to L0 for all column families if possible. As a
      result, not all column families can increase their log_numbers, and
      min_log_number_to_keep won't change.
      2) For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
      
      If we persist a new MANIFEST with
      advanced log_numbers for some column families, then during a second
      crash after persisting the MANIFEST, RocksDB will see some column
      families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
      
      As a solution,
      1. the corrupted WALs whose numbers are larger than the
      corrupted wal and smaller than the new WAL will be moved to archive folder.
      2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9634
      
      Test Plan:
      1. Added new unit tests
                      2. make crast_test -j
      
      Reviewed By: riversand963
      
      Differential Revision: D34463666
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e233d3af0ed4e2028ca0cf051e5a334a0fdc9d19
      ae82d914
  16. 18 2月, 2022 1 次提交
    • A
      Fix some MultiGet batching stats (#9583) · 627deb7c
      anand76 提交于
      Summary:
      The NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats were being recorded only when the last file in a level happened to have hits. They are supposed to be updated for every level. Also, there was some overcounting of GetContextStats. This PR fixes both the problems.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9583
      
      Test Plan: Update the unit test in db_basic_test
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D34308044
      
      Pulled By: anand1976
      
      fbshipit-source-id: b3b36020fda26ba91bc6e0e47d52d58f4d7f656e
      627deb7c
  17. 28 1月, 2022 1 次提交
    • Y
      Disallow a combination of options (#9348) · dd203ed6
      Yanqin Jin 提交于
      Summary:
      Disallow `immutable_db_opts.use_direct_io_for_flush_and_compaction == true` and
      `mutable_db_opts.writable_file_max_buffer_size == 0`, since it causes `WritableFileWriter::Append()`
      to loop forever and does not make much sense in direct IO.
      
      This combination of options itself does not make much sense: asking RocksDB to do direct IO but not allowing
      RocksDB to allocate a buffer. We should detect this false combination and warn user early, no matter whether
      the application is running on a platform that supports direct IO or not. In the case of platform **not** supporting
      direct IO, it's ok if the user learns about this and then finds that direct IO is not supported.
      
      One tricky thing: the constructor of `WritableFileWriter` is being used in our unit tests, and it's impossible
      to return status code from constructor. Since we do not throw, I put an assertion for now. Fortunately,
      the constructor is not exposed to external applications.
      
      Closing https://github.com/facebook/rocksdb/issues/7109
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9348
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D33371924
      
      Pulled By: riversand963
      
      fbshipit-source-id: 2a3701ab541cee23bffda8a36cdf37b2d235edfa
      dd203ed6
  18. 12 1月, 2022 1 次提交
  19. 06 1月, 2022 1 次提交
  20. 05 1月, 2022 1 次提交
  21. 11 12月, 2021 2 次提交
    • Y
      Add commit marker with timestamp (#9266) · bd513fd0
      Yanqin Jin 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
      
      This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
      this tag to WAL, thus it is unavailable to users.
      This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
      This diff also indicates all column families that may potentially participate in the same
      transaction must either disable timestamp or have the same timestamp format, since
      `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
      timestamp of the transaction. We will enforce this checking in a future diff. We keep this
      diff small.
      
      Reviewed By: ltamasi
      
      Differential Revision: D31721350
      
      fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
      bd513fd0
    • P
      More refactoring ahead of footer & meta changes (#9240) · 653c392e
      Peter Dillinger 提交于
      Summary:
      I'm working on a new format_version=6 to support context
      checksum (https://github.com/facebook/rocksdb/issues/9058) and this includes much of the refactoring and test
      updates to support that change.
      
      Test coverage data and manual inspection agree on dead code in
      block_based_table_reader.cc (removed).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9240
      
      Test Plan:
      tests enhanced to cover more cases etc.
      
      Extreme case performance testing indicates small % regression in fillseq (w/ compaction), though CPU profile etc. doesn't suggest any explanation. There is enhanced correctness checking in Footer::DecodeFrom, but this should be negligible.
      
      TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=1 --disable_wal={false,true}
      
      (Each is ops/s averaged over 50 runs, run simultaneously with competing configuration for load fairness)
      Before w/ wal: 454512
      After w/ wal: 444820 (-2.1%)
      Before w/o wal: 1004560
      After w/o wal: 998897 (-0.6%)
      
      Since this doesn't modify WAL code, one would expect real effects to be larger in w/o wal case.
      
      This regression will be corrected in a follow-up PR.
      
      Reviewed By: ajkr
      
      Differential Revision: D32813769
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 444a244eabf3825cd329b7d1b150cddce320862f
      653c392e
  22. 09 11月, 2021 1 次提交
    • A
      Enable a few unit tests to use custom Env objects (#9087) · dddb791c
      anand76 提交于
      Summary:
      Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
      
      Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
      
      Reviewed By: riversand963
      
      Differential Revision: D32007222
      
      Pulled By: anand1976
      
      fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
      dddb791c
  23. 03 11月, 2021 1 次提交
  24. 29 10月, 2021 1 次提交
    • P
      Implement XXH3 block checksum type (#9069) · a7d4bea4
      Peter Dillinger 提交于
      Summary:
      XXH3 - latest hash function that is extremely fast on large
      data, easily faster than crc32c on most any x86_64 hardware. In
      integrating this hash function, I have handled the compression type byte
      in a non-standard way to avoid using the streaming API (extra data
      movement and active code size because of hash function complexity). This
      approach got a thumbs-up from Yann Collet.
      
      Existing functionality change:
      * reject bad ChecksumType in options with InvalidArgument
      
      This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
      likely to be handled through different configuration than ChecksumType.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
      
      Test Plan:
      tests updated, and substantially expanded. Unit tests now check
      that we don't accidentally change the values generated by the checksum
      algorithms ("schema test") and that we properly handle
      invalid/unrecognized checksum types in options or in file footer.
      
      DBTestBase::ChangeOptions (etc.) updated from two to one configuration
      changing from default CRC32c ChecksumType. The point of this test code
      is to detect possible interactions among features, and the likelihood of
      some bad interaction being detected by including configurations other
      than XXH3 and CRC32c--and then not detected by stress/crash test--is
      extremely low.
      
      Stress/crash test also updated (manual run long enough to see it accepts
      new checksum type). db_bench also updated for microbenchmarking
      checksums.
      
       ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
      crc32c       :       0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
      xxhash       :       0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
      xxhash64     :       0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
      xxh3         :       0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
      crc32c       :       0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
      xxhash       :       0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
      xxhash64     :       0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
      xxh3         :       0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
      crc32c       :       0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
      xxhash       :       0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
      xxhash64     :       0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
      xxh3         :       0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
      
      As you can see, especially once warmed up, xxh3 is fastest.
      
       ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
      
      Test
      
          for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
      
      Results (ops/sec)
      
          for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
      
      results-0 252118 # kNoChecksum
      results-1 251588 # kCRC32c
      results-2 251863 # kxxHash
      results-3 252016 # kxxHash64
      results-4 252038 # kXXH3
      
      Reviewed By: mrambacher
      
      Differential Revision: D31905249
      
      Pulled By: pdillinger
      
      fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
      a7d4bea4
  25. 19 10月, 2021 1 次提交
    • J
      Make `DB::Close()` thread-safe (#8970) · 314de7e7
      Jay Zhuang 提交于
      Summary:
      If `DB::Close()` is called in multi-thread env, the resource
      could be double released, which causes exception or assert.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970
      
      Test Plan:
      Test with multi-thread benchmark, with each thread try to
      close the DB at the end.
      
      Reviewed By: pdillinger
      
      Differential Revision: D31242042
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb
      314de7e7
  26. 13 9月, 2021 1 次提交
  27. 08 9月, 2021 1 次提交
    • P
      Improve support for using regexes (#8740) · 0ef88538
      Peter Dillinger 提交于
      Summary:
      * Consolidate use of std::regex for testing to testharness.cc, to
      minimize Facebook linters constantly flagging uses in non-production
      code.
      * Improve syntax and error messages for asserting some string matches a
      regex in tests.
      * Add a public Regex wrapper class to encapsulate existing usage in
      ObjectRegistry.
      * Remove unnecessary include <regex>
      * Put warnings that use of Regex in production code could cause bad
      performance or stack overflow.
      
      Intended follow-up work:
      * Replace std::regex with another underlying implementation like RE2
      * Improve ObjectRegistry interface in terms of possibly confusing literal
      string matching vs. regex and in terms of reporting invalid regex.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740
      
      Test Plan:
      tests updated, basic unit test for public Regex, and some manual
      testing of temporary changes to see example error messages:
      
      utilities/backupable/backupable_db_test.cc:917: Failure
      000010_1162373755_138626.blob (child.name)
      does not match regex
      [0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)
      
      db/db_basic_test.cc:74: Failure
      R3SHSBA8C4U0CIMV2ZB0 (sid3)
      does not match regex [0-9A-Z]{20}HAHAHA
      
      Reviewed By: mrambacher
      
      Differential Revision: D30706246
      
      Pulled By: pdillinger
      
      fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd
      0ef88538
  28. 17 8月, 2021 1 次提交
    • P
      Stable cache keys using DB session ids in SSTs (#8659) · a207c278
      Peter Dillinger 提交于
      Summary:
      Use DB session ids in SST table properties to make cache keys
      stable across DB re-open and copy / move / restore / etc.
      
      These new cache keys are currently only enabled when FileSystem does not
      provide GetUniqueId. For now, they are typically larger, so slightly
      less efficient.
      
      Relevant to https://github.com/facebook/rocksdb/issues/7405
      
      This change has a minor regression in PersistentCache functionality:
      metaindex blocks are no longer cached in PersistentCache. Table properties
      blocks already were not but ideally should be. I didn't spent effort to
      fix & test these issues because we don't believe PersistentCache is used much
      if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)
      
      FIXME: there is more to be fixed for stable cache keys on external SST files
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659
      
      Test Plan:
      new unit test added, which fails when disabling new
      functionality
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D30297705
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a
      a207c278
  29. 07 8月, 2021 1 次提交
    • L
      Fix the sorting of KeyContexts for batched MultiGet (#8633) · 87882736
      Levi Tamasi 提交于
      Summary:
      `CompareKeyContext::operator()` on the trunk has a bug: when comparing
      column family IDs, `lhs` is used for both sides of the comparison. This
      results in the `KeyContext`s getting sorted solely based on key, which
      in turn means that keys with the same column family do not necessarily
      form a single range in the sorted list. This violates an assumption of the
      batched `MultiGet` logic, leading to the same column family
      showing up multiple times in the list of `MultiGetColumnFamilyData`.
      The end result is the code attempting to check out the thread-local
      `SuperVersion` for the same CF multiple times, causing an
      assertion violation in debug builds and memory corruption/crash in
      release builds.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D30169182
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023
      87882736
  30. 23 7月, 2021 1 次提交
  31. 14 5月, 2021 1 次提交
    • A
      Initial support for secondary cache in LRUCache (#8271) · feb06e83
      anand76 提交于
      Summary:
      Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
      
      Tests:
      Results from cache_bench and db_bench don't show any regression due to these changes.
      
      cache_bench results before and after this change -
      Command
      ```./cache_bench -ops_per_thread=10000000 -threads=1```
      Before
      ```Complete in 40.688 s; QPS = 245774```
      ```Complete in 40.486 s; QPS = 246996```
      ```Complete in 42.019 s; QPS = 237989```
      After
      ```Complete in 40.672 s; QPS = 245869```
      ```Complete in 44.622 s; QPS = 224107```
      ```Complete in 42.445 s; QPS = 235599```
      
      db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
      Commands
      ```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
      
      ```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
      Before
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
      ```
      After
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28357511
      
      Pulled By: anand1976
      
      fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
      feb06e83
  32. 07 1月, 2021 1 次提交
    • M
      Create a CustomEnv class; Add WinFileSystem; Make LegacyFileSystemWrapper private (#7703) · e628f59e
      mrambacher 提交于
      Summary:
      This PR does the following:
      -> Creates a WinFileSystem class.  This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
      -> Introduces a CustomEnv class.  A CustomEnv is an Env that takes a FileSystem as constructor argument.  I believe there will only ever be two implementations of this class (PosixEnv and WinEnv).  There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
      -> Eliminates the public uses of the LegacyFileSystemWrapper.
      
      With this change in place, there are effectively the following patterns of Env:
      - "Base Env classes" (PosixEnv, WinEnv).  These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem.  These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
      - Wrapped Composite Env classes (MemEnv).  These classes take in an Env and a FileSystem.  The core env functions are re-directed to the wrapped env.  The file system calls are redirected to the input file system
      - Legacy Wrapped Env classes.  These classes take in an Env input (but no FileSystem).  The core env functions are re-directed to the wrapped env.  A "Legacy File System" is created using this env and the file system calls directed to the env itself.
      
      With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created.  Any other use of the PosixEnv is via another wrapped env.  This cleans up some of the issues with the env construction and destruction.
      
      Additionally, there were places in the code that required had an Env when they required a FileSystem.  Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem().  These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D25762190
      
      Pulled By: anand1976
      
      fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
      e628f59e
  33. 05 1月, 2021 1 次提交
    • A
      Verify file checksum generator name (#7824) · 225abffd
      Andrew Kryczka 提交于
      Summary:
      Previously we only had a debug assertion to check the right generator was being used for verification. However a user hit a problem in production where their factory was creating the wrong generator for some files, leading to checksum mismatches. It would have been easier to debug if we verified in optimized builds that the generator with the proper name is used. This PR adds such verification.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7824
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D25740254
      
      Pulled By: ajkr
      
      fbshipit-source-id: a6231521747605021bad3231484b5d4f99f4044f
      225abffd
  34. 24 12月, 2020 2 次提交
    • M
      No elide constructors (#7798) · 55e99688
      mrambacher 提交于
      Summary:
      Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds.  This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it.  In this case,  without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D25680451
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
      55e99688
    • A
      Update "num_data_read" stat in RetrieveMultipleBlocks (#7770) · 30a5ed9c
      Akanksha Mahajan 提交于
      Summary:
      RetrieveMultipleBlocks which is used by MultiGet to read data blocks is not updating num_data_read stat in
      GetContextStats.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7770
      
      Test Plan: make check -j64
      
      Reviewed By: anand1976
      
      Differential Revision: D25538982
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e3daedb035b1be8ab6af6f115cb3793ccc7b1ec6
      30a5ed9c
  35. 10 12月, 2020 2 次提交
  36. 09 12月, 2020 1 次提交
    • C
      Do not track obsolete WALs in MANIFEST even if they are synced (#7725) · 07030c6f
      Cheng Chang 提交于
      Summary:
      Consider the case:
      1. All column families are flushed, so all WALs become obsolete, but no WAL is removed from disk yet because the removal is asynchronous, a VersionEdit is written to MANIFEST indicating that WALs before a certain WAL number are obsolete, let's say this number is 3;
      2. `SyncWAL` is called, so all the on-disk WALs are synced, and if track_and_verify_wal_in_manifest=true, the WALs will be tracked in MANIFEST, let's say the WAL numbers are 1 and 2;
      3. DB crashes;
      4. During DB recovery, when replaying MANIFEST, we first see that WAL with number < 3 are obsolete, then we see that WAL 1 and 2 are synced, so according to current implementation of `WalSet`, the `WalSet` will be recovered to include WAL 1 and 2;
      5. WAL 1 and 2 are asynchronously deleted from disk, then the WAL verification algorithm fails with `Corruption: missing WAL`.
      
      The above case is reproduced in a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal`.
      
      The fix is to maintain the upper bound of the obsolete WAL numbers, any WAL with number less than the maintained number is considered to be obsolete, so shouldn't be tracked even if they are later synced. The number is maintained in `WalSet`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7725
      
      Test Plan:
      1. a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal` is added.
      2. run `make crash_test` on devserver.
      
      Reviewed By: riversand963
      
      Differential Revision: D25238914
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: f5dccd57c3d89f19565ec5731f2d42f06d272b72
      07030c6f
  37. 17 11月, 2020 1 次提交
    • Y
      Clean up after two test failures in db_basic_test (#7682) · 869f0538
      Yanqin Jin 提交于
      Summary:
      In db_basic_test.cc, there are two tests that rely on the underlying
      system's `LockFile` support to function correctly:
      DBBasicTest.OpenWhenOpen and DBBasicTest.CheckLock. In both tests,
      re-opening a db using `DB::Open` is expected to fail because the second
      open cannot lock the LOCK file. Some distributed file systems, e.g. HDFS
      do not support the POSIX-style file lock. Therefore, these unit tests will cause
      assertion failure and the second `Open` will create a db instance.
      Currently, these db instances are not closed after the assertion
      failure. Since these db instances are registered with some process-wide, static
      data structures, e.g. `PeriodicWorkScheduler::Default()`, they can still be
      accessed after the unit tests. However, the `Env` object created for this db
      instance is destroyed when the test finishes in `~DBTestBase()`. Consequently,
      it causes illegal memory access.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7682
      
      Test Plan:
      Run the following on a distrubited file system:
      ```
      make check
      ```
      
      Reviewed By: anand1976
      
      Differential Revision: D25004215
      
      Pulled By: riversand963
      
      fbshipit-source-id: f4327d7716c0e72b13bb43737ec9a5d156da4d52
      869f0538