1. 18 12月, 2019 4 次提交
    • S
      crash_test: two fixes (#6200) · 9f250dd8
      sdong 提交于
      Summary:
      Fix two crash test issues:
      1. sync mode should not run with disable_wal=true
      2. disable "compaction_readahead_size" for now. With it on, some block checksum verification failure will happen in compaction paths. Not sure why, but disable it for now to keep the test clean.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6200
      
      Test Plan: Run "make crash_test" and "make crash_test_with_atomic_flush" and see it runs way longer than before the fix without failing.
      
      Differential Revision: D19143493
      
      fbshipit-source-id: 438fad52fbda60aafd142e1b65578addbe7d72b1
      9f250dd8
    • A
      Small tidy and speed up of the travis build (#6181) · 2d167094
      Adam Retter 提交于
      Summary:
      Cuts about 30-60 seconds to from each Travis Linux build, and about 15 minutes from each macOS build
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6181
      
      Differential Revision: D19098357
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 863dd1ab09076ad9b03c2b7914908359628315ae
      2d167094
    • delete superversions in BackgroundCallPurge (#6146) · 39fcaf82
      解轶伦 提交于
      Summary:
      I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
      
      Then I found "delete sv" in ~SuperVersion() takes the time.
      
      The backtrace looks like this
      
      DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
      DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
      
      I think it's better to delete in a background thread,  please review it。
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
      
      Differential Revision: D18972066
      
      fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
      39fcaf82
    • L
      Set CompactionIterator::valid_ to false when PrepareBlobOutput indicates error · 02aa2295
      Levi Tamasi 提交于
      Summary:
      With https://github.com/facebook/rocksdb/pull/6121, errors returned by `PrepareBlobValue`
      result in `CompactionIterator::status_` being set to `Corruption` or `IOError`
      as appropriate, however, `valid_` is not set to `false`. The error is eventually propagated in
      `CompactionJob::ProcessKeyValueCompaction` but only after the main loop completes.
      Setting `valid_` to `false` upon errors enables us to terminate the loop early and fail the
      compaction sooner.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6170
      
      Test Plan:
      Ran `make check` and used `db_bench` in BlobDB mode.
      
      fbshipit-source-id: a2ca88a3ca71115e2605bd34a4c795d8a28bef27
      02aa2295
  2. 17 12月, 2019 9 次提交
    • A
      Fix crash in Transaction::MultiGet() when num_keys > 32 · 1be48cb8
      anand1976 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6192
      
      Test Plan:
      Add a unit test that fails without the fix and passes now
      make check
      
      Differential Revision: D19124781
      
      Pulled By: anand1976
      
      fbshipit-source-id: 8c8cb6fa16c3fc23ec011e168561a13f76bbd783
      1be48cb8
    • Y
      Use Env::LoadEnv to create custom Env objects (#6196) · 7678cf2d
      Yanqin Jin 提交于
      Summary:
      As title. Previous assumption was that the underlying lib can always return
      a shared_ptr<Env>. This is too strong. Therefore, we use Env::LoadEnv to relax
      it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6196
      
      Test Plan: make check
      
      Differential Revision: D19133199
      
      Pulled By: riversand963
      
      fbshipit-source-id: c83a0c02a42610d077054f2de1acfc45126b3a75
      7678cf2d
    • M
      Wait for CancelAllBackgroundWork before Close in db stress (#6191) · 68d5d82d
      Maysam Yabandeh 提交于
      Summary:
      In https://github.com/facebook/rocksdb/issues/6174 we fixed the stress test to respect the CancelAllBackgroundWork + Close order for WritePrepared transactions. The fix missed to take into account that some invocation of CancelAllBackgroundWork are with wait=false parameter which essentially breaks the order.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6191
      
      Differential Revision: D19102709
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f4e7b5fdae47ff1c1ac284ba1cf67d5d3f3d03eb
      68d5d82d
    • Z
      Merge adjacent file block reads in RocksDB MultiGet() and Add uncompressed block to cache (#6089) · cddd6379
      Zhichao Cao 提交于
      Summary:
      In the current MultiGet, if the KV-pairs do not belong to the data blocks in the block cache, multiple blocks are read from a SST. It will trigger one block read for each block request and read them in parallel. In some cases, if some data blocks are adjacent in the SST, the reads for these blocks can be combined to a single large read, which can reduce the system calls and reduce the read latency if possible.
      
      Considering to fill the block cache, if multiple data blocks are in the same memory buffer, we need to copy them to the heap separately. Therefore, only in the case that 1) data block compression is enabled, and 2) compressed block cache is null, we can do combined read. Otherwise, extra memory copy is needed, which may cause extra overhead. In the current case, data blocks will be uncompressed to a new memory space.
      
      Also, in the case that 1) data block compression is enabled, and 2) compressed block cache is null, it is possible the data block is actually not compressed. In the current logic, these data blocks will not be added to the uncompressed_cache. So if memory buffer is shared and the data block is not compressed, the data block are copied to the head and fill the cache.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6089
      
      Test Plan: Added test case to ParallelIO.MultiGet. Pass make asan_check
      
      Differential Revision: D18734668
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 67c5615ed373e51e42635fd74b36f8f3a66d5da4
      cddd6379
    • S
      Add some new options to crash_test (#6176) · bcc372c0
      sdong 提交于
      Summary:
      Several options are trivially added to crash test and random values are picked.
      Made simple test run non-dynamic level and normal test run dynamic level.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6176
      
      Test Plan: Run crash_test and watch the printing
      
      Differential Revision: D19053955
      
      fbshipit-source-id: 958cb43c968541ebd87ed4d91e778bd1d40e7502
      bcc372c0
    • L
      Update HISTORY.md with the recent memtable trimming fixes · 2d095b4d
      Levi Tamasi 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6194
      
      Differential Revision: D19125292
      
      Pulled By: ltamasi
      
      fbshipit-source-id: d41aca2755ec4bec07feedd6b561e8d18606a931
      2d095b4d
    • S
      db_stress: preserve all historic manifest files (#6142) · 35126dd8
      sdong 提交于
      Summary:
      compaction history is stored in manifest files. Preserve all of them in db_stress would help debugging.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6142
      
      Test Plan: Run db_stress and observe that manifest files are preserved. Run whole crash_test and see how DB directory looks like.
      
      Differential Revision: D19047026
      
      fbshipit-source-id: f83c3e0bb5332b1b4768be5dcee56a24f9b760a9
      35126dd8
    • Z
      db_stress: generate the key based on Zipfian distribution (hot key) (#6163) · fbda25f5
      Zhichao Cao 提交于
      Summary:
      In the current db_stress, all the keys are generated randomly and follows the uniform distribution. In order to test some corner cases that some key are always updated or read, we need to generate the key based on other distributions. In this PR, the key is generated based on Zipfian distribution and the skewness can be controlled by setting hot_key_alpha (0.8 to 1.5 is suggested). The larger hot_key_alpha is, the more skewed will be. Not that, usually, if hot_key_alpha is larger than 2, there might be only 1 or 2 keys that are generated. If hot_key_alpha is 0, it generate the key follows uniform distribution (random key)
      
      Testing plan: pass the db_stress and printed the keys to make sure it follows the distribution.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6163
      
      Differential Revision: D18978480
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: e123b4865477f7478e83fb581f9576bada334680
      fbda25f5
    • L
      Fix a data race related to memtable trimming (#6187) · db7c6875
      Levi Tamasi 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/6177 introduced a data race
      involving `MemTableList::InstallNewVersion` and `MemTableList::NumFlushed`.
      The patch fixes this by caching whether the current version has any
      memtable history (i.e. flushed memtables that are kept around for
      transaction conflict checking) in an `std::atomic<bool>` member called
      `current_has_history_`, similarly to how `current_memory_usage_excluding_last_`
      is handled.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6187
      
      Test Plan:
      ```
      make clean
      COMPILE_WITH_TSAN=1 make db_test -j24
      ./db_test
      ```
      
      Differential Revision: D19084059
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 327a5af9700fb7102baea2cc8903c085f69543b9
      db7c6875
  3. 16 12月, 2019 1 次提交
    • P
      Optimize memory and CPU for building new Bloom filter (#6175) · a92bd0a1
      Peter Dillinger 提交于
      Summary:
      The filter bits builder collects all the hashes to add in memory before adding them (because the number of keys is not known until we've walked over all the keys). Existing code uses a std::vector for this, which can mean up to 2x than necessary space allocated (and not freed) and up to ~2x write amplification in memory. Using std::deque uses close to minimal space (for large filters, the only time it matters), no write amplification, frees memory while building, and no need for large contiguous memory area. The only cost is more calls to allocator, which does not appear to matter, at least in benchmark test.
      
      For now, this change only applies to the new (format_version=5) Bloom filter implementation, to ease before-and-after comparison downstream.
      
      Temporary memory use during build is about the only way the new Bloom filter could regress vs. the old (because of upgrade to 64-bit hash) and that should only matter for full filters. This change should largely mitigate that potential regression.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6175
      
      Test Plan:
      Using filter_bench with -new_builder option and 6M keys per filter is like large full filter (improvement). 10k keys and no -new_builder is like partitioned filters (about the same). (Corresponding configurations run simultaneously on devserver.)
      
      std::vector impl (before)
      
          $ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 -
          average_keys_per_filter=6000000
          Build avg ns/key: 52.2027
          Maximum resident set size (kbytes): 1105016
          $ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 -
          average_keys_per_filter=10000
          Build avg ns/key: 30.5694
          Maximum resident set size (kbytes): 1208152
      
      std::deque impl (after)
      
          $ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 -
          average_keys_per_filter=6000000
          Build avg ns/key: 39.0697
          Maximum resident set size (kbytes): 1087196
          $ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 -
          average_keys_per_filter=10000
          Build avg ns/key: 30.9348
          Maximum resident set size (kbytes): 1207980
      
      Differential Revision: D19053431
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2888e748723a19d9ea40403934f13cbb8483430c
      a92bd0a1
  4. 15 12月, 2019 2 次提交
  5. 14 12月, 2019 10 次提交
    • L
      Do not schedule memtable trimming if there is no history (#6177) · bd8404fe
      Levi Tamasi 提交于
      Summary:
      We have observed an increase in CPU load caused by frequent calls to
      `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
      when using `max_write_buffer_size_to_maintain` to limit the amount of
      memtable history maintained for transaction conflict checking. Part of the issue
      is that trimming can potentially be scheduled even if there is no memtable
      history. The patch adds a check that fixes this.
      
      See also https://github.com/facebook/rocksdb/pull/6169.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6177
      
      Test Plan:
      Compared `perf` output for
      
      ```
      ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
      ```
      
      before and after the change. There is a significant reduction for the call chain
      `rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` ->
      `rocksdb::ThreadLocalPtr::StaticMeta::Scrape` even without https://github.com/facebook/rocksdb/pull/6169.
      
      Differential Revision: D19057445
      
      Pulled By: ltamasi
      
      fbshipit-source-id: dff81882d7b280e17eda7d9b072a2d4882c50f79
      bd8404fe
    • M
      CancelAllBackgroundWork before Close in db stress (#6174) · 349bd3ed
      Maysam Yabandeh 提交于
      Summary:
      Close asserts that there is no unreleased snapshots. For WritePrepared transaction, this means that the background work that holds on a snapshot must be canceled first. Update the stress tests to respect the sequence.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6174
      
      Test Plan:
      ```
      make -j32 crash_test
      
      Differential Revision: D19057322
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: c9e9e24f779bbfb0ab72c2717e34576c01bc6362
      349bd3ed
    • A
      Env should also load the native library (#6167) · edbf0e2d
      Adam Retter 提交于
      Summary:
      Closes https://github.com/facebook/rocksdb/issues/6118
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6167
      
      Differential Revision: D19053577
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 86aca9a5bec0947a641649b515da17b3cb12bdde
      edbf0e2d
    • L
      Make it possible to enable periodic compactions for BlobDB (#6172) · 0d2172f1
      Levi Tamasi 提交于
      Summary:
      Periodic compactions ensure that even SSTs that do not get picked up
      otherwise eventually go through compaction; used in conjunction with
      BlobDB's garbage collection, they enable BlobDB to reclaim space when
      old blob files are used by such straggling SSTs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6172
      
      Test Plan: Ran `make check` and used the BlobDB mode of `db_bench`.
      
      Differential Revision: D19045045
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 04636ecc4b6cfe8d495bf656faa65d54a5eb1a93
      0d2172f1
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
    • P
      Add useful idioms to Random API (OneInOpt, PercentTrue) (#6154) · 58d46d19
      Peter Dillinger 提交于
      Summary:
      And clean up related code, especially in stress test.
      
      (More clean up of db_stress_test_base.cc coming after this.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6154
      
      Test Plan: make check, make blackbox_crash_test for a bit
      
      Differential Revision: D18938180
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 524d27621b8dbb25f6dff40f1081e7c00630357e
      58d46d19
    • L
      Do not create/install new SuperVersion if nothing was deleted during memtable trim (#6169) · 6d54eb3d
      Levi Tamasi 提交于
      Summary:
      We have observed an increase in CPU load caused by frequent calls to
      `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
      when using `max_write_buffer_size_to_maintain` to limit the amount of
      memtable history maintained for transaction conflict checking. As it turns out,
      this is caused by the code creating and installing a new `SuperVersion` even if
      no memtables were actually trimmed. The patch adds a check to avoid this.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169
      
      Test Plan:
      Compared `perf` output for
      
      ```
      ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
      ```
      
      before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` ->
      `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape`
      no longer registers in the `perf` report.
      
      Differential Revision: D19031509
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20
      6d54eb3d
    • K
      cmake: do not build tests for Release build and cleanups (#5916) · ac304adf
      Kefu Chai 提交于
      Summary:
      fixes https://github.com/facebook/rocksdb/issues/2445
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5916
      
      Differential Revision: D19031236
      
      fbshipit-source-id: bc3107b6b25a01958677d7cb411b1f381aae91c6
      ac304adf
    • M
      Enable unordered_write in stress tests (#6164) · fec7302a
      Maysam Yabandeh 提交于
      Summary:
      With WritePrepared transactions configured with two_write_queues, unordered_write will offer the same guarantees as vanilla rocksdb and thus can be enabled in stress tests.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6164
      
      Test Plan:
      ```
      make -j32 crash_test_with_txn
      
      Differential Revision: D18991899
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: eece5e96b4169b67d7931e5c0afca88540a113e1
      fec7302a
    • L
      Move out valid blobs from the oldest blob files during compaction (#6121) · 583c6953
      Levi Tamasi 提交于
      Summary:
      The patch adds logic that relocates live blobs from the oldest N non-TTL
      blob files as they are encountered during compaction (assuming the BlobDB
      configuration option `enable_garbage_collection` is `true`), where N is defined
      as the number of immutable non-TTL blob files multiplied by the value of
      a new BlobDB configuration option called `garbage_collection_cutoff`.
      (The default value of this parameter is 0.25, that is, by default the valid blobs
      residing in the oldest 25% of immutable non-TTL blob files are relocated.)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121
      
      Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`.
      
      Differential Revision: D18785357
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e
      583c6953
  6. 13 12月, 2019 9 次提交
  7. 12 12月, 2019 5 次提交