1. 10 5月, 2019 4 次提交
    • J
      Add C bindings for LowerThreadPoolIO/CPUPriority (#5285) · 6451673f
      Jelte Fennema 提交于
      Summary:
      There were no C bindings for lowering thread pool priority. This adds those.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5285
      
      Differential Revision: D15290050
      
      Pulled By: siying
      
      fbshipit-source-id: b2ed94d0c39d27434ace2204829a242b53d0d67a
      6451673f
    • S
      Merging iterator to avoid child iterator reseek for some cases (#5286) · 9fad3e21
      Siying Dong 提交于
      Summary:
      When reseek happens in merging iterator, reseeking a child iterator can be avoided if:
      (1) the iterator represents imutable data
      (2) reseek() to a larger key than the current key
      (3) the current key of the child iterator is larger than the seek key
      because it is guaranteed that the result will fall into the same position.
      
      This optimization will be useful for use cases where users keep seeking to keys nearby in ascending order.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5286
      
      Differential Revision: D15283635
      
      Pulled By: siying
      
      fbshipit-source-id: 35f79ffd5ce3609146faa8cd55f2bfd733502f83
      9fad3e21
    • A
      Fix bugs in FilePickerMultiGet (#5292) · 181bb43f
      anand76 提交于
      Summary:
      This PR fixes a couple of bugs in FilePickerMultiGet that were causing db_stress test failures. The failures were caused by -
      1. Improper handling of a key that matches the user key portion of an L0 file's largest key. In this case, the curr_index_in_curr_level file index in L0 for that key was getting incremented, but batch_iter_ was not advanced. By design, all keys in a batch are supposed to be checked against an L0 file before advancing to the next L0 file. Not advancing to the next key in the batch was causing a double increment of curr_index_in_curr_level due to the same key being processed again
      2. Improper handling of a key that matches the user key portion of the largest key in the last file of L1 and higher. This was resulting in a premature end to the processing of the batch for that level when the next key in the batch is a duplicate. Typically, the keys in MultiGet will not be duplicates, but its good to handle that case correctly
      
      Test -
      asan_crash
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5292
      
      Differential Revision: D15282530
      
      Pulled By: anand1976
      
      fbshipit-source-id: d1a6a86e0af273169c3632db22a44d79c66a581f
      181bb43f
    • S
      DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244) · 25d81e45
      Siying Dong 提交于
      Summary:
      Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
      
      We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
      
      In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
      
      Differential Revision: D15067257
      
      Pulled By: siying
      
      fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
      25d81e45
  2. 09 5月, 2019 1 次提交
  3. 08 5月, 2019 1 次提交
    • Z
      avoid updating index type during iterator creation (#5288) · eea1cad8
      Zhongyi Xie 提交于
      Summary:
      Right now there is a potential race condition where two threads are created to iterate through the DB (https://gist.github.com/miasantreble/88f5798a397ee7cb8e7baff9db2d9e85).  The problem is that in `BlockBasedTable::NewIndexIterator`, if both threads failed to find index_reader from block cache, they will call `CreateIndexReader->UpdateIndexType()` which creates a race to update `index_type` in the shared rep_ object. By checking the code, we realize the index type is always populated by `PrefetchIndexAndFilterBlocks` during the table `Open` call, so there is no need to update index type every time during iterator creation. This PR attempts to fix the race condition by removing the unnecessary call to `UpdateIndexType`
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5288
      
      Differential Revision: D15252509
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 6e3258652121d5c76d267f7ac457e15c5e84756e
      eea1cad8
  4. 07 5月, 2019 1 次提交
  5. 04 5月, 2019 3 次提交
    • M
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
    • Z
      multiget: fix memory issues due to vector auto resizing (#5279) · 5d27d65b
      Zhongyi Xie 提交于
      Summary:
      This PR fixes three memory issues found by ASAN
      * in db_stress, the key vector for MultiGet is created using `emplace_back` which could potentially invalidates references to the underlying storage (vector<string>) due to auto resizing. Fix by calling reserve in advance.
      * Similar issue in construction of GetContext autovector in version_set.cc
      * In multiget_context.h use T[] specialization for unique_ptr that holds a char array
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5279
      
      Differential Revision: D15202893
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 14cc2cda0ed64d29f2a1e264a6bfdaa4294ee75d
      5d27d65b
    • Z
      fix implicit conversion error reported by clang check (#5277) · 3e994809
      Zhongyi Xie 提交于
      Summary:
      fix the following clang check errors
      ```
      tools/db_stress.cc:3609:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32]
          int num_keys = rand_keys.size();
              ~~~~~~~~   ~~~~~~~~~~^~~~~~
      tools/db_stress.cc:3888:30: error: implicit conversion loses integer precision: 'std::vector::size_type' (aka 'unsigned long') to 'int' [-Werror,-Wshorten-64-to-32]
          int num_keys = rand_keys.size();
              ~~~~~~~~   ~~~~~~~~~~^~~~~~
      2 errors generated.
      make: *** [tools/db_stress.o] Error 1
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5277
      
      Differential Revision: D15196620
      
      Pulled By: miasantreble
      
      fbshipit-source-id: d56b1420d4a9f1df875fc52877a5fbb342bc7cae
      3e994809
  6. 03 5月, 2019 1 次提交
  7. 02 5月, 2019 5 次提交
  8. 01 5月, 2019 5 次提交
  9. 30 4月, 2019 3 次提交
    • Y
      Disable pipelined write in atomic flush stress test (#5266) · 210b49ca
      Yanqin Jin 提交于
      Summary:
      Since currently pipelined write allows one thread to perform memtable writes
      while another thread is traversing the `flush_scheduler_`, it will cause an
      assertion failure in `FlushScheduler::Clear`. To unblock crash recoery tests,
      we temporarily disable pipelined write when atomic flush is enabled.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5266
      
      Differential Revision: D15142285
      
      Pulled By: riversand963
      
      fbshipit-source-id: a0c20fe4ac543e08feaed602414f982054df7831
      210b49ca
    • T
      CMake has stock FindZLIB in upper case. (#5261) · 18864567
      Tongliang Liao 提交于
      Summary:
      More details in https://cmake.org/cmake/help/v3.14/module/FindZLIB.html
      
      This resolves the cmake config error of not finding `Findzlib` on Linux (CentOS 7 + cmake 3.14.3 + gcc-8).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5261
      
      Differential Revision: D15138052
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 2f4445f49a36c16e6f1e05c090018c02379c0de4
      18864567
    • Y
      Fix a bug when trigger atomic flush and close db (#5254) · 35e6ba73
      Yanqin Jin 提交于
      Summary:
      With atomic flush, RocksDB background flush will flush memtables of a column family up to the largest memtable id in the immutable memtable list. This can introduce a bug in the following scenario. A user thread inserts into a column family until the memtable is full and triggers a flush. This will add the column family to flush_scheduler_. Then the user thread writes another record to the column family. In the PreprocessWrite function, the user thread picks the column family from flush_scheduler_ and schedules a flush request. The flush request gaurantees to flush all the memtables up to the current largest memtable ID of the immutable memtable list. Then the user thread writes new data to the newly-created active memtable. After the write returns, the user thread closes the db. This can cause assertion failure when the background flush thread tries to install superversion for the column family. The solution is to not install flush results if the db has already set `shutting_down_` to true.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5254
      
      Differential Revision: D15124149
      
      Pulled By: riversand963
      
      fbshipit-source-id: 0a667a41339dedb5a18bcb01b0bf11c275c04df0
      35e6ba73
  10. 27 4月, 2019 2 次提交
    • S
      Improve explicit user readahead performance (#5246) · 3548e422
      Sagar Vemuri 提交于
      Summary:
      Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
      
      1. Stop creating new table readers when the user explicitly sets readahead size.
      2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
      3. Add `readahead_size` to db_bench.
      
      **Benchmarks:**
      https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
      
      For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
      For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
      
      **Test Plan:**
      Updated `DBIteratorTest.ReadAhead` test to make sure that:
      - no new table readers are created for iterators on setting ReadOptions.readahead_size
      - At least "readahead" number of bytes are actually getting read on each iterator read.
      
      TODO later:
      - Use similar logic for compactions as well.
      - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
      
      Differential Revision: D15107946
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
      3548e422
    • M
      Fix ubsan failure in snapshot refresh (#5257) · 8c7eb598
      Maysam Yabandeh 提交于
      Summary:
      The newly added test CompactionJobTest.SnapshotRefresh sets the snapshot refresh period to 0 to stress the feature. This results into large number of refresh events, which in turn results into an UBSAN failure when a bitwise shift operand goes beyond the uint64_t size.
      The patch fixes that by simplifying the shift logic to be done only by 2 bits after each refresh. Furthermore it verifies that the shift operation does not result in decreasing the refresh period.
      
      Testing:
      COMPILE_WITH_UBSAN=1 make -j32 compaction_job_test
      ./compaction_job_test --gtest_filter=CompactionJobTest.SnapshotRefresh
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5257
      
      Differential Revision: D15106463
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f2718898ea7ba4fa9f7e87b70cf98fe647c0de80
      8c7eb598
  11. 26 4月, 2019 4 次提交
    • M
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448
    • A
      Option string/map/file can set env from object registry (#5237) · 6eb317bb
      Andrew Kryczka 提交于
      Summary:
      - By providing the "env" field in any text-based options (i.e., string, map, or file), we can use `NewCustomObject` to deserialize the text value into an actual `Env` object.
      - Currently factory functions for `Env` registered with object registry should only return pointer to static `Env` objects. That's because `DBOptions::env` is a raw pointer so we cannot easily delegate cleanup.
      - Note I did not add `env` to `db_option_type_info`. It wasn't needed for (de)serialization, and I believe we don't want to do verification on `env`, even by checking name. That's because the user should be able to copy their DB from Linux to Windows, change envs, and not see an option verification error.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5237
      
      Differential Revision: D15056360
      
      Pulled By: siying
      
      fbshipit-source-id: 4b5f0b83297a5058f8949ec955dbf27d98d73d7e
      6eb317bb
    • N
      add missing rocksdb_flush_cf in c (#5243) · 084a3c69
      niukuo 提交于
      Summary:
      same to #5229
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5243
      
      Differential Revision: D15082800
      
      Pulled By: siying
      
      fbshipit-source-id: f4a68a480db0e40e1ba7cf37e18b88e43dff7c08
      084a3c69
    • Y
      Close WAL files before deletion (#5233) · da96f2fe
      Yanqin Jin 提交于
      Summary:
      Currently one thread in RocksDB keeps a WAL file open while another thread
      deletes it. Although the first thread never writes to the WAL again, it still
      tries to close it in the end. This is fine on POSIX, but can be problematic on
      other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or
      throw exceptions. The solution is to let the second thread close the WAL before deleting it.
      
      RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233
      
      Differential Revision: D15032670
      
      Pulled By: riversand963
      
      fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754
      da96f2fe
  12. 25 4月, 2019 3 次提交
    • Z
      update history.md (#5245) · 66d8360b
      Zhongyi Xie 提交于
      Summary:
      update history.md for `BottommostLevelCompaction::kForceOptimized` to mention possible user impact.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5245
      
      Differential Revision: D15073712
      
      Pulled By: miasantreble
      
      fbshipit-source-id: d40f698c42e8a6368be4eac0a00d02279615edea
      66d8360b
    • M
      Don't call FindObsoleteFiles() in ~ColumnFamilyHandleImpl() if CF is not dropped (#5238) · cd77d3c5
      Mike Kolupaev 提交于
      Summary:
      We have a DB with ~4k column families and ~70k files. On shutdown, destroying the 4k ColumnFamilyHandle-s takes over 2 minutes. Most of this time is spent in VersionSet::AddLiveFiles() called from FindObsoleteFiles() from ~ColumnFamilyHandleImpl(). It's just iterating over the list of files in memory. This seems completely unnecessary as no obsolete files are actually found since the CFs are not even dropped. This PR fixes that.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5238
      
      Differential Revision: D15056342
      
      Pulled By: siying
      
      fbshipit-source-id: 2aa342ef3770b4aa384ce81f8768e485480e4f08
      cd77d3c5
    • Z
      secondary instance: add support for WAL tailing on `OpenAsSecondary` · aa56b7e7
      Zhongyi Xie 提交于
      Summary: PR https://github.com/facebook/rocksdb/pull/4899 implemented the general framework for RocksDB secondary instances. This PR adds the support for WAL tailing in `OpenAsSecondary`, which means after the `OpenAsSecondary` call, the secondary is now able to see primary's writes that are yet to be flushed. The secondary can see primary's writes in the WAL up to the moment of `OpenAsSecondary` call starts.
      
      Differential Revision: D15059905
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 44f71f548a30b38179a7940165e138f622de1f10
      aa56b7e7
  13. 24 4月, 2019 4 次提交
  14. 23 4月, 2019 3 次提交
    • Y
      Fix compilation errors for 32bits/LITE/ios build. (#5220) · 78a6e07c
      Yuchi Chen 提交于
      Summary:
      When I build RocksDB for 32bits/LITE/iOS environment, some errors like the following.
      
      `
      table/block_based_table_reader.cc:971:44: error: implicit conversion loses integer precision: 'uint64_t'
            (aka 'unsigned long long') to 'size_t' (aka 'unsigned long') [-Werror,-Wshorten-64-to-32]
          size_t block_size = props_block_handle.size();
                 ~~~~~~~~~~   ~~~~~~~~~~~~~~~~~~~^~~~~~
      
      ./util/file_reader_writer.h:177:8: error: private field 'env_' is not used [-Werror,-Wunused-private-field]
        Env* env_;
             ^
      `
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5220
      
      Differential Revision: D15023481
      
      Pulled By: siying
      
      fbshipit-source-id: 1b5d121d3016f2b0a8a9a2cc1bd638479357f9f7
      78a6e07c
    • S
      Log file_creation_time table property (#5232) · 47fd5748
      Sagar Vemuri 提交于
      Summary:
      Log file_creation_time table property when a new table file is created.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5232
      
      Differential Revision: D15033069
      
      Pulled By: sagar0
      
      fbshipit-source-id: aaac56a4c03a8f96c338cad1b0cdb7fbfb887647
      47fd5748
    • A
      Optionally wait on bytes_per_sync to smooth I/O (#5183) · 8272a6de
      Andrew Kryczka 提交于
      Summary:
      The existing implementation does not guarantee bytes reach disk every `bytes_per_sync` when writing SST files, or every `wal_bytes_per_sync` when writing WALs. This can cause confusing behavior for users who enable this feature to avoid large syncs during flush and compaction, but then end up hitting them anyways.
      
      My understanding of the existing behavior is we used `sync_file_range` with `SYNC_FILE_RANGE_WRITE` to submit ranges for async writeback, such that we could continue processing the next range of bytes while that I/O is happening. I believe we can preserve that benefit while also limiting how far the processing can get ahead of the I/O, which prevents huge syncs from happening when the file finishes.
      
      Consider this `sync_file_range` usage: `sync_file_range(fd_, 0, static_cast<off_t>(offset + nbytes), SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE)`. Expanding the range to start at 0 and adding the `SYNC_FILE_RANGE_WAIT_BEFORE` flag causes any pending writeback (like from a previous call to `sync_file_range`) to finish before it proceeds to submit the latest `nbytes` for writeback. The latest `nbytes` are still written back asynchronously, unless processing exceeds I/O speed, in which case the following `sync_file_range` will need to wait on it.
      
      There is a second change in this PR to use `fdatasync` when `sync_file_range` is unavailable (determined statically) or has some known problem with the underlying filesystem (determined dynamically).
      
      The above two changes only apply when the user enables a new option, `strict_bytes_per_sync`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5183
      
      Differential Revision: D14953553
      
      Pulled By: siying
      
      fbshipit-source-id: 445c3862e019fb7b470f9c7f314fc231b62706e9
      8272a6de