1. 14 2月, 2020 1 次提交
    • C
      Fix flaky test DecreaseNumBgThreads (#6393) · 46516778
      Cheng Chang 提交于
      Summary:
      The DecreaseNumBgThreads test keeps failing on Windows in AppVeyor.
      It fails because it depends on a timed wait for the tasks to be dequeued from the threadpool's internal queue, but within the specified time, the task might have not been scheduled onto the newly created threads.
      https://github.com/facebook/rocksdb/pull/6232 tries to fix this by waiting for longer time to let the threads scheduled.
      This PR tries to fix this by replacing the timed wait with a synchronization on the task's internal conditional variable.
      When the number of threads increases, instead of guessing the time needed for the task to be scheduled, it directly blocks on the conditional variable until the task starts running.
      But when thread number is reduced, it still does a timed wait, but this does not lead to the flakiness now, will try to remove these timed waits in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6393
      
      Test Plan: Wait to see whether AppVeyor tests pass.
      
      Differential Revision: D19890928
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 4e56e4addf625c98c0876e62d9d57a6f0a156f76
      46516778
  2. 08 2月, 2020 1 次提交
    • S
      Allow readahead when reading option files. (#6372) · 876c2dbf
      sdong 提交于
      Summary:
      Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372
      
      Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace.
      
      Differential Revision: D19727739
      
      fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583
      876c2dbf
  3. 14 12月, 2019 1 次提交
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  4. 17 7月, 2019 1 次提交
  5. 10 7月, 2019 1 次提交
  6. 04 6月, 2019 1 次提交
  7. 31 5月, 2019 2 次提交
  8. 14 11月, 2018 1 次提交
    • A
      Backup engine support for direct I/O reads (#4640) · ea945470
      Andrew Kryczka 提交于
      Summary:
      Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640
      
      Differential Revision: D13015268
      
      Pulled By: ajkr
      
      fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269
      ea945470
  9. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  10. 06 9月, 2018 1 次提交
  11. 24 8月, 2018 1 次提交
  12. 15 8月, 2018 1 次提交
  13. 21 6月, 2018 1 次提交
  14. 05 6月, 2018 1 次提交
    • M
      Extend some tests to format_version=3 (#3942) · d0c38c0c
      Maysam Yabandeh 提交于
      Summary:
      format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
      Closes https://github.com/facebook/rocksdb/pull/3942
      
      Differential Revision: D8238413
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
      d0c38c0c
  15. 26 5月, 2018 1 次提交
    • M
      Exclude seq from index keys · 402b7aa0
      Maysam Yabandeh 提交于
      Summary:
      Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
      The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
      Closes https://github.com/facebook/rocksdb/pull/3894
      
      Differential Revision: D8118775
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
      402b7aa0
  16. 22 5月, 2018 1 次提交
    • A
      Assert keys/values pinned by range deletion meta-block iterators · 7b655214
      Andrew Kryczka 提交于
      Summary:
      `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
      Closes https://github.com/facebook/rocksdb/pull/3875
      
      Differential Revision: D8063667
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
      7b655214
  17. 06 3月, 2018 1 次提交
  18. 23 2月, 2018 2 次提交
  19. 12 9月, 2017 1 次提交
    • S
      Make InternalKeyComparator final and directly use it in merging iterator · 64b6452e
      Siying Dong 提交于
      Summary:
      Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress.
      
      I have to disable the final key word in debug build, as a hack test class depends on overriding the class.
      Closes https://github.com/facebook/rocksdb/pull/2860
      
      Differential Revision: D5800461
      
      Pulled By: siying
      
      fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53
      64b6452e
  20. 22 7月, 2017 2 次提交
  21. 16 7月, 2017 1 次提交
  22. 28 4月, 2017 1 次提交
  23. 19 10月, 2016 1 次提交
    • I
      Support SST files with Global sequence numbers [reland] · b88f8e87
      Islam AbdelRahman 提交于
      Summary:
      reland https://reviews.facebook.net/D62523
      
      - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
      - Update TableProperties to be aware of the offset of each property in the file
      - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
      
      Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
      
      Test Plan: unit tests
      
      Reviewers: sdong, yhchiang
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D65211
      b88f8e87
  24. 08 10月, 2016 1 次提交
  25. 04 10月, 2016 2 次提交
    • I
      Fix Mac build · 9d6c9613
      Islam AbdelRahman 提交于
      9d6c9613
    • I
      Support SST files with Global sequence numbers · ab01da54
      Islam AbdelRahman 提交于
      Summary:
      - Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
      - Update TableProperties to be aware of the offset of each property in the file
      - Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
      
      Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
      
      Test Plan: unit tests
      
      Reviewers: andrewkr, yhchiang, yiwu, sdong
      
      Reviewed By: sdong
      
      Subscribers: hcz, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D62523
      ab01da54
  26. 28 9月, 2016 1 次提交
    • A
      Add SeekForPrev() to Iterator · f517d9dd
      Aaron Gao 提交于
      Summary:
      Add new Iterator API, `SeekForPrev`: find the last key that <= target key
      support prefix_extractor
      support prefix_same_as_start
      support upper_bound
      not supported in iterators without Prev()
      
      Also add tests in db_iter_test and db_iterator_test
      
      Pass all tests
      Cheers!
      
      Test Plan: make all check -j64
      
      Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D64149
      f517d9dd
  27. 08 9月, 2016 1 次提交
  28. 21 7月, 2016 1 次提交
    • I
      Introduce FullMergeV2 (eliminate memcpy from merge operators) · 68a8e6b8
      Islam AbdelRahman 提交于
      Summary:
      This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
      
      This diff is stacked on top of D56493 and D56511
      
      In this diff we
      - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
      - Replace std::deque<std::string> with std::vector<Slice> to pass operands
      - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
      - Allow FullMergeV2 output to be an existing operand
      
      ```
      [Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
      
      DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
      
      [FullMergeV2]
      readseq      :       0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
      readseq      :       0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
      readseq      :       0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
      readseq      :       0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
      readseq      :       0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
      
      [master]
      readseq      :       3.935 micros/op 254140 ops/sec; 2485.7 MB/s
      readseq      :       3.722 micros/op 268657 ops/sec; 2627.7 MB/s
      readseq      :       3.149 micros/op 317605 ops/sec; 3106.5 MB/s
      readseq      :       3.125 micros/op 320024 ops/sec; 3130.1 MB/s
      readseq      :       4.075 micros/op 245374 ops/sec; 2400.0 MB/s
      ```
      
      ```
      [Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
      
      DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
      
      [FullMergeV2]
      readseq      :       3.472 micros/op 288018 ops/sec; 2817.1 MB/s
      readseq      :       2.304 micros/op 434027 ops/sec; 4245.2 MB/s
      readseq      :       1.163 micros/op 859845 ops/sec; 8410.0 MB/s
      readseq      :       1.192 micros/op 838926 ops/sec; 8205.4 MB/s
      readseq      :       1.250 micros/op 800000 ops/sec; 7824.7 MB/s
      
      [master]
      readseq      :      24.025 micros/op 41623 ops/sec;  407.1 MB/s
      readseq      :      18.489 micros/op 54086 ops/sec;  529.0 MB/s
      readseq      :      18.693 micros/op 53495 ops/sec;  523.2 MB/s
      readseq      :      23.621 micros/op 42335 ops/sec;  414.1 MB/s
      readseq      :      18.775 micros/op 53262 ops/sec;  521.0 MB/s
      
      ```
      
      ```
      [Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
      
      [FullMergeV2]
      $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
      readseq      :      14.741 micros/op 67837 ops/sec;  663.5 MB/s
      readseq      :       1.029 micros/op 971446 ops/sec; 9501.6 MB/s
      readseq      :       0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
      readseq      :       0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
      readseq      :       0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
      
      [master]
      readseq      :      16.735 micros/op 59755 ops/sec;  584.5 MB/s
      readseq      :       3.029 micros/op 330151 ops/sec; 3229.2 MB/s
      readseq      :       3.136 micros/op 318883 ops/sec; 3119.0 MB/s
      readseq      :       3.065 micros/op 326245 ops/sec; 3191.0 MB/s
      readseq      :       3.014 micros/op 331813 ops/sec; 3245.4 MB/s
      ```
      
      ```
      [Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
      
      DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
      
      [FullMergeV2]
      readseq      :      24.325 micros/op 41109 ops/sec;  402.1 MB/s
      readseq      :       1.470 micros/op 680272 ops/sec; 6653.7 MB/s
      readseq      :       1.231 micros/op 812347 ops/sec; 7945.5 MB/s
      readseq      :       1.091 micros/op 916590 ops/sec; 8965.1 MB/s
      readseq      :       1.109 micros/op 901713 ops/sec; 8819.6 MB/s
      
      [master]
      readseq      :      27.257 micros/op 36687 ops/sec;  358.8 MB/s
      readseq      :       4.443 micros/op 225073 ops/sec; 2201.4 MB/s
      readseq      :       5.830 micros/op 171526 ops/sec; 1677.7 MB/s
      readseq      :       4.173 micros/op 239635 ops/sec; 2343.8 MB/s
      readseq      :       4.150 micros/op 240963 ops/sec; 2356.8 MB/s
      ```
      
      Test Plan: COMPILE_WITH_ASAN=1 make check -j64
      
      Reviewers: yhchiang, andrewkr, sdong
      
      Reviewed By: sdong
      
      Subscribers: lovro, andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D57075
      68a8e6b8
  29. 23 5月, 2016 1 次提交
  30. 10 2月, 2016 1 次提交
  31. 27 1月, 2016 1 次提交
  32. 09 1月, 2016 1 次提交
    • S
      plain table reader: non-mmap mode to keep two recent buffers · 9a8e3f73
      sdong 提交于
      Summary: In plain table reader's non-mmap mode, we only keep the most recent read buffer. However, for binary search, it is likely we come back to a location to read. To avoid one pread in such a case, we keep two read buffers. It should cover most of the cases.
      
      Test Plan:
      1. run tests
      2. check the optimization works through strace when running
      ./table_reader_bench -mmap_read=false --num_keys2=1 -num_keys1=5000 -table_factory=plain_table --iterator --through_db
      
      Reviewers: anthony, rven, kradhakrishnan, igor, yhchiang, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D51171
      9a8e3f73
  33. 10 12月, 2015 1 次提交
    • S
      Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit · 56e77f09
      sdong 提交于
      Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.
      
      Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.
      
      Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony
      
      Reviewed By: anthony
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D51117
      56e77f09
  34. 08 12月, 2015 1 次提交
    • S
      Fix occasional failure of DBTest.DynamicCompactionOptions · 770dea93
      sdong 提交于
      Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.
      
      Test Plan: Run the test multiple times in an environment which can cause failure.
      
      Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D51687
      770dea93
  35. 12 11月, 2015 1 次提交
    • Y
      Add OptionsUtil::LoadOptionsFromFile() API · e11f676e
      Yueh-Hsuan Chiang 提交于
      Summary:
      This patch adds OptionsUtil::LoadOptionsFromFile() and
      OptionsUtil::LoadLatestOptionsFromDB(), which allow developers
      to construct DBOptions and ColumnFamilyOptions from a RocksDB
      options file.  Note that most pointer-typed options such as
      merge_operator will not be constructed.
      
      With this API, developers no longer need to remember all the
      options in order to reopen an existing rocksdb instance like
      the following:
      
        DBOptions db_options;
        std::vector<std::string> cf_names;
        std::vector<ColumnFamilyOptions> cf_opts;
      
        // Load primitive-typed options from an existing DB
        OptionsUtil::LoadLatestOptionsFromDB(
            dbname, &db_options, &cf_names, &cf_opts);
      
        // Initialize necessary pointer-typed options
        cf_opts[0].merge_operator.reset(new MyMergeOperator());
        ...
      
        // Construct the vector of ColumnFamilyDescriptor
        std::vector<ColumnFamilyDescriptor> cf_descs;
        for (size_t i = 0; i < cf_opts.size(); ++i) {
          cf_descs.emplace_back(cf_names[i], cf_opts[i]);
        }
      
        // Open the DB
        DB* db = nullptr;
        std::vector<ColumnFamilyHandle*> cf_handles;
        auto s = DB::Open(db_options, dbname, cf_descs,
                          &handles, &db);
      
      Test Plan:
      Augment existing tests in column_family_test
      options_test
      db_test
      
      Reviewers: igor, IslamAbdelRahman, sdong, anthony
      
      Reviewed By: anthony
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D49095
      e11f676e
  36. 20 10月, 2015 1 次提交
    • I
      Fix iOS build · 4e07c99a
      Igor Canadi 提交于
      Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.
      
      Test Plan:
      TARGET_OS=IOS make static_lib
      
      Observe there are no warnings
      
      Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D49029
      4e07c99a