1. 14 2月, 2020 1 次提交
    • C
      Fix flaky test DecreaseNumBgThreads (#6393) · 46516778
      Cheng Chang 提交于
      Summary:
      The DecreaseNumBgThreads test keeps failing on Windows in AppVeyor.
      It fails because it depends on a timed wait for the tasks to be dequeued from the threadpool's internal queue, but within the specified time, the task might have not been scheduled onto the newly created threads.
      https://github.com/facebook/rocksdb/pull/6232 tries to fix this by waiting for longer time to let the threads scheduled.
      This PR tries to fix this by replacing the timed wait with a synchronization on the task's internal conditional variable.
      When the number of threads increases, instead of guessing the time needed for the task to be scheduled, it directly blocks on the conditional variable until the task starts running.
      But when thread number is reduced, it still does a timed wait, but this does not lead to the flakiness now, will try to remove these timed waits in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6393
      
      Test Plan: Wait to see whether AppVeyor tests pass.
      
      Differential Revision: D19890928
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 4e56e4addf625c98c0876e62d9d57a6f0a156f76
      46516778
  2. 08 2月, 2020 1 次提交
    • S
      Allow readahead when reading option files. (#6372) · 876c2dbf
      sdong 提交于
      Summary:
      Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372
      
      Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace.
      
      Differential Revision: D19727739
      
      fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583
      876c2dbf
  3. 04 2月, 2020 1 次提交
    • M
      Add an option to prevent DB::Open() from querying sizes of all sst files (#6353) · 637e64b9
      Mike Kolupaev 提交于
      Summary:
      When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
      
      If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
      
      We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
      
      Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
      
      Differential Revision: D19656425
      
      Pulled By: al13n321
      
      fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
      637e64b9
  4. 11 1月, 2020 1 次提交
  5. 03 1月, 2020 1 次提交
    • M
      Prevent an incompatible combination of options (#6254) · 48a678b7
      Maysam Yabandeh 提交于
      Summary:
      allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254
      
      Differential Revision: D19265819
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179
      48a678b7
  6. 14 12月, 2019 1 次提交
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  7. 14 11月, 2019 1 次提交
    • P
      New Bloom filter implementation for full and partitioned filters (#6007) · f059c7d9
      Peter Dillinger 提交于
      Summary:
      Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
      
      Speed
      
      The improved speed, at least on recent x86_64, comes from
      * Using fastrange instead of modulo (%)
      * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
      * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
      * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
      
      Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
      
      $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
      Build avg ns/key: 47.7135
      Mixed inside/outside queries...
        Single filter net ns/op: 26.2825
        Random filter net ns/op: 150.459
          Average FP rate %: 0.954651
      $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
      Build avg ns/key: 47.2245
      Mixed inside/outside queries...
        Single filter net ns/op: 63.2978
        Random filter net ns/op: 188.038
          Average FP rate %: 1.13823
      
      Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
      
      The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
      
      Accuracy
      
      The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
      within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
      
      Accuracy data (generalizes, except old impl gets worse with millions of keys):
      Memory bits per key: FP rate percent old impl -> FP rate percent new impl
      6: 5.70953 -> 5.69888
      8: 2.45766 -> 2.29709
      10: 1.13977 -> 0.959254
      12: 0.662498 -> 0.411593
      16: 0.353023 -> 0.0873754
      24: 0.261552 -> 0.0060971
      50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
      
      Fixes https://github.com/facebook/rocksdb/issues/5857
      Fixes https://github.com/facebook/rocksdb/issues/4120
      
      Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
      
      Compatibility
      
      Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
      
      Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
      
      Differential Revision: D18294749
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
      f059c7d9
  8. 21 9月, 2019 1 次提交
  9. 17 9月, 2019 1 次提交
    • S
      Divide file_reader_writer.h and .cc (#5803) · b931f84e
      sdong 提交于
      Summary:
      file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
      
      Test Plan: Build whole project using make and cmake.
      
      Differential Revision: D17374550
      
      fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
      b931f84e
  10. 24 8月, 2019 1 次提交
    • Z
      Refactor trimming logic for immutable memtables (#5022) · 2f41ecfe
      Zhongyi Xie 提交于
      Summary:
      MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
      We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
      The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
      In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
      
      Differential Revision: D14394062
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
      2f41ecfe
  11. 17 7月, 2019 1 次提交
  12. 10 7月, 2019 1 次提交
  13. 25 6月, 2019 1 次提交
    • M
      Add an option to put first key of each sst block in the index (#5289) · b4d72094
      Mike Kolupaev 提交于
      Summary:
      The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
      
      Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
      
      So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
      
      Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
      
      This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
      
      Differential Revision: D15256423
      
      Pulled By: al13n321
      
      fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
      b4d72094
  14. 22 6月, 2019 1 次提交
  15. 07 6月, 2019 1 次提交
  16. 04 6月, 2019 1 次提交
  17. 01 6月, 2019 1 次提交
  18. 31 5月, 2019 2 次提交