1. 16 4月, 2020 1 次提交
    • M
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 提交于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de
  2. 11 4月, 2020 2 次提交
    • Y
      Compaction with timestamp: input boundaries (#6645) · 0c05624d
      Yanqin Jin 提交于
      Summary:
      Towards making compaction logic compatible with user timestamp.
      When computing boundaries and overlapping ranges for inputs of compaction, We need to compare SSTs by user key without timestamp.
      
      Test plan (devserver):
      ```
      make check
      ```
      Several individual tests:
      ```
      ./version_set_test --gtest_filter=VersionStorageInfoTimestampTest.GetOverlappingInputs
      ./db_with_timestamp_compaction_test
      ./db_with_timestamp_basic_test
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6645
      
      Reviewed By: ltamasi
      
      Differential Revision: D20960012
      
      Pulled By: riversand963
      
      fbshipit-source-id: ad377fa9eb481bf7a8a3e1824aaade48cdc653a4
      0c05624d
    • H
      make iterator return versions between timestamp bounds (#6544) · 9e89ffb7
      Huisheng Liu 提交于
      Summary:
      (Based on Yanqin's idea) Add a new field in readoptions as lower timestamp bound for iterator. When the parameter is not supplied (nullptr), the iterator returns the latest visible version of a record. When it is supplied, the existing timestamp field is the upper bound. Together the two serves as a bounded time window. The iterator returns all versions of a record falling in the window.
      
      SeekRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
      base line (commit e860f884):
      seekrandom   : 7.836 micros/op 4082449 ops/sec; (0 of 73481999 found)
      This PR:
      seekrandom   : 7.764 micros/op 4120935 ops/sec; (0 of 71303999 found)
      
      db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=seekrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6544
      
      Reviewed By: ltamasi
      
      Differential Revision: D20844069
      
      Pulled By: riversand963
      
      fbshipit-source-id: d97f2bf38a323c8c6a68db213b2d3c694b1c1f74
      9e89ffb7
  3. 07 3月, 2020 1 次提交
    • Y
      Iterator with timestamp (#6255) · d93812c9
      Yanqin Jin 提交于
      Summary:
      Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
      
      Create an iterator with timestamp.
      ```
      ...
      read_opts.timestamp = &ts;
      auto* iter = db->NewIterator(read_opts);
      // target is key without timestamp.
      for (iter->Seek(target); iter->Valid(); iter->Next()) {}
      for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
      delete iter;
      read_opts.timestamp = &ts1;
      // lower_bound and upper_bound are without timestamp.
      read_opts.iterate_lower_bound = &lower_bound;
      read_opts.iterate_upper_bound = &upper_bound;
      auto* iter1 = db->NewIterator(read_opts);
      // Do Seek or SeekToFirst()
      delete iter1;
      ```
      
      Test plan (dev server)
      ```
      $make check
      ```
      
      Simple benchmarking (dev server)
      1. The overhead introduced by this PR even when timestamp is disabled.
      key size: 16 bytes
      value size: 100 bytes
      Entries: 1000000
      Data reside in main memory, and try to stress iterator.
      Repeated three times on master and this PR.
      - Seek without next
      ```
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
      ```
      master: 159047.0 ops/sec
      this PR: 158922.3 ops/sec (2% drop in throughput)
      - Seek and next 10 times
      ```
      ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
      ```
      master: 109539.3 ops/sec
      this PR: 107519.7 ops/sec (2% drop in throughput)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
      
      Differential Revision: D19438227
      
      Pulled By: riversand963
      
      fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
      d93812c9
  4. 22 2月, 2020 1 次提交
  5. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  6. 29 1月, 2020 1 次提交
    • S
      Add ReadOptions.auto_prefix_mode (#6314) · 8f2bee67
      sdong 提交于
      Summary:
      Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
      
      Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
      
      Differential Revision: D19458717
      
      fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
      8f2bee67
  7. 20 11月, 2019 1 次提交
  8. 17 9月, 2019 1 次提交
    • S
      Improve readability of DBIter's two seek functions (#5794) · 6287f0d7
      sdong 提交于
      Summary:
      Doing some code reordering in DBIter::Seek() and DBIter::SeekForPrev().
      The logic largely remains the same, except slight difference when handling some stats when valid_ = false, where they are not supposed to be used anyway.
      Also remove prefix_start_key_, which sometimes point a part of seek target, some times prefix_start_buf_, which is confusing.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5794
      
      Test Plan: Run all tests.
      
      Differential Revision: D17375257
      
      fbshipit-source-id: 7339a23898cecd3a8475bf72340fcd6f82b933c5
      6287f0d7
  9. 14 9月, 2019 1 次提交
  10. 12 9月, 2019 1 次提交
  11. 23 7月, 2019 1 次提交
    • M
      WriteUnPrepared: improve read your own write functionality (#5573) · eae83274
      Manuel Ung 提交于
      Summary:
      There are a number of fixes in this PR (with most bugs found via the added stress tests):
      1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
      2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
      3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
      4. Add stress test to test reading own writes.
      5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
      6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
      7. Add some more comments everywhere.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
      
      Differential Revision: D16276089
      
      Pulled By: lth
      
      fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
      eae83274
  12. 03 7月, 2019 1 次提交
  13. 12 6月, 2019 1 次提交
  14. 08 6月, 2019 1 次提交
  15. 04 6月, 2019 1 次提交
  16. 01 6月, 2019 1 次提交
  17. 31 5月, 2019 1 次提交
  18. 30 5月, 2019 1 次提交
  19. 18 5月, 2019 1 次提交
  20. 10 5月, 2019 1 次提交
    • S
      DBIter::Next() can skip user key checking if previous entry's seqnum is 0 (#5244) · 25d81e45
      Siying Dong 提交于
      Summary:
      Right now, DBIter::Next() always checks whether an entry is for the same user key as the previous entry to see whether the key should be hidden to the user. However, if previous entry's sequence number is 0, the check is not needed because 0 is the oldest possible sequence number.
      
      We could extend it from seqnum 0 case to simply prev_seqno >= current_seqno. However, it is less robust with bug or unexpected situations, while the gain is relatively low. We can always extend it later when needed.
      
      In a readseq benchmark with full formed LSM-tree, number of key comparisons called is reduced from 2.981 to 2.165. readseq against a fully compacted DB, no key comparison is called. Performance in this benchmark didn't show obvious improvement, which is expected because key comparisons only takes small percentage of CPU. But it may show up to be more effective if users have an expensive customized comparator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5244
      
      Differential Revision: D15067257
      
      Pulled By: siying
      
      fbshipit-source-id: b7e1ef3ec4fa928cba509683d2b3246e35d270d9
      25d81e45
  21. 24 4月, 2019 1 次提交
    • S
      DBIter to use IteratorWrapper for inner iterator (#5214) · 72c8533f
      Siying Dong 提交于
      Summary:
      It's hard to get DBIter to directly use InternalIterator::NextAndGetResult() because the code change would be complicated. Instead, use IteratorWrapper, where Next() is already using NextAndGetResult(). Performance number is hard to measure because it is small and ther is variation. I run readseq many times, and there seems to be 1% gain.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5214
      
      Differential Revision: D15003635
      
      Pulled By: siying
      
      fbshipit-source-id: 17af1965c409c2fe90cd85037fbd2c5a1364f82a
      72c8533f
  22. 20 4月, 2019 1 次提交
    • S
      Add some "inline" annotation to DBIter functions (#5217) · 7a73adda
      Siying Dong 提交于
      Summary:
      My compiler doesn't inline DBIter::Next() to arena wrapped iterator, even if it is a direct forward. Adding this annotation makes it inlined. It might not always work but inlinging this function to arena wrapped iterator always feels like the right decision.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5217
      
      Differential Revision: D15004086
      
      Pulled By: siying
      
      fbshipit-source-id: a4cffd79c6fb092669a3a90633c9aa5e494f8a66
      7a73adda
  23. 19 4月, 2019 2 次提交
    • S
      Some small code changes to improve Next() (#5200) · 01cfea66
      Siying Dong 提交于
      Summary:
      Several small changes for Next():
      1. Reducing branching by always update local_stats_.next_count_++ even if statistics is null. This should be faster than a branching.
      2. Replacing ResetInternalKeysSkippedCounter() in Next() because the valid_ check is not needed in this case.
      3. iter_->Valid() should always be true for non merge case. Remove this check.
      4. Adding an inline annotation. It ends up with not picked up by my compiler, but it shouldn't hurt.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5200
      
      Differential Revision: D15000391
      
      Pulled By: siying
      
      fbshipit-source-id: be97f61c708968234fb8e5cf272b5c2ac07dc4dd
      01cfea66
    • S
      Introduce InternalIteratorBase::NextAndGetResult() (#5197) · 992dfc78
      Siying Dong 提交于
      Summary:
      In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
      Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197
      
      Differential Revision: D14945977
      
      Pulled By: siying
      
      fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
      992dfc78
  24. 03 4月, 2019 1 次提交
    • M
      WriteUnPrepared: less virtual in iterator callback (#5049) · 14b3f683
      Maysam Yabandeh 提交于
      Summary:
      WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
      The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
      Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.
      
      The following benchmark shows +0.26% higher throughput in seekrandom benchmark.
      
      Benchmark:
      ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
      
      ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
      seekrandom [AVG    10 runs] : 20355 ops/sec;  225.2 MB/sec
      seekrandom [MEDIAN 10 runs] : 20425 ops/sec;  225.9 MB/sec
      
      ./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
      seekrandom [AVG    10 runs] : 20409 ops/sec;  225.8 MB/sec
      seekrandom [MEDIAN 10 runs] : 20487 ops/sec;  226.6 MB/sec
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049
      
      Differential Revision: D14366459
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
      14b3f683
  25. 28 3月, 2019 1 次提交
    • Y
      Fix perf_context.user_key_comparison_count for range scan (#5098) · d6924158
      Yi Wu 提交于
      Summary:
      Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.
      
      Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098
      
      Differential Revision: D14603753
      
      Pulled By: siying
      
      fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
      d6924158
  26. 27 3月, 2019 1 次提交
  27. 22 3月, 2019 1 次提交
    • M
      Reorder DBIter fields to reduce memory usage (#5078) · c84fad7a
      Maysam Yabandeh 提交于
      Summary:
      The patch reorders DBIter fields to put 1-byte fields together and let the compiler optimize the memory usage by using less 64-bit allocations for bools and enums.
      
      This might have a negative side effect of putting the variables that are accessed together into different cache lines and hence increasing the cache misses. Not sure what benchmark would verify that thought. I ran simple, single-threaded seekrandom benchmarks but the variance in the results is too much to be conclusive.
      
      ./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
      ./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5078
      
      Differential Revision: D14562676
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 2284655d46e079b6e9a860e94be5defb6f482167
      c84fad7a
  28. 15 2月, 2019 1 次提交
    • M
      Apply modernize-use-override (2nd iteration) · ca89ac2b
      Michael Liu 提交于
      Summary:
      Use C++11’s override and remove virtual where applicable.
      Change are automatically generated.
      
      Reviewed By: Orvid
      
      Differential Revision: D14090024
      
      fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
      ca89ac2b
  29. 18 12月, 2018 2 次提交
  30. 29 11月, 2018 1 次提交
    • A
      Clean up FragmentedRangeTombstoneList (#4692) · 8fe1e06c
      Abhishek Madan 提交于
      Summary:
      Removed `one_time_use` flag, which removed the need for some
      tests, and changed all `NewRangeTombstoneIterator` methods to return
      `FragmentedRangeTombstoneIterators`.
      
      These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
      and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
      
      Differential Revision: D13106570
      
      Pulled By: abhimadan
      
      fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
      8fe1e06c
  31. 22 11月, 2018 1 次提交
    • A
      Introduce RangeDelAggregatorV2 (#4649) · 457f77b9
      Abhishek Madan 提交于
      Summary:
      The old RangeDelAggregator did expensive pre-processing work
      to create a collapsed, binary-searchable representation of range
      tombstones. With FragmentedRangeTombstoneIterator, much of this work is
      now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
      in each iterator to find a covering tombstone in ShouldDelete, while
      doing minimal work in AddTombstones. The old RangeDelAggregator is still
      used during flush/compaction for now, though RangeDelAggregatorV2 will
      support those uses in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
      
      Differential Revision: D13146964
      
      Pulled By: abhimadan
      
      fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
      457f77b9
  32. 14 11月, 2018 1 次提交
    • S
      Divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE` (#4498) · 5945e16d
      Soli Como 提交于
      Summary:
      Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`.
      That means tick can only increase.
      If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`.
      That's kind of a hack.
      
      So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only.
      
      Fixes #3013 .
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498
      
      Differential Revision: D10395010
      
      Pulled By: sagar0
      
      fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2
      5945e16d
  33. 11 10月, 2018 1 次提交
    • A
      Fix merge operand reappearing when covered by DeleteRange (#4481) · 7e560722
      Andrew Kryczka 提交于
      Summary:
      Even during `DBIter::Prev()`, there is a case where we need to use `RangeDelPositioningMode::kForwardTraversal`. In particular, when we hit too many internal keys for a single user key, we use seek to find the newest internal key. If it's a merge operand, we then scan forwards, collecting the merge operands. This forward scan should be using `RangeDelPositioningMode::kForwardTraversal`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4481
      
      Differential Revision: D10319507
      
      Pulled By: ajkr
      
      fbshipit-source-id: b5ce7352461f3a7696b28a5136ae0076f2bde51f
      7e560722
  34. 10 10月, 2018 1 次提交
  35. 11 8月, 2018 1 次提交
  36. 13 7月, 2018 1 次提交
    • N
      Range deletion performance improvements + cleanup (#4014) · 5f3088d5
      Nikhil Benesch 提交于
      Summary:
      This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.
      
      I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.
      
      With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.
      
      ajkr please take a look when you have a minute!
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014
      
      Differential Revision: D8773253
      
      Pulled By: ajkr
      
      fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
      5f3088d5
  37. 28 6月, 2018 1 次提交
    • M
      WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955) · a16e00b7
      Manuel Ung 提交于
      Summary:
      This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
      
      There are the places where reads had to be modified.
      - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
      - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
      - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
      Closes https://github.com/facebook/rocksdb/pull/3955
      
      Differential Revision: D8286688
      
      Pulled By: lth
      
      fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
      a16e00b7