1. 31 5月, 2019 2 次提交
  2. 30 5月, 2019 1 次提交
  3. 14 5月, 2019 1 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  4. 27 4月, 2019 1 次提交
    • S
      Improve explicit user readahead performance (#5246) · 3548e422
      Sagar Vemuri 提交于
      Summary:
      Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
      
      1. Stop creating new table readers when the user explicitly sets readahead size.
      2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
      3. Add `readahead_size` to db_bench.
      
      **Benchmarks:**
      https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
      
      For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
      For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
      
      **Test Plan:**
      Updated `DBIteratorTest.ReadAhead` test to make sure that:
      - no new table readers are created for iterators on setting ReadOptions.readahead_size
      - At least "readahead" number of bytes are actually getting read on each iterator read.
      
      TODO later:
      - Use similar logic for compactions as well.
      - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
      
      Differential Revision: D15107946
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
      3548e422
  5. 12 4月, 2019 1 次提交
    • A
      Introduce a new MultiGet batching implementation (#5011) · fefd4b98
      anand76 提交于
      Summary:
      This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
      
      Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
      1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
      2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
      
      The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
      
      Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
      
      Batch   Sizes
      
      1        | 2        | 4         | 8      | 16  | 32
      
      Random pattern (Stride length 0)
      4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
      4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
      4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)
      
      Good locality (Stride length 16)
      4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
      4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
      4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
      
      Good locality (Stride length 256)
      4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
      4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
      4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
      
      Medium locality (Stride length 4096)
      4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
      4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
      4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
      
      dbbench command used (on a DB with 4 levels, 12 million keys)-
      TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
      
      Differential Revision: D14348703
      
      Pulled By: anand1976
      
      fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
      fefd4b98
  6. 08 2月, 2019 1 次提交
  7. 03 1月, 2019 1 次提交
    • A
      Lock free MultiGet (#4754) · b9d6ecca
      Anand Ananthabhotla 提交于
      Summary:
      Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.
      
      Tests:
      1. Unit tests
      2. make check
      3. db_bench -
      
      TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
      readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)
      
      Multireadrandom with batch size 1:
      multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754
      
      Differential Revision: D13363550
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
      b9d6ecca
  8. 21 12月, 2018 1 次提交
    • S
      Introduce a CPU time counter in perf_context (#4741) · da1c64b6
      Siying Dong 提交于
      Summary:
      Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
      Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
      Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741
      
      Differential Revision: D13287798
      
      Pulled By: siying
      
      fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
      da1c64b6
  9. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  10. 02 11月, 2018 2 次提交
  11. 27 10月, 2018 1 次提交
  12. 22 10月, 2018 1 次提交
    • Y
      Fix RepeatableThreadTest::MockEnvTest hang (#4560) · 933250e3
      Yi Wu 提交于
      Summary:
      When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560
      
      Differential Revision: D10472851
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19
      933250e3
  13. 09 10月, 2018 1 次提交
    • Z
      move dump stats to a separate thread (#4382) · cac87fcf
      Zhongyi Xie 提交于
      Summary:
      Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
      We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.
      
      Tested with db_bench using `--stats_dump_period_sec=20` in command line:
      > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      
      LOG content:
      > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
      2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
      ** DB Stats **
      Uptime(secs): 20.0 total, 20.0 interval
      Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
      Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
      Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
      Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
      Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
      Interval stall: 00:00:0.012 H:M:S, 0.1 percent
      ** Compaction Stats [default] **
      Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382
      
      Differential Revision: D9933051
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
      cac87fcf
  14. 27 9月, 2018 1 次提交
    • Y
      Improve log handling when recover without flush (#4405) · dc813e4b
      Yi Wu 提交于
      Summary:
      Improve log handling when avoid_flush_during_recovery=true.
      1. restore total_log_size_ after recovery, by summing up existing log sizes. Fixes #4253.
      2. truncate the last existing log, since this log can contain preallocated space and it will be a waste to keep the space. It avoids a crash loop of user application cause a lot of log with non-trivial size being created and ultimately take up all disk space.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4405
      
      Differential Revision: D9953933
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 967780fee8acec7f358b6eb65190fb4684f82e56
      dc813e4b
  15. 18 9月, 2018 1 次提交
    • M
      Fix bug in partition filters with format_version=4 (#4381) · 65ac72ed
      Maysam Yabandeh 提交于
      Summary:
      Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381
      
      Differential Revision: D9879505
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
      65ac72ed
  16. 12 9月, 2018 1 次提交
  17. 10 8月, 2018 1 次提交
    • M
      Index value delta encoding (#3983) · caf0f53a
      Maysam Yabandeh 提交于
      Summary:
      Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.
      
      The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
      Results with sysbench read-only using 4k blocks and using 16 index restart interval:
      Format 2:
      19585   rocksdb read-only range=100
      Format 3:
      19569   rocksdb read-only range=100
      Format 4:
      19352   rocksdb read-only range=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
      
      Differential Revision: D8361343
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
      caf0f53a
  18. 14 7月, 2018 2 次提交
  19. 05 6月, 2018 1 次提交
    • M
      Extend some tests to format_version=3 (#3942) · d0c38c0c
      Maysam Yabandeh 提交于
      Summary:
      format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
      Closes https://github.com/facebook/rocksdb/pull/3942
      
      Differential Revision: D8238413
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
      d0c38c0c
  20. 01 6月, 2018 1 次提交
    • M
      Fix the bug of some test scenarios being put after kEnd · 44cf8493
      Maysam Yabandeh 提交于
      Summary:
      DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have  been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
      Closes https://github.com/facebook/rocksdb/pull/3935
      
      Differential Revision: D8230748
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
      44cf8493
  21. 04 5月, 2018 1 次提交
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  22. 24 4月, 2018 1 次提交
    • S
      Revert "Skip deleted WALs during recovery" · d5afa737
      Siying Dong 提交于
      Summary:
      This reverts commit 73f21a7b.
      
      It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:
      
      "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"
      
      This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
      Closes https://github.com/facebook/rocksdb/pull/3762
      
      Differential Revision: D7730035
      
      Pulled By: siying
      
      fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77
      d5afa737
  23. 11 4月, 2018 1 次提交
  24. 06 4月, 2018 1 次提交
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
  25. 31 3月, 2018 1 次提交
    • M
      Skip deleted WALs during recovery · 73f21a7b
      Maysam Yabandeh 提交于
      Summary:
      This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      Closes https://github.com/facebook/rocksdb/pull/3488
      
      Differential Revision: D6967893
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1
      73f21a7b
  26. 06 3月, 2018 1 次提交
  27. 23 2月, 2018 2 次提交
  28. 16 2月, 2018 1 次提交
    • M
      Unbreak MemTableRep API change · 8eb1d445
      Maysam Yabandeh 提交于
      Summary:
      The MemTableRep API was broken by this commit: 813719e9
      This patch reverts the changes and instead adds InsertKey (and etc.) overloads to extend the MemTableRep API without breaking the existing classes that inherit from it.
      Closes https://github.com/facebook/rocksdb/pull/3513
      
      Differential Revision: D7004134
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: e568d91fe1e17dd76c0c1f6c7dd51a18633b1c4f
      8eb1d445
  29. 01 2月, 2018 1 次提交
    • M
      WritePrepared Txn: Duplicate Keys, Memtable part · 813719e9
      Maysam Yabandeh 提交于
      Summary:
      Currently DB does not accept duplicate keys (keys with the same user key and the same sequence number). If Memtable returns false when receiving such keys, we can benefit from this signal to properly increase the sequence number in the rare cases when we have a duplicate key in the write batch written to DB under WritePrepared transactions.
      Closes https://github.com/facebook/rocksdb/pull/3418
      
      Differential Revision: D6822412
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: adea3ce5073131cd38ed52b16bea0673b1a19e77
      813719e9
  30. 17 1月, 2018 1 次提交
    • Y
      Fix multiple build failures · dc360df8
      Yi Wu 提交于
      Summary:
      * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure
      * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366
      * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled
      * Fix ASAN failure with DBBasicTest::DBClose test
      Closes https://github.com/facebook/rocksdb/pull/3373
      
      Differential Revision: D6732313
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5
      dc360df8
  31. 22 11月, 2017 1 次提交
  32. 17 11月, 2017 1 次提交
  33. 02 11月, 2017 1 次提交
    • M
      Added support for differential snapshots · 7fe3b328
      Mikhail Antonov 提交于
      Summary:
      The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
      
      This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
      
      From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
      
      This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
      
      For now, what's done here according to initial discussions:
      
      Preserving deletes:
       - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
       - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
       - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
      
      Iterator changes:
       - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
      
      TableCache changes:
       - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
      
      What's left:
      
       - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
      Closes https://github.com/facebook/rocksdb/pull/2999
      
      Differential Revision: D6175602
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
      7fe3b328
  34. 24 10月, 2017 1 次提交
    • Y
      Add DB::Properties::kEstimateOldestKeyTime · 66a2c44e
      Yi Wu 提交于
      Summary:
      With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
      
      My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
      Closes https://github.com/facebook/rocksdb/pull/2842
      
      Differential Revision: D5770600
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
      66a2c44e
  35. 22 7月, 2017 2 次提交