1. 03 10月, 2020 1 次提交
  2. 02 10月, 2020 1 次提交
  3. 30 9月, 2020 1 次提交
  4. 17 9月, 2020 1 次提交
  5. 15 9月, 2020 1 次提交
    • A
      Add a new IOStatus subcode to indicate that writes are fenced off (#7374) · 18a3227b
      anand76 提交于
      Summary:
      In a distributed file system, directory ownership is enforced by fencing
      off the previous owner once they've been preempted by a new owner. This
      PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
      Once this error is returned for a file write, the DB is put in read-only
      mode and not allowed to resume in read-write mode.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374
      
      Test Plan: Add new unit tests in ```error_handler_fs_test```
      
      Reviewed By: riversand963
      
      Differential Revision: D23687777
      
      Pulled By: anand1976
      
      fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
      18a3227b
  6. 22 8月, 2020 1 次提交
    • A
      Bug Fix for memtables not trimmed down. (#7296) · 38446126
      Akanksha Mahajan 提交于
      Summary:
      When a memtable is trimmed in MemTableListVersion, the memtable
      is only added to delete list if it is
      the last reference. However it is not the last reference as it is held
      by the super version. But the super version would not be switched if the
      delete list is empty. So the memtable is never destroyed and memory
      usage increases beyond write_buffer_size +
      max_write_buffer_size_to_maintain.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296
      
      Test Plan:
      1.  ./db_bench -benchmarks=randomtransaction
      -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
      -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
      --transaction_set_snapshot
      
      Reviewed By: ltamasi
      
      Differential Revision: D23267395
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
      38446126
  7. 08 8月, 2020 1 次提交
  8. 16 7月, 2020 1 次提交
    • Z
      Auto resume the DB from Retryable IO Error (#6765) · a10f12ed
      Zhichao Cao 提交于
      Summary:
      In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
      
      Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
      
      Reviewed By: anand1976
      
      Differential Revision: D21916789
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
      a10f12ed
  9. 03 7月, 2020 1 次提交
  10. 29 5月, 2020 1 次提交
    • Y
      Add timestamp to delete (#6253) · 961c7590
      Yanqin Jin 提交于
      Summary:
      Preliminary user-timestamp support for delete.
      
      If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100.
      
      Implementation
      A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp.
      The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear.
      
      Test plan (dev server)
      ```
      make check
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253
      
      Reviewed By: ltamasi
      
      Differential Revision: D20995328
      
      Pulled By: riversand963
      
      fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619
      961c7590
  11. 28 4月, 2020 1 次提交
  12. 28 3月, 2020 1 次提交
    • Z
      Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487) · 42468881
      Zhichao Cao 提交于
      Summary:
      In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
      
      The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
      
      Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
      
      Reviewed By: anand1976
      
      Differential Revision: D20685017
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
      42468881
  13. 03 3月, 2020 1 次提交
  14. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  15. 19 2月, 2020 1 次提交
    • A
      Fix concurrent full purge and WAL recycling (#5900) · c6abe30e
      Andrew Kryczka 提交于
      Summary:
      We were removing the file from `log_recycle_files_` before renaming it
      with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside
      the DB mutex, it was possible for a concurrent full purge to sneak in
      and delete the file before it could be renamed. Consequently, `SwitchMemtable()`
      would fail and the DB would enter read-only mode.
      
      The fix is to hold the old file number in `log_recycle_files_` until
      after the file has been renamed. Full purge uses that list to decide
      which files to keep, so it can no longer delete a file pending recycling.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900
      
      Test Plan: new unit test
      
      Differential Revision: D19771719
      
      Pulled By: ajkr
      
      fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8
      c6abe30e
  16. 11 2月, 2020 1 次提交
  17. 28 1月, 2020 1 次提交
  18. 14 12月, 2019 1 次提交
    • L
      Do not create/install new SuperVersion if nothing was deleted during memtable trim (#6169) · 6d54eb3d
      Levi Tamasi 提交于
      Summary:
      We have observed an increase in CPU load caused by frequent calls to
      `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory`
      when using `max_write_buffer_size_to_maintain` to limit the amount of
      memtable history maintained for transaction conflict checking. As it turns out,
      this is caused by the code creating and installing a new `SuperVersion` even if
      no memtables were actually trimmed. The patch adds a check to avoid this.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169
      
      Test Plan:
      Compared `perf` output for
      
      ```
      ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32
      ```
      
      before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` ->
      `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape`
      no longer registers in the `perf` report.
      
      Differential Revision: D19031509
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20
      6d54eb3d
  19. 13 12月, 2019 2 次提交
    • J
      Support concurrent CF iteration and drop (#6147) · c2029f97
      Jermy Li 提交于
      Summary:
      It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.
      
      This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.
      
      fixed https://github.com/facebook/rocksdb/issues/5982
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147
      
      Differential Revision: D18926378
      
      fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
      c2029f97
    • C
      wait pending memtable writes on file ingestion or compact range (#6113) · a8445912
      Connor 提交于
      Summary:
      **Summary:**
      This PR fixes two unordered_write related issues:
      - ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026
      - compact range may cause memtable is flushed before pending unordered write finished
          1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes
          2.  there are some pending writes but memtable is already flushed
          3.  the memtable related WAL is removed( note that the pending-writes were recorded in that WAL).
          4.  pending-writes write to newer created memtable
          5. there is a restart
          6. missing the previous pending-writes because WAL is removed but they aren't included in SST.
      
      **How to solve:**
      - Wait pending memtable writes before ingestion job check memtable key range
      - Wait pending memtable writes before flush memtable.
      **Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness.**
      
      **Test Plan:**
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113
      
      Differential Revision: D18895674
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf
      a8445912
  20. 16 11月, 2019 1 次提交
  21. 30 10月, 2019 1 次提交
    • S
      Move pipeline write waiting logic into WaitForPendingWrites() (#5716) · a3960fc8
      sdong 提交于
      Summary:
      In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush().
      This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716
      
      Test Plan: Run all tests with ASAN and TSAN.
      
      Differential Revision: D18217658
      
      fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
      a3960fc8
  22. 13 9月, 2019 1 次提交
    • L
      Add insert hints for each writebatch (#5728) · 1a928c22
      Lingjing You 提交于
      Summary:
      Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it.
      
      Bench result (qps):
      
      `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4`
      
      master:
      
      | batch size \ thread num | 1       | 2       | 4       | 8       |
      | ----------------------- | ------- | ------- | ------- | ------- |
      | 1                       | 387883  | 220790  | 308294  | 490998  |
      | 10                      | 1397208 | 978911  | 1275684 | 1733395 |
      | 100                     | 2045414 | 1589927 | 1798782 | 2681039 |
      | 1000                    | 2228038 | 1698252 | 1839877 | 2863490 |
      
      fillseq with writebatch hint:
      
      | batch size \ thread num | 1       | 2       | 4       | 8       |
      | ----------------------- | ------- | ------- | ------- | ------- |
      | 1                       | 286005  | 223570  | 300024  | 466981  |
      | 10                      | 970374  | 813308  | 1399299 | 1753588 |
      | 100                     | 1962768 | 1983023 | 2676577 | 3086426 |
      | 1000                    | 2195853 | 2676782 | 3231048 | 3638143 |
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728
      
      Differential Revision: D17297240
      
      fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c
      1a928c22
  23. 07 9月, 2019 1 次提交
  24. 24 8月, 2019 1 次提交
    • Z
      Refactor trimming logic for immutable memtables (#5022) · 2f41ecfe
      Zhongyi Xie 提交于
      Summary:
      MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
      We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
      The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
      In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
      
      Differential Revision: D14394062
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
      2f41ecfe
  25. 26 7月, 2019 1 次提交
    • Y
      Avoid user key copying for Get/Put/Write with user-timestamp (#5502) · ae152ee6
      Yanqin Jin 提交于
      Summary:
      In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
      We address these issues in this PR by doing the following.
      1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
      2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
      3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.
      
      Test plan (on devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      If the API extension looks good, I will add more unit tests.
      
      Some simple benchmark using db_bench.
      ```
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
      ```
      Master is at a78503bd.
      ```
      |        | readrandom | fillrandom |
      | master | 15.53 MB/s | 25.97 MB/s |
      | PR5502 | 16.70 MB/s | 25.80 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502
      
      Differential Revision: D16340894
      
      Pulled By: riversand963
      
      fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
      ae152ee6
  26. 23 7月, 2019 1 次提交
    • M
      WriteUnPrepared: improve read your own write functionality (#5573) · eae83274
      Manuel Ung 提交于
      Summary:
      There are a number of fixes in this PR (with most bugs found via the added stress tests):
      1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
      2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
      3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
      4. Add stress test to test reading own writes.
      5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
      6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
      7. Add some more comments everywhere.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
      
      Differential Revision: D16276089
      
      Pulled By: lth
      
      fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
      eae83274
  27. 02 7月, 2019 1 次提交
    • Z
      force flushing stats CF to avoid holding old logs (#5509) · 3886dddc
      Zhongyi Xie 提交于
      Summary:
      WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation)
      Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it.
      This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will  keep advancing its log number when necessary, allowing RocksDB to delete old WAL files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509
      
      Differential Revision: D16045896
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4
      3886dddc
  28. 11 6月, 2019 1 次提交
    • M
      WritePrepared: reduce prepared_mutex_ overhead (#5420) · c292dc85
      Maysam Yabandeh 提交于
      Summary:
      The patch reduces the contention over prepared_mutex_ using these techniques:
      1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
      2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
      3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420
      
      Differential Revision: D15741985
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
      c292dc85
  29. 07 6月, 2019 1 次提交
  30. 06 6月, 2019 1 次提交
    • Y
      Add support for timestamp in Get/Put (#5079) · 340ed4fa
      Yanqin Jin 提交于
      Summary:
      It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
      
      Test plan (on devserver):
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      All tests must pass.
      
      We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
      ```
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
      ```
      Repeat for 6 times for both versions.
      
      Results are as follows:
      ```
      |        | readrandom | fillrandom |
      | master | 16.77 MB/s | 47.05 MB/s |
      | PR5079 | 16.44 MB/s | 47.03 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
      
      Differential Revision: D15132946
      
      Pulled By: riversand963
      
      fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
      340ed4fa
  31. 01 6月, 2019 1 次提交
  32. 31 5月, 2019 1 次提交
  33. 20 5月, 2019 1 次提交
    • M
      WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313) · 5c0e3041
      Maysam Yabandeh 提交于
      Summary:
      WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
      The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
      
      Differential Revision: D15379977
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
      5c0e3041
  34. 16 5月, 2019 1 次提交
    • M
      WritePrepared: Fix deadlock in WriteRecoverableState (#5306) · f0e82161
      Maysam Yabandeh 提交于
      Summary:
      The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306
      
      Differential Revision: D15341458
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
      f0e82161
  35. 14 5月, 2019 1 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  36. 16 4月, 2019 1 次提交
    • V
      Consolidating WAL creation which currently has duplicate logic in... · 71a82a0a
      Vijay Nadimpalli 提交于
      Consolidating WAL creation which currently has duplicate logic in db_impl_write.cc and db_impl_open.cc (#5188)
      
      Summary:
      Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188
      
      Differential Revision: D14942832
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
      71a82a0a
  37. 05 4月, 2019 1 次提交
    • A
      Fix many bugs in log statement arguments (#5089) · c06c4c01
      Adam Simpkins 提交于
      Summary:
      Annotate all of the logging functions to inform the compiler that these
      use printf-style formatting arguments.  This allows the compiler to emit
      warnings if the format arguments are incorrect.
      
      This also fixes many problems reported now that format string checking
      is enabled.  Many of these are simply mix-ups in the argument type (e.g,
      int vs uint64_t), but in several cases the wrong number of arguments
      were being passed in which can cause the code to crash.
      
      The primary motivation for this was to fix the log message in
      `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
      format parameter with no argument supplied.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089
      
      Differential Revision: D14574795
      
      Pulled By: simpkins
      
      fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
      c06c4c01
  38. 03 4月, 2019 1 次提交
    • M
      Mark logs with prepare in PreReleaseCallback (#5121) · 5234fc1b
      Maysam Yabandeh 提交于
      Summary:
      In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
      
      Differential Revision: D14665210
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
      5234fc1b
  39. 28 3月, 2019 1 次提交
    • S
      Apply automatic formatting to some files (#5114) · 89ab1381
      Siying Dong 提交于
      Summary:
      Following files were run through automatic formatter:
      db/db_impl.cc
      db/db_impl.h
      db/db_impl_compaction_flush.cc
      db/db_impl_debug.cc
      db/db_impl_files.cc
      db/db_impl_readonly.h
      db/db_impl_write.cc
      db/dbformat.cc
      db/dbformat.h
      table/block.cc
      table/block.h
      table/block_based_filter_block.cc
      table/block_based_filter_block.h
      table/block_based_filter_block_test.cc
      table/block_based_table_builder.cc
      table/block_based_table_reader.cc
      table/block_based_table_reader.h
      table/block_builder.cc
      table/block_builder.h
      table/block_fetcher.cc
      table/block_prefix_index.cc
      table/block_prefix_index.h
      table/block_test.cc
      table/format.cc
      table/format.h
      
      I could easily run all the files, but I don't want people to feel that
      I'm doing it for lines of code changes :)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114
      
      Differential Revision: D14633040
      
      Pulled By: siying
      
      fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
      89ab1381