1. 16 11月, 2019 1 次提交
  2. 30 10月, 2019 1 次提交
    • S
      Move pipeline write waiting logic into WaitForPendingWrites() (#5716) · a3960fc8
      sdong 提交于
      Summary:
      In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush().
      This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716
      
      Test Plan: Run all tests with ASAN and TSAN.
      
      Differential Revision: D18217658
      
      fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
      a3960fc8
  3. 13 9月, 2019 1 次提交
    • L
      Add insert hints for each writebatch (#5728) · 1a928c22
      Lingjing You 提交于
      Summary:
      Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it.
      
      Bench result (qps):
      
      `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4`
      
      master:
      
      | batch size \ thread num | 1       | 2       | 4       | 8       |
      | ----------------------- | ------- | ------- | ------- | ------- |
      | 1                       | 387883  | 220790  | 308294  | 490998  |
      | 10                      | 1397208 | 978911  | 1275684 | 1733395 |
      | 100                     | 2045414 | 1589927 | 1798782 | 2681039 |
      | 1000                    | 2228038 | 1698252 | 1839877 | 2863490 |
      
      fillseq with writebatch hint:
      
      | batch size \ thread num | 1       | 2       | 4       | 8       |
      | ----------------------- | ------- | ------- | ------- | ------- |
      | 1                       | 286005  | 223570  | 300024  | 466981  |
      | 10                      | 970374  | 813308  | 1399299 | 1753588 |
      | 100                     | 1962768 | 1983023 | 2676577 | 3086426 |
      | 1000                    | 2195853 | 2676782 | 3231048 | 3638143 |
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728
      
      Differential Revision: D17297240
      
      fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c
      1a928c22
  4. 07 9月, 2019 1 次提交
  5. 24 8月, 2019 1 次提交
    • Z
      Refactor trimming logic for immutable memtables (#5022) · 2f41ecfe
      Zhongyi Xie 提交于
      Summary:
      MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
      We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
      The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
      In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
      
      Differential Revision: D14394062
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
      2f41ecfe
  6. 26 7月, 2019 1 次提交
    • Y
      Avoid user key copying for Get/Put/Write with user-timestamp (#5502) · ae152ee6
      Yanqin Jin 提交于
      Summary:
      In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
      We address these issues in this PR by doing the following.
      1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
      2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
      3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.
      
      Test plan (on devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      If the API extension looks good, I will add more unit tests.
      
      Some simple benchmark using db_bench.
      ```
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
      ```
      Master is at a78503bd.
      ```
      |        | readrandom | fillrandom |
      | master | 15.53 MB/s | 25.97 MB/s |
      | PR5502 | 16.70 MB/s | 25.80 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502
      
      Differential Revision: D16340894
      
      Pulled By: riversand963
      
      fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
      ae152ee6
  7. 23 7月, 2019 1 次提交
    • M
      WriteUnPrepared: improve read your own write functionality (#5573) · eae83274
      Manuel Ung 提交于
      Summary:
      There are a number of fixes in this PR (with most bugs found via the added stress tests):
      1. Re-enable reseek optimization. This was initially disabled to avoid infinite loops in https://github.com/facebook/rocksdb/pull/3955 but this can be resolved by remembering not to reseek after a reseek has already been done. This problem only affects forward iteration in `DBIter::FindNextUserEntryInternal`, as we already disable reseeking in `DBIter::FindValueForCurrentKeyUsingSeek`.
      2. Verify that ReadOption.snapshot can be safely used for iterator creation. Some snapshots would not give correct results because snaphsot validation would not be enforced, breaking some assumptions in Prev() iteration.
      3. In the non-snapshot Get() case, reads done at `LastPublishedSequence` may not be enough, because unprepared sequence numbers are not published. Use `std::max(published_seq, max_visible_seq)` to do lookups instead.
      4. Add stress test to test reading own writes.
      5. Minor bug in the allow_concurrent_memtable_write case where we forgot to pass in batch_per_txn_.
      6. Minor performance optimization in `CalcMaxUnpreparedSequenceNumber` by assigning by reference instead of value.
      7. Add some more comments everywhere.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5573
      
      Differential Revision: D16276089
      
      Pulled By: lth
      
      fbshipit-source-id: 18029c944eb427a90a87dee76ac1b23f37ec1ccb
      eae83274
  8. 02 7月, 2019 1 次提交
    • Z
      force flushing stats CF to avoid holding old logs (#5509) · 3886dddc
      Zhongyi Xie 提交于
      Summary:
      WAL records RocksDB writes to all column families. When user flushes a a column family, the old WAL will not accept new writes but cannot be deleted yet because it may still contain live data for other column families. (See https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log#life-cycle-of-a-wal for detailed explanation)
      Because of this, if there is a column family that receive very infrequent writes and no manual flush is called for it, it could prevent a lot of WALs from being deleted. PR https://github.com/facebook/rocksdb/pull/5046 introduced persistent stats column family which is a good example of such column families. Depending on the config, it may have long intervals between writes, and user is unaware of it which makes it difficult to call manual flush for it.
      This PR addresses the problem for persistent stats column family by forcing a flush for persistent stats column family when 1) another column family is flushed 2) persistent stats column family's log number is the smallest among all column families, this way persistent stats column family will  keep advancing its log number when necessary, allowing RocksDB to delete old WAL files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5509
      
      Differential Revision: D16045896
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 286837b633e988417f0096ff38384742d3b40ef4
      3886dddc
  9. 11 6月, 2019 1 次提交
    • M
      WritePrepared: reduce prepared_mutex_ overhead (#5420) · c292dc85
      Maysam Yabandeh 提交于
      Summary:
      The patch reduces the contention over prepared_mutex_ using these techniques:
      1) Move ::RemovePrepared() to be called from the commit callback when we have two write queues.
      2) Use two separate mutex for PreparedHeap, one prepared_mutex_ needed for ::RemovePrepared, and one ::push_pop_mutex() needed for ::AddPrepared(). Given that we call ::AddPrepared only from the first write queue and ::RemovePrepared mostly from the 2nd, this will result into each the two write queues not competing with each other over a single mutex. ::RemovePrepared might occasionally need to acquire ::push_pop_mutex() if ::erase() ends up with calling ::pop()
      3) Acquire ::push_pop_mutex() on the first callback of the write queue and release it on the last.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5420
      
      Differential Revision: D15741985
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 84ce8016007e88bb6e10da5760ba1f0d26347735
      c292dc85
  10. 07 6月, 2019 1 次提交
  11. 06 6月, 2019 1 次提交
    • Y
      Add support for timestamp in Get/Put (#5079) · 340ed4fa
      Yanqin Jin 提交于
      Summary:
      It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
      
      Test plan (on devserver):
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      All tests must pass.
      
      We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
      ```
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
      ```
      Repeat for 6 times for both versions.
      
      Results are as follows:
      ```
      |        | readrandom | fillrandom |
      | master | 16.77 MB/s | 47.05 MB/s |
      | PR5079 | 16.44 MB/s | 47.03 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
      
      Differential Revision: D15132946
      
      Pulled By: riversand963
      
      fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
      340ed4fa
  12. 01 6月, 2019 1 次提交
  13. 31 5月, 2019 1 次提交
  14. 20 5月, 2019 1 次提交
    • M
      WritePrepared: Clarify the need for two_write_queues in unordered_write (#5313) · 5c0e3041
      Maysam Yabandeh 提交于
      Summary:
      WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
      The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
      
      Differential Revision: D15379977
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
      5c0e3041
  15. 16 5月, 2019 1 次提交
    • M
      WritePrepared: Fix deadlock in WriteRecoverableState (#5306) · f0e82161
      Maysam Yabandeh 提交于
      Summary:
      The recent improvement in https://github.com/facebook/rocksdb/pull/3661 could cause a deadlock: When writing recoverable state, we also commit its sequence number to commit table, which could result into evicting existing commit entry, which could result into advancing max_evicted_seq_, which would need to get snapshots from database, which requires obtaining db mutex. The patch releases db_mutex before calling the callback in WriteRecoverableState to avoid the potential deadlock. It also improves the stress tests to let the issue be manifested in the tests.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5306
      
      Differential Revision: D15341458
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 05dcbed7e21b789fd1e5fd5ee8eea08077162323
      f0e82161
  16. 14 5月, 2019 1 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  17. 16 4月, 2019 1 次提交
    • V
      Consolidating WAL creation which currently has duplicate logic in... · 71a82a0a
      Vijay Nadimpalli 提交于
      Consolidating WAL creation which currently has duplicate logic in db_impl_write.cc and db_impl_open.cc (#5188)
      
      Summary:
      Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188
      
      Differential Revision: D14942832
      
      Pulled By: vjnadimpalli
      
      fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
      71a82a0a
  18. 05 4月, 2019 1 次提交
    • A
      Fix many bugs in log statement arguments (#5089) · c06c4c01
      Adam Simpkins 提交于
      Summary:
      Annotate all of the logging functions to inform the compiler that these
      use printf-style formatting arguments.  This allows the compiler to emit
      warnings if the format arguments are incorrect.
      
      This also fixes many problems reported now that format string checking
      is enabled.  Many of these are simply mix-ups in the argument type (e.g,
      int vs uint64_t), but in several cases the wrong number of arguments
      were being passed in which can cause the code to crash.
      
      The primary motivation for this was to fix the log message in
      `DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
      format parameter with no argument supplied.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089
      
      Differential Revision: D14574795
      
      Pulled By: simpkins
      
      fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
      c06c4c01
  19. 03 4月, 2019 1 次提交
    • M
      Mark logs with prepare in PreReleaseCallback (#5121) · 5234fc1b
      Maysam Yabandeh 提交于
      Summary:
      In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
      
      Differential Revision: D14665210
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
      5234fc1b
  20. 28 3月, 2019 1 次提交
    • S
      Apply automatic formatting to some files (#5114) · 89ab1381
      Siying Dong 提交于
      Summary:
      Following files were run through automatic formatter:
      db/db_impl.cc
      db/db_impl.h
      db/db_impl_compaction_flush.cc
      db/db_impl_debug.cc
      db/db_impl_files.cc
      db/db_impl_readonly.h
      db/db_impl_write.cc
      db/dbformat.cc
      db/dbformat.h
      table/block.cc
      table/block.h
      table/block_based_filter_block.cc
      table/block_based_filter_block.h
      table/block_based_filter_block_test.cc
      table/block_based_table_builder.cc
      table/block_based_table_reader.cc
      table/block_based_table_reader.h
      table/block_builder.cc
      table/block_builder.h
      table/block_fetcher.cc
      table/block_prefix_index.cc
      table/block_prefix_index.h
      table/block_test.cc
      table/format.cc
      table/format.h
      
      I could easily run all the files, but I don't want people to feel that
      I'm doing it for lines of code changes :)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114
      
      Differential Revision: D14633040
      
      Pulled By: siying
      
      fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
      89ab1381
  21. 16 3月, 2019 1 次提交
    • A
      Update bg_error when log flush fails in SwitchMemtable() (#5072) · b4fa51df
      anand76 提交于
      Summary:
      There is a potential failure case in DBImpl::SwitchMemtable() that is not handled properly. The call to cur_log_writer->WriteBuffer() can fail due to an IO error. In that case, we need to call SetBGError() in order set the background error since the WriteBuffer() failure may result in data loss.
      
      Also, the asserts for !new_mem and !new_log are incorrect, as those would have been allocated by the time this failure is detected.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5072
      
      Differential Revision: D14461384
      
      Pulled By: anand1976
      
      fbshipit-source-id: fb59bce9d61378f37d2dfcd28c0b704b0f43c3cf
      b4fa51df
  22. 01 3月, 2019 2 次提交
    • M
      Call PreReleaseCallback between WAL and memtable write (#5015) · 77ebc82b
      Maysam Yabandeh 提交于
      Summary:
      PreReleaseCallback meant to be called before the writes are visible to the readers. Since the sequence number is known after the WAL write, there is no reason to delay calling PreReleaseCallback to after the memtable write, which would complicates the reader's logic in presence of our memtable writes that are made visible by the other write thread.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5015
      
      Differential Revision: D14221670
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a504dd665cf923226d7af09cc8e9c7739a25edc6
      77ebc82b
    • S
      Add two more StatsLevel (#5027) · 5e298f86
      Siying Dong 提交于
      Summary:
      Statistics cost too much CPU for some use cases. Add two stats levels
      so that people can choose to skip two types of expensive stats, timers and
      histograms.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027
      
      Differential Revision: D14252765
      
      Pulled By: siying
      
      fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592
      5e298f86
  23. 30 1月, 2019 1 次提交
  24. 04 1月, 2019 1 次提交
  25. 03 1月, 2019 1 次提交
  26. 13 11月, 2018 1 次提交
    • Y
      Remove redundant member var and set options (#4631) · 05dec0c7
      Yanqin Jin 提交于
      Summary:
      In the past, both `DBImpl::atomic_flush_` and
      `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set
      `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is
      set correctly. This does not lead to incorrect behavior, but is a duplicate of
      information.
      
      Since `immutable_db_options_` is always there and has `atomic_flush`, we should
      use it as source of truth and remove `DBImpl::atomic_flush_`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631
      
      Differential Revision: D12928371
      
      Pulled By: riversand963
      
      fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d
      05dec0c7
  27. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  28. 30 10月, 2018 1 次提交
    • Y
      Avoid memtable cut when active memtable is empty (#4595) · 92b44015
      Yanqin Jin 提交于
      Summary:
      For flush triggered by RocksDB due to memory usage approaching certain
      threshold (WriteBufferManager or Memtable full), we should cut the memtable
      only when the current active memtable is not empty, i.e. contains data. This is
      what we do for non-atomic flush. If we always cut memtable even when the active
      memtable is empty, we will generate extra, empty immutable memtable.
      This is not ideal since it may cause write stall. It also causes some
      DBAtomicFlushTest to fail because cfd->imm()->NumNotFlushed() is different from
      expectation.
      
      Test plan
      ```
      $make clean && make J=1 -j32 all check
      $make clean && OPT="-DROCKSDB_LITE -g" make J=1 -j32 all check
      $make clean && TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make J=1 -j32 valgrind_test
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4595
      
      Differential Revision: D12818520
      
      Pulled By: riversand963
      
      fbshipit-source-id: d867bdbeacf4199fdd642debb085f94703c41a18
      92b44015
  29. 27 10月, 2018 1 次提交
  30. 13 10月, 2018 1 次提交
    • Y
      Add listener to sample file io (#3933) · 729a617b
      Yanqin Jin 提交于
      Summary:
      We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
      To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933
      
      Differential Revision: D10219571
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
      729a617b
  31. 11 10月, 2018 1 次提交
  32. 10 10月, 2018 1 次提交
    • A
      Handle mixed slowdown/no_slowdown writer properly (#4475) · 854a4be0
      Anand Ananthabhotla 提交于
      Summary:
      There is a bug when the write queue leader is blocked on a write
      delay/stop, and the queue has writers with WriteOptions::no_slowdown set
      to true. They are not woken up until the write stall is cleared.
      
      The fix introduces a dummy writer inserted at the tail to indicate a
      write stall and prevent further inserts into the queue, and a condition
      variable that writers who can tolerate slowdown wait on before adding
      themselves to the queue. The leader calls WriteThread::BeginWriteStall()
      to add the dummy writer and then walk the queue to fail any writers with
      no_slowdown set. Once the stall clears, the leader calls
      WriteThread::EndWriteStall() to remove the dummy writer and signal the
      condition variable.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4475
      
      Differential Revision: D10285827
      
      Pulled By: anand1976
      
      fbshipit-source-id: 747465e5e7f07a829b1fb0bc1afcd7b93f4ab1a9
      854a4be0
  33. 16 9月, 2018 1 次提交
    • A
      Auto recovery from out of space errors (#4164) · a27fce40
      Anand Ananthabhotla 提交于
      Summary:
      This commit implements automatic recovery from a Status::NoSpace() error
      during background operations such as write callback, flush and
      compaction. The broad design is as follows -
      1. Compaction errors are treated as soft errors and don't put the
      database in read-only mode. A compaction is delayed until enough free
      disk space is available to accomodate the compaction outputs, which is
      estimated based on the input size. This means that users can continue to
      write, and we rely on the WriteController to delay or stop writes if the
      compaction debt becomes too high due to persistent low disk space
      condition
      2. Errors during write callback and flush are treated as hard errors,
      i.e the database is put in read-only mode and goes back to read-write
      only fater certain recovery actions are taken.
      3. Both types of recovery rely on the SstFileManagerImpl to poll for
      sufficient disk space. We assume that there is a 1-1 mapping between an
      SFM and the underlying OS storage container. For cases where multiple
      DBs are hosted on a single storage container, the user is expected to
      allocate a single SFM instance and use the same one for all the DBs. If
      no SFM is specified by the user, DBImpl::Open() will allocate one, but
      this will be one per DB and each DB will recover independently. The
      recovery implemented by SFM is as follows -
        a) On the first occurance of an out of space error during compaction,
      subsequent
        compactions will be delayed until the disk free space check indicates
        enough available space. The required space is computed as the sum of
        input sizes.
        b) The free space check requirement will be removed once the amount of
        free space is greater than the size reserved by in progress
        compactions when the first error occured
        c) If the out of space error is a hard error, a background thread in
        SFM will poll for sufficient headroom before triggering the recovery
        of the database and putting it in write-only mode. The headroom is
        calculated as the sum of the write_buffer_size of all the DB instances
        associated with the SFM
      4. EventListener callbacks will be called at the start and completion of
      automatic recovery. Users can disable the auto recov ery in the start
      callback, and later initiate it manually by calling DB::Resume()
      
      Todo:
      1. More extensive testing
      2. Add disk full condition to db_stress (follow-on PR)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
      
      Differential Revision: D9846378
      
      Pulled By: anand1976
      
      fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
      a27fce40
  34. 25 8月, 2018 1 次提交
    • Y
      Refactor flush request queueing and processing (#3952) · 7daae512
      Yanqin Jin 提交于
      Summary:
      RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB.
      
      This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it.
      
      This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952
      
      Differential Revision: D8529933
      
      Pulled By: riversand963
      
      fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539
      7daae512
  35. 24 8月, 2018 1 次提交
  36. 07 8月, 2018 1 次提交
  37. 01 8月, 2018 1 次提交
    • S
      Trace and Replay for RocksDB (#3837) · 12b6cdee
      Sagar Vemuri 提交于
      Summary:
      A framework for tracing and replaying RocksDB operations.
      
      A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.
      
      - Column-families are supported
      - Multi-threaded tracing is supported.
      - TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
      - This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.
      
      Currently supported DB operations:
      - Writes:
      -- Put
      -- Merge
      -- Delete
      -- SingleDelete
      -- DeleteRange
      -- Write
      - Reads:
      -- Get (point lookups)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837
      
      Differential Revision: D7974837
      
      Pulled By: sagar0
      
      fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
      12b6cdee
  38. 24 7月, 2018 1 次提交
    • M
      WriteUnPrepared: Implement unprepared batches for transactions (#4104) · ea212e53
      Manuel Ung 提交于
      Summary:
      This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.
      
      Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.
      
      A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104
      
      Differential Revision: D8785717
      
      Pulled By: lth
      
      fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
      ea212e53
  39. 29 6月, 2018 1 次提交
    • A
      Allow DB resume after background errors (#3997) · 52d4c9b7
      Anand Ananthabhotla 提交于
      Summary:
      Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
      1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
      2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
      3. Provide an API for the user to clear the error and resume the DB instance
      
      This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
      Closes https://github.com/facebook/rocksdb/pull/3997
      
      Differential Revision: D8653831
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
      52d4c9b7