1. 16 6月, 2018 1 次提交
  2. 04 5月, 2018 1 次提交
    • S
      Skip deleted WALs during recovery · d5954929
      Siying Dong 提交于
      Summary:
      This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
      
      Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
      
      This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
      Closes https://github.com/facebook/rocksdb/pull/3765
      
      Differential Revision: D7747618
      
      Pulled By: siying
      
      fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
      d5954929
  3. 19 4月, 2018 1 次提交
  4. 13 4月, 2018 1 次提交
  5. 12 4月, 2018 1 次提交
  6. 02 3月, 2018 1 次提交
    • Y
      Add "rocksdb.live-sst-files-size" DB property · bf937cf1
      Yi Wu 提交于
      Summary:
      Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
      Closes https://github.com/facebook/rocksdb/pull/3548
      
      Differential Revision: D7116939
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
      bf937cf1
  7. 24 10月, 2017 1 次提交
    • Y
      Add DB::Properties::kEstimateOldestKeyTime · 66a2c44e
      Yi Wu 提交于
      Summary:
      With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
      
      My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
      Closes https://github.com/facebook/rocksdb/pull/2842
      
      Differential Revision: D5770600
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
      66a2c44e
  8. 08 9月, 2017 1 次提交
  9. 31 8月, 2017 1 次提交
    • A
      Extend property map with compaction stats · 8a6708f5
      Artem Danilov 提交于
      Summary:
      This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
      Closes https://github.com/facebook/rocksdb/pull/2794
      
      Differential Revision: D5717676
      
      Pulled By: Tema
      
      fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
      8a6708f5
  10. 16 7月, 2017 1 次提交
  11. 01 7月, 2017 1 次提交
  12. 30 6月, 2017 1 次提交
    • M
      Add a fetch_add variation to AddDBStats · e9f91a51
      Maysam Yabandeh 提交于
      Summary:
      AddDBStats is in two steps of load and store, which is more efficient than fetch_add. This is however not thread-safe. Currently we have to protect concurrent access to AddDBStats with a mutex which is less efficient that fetch_add.
      
      This patch adds the option to do fetch_add when AddDBStats. The results for my 2pc benchmark on sysbench is:
      - vanilla: 68618 tps
      - removing mutex on AddDBStats (unsafe): 69767 tps
      - fetch_add for all AddDBStats: 69200 tps
      - fetch_add only for concurrently access AddDBStats (this patch): 69579 tps
      Closes https://github.com/facebook/rocksdb/pull/2505
      
      Differential Revision: D5330656
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: af64d7bee135b0e86b4fac323a4f9d9113eaa383
      e9f91a51
  13. 28 4月, 2017 1 次提交
  14. 22 4月, 2017 1 次提交
  15. 19 4月, 2017 1 次提交
  16. 11 4月, 2017 1 次提交
  17. 06 4月, 2017 1 次提交
    • I
      Use a human readable size for level report · c50e3750
      Islam AbdelRahman 提交于
      Summary:
      Current
      ```
      ** Compaction Stats [default] **
      Level    Files   Size(MB} Score Read(GB}  Rn(GB} Rnp1(GB} Write(GB} Wnew(GB} Moved(GB} W-Amp Rd(MB/s} Wr(MB/s} Comp(sec} Comp(cnt} Avg(sec} KeyIn KeyDrop
      ----------------------------------------------------------------------------------------------------------------------------------------------------------
        L0      2/0      49.02   0.5      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0     76.1         1         2    0.322       0      0
       Sum      2/0      49.02   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     76.1         1         2    0.322       0      0
       Int      0/0       0.00   0.0      0.0     0.0      0.0       0.0      0.0       0.0   1.0      0.0     76.1         1         2    0.322       0      0
      ```
      
      New
      ```
      ** Compaction Stats [default] **
      Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn Key
      Closes https://github.com/facebook/rocksdb/pull/2055
      
      Differential Revision: D4804576
      
      Pulled By: IslamAbdelRahman
      
      fbshipit-source-id: 719be6a
      c50e3750
  18. 30 3月, 2017 1 次提交
  19. 09 2月, 2017 1 次提交
  20. 29 12月, 2016 1 次提交
  21. 12 11月, 2016 1 次提交
  22. 22 9月, 2016 1 次提交
  23. 17 6月, 2016 1 次提交
    • I
      Add InternalStats and logging for AddFile() · 30a24f2d
      Islam AbdelRahman 提交于
      Summary:
      We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
      Update InternalStats and add logging for AddFile()
      
      Test Plan: Make sure the code compile and existing tests pass
      
      Reviewers: lightmark, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D59763
      30a24f2d
  24. 26 4月, 2016 1 次提交
  25. 21 4月, 2016 1 次提交
    • A
      Add per-level compression ratio property · 73a847ef
      Andrew Kryczka 提交于
      Summary:
      This is needed so we can measure compression ratio improvements
      achieved by D52287.
      
      The property compares raw data size against the total file size for a given
      level. If the level is empty it should return 0.0.
      
      Test Plan: new unit test
      
      Reviewers: IslamAbdelRahman, yhchiang, sdong
      
      Reviewed By: sdong
      
      Subscribers: andrewkr, dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D56967
      73a847ef
  26. 05 3月, 2016 1 次提交
    • S
      Change Property name from "rocksdb.current_version_number" to... · 294bdf9e
      sdong 提交于
      Change Property name from "rocksdb.current_version_number" to "rocksdb.current-super-version-number"
      
      Summary: I realized I again is wrong about the naming convention. Let me change it to the correct one.
      
      Test Plan: Run unit tests.
      
      Reviewers: IslamAbdelRahman, kradhakrishnan, yhchiang, andrewkr
      
      Reviewed By: andrewkr
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D55041
      294bdf9e
  27. 02 3月, 2016 1 次提交
  28. 10 2月, 2016 1 次提交
  29. 03 2月, 2016 1 次提交
    • A
      Eliminate duplicated property constants · 284aa613
      Andrew Kryczka 提交于
      Summary:
      Before this diff, there were duplicated constants to refer to properties (user-
      facing API had strings and InternalStats had an enum). I noticed these were
      inconsistent in terms of which constants are provided, names of constants, and
      documentation of constants. Overall it seemed annoying/error-prone to maintain
      these duplicated constants.
      
      So, this diff gets rid of InternalStats's constants and replaces them with a map
      keyed on the user-facing constant. The value in that map contains a function
      pointer to get the property value, so we don't need to do string matching while
      holding db->mutex_. This approach has a side benefit of making many small
      handler functions rather than a giant switch-statement.
      
      Test Plan: db_properties_test passes, running "make commit-prereq -j32"
      
      Reviewers: sdong, yhchiang, kradhakrishnan, IslamAbdelRahman, rven, anthony
      
      Reviewed By: anthony
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D53253
      284aa613
  30. 26 12月, 2015 1 次提交
    • N
      support for concurrent adds to memtable · 7d87f027
      Nathan Bronson 提交于
      Summary:
      This diff adds support for concurrent adds to the skiplist memtable
      implementations.  Memory allocation is made thread-safe by the addition of
      a spinlock, with small per-core buffers to avoid contention.  Concurrent
      memtable writes are made via an additional method and don't impose a
      performance overhead on the non-concurrent case, so parallelism can be
      selected on a per-batch basis.
      
      Write thread synchronization is an increasing bottleneck for higher levels
      of concurrency, so this diff adds --enable_write_thread_adaptive_yield
      (default off).  This feature causes threads joining a write batch
      group to spin for a short time (default 100 usec) using sched_yield,
      rather than going to sleep on a mutex.  If the timing of the yield calls
      indicates that another thread has actually run during the yield then
      spinning is avoided.  This option improves performance for concurrent
      situations even without parallel adds, although it has the potential to
      increase CPU usage (and the heuristic adaptation is not yet mature).
      
      Parallel writes are not currently compatible with
      inplace updates, update callbacks, or delete filtering.
      Enable it with --allow_concurrent_memtable_write (and
      --enable_write_thread_adaptive_yield).  Parallel memtable writes
      are performance neutral when there is no actual parallelism, and in
      my experiments (SSD server-class Linux and varying contention and key
      sizes for fillrandom) they are always a performance win when there is
      more than one thread.
      
      Statistics are updated earlier in the write path, dropping the number
      of DB mutex acquisitions from 2 to 1 for almost all cases.
      
      This diff was motivated and inspired by Yahoo's cLSM work.  It is more
      conservative than cLSM: RocksDB's write batch group leader role is
      preserved (along with all of the existing flush and write throttling
      logic) and concurrent writers are blocked until all memtable insertions
      have completed and the sequence number has been advanced, to preserve
      linearizability.
      
      My test config is "db_bench -benchmarks=fillrandom -threads=$T
      -batch_size=1 -memtablerep=skip_list -value_size=100 --num=1000000/$T
      -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999
      -disable_auto_compactions --max_write_buffer_number=8
      -max_background_flushes=8 --disable_wal --write_buffer_size=160000000
      --block_size=16384 --allow_concurrent_memtable_write" on a two-socket
      Xeon E5-2660 @ 2.2Ghz with lots of memory and an SSD hard drive.  With 1
      thread I get ~440Kops/sec.  Peak performance for 1 socket (numactl
      -N1) is slightly more than 1Mops/sec, at 16 threads.  Peak performance
      across both sockets happens at 30 threads, and is ~900Kops/sec, although
      with fewer threads there is less performance loss when the system has
      background work.
      
      Test Plan:
      1. concurrent stress tests for InlineSkipList and DynamicBloom
      2. make clean; make check
      3. make clean; DISABLE_JEMALLOC=1 make valgrind_check; valgrind db_bench
      4. make clean; COMPILE_WITH_TSAN=1 make all check; db_bench
      5. make clean; COMPILE_WITH_ASAN=1 make all check; db_bench
      6. make clean; OPT=-DROCKSDB_LITE make check
      7. verify no perf regressions when disabled
      
      Reviewers: igor, sdong
      
      Reviewed By: sdong
      
      Subscribers: MarkCallaghan, IslamAbdelRahman, anthony, yhchiang, rven, sdong, guyg8, kradhakrishnan, dhruba
      
      Differential Revision: https://reviews.facebook.net/D50589
      7d87f027
  31. 18 12月, 2015 1 次提交
    • S
      Slowdown when writing to the last write buffer · d72b3177
      sdong 提交于
      Summary: Now if inserting to mem table is much faster than writing to files, there is no mechanism users can rely on to avoid stopping for reaching options.max_write_buffer_number. With the commit, if there are more than four maximum write buffers configured, we slow down to the rate of options.delayed_write_rate while we reach the last one.
      
      Test Plan:
      1. Add a new unit test.
      2. Run db_bench with
      
      ./db_bench --benchmarks=fillrandom --num=10000000 --max_background_flushes=6 --batch_size=32 -max_write_buffer_number=4 --delayed_write_rate=500000 --statistics
      
      based on hard drive and see stopping is avoided with the commit.
      
      Reviewers: yhchiang, IslamAbdelRahman, anthony, rven, kradhakrishnan, igor
      
      Reviewed By: igor
      
      Subscribers: MarkCallaghan, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D52047
      d72b3177
  32. 10 12月, 2015 1 次提交
    • S
      Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit · 56e77f09
      sdong 提交于
      Summary: Deprecate options.soft_rate_limit, which is hard to tune, with options.soft_pending_compaction_bytes_limit, which would trigger the slowdown if estimated pending compaction bytes exceeds the threshold. The hope is to make it more striaght-forward to tune.
      
      Test Plan: Modify DBTest.SoftLimit to cover options.soft_pending_compaction_bytes_limit instead; run all unit tests.
      
      Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, igor, anthony
      
      Reviewed By: anthony
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D51117
      56e77f09
  33. 17 10月, 2015 1 次提交
  34. 14 10月, 2015 1 次提交
  35. 15 9月, 2015 2 次提交
    • S
      Add options.hard_pending_compaction_bytes_limit to stop writes if compaction lagging behind · 5de807ac
      sdong 提交于
      Summary: Add an option to stop writes if compaction lefts behind. If estimated pending compaction bytes is more than threshold specified by options.hard_pending_compaction_bytes_liimt, writes will stop until compactions are cleared to under the threshold.
      
      Test Plan: Add unit test DBTest.HardLimit
      
      Reviewers: rven, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, igor
      
      Reviewed By: igor
      
      Subscribers: MarkCallaghan, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D45999
      5de807ac
    • A
      Add counters for L0 stall while L0-L1 compaction is taking place · 03ddce9a
      Ari Ekmekji 提交于
      Summary:
      Although there are currently counters to keep track of the
      stall caused by having too many L0 files, there is no distinction as
      to whether when that stall occurs either (A) L0-L1 compaction is taking
      place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
      at the moment. This diff adds a counter for (A) so that the nature of L0
      stalls can be better understood.
      
      Test Plan: make all && make check
      
      Reviewers: sdong, igor, anthony, noetzli, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: MarkCallaghan, dhruba
      
      Differential Revision: https://reviews.facebook.net/D46749
      03ddce9a
  36. 26 8月, 2015 1 次提交
    • Y
      Expose per-level aggregated table properties via GetProperty() · 6996de87
      Yueh-Hsuan Chiang 提交于
      Summary:
      This patch adds "rocksdb.aggregated-table-properties"
      and "rocksdb.aggregated-table-properties-at-levelN", the former
      returns the aggreated table properties of a column family,
      while the later returns the aggregated table properties
      of the specified level N.
      
      Test Plan: Added tests in db_test
      
      Reviewers: igor, sdong, IslamAbdelRahman, anthony
      
      Reviewed By: anthony
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D45087
      6996de87
  37. 21 8月, 2015 2 次提交
    • S
      Add a counter about estimated pending compaction bytes · 07d2d341
      sdong 提交于
      Summary:
      Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property.
      In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits.
      
      Test Plan: Add unit tests
      
      Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor
      
      Reviewed By: igor
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D44205
      07d2d341
    • I
      Total SST files size DB Property · 027ca5b2
      Islam AbdelRahman 提交于
      Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions
      
      Test Plan: Unittests for the new property
      
      Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba
      
      Differential Revision: https://reviews.facebook.net/D44799
      027ca5b2
  38. 20 8月, 2015 1 次提交
    • Y
      Introduce GetIntProperty("rocksdb.size-all-mem-tables") · df79eafc
      Yueh-Hsuan Chiang 提交于
      Summary:
      Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns
      the memory usage by those memtables which have not yet been flushed.
      
      This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"),
      which includes the memory usage by all the memtables, includes those
      have been flushed but pinned by iterators.
      
      Test Plan: Added a test in db_test
      
      Reviewers: igor, anthony, IslamAbdelRahman, sdong
      
      Reviewed By: sdong
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D44229
      df79eafc