1. 06 7月, 2017 1 次提交
    • A
      Fix GetCurrentTime() initialization for valgrind · 33042573
      Andrew Kryczka 提交于
      Summary:
      Valgrind had false positive complaints about the initialization pattern for `GetCurrentTime()`'s argument in #2480. We can instead have the client initialize the time variable before calling `GetCurrentTime()`, and have `GetCurrentTime()` promise to only overwrite it in success case.
      Closes https://github.com/facebook/rocksdb/pull/2526
      
      Differential Revision: D5358689
      
      Pulled By: ajkr
      
      fbshipit-source-id: 857b189f24c19196f6bb299216f3e23e7bc4be42
      33042573
  2. 01 7月, 2017 1 次提交
    • M
      Simplify and document sync rules for logs_ etc · 1e34d07e
      Maysam Yabandeh 提交于
      Summary:
      Adding/Correcting inline comments and clarify the sync rules. To make it simple to reason, the rules are a big general which ended up to some extra synchronizations. However such synchronizations are not on the fast path, and they are worth the simplicity.
      Closes https://github.com/facebook/rocksdb/pull/2517
      
      Differential Revision: D5348239
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: ff2e59fb1e568c122d2cdbf598310f3613b7d212
      1e34d07e
  3. 28 6月, 2017 1 次提交
    • S
      FIFO Compaction with TTL · 1cd45cd1
      Sagar Vemuri 提交于
      Summary:
      Introducing FIFO compactions with TTL.
      
      FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
      
      To address that request:
      - Added a new TTL option to FIFO compaction options.
      - Updated FIFO compaction score to take TTL into consideration.
      - Added a new table property, creation_time, to keep track of when the SST file is created.
      - Creation_time is set as below:
        - On Flush: Set to the time of flush.
        - On Compaction: Set to the max creation_time of all the files involved in the compaction.
        - On Repair and Recovery: Set to the time of repair/recovery.
        - Old files created prior to this code change will have a creation_time of 0.
      - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
      - FIFO compaction will fall back to the prior way of deleting files based on size if:
        - the creation_time of all files involved in compaction is 0.
        - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
      
      This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
      
      **Test Plan:**
      Added tests.
      
      **Benchmark results:**
      Base: FIFO with max size: 100MB ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
      
      readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
      ```
      
      With TTL (a low one for testing) ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
      
      readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
      ```
      Example Log lines:
      ```
      2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
      2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
      ...
      2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
      2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
      ```
      
      SST Files remaining in the dbbench dir, after db_bench execution completed:
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
      -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
      -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
      -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
      ```
      Closes https://github.com/facebook/rocksdb/pull/2480
      
      Differential Revision: D5305116
      
      Pulled By: sagar0
      
      fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
      1cd45cd1
  4. 25 6月, 2017 1 次提交
    • M
      Optimize for serial commits in 2PC · 499ebb3a
      Maysam Yabandeh 提交于
      Summary:
      Throughput: 46k tps in our sysbench settings (filling the details later)
      
      The idea is to have the simplest change that gives us a reasonable boost
      in 2PC throughput.
      
      Major design changes:
      1. The WAL file internal buffer is not flushed after each write. Instead
      it is flushed before critical operations (WAL copy via fs) or when
      FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
      via mutex_.
      2. Use two sequence numbers: last seq, and last seq for write. Last seq
      is the last visible sequence number for reads. Last seq for write is the
      next sequence number that should be used to write to WAL/memtable. This
      allows to have a memtable write be in parallel to WAL writes.
      3. BatchGroup is not used for writes. This means that we can have
      parallel writers which changes a major assumption in the code base. To
      accommodate for that i) allow only 1 WriteImpl that intends to write to
      memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
      come via group commit phase which is serial anyway, ii) make all the
      parts in the code base that assumed to be the only writer (via
      EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
      protected via a stat_mutex_.
      
      Note: the first commit has the approach figured out but is not clean.
      Submitting the PR anyway to get the early feedback on the approach. If
      we are ok with the approach I will go ahead with this updates:
      0) Rebase with Yi's pipelining changes
      1) Currently batching is disabled by default to make sure that it will be
      consistent with all unit tests. Will make this optional via a config.
      2) A couple of unit tests are disabled. They need to be updated with the
      serial commit of 2PC taken into account.
      3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
      releasing mutex_ beforehand (the same way EnterUnbatched does). This
      needs to be cleaned up.
      Closes https://github.com/facebook/rocksdb/pull/2345
      
      Differential Revision: D5210732
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
      499ebb3a
  5. 25 5月, 2017 2 次提交
    • A
      Introduce max_background_jobs mutable option · bb01c188
      Andrew Kryczka 提交于
      Summary:
      - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
      - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
      - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
      - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
      Closes https://github.com/facebook/rocksdb/pull/2205
      
      Differential Revision: D4937461
      
      Pulled By: ajkr
      
      fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
      bb01c188
    • S
      options.delayed_write_rate use the rate of rate_limiter by default. · 41cbb727
      Siying Dong 提交于
      Summary:
      It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude.
      Closes https://github.com/facebook/rocksdb/pull/2357
      
      Differential Revision: D5115324
      
      Pulled By: siying
      
      fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754
      41cbb727
  6. 23 5月, 2017 1 次提交
  7. 08 5月, 2017 1 次提交
    • Y
      Add bulk create/drop column family API · 2cd00773
      Yi Wu 提交于
      Summary:
      Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
      
      The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
      Closes https://github.com/facebook/rocksdb/pull/2248
      
      Differential Revision: D5001578
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
      2cd00773
  8. 06 5月, 2017 1 次提交
  9. 04 5月, 2017 1 次提交
    • L
      Max open files mutable · e7ae4a3a
      Leonidas Galanis 提交于
      Summary:
      Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache.
      Closes https://github.com/facebook/rocksdb/pull/2185
      
      Differential Revision: D4979189
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda
      e7ae4a3a
  10. 28 4月, 2017 1 次提交
  11. 14 4月, 2017 1 次提交
    • A
      change use_direct_writes to use_direct_io_for_flush_and_compaction · 44fa8ece
      Aaron Gao 提交于
      Summary:
      Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
      Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
      Closes https://github.com/facebook/rocksdb/pull/2117
      
      Differential Revision: D4860912
      
      Pulled By: lightmark
      
      fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
      44fa8ece
  12. 07 4月, 2017 1 次提交
    • S
      Refactor compaction picker code · ff972870
      Siying Dong 提交于
      Summary:
      1. Move universal compaction picker to separate files compaction_picker_universal.cc and compaction_picker_universal.h.
      2. Rename some functions to make the code easier to understand.
      3. Move leveled compaction picking code to a dedicated class, so that we we don't need to pass some common variable around when calling functions. It also allowed us to break down LevelCompactionPicker::PickCompaction() to smaller functions.
      Closes https://github.com/facebook/rocksdb/pull/2100
      
      Differential Revision: D4845948
      
      Pulled By: siying
      
      fbshipit-source-id: efa0ab4
      ff972870
  13. 06 4月, 2017 2 次提交