1. 28 6月, 2017 1 次提交
    • S
      FIFO Compaction with TTL · 1cd45cd1
      Sagar Vemuri 提交于
      Summary:
      Introducing FIFO compactions with TTL.
      
      FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
      
      To address that request:
      - Added a new TTL option to FIFO compaction options.
      - Updated FIFO compaction score to take TTL into consideration.
      - Added a new table property, creation_time, to keep track of when the SST file is created.
      - Creation_time is set as below:
        - On Flush: Set to the time of flush.
        - On Compaction: Set to the max creation_time of all the files involved in the compaction.
        - On Repair and Recovery: Set to the time of repair/recovery.
        - Old files created prior to this code change will have a creation_time of 0.
      - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
      - FIFO compaction will fall back to the prior way of deleting files based on size if:
        - the creation_time of all files involved in compaction is 0.
        - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
      
      This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
      
      **Test Plan:**
      Added tests.
      
      **Benchmark results:**
      Base: FIFO with max size: 100MB ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
      
      readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
      ```
      
      With TTL (a low one for testing) ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
      
      readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
      ```
      Example Log lines:
      ```
      2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
      2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
      ...
      2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
      2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
      ```
      
      SST Files remaining in the dbbench dir, after db_bench execution completed:
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
      -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
      -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
      -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
      ```
      Closes https://github.com/facebook/rocksdb/pull/2480
      
      Differential Revision: D5305116
      
      Pulled By: sagar0
      
      fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
      1cd45cd1
  2. 27 6月, 2017 2 次提交
  3. 14 6月, 2017 1 次提交
  4. 13 6月, 2017 1 次提交
  5. 12 6月, 2017 1 次提交
    • S
      Sample number of reads per SST file · 5582123d
      Siying Dong 提交于
      Summary:
      We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
      Closes https://github.com/facebook/rocksdb/pull/2417
      
      Differential Revision: D5193528
      
      Pulled By: siying
      
      fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
      5582123d
  6. 10 6月, 2017 1 次提交
  7. 06 6月, 2017 1 次提交
  8. 03 6月, 2017 2 次提交
    • A
      using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… · 7f6c02dd
      Aaron Gao 提交于
      Summary:
      … headers
      
      https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.
      
      We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.
      
      make check -j64
      Closes https://github.com/facebook/rocksdb/pull/2380
      
      Differential Revision: D5177896
      
      Pulled By: lightmark
      
      fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
      7f6c02dd
    • S
      Improve write buffer manager (and allow the size to be tracked in block cache) · 95b0e89b
      Siying Dong 提交于
      Summary:
      Improve write buffer manager in several ways:
      1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
      2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
      3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
      Closes https://github.com/facebook/rocksdb/pull/2350
      
      Differential Revision: D5110648
      
      Pulled By: siying
      
      fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
      95b0e89b
  9. 25 5月, 2017 2 次提交
    • A
      Introduce max_background_jobs mutable option · bb01c188
      Andrew Kryczka 提交于
      Summary:
      - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
      - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
      - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
      - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
      Closes https://github.com/facebook/rocksdb/pull/2205
      
      Differential Revision: D4937461
      
      Pulled By: ajkr
      
      fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
      bb01c188
    • S
      options.delayed_write_rate use the rate of rate_limiter by default. · 41cbb727
      Siying Dong 提交于
      Summary:
      It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude.
      Closes https://github.com/facebook/rocksdb/pull/2357
      
      Differential Revision: D5115324
      
      Pulled By: siying
      
      fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754
      41cbb727
  10. 24 5月, 2017 2 次提交
    • A
      New API for background work in single thread pool · 6cc9aef1
      Andrew Kryczka 提交于
      Summary:
      Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
      
      Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
      Closes https://github.com/facebook/rocksdb/pull/2204
      
      Differential Revision: D4936256
      
      Pulled By: ajkr
      
      fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
      6cc9aef1
    • A
      Core-local statistics · ac39d6be
      Andrew Kryczka 提交于
      Summary:
      This diff changes `StatisticsImpl` from a thread-local approach to a core-local one. The goal is to perform faster aggregations, particularly for applications that have many threads. There should be no behavior change.
      Closes https://github.com/facebook/rocksdb/pull/2258
      
      Differential Revision: D5016258
      
      Pulled By: ajkr
      
      fbshipit-source-id: 7d4d165b4a91d8110f0409d113d1be91f22d31a9
      ac39d6be
  11. 20 5月, 2017 1 次提交
    • Y
      New WriteImpl to pipeline WAL/memtable write · 07bdcb91
      Yi Wu 提交于
      Summary:
      PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that
      1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit.
      2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes.
      
      Merging #2056 and #2058 into this PR.
      Closes https://github.com/facebook/rocksdb/pull/2286
      
      Differential Revision: D5054606
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d
      07bdcb91
  12. 18 5月, 2017 2 次提交
  13. 13 5月, 2017 1 次提交
    • A
      Add GetAllKeyVersions API · 3fa9a39c
      Andrew Kryczka 提交于
      Summary:
      - Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
      - Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
      - Migrated the "ldb idump" subcommand to use this function
      - The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
      Closes https://github.com/facebook/rocksdb/pull/2232
      
      Differential Revision: D4976007
      
      Pulled By: ajkr
      
      fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
      3fa9a39c
  14. 08 5月, 2017 1 次提交
    • Y
      Add bulk create/drop column family API · 2cd00773
      Yi Wu 提交于
      Summary:
      Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
      
      The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
      Closes https://github.com/facebook/rocksdb/pull/2248
      
      Differential Revision: D5001578
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
      2cd00773
  15. 05 5月, 2017 2 次提交
  16. 04 5月, 2017 2 次提交
  17. 28 4月, 2017 1 次提交
    • D
      Remove double buffering on RandomRead on Windows. · cdad04b0
      Dmitri Smirnov 提交于
      Summary:
      Remove double buffering on RandomRead on Windows.
        With more logic appear in file reader/write Read no longer
        obeys forwarding calls to Windows implementation.
        Previously direct_io (unbuffered) was only available on Windows
        but now is supported as generic.
        We remove intermediate buffering on Windows.
        Remove random_access_max_buffer_size option which was windows specific.
        Non-zero values for that opton introduced unnecessary lock contention.
        Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
        no longer necessary.
        Add aligned buffer reads for cases when requested reads exceed read ahead size.
      Closes https://github.com/facebook/rocksdb/pull/2105
      
      Differential Revision: D4847770
      
      Pulled By: siying
      
      fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
      cdad04b0
  18. 27 4月, 2017 1 次提交
    • A
      Add user stats Reset API · efc361ef
      Andrew Kryczka 提交于
      Summary:
      It resets all the ticker and histogram stats to zero. Needed to change the locking a bit since Reset() is the only operation that manipulates multiple tickers/histograms together, and that operation should be seen as atomic by other operations that access tickers/histograms.
      Closes https://github.com/facebook/rocksdb/pull/2213
      
      Differential Revision: D4952232
      
      Pulled By: ajkr
      
      fbshipit-source-id: c0475c3e4c7b940120d53891b69c3091149a0679
      efc361ef
  19. 21 4月, 2017 1 次提交
    • S
      tools/check_format_compatible.sh to cover option file loading too · 97005dbd
      Siying Dong 提交于
      Summary:
      tools/check_format_compatible.sh will check a newer version of RocksDB can open option files generated by older version releases. In order to achieve that, a new parameter "--try_load_options" is added to ldb. With this parameter set, if option file exists, we load the option file and use it to open the DB. With this opiton set, we can validate option loading logic.
      Closes https://github.com/facebook/rocksdb/pull/2178
      
      Differential Revision: D4914989
      
      Pulled By: siying
      
      fbshipit-source-id: db114f7724fcb41e5e9483116d84d7c4b8389ca4
      97005dbd
  20. 19 4月, 2017 2 次提交
  21. 14 4月, 2017 1 次提交
    • A
      change use_direct_writes to use_direct_io_for_flush_and_compaction · 44fa8ece
      Aaron Gao 提交于
      Summary:
      Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction
      Now if Options::use_direct_io_for_flush_and_compaction = true, we will enable direct io for both reads and writes for flush and compaction job. Whereas Options::use_direct_reads controls user reads like iterator and Get().
      Closes https://github.com/facebook/rocksdb/pull/2117
      
      Differential Revision: D4860912
      
      Pulled By: lightmark
      
      fbshipit-source-id: d93575a8a5e780cf7e40797287edc425ee648c19
      44fa8ece
  22. 13 4月, 2017 1 次提交
  23. 06 4月, 2017 2 次提交
  24. 05 4月, 2017 1 次提交
    • A
      Level-based L0->L0 compaction · d659faad
      Andrew Kryczka 提交于
      Summary:
      Level-based L0->L0 compaction operates on spans of files that aren't currently being compacted. It reduces the number of L0 files, thus making write stall conditions harder to reach.
      
      - L0->L0 is triggered when base level is unavailable due to pending compactions
      - L0->L0 always outputs one file of at most `max_level0_burst_file_size` bytes.
      - Subcompactions are disabled for L0->L0 since we want to output one file.
      - Input files are chosen as the longest span of available files that will fit within the size limit. This minimizes number of files in L0.
      Closes https://github.com/facebook/rocksdb/pull/2027
      
      Differential Revision: D4760318
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9d07183
      d659faad
  25. 04 4月, 2017 1 次提交
  26. 31 3月, 2017 1 次提交
    • S
      Option to fail a request as incomplete when skipping too many internal keys · c6d04f2e
      Sagar Vemuri 提交于
      Summary:
      Operations like Seek/Next/Prev sometimes take too long to complete when there are many internal keys to be skipped. Adding an option, max_skippable_internal_keys -- which could be used to set a threshold for the maximum number of keys that can be skipped, will help to address these cases where it is much better to fail a request (as incomplete) than to wait for a considerable time for the request to complete.
      
      This feature -- to fail an iterator seek request as incomplete, is disabled by default when max_skippable_internal_keys = 0. It is enabled only when max_skippable_internal_keys > 0.
      
      This feature is based on the discussion mentioned in the PR https://github.com/facebook/rocksdb/pull/1084.
      Closes https://github.com/facebook/rocksdb/pull/2000
      
      Differential Revision: D4753223
      
      Pulled By: sagar0
      
      fbshipit-source-id: 1c973f7
      c6d04f2e
  27. 29 3月, 2017 1 次提交
  28. 23 3月, 2017 1 次提交
  29. 21 3月, 2017 1 次提交
  30. 14 3月, 2017 1 次提交
  31. 10 3月, 2017 1 次提交