1. 13 9月, 2017 1 次提交
    • A
      support opening zero backups during engine init · f5148ade
      Andrew Kryczka 提交于
      Summary:
      There are internal users who open BackupEngine for writing new backups only, and they don't care whether old backups can be read or not. The condition `BackupableDBOptions::max_valid_backups_to_open == 0` should be supported (previously in df74b775 I made the mistake of choosing 0 as a special value to disable the limit).
      Closes https://github.com/facebook/rocksdb/pull/2819
      
      Differential Revision: D5751599
      
      Pulled By: ajkr
      
      fbshipit-source-id: e73ac19eb5d756d6b68601eae8e43407ee4f2752
      f5148ade
  2. 31 8月, 2017 1 次提交
  3. 30 8月, 2017 1 次提交
  4. 24 8月, 2017 2 次提交
  5. 22 8月, 2017 1 次提交
  6. 19 8月, 2017 3 次提交
  7. 17 8月, 2017 2 次提交
    • S
      Allow merge operator to be called even with a single operand · 9a44b4c3
      Sagar Vemuri 提交于
      Summary:
      Added a function `MergeOperator::DoesAllowSingleMergeOperand()` to allow invoking a merge operator even with a single merge operand, if overriden.
      
      This is needed for Cassandra-on-RocksDB work. All Cassandra writes are through merges and this will allow a single merge-value to be updated in the merge-operator invoked via a compaction, if needed, due to an expired TTL.
      Closes https://github.com/facebook/rocksdb/pull/2721
      
      Differential Revision: D5608706
      
      Pulled By: sagar0
      
      fbshipit-source-id: f299f9f91c4d1ac26e48bd5906e122c1c5e5f3fc
      9a44b4c3
    • A
      fix deleterange with memtable prefix bloom · af012c0f
      Andrew Kryczka 提交于
      Summary:
      the range delete tombstones in memtable should be added to the aggregator even when the memtable's prefix bloom filter tells us the lookup key's not there. This bug could cause data to temporarily reappear until the memtable containing range deletions is flushed.
      
      Reported in #2743.
      Closes https://github.com/facebook/rocksdb/pull/2745
      
      Differential Revision: D5639007
      
      Pulled By: ajkr
      
      fbshipit-source-id: 04fc6facb6f978340a3f639536f4ca7c0d73dfc9
      af012c0f
  8. 12 8月, 2017 1 次提交
    • A
      fix deletion dropping in intra-L0 · acf935e4
      Andrew Kryczka 提交于
      Summary:
      `KeyNotExistsBeyondOutputLevel` didn't consider L0 files' key-ranges. So if a key only was covered by older L0 files' key-ranges, we would incorrectly drop deletions of that key. This PR just skips the deletion-dropping optimization when output level is L0.
      Closes https://github.com/facebook/rocksdb/pull/2726
      
      Differential Revision: D5617286
      
      Pulled By: ajkr
      
      fbshipit-source-id: 4bff1396b06d49a828ba4542f249191052915bce
      acf935e4
  9. 04 8月, 2017 1 次提交
    • A
      Introduce bottom-pri thread pool for large universal compactions · cc01985d
      Andrew Kryczka 提交于
      Summary:
      When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
      
      This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
      
      - Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
      - Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
      - Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
      - Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
      - Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
      Closes https://github.com/facebook/rocksdb/pull/2580
      
      Differential Revision: D5422916
      
      Pulled By: ajkr
      
      fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
      cc01985d
  10. 01 8月, 2017 1 次提交
    • A
      fix db get/write stats · 6a36b3a7
      Andrew Kryczka 提交于
      Summary:
      we were passing `record_read_stats` (a bool) as the `hist_type` argument, which meant we were updating either `rocksdb.db.get.micros` (`hist_type == 0`) or `rocksdb.db.write.micros` (`hist_type == 1`) with wrong data.
      Closes https://github.com/facebook/rocksdb/pull/2666
      
      Differential Revision: D5520384
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2f7c956aec32f8b58c5c18845ac478e0230c9516
      6a36b3a7
  11. 29 7月, 2017 1 次提交
    • S
      Replace dynamic_cast<> · 21696ba5
      Siying Dong 提交于
      Summary:
      Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
      Some nontrivial changes:
      1. Add Comparator::GetRootComparator() to get around the internal comparator hack
      2. Add the two experiemental functions to DB
      3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
      4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
      Closes https://github.com/facebook/rocksdb/pull/2645
      
      Differential Revision: D5502723
      
      Pulled By: siying
      
      fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
      21696ba5
  12. 26 7月, 2017 1 次提交
  13. 25 7月, 2017 1 次提交
    • S
      Add Iterator::Refresh() · e67b35c0
      Siying Dong 提交于
      Summary:
      Add and implement Iterator::Refresh(). When this function is called, if the super version doesn't change, update the sequence number of the iterator to the latest one and invalidate the iterator. If the super version changed, recreated the whole iterator. This can help users reuse the iterator more easily.
      Closes https://github.com/facebook/rocksdb/pull/2621
      
      Differential Revision: D5464500
      
      Pulled By: siying
      
      fbshipit-source-id: f548bd35e85c1efca2ea69273802f6704eba6ba9
      e67b35c0
  14. 14 7月, 2017 1 次提交
  15. 01 7月, 2017 1 次提交
  16. 28 6月, 2017 1 次提交
    • S
      FIFO Compaction with TTL · 1cd45cd1
      Sagar Vemuri 提交于
      Summary:
      Introducing FIFO compactions with TTL.
      
      FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
      
      To address that request:
      - Added a new TTL option to FIFO compaction options.
      - Updated FIFO compaction score to take TTL into consideration.
      - Added a new table property, creation_time, to keep track of when the SST file is created.
      - Creation_time is set as below:
        - On Flush: Set to the time of flush.
        - On Compaction: Set to the max creation_time of all the files involved in the compaction.
        - On Repair and Recovery: Set to the time of repair/recovery.
        - Old files created prior to this code change will have a creation_time of 0.
      - FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
      - FIFO compaction will fall back to the prior way of deleting files based on size if:
        - the creation_time of all files involved in compaction is 0.
        - the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
      
      This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
      
      **Test Plan:**
      Added tests.
      
      **Benchmark results:**
      Base: FIFO with max size: 100MB ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
      
      readwhilewriting :       1.924 micros/op 519858 ops/sec;   13.6 MB/s (1176277 of 5000000 found)
      ```
      
      With TTL (a low one for testing) ::
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
      
      readwhilewriting :       1.902 micros/op 525817 ops/sec;   13.7 MB/s (1185057 of 5000000 found)
      ```
      Example Log lines:
      ```
      2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
      2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
      ...
      2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
      2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
      ```
      
      SST Files remaining in the dbbench dir, after db_bench execution completed:
      ```
      svemuri@dev15905 ~/rocksdb (fifo-compaction)  $ ls -l /dev/shm//dbbench/*.sst
      -rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
      -rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
      -rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
      ```
      Closes https://github.com/facebook/rocksdb/pull/2480
      
      Differential Revision: D5305116
      
      Pulled By: sagar0
      
      fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
      1cd45cd1
  17. 27 6月, 2017 2 次提交
  18. 14 6月, 2017 1 次提交
  19. 13 6月, 2017 1 次提交
  20. 12 6月, 2017 1 次提交
    • S
      Sample number of reads per SST file · 5582123d
      Siying Dong 提交于
      Summary:
      We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
      Closes https://github.com/facebook/rocksdb/pull/2417
      
      Differential Revision: D5193528
      
      Pulled By: siying
      
      fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
      5582123d
  21. 10 6月, 2017 1 次提交
  22. 06 6月, 2017 1 次提交
  23. 03 6月, 2017 2 次提交
    • A
      using ThreadLocalPtr to hide ROCKSDB_SUPPORT_THREAD_LOCAL from public… · 7f6c02dd
      Aaron Gao 提交于
      Summary:
      … headers
      
      https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.
      
      We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.
      
      make check -j64
      Closes https://github.com/facebook/rocksdb/pull/2380
      
      Differential Revision: D5177896
      
      Pulled By: lightmark
      
      fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
      7f6c02dd
    • S
      Improve write buffer manager (and allow the size to be tracked in block cache) · 95b0e89b
      Siying Dong 提交于
      Summary:
      Improve write buffer manager in several ways:
      1. Size is tracked when arena block is allocated, rather than every allocation, so that it can better track actual memory usage and the tracking overhead is slightly lower.
      2. We start to trigger memtable flush when 7/8 of the memory cap hits, instead of 100%, and make 100% much harder to hit.
      3. Allow a cache object to be passed into buffer manager and the size allocated by memtable can be costed there. This can help users have one single memory cap across block cache and memtable.
      Closes https://github.com/facebook/rocksdb/pull/2350
      
      Differential Revision: D5110648
      
      Pulled By: siying
      
      fbshipit-source-id: b4238113094bf22574001e446b5d88523ba00017
      95b0e89b
  24. 25 5月, 2017 2 次提交
    • A
      Introduce max_background_jobs mutable option · bb01c188
      Andrew Kryczka 提交于
      Summary:
      - `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
      - `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
      - `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
      - The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
      Closes https://github.com/facebook/rocksdb/pull/2205
      
      Differential Revision: D4937461
      
      Pulled By: ajkr
      
      fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
      bb01c188
    • S
      options.delayed_write_rate use the rate of rate_limiter by default. · 41cbb727
      Siying Dong 提交于
      Summary:
      It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude.
      Closes https://github.com/facebook/rocksdb/pull/2357
      
      Differential Revision: D5115324
      
      Pulled By: siying
      
      fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754
      41cbb727
  25. 24 5月, 2017 2 次提交
    • A
      New API for background work in single thread pool · 6cc9aef1
      Andrew Kryczka 提交于
      Summary:
      Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
      
      Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
      Closes https://github.com/facebook/rocksdb/pull/2204
      
      Differential Revision: D4936256
      
      Pulled By: ajkr
      
      fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
      6cc9aef1
    • A
      Core-local statistics · ac39d6be
      Andrew Kryczka 提交于
      Summary:
      This diff changes `StatisticsImpl` from a thread-local approach to a core-local one. The goal is to perform faster aggregations, particularly for applications that have many threads. There should be no behavior change.
      Closes https://github.com/facebook/rocksdb/pull/2258
      
      Differential Revision: D5016258
      
      Pulled By: ajkr
      
      fbshipit-source-id: 7d4d165b4a91d8110f0409d113d1be91f22d31a9
      ac39d6be
  26. 20 5月, 2017 1 次提交
    • Y
      New WriteImpl to pipeline WAL/memtable write · 07bdcb91
      Yi Wu 提交于
      Summary:
      PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that
      1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit.
      2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes.
      
      Merging #2056 and #2058 into this PR.
      Closes https://github.com/facebook/rocksdb/pull/2286
      
      Differential Revision: D5054606
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d
      07bdcb91
  27. 18 5月, 2017 2 次提交
  28. 13 5月, 2017 1 次提交
    • A
      Add GetAllKeyVersions API · 3fa9a39c
      Andrew Kryczka 提交于
      Summary:
      - Introduced an include/ file dedicated to db-related debug functions to avoid making db.h more complex
      - Added debugging function, `GetAllKeyVersions()`, to return a listing of internal data for a range of user keys. The new `struct KeyVersion` exposes data similar to internal key without exposing any internal type.
      - Migrated the "ldb idump" subcommand to use this function
      - The API takes an inclusive-exclusive range to match behavior of "ldb idump". This will be quite annoying for users who want to query a single user key's versions :(.
      Closes https://github.com/facebook/rocksdb/pull/2232
      
      Differential Revision: D4976007
      
      Pulled By: ajkr
      
      fbshipit-source-id: cab375da53a7595d6575af2b7e3b776aa3ad793e
      3fa9a39c
  29. 08 5月, 2017 1 次提交
    • Y
      Add bulk create/drop column family API · 2cd00773
      Yi Wu 提交于
      Summary:
      Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
      
      The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
      Closes https://github.com/facebook/rocksdb/pull/2248
      
      Differential Revision: D5001578
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
      2cd00773
  30. 05 5月, 2017 2 次提交