1. 14 6月, 2019 1 次提交
  2. 31 5月, 2019 2 次提交
  3. 30 5月, 2019 1 次提交
  4. 27 4月, 2019 1 次提交
    • S
      Improve explicit user readahead performance (#5246) · 3548e422
      Sagar Vemuri 提交于
      Summary:
      Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
      
      1. Stop creating new table readers when the user explicitly sets readahead size.
      2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
      3. Add `readahead_size` to db_bench.
      
      **Benchmarks:**
      https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
      
      For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
      For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
      
      **Test Plan:**
      Updated `DBIteratorTest.ReadAhead` test to make sure that:
      - no new table readers are created for iterators on setting ReadOptions.readahead_size
      - At least "readahead" number of bytes are actually getting read on each iterator read.
      
      TODO later:
      - Use similar logic for compactions as well.
      - This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
      
      Differential Revision: D15107946
      
      Pulled By: sagar0
      
      fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
      3548e422
  5. 12 4月, 2019 1 次提交
    • A
      Introduce a new MultiGet batching implementation (#5011) · fefd4b98
      anand76 提交于
      Summary:
      This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
      
      Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
      1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
      2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
      
      The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
      
      Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
      
      Batch   Sizes
      
      1        | 2        | 4         | 8      | 16  | 32
      
      Random pattern (Stride length 0)
      4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
      4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
      4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)
      
      Good locality (Stride length 16)
      4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
      4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
      4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
      
      Good locality (Stride length 256)
      4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
      4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
      4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
      
      Medium locality (Stride length 4096)
      4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
      4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
      4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
      
      dbbench command used (on a DB with 4 levels, 12 million keys)-
      TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
      
      Differential Revision: D14348703
      
      Pulled By: anand1976
      
      fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
      fefd4b98
  6. 21 12月, 2018 1 次提交
    • S
      Introduce a CPU time counter in perf_context (#4741) · da1c64b6
      Siying Dong 提交于
      Summary:
      Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
      Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
      Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741
      
      Differential Revision: D13287798
      
      Pulled By: siying
      
      fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
      da1c64b6
  7. 18 12月, 2018 1 次提交
  8. 29 11月, 2018 1 次提交
    • A
      Clean up FragmentedRangeTombstoneList (#4692) · 8fe1e06c
      Abhishek Madan 提交于
      Summary:
      Removed `one_time_use` flag, which removed the need for some
      tests, and changed all `NewRangeTombstoneIterator` methods to return
      `FragmentedRangeTombstoneIterators`.
      
      These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
      and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
      
      Differential Revision: D13106570
      
      Pulled By: abhimadan
      
      fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
      8fe1e06c
  9. 22 11月, 2018 2 次提交
    • A
      Fix ticker stat for number files closed (#4703) · 07cf0ee5
      Andrew Kryczka 提交于
      Summary:
      We haven't been populating `NO_FILE_CLOSES` since v1.5.8 even though it was never marked as deprecated. Start populating it again. Conveniently `DeleteTableReader` has an unused `void*` argument that we can use...
      
      Blame: 63f216ee
      
      Closes #4700.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4703
      
      Differential Revision: D13146769
      
      Pulled By: ajkr
      
      fbshipit-source-id: ad8d6fb0493e701f60a165a3bca1787d255be008
      07cf0ee5
    • A
      Introduce RangeDelAggregatorV2 (#4649) · 457f77b9
      Abhishek Madan 提交于
      Summary:
      The old RangeDelAggregator did expensive pre-processing work
      to create a collapsed, binary-searchable representation of range
      tombstones. With FragmentedRangeTombstoneIterator, much of this work is
      now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
      in each iterator to find a covering tombstone in ShouldDelete, while
      doing minimal work in AddTombstones. The old RangeDelAggregator is still
      used during flush/compaction for now, though RangeDelAggregatorV2 will
      support those uses in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
      
      Differential Revision: D13146964
      
      Pulled By: abhimadan
      
      fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
      457f77b9
  10. 15 11月, 2018 1 次提交
    • A
      Modify FragmentedRangeTombstoneList member layout (#4632) · 6bee36a7
      Abhishek Madan 提交于
      Summary:
      Rather than storing a `vector<RangeTombstone>`, we now store a
      `vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
      `RangeTombstoneStack` contains the start and end keys of a range tombstone
      fragment, and indices into the seqnum vector to indicate which sequence
      numbers the fragment is located at. The diagram below illustrates an
      example:
      
      ```
      tombstones_:     [a, b) [c, e) [h, k)
                         | \   /  \   /  |
                         |  \ /    \ /   |
                         v   v      v    v
      tombstone_seqs_: [ 5 3 10 7 2 8 6  ]
      ```
      
      This format allows binary searching the tombstone list to use less key
      comparisons, which helps in cases where there are many overlapping
      tombstones. Also, this format makes it easier to add DBIter-like
      semantics to `FragmentedRangeTombstoneIterator` in the future.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632
      
      Differential Revision: D13053103
      
      Pulled By: abhimadan
      
      fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
      6bee36a7
  11. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  12. 26 10月, 2018 1 次提交
    • A
      Cache fragmented range tombstones in BlockBasedTableReader (#4493) · 7528130e
      Abhishek Madan 提交于
      Summary:
      This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
      
      On the same DB used in #4449, running `readrandom` results in the following:
      ```
      readrandom   :       0.983 micros/op 1017076 ops/sec;   78.3 MB/s (63103 of 100000 found)
      ```
      
      Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
      ```
         Tombstones?    | avg micros/op | stddev micros/op |  avg ops/s   | stddev ops/s
      ----------------- | ------------- | ---------------- | ------------ | ------------
      None              |        0.6186 |          0.04637 | 1,625,252.90 | 124,679.41
      500 Expanded      |        0.6019 |          0.03628 | 1,666,670.40 | 101,142.65
      500 Unexpanded    |        0.6435 |          0.03994 | 1,559,979.40 | 104,090.52
      1k Expanded       |        0.6034 |          0.04349 | 1,665,128.10 | 125,144.57
      1k Unexpanded     |        0.6261 |          0.03093 | 1,600,457.50 |  79,024.94
      5k Expanded       |        0.6163 |          0.05926 | 1,636,668.80 | 154,888.85
      5k Unexpanded     |        0.6402 |          0.04002 | 1,567,804.70 | 100,965.55
      10k Expanded      |        0.6036 |          0.05105 | 1,667,237.70 | 142,830.36
      10k Unexpanded    |        0.6128 |          0.02598 | 1,634,633.40 |  72,161.82
      25k Expanded      |        0.6198 |          0.04542 | 1,620,980.50 | 116,662.93
      25k Unexpanded    |        0.5478 |          0.0362  | 1,833,059.10 | 121,233.81
      50k Expanded      |        0.5104 |          0.04347 | 1,973,107.90 | 184,073.49
      50k Unexpanded    |        0.4528 |          0.03387 | 2,219,034.50 | 170,984.32
      ```
      
      After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
      
      Differential Revision: D10842844
      
      Pulled By: abhimadan
      
      fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
      7528130e
  13. 25 10月, 2018 1 次提交
    • A
      Use only "local" range tombstones during Get (#4449) · 8c78348c
      Abhishek Madan 提交于
      Summary:
      Previously, range tombstones were accumulated from every level, which
      was necessary if a range tombstone in a higher level covered a key in a lower
      level. However, RangeDelAggregator::AddTombstones's complexity is based on
      the number of tombstones that are currently stored in it, which is wasteful in
      the Get case, where we only need to know the highest sequence number of range
      tombstones that cover the key from higher levels, and compute the highest covering
      sequence number at the current level. This change introduces this optimization, and
      removes the use of RangeDelAggregator from the Get path.
      
      In the benchmark results, the following command was used to initialize the database:
      ```
      ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
      ```
      
      ...and the following command was used to measure read throughput:
      ```
      ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
      ```
      
      The filluniquerandom command was only run once, and the resulting database was used
      to measure read performance before and after the PR. Both binaries were compiled with
      `DEBUG_LEVEL=0`.
      
      Readrandom results before PR:
      ```
      readrandom   :       4.544 micros/op 220090 ops/sec;   16.9 MB/s (63103 of 100000 found)
      ```
      
      Readrandom results after PR:
      ```
      readrandom   :      11.147 micros/op 89707 ops/sec;    6.9 MB/s (63103 of 100000 found)
      ```
      
      So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
      
      ----
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
      
      Differential Revision: D10370575
      
      Pulled By: abhimadan
      
      fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
      8c78348c
  14. 13 10月, 2018 1 次提交
    • Y
      Add listener to sample file io (#3933) · 729a617b
      Yanqin Jin 提交于
      Summary:
      We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
      To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933
      
      Differential Revision: D10219571
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
      729a617b
  15. 10 10月, 2018 1 次提交
  16. 10 8月, 2018 1 次提交
    • M
      Index value delta encoding (#3983) · caf0f53a
      Maysam Yabandeh 提交于
      Summary:
      Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.
      
      The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
      Results with sysbench read-only using 4k blocks and using 16 index restart interval:
      Format 2:
      19585   rocksdb read-only range=100
      Format 3:
      19569   rocksdb read-only range=100
      Format 4:
      19352   rocksdb read-only range=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
      
      Differential Revision: D8361343
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
      caf0f53a
  17. 28 7月, 2018 1 次提交
    • Y
      Remove random writes from SST file ingestion (#4172) · 54de5684
      Yanqin Jin 提交于
      Summary:
      RocksDB used to store global_seqno in external SST files written by
      SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the
      `global_seqno`. Since random write is not supported in some non-POSIX compliant
      file systems, external SST file ingestion is not supported on these file
      systems. To address this limitation, we no longer update `global_seqno` during
      file ingestion. Later RocksDB uses the MANIFEST and other information in table
      properties to deduce global seqno for externally-ingested SST files.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172
      
      Differential Revision: D8961465
      
      Pulled By: riversand963
      
      fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
      54de5684
  18. 14 7月, 2018 1 次提交
    • P
      Relax VersionStorageInfo::GetOverlappingInputs check (#4050) · 90fc4069
      Peter Mattis 提交于
      Summary:
      Do not consider the range tombstone sentinel key as causing 2 adjacent
      sstables in a level to overlap. When a range tombstone's end key is the
      largest key in an sstable, the sstable's end key is so to a "sentinel"
      value that is the smallest key in the next sstable with a sequence
      number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
      overlap in internal-key space with the next sstable. Unfortunately,
      GetOverlappingFiles uses user-keys to determine overlap and was thus
      considering 2 adjacent sstables in a level to overlap if they were
      separated by this sentinel key. This in turn would cause compactions to
      be larger than necessary.
      
      Note that this conflicts with
      https://github.com/facebook/rocksdb/pull/2769 and cases
      `DBRangeDelTest.CompactionTreatsSplitInputLevelDeletionAtomically` to
      fail.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4050
      
      Differential Revision: D8844423
      
      Pulled By: ajkr
      
      fbshipit-source-id: df3f9f1db8f4cff2bff77376b98b83c2ae1d155b
      90fc4069
  19. 28 6月, 2018 1 次提交
  20. 26 6月, 2018 1 次提交
  21. 22 5月, 2018 1 次提交
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758
  22. 09 5月, 2018 1 次提交
    • A
      Disable readahead when using mmap for reads · 4bf169f0
      Andrew Kryczka 提交于
      Summary:
      `ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the  `mmap`'d memory region. This PR:
      
      - prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
      - adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
      Closes https://github.com/facebook/rocksdb/pull/3813
      
      Differential Revision: D7891513
      
      Pulled By: ajkr
      
      fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d
      4bf169f0
  23. 05 5月, 2018 1 次提交
    • daheiantian's avatar
      Recommit "Avoid adding tombstones of the same file to RangeDelAggregator multiple times" · 72942ad7
      daheiantian 提交于
      Summary:
      The origin commit #3635  will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7b. (see #3672)
      
      To fix this, move the set to  and add a check in , i.e., file will be added only if  is non-nullptr.
      
      The db_bench command which find the performance regression:
      > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none
      
      Before and after the modification, I re-run this command on the machine, the results of are as follows:
      
        **fillrandom**
       Table | P50 | P75 | P99 | P99.9 | P99.99 |
        ---- | --- | --- | --- | ----- | ------ |
       before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 |
       after commit  | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 |
      
       **seekrandomwhilewriting**
        Table | P50 | P75 | P99 | P99.9 | P99.99 |
         ---- | --- | --- | --- | ----- | ------ |
       before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 |
       after commit  | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 |
      Closes https://github.com/facebook/rocksdb/pull/3800
      
      Differential Revision: D7874245
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17
      72942ad7
  24. 06 4月, 2018 1 次提交
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
  25. 03 4月, 2018 1 次提交
    • Z
      Revert "Avoid adding tombstones of the same file to RangeDelAggregato… · 44653c7b
      Zhongyi Xie 提交于
      Summary:
      …r multiple times"
      
      This reverts commit e80709a3.
      
      lingbin PR https://github.com/facebook/rocksdb/pull/3635 is causing some performance regression for seekrandom workloads
      I'm reverting the commit for now but feel free to submit new patches 😃
      
      To reproduce the regression, you can run the following db_bench command
      > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none
      
      write stats printed by db_bench:
      
      Table | | | | | | | | | | |
       --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | ---
      revert commit | Percentiles: | P50: | 80.77  | P75: |102.94  |P99: | 1786.44 | P99.9: | 1892.39 |P99.99: 2645.10 |
      keep commit | Percentiles: | P50: | 221.72 | P75: | 686.62 | P99: | 1842.57 | P99.9: | 1899.70|  P99.99: 2814.29|
      Closes https://github.com/facebook/rocksdb/pull/3672
      
      Differential Revision: D7463315
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 8e779c87591127f2c3694b91a56d9b459011959d
      44653c7b
  26. 24 3月, 2018 1 次提交
  27. 06 3月, 2018 1 次提交
  28. 23 2月, 2018 2 次提交
  29. 16 2月, 2018 1 次提交
    • J
      Several small "fixes" · 4e7a182d
      jsteemann 提交于
      Summary:
      - removed a few unneeded variables
      - fused some variable declarations and their assignments
      - fixed right-trimming code in string_util.cc to not underflow
      - simplifed an assertion
      - move non-nullptr check assertion before dereferencing of that pointer
      - pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
      Closes https://github.com/facebook/rocksdb/pull/3507
      
      Differential Revision: D7004679
      
      Pulled By: sagar0
      
      fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
      4e7a182d
  30. 17 11月, 2017 1 次提交
  31. 04 11月, 2017 1 次提交
  32. 18 10月, 2017 1 次提交
    • N
      expose a hook to skip tables during iteration · 7891af8b
      Nikhil Benesch 提交于
      Summary:
      As discussed on the mailing list (["Skipping entire SSTs while iterating"](https://groups.google.com/forum/#!topic/rocksdb/ujHCJVLrHlU)), this patch adds a `table_filter` to `ReadOptions` that allows specifying a callback to be executed during iteration before each table in the database is scanned. The callback is passed the table's properties; the table is scanned iff the callback returns true.
      
      This can be used in conjunction with a `TablePropertiesCollector` to dramatically speed up scans by skipping tables that are known to contain irrelevant data for the scan at hand.
      
      We're using this [downstream in CockroachDB](https://github.com/cockroachdb/cockroach/blob/master/pkg/storage/engine/db.cc#L2009-L2022) already. With this feature, under ideal conditions, we can reduce the time of an incremental backup in  from hours to seconds.
      
      FYI, the first commit in this PR fixes a segfault that I unfortunately have not figured out how to reproduce outside of CockroachDB. I'm hoping you accept it on the grounds that it is not correct to return 8-byte aligned memory from a call to `malloc` on some 64-bit platforms; one correct approach is to infer the necessary alignment from `std::max_align_t`, as done here. As noted in the first commit message, the bug is tickled by having a`std::function` in `struct ReadOptions`. That is, the following patch alone is enough to cause RocksDB to segfault when run from CockroachDB on Darwin.
      
      ```diff
       --- a/include/rocksdb/options.h
      +++ b/include/rocksdb/options.h
      @@ -1546,6 +1546,13 @@ struct ReadOptions {
         // Default: false
         bool ignore_range_deletions;
      
      +  // A callback to determine whether relevant keys for this scan exist in a
      +  // given table based on the table's properties. The callback is passed the
      +  // properties of each table during iteration. If the callback returns false,
      +  // the table will not be scanned.
      +  // Default: empty (every table will be scanned)
      +  std::function<bool(const TableProperties&)> table_filter;
      +
         ReadOptions();
         ReadOptions(bool cksum, bool cache);
       };
      ```
      
      /cc danhhz
      Closes https://github.com/facebook/rocksdb/pull/2265
      
      Differential Revision: D5054262
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: dd6b28f2bba6cb8466250d8c5c542d3c92785476
      7891af8b
  33. 01 8月, 2017 1 次提交
    • A
      fix db get/write stats · 6a36b3a7
      Andrew Kryczka 提交于
      Summary:
      we were passing `record_read_stats` (a bool) as the `hist_type` argument, which meant we were updating either `rocksdb.db.get.micros` (`hist_type == 0`) or `rocksdb.db.write.micros` (`hist_type == 1`) with wrong data.
      Closes https://github.com/facebook/rocksdb/pull/2666
      
      Differential Revision: D5520384
      
      Pulled By: ajkr
      
      fbshipit-source-id: 2f7c956aec32f8b58c5c18845ac478e0230c9516
      6a36b3a7
  34. 28 7月, 2017 1 次提交
  35. 22 7月, 2017 2 次提交
  36. 18 7月, 2017 1 次提交
    • S
      enable PinnableSlice for RowCache · 0655b585
      Sushma Devendrappa 提交于
      Summary:
      This patch enables using PinnableSlice for RowCache, changes include
      not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle.
      Closes https://github.com/facebook/rocksdb/pull/2492
      
      Differential Revision: D5316216
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362
      0655b585