1. 17 7月, 2018 1 次提交
    • S
      Separate some IndexBlockIter logic from BlockIter (#4136) · 8f06b4fa
      Siying Dong 提交于
      Summary:
      Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136
      
      Reviewed By: maysamyabandeh
      
      Differential Revision: D8859673
      
      Pulled By: siying
      
      fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e
      8f06b4fa
  2. 14 7月, 2018 3 次提交
  3. 13 7月, 2018 1 次提交
    • M
      Refactor BlockIter (#4121) · d4ad32d7
      Maysam Yabandeh 提交于
      Summary:
      BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121
      
      Differential Revision: D8816832
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197
      d4ad32d7
  4. 12 7月, 2018 1 次提交
  5. 07 7月, 2018 1 次提交
    • Y
      Fix a bug caused by not copying the block trailer. (#4096) · d4d9fe8e
      Yanqin Jin 提交于
      Summary:
      This was caught by crash test, and the following is a simple way to reproduce it and verify the fix.
      One way to trigger this code path is to use the following configuration:
      - Compress SST file
      - Enable direct IO and prefetch buffer
      - Do NOT use compressed block cache
      Closes https://github.com/facebook/rocksdb/pull/4096
      
      Differential Revision: D8742009
      
      Pulled By: riversand963
      
      fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2
      d4d9fe8e
  6. 06 7月, 2018 1 次提交
  7. 30 6月, 2018 1 次提交
  8. 29 6月, 2018 1 次提交
    • M
      Charging block cache more accurately (#4073) · 29ffbb8a
      Maysam Yabandeh 提交于
      Summary:
      Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
      Closes https://github.com/facebook/rocksdb/pull/4073
      
      Differential Revision: D8686552
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
      29ffbb8a
  9. 28 6月, 2018 4 次提交
    • Z
      PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067) · 14f409c0
      Zhongyi Xie 提交于
      Summary:
      with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests.
      Closes https://github.com/facebook/rocksdb/pull/4067
      
      Differential Revision: D8669002
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a
      14f409c0
    • Z
      Add bottommost_compression_opts to for bottommost_compression (#3985) · 1f6efabe
      Zhichao Cao 提交于
      Summary:
      …ression
      
       For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts
      Closes https://github.com/facebook/rocksdb/pull/3985
      
      Reviewed By: riversand963
      
      Differential Revision: D8385911
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc
      1f6efabe
    • M
      Pin mmap files in ReadOnlyDB (#4053) · 235ab9dd
      Maysam Yabandeh 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case.
      Closes https://github.com/facebook/rocksdb/pull/4053
      
      Differential Revision: D8662546
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd
      235ab9dd
    • M
      WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955) · a16e00b7
      Manuel Ung 提交于
      Summary:
      This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data.
      
      There are the places where reads had to be modified.
      - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible.
      - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key.
      - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case.
      Closes https://github.com/facebook/rocksdb/pull/3955
      
      Differential Revision: D8286688
      
      Pulled By: lth
      
      fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe
      a16e00b7
  10. 27 6月, 2018 2 次提交
  11. 26 6月, 2018 1 次提交
  12. 23 6月, 2018 1 次提交
    • M
      Pin top-level index on partitioned index/filter blocks (#4037) · 80ade9ad
      Maysam Yabandeh 提交于
      Summary:
      Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
      Closes https://github.com/facebook/rocksdb/pull/4037
      
      Differential Revision: D8596218
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
      80ade9ad
  13. 22 6月, 2018 1 次提交
    • S
      Improve direct IO range scan performance with readahead (#3884) · 7103559f
      Sagar Vemuri 提交于
      Summary:
      This PR extends the improvements in #3282 to also work when using Direct IO.
      We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
      
      **Description:**
      This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
      
      **Implementation Details:**
      - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
      - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
      - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
      - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
      - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
      
      **Constraints:**
      - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
      - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
      - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
      
      **Benchmarks:**
      I used the same benchmark as used in #3282.
      Data fill:
      ```
      TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
      ```
      
      Do a long range scan: Seekrandom with large number of nexts
      ```
      TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
      ```
      
      ```
      Before:
      seekrandom   :   37939.906 micros/op 26 ops/sec;   29.2 MB/s (1636 of 1999 found)
      With this change:
      seekrandom   :   8527.720 micros/op 117 ops/sec;  129.7 MB/s (6530 of 7999 found)
      ```
      ~4.5X perf improvement. Taken on an average of 3 runs.
      Closes https://github.com/facebook/rocksdb/pull/3884
      
      Differential Revision: D8082143
      
      Pulled By: sagar0
      
      fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
      7103559f
  14. 20 6月, 2018 2 次提交
    • M
      Fix the bug with duplicate prefix in partition filters (#4024) · 28a9d891
      Maysam Yabandeh 提交于
      Summary:
      https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut.
      This bug could result into false negatives, which means that DB could skip an existing key.
      Closes https://github.com/facebook/rocksdb/pull/4024
      
      Differential Revision: D8518866
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf
      28a9d891
    • S
      BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004) · 92ee3350
      Siying Dong 提交于
      Summary:
      b555ed30 makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound.
      
      For example, if an SST file has a data block containing following keys : {a, z}
      
      The user sets the upper bound to be "x", and it executed following queries:
      Seek("b")
      Seek("c")
      Seek("d")
      
      Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again.
      
      To prevent this regression case, we keep the current data block iterator if it is upper bound.
      Closes https://github.com/facebook/rocksdb/pull/4004
      
      Differential Revision: D8463192
      
      Pulled By: siying
      
      fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81
      92ee3350
  15. 16 6月, 2018 1 次提交
  16. 13 6月, 2018 3 次提交
    • S
      Fix regression bug of Prev() with upper bound (#3989) · d82f1421
      Siying Dong 提交于
      Summary:
      A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
        Seek(key);
        if (!Valid()) SeekToLast();
      Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
      This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().
      
      Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
      Closes https://github.com/facebook/rocksdb/pull/3989
      
      Differential Revision: D8385422
      
      Pulled By: siying
      
      fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
      d82f1421
    • A
      Fix argument mismatch in BlockBasedTableBuilder (#3974) · 9d347332
      Andrew Kryczka 提交于
      Summary:
      The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext*`. We were mistakenly passing the `GetContext*` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least.
      
      Blame: 402b7aa0
      Closes https://github.com/facebook/rocksdb/pull/3974
      
      Differential Revision: D8344907
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e
      9d347332
    • F
      Remove restart point from the properties_block (#3970) · 35932753
      Fenggang Wu 提交于
      Summary:
      Property block will be read sequentially and cached in a heap located
      object, so there's no need for restart points. Thus we set the restart
      interval to infinity to save space.
      Closes https://github.com/facebook/rocksdb/pull/3970
      
      Differential Revision: D8332586
      
      Pulled By: fgwu
      
      fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134
      35932753
  17. 09 6月, 2018 1 次提交
  18. 07 6月, 2018 1 次提交
  19. 06 6月, 2018 2 次提交
  20. 05 6月, 2018 2 次提交
    • M
      Extend some tests to format_version=3 (#3942) · d0c38c0c
      Maysam Yabandeh 提交于
      Summary:
      format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
      Closes https://github.com/facebook/rocksdb/pull/3942
      
      Differential Revision: D8238413
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
      d0c38c0c
    • D
      Provide a way to override windows memory allocator with jemalloc for ZSTD · f4b72d70
      Dmitri Smirnov 提交于
      Summary:
      Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.
      
      For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.
      
      The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
      Closes https://github.com/facebook/rocksdb/pull/3838
      
      Differential Revision: D8229794
      
      Pulled By: miasantreble
      
      fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
      f4b72d70
  21. 02 6月, 2018 1 次提交
    • A
      Copy Get() result when file reads use mmap · fea2b1df
      Andrew Kryczka 提交于
      Summary:
      For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
      
      For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped.   The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
      
      This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
      
      This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
      Closes https://github.com/facebook/rocksdb/pull/3881
      
      Differential Revision: D8076288
      
      Pulled By: ajkr
      
      fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
      fea2b1df
  22. 01 6月, 2018 2 次提交
  23. 30 5月, 2018 1 次提交
  24. 26 5月, 2018 3 次提交
    • M
      Exclude seq from index keys · 402b7aa0
      Maysam Yabandeh 提交于
      Summary:
      Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
      The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
      Closes https://github.com/facebook/rocksdb/pull/3894
      
      Differential Revision: D8118775
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
      402b7aa0
    • N
      Check status when reading HashIndexPrefixesMetadataBlock · 8c3bf080
      Nathan VanBenschoten 提交于
      Summary:
      This was missed in a refactor of `ReadBlockContents` (2f1a3a4d).
      Closes https://github.com/facebook/rocksdb/pull/3906
      
      Differential Revision: D8172648
      
      Pulled By: ajkr
      
      fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090
      8c3bf080
    • Y
      Fix segfault caused by object premature destruction · aa53579d
      Yanqin Jin 提交于
      Summary:
      Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
      There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.
      
      To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
      Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
      Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.
      
      Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.
      
      Test plan
      ```
      $ make -j16
      $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
      ```
      
      Expected behavior:
      All tests should pass.
      Closes https://github.com/facebook/rocksdb/pull/3898
      
      Differential Revision: D8149421
      
      Pulled By: riversand963
      
      fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
      aa53579d
  25. 23 5月, 2018 1 次提交
  26. 22 5月, 2018 1 次提交
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758