1. 21 8月, 2019 2 次提交
  2. 17 8月, 2019 3 次提交
  3. 16 8月, 2019 1 次提交
    • S
      Add command "list_file_range_deletes" in ldb (#5615) · bd2c753d
      sdong 提交于
      Summary:
      Add a command in ldb so that users can print out tombstones in SST files.
      In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
      
      Test Plan: Add a new unit test
      
      Differential Revision: D16550326
      
      fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
      bd2c753d
  4. 10 8月, 2019 1 次提交
    • Y
      Support loading custom objects in unit tests (#5676) · 5d9a67e7
      Yanqin Jin 提交于
      Summary:
      Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc.
      This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning.
      By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop.
      Test plan (on devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 all
      $make check
      ```
      All unit tests must pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676
      
      Differential Revision: D16679157
      
      Pulled By: riversand963
      
      fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d
      5d9a67e7
  5. 31 7月, 2019 3 次提交
  6. 26 7月, 2019 1 次提交
  7. 24 7月, 2019 3 次提交
    • M
      The ObjectRegistry class replaces the Registrar and NewCustomObjects.… (#5293) · cfcf045a
      Mark Rambacher 提交于
      Summary:
      The ObjectRegistry class replaces the Registrar and NewCustomObjects.  Objects are registered with the registry by Type (the class must implement the static const char *Type() method).
      
      This change is necessary for a few reasons:
      - By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
      - By having a class with instances, different units could have different objects registered.  This could be useful if, for example, one Option allowed for a dynamic library and one did not.
      
      When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.
      
      Test plan (on riversand963's  devserver)
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
      ```
      All tests pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293
      
      Differential Revision: D16363396
      
      Pulled By: riversand963
      
      fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
      cfcf045a
    • L
      Move the uncompression dictionary object out of the block cache (#5584) · 092f4170
      Levi Tamasi 提交于
      Summary:
      RocksDB has historically stored uncompression dictionary objects in the block
      cache as opposed to storing just the block contents. This neccesitated
      evicting the object upon table close. With the new code, only the raw blocks
      are stored in the cache, eliminating the need for eviction.
      
      In addition, the patch makes the following improvements:
      
      1) Compression dictionary blocks are now prefetched/pinned similarly to
      index/filter blocks.
      2) A copy operation got eliminated when the uncompression dictionary is
      retrieved.
      3) Errors related to retrieving the uncompression dictionary are propagated as
      opposed to silently ignored.
      
      Note: the patch temporarily breaks the compression dictionary evicition stats.
      They will be fixed in a separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584
      
      Test Plan: make asan_check
      
      Differential Revision: D16344151
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b
      092f4170
    • S
      ldb sometimes specify a string-append merge operator (#5607) · 3782accf
      sdong 提交于
      Summary:
      Right now, ldb cannot scan a DB with merge operands with default ldb. There is no hard to give a general merge operator so that it can at least print out something
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5607
      
      Test Plan: Run ldb against a DB with merge operands and see the outputs.
      
      Differential Revision: D16442634
      
      fbshipit-source-id: c66c414ec07f219cfc6e6ec2cc14c783ee95df54
      3782accf
  8. 23 7月, 2019 2 次提交
    • M
      Disable refresh snapshot feature by default (#5606) · 327c4807
      Maysam Yabandeh 提交于
      Summary:
      There are concerns about the correctness of this patch. Disabling by default until the concerns are resolved.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5606
      
      Differential Revision: D16428064
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: a89280f0ea85796c9c9dfbfd9a8e91dad9b000b3
      327c4807
    • S
      row_cache to share entry for recent snapshots (#5600) · 66b5613d
      sdong 提交于
      Summary:
      Right now, users cannot take advantage of row cache, unless no snapshot is used, or Get() is repeated for the same snapshots. This limits the usage of row cache.
      This change eliminate this restriction in some cases. If the snapshot used is newer than the largest sequence number in the file, and write callback function is not registered, the same row cache key is used as no snapshot is given. We still need the callback function restriction for now because the callback function may filter out different keys for different snapshots even if the snapshots are new.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5600
      
      Test Plan: Add a unit test.
      
      Differential Revision: D16386616
      
      fbshipit-source-id: 6b7d214bd215d191b03ccf55926ad4b703ec2e53
      66b5613d
  9. 19 7月, 2019 1 次提交
  10. 17 7月, 2019 1 次提交
    • L
      Move the filter readers out of the block cache (#5504) · 3bde41b5
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for the filter block, it is not
      really the block itself that is stored in the cache but a FilterBlockReader
      object. Since this object is not pure data (it has, for instance, pointers that
      might dangle, including in one case a back pointer to the TableReader), it's not
      really sharable. To avoid the issues around this, the current code erases the
      cache entries when the TableReader is closed (which, BTW, is not sufficient
      since a concurrent TableReader might have picked up the object in the meantime).
      Instead of doing this, the patch moves the FilterBlockReader out of the cache
      altogether, and decouples the filter reader object from the filter block.
      In particular, instead of the TableReader owning, or caching/pinning the
      FilterBlockReader (based on the customer's settings), with the change the
      TableReader unconditionally owns the FilterBlockReader, which in turn
      owns/caches/pins the filter block. This change also enables us to reuse the code
      paths historically used for data blocks for filters as well.
      
      Note:
      Eviction statistics for filter blocks are temporarily broken. We plan to fix this in a
      separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5504
      
      Test Plan: make asan_check
      
      Differential Revision: D16036974
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 770f543c5fb4ed126fd1e04bfd3809cf4ff9c091
      3bde41b5
  11. 10 7月, 2019 1 次提交
    • S
      Allow ldb to open DB as secondary (#5537) · aa0367aa
      sdong 提交于
      Summary:
      Right now ldb can open running DB through read-only DB. However, it might leave info logs files to the read-only DB directory. Add an option to open the DB as secondary to avoid it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5537
      
      Test Plan:
      Run
      ./ldb scan  --max_keys=10 --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp --no_value --hex
      and
      ./ldb get 0x00000000000000103030303030303030 --hex --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp
      against a normal db_bench run and observe the output changes. Also observe that no new info logs files are created under /tmp/rocksdbtest-2491/dbbench.
      Run without --secondary_path and observe that new info logs created under /tmp/rocksdbtest-2491/dbbench.
      
      Differential Revision: D16113886
      
      fbshipit-source-id: 4e09dec47c2528f6ca08a9e7a7894ba2d9daebbb
      aa0367aa
  12. 08 7月, 2019 1 次提交
    • Y
      Support GetAllKeyVersions() for non-default cf (#5544) · 7c76a7fb
      Yanqin Jin 提交于
      Summary:
      Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.
      
      Test plan (devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
      $./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
      ```
      All other unit tests must pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544
      
      Differential Revision: D16147551
      
      Pulled By: riversand963
      
      fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
      7c76a7fb
  13. 07 7月, 2019 1 次提交
  14. 04 7月, 2019 1 次提交
  15. 03 7月, 2019 1 次提交
  16. 01 7月, 2019 1 次提交
    • A
      MultiGet parallel IO (#5464) · 7259e28d
      anand76 提交于
      Summary:
      Enhancement to MultiGet batching to read data blocks required for keys in a batch in parallel from disk. It uses Env::MultiRead() API to read multiple blocks and reduce latency.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5464
      
      Test Plan:
      1. make check
      2. make asan_check
      3. make asan_crash
      
      Differential Revision: D15911771
      
      Pulled By: anand1976
      
      fbshipit-source-id: 605036b9af0f90ca0020dc87c3a86b4da6e83394
      7259e28d
  17. 28 6月, 2019 1 次提交
    • S
      LRU Cache to enable mid-point insertion by default (#5508) · 15fd3be0
      sdong 提交于
      Summary:
      Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508
      
      Test Plan: Run all existing tests.
      
      Differential Revision: D16021179
      
      fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb
      15fd3be0
  18. 27 6月, 2019 1 次提交
  19. 25 6月, 2019 1 次提交
    • M
      Add an option to put first key of each sst block in the index (#5289) · b4d72094
      Mike Kolupaev 提交于
      Summary:
      The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
      
      Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
      
      So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
      
      Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
      
      This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
      
      Differential Revision: D15256423
      
      Pulled By: al13n321
      
      fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
      b4d72094
  20. 22 6月, 2019 1 次提交
  21. 19 6月, 2019 3 次提交
    • S
      Replace Corruption with TryAgain status when new tail is not visible to... · fe90ed7a
      Simon Grätzer 提交于
      Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474)
      
      Summary:
      When tailing the WAL with TransactionLogIterator, it used to return Corruption status to indicate that the WAL has new tail that is not visible to the iterator, which is a misleading status. The patch replaces it with TryAgain which is more descriptive of a status, indicating that the user needs to create a new iterator to fetch the recent tail.
      Fixes https://github.com/facebook/rocksdb/issues/5455
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5474
      
      Differential Revision: D15898953
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 40966f6457cb539e1aeb104daeada6b0e46059fc
      fe90ed7a
    • L
      Make the 'block read count' performance counters consistent (#5484) · 5355e527
      Levi Tamasi 提交于
      Summary:
      The patch brings the semantics of per-block-type read performance
      context counters in sync with the generic block_read_count by only
      incrementing the counter if the block was actually read from the file.
      It also fixes index_block_read_count, which fell victim to the
      refactoring in PR https://github.com/facebook/rocksdb/issues/5298.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5484
      
      Test Plan: Extended the unit tests.
      
      Differential Revision: D15887431
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a3889759d0ac5759d56625d692cd828d1b9207a6
      5355e527
    • Y
      Fix a bug caused by secondary not skipping the beginning of new MANIFEST (#5472) · f287f8dc
      Yanqin Jin 提交于
      Summary:
      While the secondary is replaying after the primary, the primary may switch to a new MANIFEST. The secondary is already able to detect and follow the primary to the new MANIFEST. However, the current implementation has a bug, described as follows.
      The new MANIFEST's first records have been generated by VersionSet::WriteSnapshot to describe the current state of the column families and the db as of the MANIFEST creation. Since the secondary instance has already finished recovering upon start, there is no need for the secondary to process these records. Actually, if the secondary were to replay these records, the secondary may end up adding the same SST files **again** to each column family, causing consistency checks done by VersionBuilder to fail. Therefore, we record the number of records to skip at the beginning of the new MANIFEST and ignore them.
      
      Test plan (on dev server)
      ```
      $make clean && make -j32 all
      $./db_secondary_test
      ```
      All existing unit tests must pass as well.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5472
      
      Differential Revision: D15866771
      
      Pulled By: riversand963
      
      fbshipit-source-id: a1eec4837fb2ad13059398efb0f437e74fd53bed
      f287f8dc
  22. 15 6月, 2019 1 次提交
    • S
      Validate CF Options when creating a new column family (#5453) · f1219644
      Sagar Vemuri 提交于
      Summary:
      It seems like CF Options are not properly validated  when creating a new column family with `CreateColumnFamily` API; only a selected few checks are done. Calling `ColumnFamilyData::ValidateOptions`, which is the single source for all CFOptions validations,  will help fix this. (`ColumnFamilyData::ValidateOptions` is already called at the time of `DB::Open`).
      
      **Test Plan:**
      Added a new test: `DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions`
      ```
      TEST_TMPDIR=/dev/shm ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions
      ```
      Also ran gtest-parallel to make sure the new test is not flaky.
      ```
      TEST_TMPDIR=/dev/shm ~/gtest-parallel/gtest-parallel ./db_test --gtest_filter=DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions --repeat=10000
      [10000/10000] DBTest.CreateColumnFamilyShouldFailOnIncompatibleOptions (15 ms)
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5453
      
      Differential Revision: D15816851
      
      Pulled By: sagar0
      
      fbshipit-source-id: 9e702b9850f5c4a7e0ef8d39e1e6f9b81e7fe1e5
      f1219644
  23. 12 6月, 2019 1 次提交
  24. 11 6月, 2019 1 次提交
    • Y
      Improve memtable earliest seqno assignment for secondary instance (#5413) · 6ce55808
      Yanqin Jin 提交于
      Summary:
      In regular RocksDB instance, `MemTable::earliest_seqno_` is "db sequence number at the time of creation". However, we cannot use the db sequence number to set the value of `MemTable::earliest_seqno_` for secondary instance, i.e. `DBImplSecondary` due to the logic of MANIFEST and WAL replay.
      When replaying the log files of the primary, the secondary instance first replays MANIFEST and updates the db sequence number if necessary. Next, the secondary replays WAL files, creates new memtables if necessary and inserts key-value pairs into memtables. The following can occur when the db has two or more column families.
      Assume the db has column family "default" and "cf1". At a certain in time, both "default" and "cf1" have data in memtables.
      1. Primary triggers a flush and flushes "cf1". "default" is **not** flushed.
      2. Secondary replays the MANIFEST updates its db sequence number to the latest value learned from the MANIFEST.
      3. Secondary starts to replay WAL that contains the writes to "default". It is possible that the write batches' sequence numbers are smaller than the db sequence number. In this case, these write batches will be skipped, and these updates will not be visible to reader until "default" is later flushed.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5413
      
      Differential Revision: D15637407
      
      Pulled By: riversand963
      
      fbshipit-source-id: 3de3fe35cfc6f1b9f844f3f926f0df29717b6580
      6ce55808
  25. 07 6月, 2019 1 次提交
    • L
      Refactor the handling of cache related counters and statistics (#5408) · bee2f48a
      Levi Tamasi 提交于
      Summary:
      The patch cleans up the handling of cache hit/miss/insertion related
      performance counters, get context counters, and statistics by
      eliminating some code duplication and factoring out the affected logic
      into separate methods. In addition, it makes the semantics of cache hit
      metrics more consistent by changing the code so that accessing a
      partition of partitioned indexes/filters through a pinned reference no
      longer counts as a cache hit.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408
      
      Differential Revision: D15610883
      
      Pulled By: ltamasi
      
      fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1
      bee2f48a
  26. 06 6月, 2019 1 次提交
    • Y
      Add support for timestamp in Get/Put (#5079) · 340ed4fa
      Yanqin Jin 提交于
      Summary:
      It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
      
      Test plan (on devserver):
      ```
      $COMPILE_WITH_ASAN=1 make -j32 all
      $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
      $make check
      ```
      All tests must pass.
      
      We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
      ```
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
      $TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
      ```
      Repeat for 6 times for both versions.
      
      Results are as follows:
      ```
      |        | readrandom | fillrandom |
      | master | 16.77 MB/s | 47.05 MB/s |
      | PR5079 | 16.44 MB/s | 47.03 MB/s |
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
      
      Differential Revision: D15132946
      
      Pulled By: riversand963
      
      fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
      340ed4fa
  27. 05 6月, 2019 1 次提交
  28. 01 6月, 2019 1 次提交
    • S
      Auto roll logger to enforce options.keep_log_file_num immediately after a new... · cb094e13
      Siying Dong 提交于
      Auto roll logger to enforce options.keep_log_file_num immediately after a new file is created (#5370)
      
      Summary:
      Right now, with auto roll logger, options.keep_log_file_num enforcement is triggered by events like DB reopen or full obsolete scan happens. In the mean time, the size and number of log files can grow without a limit. We put a stronger enforcement to the option, so that the number of log files can always under control.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5370
      
      Differential Revision: D15570413
      
      Pulled By: siying
      
      fbshipit-source-id: 0916c3c4d42ab8fdd29389ee7fd7e1557b03176e
      cb094e13
  29. 31 5月, 2019 2 次提交
    • Y
      Fix WAL replay by skipping old write batches (#5170) · b9f59006
      Yanqin Jin 提交于
      Summary:
      1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
      2. Add support for benchmarking secondary instance to db_bench_tool.
      With changes made in this PR, we can start benchmarking secondary instance
      using two processes. It is also possible to vary the frequency at which the
      secondary instance tries to catch up with the primary. The info log of the
      secondary can be found in a directory whose path can be specified with
      '-secondary_path'.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170
      
      Differential Revision: D15564608
      
      Pulled By: riversand963
      
      fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
      b9f59006
    • L
      Move the index readers out of the block cache (#5298) · 1e355842
      Levi Tamasi 提交于
      Summary:
      Currently, when the block cache is used for index blocks as well, it is
      not really the index block that is stored in the cache but an
      IndexReader object. Since this object is not pure data (it has, for
      instance, pointers that might dangle), it's not really sharable. To
      avoid the issues around this, the current code uses a dummy unique cache
      key for each TableReader to store the IndexReader, and erases the
      IndexReader entry when the TableReader is closed. Instead of doing this,
      the new code moves the IndexReader out of the cache altogether. In
      particular, instead of the TableReader owning, or caching/pinning the
      IndexReader based on the customer's settings, the TableReader
      unconditionally owns the IndexReader, which in turn owns/caches/pins
      the index block (which is itself sharable and thus can be safely put in
      the cache without any hacks).
      
      Note: the change has two side effects:
      1) Partitions of partitioned indexes no longer affect the read
      amplification statistics.
      2) Eviction statistics for index blocks are temporarily broken. We plan to fix
      this in a separate phase.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
      
      Differential Revision: D15303203
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
      1e355842