1. 19 9月, 2019 1 次提交
  2. 11 8月, 2019 1 次提交
    • Z
      exclude TEST_ENV_URI from rocksdb lite (#5686) · de3fb9a6
      Zhongyi Xie 提交于
      Summary:
      PR https://github.com/facebook/rocksdb/pull/5676 added some test coverage for `TEST_ENV_URI`, which unfortunately isn't supported in lite mode, causing some test failures for rocksdb lite. For example,
      ```
      db/db_test_util.cc: In constructor ‘rocksdb::DBTestBase::DBTestBase(std::__cxx11::string)’:
      db/db_test_util.cc:57:16: error: ‘ObjectRegistry’ has not been declared
           Status s = ObjectRegistry::NewInstance()->NewSharedObject(test_env_uri,
                      ^
      ```
      This PR fixes these errors by excluding the new code from test functions for lite mode.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5686
      
      Differential Revision: D16749000
      
      Pulled By: miasantreble
      
      fbshipit-source-id: e8b3088c31a78b3dffc5fe7814261909d2c3e369
      de3fb9a6
  3. 10 8月, 2019 1 次提交
    • Y
      Support loading custom objects in unit tests (#5676) · 5d9a67e7
      Yanqin Jin 提交于
      Summary:
      Most existing RocksDB unit tests run on `Env::Default()`. It will be useful to port the unit tests to non-default environments, e.g. `HdfsEnv`, etc.
      This pull request is one step towards this goal. If RocksDB unit tests are built with a static library exposing a function `RegisterCustomObjects()`, then it is possible to implement custom object registrar logic in the library. RocksDB unit test can call `RegisterCustomObjects()` at the beginning.
      By default, `ROCKSDB_UNITTESTS_WITH_CUSTOM_OBJECTS_FROM_STATIC_LIBS` is not defined, thus this PR has no impact on existing RocksDB because `RegisterCustomObjects()` is a noop.
      Test plan (on devserver):
      ```
      $make clean && COMPILE_WITH_ASAN=1 make -j32 all
      $make check
      ```
      All unit tests must pass.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5676
      
      Differential Revision: D16679157
      
      Pulled By: riversand963
      
      fbshipit-source-id: aca571af3fd0525277cdc674248d0fe06e060f9d
      5d9a67e7
  4. 14 5月, 2019 1 次提交
    • M
      Unordered Writes (#5218) · f383641a
      Maysam Yabandeh 提交于
      Summary:
      Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
      The patch is prepared based on an original by siying
      Existing unit tests are extended to include unordered_write option.
      
      Benchmark Results:
      ```
      TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions  --unordered_write=1
      ```
      With WAL
      - Vanilla RocksDB: 78.6 MB/s
      - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
      - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
      
      Without WAL
      - Vanilla RocksDB: 111.3 MB/s
      - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
      - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
      
      - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
      
      Limitations:
      - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
      
      Differential Revision: D15219029
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
      f383641a
  5. 12 4月, 2019 1 次提交
    • A
      Introduce a new MultiGet batching implementation (#5011) · fefd4b98
      anand76 提交于
      Summary:
      This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
      
      Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
      1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
      2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
      
      The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
      
      Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
      
      Batch   Sizes
      
      1        | 2        | 4         | 8      | 16  | 32
      
      Random pattern (Stride length 0)
      4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074        - Get
      4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
      4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14        - MultiGet (w/ batching)
      
      Good locality (Stride length 16)
      4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
      4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
      4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
      
      Good locality (Stride length 256)
      4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
      4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
      4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
      
      Medium locality (Stride length 4096)
      4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
      4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
      4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
      
      dbbench command used (on a DB with 4 levels, 12 million keys)-
      TEST_TMPDIR=/dev/shm numactl -C 10  ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
      
      Differential Revision: D14348703
      
      Pulled By: anand1976
      
      fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
      fefd4b98
  6. 08 2月, 2019 1 次提交
  7. 03 1月, 2019 1 次提交
    • A
      Lock free MultiGet (#4754) · b9d6ecca
      Anand Ananthabhotla 提交于
      Summary:
      Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.
      
      Tests:
      1. Unit tests
      2. make check
      3. db_bench -
      
      TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
      readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)
      
      Multireadrandom with batch size 1:
      multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754
      
      Differential Revision: D13363550
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
      b9d6ecca
  8. 29 12月, 2018 1 次提交
    • S
      Preload some files even if options.max_open_files (#3340) · f0dda35d
      Siying Dong 提交于
      Summary:
      Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340
      
      Differential Revision: D6686945
      
      Pulled By: siying
      
      fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
      f0dda35d
  9. 18 12月, 2018 2 次提交
  10. 29 11月, 2018 1 次提交
    • A
      Clean up FragmentedRangeTombstoneList (#4692) · 8fe1e06c
      Abhishek Madan 提交于
      Summary:
      Removed `one_time_use` flag, which removed the need for some
      tests, and changed all `NewRangeTombstoneIterator` methods to return
      `FragmentedRangeTombstoneIterators`.
      
      These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
      and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
      
      Differential Revision: D13106570
      
      Pulled By: abhimadan
      
      fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
      8fe1e06c
  11. 22 11月, 2018 1 次提交
    • A
      Introduce RangeDelAggregatorV2 (#4649) · 457f77b9
      Abhishek Madan 提交于
      Summary:
      The old RangeDelAggregator did expensive pre-processing work
      to create a collapsed, binary-searchable representation of range
      tombstones. With FragmentedRangeTombstoneIterator, much of this work is
      now unnecessary. RangeDelAggregatorV2 takes advantage of this by seeking
      in each iterator to find a covering tombstone in ShouldDelete, while
      doing minimal work in AddTombstones. The old RangeDelAggregator is still
      used during flush/compaction for now, though RangeDelAggregatorV2 will
      support those uses in a future PR.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4649
      
      Differential Revision: D13146964
      
      Pulled By: abhimadan
      
      fbshipit-source-id: be29a4c020fc440500c137216fcc1cf529571eb3
      457f77b9
  12. 14 11月, 2018 1 次提交
    • A
      Backup engine support for direct I/O reads (#4640) · ea945470
      Andrew Kryczka 提交于
      Summary:
      Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640
      
      Differential Revision: D13015268
      
      Pulled By: ajkr
      
      fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269
      ea945470
  13. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  14. 02 11月, 2018 2 次提交
  15. 27 10月, 2018 1 次提交
  16. 18 9月, 2018 1 次提交
    • M
      Fix bug in partition filters with format_version=4 (#4381) · 65ac72ed
      Maysam Yabandeh 提交于
      Summary:
      Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381
      
      Differential Revision: D9879505
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
      65ac72ed
  17. 10 8月, 2018 1 次提交
    • M
      Index value delta encoding (#3983) · caf0f53a
      Maysam Yabandeh 提交于
      Summary:
      Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the  block cache more efficiently. The feature is enabled with using format_version 4.
      
      The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
      Results with sysbench read-only using 4k blocks and using 16 index restart interval:
      Format 2:
      19585   rocksdb read-only range=100
      Format 3:
      19569   rocksdb read-only range=100
      Format 4:
      19352   rocksdb read-only range=100
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
      
      Differential Revision: D8361343
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
      caf0f53a
  18. 14 7月, 2018 2 次提交
    • M
      Per-thread unique test db names (#4135) · 8581a93a
      Maysam Yabandeh 提交于
      Summary:
      The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
      Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135
      
      Differential Revision: D8846653
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
      8581a93a
    • A
      Re-enable kUniversalSubcompactions option_config (#4125) · e3eba52a
      Anand Ananthabhotla 提交于
      Summary:
      1. Move kUniversalSubcompactions up before kEnd in db_test_util.h, so
      tests that cycle through all the option_configs include this
      2. Skip kUniversalSubcompactions wherever kUniversalCompaction and
      kUniversalCompactionMultilevel are skipped
      
      Related to #3935
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4125
      
      Differential Revision: D8828637
      
      Pulled By: anand1976
      
      fbshipit-source-id: 650dee15fd27d85281cf9bb4ca8ab460e04cac6f
      e3eba52a
  19. 07 6月, 2018 1 次提交
  20. 05 6月, 2018 1 次提交
    • M
      Extend some tests to format_version=3 (#3942) · d0c38c0c
      Maysam Yabandeh 提交于
      Summary:
      format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
      Closes https://github.com/facebook/rocksdb/pull/3942
      
      Differential Revision: D8238413
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
      d0c38c0c
  21. 02 6月, 2018 1 次提交
  22. 06 4月, 2018 1 次提交
    • P
      Support for Column family specific paths. · 446b32cf
      Phani Shekhar Mantripragada 提交于
      Summary:
      In this change, an option to set different paths for different column families is added.
      This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
      To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
      
      Changes :
      1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions.  This member is used to identify the path information whenever files are accessed.
      2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
      3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
      4) Unit tests are added appropriately.
      Closes https://github.com/facebook/rocksdb/pull/3102
      
      Differential Revision: D6951697
      
      Pulled By: ajkr
      
      fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
      446b32cf
  23. 06 3月, 2018 1 次提交
  24. 23 2月, 2018 2 次提交
  25. 17 1月, 2018 1 次提交
    • Y
      Fix multiple build failures · dc360df8
      Yi Wu 提交于
      Summary:
      * Fix DBTest.CompactRangeWithEmptyBottomLevel lite build failure
      * Fix DBTest.AutomaticConflictsWithManualCompaction failure introduce by #3366
      * Fix BlockBasedTableTest::IndexUncompressed should be disabled if snappy is disabled
      * Fix ASAN failure with DBBasicTest::DBClose test
      Closes https://github.com/facebook/rocksdb/pull/3373
      
      Differential Revision: D6732313
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: 1eb9b9d9a8d795f56188fa9770db9353f6fdedc5
      dc360df8
  26. 19 12月, 2017 1 次提交
  27. 11 11月, 2017 1 次提交
  28. 02 11月, 2017 1 次提交
    • M
      Added support for differential snapshots · 7fe3b328
      Mikhail Antonov 提交于
      Summary:
      The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
      
      This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
      
      From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
      
      This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
      
      For now, what's done here according to initial discussions:
      
      Preserving deletes:
       - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
       - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
       - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
      
      Iterator changes:
       - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
      
      TableCache changes:
       - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
      
      What's left:
      
       - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
      Closes https://github.com/facebook/rocksdb/pull/2999
      
      Differential Revision: D6175602
      
      Pulled By: mikhail-antonov
      
      fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
      7fe3b328
  29. 25 10月, 2017 1 次提交
  30. 20 10月, 2017 1 次提交
  31. 01 9月, 2017 1 次提交
  32. 27 7月, 2017 3 次提交
  33. 22 7月, 2017 2 次提交