1. 16 2月, 2019 1 次提交
    • A
      Deprecate ttl option from CompactionOptionsFIFO (#4965) · 3231a2e5
      Aubin Sanyal 提交于
      Summary:
      We introduced ttl option in CompactionOptionsFIFO when ttl-based file
      deletion (compaction) was supported only as part of FIFO Compaction. But
      with the extension of ttl semantics even to Level compaction,
      CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start
      using ColumnFamilyOptions.ttl for FIFO compaction as well.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965
      
      Differential Revision: D14072960
      
      Pulled By: sagar0
      
      fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e
      3231a2e5
  2. 15 2月, 2019 1 次提交
    • A
      Dictionary compression for files written by SstFileWriter (#4978) · c8c8104d
      Andrew Kryczka 提交于
      Summary:
      If `CompressionOptions::max_dict_bytes` and/or `CompressionOptions::zstd_max_train_bytes` are set, `SstFileWriter` will now generate files respecting those options.
      
      I refactored the logic a bit for deciding when to use dictionary compression. Previously we plumbed `is_bottommost_level` down to the table builder and used that. However it was kind of confusing in `SstFileWriter`'s context since we don't know what level the file will be ingested to. Instead, now the higher-level callers (e.g., flush, compaction, file writer) are responsible for building the right `CompressionOptions` to give the table builder.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4978
      
      Differential Revision: D14060763
      
      Pulled By: ajkr
      
      fbshipit-source-id: dc802c327896df2b319dc162d6acc82b9cdb452a
      c8c8104d
  3. 14 2月, 2019 1 次提交
  4. 13 2月, 2019 1 次提交
    • Y
      Atomic ingest (#4895) · a69d4dee
      Yanqin Jin 提交于
      Summary:
      Make file ingestion atomic.
      
       as title.
      Ingesting external SST files into multiple column families should be atomic. If
      a crash occurs and db reopens, either all column families have successfully
      ingested the files before the crash, or non of the ingestions have any effect
      on the state of the db.
      
      Also add unit tests for atomic ingestion.
      
      Note that the unit test here does not cover the case of incomplete atomic group
      in the MANIFEST, which is covered in VersionSetTest already.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895
      
      Differential Revision: D13718245
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198
      a69d4dee
  5. 12 2月, 2019 3 次提交
    • A
      Reduce scope of compression dictionary to single SST (#4952) · 62f70f6d
      Andrew Kryczka 提交于
      Summary:
      Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
      
      So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
      
      - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
      - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
      - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
      
      Differential Revision: D13967980
      
      Pulled By: ajkr
      
      fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
      62f70f6d
    • P
      Increment NUMBER_BLOCK_NOT_COMPRESSED when !GoodCompressionRatio (#4929) · 79496d71
      Peter (Stig) Edwards 提交于
      Summary:
      See https://github.com/facebook/rocksdb/issues/4884
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4929
      
      Differential Revision: D14028333
      
      Pulled By: sagar0
      
      fbshipit-source-id: eed12bceae85385a34aaa6dd303bf0f53c4c7b06
      79496d71
    • Y
      Checksum properties block for block-based table (#4956) · 2d049ab7
      Yanqin Jin 提交于
      Summary:
      Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956
      
      Differential Revision: D14012741
      
      Pulled By: riversand963
      
      fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
      2d049ab7
  6. 09 2月, 2019 2 次提交
  7. 08 2月, 2019 2 次提交
    • S
      Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954) · f48758e9
      Siying Dong 提交于
      Summary:
      We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't
      what we have expected. We thought that snapshot will always be preserved.
      However, we just realized that, if no snapshot is created while compaction
      starts, and a snapshot is created after that, the data seen from the snapshot
      can successfully be dropped by the compaction. This creates a strange behavior
      to the feature, which is hard to explain. Like what is documented in code
      comment, this feature is not very useful with snapshot anyway. The decision
      is to deprecate the feature.
      
      We keep the function to avoid to break users code. However, we will fail
      compactions if false is returned.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954
      
      Differential Revision: D13981900
      
      Pulled By: siying
      
      fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36
      f48758e9
    • S
      Remove cuckoo hash memtable (#4953) · cf3a6717
      Siying Dong 提交于
      Summary:
      Cuckoo Hash is less useful than we initially expected. Remove it.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953
      
      Differential Revision: D13979264
      
      Pulled By: siying
      
      fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
      cf3a6717
  8. 30 1月, 2019 1 次提交
  9. 29 1月, 2019 1 次提交
  10. 24 1月, 2019 3 次提交
  11. 23 1月, 2019 1 次提交
  12. 20 1月, 2019 1 次提交
  13. 17 1月, 2019 1 次提交
  14. 12 1月, 2019 1 次提交
  15. 11 1月, 2019 1 次提交
  16. 05 1月, 2019 1 次提交
    • A
      Fix point lookup on range tombstone sentinel endpoint (#4829) · 9e2c804f
      Andrew Kryczka 提交于
      Summary:
      Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:
      
      - L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
      - L1's file2 gets compacted to L2.
      - User issues `Get()` for b#3.
      - L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
      - `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.
      
      The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829
      
      Differential Revision: D13561355
      
      Pulled By: ajkr
      
      fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
      9e2c804f
  17. 04 1月, 2019 1 次提交
  18. 03 1月, 2019 2 次提交
    • A
      fix accounting for range tombstones in TableProperties (#4841) · ace543a8
      Andrew Kryczka 提交于
      Summary:
      - To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`.
      - Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841
      
      Differential Revision: D13568424
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026
      ace543a8
    • A
      Lock free MultiGet (#4754) · b9d6ecca
      Anand Ananthabhotla 提交于
      Summary:
      Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.
      
      Tests:
      1. Unit tests
      2. make check
      3. db_bench -
      
      TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
      readrandom   :       0.167 micros/op 5983920 ops/sec;  426.2 MB/s (1000000 of 1000000 found)
      
      Multireadrandom with batch size 1:
      multireadrandom :       0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754
      
      Differential Revision: D13363550
      
      Pulled By: anand1976
      
      fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
      b9d6ecca
  19. 29 12月, 2018 1 次提交
    • S
      Preload some files even if options.max_open_files (#3340) · f0dda35d
      Siying Dong 提交于
      Summary:
      Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340
      
      Differential Revision: D6686945
      
      Pulled By: siying
      
      fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
      f0dda35d
  20. 21 12月, 2018 1 次提交
    • A
      fix DeleteRange memory leak for mmap and block cache (#4810) · e0be1bc4
      Andrew Kryczka 提交于
      Summary:
      Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.
      
      Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810
      
      Differential Revision: D13534296
      
      Pulled By: ajkr
      
      fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
      e0be1bc4
  21. 14 12月, 2018 1 次提交
    • Y
      Improve flushing multiple column families (#4708) · 4fce44fc
      Yanqin Jin 提交于
      Summary:
      If one column family is dropped, we should simply skip it and continue to flush
      other active ones.
      Currently we use Status::ShutdownInProgress to notify caller of column families
      being dropped. In the future, we should consider using a different Status code.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708
      
      Differential Revision: D13378954
      
      Pulled By: riversand963
      
      fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
      4fce44fc
  22. 08 12月, 2018 1 次提交
    • Y
      Enable checkpoint of read-only db (#4681) · f307479b
      Yanqin Jin 提交于
      Summary:
      1. DBImplReadOnly::GetLiveFiles should not return NotSupported. Instead, it
         should call DBImpl::GetLiveFiles(flush_memtable=false).
      2. In DBImp::Recover, we should also recover the OPTIONS file name and/or
         number so that an immediate subsequent GetLiveFiles will get the correct
         OPTIONS name.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4681
      
      Differential Revision: D13069205
      
      Pulled By: riversand963
      
      fbshipit-source-id: 3e6a0174307d06db5a01feb099b306cea1f7f88a
      f307479b
  23. 07 12月, 2018 1 次提交
    • M
      Extend Transaction::GetForUpdate with do_validate (#4680) · b878f93c
      Maysam Yabandeh 提交于
      Summary:
      Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false).
      The Java APIs are accordingly updated.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680
      
      Differential Revision: D13068508
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d
      b878f93c
  24. 06 12月, 2018 1 次提交
  25. 01 12月, 2018 2 次提交
  26. 29 11月, 2018 1 次提交
  27. 28 11月, 2018 1 次提交
  28. 22 11月, 2018 2 次提交
  29. 21 11月, 2018 1 次提交
    • A
      Fix range tombstone covering short-circuit logic (#4698) · ed5aec5b
      Abhishek Madan 提交于
      Summary:
      Since a range tombstone seen at one level will cover all keys
      in the range at lower levels, there was a short-circuiting check in Get
      that reported a key was not found at most one file after the range
      tombstone was discovered. However, this was incorrect for merge
      operands, since a deletion might only cover some merge operands,
      which implies that the key should be found. This PR fixes this logic in
      the Version portion of Get, and removes the logic from the MemTable
      portion of Get, since the perforamnce benefit provided there is minimal.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4698
      
      Differential Revision: D13142484
      
      Pulled By: abhimadan
      
      fbshipit-source-id: cbd74537c806032f2bfa564724d01a80df7c8f10
      ed5aec5b
  30. 14 11月, 2018 2 次提交
    • Y
      Move MemoryAllocator option from Cache to BlockBasedTableOptions (#4676) · b32d087d
      Yi Wu 提交于
      Summary:
      Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.
      
      It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676
      
      Differential Revision: D13047662
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
      b32d087d
    • S
      Divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE` (#4498) · 5945e16d
      Soli Como 提交于
      Summary:
      Currently, `Statistics` can record tick by `recordTick()` whose second parameter is an `uint64_t`.
      That means tick can only increase.
      If we want to reduce tick, we have to work around like `RecordTick(statistics_, NO_ITERATORS, uint64_t(-1));`.
      That's kind of a hack.
      
      So, this PR divide `NO_ITERATORS` into two counters `NO_ITERATOR_CREATED` and `NO_ITERATOR_DELETE`, making the counters increase only.
      
      Fixes #3013 .
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4498
      
      Differential Revision: D10395010
      
      Pulled By: sagar0
      
      fbshipit-source-id: cfb523b22a37411c794b4e9da090f1ae30293db2
      5945e16d