1. 26 8月, 2020 1 次提交
    • Z
      Pass SST file checksum information through OnTableFileCreated (#7108) · d51f88c9
      Zhichao Cao 提交于
      Summary:
      When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108
      
      Test Plan: make check, listener_test.
      
      Reviewed By: ajkr
      
      Differential Revision: D22470240
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
      d51f88c9
  2. 15 8月, 2020 1 次提交
    • A
      Disable manual compaction during `ReFitLevel()` (#7250) · a1aa3f83
      Andrew Kryczka 提交于
      Summary:
      Manual compaction with `CompactRangeOptions::change_levels` set could
      refit to a level targeted by another manual compaction. If
      force_consistency_checks were disabled, it could be possible for
      overlapping files to be written at that target level.
      
      This PR prevents the possibility by calling `DisableManualCompaction()`
      prior to `ReFitLevel()`. It also improves the manual compaction disabling
      mechanism to wait for pending manual compactions to complete before
      returning, and support disabling from multiple threads.
      
      Fixes https://github.com/facebook/rocksdb/issues/6432.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250
      
      Test Plan:
      crash test command that repro'd the bug reliably:
      
      ```
      $ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D23090800
      
      Pulled By: ajkr
      
      fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
      a1aa3f83
  3. 13 8月, 2020 1 次提交
    • A
      Store FileSystemPtr object that contains FileSystem ptr (#7180) · 1f9f630b
      Akanksha Mahajan 提交于
      Summary:
      As part of the IOTracing project, this PR
          1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
          2. FileSystemPtr object is created using FileSystem pointer and IOTracer
          pointer.
          3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
          4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
          returns FileSystemTracingWrapper pointer for tracing purpose and when
          it is disabled underlying FileSystem pointer is returned.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180
      
      Test Plan:
      make check -j64
                      COMPILE_WITH_TSAN=1 make check -j64
      
      Reviewed By: anand1976
      
      Differential Revision: D22987117
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
      1f9f630b
  4. 07 8月, 2020 1 次提交
  5. 04 8月, 2020 1 次提交
    • A
      dedup ReadOptions in iterator hierarchy (#7210) · a4a4a2da
      Andrew Kryczka 提交于
      Summary:
      Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator`
      and every `LevelIterator`. This redundancy consumes extra memory,
      resulting in the `Arena` making more allocations, and iteration
      observing worse cache performance.
      
      This PR migrates callers of `NewInternalIterator()` and
      `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to
      outlive the returned iterator. When the iterator's lifetime will be managed by the
      user, this lifetime guarantee is achieved by storing the `ReadOptions`
      value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and
      `MakeInputIterator()` can hold a reference-to-const `ReadOptions`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210
      
      Test Plan:
      - `make check` under ASAN and valgrind
      - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes.
      
      Reviewed By: anand1976
      
      Differential Revision: D22861323
      
      Pulled By: ajkr
      
      fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce
      a4a4a2da
  6. 30 7月, 2020 1 次提交
    • A
      Compaction Read/Write Stats by Compaction Type (#7165) · 56ed601d
      Aaron Kabcenell 提交于
      Summary:
      Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165
      
      Test Plan:
      TTL and periodic can be checked by runnning db_bench with the options activated:
      
      /db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
      ./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1
      
      Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.
      
      Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking
      
      Reviewed By: ajkr
      
      Differential Revision: D22693050
      
      Pulled By: akabcenell
      
      fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
      56ed601d
  7. 25 7月, 2020 1 次提交
    • T
      SST Partitioner interface that allows to split SST files (#6957) · cd4592c2
      Tomas Kolda 提交于
      Summary:
      SST Partitioner interface that allows to split SST files during compactions.
      
      It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957
      
      Reviewed By: ajkr
      
      Differential Revision: D22461239
      
      fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
      cd4592c2
  8. 23 7月, 2020 1 次提交
  9. 16 7月, 2020 1 次提交
    • Z
      Auto resume the DB from Retryable IO Error (#6765) · a10f12ed
      Zhichao Cao 提交于
      Summary:
      In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
      
      Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
      
      Reviewed By: anand1976
      
      Differential Revision: D21916789
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
      a10f12ed
  10. 15 7月, 2020 1 次提交
    • Y
      Report corrupted keys during compaction (#7124) · 27735dea
      Yanqin Jin 提交于
      Summary:
      Currently, RocksDB lets compaction to go through even in case of
      corrupted keys, the number of which is reported in CompactionJobStats.
      However, RocksDB does not check this value. We should let compaction run
      in a stricter mode.
      
      Temporarily disable two tests that allow corrupted keys in compaction.
      With this PR, the two tests will assert(false) and terminate. Still need
      to investigate what is the recommended google-test way of doing it.
      Death test (EXPECT_DEATH) in gtest has warnings now.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D22530722
      
      Pulled By: riversand963
      
      fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6
      27735dea
  11. 27 6月, 2020 1 次提交
    • Y
      Fix data race to VersionSet::io_status_ (#7034) · d47c8711
      Yanqin Jin 提交于
      Summary:
      After https://github.com/facebook/rocksdb/issues/6949 , VersionSet::io_status_ can be concurrently accessed by multiple
      threads without lock, causing tsan test to fail. For example, a bg flush thread
      resets io_status_ before calling LogAndApply(), while another thread already in
      the process of LogAndApply() reads io_status_. This is a bug.
      
      We do not have to reset io_status_ each time we call LogAndApply(). io_status_
      is part of the state of VersionSet, and it indicates the outcome of preceding
      MANIFEST/CURRENT files IO operations. Its value should be updated only when:
      
      1. MANIFEST/CURRENT files IO fail for the first time.
      2. MANIFEST/CURRENT files IO succeed as part of recovering from a prior
         failure without process restart, e.g. calling Resume().
      
      Test Plan (devserver):
      COMPILE_WITH_TSAN=1 make check
      COMPILE_WITH_TSAN=1 make db_test2
      ./db_test2 --gtest_filter=DBTest2.CompactionStall
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7034
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D22247137
      
      Pulled By: riversand963
      
      fbshipit-source-id: 77b83e05390f3ee3cd2d96d3fdd6fe4f225e3216
      d47c8711
  12. 25 6月, 2020 1 次提交
    • Y
      First step towards handling MANIFEST write error (#6949) · e66199d8
      Yanqin Jin 提交于
      Summary:
      This PR provides preliminary support for handling IO error during MANIFEST write.
      File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
      One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
      If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
      Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
      
      Possible future directions:
      - Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
      
      Test plan (dev server):
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
      
      Reviewed By: anand1976
      
      Differential Revision: D22026020
      
      Pulled By: riversand963
      
      fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
      e66199d8
  13. 18 6月, 2020 1 次提交
    • Z
      Store DB identity and DB session ID in SST files (#6983) · 94d04529
      Zitan Chen 提交于
      Summary:
      `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.
      
      The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.
      
      In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.
      
      A table property test is added.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983
      
      Test Plan: make check and some manual tests.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D22048826
      
      Pulled By: gg814
      
      fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
      94d04529
  14. 10 6月, 2020 1 次提交
  15. 03 6月, 2020 1 次提交
    • Z
      Fix potential overflow of unsigned type in for loop (#6902) · 2adb7e37
      Zhichao Cao 提交于
      Summary:
      x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902
      
      Test Plan: pass make asan_check
      
      Reviewed By: ajkr
      
      Differential Revision: D21843767
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1
      2adb7e37
  16. 16 4月, 2020 1 次提交
    • M
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 提交于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de
  17. 02 4月, 2020 1 次提交
    • Z
      Add pipelined & parallel compression optimization (#6262) · 03a781a9
      Ziyue Yang 提交于
      Summary:
      This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
      
      Reviewed By: ajkr
      
      Differential Revision: D20651306
      
      fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
      03a781a9
  18. 30 3月, 2020 1 次提交
    • Z
      Use FileChecksumGenFactory for SST file checksum (#6600) · e8d332d9
      Zhichao Cao 提交于
      Summary:
      In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600
      
      Test Plan: tested with make asan_check
      
      Reviewed By: riversand963
      
      Differential Revision: D20717670
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6
      e8d332d9
  19. 28 3月, 2020 1 次提交
    • Z
      Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487) · 42468881
      Zhichao Cao 提交于
      Summary:
      In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
      
      The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
      
      Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
      
      Reviewed By: anand1976
      
      Differential Revision: D20685017
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
      42468881
  20. 03 3月, 2020 1 次提交
  21. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  22. 11 2月, 2020 1 次提交
    • Z
      Checksum for each SST file and stores in MANIFEST (#6216) · 4369f2c7
      Zhichao Cao 提交于
      Summary:
      In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
      
      Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string).  A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
      
      Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
      
      Differential Revision: D19171461
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
      4369f2c7
  23. 24 1月, 2020 1 次提交
    • L
      Fix the "records dropped" statistics (#6325) · f34782a6
      Levi Tamasi 提交于
      Summary:
      The earlier code used two conflicting definitions for the number of
      input records going into a compaction, one based on the
      `rocksdb.num.entries` table property and one based on
      `CompactionIterationStats`. The first one is correct and in line
      with how output records are counted, while the second one incorrectly
      ignores input records in various cases when the `CompactionIterator`
      advances or reseeks the input iterator (this can happen, amongst other
      cases, when dealing with `SingleDelete`s, regular `Delete`s, `Merge`s,
      and compaction filters). This can result in the code undercounting the
      input records and computing an incorrect value for "records dropped"
      during the compaction. The patch fixes this by switching over to the
      correct (table property based) input record count for "records dropped".
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6325
      
      Test Plan: Tested using `make check` and `db_bench`.
      
      Differential Revision: D19525491
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 4340b0b2f41546db8e356db70ca02199e48fa636
      f34782a6
  24. 14 12月, 2019 1 次提交
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  25. 27 11月, 2019 1 次提交
    • S
      Support options.max_open_files = -1 with periodic_compaction_seconds (#6090) · aa1857e2
      sdong 提交于
      Summary:
      options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090
      
      Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.
      
      Differential Revision: D18702268
      
      fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
      aa1857e2
  26. 23 11月, 2019 1 次提交
    • S
      Support options.ttl with options.max_open_files = -1 (#6060) · d8c28e69
      sdong 提交于
      Summary:
      Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.
      
      Note that, this change will break forward compatibility for release 5.1 and older.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060
      
      Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.
      
      Differential Revision: D18631623
      
      fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
      d8c28e69
  27. 12 11月, 2019 1 次提交
    • S
      Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992) · c17384fe
      Sagar Vemuri 提交于
      Summary:
      When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster.
      
      This will help start cascading compactions to move the expired key range to the bottom-most level faster.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992
      
      Test Plan: `make check`
      
      Differential Revision: D18257883
      
      Pulled By: sagar0
      
      fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc
      c17384fe
  28. 31 10月, 2019 1 次提交
    • M
      Turn compaction asserts to runtime check (#5935) · dccaf9f0
      Maysam Yabandeh 提交于
      Summary:
      Compaction iterator has many assert statements that are active only during test runs. Some rare bugs would show up only at runtime could violate the assert condition but go unnoticed since assert statements are not compiled in release mode. Turning the assert statements to runtime check sone pors and cons:
      Pros:
      - A bug that would result into incorrect data would be detected early before the incorrect data is written to the disk.
      
      Cons:
      - Runtime overhead: which should be negligible since compaction cpu is the minority in the overall cpu usage
      - The assert statements might already being violated at runtime, and turning them to runtime failure might result into reliability issues.
      
      The patch takes a conservative step in this direction by logging the assert violations at runtime. If we see any violation reported in logs, we investigate. Otherwise, we can go ahead turning them to runtime error.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5935
      
      Differential Revision: D18229697
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: f1890eca80ccd7cca29737f1825badb9aa8038a8
      dccaf9f0
  29. 15 10月, 2019 1 次提交
    • L
      BlobDB GC: add SST <-> oldest blob file referenced mapping (#5903) · 5f025ea8
      Levi Tamasi 提交于
      Summary:
      This is groundwork for adding garbage collection support to BlobDB. The
      patch adds logic that keeps track of the oldest blob file referred to by
      each SST file. The oldest blob file is identified during flush/
      compaction (similarly to how the range of keys covered by the SST is
      identified), and persisted in the manifest as a custom field of the new
      file edit record. Blob indexes with TTL are ignored for the purposes of
      identifying the oldest blob file (since such blob files are cleaned up by the
      TTL logic in BlobDB).
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903
      
      Test Plan:
      Added new unit tests; also ran db_bench in BlobDB mode, inspected the
      manifest using ldb, and confirmed (by scanning the SST files using
      sst_dump) that the value of the oldest blob file number field matches
      the contents of the file for each SST.
      
      Differential Revision: D17859997
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
      5f025ea8
  30. 21 9月, 2019 1 次提交
  31. 20 9月, 2019 1 次提交
  32. 19 9月, 2019 1 次提交
  33. 17 9月, 2019 2 次提交
    • A
      Allow users to stop manual compactions (#3971) · 62268300
      andrew 提交于
      Summary:
      Manual compaction may bring in very high load because sometime the amount of data involved in a compaction could be large, which may affect online service. So it would be good if the running compaction making the server busy can be stopped immediately. In this implementation, stopping manual compaction condition is only checked in slow process. We let deletion compaction and trivial move go through.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/3971
      
      Test Plan: add tests at more spots.
      
      Differential Revision: D17369043
      
      fbshipit-source-id: 575a624fb992ce0bb07d9443eb209e547740043c
      62268300
    • S
      Divide file_reader_writer.h and .cc (#5803) · b931f84e
      sdong 提交于
      Summary:
      file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
      
      Test Plan: Build whole project using make and cmake.
      
      Differential Revision: D17374550
      
      fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
      b931f84e
  34. 31 7月, 2019 1 次提交
    • E
      Improve CPU Efficiency of ApproximateSize (part 2) (#5609) · 4834dab5
      Eli Pozniansky 提交于
      Summary:
      In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609
      
      Differential Revision: D16433481
      
      Pulled By: elipoz
      
      fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
      4834dab5
  35. 21 6月, 2019 1 次提交
    • H
      Add more callers for table reader. (#5454) · 705b8eec
      haoyuhuang 提交于
      Summary:
      This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
      1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
      2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
      
      This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
      
      Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
      
      Differential Revision: D15819451
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
      705b8eec
  36. 11 6月, 2019 1 次提交
    • H
      Create a BlockCacheLookupContext to enable fine-grained block cache tracing. (#5421) · 5efa0d6b
      haoyuhuang 提交于
      Summary:
      BlockCacheLookupContext only contains the caller for now.
      We will trace block accesses at five places:
      1. BlockBasedTable::GetFilter.
      2. BlockBasedTable::GetUncompressedDict.
      3. BlockBasedTable::MaybeReadAndLoadToCache. (To trace access on data, index, and range deletion block.)
      4. BlockBasedTable::Get. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      5. BlockBasedTable::MultiGet. (To trace the referenced key and whether the referenced key exists in a fetched data block.)
      
      We create the context at:
      1. BlockBasedTable::Get. (kUserGet)
      2. BlockBasedTable::MultiGet. (kUserMGet)
      3. BlockBasedTable::NewIterator. (either kUserIterator, kCompaction, or external SST ingestion calls this function.)
      4. BlockBasedTable::Open. (kPrefetch)
      5. Index/Filter::CacheDependencies. (kPrefetch)
      6. BlockBasedTable::ApproximateOffsetOf. (kCompaction or kUserApproximateSize).
      
      I loaded 1 million key-value pairs into the database and ran the readrandom benchmark with a single thread. I gave the block cache 10 GB to make sure all reads hit the block cache after warmup. The throughput is comparable.
      Throughput of this PR: 231334 ops/s.
      Throughput of the master branch: 238428 ops/s.
      
      Experiment setup:
      RocksDB:    version 6.2
      Date:       Mon Jun 10 10:42:51 2019
      CPU:        24 * Intel Core Processor (Skylake)
      CPUCache:   16384 KB
      Keys:       20 bytes each
      Values:     100 bytes each (100 bytes after compression)
      Entries:    1000000
      Prefix:    20 bytes
      Keys per prefix:    0
      RawSize:    114.4 MB (estimated)
      FileSize:   114.4 MB (estimated)
      Write rate: 0 bytes/second
      Read rate: 0 ops/second
      Compression: NoCompression
      Compression sampling rate: 0
      Memtablerep: skip_list
      Perf Level: 1
      
      Load command: ./db_bench --benchmarks="fillseq" --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000
      
      Run command: ./db_bench --benchmarks="readrandom,stats" --use_existing_db --threads=1 --duration=120 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --statistics --cache_index_and_filter_blocks --cache_size=10737418240 --disable_auto_compactions=1 --disable_wal=1 --compression_type=none --min_level_to_compress=-1 --compression_ratio=1 --num=1000000 --duration=120
      
      TODOs:
      1. Create a caller for external SST file ingestion and differentiate the callers for iterator.
      2. Integrate tracer to trace block cache accesses.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5421
      
      Differential Revision: D15704258
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: 4aa8a55f8cb1576ffb367bfa3186a91d8f06d93a
      5efa0d6b
  37. 07 6月, 2019 1 次提交
  38. 01 6月, 2019 2 次提交