1. 23 7月, 2020 1 次提交
  2. 30 6月, 2020 1 次提交
    • A
      Extend Get/MultiGet deadline support to table open (#6982) · 9a5886bd
      Anand Ananthabhotla 提交于
      Summary:
      Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.
      
      The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.
      
      Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982
      
      Test Plan: Add new unit tests in db_basic_test
      
      Reviewed By: riversand963
      
      Differential Revision: D22219515
      
      Pulled By: anand1976
      
      fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
      9a5886bd
  3. 01 5月, 2020 1 次提交
    • A
      Pass a timeout to FileSystem for random reads (#6751) · ab13d43e
      anand76 提交于
      Summary:
      Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
      
      For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
      
      Tests:
      Update existing unit tests to verify the correct timeout value is being passed
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
      
      Reviewed By: riversand963
      
      Differential Revision: D21285631
      
      Pulled By: anand1976
      
      fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
      ab13d43e
  4. 16 4月, 2020 1 次提交
    • M
      Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621) · e45673de
      Mike Kolupaev 提交于
      Summary:
      Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
      
      Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
      
      It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
      
      Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
      
      Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
      
      Reviewed By: siying
      
      Differential Revision: D20786930
      
      Pulled By: al13n321
      
      fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
      e45673de
  5. 07 3月, 2020 1 次提交
    • C
      Remove memcpy from RandomAccessFileReader::Read in direct IO mode (#6455) · 0a0151fb
      Cheng Chang 提交于
      Summary:
      In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455
      
      Test Plan: make check
      
      Differential Revision: D20106753
      
      Pulled By: cheng-chang
      
      fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58
      0a0151fb
  6. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  7. 14 12月, 2019 1 次提交
    • A
      Introduce a new storage specific Env API (#5761) · afa2420c
      anand76 提交于
      Summary:
      The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
      
      This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
      
      The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
      
      This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
      
      The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
      
      Differential Revision: D18868376
      
      Pulled By: anand1976
      
      fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
      afa2420c
  8. 21 9月, 2019 1 次提交
  9. 19 9月, 2019 1 次提交
  10. 17 9月, 2019 1 次提交
    • S
      Divide file_reader_writer.h and .cc (#5803) · b931f84e
      sdong 提交于
      Summary:
      file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803
      
      Test Plan: Build whole project using make and cmake.
      
      Differential Revision: D17374550
      
      fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987
      b931f84e
  11. 21 6月, 2019 1 次提交
    • H
      Add more callers for table reader. (#5454) · 705b8eec
      haoyuhuang 提交于
      Summary:
      This PR adds more callers for table readers. These information are only used for block cache analysis so that we can know which caller accesses a block.
      1. It renames the BlockCacheLookupCaller to TableReaderCaller as passing the caller from upstream requires changes to table_reader.h and TableReaderCaller is a more appropriate name.
      2. It adds more table reader callers in table/table_reader_caller.h, e.g., kCompactionRefill, kExternalSSTIngestion, and kBuildTable.
      
      This PR is long as it requires modification of interfaces in table_reader.h, e.g., NewIterator.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5454
      
      Test Plan: make clean && COMPILE_WITH_ASAN=1 make check -j32.
      
      Differential Revision: D15819451
      
      Pulled By: HaoyuHuang
      
      fbshipit-source-id: b6caa704c8fb96ddd15b9a934b7e7ea87f88092d
      705b8eec
  12. 20 6月, 2019 1 次提交
  13. 04 5月, 2019 1 次提交
    • M
      Refresh snapshot list during long compactions (2nd attempt) (#5278) · 6a40ee5e
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
      This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
      
      Differential Revision: D15203291
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258
      6a40ee5e
  14. 02 5月, 2019 1 次提交
  15. 26 4月, 2019 1 次提交
    • M
      Refresh snapshot list during long compactions (#5099) · 506e8448
      Maysam Yabandeh 提交于
      Summary:
      Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
      
      Differential Revision: D15086710
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
      506e8448
  16. 10 11月, 2018 1 次提交
    • S
      Update all unique/shared_ptr instances to be qualified with namespace std (#4638) · dc352807
      Sagar Vemuri 提交于
      Summary:
      Ran the following commands to recursively change all the files under RocksDB:
      ```
      find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
      find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
      ```
      Running `make format` updated some formatting on the files touched.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
      
      Differential Revision: D12934992
      
      Pulled By: sagar0
      
      fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
      dc352807
  17. 24 8月, 2018 1 次提交
  18. 26 6月, 2018 1 次提交
  19. 22 5月, 2018 1 次提交
    • Z
      Move prefix_extractor to MutableCFOptions · c3ebc758
      Zhongyi Xie 提交于
      Summary:
      Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
      This PR aims to make it possible to dynamically change bloom filter config.
      Closes https://github.com/facebook/rocksdb/pull/3601
      
      Differential Revision: D7253114
      
      Pulled By: miasantreble
      
      fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c
      c3ebc758
  20. 06 4月, 2018 1 次提交
    • M
      Stats for false positive rate of full filtesr · 67182678
      Maysam Yabandeh 提交于
      Summary:
      Adds two stats to allow us measuring the false positive rate of full filters:
      - The total count of positives: rocksdb.bloom.filter.full.positive
      - The total count of true positives: rocksdb.bloom.filter.full.true.positive
      Not the term "full" in the stat name to indicate that they are meaningful in full filters. block-based filters are to be deprecated soon and supporting it is not worth the the additional cost of if-then-else branches.
      
      Closes #3680
      
      Tested by:
      $ ./db_bench -benchmarks=fillrandom  -db /dev/shm/rocksdb-tmpdb --num=1000000 -bloom_bits=10
      $ ./db_bench -benchmarks="readwhilewriting"  -db /dev/shm/rocksdb-tmpdb --statistics -bloom_bits=10 --duration=60 --num=2000000 --use_existing_db 2>&1 > /tmp/full.log
      $ grep filter.full /tmp/full.log
      rocksdb.bloom.filter.full.positive COUNT : 3628593
      rocksdb.bloom.filter.full.true.positive COUNT : 3536026
      which gives the false positive rate of 2.5%
      Closes https://github.com/facebook/rocksdb/pull/3681
      
      Differential Revision: D7517570
      
      Pulled By: maysamyabandeh
      
      fbshipit-source-id: 630ab1a473afdce404916d297035b6318de4c052
      67182678
  21. 06 3月, 2018 1 次提交
  22. 23 2月, 2018 2 次提交
  23. 28 7月, 2017 1 次提交
  24. 22 7月, 2017 2 次提交
  25. 19 7月, 2017 1 次提交
  26. 17 7月, 2017 1 次提交
    • Y
      CodeMod: Prefer ADD_FAILURE() over EXPECT_TRUE(false), et cetera · f1a056e0
      Yedidya Feldblum 提交于
      Summary:
      CodeMod: Prefer `ADD_FAILURE()` over `EXPECT_TRUE(false)`, et cetera.
      
      The tautologically-conditioned and tautologically-contradicted boolean expectations/assertions have better alternatives: unconditional passes and failures.
      
      Reviewed By: Orvid
      
      Differential Revision:
      D5432398
      
      Tags: codemod, codemod-opensource
      
      fbshipit-source-id: d16b447e8696a6feaa94b41199f5052226ef6914
      f1a056e0
  27. 16 7月, 2017 1 次提交
  28. 06 5月, 2017 1 次提交
    • A
      do not read next datablock if upperbound is reached · a30a6960
      Aaron Gao 提交于
      Summary:
      Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
      [a b c d][f g h]
      Then the index key for the first block will be 'e'.
      then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.
      
      This diff can reduce RA in most cases.
      Closes https://github.com/facebook/rocksdb/pull/2239
      
      Differential Revision: D4990693
      
      Pulled By: lightmark
      
      fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
      a30a6960
  29. 28 4月, 2017 1 次提交
  30. 01 11月, 2016 1 次提交
    • A
      DeleteRange flush support · 40a2e406
      Andrew Kryczka 提交于
      Summary:
      Changed BuildTable() (used for flush) to (1) add range
      tombstones to the aggregator, which is used by CompactionIterator to
      determine which keys can be removed; and (2) add aggregator's range
      tombstones to the table that is output for the flush.
      Closes https://github.com/facebook/rocksdb/pull/1438
      
      Differential Revision: D4100025
      
      Pulled By: ajkr
      
      fbshipit-source-id: cb01a70
      40a2e406
  31. 21 7月, 2016 1 次提交
    • O
      Only cache level 0 indexes and filter when opening table reader · e70020e4
      omegaga 提交于
      Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).
      
      Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)
      
      Reviewers: sdong, andrewkr
      
      Reviewed By: andrewkr
      
      Subscribers: andrewkr, dhruba
      
      Differential Revision: https://reviews.facebook.net/D59913
      e70020e4
  32. 10 2月, 2016 1 次提交
  33. 24 12月, 2015 1 次提交
    • A
      Skip bottom-level filter block caching when hit-optimized · e089db40
      Andrew Kryczka 提交于
      Summary:
      When Get() or NewIterator() trigger file loads, skip caching the filter block if
      (1) optimize_filters_for_hits is set and (2) the file is on the bottommost
      level. Also skip checking filters under the same conditions, which means that
      for a preloaded file or a file that was trivially-moved to the bottom level, its
      filter block will eventually expire from the cache.
      
      - added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
      - in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
      - in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
      
      Test Plan:
      updated unit test:
      
        $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
      
      will also run 'make check'
      
      Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
      
      Reviewed By: yhchiang
      
      Subscribers: leveldb
      
      Differential Revision: https://reviews.facebook.net/D51633
      e089db40
  34. 29 10月, 2015 1 次提交
    • D
      Fix MockTable ID storage · 5c8f2ee7
      Dmitri Smirnov 提交于
        On Windows two tests fail that use MockTable:
        flush_job_test and compaction_job_test with the following message:
        compaction_job_test_je.exe : Assertion failed: result.size() == 4,
        file c:\dev\rocksdb\rocksdb\table\mock_table.cc, line 110
      
        Investigation reveals that this failure occurs when a 4 byte
        ID written to a beginning of the physically open file (main
        contents remains in a in-memory map) can not be read back.
      
        The reason for the failure is that the ID is written directly
        to a WritableFile bypassing WritableFileWriter. The side effect of that
        is that pending_sync_ never becomes true so the file is never flushed,
        however, the direct cause of the failure is that the filesize_ member
        of the WritableFileWriter remains zero. At Close() the file is truncated
        to that size and the file becomes empty so the ID can not be read back.
      5c8f2ee7
  35. 14 10月, 2015 1 次提交
    • S
      Seperate InternalIterator from Iterator · 35ad531b
      sdong 提交于
      Summary:
      Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
      
      This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
      At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
      
      Test Plan: Run all existing tests.
      
      Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
      
      Reviewed By: rven
      
      Subscribers: leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D48549
      35ad531b
  36. 10 10月, 2015 1 次提交
    • S
      Pass column family ID to table property collector · 776bd8d5
      sdong 提交于
      Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.
      
      Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.
      
      Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman
      
      Reviewed By: IslamAbdelRahman
      
      Subscribers: yoshinorim, leveldb, dhruba
      
      Differential Revision: https://reviews.facebook.net/D48411
      776bd8d5
  37. 12 9月, 2015 1 次提交
  38. 03 9月, 2015 1 次提交
    • A
      Unified maps with Comparator for sorting, other cleanup · 3c9cef1e
      Andres Noetzli 提交于
      Summary:
      This diff is a collection of cleanups that were initially part of D43179.
      Additionally it adds a unified way of defining key-value maps that use a
      Comparator for sorting (this was previously implemented in four different
      places).
      
      Test Plan: make clean check all
      
      Reviewers: rven, anthony, yhchiang, sdong, igor
      
      Reviewed By: igor
      
      Subscribers: dhruba, leveldb
      
      Differential Revision: https://reviews.facebook.net/D45993
      3c9cef1e