1. 11 6月, 2021 3 次提交
    • A
      Support for Merge in Integrated BlobDB with base values (#8292) · 3897ce31
      Akanksha Mahajan 提交于
      Summary:
      This PR add support for Merge operation in Integrated BlobDB with base values(i.e DB::Put). Merged values can be retrieved through  DB::Get, DB::MultiGet, DB::GetMergeOperands and Iterator operation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8292
      
      Test Plan: Add new unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D28415896
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e9b3478bef51d2f214fb88c31ed3c8d2f4a531ff
      3897ce31
    • B
      Fixed manifest_dump issues when printing keys and values containing null characters (#8378) · d61a4493
      Baptiste Lemaire 提交于
      Summary:
      Changed fprintf function to fputc in ApplyVersionEdit, and replaced null characters with whitespaces.
      Added unit test in ldb_test.py - verifies that manifest_dump --verbose output is correct when keys and values containing null characters are inserted.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8378
      
      Reviewed By: pdillinger
      
      Differential Revision: D29034584
      
      Pulled By: bjlemaire
      
      fbshipit-source-id: 50833687a8a5f726e247c38457eadc3e6dbab862
      d61a4493
    • Z
      Use DbSessionId as cache key prefix when secondary cache is enabled (#8360) · f44e69c6
      Zhichao Cao 提交于
      Summary:
      Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
      
      Test Plan: add the test to lru_cache_test
      
      Reviewed By: pdillinger
      
      Differential Revision: D29006215
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
      f44e69c6
  2. 10 6月, 2021 1 次提交
    • L
      Add a clipping internal iterator (#8327) · db325a59
      Levi Tamasi 提交于
      Summary:
      Logically, subcompactions process a key range [start, end); however, the way
      this is currently implemented is that the `CompactionIterator` for any given
      subcompaction keeps processing key-values until it actually outputs a key that
      is out of range, which is then discarded. Instead of doing this, the patch
      introduces a new type of internal iterator called `ClippingIterator` which wraps
      another internal iterator and "clips" its range of key-values so that any KVs
      returned are strictly in the [start, end) interval. This does eliminate a (minor)
      inefficiency by stopping processing in subcompactions exactly at the limit;
      however, the main motivation is related to BlobDB: namely, we need this to be
      able to measure the amount of garbage generated by a subcompaction
      precisely and prevent off-by-one errors.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8327
      
      Test Plan: `make check`
      
      Reviewed By: siying
      
      Differential Revision: D28761541
      
      Pulled By: ltamasi
      
      fbshipit-source-id: ee0e7229f04edabbc7bed5adb51771fbdc287f69
      db325a59
  3. 08 6月, 2021 2 次提交
    • P
      Fix a major performance bug in 6.21 for cache entry stats (#8369) · 2f93a3b8
      Peter Dillinger 提交于
      Summary:
      In final polishing of https://github.com/facebook/rocksdb/issues/8297 (after most manual testing), I
      broke my own caching layer by sanitizing an input parameter with
      std::min(0, x) instead of std::max(0, x). I resisted unit testing the
      timing part of the result caching because historically, these test
      are either flaky or difficult to write, and this was not a correctness
      issue. This bug is essentially unnoticeable with a small number
      of column families but can explode background work with a
      large number of column families.
      
      This change fixes the logical error, removes some unnecessary related
      optimization, and adds mock time/sleeps to the unit test to ensure we
      can cache hit within the age limit.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8369
      
      Test Plan: added time testing logic to existing unit test
      
      Reviewed By: ajkr
      
      Differential Revision: D28950892
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e79cd4ff3eec68fd0119d994f1ed468c38026c3b
      2f93a3b8
    • D
      Cancel compact range (#8351) · 80a59a03
      David Devecsery 提交于
      Summary:
      Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.
      
      Tested via two tests added to db_tests2.cc.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351
      
      Reviewed By: ajkr
      
      Differential Revision: D28808894
      
      Pulled By: ddevec
      
      fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
      80a59a03
  4. 04 6月, 2021 1 次提交
    • A
      Snapshot release triggered compaction without multiple tombstones (#8357) · 9167ece5
      Andrew Kryczka 提交于
      Summary:
      This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.
      
      This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.
      
      The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:
      
      - files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
      - it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357
      
      Test Plan: - make check
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28855340
      
      Pulled By: ajkr
      
      fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
      9167ece5
  5. 02 6月, 2021 1 次提交
    • P
      Fix "Interval WAL" bytes to say GB instead of MB (#8350) · 2655477c
      PiyushDatta 提交于
      Summary:
      Reference: https://github.com/facebook/rocksdb/issues/7201
      
      Before fix:
      `/tmp/rocksdb_test_file/LOG.old.1622492586055679:Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s`
      
      After fix:
      `/tmp/rocksdb_test_file/LOG:Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s`
      
      Tests:
      ```
      Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
      ETA: 0s Left: 0 AVG: 0.05s  local:0/7720/100%/0.0s
      rm -rf /dev/shm/rocksdb.CLRh
      /usr/bin/python3 tools/check_all_python.py
      No syntax errors in 34 .py files
      /usr/bin/python3 tools/ldb_test.py
      Running testCheckConsistency...
      .Running testColumnFamilies...
      .Running testCountDelimDump...
      .Running testCountDelimIDump...
      .Running testDumpLiveFiles...
      .Running testDumpLoad...
      Warning: 7 bad lines ignored.
      .Running testGetProperty...
      .Running testHexPutGet...
      .Running testIDumpBasics...
      .Running testIngestExternalSst...
      .Running testInvalidCmdLines...
      .Running testListColumnFamilies...
      .Running testManifestDump...
      .Running testMiscAdminTask...
      Sequence,Count,ByteSize,Physical Offset,Key(s)
      .Running testSSTDump...
      .Running testSimpleStringPutGet...
      .Running testStringBatchPut...
      .Running testTtlPutGet...
      .Running testWALDump...
      .
      ----------------------------------------------------------------------
      Ran 19 tests in 15.945s
      
      OK
      sh tools/rocksdb_dump_test.sh
      make check-format
      make[1]: Entering directory '/home/piydatta/Documents/rocksdb'
      $DEBUG_LEVEL is 1
      Makefile:176: Warning: Compiling in debug mode. Don't use the resulting binary in production
      build_tools/format-diff.sh -c
      Checking format of uncommitted changes...
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8350
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28790567
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: dcb1e4c124361156435122f21f0a288335b2c8c8
      2655477c
  6. 28 5月, 2021 1 次提交
    • P
      Do not truncate WAL if in read_only mode (#8313) · c75ef03e
      Peter (Stig) Edwards 提交于
      Summary:
      I noticed ```openat``` system call with ```O_WRONLY``` flag and ```sync_file_range``` and ```truncate``` on WAL file when using ```rocksdb::DB::OpenForReadOnly``` by way of ```db_bench --readonly=true --benchmarks=readseq --use_existing_db=1 --num=1 ...```
      
      Noticed in ```strace``` after seeing the last modification time of the WAL file change after each run (with ```--readonly=true```).
      
        I think introduced by https://github.com/facebook/rocksdb/commit/7d7f14480e135a4939ed6903f46b3f7056aa837a from https://github.com/facebook/rocksdb/pull/8122
      
      I added a test to catch the WAL file being truncated and the modification time on it changing.
      I am not sure if a mock filesystem with mock clock could be used to avoid having to sleep 1.1s.
      The test could also check the set of files is the same and that the sizes are also unchanged.
      
      Before:
      
      ```
      [ RUN      ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
      db/db_basic_test.cc:182: Failure
      Expected equality of these values:
        file_mtime_after_readonly_reopen
          Which is: 1621611136
        file_mtime_before_readonly_reopen
          Which is: 1621611135
        file is: 000010.log
      [  FAILED  ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
      ```
      
      After:
      
      ```
      [ RUN      ] DBBasicTest.ReadOnlyReopenMtimeUnchanged
      [       OK ] DBBasicTest.ReadOnlyReopenMtimeUnchanged (1108 ms)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8313
      
      Reviewed By: pdillinger
      
      Differential Revision: D28656925
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: ea9e215cb53e7c830e76bc5fc75c45e21f12a1d6
      c75ef03e
  7. 25 5月, 2021 1 次提交
  8. 22 5月, 2021 4 次提交
    • J
      Fix clang-analyze: use uninitiated variable (#8325) · 55853de6
      Jay Zhuang 提交于
      Summary:
      Error:
      ```
      db/db_compaction_test.cc:5211:47: warning: The left operand of '*' is a garbage value
      uint64_t total = (l1_avg_size + l2_avg_size * 10) * 10;
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8325
      
      Test Plan: `$ make analyze`
      
      Reviewed By: pdillinger
      
      Differential Revision: D28620916
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: f6d58ab84eefbcc905cda45afb9522b0c6d230f8
      55853de6
    • Z
      Use new Insert and Lookup APIs in table reader to support secondary cache (#8315) · 7303d02b
      Zhichao Cao 提交于
      Summary:
      Secondary cache is implemented to achieve the secondary cache tier for block cache. New Insert and Lookup APIs are introduced in https://github.com/facebook/rocksdb/issues/8271  . To support and use the secondary cache in block based table reader, this PR introduces the corresponding callback functions that will be used in secondary cache, and update the Insert and Lookup APIs accordingly.
      
      benchmarking:
      ./db_bench --benchmarks="fillrandom" -num=1000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/tmp/rocks_t/db -partition_index_and_filters=true
      
      ./db_bench -db=/tmp/rocks_t/db -use_existing_db=true -benchmarks=readrandom -num=1000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=5 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -stats_dump_period_sec=30 -reads=50000000
      
      master benchmarking results:
      readrandom   :       3.923 micros/op 254881 ops/sec;   33.4 MB/s (23849796 of 50000000 found)
      rocksdb.db.get.micros P50 : 2.820992 P95 : 5.636716 P99 : 16.450553 P100 : 8396.000000 COUNT : 50000000 SUM : 179947064
      
      Current PR benchmarking results
      readrandom   :       4.083 micros/op 244925 ops/sec;   32.1 MB/s (23849796 of 50000000 found)
      rocksdb.db.get.micros P50 : 2.967687 P95 : 5.754916 P99 : 15.665912 P100 : 8213.000000 COUNT : 50000000 SUM : 187250053
      
      About 3.8% throughput reduction.
      P50: 5.2% increasing, P95, 2.09% increasing, P99 4.77% improvement
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8315
      
      Test Plan: added the testing case
      
      Reviewed By: anand1976
      
      Differential Revision: D28599774
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: 098c4df0d7327d3a546df7604b2f1602f13044ed
      7303d02b
    • P
      Add table properties for number of entries added to filters (#8323) · 3469d60f
      Peter Dillinger 提交于
      Summary:
      With Ribbon filter work and possible variance in actual bits
      per key (or prefix; general term "entry") to achieve certain FP rates,
      I've received a request to be able to track actual bits per key in
      generated filters. This change adds a num_filter_entries table
      property, which can be combined with filter_size to get bits per key
      (entry).
      
      This can vary from num_entries in at least these ways:
      * Different versions of same key are only counted once in filters.
      * With prefix filters, several user keys map to the same filter entry.
      * A single filter can include both prefixes and user keys.
      
      Note that FilterBlockBuilder::NumAdded() didn't do anything useful
      except distinguish empty from non-empty.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323
      
      Test Plan: basic unit test included, others updated
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28596210
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
      3469d60f
    • J
      Fix manual compaction `max_compaction_bytes` under-calculated issue (#8269) · 6c865435
      Jay Zhuang 提交于
      Summary:
      Fix a bug that for manual compaction, `max_compaction_bytes` is only
      limit the SST files from input level, but not overlapped files on output
      level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269
      
      Test Plan: `make check`
      
      Reviewed By: ajkr
      
      Differential Revision: D28231044
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c
      6c865435
  9. 21 5月, 2021 2 次提交
    • S
      Compare memtable insert and flush count (#8288) · 2f1984dd
      sdong 提交于
      Summary:
      When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
      heck against memory corruption. This change will also allow more counters to be added in the future for better validation.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288
      
      Test Plan: Pass all existing tests
      
      Reviewed By: ajkr
      
      Differential Revision: D28369194
      
      fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
      2f1984dd
    • J
      Deflake ExternalSSTFileTest.PickedLevelBug (#8307) · 94b4faa0
      Jay Zhuang 提交于
      Summary:
      The test want to make sure these's no compaction during `AddFile`
      (between `DBImpl::AddFile:MutexLock` and `DBImpl::AddFile:MutexUnlock`)
      but the mutex could be unlocked by `EnterUnbatched()`.
      Move the lock start point after bumping the ingest file number.
      
      Also fix the dead lock when ASSERT fails.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8307
      
      Reviewed By: ajkr
      
      Differential Revision: D28479849
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: b3c50f66aa5d5f59c5c27f815bfea189c4cd06cb
      94b4faa0
  10. 20 5月, 2021 2 次提交
    • J
      Add remote compaction public API (#8300) · 3786181a
      Jay Zhuang 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8300
      
      Reviewed By: ajkr
      
      Differential Revision: D28464726
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 49e9f4fb791808a6cbf39a7b1a331373f645fc5e
      3786181a
    • P
      Use deleters to label cache entries and collect stats (#8297) · 311a544c
      Peter Dillinger 提交于
      Summary:
      This change gathers and publishes statistics about the
      kinds of items in block cache. This is especially important for
      profiling relative usage of cache by index vs. filter vs. data blocks.
      It works by iterating over the cache during periodic stats dump
      (InternalStats, stats_dump_period_sec) or on demand when
      DB::Get(Map)Property(kBlockCacheEntryStats), except that for
      efficiency and sharing among column families, saved data from
      the last scan is used when the data is not considered too old.
      
      The new information can be seen in info LOG, for example:
      
          Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
          Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
      
      And also through DB::GetProperty and GetMapProperty (here using
      ldb just for demonstration):
      
          $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
          rocksdb.block-cache-entry-stats.bytes.data-block: 0
          rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.bytes.index-block: 178992
          rocksdb.block-cache-entry-stats.bytes.misc: 0
          rocksdb.block-cache-entry-stats.bytes.other-block: 0
          rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
          rocksdb.block-cache-entry-stats.capacity: 8388608
          rocksdb.block-cache-entry-stats.count.data-block: 0
          rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.count.index-block: 215
          rocksdb.block-cache-entry-stats.count.misc: 1
          rocksdb.block-cache-entry-stats.count.other-block: 0
          rocksdb.block-cache-entry-stats.count.write-buffer: 0
          rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
          rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
          rocksdb.block-cache-entry-stats.percent.misc: 0.000000
          rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
          rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
          rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
      
      Solution detail - We need some way to flag what kind of blocks each
      entry belongs to, preferably without changing the Cache API.
      One of the complications is that Cache is a general interface that could
      have other users that don't adhere to whichever convention we decide
      on for keys and values. Or we would pay for an extra field in the Handle
      that would only be used for this purpose.
      
      This change uses a back-door approach, the deleter, to indicate the
      "role" of a Cache entry (in addition to the value type, implicitly).
      This has the added benefit of ensuring proper code origin whenever we
      recognize a particular role for a cache entry; if the entry came from
      some other part of the code, it will use an unrecognized deleter, which
      we simply attribute to the "Misc" role.
      
      An internal API makes for simple instantiation and automatic
      registration of Cache deleters for a given value type and "role".
      
      Another internal API, CacheEntryStatsCollector, solves the problem of
      caching the results of a scan and sharing them, to ensure scans are
      neither excessive nor redundant so as not to harm Cache performance.
      
      Because code is added to BlocklikeTraits, it is pulled out of
      block_based_table_reader.cc into its own file.
      
      This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
      (could still be added), and with actual stat gathering.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
      
      Test Plan: manual testing with db_bench, and a couple of basic unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D28488721
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
      311a544c
  11. 19 5月, 2021 1 次提交
    • A
      Sync ingested files only if reopen is supported by the FS (#8296) · 9d61a085
      anand76 提交于
      Summary:
      Some file systems (especially distributed FS) do not support reopening a file for writing. The ExternalSstFileIngestionJob calls ReopenWritableFile in order to sync the ingested file, which typically makes sense only on a local file system with a page cache (i.e Posix). So this change tries to sync the ingested file only if ReopenWritableFile doesn't return Status::NotSupported().
      
      Tests:
      Add a new unit test in external_sst_file_basic_test
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8296
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28420865
      
      Pulled By: anand1976
      
      fbshipit-source-id: 380e7f5ff95324997f7a59864a9ac96ebbd0100c
      9d61a085
  12. 18 5月, 2021 3 次提交
    • S
      Expose CompressionOptions::parallel_threads through C API (#8302) · 83d1a665
      Stanislav Tkach 提交于
      Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8302
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28499262
      
      Pulled By: ajkr
      
      fbshipit-source-id: 7b17b79af871d874dfca76db9bca0d640a6cd854
      83d1a665
    • L
      Make it possible to apply only a subrange of table property collectors (#8298) · d83542ca
      Levi Tamasi 提交于
      Summary:
      This patch does two things:
      1) Introduces some aliases in order to eliminate/prevent long-winded type names
      w/r/t the internal table property collectors (see e.g.
      `std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`).
      2) Makes it possible to apply only a subrange of table property collectors during
      table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories`
      from a pointer to a `vector` into a range (i.e. a pair of iterators).
      
      Rationale: I plan to introduce a BlobDB related table property collector, which
      should only be applied during table creation if blob storage is enabled at the moment
      (which can be changed dynamically). This change will make it possible to include/
      exclude the BlobDB related collector as needed without having to introduce
      a second `vector` of collectors in `ColumnFamilyData` with pretty much the same
      contents.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298
      
      Test Plan: `make check`
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28430910
      
      Pulled By: ltamasi
      
      fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e
      d83542ca
    • S
      Write file temperature information to manifest (#8284) · 0ed8cb66
      sdong 提交于
      Summary:
      As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities.
      
      Also fix some issues in simulated hybrid FS.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284
      
      Test Plan: Add a new unit test to validate that the information is indeed written and read back.
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28335801
      
      fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def
      0ed8cb66
  13. 14 5月, 2021 2 次提交
    • A
      Initial support for secondary cache in LRUCache (#8271) · feb06e83
      anand76 提交于
      Summary:
      Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
      
      Tests:
      Results from cache_bench and db_bench don't show any regression due to these changes.
      
      cache_bench results before and after this change -
      Command
      ```./cache_bench -ops_per_thread=10000000 -threads=1```
      Before
      ```Complete in 40.688 s; QPS = 245774```
      ```Complete in 40.486 s; QPS = 246996```
      ```Complete in 42.019 s; QPS = 237989```
      After
      ```Complete in 40.672 s; QPS = 245869```
      ```Complete in 44.622 s; QPS = 224107```
      ```Complete in 42.445 s; QPS = 235599```
      
      db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
      Commands
      ```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
      
      ```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
      Before
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
      ```
      After
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
      ```
      ```
      DB path: [/home/anand76/nvm_cache/db]
      readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D28357511
      
      Pulled By: anand1976
      
      fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
      feb06e83
    • J
      Refactor Option obj address from char* to void* (#8295) · d15fbae4
      Jay Zhuang 提交于
      Summary:
      And replace `reinterpret_cast` with `static_cast` or no cast.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8295
      
      Test Plan: `make check`
      
      Reviewed By: mrambacher
      
      Differential Revision: D28420303
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 645be123a0df624dc2bea37cd54a35403fc494fa
      d15fbae4
  14. 13 5月, 2021 1 次提交
  15. 12 5月, 2021 1 次提交
    • P
      New Cache API for gathering statistics (#8225) · 78a309bf
      Peter Dillinger 提交于
      Summary:
      Adds a new Cache::ApplyToAllEntries API that we expect to use
      (in follow-up PRs) for efficiently gathering block cache statistics.
      Notable features vs. old ApplyToAllCacheEntries:
      
      * Includes key and deleter (in addition to value and charge). We could
      have passed in a Handle but then more virtual function calls would be
      needed to get the "fields" of each entry. We expect to use the 'deleter'
      to identify the origin of entries, perhaps even more.
      * Heavily tuned to minimize latency impact on operating cache. It
      does this by iterating over small sections of each cache shard while
      cycling through the shards.
      * Supports tuning roughly how many entries to operate on for each
      lock acquire and release, to control the impact on the latency of other
      operations without excessive lock acquire & release. The right balance
      can depend on the cost of the callback. Good default seems to be
      around 256.
      * There should be no need to disable thread safety. (I would expect
      uncontended locks to be sufficiently fast.)
      
      I have enhanced cache_bench to validate this approach:
      
      * Reports a histogram of ns per operation, so we can look at the
      ditribution of times, not just throughput (average).
      * Can add a thread for simulated "gather stats" which calls
      ApplyToAllEntries at a specified interval. We also generate a histogram
      of time to run ApplyToAllEntries.
      
      To make the iteration over some entries of each shard work as cleanly as
      possible, even with resize between next set of entries, I have
      re-arranged which hash bits are used for sharding and which for indexing
      within a shard.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
      
      Test Plan:
      A couple of unit tests are added, but primary validation is manual, as
      the primary risk is to performance.
      
      The primary validation is using cache_bench to ensure that neither
      the minor hashing changes nor the simulated stats gathering
      significantly impact QPS or latency distribution. Note that adding op
      latency histogram seriously impacts the benchmark QPS, so for a
      fair baseline, we need the cache_bench changes (except remove simulated
      stat gathering to make it compile). In short, we don't see any
      reproducible difference in ops/sec or op latency unless we are gathering
      stats nearly continuously. Test uses 10GB block cache with
      8KB values to be somewhat realistic in the number of items to iterate
      over.
      
      Baseline typical output:
      
      ```
      Complete in 92.017 s; Rough parallel ops/sec = 869401
      Thread ops/sec = 54662
      
      Operation latency (ns):
      Count: 80000000 Average: 11223.9494  StdDev: 29.61
      Min: 0  Median: 7759.3973  Max: 9620500
      Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
      ------------------------------------------------------
      [       0,       1 ]       68   0.000%   0.000%
      (    2900,    4400 ]       89   0.000%   0.000%
      (    4400,    6600 ] 33630240  42.038%  42.038% ########
      (    6600,    9900 ] 18129842  22.662%  64.700% #####
      (    9900,   14000 ]  7877533   9.847%  74.547% ##
      (   14000,   22000 ] 15193238  18.992%  93.539% ####
      (   22000,   33000 ]  3037061   3.796%  97.335% #
      (   33000,   50000 ]  1626316   2.033%  99.368%
      (   50000,   75000 ]   421532   0.527%  99.895%
      (   75000,  110000 ]    56910   0.071%  99.966%
      (  110000,  170000 ]    16134   0.020%  99.986%
      (  170000,  250000 ]     5166   0.006%  99.993%
      (  250000,  380000 ]     3017   0.004%  99.996%
      (  380000,  570000 ]     1337   0.002%  99.998%
      (  570000,  860000 ]      805   0.001%  99.999%
      (  860000, 1200000 ]      319   0.000% 100.000%
      ( 1200000, 1900000 ]      231   0.000% 100.000%
      ( 1900000, 2900000 ]      100   0.000% 100.000%
      ( 2900000, 4300000 ]       39   0.000% 100.000%
      ( 4300000, 6500000 ]       16   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ```
      
      New, gather_stats=false. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 92.030 s; Rough parallel ops/sec = 869285
      Thread ops/sec = 54458
      
      Operation latency (ns):
      Count: 80000000 Average: 11298.1027  StdDev: 42.18
      Min: 0  Median: 7722.0822  Max: 6398720
      Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
      ------------------------------------------------------
      [       0,       1 ]      109   0.000%   0.000%
      (    2900,    4400 ]      793   0.001%   0.001%
      (    4400,    6600 ] 34054563  42.568%  42.569% #########
      (    6600,    9900 ] 17482646  21.853%  64.423% ####
      (    9900,   14000 ]  7908180   9.885%  74.308% ##
      (   14000,   22000 ] 15032072  18.790%  93.098% ####
      (   22000,   33000 ]  3237834   4.047%  97.145% #
      (   33000,   50000 ]  1736882   2.171%  99.316%
      (   50000,   75000 ]   446851   0.559%  99.875%
      (   75000,  110000 ]    68251   0.085%  99.960%
      (  110000,  170000 ]    18592   0.023%  99.983%
      (  170000,  250000 ]     7200   0.009%  99.992%
      (  250000,  380000 ]     3334   0.004%  99.997%
      (  380000,  570000 ]     1393   0.002%  99.998%
      (  570000,  860000 ]      700   0.001%  99.999%
      (  860000, 1200000 ]      293   0.000% 100.000%
      ( 1200000, 1900000 ]      196   0.000% 100.000%
      ( 1900000, 2900000 ]       69   0.000% 100.000%
      ( 2900000, 4300000 ]       32   0.000% 100.000%
      ( 4300000, 6500000 ]       10   0.000% 100.000%
      ```
      
      New, gather_stats=true, 1 second delay between scans. Scans take about
      1 second here so it's spending about 50% time scanning. Still the effect on
      ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
      
      ```
      Complete in 91.890 s; Rough parallel ops/sec = 870608
      Thread ops/sec = 54551
      
      Operation latency (ns):
      Count: 80000000 Average: 11311.2629  StdDev: 45.28
      Min: 0  Median: 7686.5458  Max: 10018340
      Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
      ------------------------------------------------------
      [       0,       1 ]       71   0.000%   0.000%
      (    2900,    4400 ]      291   0.000%   0.000%
      (    4400,    6600 ] 34492060  43.115%  43.116% #########
      (    6600,    9900 ] 16727328  20.909%  64.025% ####
      (    9900,   14000 ]  7845828   9.807%  73.832% ##
      (   14000,   22000 ] 15510654  19.388%  93.220% ####
      (   22000,   33000 ]  3216533   4.021%  97.241% #
      (   33000,   50000 ]  1680859   2.101%  99.342%
      (   50000,   75000 ]   439059   0.549%  99.891%
      (   75000,  110000 ]    60540   0.076%  99.967%
      (  110000,  170000 ]    14649   0.018%  99.985%
      (  170000,  250000 ]     5242   0.007%  99.991%
      (  250000,  380000 ]     3260   0.004%  99.995%
      (  380000,  570000 ]     1599   0.002%  99.997%
      (  570000,  860000 ]     1043   0.001%  99.999%
      (  860000, 1200000 ]      471   0.001%  99.999%
      ( 1200000, 1900000 ]      275   0.000% 100.000%
      ( 1900000, 2900000 ]      143   0.000% 100.000%
      ( 2900000, 4300000 ]       60   0.000% 100.000%
      ( 4300000, 6500000 ]       27   0.000% 100.000%
      ( 6500000, 9800000 ]        7   0.000% 100.000%
      ( 9800000, 14000000 ]        1   0.000% 100.000%
      
      Gather stats latency (us):
      Count: 46 Average: 980387.5870  StdDev: 60911.18
      Min: 879155  Median: 1033777.7778  Max: 1261431
      Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
      ------------------------------------------------------
      (  860000, 1200000 ]       45  97.826%  97.826% ####################
      ( 1200000, 1900000 ]        1   2.174% 100.000%
      
      Most recent cache entry stats:
      Number of entries: 1295133
      Total charge: 9.88 GB
      Average key size: 23.4982
      Average charge: 8.00 KB
      Unique deleters: 3
      ```
      
      Reviewed By: mrambacher
      
      Differential Revision: D28295742
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
      78a309bf
  16. 11 5月, 2021 2 次提交
    • M
      Add ObjectRegistry to ConfigOptions (#8166) · 9f2d255a
      mrambacher 提交于
      Summary:
      This change enables a couple of things:
      - Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
      - The ObjectRegistry is created fewer times and can be re-used
      
      The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
      
      Reviewed By: zhichao-cao
      
      Differential Revision: D27657952
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
      9f2d255a
    • M
      Add Merge Operator support to WriteBatchWithIndex (#8135) · ff463742
      mrambacher 提交于
      Summary:
      The WBWI has two differing modes of operation dependent on the value
      of the constructor parameter `overwrite_key`.
      Currently, regardless of the parameter, neither mode performs as
      expected when using Merge. This PR remedies this by correctly invoking
      the appropriate Merge Operator before returning results from the WBWI.
      
      Examples of issues that exist which are solved by this PR:
      
      ## Example 1 with `overwrite_key=false`
      Currently, from an empty database, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      Get('k1')
      ```
      Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.
      
      ## Example 2 with o`verwrite_key=true`
      Currently, from an empty database, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      Get('k1')
      ```
      Incorrectly yields `ERROR: kMergeInProgress`.
      
      ## Example 3 with `overwrite_key=false`
      Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
      ```
      Merge('k1', 'v2')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `v1,v2`
      
      ## Example 4 with `overwrite_key=true`
      Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
      ```
      Merge('k1', 'v1')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `ERROR: kMergeInProgress`.
      
      ## Example 5 with `overwrite_key=false`
      Currently, from an empty database, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `v1,v2`
      
      ## Example 6 with `overwrite_key=true`
      Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
      ```
      Put('k1', 'v1')
      Merge('k1', 'v2')
      GetFromBatchAndDB('k1')
      ```
      Incorrectly yields `ERROR: kMergeInProgress`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135
      
      Reviewed By: pdillinger
      
      Differential Revision: D27657938
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
      ff463742
  17. 08 5月, 2021 2 次提交
    • A
      Allow applying `CompactionFilter` outside of compaction (#8243) · a639c02f
      Andrew Kryczka 提交于
      Summary:
      From HISTORY.md release note:
      
      - Allow `CompactionFilter`s to apply in more table file creation scenarios such as flush and recovery. For compatibility, `CompactionFilter`s by default apply during compaction. Users can customize this behavior by overriding `CompactionFilterFactory::ShouldFilterTableFileCreation()`.
      - Removed unused structure `CompactionFilterContext`
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8243
      
      Test Plan: added unit tests
      
      Reviewed By: pdillinger
      
      Differential Revision: D28088089
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0799be7908e3b39fea09fc3f1ab00e13ad817fae
      a639c02f
    • S
      Cap automatic arena block size to 1 MB (#7907) · a4919d6b
      sdong 提交于
      Summary:
      Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907
      
      Test Plan: Run all existing tests
      
      Reviewed By: pdillinger
      
      Differential Revision: D26135269
      
      fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8
      a4919d6b
  18. 06 5月, 2021 3 次提交
    • S
      Refactor kill point (#8241) · e19908cb
      sdong 提交于
      Summary:
      Refactor kill point to one single class, rather than several extern variables. The intention was to drop unflushed data before killing to simulate some job, and I tried to a pointer to fault ingestion fs to the killing class, but it ended up with harder than I thought. Perhaps we'll need to do this in another way. But I thought the refactoring itself is good so I send it out.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8241
      
      Test Plan: make release and run crash test for a while.
      
      Reviewed By: anand1976
      
      Differential Revision: D28078486
      
      fbshipit-source-id: f9182c1455f52e6851c13f88a21bade63bcec45f
      e19908cb
    • M
      Make ImmutableOptions struct that inherits from ImmutableCFOptions and ImmutableDBOptions (#8262) · 8948dc85
      mrambacher 提交于
      Summary:
      The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions.  This change cleans that up by introducing an ImmutableOptions struct.  Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
      
      Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR.  All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
      
      Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
      
      Reviewed By: pdillinger
      
      Differential Revision: D28226540
      
      Pulled By: mrambacher
      
      fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
      8948dc85
    • A
      Fix `GetLiveFiles()` returning OPTIONS-000000 (#8268) · 0f42e50f
      Andrew Kryczka 提交于
      Summary:
      See release note in HISTORY.md.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268
      
      Test Plan: unit test repro
      
      Reviewed By: siying
      
      Differential Revision: D28227901
      
      Pulled By: ajkr
      
      fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339
      0f42e50f
  19. 05 5月, 2021 2 次提交
    • A
      Fix ConcurrentTaskLimiter token release for shutdown (#8253) · c70bae1b
      Andrew Kryczka 提交于
      Summary:
      Previously the shutdown process did not properly wait for all
      `compaction_thread_limiter` tokens to be released before proceeding to
      delete the DB's C++ objects. When this happened, we saw tests like
      "DBCompactionTest.CompactionLimiter" flake with the following error:
      
      ```
      virtual
      rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl():
      Assertion `outstanding_tasks_ == 0' failed.
      ```
      
      There is a case where a token can still be alive even after the shutdown
      process has waited for BG work to complete. In particular, this happens
      because the shutdown process only waits for flush/compaction scheduled/unscheduled counters to all
      reach zero. These counters are decremented in `BackgroundCallCompaction()`
      functions. However, tokens are released in `BGWork*Compaction()` functions, which
      actually wrap the `BackgroundCallCompaction()` function.
      
      A simple sleep could repro the race condition:
      
      ```
      $ diff --git a/db/db_impl/db_impl_compaction_flush.cc
      b/db/db_impl/db_impl_compaction_flush.cc
      index 806bc548a..ba59efa89 100644
       --- a/db/db_impl/db_impl_compaction_flush.cc
      +++ b/db/db_impl/db_impl_compaction_flush.cc
      @@ -2442,6 +2442,7 @@ void DBImpl::BGWorkCompaction(void* arg) {
             static_cast<PrepickedCompaction*>(ca.prepicked_compaction);
         static_cast_with_check<DBImpl>(ca.db)->BackgroundCallCompaction(
             prepicked_compaction, Env::Priority::LOW);
      +  sleep(1);
         delete prepicked_compaction;
       }
      
      $ ./db_compaction_test --gtest_filter=DBCompactionTest.CompactionLimiter
      db_compaction_test: util/concurrent_task_limiter_impl.cc:24: virtual rocksdb::ConcurrentTaskLimiterImpl::~ConcurrentTaskLimiterImpl(): Assertion `outstanding_tasks_ == 0' failed.
      Received signal 6 (Aborted)
      #0   /usr/local/fbcode/platform007/lib/libc.so.6(gsignal+0xcf) [0x7f02673c30ff] ??      ??:0
      https://github.com/facebook/rocksdb/issues/1   /usr/local/fbcode/platform007/lib/libc.so.6(abort+0x134) [0x7f02673ac934] ??       ??:0
      ...
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8253
      
      Test Plan: sleeps to expose race conditions
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D28168064
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9e5167c74398d323e7975980c5cc00f450631160
      c70bae1b
    • A
      Deflake DBTest.L0L1L2AndUpHitCounter (#8259) · c2a3424d
      Andrew Kryczka 提交于
      Summary:
      Previously we saw flakes on platforms like arm on CircleCI, such as the following:
      
      ```
      Note: Google Test filter = DBTest.L0L1L2AndUpHitCounter
      [==========] Running 1 test from 1 test case.
      [----------] Global test environment set-up.
      [----------] 1 test from DBTest
      [ RUN      ] DBTest.L0L1L2AndUpHitCounter
      db/db_test.cc:5345: Failure
      Expected: (TestGetTickerCount(options, GET_HIT_L0)) > (100), actual: 30 vs 100
      [  FAILED  ] DBTest.L0L1L2AndUpHitCounter (150 ms)
      [----------] 1 test from DBTest (150 ms total)
      
      [----------] Global test environment tear-down
      [==========] 1 test from 1 test case ran. (150 ms total)
      [  PASSED  ] 0 tests.
      [  FAILED  ] 1 test, listed below:
      [  FAILED  ] DBTest.L0L1L2AndUpHitCounter
      ```
      
      The test was totally non-deterministic, e.g., flush/compaction timing would affect how many files on each level. Furthermore, it depended heavily on platform-specific details, e.g., by having a 32KB memtable, it could become full with a very different number of entries depending on the platform.
      
      This PR rewrites the test to build a deterministic LSM with one file per level.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8259
      
      Reviewed By: mrambacher
      
      Differential Revision: D28178100
      
      Pulled By: ajkr
      
      fbshipit-source-id: 0a03b26e8d23c29d8297c1bccb1b115dce33bdcd
      c2a3424d
  20. 04 5月, 2021 1 次提交
    • S
      Hint temperature of bottommost level files to FileSystem (#8222) · c3ff14e2
      sdong 提交于
      Summary:
      As the first part of the effort of having placing different files on different storage types, this change introduces several things:
      (1) An experimental interface in FileSystem that specify temperature to a new file created.
      (2) A test FileSystemWrapper,  SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature.
      (3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file.
      (4) A db_bench parameter that applies the (2) and (3) to db_bench.
      
      The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222
      
      Test Plan:
      ./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000  -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000
      
      followed by
      
      ./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000
      
      and see results as expected.
      
      Reviewed By: ajkr
      
      Differential Revision: D28003028
      
      fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce
      c3ff14e2
  21. 01 5月, 2021 1 次提交
    • P
      Add more LSM info to FilterBuildingContext (#8246) · d2ca04e3
      Peter Dillinger 提交于
      Summary:
      Add `num_levels`, `is_bottommost`, and table file creation
      `reason` to `FilterBuildingContext`, in anticipation of more powerful
      Bloom-like filter support.
      
      To support this, added `is_bottommost` and `reason` to
      `TableBuilderOptions`, which allowed removing `reason` parameter from
      `rocksdb::BuildTable`.
      
      I attempted to remove `skip_filters` from `TableBuilderOptions`, because
      filter construction decisions should arise from options, not one-off
      parameters. I could not completely remove it because the public API for
      SstFileWriter takes a `skip_filters` parameter, and translating this
      into an option change would mean awkwardly replacing the table_factory
      if it is BlockBasedTableFactory with new filter_policy=nullptr option.
      I marked this public skip_filters option as deprecated because of this
      oddity. (skip_filters on the read side probably makes sense.)
      
      At least `skip_filters` is now largely hidden for users of
      `TableBuilderOptions` and is no longer used for implementing the
      optimize_filters_for_hits option. Bringing the logic for that option
      closer to handling of FilterBuildingContext makes it more obvious that
      hese two are using the same notion of "bottommost." (Planned:
      configuration options for Bloom-like filters that generalize
      `optimize_filters_for_hits`)
      
      Recommended follow-up: Try to get away from "bottommost level" naming of
      things, which is inaccurate (see
      VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
      "bottommost run" or just "bottommost."
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
      
      Test Plan:
      extended an existing unit test to exercise and check various
      filter building contexts. Also, existing tests for
      optimize_filters_for_hits validate some of the "bottommost" handling,
      which is now closely connected to FilterBuildingContext::is_bottommost
      through TableBuilderOptions::is_bottommost
      
      Reviewed By: mrambacher
      
      Differential Revision: D28099346
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
      d2ca04e3
  22. 29 4月, 2021 2 次提交
    • P
      Refactor: use TableBuilderOptions to reduce parameter lists (#8240) · 85becd94
      Peter Dillinger 提交于
      Summary:
      Greatly reduced the not-quite-copy-paste giant parameter lists
      of rocksdb::NewTableBuilder, rocksdb::BuildTable,
      BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.
      
      Moved weird separate parameter `uint32_t column_family_id` of
      TableFactory::NewTableBuilder into TableBuilderOptions.
      
      Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
      target_file_size` is not randomly placed between uint64_t timestamps
      (was easy to mix up).
      
      Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
      FilterBuildingContext. The motivation for this change is making it
      easier to pass along more data into new fields in FilterBuildingContext
      (follow-up PR).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240
      
      Test Plan: ASAN make check
      
      Reviewed By: mrambacher
      
      Differential Revision: D28075891
      
      Pulled By: pdillinger
      
      fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
      85becd94
    • A
      Fix a memory leak in c_test (#8237) · 0db4cde6
      anand76 提交于
      Summary:
      Don't call ```rocksdb_cache_disown_data()``` as it causes the memory allocated for ```shards_``` to be leaked.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8237
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28039061
      
      Pulled By: anand1976
      
      fbshipit-source-id: c3464efe2c006b93b4be87030116a12a124598c4
      0db4cde6
  23. 28 4月, 2021 1 次提交