1. 14 9月, 2021 1 次提交
    • L
      Use GetBlobFileSize instead of GetTotalBlobBytes in DB properties (#8902) · 306b7799
      Levi Tamasi 提交于
      Summary:
      The patch adjusts the definition of BlobDB's DB properties a bit by
      switching to `GetBlobFileSize` from `GetTotalBlobBytes`. The
      difference is that the value returned by `GetBlobFileSize` includes
      the blob file header and footer as well, and thus matches the on-disk
      size of blob files. In addition, the patch removes the `Version` number
      from the `blob_stats` property, and updates/extends the unit tests a little.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8902
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D30859542
      
      Pulled By: ltamasi
      
      fbshipit-source-id: e3426d2d567bd1bd8c8636abdafaafa0743c854c
      306b7799
  2. 09 9月, 2021 1 次提交
    • Z
      Add DB properties for BlobDB (#8734) · 0cb0fc6f
      Zhiyi Zhang 提交于
      Summary:
      RocksDB exposes certain internal statistics via the DB property interface.
      However, there are currently no properties related to BlobDB.
      
      For starters, we would like to add the following BlobDB properties:
      `rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
      `rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
      `rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
      `rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
      `rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734
      
      Test Plan:
      ```
      ➜  rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
      [==========] Running 16 tests from 2 test cases.
      [----------] Global test environment set-up.
      [----------] 10 tests from DBBlobBasicTest
      [ RUN      ] DBBlobBasicTest.GetBlob
      [       OK ] DBBlobBasicTest.GetBlob (12 ms)
      [ RUN      ] DBBlobBasicTest.MultiGetBlobs
      [       OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_CorruptIndex
      [       OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
      [       OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
      [ RUN      ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
      [       OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
      [ RUN      ] DBBlobBasicTest.GenerateIOTracing
      [       OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
      [ RUN      ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
      [       OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
      [ RUN      ] DBBlobBasicTest.GetMergeBlobWithPut
      [       OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
      [ RUN      ] DBBlobBasicTest.MultiGetMergeBlobWithPut
      [       OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
      [ RUN      ] DBBlobBasicTest.BlobDBProperties
      [       OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
      [----------] 10 tests from DBBlobBasicTest (124 ms total)
      
      [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
      [ RUN      ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
      [       OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
      [----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)
      
      [----------] Global test environment tear-down
      [==========] 16 tests from 2 test cases ran. (2190 ms total)
      [  PASSED  ] 16 tests.
      ```
      
      Reviewed By: ltamasi
      
      Differential Revision: D30690849
      
      Pulled By: Zhiyi-Zhang
      
      fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346
      0cb0fc6f
  3. 25 8月, 2021 1 次提交
    • P
      Add port::GetProcessID() (#8693) · 318fe694
      Peter Dillinger 提交于
      Summary:
      Useful in some places for object uniqueness across processes.
      Currently used for generating a host-wide identifier of Cache objects
      but expected to be used soon in some unique id generation code.
      
      `int64_t` is chosen for return type because POSIX uses signed integer type,
      usually `int`, for `pid_t` and Windows uses `DWORD`, which is `uint32_t`.
      
      Future work: avoid copy-pasted declarations in port_*.h, perhaps with
      port_common.h always included from port.h
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8693
      
      Test Plan: manual for now
      
      Reviewed By: ajkr, anand1976
      
      Differential Revision: D30492876
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 39fc2788623cc9f4787866bdb67a4d183dde7eef
      318fe694
  4. 16 8月, 2021 1 次提交
  5. 17 7月, 2021 1 次提交
    • P
      Don't hold DB mutex for block cache entry stat scans (#8538) · df5dc73b
      Peter Dillinger 提交于
      Summary:
      I previously didn't notice the DB mutex was being held during
      block cache entry stat scans, probably because I primarily checked for
      read performance regressions, because they require the block cache and
      are traditionally latency-sensitive.
      
      This change does some refactoring to avoid holding DB mutex and to
      avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
      Some tests have to be updated because now the stats collector is
      populated in the Cache aggressively on DB startup rather than lazily.
      (I hope to clean up some of this added complexity in the future.)
      
      This change also ensures proper treatment of need_out_of_mutex for
      non-int DB properties.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
      
      Test Plan:
      Added unit test logic that uses sync points to fail if the DB mutex
      is held during a scan, covering the various ways that a scan might be
      triggered.
      
      Performance test - the known impact to holding the DB mutex is on
      TransactionDB, and the easiest way to see the impact is to hack the
      scan code to almost always miss and take an artificially long time
      scanning. Here I've injected an unconditional 5s sleep at the call to
      ApplyToAllEntries.
      
      Before (hacked):
      
          $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     433.219 micros/op 2308 ops/sec;    0.1 MB/s ( transactions:78999 aborts:0)
          rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
          $ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     448.802 micros/op 2228 ops/sec;    0.1 MB/s ( transactions:75999 aborts:0)
          rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
      
      Notice the 5s P100 write time.
      
      After (hacked):
      
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     303.645 micros/op 3293 ops/sec;    0.1 MB/s ( transactions:98999 aborts:0)
          rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     310.383 micros/op 3221 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
          rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
      
      P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
      
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     311.365 micros/op 3211 ops/sec;    0.1 MB/s ( transactions:96999 aborts:0)
          rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
          $ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
          randomtransaction :     308.395 micros/op 3242 ops/sec;    0.1 MB/s ( transactions:97999 aborts:0)
          rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
      
      No substantial difference.
      
      Reviewed By: siying
      
      Differential Revision: D29738847
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
      df5dc73b
  6. 14 6月, 2021 1 次提交
    • P
      Pin CacheEntryStatsCollector to fix performance bug (#8385) · d5a46c40
      Peter Dillinger 提交于
      Summary:
      If the block Cache is full with strict_capacity_limit=false,
      then our CacheEntryStatsCollector could be immediately evicted on
      release, so iterating through column families with shared block cache
      could trigger re-scan for each CF. This change fixes that problem by
      pinning the CacheEntryStatsCollector from InternalStats so that it's not
      evicted.
      
      I had originally thought that this object could participate in LRU like
      everything else, but even though a re-load+re-scan only touches memory,
      it can be orders of magnitude more expensive than other cache misses.
      One service in Facebook has scans that take ~20s over 100GB block cache
      that is mostly 4KB entries. (The up-side of this bug and https://github.com/facebook/rocksdb/issues/8369 is that
      we had a natural experiment on the effect on some service metrics even
      with block cache scans running continuously in the background--a kind
      of worst case scenario. Metrics like latency were not affected enough
      to trigger warnings.)
      
      Other smaller fixes:
      
      20s is already a sizable portion of 600s stats dump period, or 180s
      default max age to force re-scan, so added logic to ensure that (for
      each block cache) we don't spend more than 0.2% of our background thread
      time scanning it. Nevertheless, "foreground" requests for cache entry
      stats (calls to `db->GetMapProperty(DB::Properties::kBlockCacheEntryStats)`)
      are permitted to consume more CPU.
      
      Renamed field to cache_entry_stats_ to match code style.
      
      This change is intended for patching in 6.21 release.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8385
      
      Test Plan:
      unit test expanded to cover new logic (detect regression),
      some manual testing with db_bench
      
      Reviewed By: ajkr
      
      Differential Revision: D29042759
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 236faa902397f50038c618f50fbc8cf3f277308c
      d5a46c40
  7. 08 6月, 2021 1 次提交
    • P
      Fix a major performance bug in 6.21 for cache entry stats (#8369) · 2f93a3b8
      Peter Dillinger 提交于
      Summary:
      In final polishing of https://github.com/facebook/rocksdb/issues/8297 (after most manual testing), I
      broke my own caching layer by sanitizing an input parameter with
      std::min(0, x) instead of std::max(0, x). I resisted unit testing the
      timing part of the result caching because historically, these test
      are either flaky or difficult to write, and this was not a correctness
      issue. This bug is essentially unnoticeable with a small number
      of column families but can explode background work with a
      large number of column families.
      
      This change fixes the logical error, removes some unnecessary related
      optimization, and adds mock time/sleeps to the unit test to ensure we
      can cache hit within the age limit.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8369
      
      Test Plan: added time testing logic to existing unit test
      
      Reviewed By: ajkr
      
      Differential Revision: D28950892
      
      Pulled By: pdillinger
      
      fbshipit-source-id: e79cd4ff3eec68fd0119d994f1ed468c38026c3b
      2f93a3b8
  8. 02 6月, 2021 1 次提交
    • P
      Fix "Interval WAL" bytes to say GB instead of MB (#8350) · 2655477c
      PiyushDatta 提交于
      Summary:
      Reference: https://github.com/facebook/rocksdb/issues/7201
      
      Before fix:
      `/tmp/rocksdb_test_file/LOG.old.1622492586055679:Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s`
      
      After fix:
      `/tmp/rocksdb_test_file/LOG:Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s`
      
      Tests:
      ```
      Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
      ETA: 0s Left: 0 AVG: 0.05s  local:0/7720/100%/0.0s
      rm -rf /dev/shm/rocksdb.CLRh
      /usr/bin/python3 tools/check_all_python.py
      No syntax errors in 34 .py files
      /usr/bin/python3 tools/ldb_test.py
      Running testCheckConsistency...
      .Running testColumnFamilies...
      .Running testCountDelimDump...
      .Running testCountDelimIDump...
      .Running testDumpLiveFiles...
      .Running testDumpLoad...
      Warning: 7 bad lines ignored.
      .Running testGetProperty...
      .Running testHexPutGet...
      .Running testIDumpBasics...
      .Running testIngestExternalSst...
      .Running testInvalidCmdLines...
      .Running testListColumnFamilies...
      .Running testManifestDump...
      .Running testMiscAdminTask...
      Sequence,Count,ByteSize,Physical Offset,Key(s)
      .Running testSSTDump...
      .Running testSimpleStringPutGet...
      .Running testStringBatchPut...
      .Running testTtlPutGet...
      .Running testWALDump...
      .
      ----------------------------------------------------------------------
      Ran 19 tests in 15.945s
      
      OK
      sh tools/rocksdb_dump_test.sh
      make check-format
      make[1]: Entering directory '/home/piydatta/Documents/rocksdb'
      $DEBUG_LEVEL is 1
      Makefile:176: Warning: Compiling in debug mode. Don't use the resulting binary in production
      build_tools/format-diff.sh -c
      Checking format of uncommitted changes...
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8350
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D28790567
      
      Pulled By: zhichao-cao
      
      fbshipit-source-id: dcb1e4c124361156435122f21f0a288335b2c8c8
      2655477c
  9. 20 5月, 2021 1 次提交
    • P
      Use deleters to label cache entries and collect stats (#8297) · 311a544c
      Peter Dillinger 提交于
      Summary:
      This change gathers and publishes statistics about the
      kinds of items in block cache. This is especially important for
      profiling relative usage of cache by index vs. filter vs. data blocks.
      It works by iterating over the cache during periodic stats dump
      (InternalStats, stats_dump_period_sec) or on demand when
      DB::Get(Map)Property(kBlockCacheEntryStats), except that for
      efficiency and sharing among column families, saved data from
      the last scan is used when the data is not considered too old.
      
      The new information can be seen in info LOG, for example:
      
          Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
          Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
      
      And also through DB::GetProperty and GetMapProperty (here using
      ldb just for demonstration):
      
          $ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
          rocksdb.block-cache-entry-stats.bytes.data-block: 0
          rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-block: 0
          rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.bytes.index-block: 178992
          rocksdb.block-cache-entry-stats.bytes.misc: 0
          rocksdb.block-cache-entry-stats.bytes.other-block: 0
          rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
          rocksdb.block-cache-entry-stats.capacity: 8388608
          rocksdb.block-cache-entry-stats.count.data-block: 0
          rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-block: 0
          rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
          rocksdb.block-cache-entry-stats.count.index-block: 215
          rocksdb.block-cache-entry-stats.count.misc: 1
          rocksdb.block-cache-entry-stats.count.other-block: 0
          rocksdb.block-cache-entry-stats.count.write-buffer: 0
          rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
          rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
          rocksdb.block-cache-entry-stats.percent.misc: 0.000000
          rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
          rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
          rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
          rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
      
      Solution detail - We need some way to flag what kind of blocks each
      entry belongs to, preferably without changing the Cache API.
      One of the complications is that Cache is a general interface that could
      have other users that don't adhere to whichever convention we decide
      on for keys and values. Or we would pay for an extra field in the Handle
      that would only be used for this purpose.
      
      This change uses a back-door approach, the deleter, to indicate the
      "role" of a Cache entry (in addition to the value type, implicitly).
      This has the added benefit of ensuring proper code origin whenever we
      recognize a particular role for a cache entry; if the entry came from
      some other part of the code, it will use an unrecognized deleter, which
      we simply attribute to the "Misc" role.
      
      An internal API makes for simple instantiation and automatic
      registration of Cache deleters for a given value type and "role".
      
      Another internal API, CacheEntryStatsCollector, solves the problem of
      caching the results of a scan and sharing them, to ensure scans are
      neither excessive nor redundant so as not to harm Cache performance.
      
      Because code is added to BlocklikeTraits, it is pulled out of
      block_based_table_reader.cc into its own file.
      
      This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
      (could still be added), and with actual stat gathering.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
      
      Test Plan: manual testing with db_bench, and a couple of basic unit tests
      
      Reviewed By: ltamasi
      
      Differential Revision: D28488721
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
      311a544c
  10. 23 4月, 2021 1 次提交
    • M
      Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176) · 01e460d5
      mrambacher 提交于
      Summary:
      This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code.  With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.
      
      readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).
      
      There are still fields in the ImmutableCFOptions that are not CF options but DB options.  Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions).  But that will be part of a future PR to minimize changes and disruptions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176
      
      Reviewed By: pdillinger
      
      Differential Revision: D27954339
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
      01e460d5
  11. 20 4月, 2021 1 次提交
    • L
      Fix a data race related to DB properties (#8206) · 0c6e4674
      Levi Tamasi 提交于
      Summary:
      Historically, the DB properties `rocksdb.cur-size-active-mem-table`,
      `rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called
      the method `MemTable::ApproximateMemoryUsage` for mutable memtables,
      which is not safe without synchronization. This resulted in data races with
      memtable inserts. The patch changes the code handling these properties
      to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a
      cached value backed by an atomic variable. Two test cases had to be updated
      for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by
      increasing the value size used so each value ends up in its own memtable,
      which was the original intention (note: the test has been broken in the sense
      that the test code didn't consider that memtable sizes below 64 KB get
      increased to 64 KB by `SanitizeOptions`, and has been passing only by
      accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on
      completely up-to-date values and thus was changed to use `ApproximateMemoryUsage`
      directly instead of going through the DB properties. Note: this should be safe in this case
      since there's only a single thread involved.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D27866811
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394
      0c6e4674
  12. 04 3月, 2021 1 次提交
    • L
      Update compaction statistics to include the amount of data read from blob files (#8022) · cb25bc11
      Levi Tamasi 提交于
      Summary:
      The patch does the following:
      1) Exposes the amount of data (number of bytes) read from blob files from
      `BlobFileReader::GetBlob` / `Version::GetBlob`.
      2) Tracks the total number and size of blobs read from blob files during a
      compaction (due to garbage collection or compaction filter usage) in
      `CompactionIterationStats` and propagates this data to
      `InternalStats::CompactionStats` / `CompactionJobStats`.
      3) Updates the formulae for write amplification calculations to include the
      amount of data read from blob files.
      4) Extends the compaction stats dump with a new column `Rblob(GB)` and
      a new line containing the total number and size of blob files in the current
      `Version` to complement the information about the shape and size of the LSM tree
      that's already there.
      5) Updates `CompactionJobStats` so that the number of files and amount of data
      written by a compaction are broken down per file type (i.e. table/blob file).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022
      
      Test Plan: Ran `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D26801199
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
      cb25bc11
  13. 03 3月, 2021 1 次提交
    • L
      Break down the amount of data written during flushes/compactions per file type (#8013) · a46f080c
      Levi Tamasi 提交于
      Summary:
      The patch breaks down the "bytes written" (as well as the "number of output files")
      compaction statistics into two, so the values are logged separately for table files
      and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
      files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
      This will also come in handy for fixing the write amplification statistics, which currently
      do not consider the amount of data read from blob files during compaction. (This will
      be fixed by an upcoming patch.)
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013
      
      Test Plan: Ran `make check` and `db_bench`.
      
      Reviewed By: riversand963
      
      Differential Revision: D26742156
      
      Pulled By: ltamasi
      
      fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
      a46f080c
  14. 26 1月, 2021 1 次提交
    • M
      Add a SystemClock class to capture the time functions of an Env (#7858) · 12f11373
      mrambacher 提交于
      Summary:
      Introduces and uses a SystemClock class to RocksDB.  This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
      
      Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead.  There are likely more places that can be changed, but this is a start to show what can/should be done.  Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
      
      There are several Env classes that implement these functions.  Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR.  It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
      
      Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
      
      Reviewed By: pdillinger
      
      Differential Revision: D26006406
      
      Pulled By: mrambacher
      
      fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
      12f11373
  15. 20 12月, 2020 1 次提交
    • P
      aggregated-table-properties with GetMapProperty (#7779) · 4d1ac19e
      Peter Dillinger 提交于
      Summary:
      So that we can more easily get aggregate live table data such
      as total filter, index, and data sizes.
      
      Also adds ldb support for getting properties
      
      Also fixed some missing/inaccurate related comments in db.h
      
      For example:
      
          $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties
          rocksdb.aggregated-table-properties.data_size: 102871
          rocksdb.aggregated-table-properties.filter_size: 0
          rocksdb.aggregated-table-properties.index_partitions: 0
          rocksdb.aggregated-table-properties.index_size: 2232
          rocksdb.aggregated-table-properties.num_data_blocks: 100
          rocksdb.aggregated-table-properties.num_deletions: 0
          rocksdb.aggregated-table-properties.num_entries: 15000
          rocksdb.aggregated-table-properties.num_merge_operands: 0
          rocksdb.aggregated-table-properties.num_range_deletions: 0
          rocksdb.aggregated-table-properties.raw_key_size: 288890
          rocksdb.aggregated-table-properties.raw_value_size: 198890
          rocksdb.aggregated-table-properties.top_level_index_size: 0
          $ ./ldb --db=testdb get_property rocksdb.aggregated-table-properties-at-level1
          rocksdb.aggregated-table-properties-at-level1.data_size: 80909
          rocksdb.aggregated-table-properties-at-level1.filter_size: 0
          rocksdb.aggregated-table-properties-at-level1.index_partitions: 0
          rocksdb.aggregated-table-properties-at-level1.index_size: 1787
          rocksdb.aggregated-table-properties-at-level1.num_data_blocks: 81
          rocksdb.aggregated-table-properties-at-level1.num_deletions: 0
          rocksdb.aggregated-table-properties-at-level1.num_entries: 12466
          rocksdb.aggregated-table-properties-at-level1.num_merge_operands: 0
          rocksdb.aggregated-table-properties-at-level1.num_range_deletions: 0
          rocksdb.aggregated-table-properties-at-level1.raw_key_size: 238210
          rocksdb.aggregated-table-properties-at-level1.raw_value_size: 163414
          rocksdb.aggregated-table-properties-at-level1.top_level_index_size: 0
          $
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7779
      
      Test Plan: Added a test to ldb_test.py
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D25653103
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 2905469a08a64dd6b5510cbd7be2e64d3234d6d3
      4d1ac19e
  16. 08 12月, 2020 1 次提交
  17. 13 11月, 2020 1 次提交
  18. 08 10月, 2020 1 次提交
    • L
      Introduce a blob file reader class (#7461) · 22655a39
      Levi Tamasi 提交于
      Summary:
      The patch adds a class called `BlobFileReader` that can be used to retrieve blobs
      using the information available in blob references (e.g. blob file number, offset, and
      size). This will come in handy when implementing blob support for `Get`, `MultiGet`,
      and iterators, and also for compaction/garbage collection.
      
      When a `BlobFileReader` object is created (using the factory method `Create`),
      it first checks whether the specified file is potentially valid by comparing the file
      size against the combined size of the blob file header and footer (files smaller than
      the threshold are considered malformed). Then, it opens the file, and reads and verifies
      the header and footer. The verification involves magic number/CRC checks
      as well as checking for unexpected header/footer fields, e.g. incorrect column family ID
      or TTL blob files.
      
      Blobs can be retrieved using `GetBlob`. `GetBlob` validates the offset and compression
      type passed by the caller (because of the presence of the header and footer, the
      specified offset cannot be too close to the start/end of the file; also, the compression type
      has to match the one in the blob file header), and retrieves and potentially verifies and
      uncompresses the blob. In particular, when `ReadOptions::verify_checksums` is set,
      `BlobFileReader` reads the blob record header as well (as opposed to just the blob itself)
      and verifies the key/value size, the key itself, as well as the CRC of the blob record header
      and the key/value pair.
      
      In addition, the patch exposes the compression type from `BlobIndex` (both using an
      accessor and via `DebugString`), and adds a blob file read latency histogram to
      `InternalStats` that can be used with `BlobFileReader`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/7461
      
      Test Plan: `make check`
      
      Reviewed By: riversand963
      
      Differential Revision: D23999219
      
      Pulled By: ltamasi
      
      fbshipit-source-id: deb6b1160d251258b308d5156e2ec063c3e12e5e
      22655a39
  19. 15 9月, 2020 1 次提交
  20. 25 6月, 2020 1 次提交
    • Y
      First step towards handling MANIFEST write error (#6949) · e66199d8
      Yanqin Jin 提交于
      Summary:
      This PR provides preliminary support for handling IO error during MANIFEST write.
      File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
      One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
      If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
      Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
      
      Possible future directions:
      - Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
      
      Test plan (dev server):
      make check
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
      
      Reviewed By: anand1976
      
      Differential Revision: D22026020
      
      Pulled By: riversand963
      
      fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
      e66199d8
  21. 21 2月, 2020 1 次提交
    • S
      Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) · fdf882de
      sdong 提交于
      Summary:
      When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
      
      Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
      
      Differential Revision: D19977691
      
      fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
      fdf882de
  22. 09 1月, 2020 1 次提交
  23. 08 1月, 2020 1 次提交
  24. 21 9月, 2019 1 次提交
  25. 07 9月, 2019 1 次提交
  26. 07 6月, 2019 1 次提交
  27. 01 6月, 2019 1 次提交
  28. 31 5月, 2019 1 次提交
  29. 19 4月, 2019 1 次提交
  30. 20 3月, 2019 1 次提交
    • Z
      Collect compaction stats by priority and dump to info LOG (#5050) · a291f3a1
      Zhongyi Xie 提交于
      Summary:
      In order to better understand compaction done by different priority thread pool, we now collect compaction stats by priority and also print them to info LOG through stats dump.
      
      ```
      ** Compaction Stats [default] **
      Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
      -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       Low      0/0    0.00 KB   0.0     16.8    11.3      5.5       5.6      0.1       0.0   0.0    406.4    136.1     42.24             34.96        45    0.939     13M  8865K
      High      0/0    0.00 KB   0.0      0.0     0.0      0.0      11.4     11.4       0.0   0.0      0.0     76.2    153.00             35.74     12185    0.013       0      0
      ```
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/5050
      
      Differential Revision: D14408583
      
      Pulled By: miasantreble
      
      fbshipit-source-id: e53746586ea27cb8abc9fec35805bd80ed30f608
      a291f3a1
  31. 30 1月, 2019 1 次提交
  32. 06 11月, 2018 1 次提交
    • A
      Add DB property for SST files kept from deletion (#4618) · fffac43c
      Andrew Kryczka 提交于
      Summary:
      This property can help debug why SST files aren't being deleted. Previously we only had the property "rocksdb.is-file-deletions-enabled". However, even when that returned true, obsolete SSTs may still not be deleted due to the coarse-grained mechanism we use to prevent newly created SSTs from being accidentally deleted. That coarse-grained mechanism uses a lower bound file number for SSTs that should not be deleted, and this property exposes that lower bound.
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/4618
      
      Differential Revision: D12898179
      
      Pulled By: ajkr
      
      fbshipit-source-id: fe68acc041ddbcc9276bbd48976524d95aafc776
      fffac43c
  33. 16 6月, 2018 1 次提交
  34. 29 5月, 2018 1 次提交
  35. 05 5月, 2018 1 次提交
  36. 19 4月, 2018 1 次提交
  37. 13 4月, 2018 1 次提交
  38. 12 4月, 2018 1 次提交
  39. 06 3月, 2018 1 次提交
  40. 02 3月, 2018 1 次提交
    • Y
      Add "rocksdb.live-sst-files-size" DB property · bf937cf1
      Yi Wu 提交于
      Summary:
      Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
      Closes https://github.com/facebook/rocksdb/pull/3548
      
      Differential Revision: D7116939
      
      Pulled By: yiwu-arbug
      
      fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
      bf937cf1