1. 06 5月, 2022 3 次提交
    • S
      Use std::numeric_limits<> (#9954) · 49628c9a
      sdong 提交于
      Summary:
      Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
      
      Test Plan: See CI Runs.
      
      Reviewed By: riversand963
      
      Differential Revision: D36173954
      
      fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
      49628c9a
    • S
      platform010 gcc (#9946) · 46f8889b
      sdong 提交于
      Summary:
      Make platform010 gcc build work.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9946
      
      Test Plan:
      ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 make release -j
      ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 make all check -j
      
      Reviewed By: pdillinger, mdcallag
      
      Differential Revision: D36152684
      
      fbshipit-source-id: ca7b0916c51501a72bb15ad33a85e8c5cac5b505
      46f8889b
    • T
      Generate pkg-config file via CMake (#9945) · e62c23cc
      Trynity Mirell 提交于
      Summary:
      Fixes https://github.com/facebook/rocksdb/issues/7934
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9945
      
      Test Plan:
      Built via Homebrew pointing to my fork/branch:
      
      ```
        ~/src/github.com/facebook/fbthrift on   main ❯ cat ~/.homebrew/opt/rocksdb/lib/pkgconfig/rocksdb.pc                                                                                                                                                     took  1h 17m 48s at  04:24:54 pm
      prefix="/Users/trynity/.homebrew/Cellar/rocksdb/HEAD-968e4dd"
      exec_prefix="${prefix}"
      libdir="${prefix}/lib"
      includedir="${prefix}/include"
      
      Name: rocksdb
      Description: An embeddable persistent key-value store for fast storage
      URL: https://rocksdb.org/
      Version: 7.3.0
      Cflags: -I"${includedir}"
      Libs: -L"${libdir}" -lrocksdb
      ```
      
      Reviewed By: riversand963
      
      Differential Revision: D36161635
      
      Pulled By: trynity
      
      fbshipit-source-id: 0f1a9c30e43797ee65e6696896e06fde0658456e
      e62c23cc
  2. 05 5月, 2022 5 次提交
    • Y
      Rename kRemoveWithSingleDelete to kPurge (#9951) · 9d634dd5
      Yanqin Jin 提交于
      Summary:
      PR 9929 adds a new CompactionFilter::Decision, i.e.
      kRemoveWithSingleDelete so that CompactionFilter can indicate to
      CompactionIterator that a PUT can only be removed with SD. However, how
      CompactionIterator handles such a key is implementation detail which
      should not be implied in the public API. In fact,
      such a PUT can just be dropped. This is an optimization which we will apply in the near future.
      
      Discussion thread: https://github.com/facebook/rocksdb/pull/9929#discussion_r863198964
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9951
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D36156590
      
      Pulled By: riversand963
      
      fbshipit-source-id: 7b7d01f47bba4cad7d9cca6ca52984f27f88b372
      9d634dd5
    • S
      Printing IO Error in DumpDBFileSummary (#9940) · 68ac507f
      sdong 提交于
      Summary:
      Right now in DumpDBFileSummary, IO error isn't printed out, but they are sometimes helpful. Print it out instead.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9940
      
      Test Plan: Watch existing tests to pass.
      
      Reviewed By: riversand963
      
      Differential Revision: D36113016
      
      fbshipit-source-id: 13002080fa4dc76589e2c1c5a1079df8a3c9391c
      68ac507f
    • M
      Print elapsed time and number of operations completed (#9886) · bf68d1c9
      Mark Callaghan 提交于
      Summary:
      This is inspired by debugging a regression test that runs for ~0.05 seconds and the short
      running time makes it prone to variance. While db_bench ran for ~60 seconds, 59.95 seconds
      was spent opening 128 databases (and doing recovery). So it was harder to notice that the
      benchmark only ran for 0.05 seconds.
      
      Normally I add output to the end of the line to make life easier for existing tools that parse it
      but in this case the output near the end of the line has two optional parts and one of the optional
      parts adds an extra newline.
      
      This is for https://github.com/facebook/rocksdb/issues/9856
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9886
      
      Test Plan:
      ./db_bench --benchmarks=overwrite,readrandom --num=1000000 --threads=4
      
      old output:
       DB path: [/tmp/rocksdbtest-2260/dbbench]
       overwrite    :      14.108 micros/op 283338 ops/sec;   31.3 MB/s
       DB path: [/tmp/rocksdbtest-2260/dbbench]
       readrandom   :       7.994 micros/op 496788 ops/sec;   55.0 MB/s (1000000 of 1000000 found)
      
      new output:
       DB path: [/tmp/rocksdbtest-2260/dbbench]
       overwrite    :      14.117 micros/op 282862 ops/sec 14.141 seconds 4000000 operations;   31.3 MB/s
       DB path: [/tmp/rocksdbtest-2260/dbbench]
       readrandom   :       8.649 micros/op 458475 ops/sec 8.725 seconds 4000000 operations;   49.8 MB/s (981548 of 1000000 found)
      
      Reviewed By: ajkr
      
      Differential Revision: D36102269
      
      Pulled By: mdcallag
      
      fbshipit-source-id: 5cd8a9e11f5cbe2a46809571afd83335b6b0caa0
      bf68d1c9
    • J
      do not call DeleteFile for not-created sst files (#9920) · 95663ff7
      jsteemann 提交于
      Summary:
      When a memtable is flushed and the flush would lead to a 0 byte .sst
      file being created, RocksDB does not write out the empty .sst file to
      disk.
      However it still calls Env::DeleteFile() on the file as part of some
      cleanup procedure at the end of BuildTable().
      Because the to-be-deleted file does not exist, this requires
      implementors of the DeleteFile() API to check if the file exists on
      their own code, or otherwise risk running into PathNotFound errors when
      DeleteFile is invoked on non-existing files.
      This PR fixes the situation so that when no .sst file is created,
      Deletefile will not be called either.
      TableFileCreationStarted() will still be called as before.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9920
      
      Reviewed By: ajkr
      
      Differential Revision: D36107102
      
      Pulled By: riversand963
      
      fbshipit-source-id: 15881ba3fa3192dd448f906280a1cfc7a68a114a
      95663ff7
    • H
      Fix a comment in RateLimiter::RequestToken (#9933) · de537dca
      Hui Xiao 提交于
      Summary:
      **Context/Summary:**
      - As titled
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9933
      
      Test Plan: - No code change
      
      Reviewed By: ajkr
      
      Differential Revision: D36086544
      
      Pulled By: hx235
      
      fbshipit-source-id: 2bdd19f67e45df1e3af4121b0c1a5e866a57826d
      de537dca
  3. 04 5月, 2022 7 次提交
    • J
      Default `try_load_options` to true when DB is specified (#9937) · 270179bb
      Jay Zhuang 提交于
      Summary:
      If the DB path is specified, the user would expect ldb loads the
      options from the path, but it's not:
      ```
      $ ldb list_live_files_metadata --db=`pwd`
      ```
      Default `try_load_options` to true in that case. The user can still
      disable that by:
      ```
      $ ldb list_live_files_metadata --db=`pwd` --try_load_options=false
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9937
      
      Test Plan:
      `ldb list_live_files_metadata --db=`pwd`` is able to work for
      a db generated with different options.num_levels.
      
      Reviewed By: ajkr
      
      Differential Revision: D36106708
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 2732fdc027a4d172436b2c9b6a9787b56b10c710
      270179bb
    • X
      Reduce comparator objects init cost in BlockIter (#9611) · 8b74cea7
      Xinyu Zeng 提交于
      Summary:
      This PR solves the problem discussed in https://github.com/facebook/rocksdb/issues/7149. By storing the pointer of InternalKeyComparator as icmp_ in BlockIter, the object size remains the same. And for each call to CompareCurrentKey, there is no need to create Comparator objects. One can use icmp_ directly or use the "user_comparator" from the icmp_.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9611
      
      Test Plan:
      with https://github.com/facebook/rocksdb/issues/9903,
      
      ```
      $ TEST_TMPDIR=/dev/shm python3.6 ../benchmark/tools/compare.py benchmarks ./db_basic_bench ../rocksdb-pr9611/db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1 --benchmark_repetitions=50
      ...
      Comparing ./db_basic_bench to ../rocksdb-pr9611/db_basic_bench
      Benchmark                                                                                                                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      ...
      DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_pvalue                 0.0001          0.0001      U Test, Repetitions: 50 vs 50
      DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_mean                  -0.0483         -0.0483          3924          3734          3924          3734
      DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_median                -0.0713         -0.0713          3971          3687          3970          3687
      DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_stddev                -0.0342         -0.0344           225           217           225           217
      DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_cv                    +0.0148         +0.0146             0             0             0             0
      OVERALL_GEOMEAN                                                                                                                                                      -0.0483         -0.0483             0             0             0             0
      ```
      
      Reviewed By: akankshamahajan15
      
      Differential Revision: D35882037
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9e5337bbad8f1239dff7aa9f6549020d599bfcdf
      8b74cea7
    • S
      Improve comments to options.allow_mmap_reads (#9936) · b82edffc
      Siying Dong 提交于
      Summary:
      It confused users and use that with options.allow_mmap_reads = true, CPU is high with checksum verification. Add a comment to explain it.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9936
      
      Reviewed By: anand1976
      
      Differential Revision: D36106529
      
      fbshipit-source-id: 3d723bd686f96a84c694c8b2d91ad28d9ccfd979
      b82edffc
    • A
      db_basic_bench fix for DB object cleanup (#9939) · 440c7f63
      Andrew Kryczka 提交于
      Summary:
      Use `unique_ptr<DB>` to make sure the DB object is deleted. Previously it was not, which led to accumulating file descriptors for deleted directories because a `DBImpl::db_dir_` from each test remained alive.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9939
      
      Test Plan: run `lsof -p $(pidof db_basic_bench)` while benchmark runs; verify no FDs for deleted directories.
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D36108761
      
      Pulled By: ajkr
      
      fbshipit-source-id: cfe02646b038a445af7d5db8989eb1f40d658359
      440c7f63
    • P
      Fork and simplify LRUCache for developing enhancements (#9917) · bb87164d
      Peter Dillinger 提交于
      Summary:
      To support a project to prototype and evaluate algorithmic
      enhancments and alternatives to LRUCache, here I have separated out
      LRUCache into internal-only "FastLRUCache" and cut it down to
      essentials, so that details like secondary cache handling and
      priorities do not interfere with prototyping. These can be
      re-integrated later as needed, along with refactoring to minimize code
      duplication (which would slow down prototyping for now).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9917
      
      Test Plan:
      unit tests updated to ensure basic functionality has (likely)
      been preserved
      
      Reviewed By: anand1976
      
      Differential Revision: D35995554
      
      Pulled By: pdillinger
      
      fbshipit-source-id: d67b20b7ada3b5d3bfe56d897a73885894a1d9db
      bb87164d
    • P
      Fix db_crashtest.py call inconsistency in crash_test.mk (#9935) · 4b9a1a2f
      Peter Dillinger 提交于
      Summary:
      Some tests crashing because not using custom DB_STRESS_CMD
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9935
      
      Test Plan: internal tests
      
      Reviewed By: riversand963
      
      Differential Revision: D36104347
      
      Pulled By: pdillinger
      
      fbshipit-source-id: 23f080704a124174203f54ffd85578c2047effe5
      4b9a1a2f
    • M
      Make --benchmarks=flush flush the default column family (#9887) · b6ec3328
      Mark Callaghan 提交于
      Summary:
      db_bench --benchmarks=flush wasn't flushing the default column family.
      
      This is for https://github.com/facebook/rocksdb/issues/9880
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9887
      
      Test Plan:
      Confirm that flush works (*.log is empty) when "flush" added to benchmark list
      Confirm that *.log is not empty otherwise.
      
      Repeat for all combinations for: uses column families, uses multiple databases
      
      ./db_bench --benchmarks=overwrite --num=10000
      ls -lrt /tmp/rocksdbtest-2260/dbbench/*.log
      -rw-r--r-- 1 me users 1380286 Apr 21 10:47 /tmp/rocksdbtest-2260/dbbench/000004.log
      
      ./db_bench --benchmarks=overwrite,flush --num=10000
      ls -lrt /tmp/rocksdbtest-2260/dbbench/*.log
       -rw-r--r-- 1 me users 0 Apr 21 10:48 /tmp/rocksdbtest-2260/dbbench/000008.log
      
      ./db_bench --benchmarks=overwrite --num=10000 --num_column_families=4
      ls -lrt /tmp/rocksdbtest-2260/dbbench/*.log
        -rw-r--r-- 1 me users 1387823 Apr 21 10:49 /tmp/rocksdbtest-2260/dbbench/000004.log
      
      ./db_bench --benchmarks=overwrite,flush --num=10000 --num_column_families=4
      ls -lrt /tmp/rocksdbtest-2260/dbbench/*.log
      -rw-r--r-- 1 me users 0 Apr 21 10:51 /tmp/rocksdbtest-2260/dbbench/000014.log
      
      ./db_bench --benchmarks=overwrite --num=10000 --num_multi_db=2
      ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/*.log
       -rw-r--r-- 1 me users 1380838 Apr 21 10:55 /tmp/rocksdbtest-2260/dbbench/0/000004.log
       -rw-r--r-- 1 me users 1379734 Apr 21 10:55 /tmp/rocksdbtest-2260/dbbench/1/000004.log
      
      ./db_bench --benchmarks=overwrite,flush --num=10000 --num_multi_db=2
      ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/*.log
      -rw-r--r-- 1 me users 0 Apr 21 10:57 /tmp/rocksdbtest-2260/dbbench/0/000013.log
      -rw-r--r-- 1 me users 0 Apr 21 10:57 /tmp/rocksdbtest-2260/dbbench/1/000013.log
      
      ./db_bench --benchmarks=overwrite --num=10000 --num_column_families=4 --num_multi_db=2
      ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/*.log
      -rw-r--r-- 1 me users 1395108 Apr 21 10:52 /tmp/rocksdbtest-2260/dbbench/1/000004.log
      -rw-r--r-- 1 me users 1380411 Apr 21 10:52 /tmp/rocksdbtest-2260/dbbench/0/000004.log
      
      ./db_bench --benchmarks=overwrite,flush --num=10000 --num_column_families=4 --num_multi_db=2
      ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/*.log
      -rw-r--r-- 1 me users 0 Apr 21 10:54 /tmp/rocksdbtest-2260/dbbench/0/000022.log
      -rw-r--r-- 1 me users 0 Apr 21 10:54 /tmp/rocksdbtest-2260/dbbench/1/000022.log
      
      Reviewed By: ajkr
      
      Differential Revision: D36026777
      
      Pulled By: mdcallag
      
      fbshipit-source-id: d42d3d7efceea7b9a25bbbc0f04461d2b7301122
      b6ec3328
  4. 03 5月, 2022 4 次提交
    • Y
      Remove ifdef for try_emplace after upgrading to c++17 (#9932) · 2b5df21e
      Yanqin Jin 提交于
      Summary:
      Test plan
      make check
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9932
      
      Reviewed By: ajkr
      
      Differential Revision: D36085404
      
      Pulled By: riversand963
      
      fbshipit-source-id: 2ece14ca0e2e4c1288339ff79e7e126b76eaf786
      2b5df21e
    • A
      Allow consecutive SingleDelete() in stress/crash test (#9930) · cda34dd6
      Andrew Kryczka 提交于
      Summary:
      We need to support consecutive SingleDelete(), so this PR adds it to the stress/crash tests.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9930
      
      Test Plan: `python3 tools/db_crashtest.py blackbox --simple --nooverwritepercent=50 --writepercent=90 --delpercent=10 --readpercent=0 --prefixpercent=0 --delrangepercent=0 --iterpercent=0 --max_key=1000000 --duration=3600 --interval=10 --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --value_size_mult=33`
      
      Reviewed By: riversand963
      
      Differential Revision: D36081863
      
      Pulled By: ajkr
      
      fbshipit-source-id: 3566cdbaed375b8003126fc298968eb1a854317f
      cda34dd6
    • Y
      Fix a bug of CompactionIterator/CompactionFilter using `Delete` (#9929) · 06394ff4
      Yanqin Jin 提交于
      Summary:
      When compaction filter determines that a key should be removed, it updates the internal key's type
      to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
      together with `SingleDelete`, it will cause compaction iterator to return Corruption.
      
      To fix the issue, compaction filter should return more information in addition to the intention of removing
      a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
      `kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.
      
      In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
      reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
      when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.
      
      Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
      not quite work with `MultiOpsTxnsStressTest`.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929
      
      Test Plan:
      make check
      make crash_test
      make crash_test_with_txn
      
      Reviewed By: anand1976
      
      Differential Revision: D36069678
      
      Pulled By: riversand963
      
      fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
      06394ff4
    • C
      Specify largest_seqno in VerifyChecksum (#9919) · 37f49083
      Changyu Bi 提交于
      Summary:
      `VerifyChecksum()` does not specify `largest_seqno` when creating a `TableReader`. As a result, the `TableReader` uses the `TableReaderOptions` default value (0) for `largest_seqno`. This causes the following error when the file has a nonzero global seqno in its properties:
      ```
      Corruption: An external sst file with version 2 have global seqno property with value , while largest seqno in the file is 0
      ```
      This PR fixes this by specifying `largest_seqno` in `VerifyChecksumInternal` with `largest_seqno` from the file metadata.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9919
      
      Test Plan: `make check`
      
      Reviewed By: ajkr
      
      Differential Revision: D36028824
      
      Pulled By: cbi42
      
      fbshipit-source-id: 428d028a79386f46ef97bb6b6051dc76c83e1f2b
      37f49083
  5. 29 4月, 2022 2 次提交
    • Y
      Enforce the contract of SingleDelete (#9888) · 2b5c29f9
      Yanqin Jin 提交于
      Summary:
      Enforce the contract of SingleDelete so that they are not mixed with
      Delete for the same key. Otherwise, it will lead to undefined behavior.
      See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes.
      
      Also fix unit tests and write-unprepared.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888
      
      Test Plan: make check
      
      Reviewed By: ajkr
      
      Differential Revision: D35837817
      
      Pulled By: riversand963
      
      fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635
      2b5c29f9
    • A
      Update protection info on recovered logs data (#9875) · aafb377b
      Anvesh Komuravelli 提交于
      Summary:
      Update protection info on recovered logs data
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9875
      
      Test Plan:
      - Benchmark setup: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000`
      - Benchmark command: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true`
      - Results before this PR
      ```
      OpenDb:     2350.14 milliseconds
      OpenDb:     2296.94 milliseconds
      OpenDb:     2184.29 milliseconds
      OpenDb:     2167.59 milliseconds
      OpenDb:     2231.24 milliseconds
      OpenDb:     2109.57 milliseconds
      OpenDb:     2197.71 milliseconds
      OpenDb:     2120.8 milliseconds
      OpenDb:     2148.12 milliseconds
      OpenDb:     2207.95 milliseconds
      ```
      - Results after this PR
      ```
      OpenDb:     2424.52 milliseconds
      OpenDb:     2359.84 milliseconds
      OpenDb:     2317.68 milliseconds
      OpenDb:     2339.4 milliseconds
      OpenDb:     2325.36 milliseconds
      OpenDb:     2321.06 milliseconds
      OpenDb:     2353.98 milliseconds
      OpenDb:     2344.64 milliseconds
      OpenDb:     2384.09 milliseconds
      OpenDb:     2428.58 milliseconds
      ```
      
      Mean regressed 7.2% (2201.4 -> 2359.9)
      
      Reviewed By: ajkr
      
      Differential Revision: D36012787
      
      Pulled By: akomurav
      
      fbshipit-source-id: d2aba09f29c6beb2fd0fe8e1e359be910b4ef02a
      aafb377b
  6. 28 4月, 2022 3 次提交
    • A
      Fix bug in async_io path which reads incorrect length (#9916) · fce65e7e
      Akanksha Mahajan 提交于
      Summary:
      In FilePrefetchBuffer, in case data is overlapping between two
      buffers and more data is required to read and copy that to third buffer,
      incorrect length was updated resulting in
      ```
      Iterator diverged from control iterator which has value 00000000000310C3000000000000012B0000000000000274 total_order_seek: 1 auto_prefix_mode: 0 S 000000000002C37F000000000000012B000000000000001C NNNPPPPPNN; total_order_seek: 1 auto_prefix_mode: 0 S 000000000002F10B00000000000000BF78787878787878 NNNPNNNNPN; total_order_seek: 1 auto_prefix_mode: 0 S 00000000000310C3000000000000012B000000000000026B
      iterator is not valid
      Control CF default
      db_stress: db_stress_tool/db_stress_test_base.cc:1388: void rocksdb::StressTest::VerifyIterator(rocksdb::ThreadState*, rocksdb::ColumnFamilyHandle*, const rocksdb::ReadOptions&, rocksdb::Iterator*, rocksdb::Iterator*, rocksdb::StressTest::LastIterateOp, const rocksdb::Slice&, const string&, bool*): Assertion `false' failed.
      Aborted (core dumped)
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9916
      
      Test Plan:
      ```
      - CircleCI jobs
      - Ran db_stress with OPTIONS file which caught the bug
       ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=42.26248932628998 --bottommost_compression_type=disable --cache_index_and_filter_blocks=0 --cache_size=8388608 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_buffer_bytes=1073741823 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/ --db_write_buffer_size=134217728 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --fail_if_options_file_error=0 --file_checksum_impl=none --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=12 --index_type=2 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=0 --mark_for_compaction_one_file_in=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.001 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --read_only=0 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --reserve_table_reader_memory=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_catch_up_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_blob_db=0 --use_block_based_filter=0 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=0 --use_txn=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35 --options_file=/home/akankshamahajan/OPTIONS.orig -column_families=1
      
      db_bench with async_io enabled to make sure db_bench completes successfully without any failure.
      - ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1
      ```
      
      crash_test in progress
      
      Reviewed By: anand1976
      
      Differential Revision: D35985789
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: 5abe185f34caa99ca587d4bdc8954bd0802b1bf9
      fce65e7e
    • Y
      Improve stress test for MultiOpsTxnsStressTest (#9829) · 94e245a1
      Yanqin Jin 提交于
      Summary:
      Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions.
      
      1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw.
      2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true.
      3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829
      
      Test Plan:
      ```
      TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn
      ```
      
      Reviewed By: pdillinger
      
      Differential Revision: D35806678
      
      Pulled By: riversand963
      
      fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616
      94e245a1
    • H
      Fix locktree accesses to PessimisticTransactions (#9898) · d9d456de
      Herman Lee 提交于
      Summary:
      The current locktree implementation stores the address of the
      PessimisticTransactions object as the TXNID. However, when a transaction
      is blocked on a lock, it records the list of waitees with conflicting
      locks using the rocksdb assigned TransactionID. This is performed by
      calling GetID() on PessimisticTransactions objects of the waitees,
      and then recorded in the waiter's list.
      
      However, there is no guarantee the objects are valid when recording the
      waitee list during the conflict callbacks because the waitee
      could have released the lock and freed the PessimisticTransactions
      object.
      
      The waitee/txnid values are only valid PessimisticTransaction objects
      while the mutex for the root of the locktree is held.
      
      The simplest fix for this problem is to use the address of the
      PessimisticTransaction as the TransactionID so that it is consistent
      with its usage in the locktree. The TXNID is only converted back to a
      PessimisticTransaction for the report_wait callbacks. Since
      these callbacks are now all made within the critical section where the
      lock_request queue mutx is held, these conversions will be safe.
      Otherwise, only the uint64_t TXNID of the waitee is registerd
      with the waiter transaction. The PessimisitcTransaction object of the
      waitee is never referenced.
      
      The main downside of this approach is the TransactionID will not change
      if the PessimisticTransaction object is reused for new transactions.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9898
      
      Test Plan:
      Add a new test case and run unit tests.
      Also verified with MyRocks workloads using range locks that the
      crash no longer happens.
      
      Reviewed By: riversand963
      
      Differential Revision: D35950376
      
      Pulled By: hermanlee
      
      fbshipit-source-id: 8c9cae272e23e487fc139b6a8ed5b8f8f24b1570
      d9d456de
  7. 27 4月, 2022 5 次提交
    • P
      RocksDB: fix bug in crash-recovery correctness testing (#9897) · 68ee228d
      Paras Sethia 提交于
      Summary:
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9897
      
      Fixes https://github.com/facebook/rocksdb/issues/9385.
      
      Update State to reflect the value in the DB after a crash
      
      Reviewed By: ajkr
      
      Differential Revision: D35788808
      
      fbshipit-source-id: 2d21d8537ab380a17cad3e90ac72b3eb1b56de9f
      68ee228d
    • P
      Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) · 9d0cae71
      Peter Dillinger 提交于
      Summary:
      When MultiGet() determines that multiple query keys can be
      served by examining the same data block in block cache (one Lookup()),
      each PinnableSlice referring to data in that data block needs to hold
      on to the block in cache so that they can be released at arbitrary
      times by the API user. Historically this is accomplished with extra
      calls to Ref() on the Handle from Lookup(), with each PinnableSlice
      cleanup calling Release() on the Handle, but this creates extra
      contention on the block cache for the extra Ref()s and Release()es,
      especially because they hit the same cache shard repeatedly.
      
      In the case of merge operands (possibly more cases?), the problem was
      compounded by doing an extra Ref()+eventual Release() for each merge
      operand for a key reusing a block (which could be the same key!), rather
      than one Ref() per key. (Note: the non-shared case with `biter` was
      already one per key.)
      
      This change optimizes MultiGet not to rely on these extra, contentious
      Ref()+Release() calls by instead, in the shared block case, wrapping
      the cache Release() cleanup in a refcounted object referenced by the
      PinnableSlices, such that after the last wrapped reference is released,
      the cache entry is Release()ed. Relaxed atomic refcounts should be
      much faster than mutex-guarded Ref() and Release(), and much less prone
      to a performance cliff when MultiGet() does a lot of block sharing.
      
      Note that I did not use std::shared_ptr, because that would require an
      extra indirection object (shared_ptr itself new/delete) in order to
      associate a ref increment/decrement with a Cleanable cleanup entry. (If
      I assumed it was the size of two pointers, I could do some hackery to
      make it work without the extra indirection, but that's too fragile.)
      
      Some details:
      * Fixed (removed) extra block cache tracing entries in cases of cache
      entry reuse in MultiGet, but it's likely that in some other cases traces
      are missing (XXX comment inserted)
      * Moved existing implementations for cleanable.h from iterator.cc to
      new cleanable.cc
      * Improved API comments on Cleanable
      * Added a public SharedCleanablePtr class to cleanable.h in case others
      could benefit from the same pattern (potentially many Cleanables and/or
      smart pointers referencing a shared Cleanable)
      * Add a typedef for MultiGetContext::Mask
      * Some variable renaming for clarity
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899
      
      Test Plan:
      Added unit tests for SharedCleanablePtr.
      
      Greatly enhanced ability of existing tests to detect cache use-after-free.
      * Release PinnableSlices from MultiGet as they are read rather than in
      bulk (in db_test_util wrapper).
      * In ASAN build, default to using a trivially small LRUCache for block_cache
      so that entries are immediately erased when unreferenced. (Updated two
      tests that depend on caching.) New ASAN testsuite running time seems
      OK to me.
      
      If I introduce a bug into my implementation where we skip the shared
      cleanups on block reuse, ASAN detects the bug in
      `db_basic_test *MultiGet*`. If I remove either of the above testing
      enhancements, the bug is not detected.
      
      Consider for follow-up work: manipulate or randomize ordering of
      PinnableSlice use and release from MultiGet db_test_util wrapper. But in
      typical cases, natural ordering gives pretty good functional coverage.
      
      Performance test:
      In the extreme (but possible) case of MultiGetting the same or adjacent keys
      in a batch, throughput can improve by an order of magnitude.
      `./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200`
      Before ops/sec, num=5: 1,384,394
      Before ops/sec, num=500: 6,423,720
      After ops/sec, num=500: 10,658,794
      After ops/sec, num=5: 16,027,257
      
      Also note that previously, with high parallelism, having query keys
      concentrated in a single block was worse than spreading them out a bit. Now
      concentrated in a single block is faster than spread out, which is hopefully
      consistent with natural expectation.
      
      Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12):
      Before: multireadrandom [AVG    999 runs] : 1088699 (± 7344) ops/sec;  120.4 (± 0.8 ) MB/sec
      After: multireadrandom [AVG    999 runs] : 1090402 (± 7230) ops/sec;  120.6 (± 0.8 ) MB/sec
      Possibly better, possibly in the noise.
      
      Reviewed By: anand1976
      
      Differential Revision: D35907003
      
      Pulled By: pdillinger
      
      fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e
      9d0cae71
    • A
      fix clang-analyze in corruption_test (#9908) · ce2d8a42
      Andrew Kryczka 提交于
      Summary:
      This PR fixes a clang-analyze error that I introduced in https://github.com/facebook/rocksdb/issues/9906:
      
      ```
      db/corruption_test.cc:358:15: warning: Called C++ object pointer is null
          ASSERT_OK(db_->Put(WriteOptions(), cfhs[0], "k", "v"));
                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ./test_util/testharness.h:76:62: note: expanded from macro 'ASSERT_OK'
        ASSERT_PRED_FORMAT1(ROCKSDB_NAMESPACE::test::AssertStatus, s)
                                                                   ^
      third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19909:36: note: expanded
      from macro 'ASSERT_PRED_FORMAT1'
        GTEST_PRED_FORMAT1_(pred_format, v1, GTEST_FATAL_FAILURE_)
                                         ^~
      third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19892:34: note: expanded
      from macro 'GTEST_PRED_FORMAT1_'
        GTEST_ASSERT_(pred_format(#v1, v1), \
                                       ^~
      third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19868:52: note: expanded
      from macro 'GTEST_ASSERT_'
        if (const ::testing::AssertionResult gtest_ar = (expression)) \
                                                         ^~~~~~~~~~
      1 warning generated.
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9908
      
      Reviewed By: riversand963
      
      Differential Revision: D35953147
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9b837bd7581c6e1e2cdbc961c099652256eb9d4b
      ce2d8a42
    • A
      Add mmap DBGet microbench parameters (#9903) · 1eb279dc
      Andrew Kryczka 提交于
      Summary:
      I tried evaluating https://github.com/facebook/rocksdb/issues/9611 using DBGet microbenchmarks but mostly found the change is well within the noise even for hundreds of repetitions; meanwhile, the InternalKeyComparator CPU it saves is 1-2% according to perf so it should be measurable. In this PR I tried adding a mmap mode that will bypass compression/checksum/block cache/file read to focus more on the block lookup paths, and also increased the Get() count.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9903
      
      Reviewed By: jay-zhuang, riversand963
      
      Differential Revision: D35907375
      
      Pulled By: ajkr
      
      fbshipit-source-id: 69490d5040ef0863e1ce296724104d0aa7667215
      1eb279dc
    • A
      Revert open logic changes in #9634 (#9906) · c5d367f4
      Andrew Kryczka 提交于
      Summary:
      Left HISTORY.md and unit tests.
      Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9906
      
      Reviewed By: riversand963
      
      Differential Revision: D35940093
      
      Pulled By: ajkr
      
      fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b
      c5d367f4
  8. 26 4月, 2022 4 次提交
    • A
      Add stats related to async prefetching (#9845) · 3653029d
      Akanksha Mahajan 提交于
      Summary:
      Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS.
      PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by
      FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling
      file_system Poll API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845
      
      Test Plan: Update existing tests
      
      Reviewed By: anand1976
      
      Differential Revision: D35909694
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36
      3653029d
    • R
      Bugfix/fix manual flush blocking bug (#9893) · 6d2577e5
      RoeyMaor 提交于
      Summary:
      Fix https://github.com/facebook/rocksdb/issues/9892
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9893
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35880959
      
      Pulled By: ajkr
      
      fbshipit-source-id: dad1139ad0983cfbd5c5cd6fa6b71022f889735a
      6d2577e5
    • J
      Add 95% confidence intervals to db_bench output (#9882) · fb9a167a
      Jaromir Vanek 提交于
      Summary:
      Enhancing `db_bench` output with 95% statistical confidence intervals for better performance evaluation. The goal is to unambiguously separate random variance when running benchmark over multiple iterations.
      
      Output enhanced with confidence intervals exposed in brackets:
      
      ```
      $ ./db_bench --benchmarks=fillseq[-X10]
      
      Running benchmark for 10 times
      fillseq      :       4.961 micros/op 201578 ops/sec;   22.3 MB/s
      fillseq      :       5.030 micros/op 198824 ops/sec;   22.0 MB/s
      fillseq [AVG 2 runs] : 200201 (± 2698) ops/sec;   22.1 (± 0.3) MB/sec
      fillseq      :       4.963 micros/op 201471 ops/sec;   22.3 MB/s
      fillseq [AVG 3 runs] : 200624 (± 1765) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq      :       5.035 micros/op 198625 ops/sec;   22.0 MB/s
      fillseq [AVG 4 runs] : 200124 (± 1586) ops/sec;   22.1 (± 0.2) MB/sec
      fillseq      :       4.979 micros/op 200861 ops/sec;   22.2 MB/s
      fillseq [AVG 5 runs] : 200272 (± 1262) ops/sec;   22.2 (± 0.1) MB/sec
      fillseq      :       4.893 micros/op 204367 ops/sec;   22.6 MB/s
      fillseq [AVG 6 runs] : 200954 (± 1688) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq      :       4.914 micros/op 203502 ops/sec;   22.5 MB/s
      fillseq [AVG 7 runs] : 201318 (± 1595) ops/sec;   22.3 (± 0.2) MB/sec
      fillseq      :       4.998 micros/op 200074 ops/sec;   22.1 MB/s
      fillseq [AVG 8 runs] : 201163 (± 1415) ops/sec;   22.3 (± 0.2) MB/sec
      fillseq      :       4.946 micros/op 202188 ops/sec;   22.4 MB/s
      fillseq [AVG 9 runs] : 201277 (± 1267) ops/sec;   22.3 (± 0.1) MB/sec
      fillseq      :       5.093 micros/op 196331 ops/sec;   21.7 MB/s
      fillseq [AVG 10 runs] : 200782 (± 1491) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq [AVG    10 runs] : 200782 (± 1491) ops/sec;   22.2 (± 0.2) MB/sec
      fillseq [MEDIAN 10 runs] : 201166 ops/sec;   22.3 MB/s
      ```
      
      For more explicit interval representation, use `--confidence_interval_only` flag:
      
      ```
      $ ./db_bench --benchmarks=fillseq[-X10] --confidence_interval_only
      
      Running benchmark for 10 times
      fillseq      :       4.935 micros/op 202648 ops/sec;   22.4 MB/s
      fillseq      :       5.078 micros/op 196943 ops/sec;   21.8 MB/s
      fillseq [CI95 2 runs] : (194205, 205385) ops/sec; (21.5, 22.7) MB/sec
      fillseq      :       5.159 micros/op 193816 ops/sec;   21.4 MB/s
      fillseq [CI95 3 runs] : (192735, 202869) ops/sec; (21.3, 22.4) MB/sec
      fillseq      :       4.947 micros/op 202158 ops/sec;   22.4 MB/s
      fillseq [CI95 4 runs] : (194721, 203061) ops/sec; (21.5, 22.5) MB/sec
      fillseq      :       4.908 micros/op 203756 ops/sec;   22.5 MB/s
      fillseq [CI95 5 runs] : (196113, 203615) ops/sec; (21.7, 22.5) MB/sec
      fillseq      :       5.063 micros/op 197528 ops/sec;   21.9 MB/s
      fillseq [CI95 6 runs] : (196319, 202631) ops/sec; (21.7, 22.4) MB/sec
      fillseq      :       5.214 micros/op 191799 ops/sec;   21.2 MB/s
      fillseq [CI95 7 runs] : (194953, 201803) ops/sec; (21.6, 22.3) MB/sec
      fillseq      :       5.260 micros/op 190095 ops/sec;   21.0 MB/s
      fillseq [CI95 8 runs] : (193749, 200937) ops/sec; (21.4, 22.2) MB/sec
      fillseq      :       5.076 micros/op 196992 ops/sec;   21.8 MB/s
      fillseq [CI95 9 runs] : (194134, 200474) ops/sec; (21.5, 22.2) MB/sec
      fillseq      :       5.388 micros/op 185603 ops/sec;   20.5 MB/s
      fillseq [CI95 10 runs] : (192487, 199781) ops/sec; (21.3, 22.1) MB/sec
      fillseq [AVG    10 runs] : 196134 (± 3647) ops/sec;   21.7 (± 0.4) MB/sec
      fillseq [MEDIAN 10 runs] : 196968 ops/sec;   21.8 MB/sec
      ```
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9882
      
      Reviewed By: pdillinger
      
      Differential Revision: D35796148
      
      Pulled By: vanekjar
      
      fbshipit-source-id: 8313712d16728ff982b8aff28195ee56622385b8
      fb9a167a
    • A
      Add experimental new FS API AbortIO to cancel read request (#9901) · 5bd374b3
      Akanksha Mahajan 提交于
      Summary:
      Add experimental new API AbortIO in FileSystem to abort the
      read requests submitted asynchronously through ReadAsync API.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9901
      
      Test Plan: Existing tests
      
      Reviewed By: anand1976
      
      Differential Revision: D35885591
      
      Pulled By: akankshamahajan15
      
      fbshipit-source-id: df3944e6e9e6e487af1fa688376b4abb6837fb02
      5bd374b3
  9. 23 4月, 2022 1 次提交
  10. 22 4月, 2022 1 次提交
  11. 21 4月, 2022 4 次提交
    • Y
      Add rollback_deletion_type_callback to TxnDBOptions (#9873) · d13825e5
      Yanqin Jin 提交于
      Summary:
      This PR does not affect write-committed.
      
      Add a member, `rollback_deletion_type_callback` to TransactionDBOptions
      so that a write-prepared transaction, when rolling back, can call this
      callback to decide if a `Delete` or `SingleDelete` should be used to
      cancel a prior `Put` written to the database during prepare phase.
      
      The purpose of this PR is to prevent mixing `Delete` and `SingleDelete`
      for the same key, causing undefined behaviors. Without this PR, the
      following can happen:
      
      ```
      // The application always issues SingleDelete when deleting keys.
      
      txn1->Put('a');
      txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax
      txn1->Rollback();  // inserts DELETE('a')
      
      txn2->Put('a');
      txn2->Commit();  // writes to memtable and potentially gets flushed/compacted
      ```
      
      In the database, we may have
      ```
      L0:   [PUT('a', s=100)]
      L1:   [DELETE('a', s=90)]
      Lmax: [PUT('a', s=0)]
      ```
      
      If a compaction compacts L0 and L1, then we have
      ```
      L1:    [PUT('a', s=100)]
      Lmax:  [PUT('a', s=0)]
      ```
      
      If a future transaction issues a SingleDelete, we have
      ```
      L0:    [SD('a', s=110)]
      L1:    [PUT('a', s=100)]
      Lmax:  [PUT('a', s=0)]
      ```
      
      Then, a compaction including L0, L1 and Lmax leads to
      ```
      Lmax:  [PUT('a', s=0)]
      ```
      
      which is incorrect.
      
      Similar bugs reported and addressed in
      https://github.com/cockroachdb/pebble/issues/1255. Based on our team's
      current priority, we have decided to take this approach for now. We may
      come back and revisit in the future.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873
      
      Test Plan: make check
      
      Reviewed By: ltamasi
      
      Differential Revision: D35762170
      
      Pulled By: riversand963
      
      fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d
      d13825e5
    • P
      Mark GetLiveFilesStorageInfo ready for production use (#9868) · 1bac873f
      Peter Dillinger 提交于
      Summary:
      ... by filling out remaining testing hole: handling of
      db_pathsi+cf_paths. (Note that while GetLiveFilesStorageInfo works
      with db_paths / cf_paths, Checkpoint and BackupEngine do not and
      are marked appropriately.)
      
      Also improved comments for "live files" APIs, and grouped them
      together in db.h.
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9868
      
      Test Plan: Adding to existing unit tests
      
      Reviewed By: jay-zhuang
      
      Differential Revision: D35752254
      
      Pulled By: pdillinger
      
      fbshipit-source-id: c70eb67748fad61826e2f554b674638700abefb2
      1bac873f
    • J
      Add 7.2 to compatible check (#9858) · 2ea4205a
      Jay Zhuang 提交于
      Summary:
      Add 7.2 to compatible check (should change it with version update).
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9858
      
      Reviewed By: riversand963
      
      Differential Revision: D35722897
      
      Pulled By: jay-zhuang
      
      fbshipit-source-id: 08c782b9344599d7296543eb0c61afcd9a869a1a
      2ea4205a
    • Y
      Add --decode_blob_index option to idump and dump commands (#9870) · 9b5790f0
      yuzhangyu 提交于
      Summary:
      This patch completes the first part of the task: "Extend all three commands so they can decode and print blob references if a new option --decode_blob_index is specified"
      
      Pull Request resolved: https://github.com/facebook/rocksdb/pull/9870
      
      Reviewed By: ltamasi
      
      Differential Revision: D35753932
      
      Pulled By: jowlyzhang
      
      fbshipit-source-id: 9d2bbba0eef2ed86b982767eba9de1b4881f35c9
      9b5790f0
  12. 20 4月, 2022 1 次提交