...
 
Commits (5)
    https://gitcode.net/kvdb/rocksdb/-/commit/c53d604f4114baa6e06e90e204850c36d6f35765 `sst_dump --command=verify` should verify block checksums (#11576) 2023-07-05T14:12:06-07:00 Changyu Bi changyubi@meta.com Summary: `sst_dump --command=verify` did not set read_options.verify_checksum to true so it was not verifying checksum. Pull Request resolved: <a href="https://github.com/facebook/rocksdb/pull/11576" rel="nofollow noreferrer noopener" target="_blank">https://github.com/facebook/rocksdb/pull/11576</a> Test Plan: ran the same command on an SST file with bad checksum: ``` sst_dump --command=verify --file=...sst_file_with_bad_block_checksum Before this PR: options.env is 0x6ba048 Process ...sst_file_with_bad_block_checksum Sst file format: block-based The file is ok After this PR: options.env is 0x7f43f6690000 Process ...sst_file_with_bad_block_checksum Sst file format: block-based ... is corrupted: Corruption: block checksum mismatch: stored = 2170109798, computed = 2170097510, type = 4 ... ``` Reviewed By: ajkr Differential Revision: D47136284 Pulled By: cbi42 fbshipit-source-id: 07d68db715c00347145e5b83d649aef2c3f2acd9 https://gitcode.net/kvdb/rocksdb/-/commit/df082c8d1ddf5a90b195941064b56e853f104ff0 Deprecate option `periodic_compaction_seconds` for FIFO compaction (#11550) 2023-07-05T14:40:45-07:00 Changyu Bi 102700264+cbi42@users.noreply.github.com Summary: both options `ttl` and `periodic_compaction_seconds` have the same meaning for FIFO compaction, which is redundant and can be confusing to use. For example, setting TTL to 0 does not disable TTL: user needs to also set periodic_compaction_seconds to 0. Another example is that dynamically setting `periodic_compaction_seconds` (surprisingly) has no effect on TTL compaction. This is because FIFO compaction picker internally only looks at value of `ttl`. The value of `ttl` is in `SanitizeOptions()` which take into account the value of `periodic_compaction_seconds`, but dynamically setting an option does not invoke this method. This PR clarifies the usage of both options for FIFO compaction: only `ttl` should be used, `periodic_compaction_seconds` will not have any effect on FIFO compaction. Pull Request resolved: <a href="https://github.com/facebook/rocksdb/pull/11550" rel="nofollow noreferrer noopener" target="_blank">https://github.com/facebook/rocksdb/pull/11550</a> Test Plan: - updated existing unit test `DBOptionsTest.SanitizeFIFOPeriodicCompaction` - checked existing values of both options in feature matrix: <a href="https://fburl.com/daiquery/xxd0gs9w" rel="nofollow noreferrer noopener" target="_blank">https://fburl.com/daiquery/xxd0gs9w</a>. All current uses cases either have `periodic_compaction_seconds = 0` or have `periodic_compaction_seconds &gt; ttl`, so should not cause change of behavior. Reviewed By: ajkr Differential Revision: D46902959 Pulled By: cbi42 fbshipit-source-id: a9ede235b276783b4906aaec443551fa62ceff4c https://gitcode.net/kvdb/rocksdb/-/commit/1f410ff95f623216c6d1c72f8d0788ed333e829c Make `rocksdb_options_add_compact_on_deletion_collector_factory` backward com... 2023-07-07T13:16:20-07:00 Changyu Bi changyubi@meta.com Summary: <a href="https://github.com/facebook/rocksdb/issues/11542" rel="nofollow noreferrer noopener" target="_blank">https://github.com/facebook/rocksdb/issues/11542</a> added a parameter to the C API `rocksdb_options_add_compact_on_deletion_collector_factory` which causes some internal builds to fail. External users using this API would also require code change. Making the API backward compatible by restoring the old C API and add the parameter to a new C API `rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio`. Also updated change log for 8.4 and will backport this change to 8.4 branch once landed. Pull Request resolved: <a href="https://github.com/facebook/rocksdb/pull/11593" rel="nofollow noreferrer noopener" target="_blank">https://github.com/facebook/rocksdb/pull/11593</a> Test Plan: `make c_test &amp;&amp; ./c_test` Reviewed By: akankshamahajan15 Differential Revision: D47299555 Pulled By: cbi42 fbshipit-source-id: 517dc093ef4cf02cac2fe4af4f1af13754bbda63 https://gitcode.net/kvdb/rocksdb/-/commit/baf37a0e818dc334a0ed94f3d315155e2c138c93 Fix a unit test hole for recovering UDTs with WAL files (#11577) 2023-07-07T16:47:49-07:00 Yu Zhang yuzhangyu@fb.com Summary: Thanks pdillinger for pointing out this test hole. The test `DBWALTestWithTimestamp.Recover` that is intended to test recovery from WAL including user-defined timestamps doesn't achieve its promised coverage. Specifically, after <a href="https://github.com/facebook/rocksdb/issues/11557" rel="nofollow noreferrer noopener" target="_blank">https://github.com/facebook/rocksdb/issues/11557</a>, timestamps will be removed during flush, and RocksDB by default flush memtables during recovery with `avoid_flush_during_recovery` defaults to false. This test didn't fail even if all the timestamps are quickly lost due to the default flush behavior. This PR renamed test `Recover` to `RecoverAndNoFlush`, and updated it to verify timestamps are successfully recovered from WAL with some time-travel reads. `avoid_flush_during_recovery` is set to true to help do this verification. On the other hand, for test `DBWALTestWithTimestamp.RecoverAndFlush`, since flush on reopen is DB's default behavior. Setting the flags `max_write_buffer` and `arena_block_size` are not really the factors that enforces the flush, so these flags are removed. Pull Request resolved: <a href="https://github.com/facebook/rocksdb/pull/11577" rel="nofollow noreferrer noopener" target="_blank">https://github.com/facebook/rocksdb/pull/11577</a> Test Plan: ./db_wal_test Reviewed By: pdillinger Differential Revision: D47142892 Pulled By: jowlyzhang fbshipit-source-id: 9465e278806faa5885b541b4e32d99e698edef7d https://gitcode.net/kvdb/rocksdb/-/commit/f74526341dc2dee11e3e66b725b8ba40890eb20a Handle file boundaries when timestamps should not be persisted (#11578) 2023-07-10T11:03:25-07:00 Yu Zhang yuzhangyu@fb.com Summary: Handle file boundaries `FileMetaData.smallest`, `FileMetaData.largest` for when `persist_user_defined_timestamps` is false: 1) on the manifest write path, the original user-defined timestamps in file boundaries are stripped. This stripping is done during `VersionEdit::Encode` to limit the effect of the stripping to only the persisted version of the file boundaries. 2) on the manifest read path during DB open, a a min timestamp is padded to the file boundaries. Ideally, this padding should happen during `VersionEdit::Decode` so that all in memory file boundaries have a compatible user key format as the running user comparator. However, because the user-defined timestamp size information is not available at that time. This change is added to `VersionEditHandler::OnNonCfOperation`. Pull Request resolved: <a href="https://github.com/facebook/rocksdb/pull/11578" rel="nofollow noreferrer noopener" target="_blank">https://github.com/facebook/rocksdb/pull/11578</a> Test Plan: ``` make all check ./version_edit_test --gtest_filter="*EncodeDecodeNewFile4HandleFileBoundary*". ./db_with_timestamp_basic_test --gtest_filter="*HandleFileBoundariesTest*" ``` Reviewed By: pdillinger Differential Revision: D47309399 Pulled By: jowlyzhang fbshipit-source-id: 21b4d54d2089a62826b31d779094a39cb2bbbd51
......@@ -14,7 +14,7 @@
* Add `WriteBatch::Release()` that releases the batch's serialized data to the caller.
### Public API Changes
* Add parameter `deletion_ratio` to C API `rocksdb_options_add_compact_on_deletion_collector_factory`.
* Add C API `rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio`.
* change the FileSystem::use_async_io() API to SupportedOps API in order to extend it to various operations supported by underlying FileSystem. Right now it contains FSSupportedOps::kAsyncIO and FSSupportedOps::kFSBuffer. More details about FSSupportedOps in filesystem.h
* Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.
* Overload the API CreateColumnFamilyWithImport() to support creating ColumnFamily by importing multiple ColumnFamilies It requires that CFs should not overlap in user key range.
......
......@@ -3916,6 +3916,14 @@ void rocksdb_options_set_row_cache(rocksdb_options_t* opt,
}
void rocksdb_options_add_compact_on_deletion_collector_factory(
rocksdb_options_t* opt, size_t window_size, size_t num_dels_trigger) {
std::shared_ptr<ROCKSDB_NAMESPACE::TablePropertiesCollectorFactory>
compact_on_del =
NewCompactOnDeletionCollectorFactory(window_size, num_dels_trigger);
opt->rep.table_properties_collector_factories.emplace_back(compact_on_del);
}
void rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio(
rocksdb_options_t* opt, size_t window_size, size_t num_dels_trigger,
double deletion_ratio) {
std::shared_ptr<ROCKSDB_NAMESPACE::TablePropertiesCollectorFactory>
......
......@@ -720,7 +720,9 @@ int main(int argc, char** argv) {
rocksdb_compactoptions_set_exclusive_manual_compaction(coptions, 1);
rocksdb_options_add_compact_on_deletion_collector_factory(options, 10000,
10001, 0.0);
10001);
rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio(
options, 10000, 10001, 0.0);
StartPhase("destroy");
rocksdb_destroy_db(options, dbname, &err);
......
......@@ -382,8 +382,8 @@ ColumnFamilyOptions SanitizeOptions(const ImmutableDBOptions& db_options,
const uint64_t kAdjustedTtl = 30 * 24 * 60 * 60;
if (result.ttl == kDefaultTtl) {
if (is_block_based_table &&
result.compaction_style != kCompactionStyleFIFO) {
if (is_block_based_table) {
// For FIFO, max_open_files is checked in ValidateOptions().
result.ttl = kAdjustedTtl;
} else {
result.ttl = 0;
......@@ -403,16 +403,12 @@ ColumnFamilyOptions SanitizeOptions(const ImmutableDBOptions& db_options,
result.periodic_compaction_seconds = kAdjustedPeriodicCompSecs;
}
} else {
// result.compaction_style == kCompactionStyleFIFO
if (result.ttl == 0) {
if (is_block_based_table) {
if (result.periodic_compaction_seconds == kDefaultPeriodicCompSecs) {
result.periodic_compaction_seconds = kAdjustedPeriodicCompSecs;
}
result.ttl = result.periodic_compaction_seconds;
}
} else if (result.periodic_compaction_seconds != 0) {
result.ttl = std::min(result.ttl, result.periodic_compaction_seconds);
if (result.periodic_compaction_seconds != kDefaultPeriodicCompSecs &&
result.periodic_compaction_seconds > 0) {
ROCKS_LOG_WARN(
db_options.info_log.get(),
"periodic_compaction_seconds does not support FIFO compaction. You"
"may want to set option TTL instead.");
}
}
......
......@@ -880,9 +880,13 @@ TEST_F(DBOptionsTest, SanitizeFIFOPeriodicCompaction) {
Options options;
options.compaction_style = kCompactionStyleFIFO;
options.env = CurrentOptions().env;
// Default value allows RocksDB to set ttl to 30 days.
ASSERT_EQ(30 * 24 * 60 * 60, dbfull()->GetOptions().ttl);
// Disable
options.ttl = 0;
Reopen(options);
ASSERT_EQ(30 * 24 * 60 * 60, dbfull()->GetOptions().ttl);
ASSERT_EQ(0, dbfull()->GetOptions().ttl);
options.ttl = 100;
Reopen(options);
......@@ -892,15 +896,12 @@ TEST_F(DBOptionsTest, SanitizeFIFOPeriodicCompaction) {
Reopen(options);
ASSERT_EQ(100 * 24 * 60 * 60, dbfull()->GetOptions().ttl);
options.ttl = 200;
options.periodic_compaction_seconds = 300;
Reopen(options);
ASSERT_EQ(200, dbfull()->GetOptions().ttl);
// periodic_compaction_seconds should have no effect
// on FIFO compaction.
options.ttl = 500;
options.periodic_compaction_seconds = 300;
Reopen(options);
ASSERT_EQ(300, dbfull()->GetOptions().ttl);
ASSERT_EQ(500, dbfull()->GetOptions().ttl);
}
TEST_F(DBOptionsTest, SetFIFOCompactionOptions) {
......
......@@ -318,16 +318,25 @@ class DBWALTestWithTimestamp
DBWALTestWithTimestamp()
: DBBasicTestWithTimestampBase("db_wal_test_with_timestamp") {}
void SetUp() override {
persist_udt_ = test::ShouldPersistUDT(GetParam());
DBBasicTestWithTimestampBase::SetUp();
}
Status CreateAndReopenWithCFWithTs(const std::vector<std::string>& cfs,
const Options& options) {
const Options& options,
bool avoid_flush_during_recovery = false) {
CreateColumnFamilies(cfs, options);
return ReopenColumnFamiliesWithTs(cfs, options);
return ReopenColumnFamiliesWithTs(cfs, options,
avoid_flush_during_recovery);
}
Status ReopenColumnFamiliesWithTs(const std::vector<std::string>& cfs,
Options ts_options) {
Options ts_options,
bool avoid_flush_during_recovery = false) {
Options default_options = CurrentOptions();
default_options.create_if_missing = false;
default_options.avoid_flush_during_recovery = avoid_flush_during_recovery;
ts_options.create_if_missing = false;
std::vector<Options> cf_options(cfs.size(), ts_options);
......@@ -345,46 +354,88 @@ class DBWALTestWithTimestamp
}
void CheckGet(const ReadOptions& read_opts, uint32_t cf, const Slice& key,
const std::string& expected_value) {
const std::string& expected_value,
const std::string& expected_ts) {
std::string actual_value;
ASSERT_OK(db_->Get(read_opts, handles_[cf], key, &actual_value));
std::string actual_ts;
ASSERT_OK(
db_->Get(read_opts, handles_[cf], key, &actual_value, &actual_ts));
ASSERT_EQ(expected_value, actual_value);
ASSERT_EQ(expected_ts, actual_ts);
}
protected:
bool persist_udt_;
};
TEST_F(DBWALTestWithTimestamp, Recover) {
TEST_P(DBWALTestWithTimestamp, RecoverAndNoFlush) {
// Set up the option that enables user defined timestmp size.
std::string ts = Timestamp(1, 0);
const size_t kTimestampSize = ts.size();
std::string ts1 = Timestamp(1, 0);
const size_t kTimestampSize = ts1.size();
TestComparator test_cmp(kTimestampSize);
Options ts_options;
ts_options.create_if_missing = true;
ts_options.comparator = &test_cmp;
// Test that user-defined timestamps are recovered from WAL regardless of
// the value of this flag because UDTs are saved in WAL nonetheless.
// We however need to explicitly disable flush during recovery by setting
// `avoid_flush_during_recovery=true` so that we can avoid timestamps getting
// stripped when the `persist_user_defined_timestamps` flag is false, so that
// all written timestamps are available for testing user-defined time travel
// read.
ts_options.persist_user_defined_timestamps = persist_udt_;
bool avoid_flush_during_recovery = true;
ReadOptions read_opts;
Slice ts_slice = ts;
read_opts.timestamp = &ts_slice;
do {
ASSERT_OK(CreateAndReopenWithCFWithTs({"pikachu"}, ts_options));
ASSERT_OK(Put(1, "foo", ts, "v1"));
ASSERT_OK(Put(1, "baz", ts, "v5"));
ASSERT_OK(ReopenColumnFamiliesWithTs({"pikachu"}, ts_options));
CheckGet(read_opts, 1, "foo", "v1");
CheckGet(read_opts, 1, "baz", "v5");
ASSERT_OK(Put(1, "bar", ts, "v2"));
ASSERT_OK(Put(1, "foo", ts, "v3"));
ASSERT_OK(ReopenColumnFamiliesWithTs({"pikachu"}, ts_options));
CheckGet(read_opts, 1, "foo", "v3");
ASSERT_OK(Put(1, "foo", ts, "v4"));
CheckGet(read_opts, 1, "foo", "v4");
CheckGet(read_opts, 1, "bar", "v2");
CheckGet(read_opts, 1, "baz", "v5");
Slice ts_slice = ts1;
read_opts.timestamp = &ts_slice;
ASSERT_OK(CreateAndReopenWithCFWithTs({"pikachu"}, ts_options,
avoid_flush_during_recovery));
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"), 0U);
ASSERT_OK(Put(1, "foo", ts1, "v1"));
ASSERT_OK(Put(1, "baz", ts1, "v5"));
ASSERT_OK(ReopenColumnFamiliesWithTs({"pikachu"}, ts_options,
avoid_flush_during_recovery));
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"), 0U);
// Do a timestamped read with ts1 after second reopen.
CheckGet(read_opts, 1, "foo", "v1", ts1);
CheckGet(read_opts, 1, "baz", "v5", ts1);
// Write more value versions for key "foo" and "bar" before and after second
// reopen.
std::string ts2 = Timestamp(2, 0);
ASSERT_OK(Put(1, "bar", ts2, "v2"));
ASSERT_OK(Put(1, "foo", ts2, "v3"));
ASSERT_OK(ReopenColumnFamiliesWithTs({"pikachu"}, ts_options,
avoid_flush_during_recovery));
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"), 0U);
std::string ts3 = Timestamp(3, 0);
ASSERT_OK(Put(1, "foo", ts3, "v4"));
// Do a timestamped read with ts1 after third reopen.
CheckGet(read_opts, 1, "foo", "v1", ts1);
std::string value;
ASSERT_TRUE(db_->Get(read_opts, handles_[1], "bar", &value).IsNotFound());
CheckGet(read_opts, 1, "baz", "v5", ts1);
// Do a timestamped read with ts2 after third reopen.
ts_slice = ts2;
CheckGet(read_opts, 1, "foo", "v3", ts2);
CheckGet(read_opts, 1, "bar", "v2", ts2);
CheckGet(read_opts, 1, "baz", "v5", ts1);
// Do a timestamped read with ts3 after third reopen.
ts_slice = ts3;
CheckGet(read_opts, 1, "foo", "v4", ts3);
CheckGet(read_opts, 1, "bar", "v2", ts2);
CheckGet(read_opts, 1, "baz", "v5", ts1);
} while (ChangeWalOptions());
}
TEST_F(DBWALTestWithTimestamp, RecoverInconsistentTimestamp) {
TEST_P(DBWALTestWithTimestamp, RecoverInconsistentTimestamp) {
// Set up the option that enables user defined timestmp size.
std::string ts = Timestamp(1, 0);
const size_t kTimestampSize = ts.size();
......@@ -392,11 +443,19 @@ TEST_F(DBWALTestWithTimestamp, RecoverInconsistentTimestamp) {
Options ts_options;
ts_options.create_if_missing = true;
ts_options.comparator = &test_cmp;
ts_options.persist_user_defined_timestamps = persist_udt_;
ASSERT_OK(CreateAndReopenWithCFWithTs({"pikachu"}, ts_options));
ASSERT_OK(Put(1, "foo", ts, "v1"));
ASSERT_OK(Put(1, "baz", ts, "v5"));
// In real use cases, switching to a different user comparator is prohibited
// by a sanity check during DB open that does a user comparator name
// comparison. This test mocked and bypassed that sanity check because the
// before and after user comparator are both named "TestComparator". This is
// to test the user-defined timestamp recovery logic for WAL files have
// the intended consistency check.
// `HandleWriteBatchTimestampSizeDifference` in udt_util.h has more details.
TestComparator diff_test_cmp(kTimestampSize + 1);
ts_options.comparator = &diff_test_cmp;
ASSERT_TRUE(
......@@ -404,7 +463,7 @@ TEST_F(DBWALTestWithTimestamp, RecoverInconsistentTimestamp) {
}
TEST_P(DBWALTestWithTimestamp, RecoverAndFlush) {
// Set up the option that enables user defined timestmp size.
// Set up the option that enables user defined timestamp size.
std::string min_ts = Timestamp(0, 0);
std::string write_ts = Timestamp(1, 0);
const size_t kTimestampSize = write_ts.size();
......@@ -412,22 +471,21 @@ TEST_P(DBWALTestWithTimestamp, RecoverAndFlush) {
Options ts_options;
ts_options.create_if_missing = true;
ts_options.comparator = &test_cmp;
bool persist_udt = test::ShouldPersistUDT(GetParam());
ts_options.persist_user_defined_timestamps = persist_udt;
ts_options.persist_user_defined_timestamps = persist_udt_;
std::string smallest_ukey_without_ts = "baz";
std::string largest_ukey_without_ts = "foo";
ASSERT_OK(CreateAndReopenWithCFWithTs({"pikachu"}, ts_options));
// No flush, no sst files, because of no data.
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"), 0U);
ASSERT_OK(Put(1, largest_ukey_without_ts, write_ts, "v1"));
ASSERT_OK(Put(1, smallest_ukey_without_ts, write_ts, "v5"));
// Very small write buffer size to force flush memtables recovered from WAL.
ts_options.write_buffer_size = 16;
ts_options.arena_block_size = 16;
ASSERT_OK(ReopenColumnFamiliesWithTs({"pikachu"}, ts_options));
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"),
static_cast<uint64_t>(1));
// Memtable recovered from WAL flushed because `avoid_flush_during_recovery`
// defaults to false, created one L0 file.
ASSERT_EQ(GetNumberOfSstFilesForColumnFamily(db_, "pikachu"), 1U);
std::vector<std::vector<FileMetaData>> level_to_files;
dbfull()->TEST_GetFilesMetaData(handles_[1], &level_to_files);
......@@ -435,7 +493,7 @@ TEST_P(DBWALTestWithTimestamp, RecoverAndFlush) {
// L0 only has one SST file.
ASSERT_EQ(level_to_files[0].size(), 1);
auto meta = level_to_files[0][0];
if (persist_udt) {
if (persist_udt_) {
ASSERT_EQ(smallest_ukey_without_ts + write_ts, meta.smallest.user_key());
ASSERT_EQ(largest_ukey_without_ts + write_ts, meta.largest.user_key());
} else {
......@@ -446,7 +504,7 @@ TEST_P(DBWALTestWithTimestamp, RecoverAndFlush) {
// Param 0: test mode for the user-defined timestamp feature
INSTANTIATE_TEST_CASE_P(
RecoverAndFlush, DBWALTestWithTimestamp,
DBWALTestWithTimestamp, DBWALTestWithTimestamp,
::testing::Values(
test::UserDefinedTimestampTestMode::kStripUserDefinedTimestamp,
test::UserDefinedTimestampTestMode::kNormal));
......
......@@ -3273,6 +3273,76 @@ TEST_F(UpdateFullHistoryTsLowTest, ConcurrentUpdate) {
Close();
}
// Tests the effect of flag `persist_user_defined_timestamps` on the file
// boundaries contained in the Manifest, a.k.a FileMetaData.smallest,
// FileMetaData.largest.
class HandleFileBoundariesTest
: public DBBasicTestWithTimestampBase,
public testing::WithParamInterface<test::UserDefinedTimestampTestMode> {
public:
HandleFileBoundariesTest()
: DBBasicTestWithTimestampBase("/handle_file_boundaries") {}
};
TEST_P(HandleFileBoundariesTest, ConfigurePersistUdt) {
Options options = CurrentOptions();
options.env = env_;
// Write a timestamp that is not the min timestamp to help test the behavior
// of flag `persist_user_defined_timestamps`.
std::string write_ts = Timestamp(1, 0);
std::string min_ts = Timestamp(0, 0);
std::string smallest_ukey_without_ts = "bar";
std::string largest_ukey_without_ts = "foo";
const size_t kTimestampSize = write_ts.size();
TestComparator test_cmp(kTimestampSize);
options.comparator = &test_cmp;
bool persist_udt = test::ShouldPersistUDT(GetParam());
options.persist_user_defined_timestamps = persist_udt;
DestroyAndReopen(options);
ASSERT_OK(
db_->Put(WriteOptions(), smallest_ukey_without_ts, write_ts, "val1"));
ASSERT_OK(
db_->Put(WriteOptions(), largest_ukey_without_ts, write_ts, "val2"));
// Create a L0 SST file and its record is added to the Manfiest.
ASSERT_OK(Flush());
Close();
options.create_if_missing = false;
// Reopen the DB and process manifest file.
Reopen(options);
std::vector<std::vector<FileMetaData>> level_to_files;
dbfull()->TEST_GetFilesMetaData(dbfull()->DefaultColumnFamily(),
&level_to_files);
ASSERT_GT(level_to_files.size(), 1);
// L0 only has one SST file.
ASSERT_EQ(level_to_files[0].size(), 1);
auto file_meta = level_to_files[0][0];
if (persist_udt) {
ASSERT_EQ(smallest_ukey_without_ts + write_ts,
file_meta.smallest.user_key());
ASSERT_EQ(largest_ukey_without_ts + write_ts, file_meta.largest.user_key());
} else {
// If `persist_user_defined_timestamps` is false, the file boundaries should
// have the min timestamp. Behind the scenes, when file boundaries in
// FileMetaData is persisted to Manifest, the original user-defined
// timestamps in user key are stripped. When manifest is read and processed
// during DB open, a min timestamp is padded to the file boundaries. This
// test's writes contain non min timestamp to verify this logic end-to-end.
ASSERT_EQ(smallest_ukey_without_ts + min_ts, file_meta.smallest.user_key());
ASSERT_EQ(largest_ukey_without_ts + min_ts, file_meta.largest.user_key());
}
Close();
}
INSTANTIATE_TEST_CASE_P(
ConfigurePersistUdt, HandleFileBoundariesTest,
::testing::Values(
test::UserDefinedTimestampTestMode::kStripUserDefinedTimestamp,
test::UserDefinedTimestampTestMode::kNormal));
TEST_F(DBBasicTestWithTimestamp,
GCPreserveRangeTombstoneWhenNoOrSmallFullHistoryLow) {
Options options = CurrentOptions();
......
......@@ -93,7 +93,8 @@ void VersionEdit::Clear() {
full_history_ts_low_.clear();
}
bool VersionEdit::EncodeTo(std::string* dst) const {
bool VersionEdit::EncodeTo(std::string* dst,
std::optional<size_t> ts_sz) const {
if (has_db_id_) {
PutVarint32(dst, kDbId);
PutLengthPrefixedSlice(dst, db_id_);
......@@ -133,6 +134,8 @@ bool VersionEdit::EncodeTo(std::string* dst) const {
}
bool min_log_num_written = false;
assert(new_files_.empty() || ts_sz.has_value());
for (size_t i = 0; i < new_files_.size(); i++) {
const FileMetaData& f = new_files_[i].second;
if (!f.smallest.Valid() || !f.largest.Valid() ||
......@@ -142,8 +145,7 @@ bool VersionEdit::EncodeTo(std::string* dst) const {
PutVarint32(dst, kNewFile4);
PutVarint32Varint64(dst, new_files_[i].first /* level */, f.fd.GetNumber());
PutVarint64(dst, f.fd.GetFileSize());
PutLengthPrefixedSlice(dst, f.smallest.Encode());
PutLengthPrefixedSlice(dst, f.largest.Encode());
EncodeFileBoundaries(dst, f, ts_sz.value());
PutVarint64Varint64(dst, f.fd.smallest_seqno, f.fd.largest_seqno);
// Customized fields' format:
// +-----------------------------+
......@@ -458,6 +460,23 @@ const char* VersionEdit::DecodeNewFile4From(Slice* input) {
return nullptr;
}
void VersionEdit::EncodeFileBoundaries(std::string* dst,
const FileMetaData& meta,
size_t ts_sz) const {
if (ts_sz == 0 || meta.user_defined_timestamps_persisted) {
PutLengthPrefixedSlice(dst, meta.smallest.Encode());
PutLengthPrefixedSlice(dst, meta.largest.Encode());
return;
}
std::string smallest_buf;
std::string largest_buf;
StripTimestampFromInternalKey(&smallest_buf, meta.smallest.Encode(), ts_sz);
StripTimestampFromInternalKey(&largest_buf, meta.largest.Encode(), ts_sz);
PutLengthPrefixedSlice(dst, smallest_buf);
PutLengthPrefixedSlice(dst, largest_buf);
return;
};
Status VersionEdit::DecodeFrom(const Slice& src) {
Clear();
#ifndef NDEBUG
......
......@@ -9,6 +9,7 @@
#pragma once
#include <algorithm>
#include <optional>
#include <set>
#include <string>
#include <utility>
......@@ -489,6 +490,8 @@ class VersionEdit {
using NewFiles = std::vector<std::pair<int, FileMetaData>>;
const NewFiles& GetNewFiles() const { return new_files_; }
NewFiles& GetMutableNewFiles() { return new_files_; }
// Retrieve all the compact cursors
using CompactCursors = std::vector<std::pair<int, InternalKey>>;
const CompactCursors& GetCompactCursors() const { return compact_cursors_; }
......@@ -639,7 +642,17 @@ class VersionEdit {
}
// return true on success.
bool EncodeTo(std::string* dst) const;
// `ts_sz` is the size in bytes for the user-defined timestamp contained in
// a user key. This argument is optional because it's only required for
// encoding a `VersionEdit` with new SST files to add. It's used to handle the
// file boundaries: `smallest`, `largest` when
// `FileMetaData.user_defined_timestamps_persisted` is false. When reading
// the Manifest file, a mirroring change needed to handle
// file boundaries are not added to the `VersionEdit.DecodeFrom` function
// because timestamp size is not available at `VersionEdit` decoding time,
// it's instead added to `VersionEditHandler::OnNonCfOperation`.
bool EncodeTo(std::string* dst,
std::optional<size_t> ts_sz = std::nullopt) const;
Status DecodeFrom(const Slice& src);
std::string DebugString(bool hex_key = false) const;
......@@ -660,6 +673,12 @@ class VersionEdit {
const char* DecodeNewFile4From(Slice* input);
// Encode file boundaries `FileMetaData.smallest` and `FileMetaData.largest`.
// User-defined timestamps in the user key will be stripped if they shouldn't
// be persisted.
void EncodeFileBoundaries(std::string* dst, const FileMetaData& meta,
size_t ts_sz) const;
int max_level_ = 0;
std::string db_id_;
std::string comparator_;
......
......@@ -308,6 +308,17 @@ Status VersionEditHandler::OnNonCfOperation(VersionEdit& edit,
tmp_cfd = version_set_->GetColumnFamilySet()->GetColumnFamily(
edit.column_family_);
assert(tmp_cfd != nullptr);
// It's important to handle file boundaries before `MaybeCreateVersion`
// because `VersionEditHandlerPointInTime::MaybeCreateVersion` does
// `FileMetaData` verification that involves the file boundaries.
// All `VersionEditHandlerBase` subclasses that need to deal with
// `FileMetaData` for new files are also subclasses of
// `VersionEditHandler`, so it's sufficient to do the file boundaries
// handling in this method.
s = MaybeHandleFileBoundariesForNewFiles(edit, tmp_cfd);
if (!s.ok()) {
return s;
}
s = MaybeCreateVersion(edit, tmp_cfd, /*force_create_version=*/false);
if (s.ok()) {
s = builder_iter->second->version_builder()->Apply(&edit);
......@@ -647,6 +658,47 @@ Status VersionEditHandler::ExtractInfoFromVersionEdit(ColumnFamilyData* cfd,
return s;
}
Status VersionEditHandler::MaybeHandleFileBoundariesForNewFiles(
VersionEdit& edit, const ColumnFamilyData* cfd) {
if (edit.GetNewFiles().empty()) {
return Status::OK();
}
auto ucmp = cfd->user_comparator();
assert(ucmp);
size_t ts_sz = ucmp->timestamp_size();
if (ts_sz == 0) {
return Status::OK();
}
VersionEdit::NewFiles& new_files = edit.GetMutableNewFiles();
assert(!new_files.empty());
bool file_boundaries_need_handling = false;
for (auto& new_file : new_files) {
FileMetaData& meta = new_file.second;
if (meta.user_defined_timestamps_persisted) {
// `FileMetaData.user_defined_timestamps_persisted` field is the value of
// the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps`
// at the time when the SST file was created. As a result, all added SST
// files in one `VersionEdit` should have the same value for it.
if (file_boundaries_need_handling) {
return Status::Corruption(
"New files in one VersionEdit has different "
"user_defined_timestamps_persisted value.");
}
break;
}
file_boundaries_need_handling = true;
std::string smallest_buf;
std::string largest_buf;
PadInternalKeyWithMinTimestamp(&smallest_buf, meta.smallest.Encode(),
ts_sz);
PadInternalKeyWithMinTimestamp(&largest_buf, meta.largest.Encode(), ts_sz);
meta.smallest.DecodeFrom(smallest_buf);
meta.largest.DecodeFrom(largest_buf);
}
return Status::OK();
}
VersionEditHandlerPointInTime::VersionEditHandlerPointInTime(
bool read_only, std::vector<ColumnFamilyDescriptor> column_families,
VersionSet* version_set, const std::shared_ptr<IOTracer>& io_tracer,
......
......@@ -206,6 +206,17 @@ class VersionEditHandler : public VersionEditHandlerBase {
private:
Status ExtractInfoFromVersionEdit(ColumnFamilyData* cfd,
const VersionEdit& edit);
// When `FileMetaData.user_defined_timestamps_persisted` is false and
// user-defined timestamp size is non-zero. User-defined timestamps are
// stripped from file boundaries: `smallest`, `largest` in
// `VersionEdit.DecodeFrom` before they were written to Manifest.
// This is the mirroring change to handle file boundaries on the Manifest read
// path for this scenario: to pad a minimum timestamp to the user key in
// `smallest` and `largest` so their format are consistent with the running
// user comparator.
Status MaybeHandleFileBoundariesForNewFiles(VersionEdit& edit,
const ColumnFamilyData* cfd);
};
// A class similar to its base class, i.e. VersionEditHandler.
......
......@@ -21,12 +21,20 @@
namespace ROCKSDB_NAMESPACE {
static void TestEncodeDecode(const VersionEdit& edit) {
// Encoding one `VersionEdit` and decoding it again should result in the
// exact same `VersionEdit`. However, a special handling is applied to file
// boundaries: `FileMetaData.smallest`, `FileMetaData.largest` when
// user-defined timestamps should not be persisted. In that scenario, this
// invariant does not hold. We disable this scenario in this util method to
// enable all other test cases continue to verify this invariant, while the
// special case is separately covered in test
// `EncodeDecodeNewFile4HandleFileBoundary`.
std::string encoded, encoded2;
edit.EncodeTo(&encoded);
edit.EncodeTo(&encoded, 0 /* ts_sz */);
VersionEdit parsed;
Status s = parsed.DecodeFrom(encoded);
ASSERT_TRUE(s.ok()) << s.ToString();
parsed.EncodeTo(&encoded2);
parsed.EncodeTo(&encoded2, 0 /* ts_sz */);
ASSERT_EQ(encoded, encoded2);
}
......@@ -93,7 +101,7 @@ TEST_F(VersionEditTest, EncodeDecodeNewFile4) {
TestEncodeDecode(edit);
std::string encoded, encoded2;
edit.EncodeTo(&encoded);
edit.EncodeTo(&encoded, 0 /* ts_sz */);
VersionEdit parsed;
Status s = parsed.DecodeFrom(encoded);
ASSERT_TRUE(s.ok()) << s.ToString();
......@@ -119,6 +127,57 @@ TEST_F(VersionEditTest, EncodeDecodeNewFile4) {
ASSERT_TRUE(new_files[3].second.user_defined_timestamps_persisted);
}
TEST_F(VersionEditTest, EncodeDecodeNewFile4HandleFileBoundary) {
static const uint64_t kBig = 1ull << 50;
size_t ts_sz = 16;
static std::string min_ts(ts_sz, static_cast<unsigned char>(0));
VersionEdit edit;
std::string smallest = "foo";
std::string largest = "zoo";
// In real manifest writing scenarios, one `VersionEdit` should not contain
// files with different `user_defined_timestamps_persisted` flag value.
// This is just for testing file boundaries handling w.r.t persisting user
// defined timestamps during `VersionEdit` encoding.
edit.AddFile(
3, 300, 3, 100, InternalKey(smallest + min_ts, kBig + 500, kTypeValue),
InternalKey(largest + min_ts, kBig + 600, kTypeDeletion), kBig + 500,
kBig + 600, true, Temperature::kUnknown, kInvalidBlobFileNumber,
kUnknownOldestAncesterTime, kUnknownFileCreationTime,
300 /* epoch_number */, kUnknownFileChecksum,
kUnknownFileChecksumFuncName, kNullUniqueId64x2,
0 /* compensated_range_deletion_size */, 0 /* tail_size */,
false /* user_defined_timestamps_persisted */);
edit.AddFile(3, 300, 3, 100,
InternalKey(smallest + min_ts, kBig + 500, kTypeValue),
InternalKey(largest + min_ts, kBig + 600, kTypeDeletion),
kBig + 500, kBig + 600, true, Temperature::kUnknown,
kInvalidBlobFileNumber, kUnknownOldestAncesterTime,
kUnknownFileCreationTime, 300 /* epoch_number */,
kUnknownFileChecksum, kUnknownFileChecksumFuncName,
kNullUniqueId64x2, 0 /* compensated_range_deletion_size */,
0 /* tail_size */, true /* user_defined_timestamps_persisted */);
std::string encoded;
edit.EncodeTo(&encoded, ts_sz);
VersionEdit parsed;
Status s = parsed.DecodeFrom(encoded);
ASSERT_TRUE(s.ok()) << s.ToString();
auto& new_files = parsed.GetNewFiles();
ASSERT_TRUE(new_files.size() == 2);
ASSERT_FALSE(new_files[0].second.user_defined_timestamps_persisted);
// First file's boundaries do not contain user-defined timestamps.
ASSERT_EQ(InternalKey(smallest, kBig + 500, kTypeValue).Encode(),
new_files[0].second.smallest.Encode());
ASSERT_EQ(InternalKey(largest, kBig + 600, kTypeDeletion).Encode(),
new_files[0].second.largest.Encode());
ASSERT_TRUE(new_files[1].second.user_defined_timestamps_persisted);
// Second file's boundaries contain user-defined timestamps.
ASSERT_EQ(InternalKey(smallest + min_ts, kBig + 500, kTypeValue).Encode(),
new_files[1].second.smallest.Encode());
ASSERT_EQ(InternalKey(largest + min_ts, kBig + 600, kTypeDeletion).Encode(),
new_files[1].second.largest.Encode());
}
TEST_F(VersionEditTest, ForwardCompatibleNewFile4) {
static const uint64_t kBig = 1ull << 50;
VersionEdit edit;
......@@ -158,7 +217,7 @@ TEST_F(VersionEditTest, ForwardCompatibleNewFile4) {
}
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
edit.EncodeTo(&encoded);
edit.EncodeTo(&encoded, 0 /* ts_sz */);
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
VersionEdit parsed;
......@@ -198,7 +257,7 @@ TEST_F(VersionEditTest, NewFile4NotSupportedField) {
PutLengthPrefixedSlice(str, str1);
});
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
edit.EncodeTo(&encoded);
edit.EncodeTo(&encoded, 0 /* ts_sz */);
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
VersionEdit parsed;
......@@ -214,7 +273,7 @@ TEST_F(VersionEditTest, EncodeEmptyFile) {
1 /*epoch_number*/, kUnknownFileChecksum,
kUnknownFileChecksumFuncName, kNullUniqueId64x2, 0, 0, true);
std::string buffer;
ASSERT_TRUE(!edit.EncodeTo(&buffer));
ASSERT_TRUE(!edit.EncodeTo(&buffer, 0 /* ts_sz */));
}
TEST_F(VersionEditTest, ColumnFamilyTest) {
......@@ -583,7 +642,7 @@ TEST_F(VersionEditTest, IgnorableTags) {
edit.SetColumnFamily(kColumnFamilyId);
std::string encoded;
ASSERT_TRUE(edit.EncodeTo(&encoded));
ASSERT_TRUE(edit.EncodeTo(&encoded, 0 /* ts_sz */));
VersionEdit decoded;
ASSERT_OK(decoded.DecodeFrom(encoded));
......
......@@ -5118,6 +5118,12 @@ Status VersionSet::ProcessManifestWrites(
assert(manifest_writers_.front() == &first_writer);
autovector<VersionEdit*> batch_edits;
// This vector keeps track of the corresponding user-defined timestamp size
// for `batch_edits` side by side, which is only needed for encoding a
// `VersionEdit` that adds new SST files.
// Note that anytime `batch_edits` has new element added or get existing
// element removed, `batch_edits_ts_sz` should be updated too.
autovector<std::optional<size_t>> batch_edits_ts_sz;
autovector<Version*> versions;
autovector<const MutableCFOptions*> mutable_cf_options_ptrs;
std::vector<std::unique_ptr<BaseReferencedVersionBuilder>> builder_guards;
......@@ -5133,6 +5139,7 @@ Status VersionSet::ProcessManifestWrites(
// No group commits for column family add or drop
LogAndApplyCFHelper(first_writer.edit_list.front(), &max_last_sequence);
batch_edits.push_back(first_writer.edit_list.front());
batch_edits_ts_sz.push_back(std::nullopt);
} else {
auto it = manifest_writers_.cbegin();
size_t group_start = std::numeric_limits<size_t>::max();
......@@ -5209,6 +5216,9 @@ Status VersionSet::ProcessManifestWrites(
TEST_SYNC_POINT_CALLBACK("VersionSet::ProcessManifestWrites:NewVersion",
version);
}
const Comparator* ucmp = last_writer->cfd->user_comparator();
assert(ucmp);
std::optional<size_t> edit_ts_sz = ucmp->timestamp_size();
for (const auto& e : last_writer->edit_list) {
if (e->is_in_atomic_group_) {
if (batch_edits.empty() || !batch_edits.back()->is_in_atomic_group_ ||
......@@ -5229,6 +5239,7 @@ Status VersionSet::ProcessManifestWrites(
return s;
}
batch_edits.push_back(e);
batch_edits_ts_sz.push_back(edit_ts_sz);
}
}
for (int i = 0; i < static_cast<int>(versions.size()); ++i) {
......@@ -5394,9 +5405,11 @@ Status VersionSet::ProcessManifestWrites(
#ifndef NDEBUG
size_t idx = 0;
#endif
for (auto& e : batch_edits) {
assert(batch_edits.size() == batch_edits_ts_sz.size());
for (size_t bidx = 0; bidx < batch_edits.size(); bidx++) {
auto& e = batch_edits[bidx];
std::string record;
if (!e->EncodeTo(&record)) {
if (!e->EncodeTo(&record, batch_edits_ts_sz[bidx])) {
s = Status::Corruption("Unable to encode VersionEdit:" +
e->DebugString(true));
break;
......@@ -6505,8 +6518,10 @@ Status VersionSet::WriteCurrentStateToManifest(
edit.SetLastSequence(descriptor_last_sequence_);
const Comparator* ucmp = cfd->user_comparator();
assert(ucmp);
std::string record;
if (!edit.EncodeTo(&record)) {
if (!edit.EncodeTo(&record, ucmp->timestamp_size())) {
return Status::Corruption("Unable to Encode VersionEdit:" +
edit.DebugString(true));
}
......
......@@ -3320,7 +3320,7 @@ class VersionSetTestMissingFiles : public VersionSetTestBase,
++last_seqno_;
assert(log_writer_.get() != nullptr);
std::string record;
ASSERT_TRUE(edit.EncodeTo(&record));
ASSERT_TRUE(edit.EncodeTo(&record, 0 /* ts_sz */));
Status s = log_writer_->AddRecord(record);
ASSERT_OK(s);
}
......
......@@ -858,24 +858,23 @@ struct AdvancedColumnFamilyOptions {
// Dynamically changeable through SetOptions() API
bool report_bg_io_stats = false;
// Files containing updates older than TTL will go through the compaction
// process. This usually happens in a cascading way so that those entries
// will be compacted to bottommost level/file.
// The feature is used to remove stale entries that have been deleted or
// updated from the file system.
// Pre-req: This needs max_open_files to be set to -1.
// In Level: Non-bottom-level files older than TTL will go through the
// compaction process.
// In FIFO: Files older than TTL will be deleted.
// This option has different meanings for different compaction styles:
//
// Leveled: Non-bottom-level files with all keys older than TTL will go
// through the compaction process. This usually happens in a cascading
// way so that those entries will be compacted to bottommost level/file.
// The feature is used to remove stale entries that have been deleted or
// updated from the file system.
//
// FIFO: Files with all keys older than TTL will be deleted. TTL is only
// supported if option max_open_files is set to -1.
//
// unit: seconds. Ex: 1 day = 1 * 24 * 60 * 60
// In FIFO, this option will have the same meaning as
// periodic_compaction_seconds. Whichever stricter will be used.
// 0 means disabling.
// UINT64_MAX - 1 (0xfffffffffffffffe) is special flag to allow RocksDB to
// pick default.
//
// Default: 30 days for leveled compaction + block based table. disable
// otherwise.
// Default: 30 days if using block based table. 0 (disable) otherwise.
//
// Dynamically changeable through SetOptions() API
uint64_t ttl = 0xfffffffffffffffe;
......@@ -891,12 +890,9 @@ struct AdvancedColumnFamilyOptions {
// age is based on the file's last modified time (given by the underlying
// Env).
//
// Supported in all compaction styles.
// Supported in leveled and universal compaction.
// In Universal compaction, rocksdb will try to do a full compaction when
// possible, see more in UniversalCompactionBuilder::PickPeriodicCompaction().
// In FIFO compaction, this option has the same meaning as TTL and whichever
// stricter will be used.
// Pre-req: max_open_file == -1.
// unit: seconds. Ex: 7 days = 7 * 24 * 60 * 60
//
// Values:
......@@ -905,9 +901,6 @@ struct AdvancedColumnFamilyOptions {
// as needed. For now, RocksDB will change this value to 30 days
// (i.e 30 * 24 * 60 * 60) so that every file goes through the compaction
// process at least once every 30 days if not compacted sooner.
// In FIFO compaction, since the option has the same meaning as ttl,
// when this value is left default, and ttl is left to 0, 30 days will be
// used. Otherwise, min(ttl, periodic_compaction_seconds) will be used.
//
// Default: UINT64_MAX - 1 (allow RocksDB to auto-tune)
//
......
......@@ -1626,6 +1626,9 @@ extern ROCKSDB_LIBRARY_API void rocksdb_options_set_row_cache(
extern ROCKSDB_LIBRARY_API void
rocksdb_options_add_compact_on_deletion_collector_factory(
rocksdb_options_t*, size_t window_size, size_t num_dels_trigger);
extern ROCKSDB_LIBRARY_API void
rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio(
rocksdb_options_t*, size_t window_size, size_t num_dels_trigger,
double deletion_ratio);
extern ROCKSDB_LIBRARY_API void rocksdb_options_set_manual_wal_flush(
......
......@@ -196,6 +196,7 @@ Status SstFileDumper::NewTableReader(
}
Status SstFileDumper::VerifyChecksum() {
assert(read_options_.verify_checksums);
// We could pass specific readahead setting into read options if needed.
return table_reader_->VerifyChecksum(read_options_,
TableReaderCaller::kSSTDumpTool);
......
......@@ -419,6 +419,10 @@ int SSTDumpTool::Run(int argc, char const* const* argv, Options options) {
filename = std::string(dir_or_file) + "/" + filename;
}
if (command == "verify") {
verify_checksum = true;
}
ROCKSDB_NAMESPACE::SstFileDumper dumper(
options, filename, Temperature::kUnknown, readahead_size,
verify_checksum, output_hex, decode_blob_index);
......
Option `periodic_compaction_seconds` no longer supports FIFO compaction: setting it has no effect on FIFO compactions. FIFO compaction users should only set option `ttl` instead.
\ No newline at end of file