未验证 提交 8a2a0008 编写于 作者: Z zifeihan 提交者: GitHub

Add `slowTraceSegmentThreshold` to forcibly sample slow traces (#5707)

上级 af41da0e
......@@ -12,6 +12,7 @@ Release Notes.
#### OAP-Backend
* Add the `@SuperDataset` annotation for BrowserErrorLog.
* Support keeping collecting the slowly segments in the sampling mechanism.
* Support choose files to active the meter analyzer.
* Improve Kubernetes service registry for ALS analysis.
......
......@@ -145,6 +145,7 @@ core|default|role|Option values, `Mixed/Receiver/Aggregator`. **Receiver** mode
| - | - |forceSampleErrorSegment|When sampling mechanism activated, this config would make the error status segment sampled, ignoring the sampling rate.|SW_FORCE_SAMPLE_ERROR_SEGMENT|true|
| - | - |segmentStatusAnalysisStrategy|Determine the final segment status from the status of spans. Available values are `FROM_SPAN_STATUS` , `FROM_ENTRY_SPAN` and `FROM_FIRST_SPAN`. `FROM_SPAN_STATUS` represents the segment status would be error if any span is in error status. `FROM_ENTRY_SPAN` means the segment status would be determined by the status of entry spans only. `FROM_FIRST_SPAN` means the segment status would be determined by the status of the first span only.|SW_SEGMENT_STATUS_ANALYSIS_STRATEGY|FROM_SPAN_STATUS|
| - | - |noUpstreamRealAddressAgents|Exit spans with the component in the list would not generate the client-side instance relation metrics. As some tracing plugins can't collect the real peer ip address, such as Nginx-LUA and Envoy. |SW_NO_UPSTREAM_REAL_ADDRESS|6000,9000|
| - | - |slowTraceSegmentThreshold|Setting this threshold about the latency would make the slow trace segments sampled if they cost more time, even the sampling mechanism activated. The default value is `-1`, which means would not sample slow traces. Unit, millisecond. |SW_SLOW_TRACE_SEGMENT_THRESHOLD|-1|
| - | - |meterAnalyzerActiveFiles|Which files could be meter analyzed, files split by ","|SW_METER_ANALYZER_ACTIVE_FILES||
| receiver-sharing-server|default| Sharing server provides new gRPC and restful servers for data collection. Ana make the servers in the core module working for internal communication only.| - | - |
| - | - | restHost| Binding IP of restful service. Services include GraphQL query and HTTP data report| - | - |
......
......@@ -12,6 +12,7 @@ Right now, SkyWalking supports following dynamic configurations.
|core.default.apdexThreshold| The apdex threshold settings, will override `service-apdex-threshold.yml`. | same as [`service-apdex-threshold.yml`](apdex-threshold.md) |
|core.default.endpoint-name-grouping| The endpoint name grouping setting, will override `endpoint-name-grouping.yml`. | same as [`endpoint-name-grouping.yml`](endpoint-grouping-rules.md) |
|agent-analyzer.default.sampleRate| Trace sampling , override `receiver-trace/default/sampleRate` of `application.yml`. | 10000 |
|agent-analyzer.default.slowTraceSegmentThreshold| Setting this threshold about the latency would make the slow trace segments sampled if they cost more time, even the sampling mechanism activated. The default value is `-1`, which means would not sample slow traces. Unit, millisecond. override `receiver-trace/default/slowTraceSegmentThreshold` of `application.yml`. | -1 |
This feature depends on upstream service, so it is **DISABLED** by default.
......
......@@ -16,14 +16,18 @@ agent-analyzer:
...
sampleRate: ${SW_TRACE_SAMPLE_RATE:1000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
forceSampleErrorSegment: ${SW_FORCE_SAMPLE_ERROR_SEGMENT:true} # When sampling mechanism activated, this config would make the error status segment sampled, ignoring the sampling rate.
slowTraceSegmentThreshold: ${SW_SLOW_TRACE_SEGMENT_THRESHOLD:-1} # Setting this threshold about the latency would make the slow trace segments sampled if they cost more time, even the sampling mechanism activated. The default value is `-1`, which means would not sample slow traces. Unit, millisecond.
```
`sampleRate` is for you to set sample rate to this backend.
The sample rate precision is 1/10000. 10000 means 100% sample in default.
`forceSampleErrorSegment` is for you to open force save some error segment when sampling mechanism active.
`forceSampleErrorSegment` is for you to save all error segments when sampling mechanism actived.
When sampling mechanism activated, this config would make the error status segment sampled, ignoring the sampling rate.
`slowTraceSegmentThreshold` is for you to save all slow trace segments when sampling mechanism actived.
Setting this threshold about the latency would make the slow trace segments sampled if they cost more time, even the sampling mechanism activated. The default value is `-1`, which means would not sample slow traces. Unit, millisecond.
# Recommendation
You could set different backend instances with different `sampleRate` values, but we recommend you to set the same.
......@@ -37,6 +41,6 @@ Then the 35% traces in the global will be collected and saved in storage consist
because they are reported to Backend-Instance**A** and ignored.
# Note
When you open sampling, the actual sample rate could be over sampleRate. Because currently, all error segments will be saved, meanwhile, the upstream and downstream may not be sampled. This feature is going to make sure you could have the error stacks and segments, but don't guarantee you would have the whole trace.
When you open sampling, the actual sample rate could be over sampleRate. Because currently, all error/slow segments will be saved, meanwhile, the upstream and downstream may not be sampled. This feature is going to make sure you could have the error/slow stacks and segments, but don't guarantee you would have the whole trace.
Also, the side effect would be, if most of the accesses are fail, the sampling rate would be closing to 100%, which could crash the backend or storage clusters.
Also, the side effect would be, if most of the accesses are fail/slow, the sampling rate would be closing to 100%, which could crash the backend or storage clusters.
......@@ -23,6 +23,7 @@ import lombok.Setter;
import lombok.extern.slf4j.Slf4j;
import org.apache.commons.lang3.StringUtils;
import org.apache.skywalking.oap.server.analyzer.provider.trace.DBLatencyThresholdsAndWatcher;
import org.apache.skywalking.oap.server.analyzer.provider.trace.TraceLatencyThresholdsAndWatcher;
import org.apache.skywalking.oap.server.analyzer.provider.trace.TraceSampleRateWatcher;
import org.apache.skywalking.oap.server.analyzer.provider.trace.UninstrumentedGatewaysConfig;
import org.apache.skywalking.oap.server.analyzer.provider.trace.parser.listener.strategy.SegmentStatusStrategy;
......@@ -56,6 +57,12 @@ public class AnalyzerModuleConfig extends ModuleConfig {
@Setter
@Getter
private String slowDBAccessThreshold = "default:200";
/**
* Setting this threshold about the latency would make the slow trace segments sampled if they cost more time, even the sampling mechanism activated. The default value is `-1`, which means would not sample slow traces. Unit, millisecond.
*/
@Setter
@Getter
private int slowTraceSegmentThreshold = -1;
@Setter
@Getter
private DBLatencyThresholdsAndWatcher dbLatencyThresholdsAndWatcher;
......@@ -65,6 +72,9 @@ public class AnalyzerModuleConfig extends ModuleConfig {
@Setter
@Getter
private TraceSampleRateWatcher traceSampleRateWatcher;
@Setter
@Getter
private TraceLatencyThresholdsAndWatcher traceLatencyThresholdsAndWatcher;
/**
* Analysis trace status.
* <p>
......
......@@ -25,6 +25,7 @@ import org.apache.skywalking.oap.server.analyzer.provider.meter.config.MeterConf
import org.apache.skywalking.oap.server.analyzer.provider.meter.process.IMeterProcessService;
import org.apache.skywalking.oap.server.analyzer.provider.meter.process.MeterProcessService;
import org.apache.skywalking.oap.server.analyzer.provider.trace.DBLatencyThresholdsAndWatcher;
import org.apache.skywalking.oap.server.analyzer.provider.trace.TraceLatencyThresholdsAndWatcher;
import org.apache.skywalking.oap.server.analyzer.provider.trace.TraceSampleRateWatcher;
import org.apache.skywalking.oap.server.analyzer.provider.trace.UninstrumentedGatewaysConfig;
import org.apache.skywalking.oap.server.analyzer.provider.trace.parser.ISegmentParserService;
......@@ -58,6 +59,8 @@ public class AnalyzerModuleProvider extends ModuleProvider {
private SegmentParserServiceImpl segmentParserService;
@Getter
private TraceSampleRateWatcher traceSampleRateWatcher;
@Getter
private TraceLatencyThresholdsAndWatcher traceLatencyThresholdsAndWatcher;
private List<MeterConfig> meterConfigs;
@Getter
......@@ -89,10 +92,12 @@ public class AnalyzerModuleProvider extends ModuleProvider {
uninstrumentedGatewaysConfig = new UninstrumentedGatewaysConfig(this);
traceSampleRateWatcher = new TraceSampleRateWatcher(this);
traceLatencyThresholdsAndWatcher = new TraceLatencyThresholdsAndWatcher(this);
moduleConfig.setDbLatencyThresholdsAndWatcher(thresholds);
moduleConfig.setUninstrumentedGatewaysConfig(uninstrumentedGatewaysConfig);
moduleConfig.setTraceSampleRateWatcher(traceSampleRateWatcher);
moduleConfig.setTraceLatencyThresholdsAndWatcher(traceLatencyThresholdsAndWatcher);
segmentParserService = new SegmentParserServiceImpl(getManager(), moduleConfig);
this.registerServiceImplementation(ISegmentParserService.class, segmentParserService);
......
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
package org.apache.skywalking.oap.server.analyzer.provider.trace;
import java.util.concurrent.atomic.AtomicReference;
import lombok.extern.slf4j.Slf4j;
import org.apache.skywalking.oap.server.analyzer.module.AnalyzerModule;
import org.apache.skywalking.oap.server.analyzer.provider.AnalyzerModuleConfig;
import org.apache.skywalking.oap.server.configuration.api.ConfigChangeWatcher;
import org.apache.skywalking.oap.server.library.module.ModuleProvider;
/**
* This threshold watcher about the latency would make the slow trace segments sampled if they cost more time,
* even the sampling mechanism activated. The default value is `-1`, which means would not sample slow traces. Unit, millisecond.
*/
@Slf4j
public class TraceLatencyThresholdsAndWatcher extends ConfigChangeWatcher {
private AtomicReference<Integer> slowTraceSegmentThreshold;
public TraceLatencyThresholdsAndWatcher(ModuleProvider provider) {
super(AnalyzerModule.NAME, provider, "slowTraceSegmentThreshold");
slowTraceSegmentThreshold = new AtomicReference<>();
slowTraceSegmentThreshold.set(getDefaultValue());
}
private void activeSetting(String config) {
if (log.isDebugEnabled()) {
log.debug("Updating using new static config: {}", config);
}
try {
slowTraceSegmentThreshold.set(Integer.parseInt(config));
} catch (NumberFormatException ex) {
log.error("Cannot load slowTraceThreshold from: {}", config, ex);
}
}
@Override
public void notify(ConfigChangeEvent value) {
if (EventType.DELETE.equals(value.getEventType())) {
activeSetting(String.valueOf(getDefaultValue()));
} else {
activeSetting(value.getNewValue());
}
}
@Override
public String value() {
return String.valueOf(slowTraceSegmentThreshold.get());
}
private int getDefaultValue() {
return ((AnalyzerModuleConfig) this.getProvider().createConfigBeanIfAbsent()).getSlowTraceSegmentThreshold();
}
public int getSlowTraceSegmentThreshold() {
return slowTraceSegmentThreshold.get();
}
public boolean shouldSample(int duration) {
return (slowTraceSegmentThreshold.get() > -1) && (duration >= slowTraceSegmentThreshold.get());
}
}
......@@ -26,8 +26,9 @@ import org.apache.skywalking.apm.network.language.agent.v3.SegmentObject;
import org.apache.skywalking.apm.network.language.agent.v3.SpanObject;
import org.apache.skywalking.apm.util.StringUtil;
import org.apache.skywalking.oap.server.analyzer.provider.AnalyzerModuleConfig;
import org.apache.skywalking.oap.server.analyzer.provider.trace.parser.listener.strategy.SegmentStatusStrategy;
import org.apache.skywalking.oap.server.analyzer.provider.trace.TraceLatencyThresholdsAndWatcher;
import org.apache.skywalking.oap.server.analyzer.provider.trace.parser.listener.strategy.SegmentStatusAnalyzer;
import org.apache.skywalking.oap.server.analyzer.provider.trace.parser.listener.strategy.SegmentStatusStrategy;
import org.apache.skywalking.oap.server.core.Const;
import org.apache.skywalking.oap.server.core.CoreModule;
import org.apache.skywalking.oap.server.core.analysis.IDManager;
......@@ -53,6 +54,7 @@ public class SegmentAnalysisListener implements FirstAnalysisListener, EntryAnal
private final NamingControl namingControl;
private final List<String> searchableTagKeys;
private final SegmentStatusAnalyzer segmentStatusAnalyzer;
private final TraceLatencyThresholdsAndWatcher traceLatencyThresholdsAndWatcher;
private final Segment segment = new Segment();
private SAMPLE_STATUS sampleStatus = SAMPLE_STATUS.UNKNOWN;
......@@ -148,6 +150,8 @@ public class SegmentAnalysisListener implements FirstAnalysisListener, EntryAnal
sampleStatus = SAMPLE_STATUS.SAMPLED;
} else if (isError && forceSampleErrorSegment) {
sampleStatus = SAMPLE_STATUS.SAMPLED;
} else if (traceLatencyThresholdsAndWatcher.shouldSample(duration)) {
sampleStatus = SAMPLE_STATUS.SAMPLED;
} else {
sampleStatus = SAMPLE_STATUS.IGNORE;
}
......@@ -189,6 +193,7 @@ public class SegmentAnalysisListener implements FirstAnalysisListener, EntryAnal
private final NamingControl namingControl;
private final List<String> searchTagKeys;
private final SegmentStatusAnalyzer segmentStatusAnalyzer;
private final TraceLatencyThresholdsAndWatcher traceLatencyThresholdsAndWatcher;
public Factory(ModuleManager moduleManager, AnalyzerModuleConfig config) {
this.sourceReceiver = moduleManager.find(CoreModule.NAME).provider().getService(SourceReceiver.class);
......@@ -203,6 +208,7 @@ public class SegmentAnalysisListener implements FirstAnalysisListener, EntryAnal
.getService(NamingControl.class);
this.segmentStatusAnalyzer = SegmentStatusStrategy.findByName(config.getSegmentStatusAnalysisStrategy())
.getExceptionAnalyzer();
this.traceLatencyThresholdsAndWatcher = config.getTraceLatencyThresholdsAndWatcher();
}
@Override
......@@ -213,7 +219,8 @@ public class SegmentAnalysisListener implements FirstAnalysisListener, EntryAnal
forceSampleErrorSegment,
namingControl,
searchTagKeys,
segmentStatusAnalyzer
segmentStatusAnalyzer,
traceLatencyThresholdsAndWatcher
);
}
}
......
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
package org.apache.skywalking.oap.server.analyzer.provider.trace;
import java.util.Optional;
import java.util.Set;
import org.apache.skywalking.oap.server.analyzer.provider.AnalyzerModuleProvider;
import org.apache.skywalking.oap.server.configuration.api.ConfigChangeWatcher;
import org.apache.skywalking.oap.server.configuration.api.ConfigTable;
import org.apache.skywalking.oap.server.configuration.api.ConfigWatcherRegister;
import org.junit.Assert;
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.mockito.runners.MockitoJUnitRunner;
import static org.hamcrest.CoreMatchers.is;
import static org.hamcrest.MatcherAssert.assertThat;
@RunWith(MockitoJUnitRunner.class)
public class TraceLatencyThresholdsAndWatcherTest {
private AnalyzerModuleProvider provider;
@Before
public void init() {
provider = new AnalyzerModuleProvider();
}
@Test
public void testInit() {
TraceLatencyThresholdsAndWatcher traceLatencyThresholdsAndWatcher = new TraceLatencyThresholdsAndWatcher(provider);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.getSlowTraceSegmentThreshold(), -1);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.value(), "-1");
}
@Test(timeout = 20000)
public void testDynamicUpdate() throws InterruptedException {
ConfigWatcherRegister register = new MockConfigWatcherRegister(3);
TraceLatencyThresholdsAndWatcher watcher = new TraceLatencyThresholdsAndWatcher(provider);
register.registerConfigChangeWatcher(watcher);
register.start();
while (watcher.getSlowTraceSegmentThreshold() == 10000) {
Thread.sleep(2000);
}
assertThat(watcher.getSlowTraceSegmentThreshold(), is(3000));
assertThat(provider.getModuleConfig().getSlowTraceSegmentThreshold(), is(-1));
}
@Test
public void testNotify() {
TraceLatencyThresholdsAndWatcher traceLatencyThresholdsAndWatcher = new TraceLatencyThresholdsAndWatcher(provider);
ConfigChangeWatcher.ConfigChangeEvent value1 = new ConfigChangeWatcher.ConfigChangeEvent(
"8000", ConfigChangeWatcher.EventType.MODIFY);
traceLatencyThresholdsAndWatcher.notify(value1);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.getSlowTraceSegmentThreshold(), 8000);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.value(), "8000");
ConfigChangeWatcher.ConfigChangeEvent value2 = new ConfigChangeWatcher.ConfigChangeEvent(
"8000", ConfigChangeWatcher.EventType.DELETE);
traceLatencyThresholdsAndWatcher.notify(value2);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.getSlowTraceSegmentThreshold(), -1);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.value(), "-1");
ConfigChangeWatcher.ConfigChangeEvent value3 = new ConfigChangeWatcher.ConfigChangeEvent(
"800", ConfigChangeWatcher.EventType.ADD);
traceLatencyThresholdsAndWatcher.notify(value3);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.getSlowTraceSegmentThreshold(), 800);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.value(), "800");
ConfigChangeWatcher.ConfigChangeEvent value4 = new ConfigChangeWatcher.ConfigChangeEvent(
"abc", ConfigChangeWatcher.EventType.MODIFY);
traceLatencyThresholdsAndWatcher.notify(value4);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.getSlowTraceSegmentThreshold(), 800);
Assert.assertEquals(traceLatencyThresholdsAndWatcher.value(), "800");
}
public static class MockConfigWatcherRegister extends ConfigWatcherRegister {
public MockConfigWatcherRegister(long syncPeriod) {
super(syncPeriod);
}
@Override
public Optional<ConfigTable> readConfig(Set<String> keys) {
ConfigTable table = new ConfigTable();
table.add(new ConfigTable.ConfigItem("agent-analyzer.default.slowTraceSegmentThreshold", "3000"));
return Optional.of(table);
}
}
}
......@@ -191,6 +191,7 @@ agent-analyzer:
# Nginx and Envoy agents can't get the real remote address.
# Exit spans with the component in the list would not generate the client-side instance relation metrics.
noUpstreamRealAddressAgents: ${SW_NO_UPSTREAM_REAL_ADDRESS:6000,9000}
slowTraceSegmentThreshold: ${SW_SLOW_TRACE_SEGMENT_THRESHOLD:-1} # Setting this threshold about the latency would make the slow trace segments sampled if they cost more time, even the sampling mechanism activated. The default value is `-1`, which means would not sample slow traces. Unit, millisecond.
meterAnalyzerActiveFiles: ${SW_METER_ANALYZER_ACTIVE_FILES:} # Which files could be meter analyzed, files split by ","
receiver-sharing-server:
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册