提交 5577f67d 编写于 作者: wu-sheng's avatar wu-sheng 提交者: 彭勇升 pengys

Support backend sampling (#1977)

* Support sampling trace at server side and keep metric right.

* Add a trace sampling document

* Fix wrong default value.

* Fix document issue.

* Fix assemble issue.

* Fix wrong settings and doc.
上级 f5545a87
......@@ -41,6 +41,7 @@
<includes>
<include>*.yml</include>
<include>*.xml</include>
<include>*.properties</include>
</includes>
</fileSet>
<fileSet>
......
......@@ -68,6 +68,7 @@ receiver-trace:
bufferOffsetMaxFileSize: ${SW_RECEIVER_BUFFER_OFFSET_MAX_FILE_SIZE:100} # Unit is MB
bufferDataMaxFileSize: ${SW_RECEIVER_BUFFER_DATA_MAX_FILE_SIZE:500} # Unit is MB
bufferFileCleanWhenRestart: ${SW_RECEIVER_BUFFER_FILE_CLEAN_WHEN_RESTART:false}
sampleRate: ${SW_TRACE_SAMPLE_RATE:10000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
receiver-jvm:
default:
service-mesh:
......
......@@ -23,6 +23,7 @@ receiver-trace:
bufferOffsetMaxFileSize: 100 # Unit is MB
bufferDataMaxFileSize: 500 # Unit is MB
bufferFileCleanWhenRestart: false
sampleRate: ${SW_TRACE_SAMPLE_RATE:1000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
receiver-jvm:
default:
service-mesh:
......
......@@ -63,6 +63,8 @@ DB. But clearly, it doesn't fit the product env. In here, you could find what ot
Choose the one you like, we are also welcome anyone to contribute new storage implementor,
1. [Set receivers](backend-receivers.md). You could choose receivers by your requirements, most receivers
are harmless, at least our default receivers are. You would set and active all receivers provided.
1. Do [trace sampling](trace-sampling.md) at backend. This sample keep the metric accurate, only don't save some of traces
in storage based on rate.
1. Official [OAL scripts](../../guides/backend-oal-scripts.md). As you known from our [OAL introduction](../../concepts-and-designs/oal.md),
most of backend analysis capabilities based on the scripts. Here is the description of official scripts,
which helps you to understand which metric data are in process, also could be used in alarm.
......
......@@ -58,7 +58,7 @@ storage:
```
All connection related settings including link url, username and password
are in `databsource-settings.properties`.
are in `datasource-settings.properties`.
This setting file follow [HikariCP](https://github.com/brettwooldridge/HikariCP) connection pool document.
## TiDB
......@@ -71,7 +71,7 @@ storage:
```
All connection related settings including link url, username and password
are in `databsource-settings.properties`. And these settings can refer to the configuration of *MySQL* above.
are in `datasource-settings.properties`. And these settings can refer to the configuration of *MySQL* above.
## More storage solution extension
Follow [Storage extension development guide](../../guides/storage-extention.md)
......
# Trace Sampling at server side
When we run a distributed tracing system, the trace bring us detailed info, but cost a lot at storage.
Open server side trace sampling mechanism, the metric of service, service instance, endpoint and topology are all accurate
as before, but only don't save all the traces into storage.
Of course, even you open sampling, the traces will be kept as consistent as possible. **Consistent** means, once the trace
segments have been collected and reported by agents, the backend would do their best to don't break the trace. See [Recommendation](#recommendation)
to understand why we called it `as consistent as possible` and `do their best to don't break the trace`.
## Set the sample rate
In **receiver-trace** receiver, you will find `sampleRate` setting.
```yaml
receiver-trace:
default:
bufferPath: ../trace-buffer/ # Path to trace buffer files, suggest to use absolute path
bufferOffsetMaxFileSize: 100 # Unit is MB
bufferDataMaxFileSize: 500 # Unit is MB
bufferFileCleanWhenRestart: false
sampleRate: ${SW_TRACE_SAMPLE_RATE:1000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
```
`sampleRate` is for you to set sample rate to this backend.
The sample rate precision is 1/10000. 10000 means 100% sample in default.
# Recommendation
You could set different backend instances with different `sampleRate` values, but we recommend you to set the same.
When you set the rate different, let's say
* Backend-Instance**A**.sampleRate = 35
* Backend-Instance**B**.sampleRate = 55
And we assume the agents reported all trace segments to backend,
Then the 35% traces in the global will be collected and saved in storage consistent/complete, with all spans.
20% trace segments, which reported to Backend-Instance**B**, will saved in storage, maybe miss some trace segments,
because they are reported to Backend-Instance**A** and ignored.
\ No newline at end of file
......@@ -20,13 +20,22 @@ package org.apache.skywalking.oap.server.receiver.trace.provider;
import java.io.IOException;
import org.apache.skywalking.oap.server.core.CoreModule;
import org.apache.skywalking.oap.server.core.server.*;
import org.apache.skywalking.oap.server.library.module.*;
import org.apache.skywalking.oap.server.core.server.GRPCHandlerRegister;
import org.apache.skywalking.oap.server.core.server.JettyHandlerRegister;
import org.apache.skywalking.oap.server.library.module.ModuleConfig;
import org.apache.skywalking.oap.server.library.module.ModuleDefine;
import org.apache.skywalking.oap.server.library.module.ModuleProvider;
import org.apache.skywalking.oap.server.library.module.ModuleStartException;
import org.apache.skywalking.oap.server.library.module.ServiceNotProvidedException;
import org.apache.skywalking.oap.server.receiver.trace.module.TraceModule;
import org.apache.skywalking.oap.server.receiver.trace.provider.handler.v5.grpc.TraceSegmentServiceHandler;
import org.apache.skywalking.oap.server.receiver.trace.provider.handler.v5.rest.TraceSegmentServletHandler;
import org.apache.skywalking.oap.server.receiver.trace.provider.handler.v6.grpc.TraceSegmentReportServiceHandler;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.*;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.ISegmentParserService;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.SegmentParse;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.SegmentParseV2;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.SegmentParserListenerManager;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.SegmentParserServiceImpl;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.endpoint.MultiScopesSpanListener;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.segment.SegmentSpanListener;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.service.ServiceMappingSpanListener;
......@@ -61,14 +70,14 @@ public class TraceModuleProvider extends ModuleProvider {
SegmentParserListenerManager listenerManager = new SegmentParserListenerManager();
listenerManager.add(new MultiScopesSpanListener.Factory());
listenerManager.add(new ServiceMappingSpanListener.Factory());
listenerManager.add(new SegmentSpanListener.Factory());
listenerManager.add(new SegmentSpanListener.Factory(moduleConfig.getSampleRate()));
segmentProducer = new SegmentParse.Producer(getManager(), listenerManager);
listenerManager = new SegmentParserListenerManager();
listenerManager.add(new MultiScopesSpanListener.Factory());
listenerManager.add(new ServiceMappingSpanListener.Factory());
listenerManager.add(new SegmentSpanListener.Factory());
listenerManager.add(new SegmentSpanListener.Factory(moduleConfig.getSampleRate()));
segmentProducerV2 = new SegmentParseV2.Producer(getManager(), listenerManager);
......
......@@ -29,4 +29,9 @@ public class TraceServiceModuleConfig extends ModuleConfig {
@Setter @Getter private int bufferOffsetMaxFileSize;
@Setter @Getter private int bufferDataMaxFileSize;
@Setter @Getter private boolean bufferFileCleanWhenRestart;
/**
* The sample rate precision is 1/10000.
* 10000 means 100% sample in default.
*/
@Setter @Getter private int sampleRate = 10000;
}
......@@ -21,12 +21,20 @@ package org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener
import org.apache.skywalking.apm.network.language.agent.UniqueId;
import org.apache.skywalking.oap.server.core.CoreModule;
import org.apache.skywalking.oap.server.core.cache.EndpointInventoryCache;
import org.apache.skywalking.oap.server.core.source.*;
import org.apache.skywalking.oap.server.core.source.Segment;
import org.apache.skywalking.oap.server.core.source.SourceReceiver;
import org.apache.skywalking.oap.server.library.module.ModuleManager;
import org.apache.skywalking.oap.server.library.util.*;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.decorator.*;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.*;
import org.slf4j.*;
import org.apache.skywalking.oap.server.library.util.BooleanUtils;
import org.apache.skywalking.oap.server.library.util.TimeBucketUtils;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.decorator.SegmentCoreInfo;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.decorator.SpanDecorator;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.EntrySpanListener;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.FirstSpanListener;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.GlobalTraceIdsListener;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.SpanListener;
import org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.SpanListenerFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* @author peng-yongsheng
......@@ -36,12 +44,15 @@ public class SegmentSpanListener implements FirstSpanListener, EntrySpanListener
private static final Logger logger = LoggerFactory.getLogger(SegmentSpanListener.class);
private final SourceReceiver sourceReceiver;
private final TraceSegmentSampler sampler;
private final Segment segment = new Segment();
private final EndpointInventoryCache serviceNameCacheService;
private SAMPLE_STATUS sampleStatus = SAMPLE_STATUS.UNKNOWN;
private int entryEndpointId = 0;
private int firstEndpointId = 0;
private SegmentSpanListener(ModuleManager moduleManager) {
private SegmentSpanListener(ModuleManager moduleManager, TraceSegmentSampler sampler) {
this.sampler = sampler;
this.sourceReceiver = moduleManager.find(CoreModule.NAME).provider().getService(SourceReceiver.class);
this.serviceNameCacheService = moduleManager.find(CoreModule.NAME).provider().getService(EndpointInventoryCache.class);
}
......@@ -52,6 +63,10 @@ public class SegmentSpanListener implements FirstSpanListener, EntrySpanListener
@Override
public void parseFirst(SpanDecorator spanDecorator, SegmentCoreInfo segmentCoreInfo) {
if (sampleStatus.equals(SAMPLE_STATUS.IGNORE)) {
return;
}
long timeBucket = TimeBucketUtils.INSTANCE.getSecondTimeBucket(segmentCoreInfo.getStartTime());
segment.setSegmentId(segmentCoreInfo.getSegmentId());
......@@ -75,6 +90,18 @@ public class SegmentSpanListener implements FirstSpanListener, EntrySpanListener
}
@Override public void parseGlobalTraceId(UniqueId uniqueId, SegmentCoreInfo segmentCoreInfo) {
if (sampleStatus.equals(SAMPLE_STATUS.UNKNOWN) || sampleStatus.equals(SAMPLE_STATUS.IGNORE)) {
if (sampler.shouldSample(uniqueId)) {
sampleStatus = SAMPLE_STATUS.SAMPLED;
} else {
sampleStatus = SAMPLE_STATUS.IGNORE;
}
}
if (sampleStatus.equals(SAMPLE_STATUS.IGNORE)) {
return;
}
StringBuilder traceIdBuilder = new StringBuilder();
for (int i = 0; i < uniqueId.getIdPartsList().size(); i++) {
if (i == 0) {
......@@ -91,6 +118,10 @@ public class SegmentSpanListener implements FirstSpanListener, EntrySpanListener
logger.debug("segment listener build, segment id: {}", segment.getSegmentId());
}
if (sampleStatus.equals(SAMPLE_STATUS.IGNORE)) {
return;
}
if (entryEndpointId == 0) {
segment.setEndpointId(firstEndpointId);
segment.setEndpointName(serviceNameCacheService.get(firstEndpointId).getName());
......@@ -102,10 +133,19 @@ public class SegmentSpanListener implements FirstSpanListener, EntrySpanListener
sourceReceiver.receive(segment);
}
private enum SAMPLE_STATUS {
UNKNOWN, SAMPLED, IGNORE
}
public static class Factory implements SpanListenerFactory {
private TraceSegmentSampler sampler;
public Factory(int segmentSamplingRate) {
this.sampler = new TraceSegmentSampler(segmentSamplingRate);
}
@Override public SpanListener create(ModuleManager moduleManager) {
return new SegmentSpanListener(moduleManager);
return new SegmentSpanListener(moduleManager, sampler);
}
}
}
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
package org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.segment;
import java.util.List;
import org.apache.skywalking.apm.network.language.agent.UniqueId;
/**
* The sampler makes the sampling mechanism works at backend side.
*
* The sample check mechanism is very easy and effective when backend run in cluster mode. Based on traceId, which is
* constituted by 3 Long, and according to GlobalIdGenerator, the last four number of the last Long is a sequence, so it
* is suitable for sampling.
*
* Set rate = x
*
* Then divide last Long in TraceId by 10000, y = x % 10000
*
* Sample result: [0,y) sampled, (y,~) ignored
*
* @author wusheng
*/
public class TraceSegmentSampler {
private int sampleRate = 10000;
public TraceSegmentSampler(int sampleRate) {
this.sampleRate = sampleRate;
}
public boolean shouldSample(UniqueId uniqueId) {
List<Long> idPartsList = uniqueId.getIdPartsList();
if (idPartsList.size() == 3) {
Long lastLong = idPartsList.get(2);
long sampleValue = lastLong % 10000;
if (sampleValue < sampleRate) {
return true;
}
}
return false;
}
}
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*
*/
package org.apache.skywalking.oap.server.receiver.trace.provider.parser.listener.segment;
import org.apache.skywalking.apm.network.language.agent.UniqueId;
import org.junit.Assert;
import org.junit.Test;
public class TraceSegmentSamplerTest {
@Test
public void sample() {
TraceSegmentSampler sampler = new TraceSegmentSampler(100);
Assert.assertTrue(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(0).build()));
Assert.assertTrue(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(50).build()));
Assert.assertTrue(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(99).build()));
Assert.assertFalse(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(100).build()));
Assert.assertFalse(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(101).build()));
Assert.assertTrue(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(10000).build()));
Assert.assertTrue(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(10001).build()));
Assert.assertFalse(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(1019903).build()));
}
@Test
public void IllegalTraceIDSample() {
TraceSegmentSampler sampler = new TraceSegmentSampler(100);
Assert.assertFalse(sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).build()));
Assert.assertFalse(
sampler.shouldSample(UniqueId.newBuilder().addIdParts(123).addIdParts(2).addIdParts(23).addIdParts(3).build()));
}
}
......@@ -141,6 +141,7 @@
<exclude>log4j2.xml</exclude>
<exclude>alarm-settings.yml</exclude>
<exclude>component-libraries.yml</exclude>
<exclude>datasource-settings.properties</exclude>
</excludes>
</configuration>
</plugin>
......
......@@ -67,6 +67,7 @@ receiver-trace:
bufferOffsetMaxFileSize: ${SW_RECEIVER_BUFFER_OFFSET_MAX_FILE_SIZE:100} # Unit is MB
bufferDataMaxFileSize: ${SW_RECEIVER_BUFFER_DATA_MAX_FILE_SIZE:500} # Unit is MB
bufferFileCleanWhenRestart: ${SW_RECEIVER_BUFFER_FILE_CLEAN_WHEN_RESTART:false}
sampleRate: ${SW_TRACE_SAMPLE_RATE:10000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
receiver-jvm:
default:
service-mesh:
......
......@@ -39,6 +39,7 @@
<include>application.yml</include>
<include>alarm-settings.yml</include>
<include>alarm-settings-sample.yml</include>
<include>datasource-settings.properties</include>
</includes>
</fileSet>
<fileSet>
......
......@@ -16,8 +16,8 @@
cluster:
standalone:
# Please check your ZooKeeper is 3.5+, However, it is also compatible with ZooKeeper 3.4.x. Replace the ZooKeeper 3.5+
# library the oap-libs folder with your ZooKeeper 3.4.x library.
# Please check your ZooKeeper is 3.5+, However, it is also compatible with ZooKeeper 3.4.x. Replace the ZooKeeper 3.5+
# library the oap-libs folder with your ZooKeeper 3.4.x library.
# zookeeper:
# hostPort: ${SW_CLUSTER_ZK_HOST_PORT:localhost:2181}
# #Retry Policy
......@@ -68,6 +68,7 @@ receiver-trace:
bufferOffsetMaxFileSize: ${SW_RECEIVER_BUFFER_OFFSET_MAX_FILE_SIZE:100} # Unit is MB
bufferDataMaxFileSize: ${SW_RECEIVER_BUFFER_DATA_MAX_FILE_SIZE:500} # Unit is MB
bufferFileCleanWhenRestart: ${SW_RECEIVER_BUFFER_FILE_CLEAN_WHEN_RESTART:false}
sampleRate: ${SW_TRACE_SAMPLE_RATE:10000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
receiver-jvm:
default:
service-mesh:
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册