Deployment Architecture

Search Head Cluster: How do I resolve "Error Fixup - failed to kick off replication..."

kbecker
Communicator

Does anybody happen to know what the following error means and how to resolve it? I linked this back to a saved search via the scheduler log and verified that the expiration of search is 15 minutes so it should be enough time to replicate the data.

04-28-2015 10:11:37.454 -0500 ERROR Fixup - failed to kick off replication from src=FA5DD091-47DE-44C0-BF4C-FF60B8DF4B72 tgt=6458DFA1-C04B-43D2-BCE6-C2D3B5AB74C3 aid=scheduler__jdoe__search__RMD5ebb33acee6403ee2_at_1430233560_69456_15082AA6-AAE2-47B5-BD24-7643F4C96F15 err='src FA5DD091-47DE-44C0-BF4C-FF60B8DF4B72 cannot be valid source for scheduler__jdoe__search__RMD5ebb33acee6403ee2_at_1430233560_69456_15082AA6-AAE2-47B5-BD24-7643F4C96F15'

Thanks in advance.

kbecker
Communicator

We are running 6.5.1 and no longer have this issue. I think there was some internal Splunk confusion in regards to which release this was actually fixed in.

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

We have two bug for the errors like “Error Fixup - failed to kick off replication….”
The Bug number are

SPL-94508::Search Head Clustering: Captain's splunkd.log spam ERROR Fixup - failed to kick off replication from src=
tgt= aid= err=...
SPL-98488::SHC - Peers incorrectly report 4 billion replications in high latency environments

i)The Fixup error comes when the" # of outstanding replications" on a peer are higher than the configured "#max_peer_rep_load = 5."
For example if there are 100 scheduled searches running at the 30 minute mark, it could be a possibility that all finished and tried to replicate at the same time. So the Fixup code may throw that error.

Just to clarify on the impact of this Bug, If users hit any of the peers looking for the job it will be proxied instead of read locally. Although this will not impact the access to artifacts, it disables replication which is not good. So Splunk is working to get Bug SPL-98488 fixed at earliest.

Also, Just to clarify that in error messages you see reference to “aid” and “sid”. Note aid is a short-form for artifact id and sid is an artifact when it is managed for replication by the captain. sid's like adhoc searches which are not replicated are thereby not artifacts.

ii) To confirm if you are hitting Bug “SPL-98488“You can diagnose this if you are seeing that message is by going to any node on the SHCluster and doing "splunk list shcluster-members " and look for the value of "replication_count" for the problematic source. If the replication_count is very high its the same issue. (Command :./splunk list shcluster-members | grep replication_count )

iii) Can we hit https://:/services/shcluster/captain/replications on the captain node with admin credentials to see what the output is.

Other things to verify…..

  1. when we get a fixup error for an SID, does that SID eventually replicate itself to a replica count in your enviornmnet?In that case the errors are just transient and annoying spam, which we can correct in a maint release. You can validate this by taking the latest error by tailing splunkd.log and then seeing the same artifact after a few minutes in https://:/services/shcluster/captain/artifacts

  2. as these sids/artifacts are getting generated is there any access ( interactive ) happening to these sids?

baker987
Explorer

I am on version 6.2.7 and I am experiencing this issue as well.

0 Karma

asifkh11
New Member

I'm running 6.3.2 and seeing same issue

0 Karma

aakwah
Builder

I've the same errors with 6.2.9

0 Karma

somesoni2
Revered Legend

The splunk change logs says it was fixed in Splunk 6.2.4. I'm using Splunk 6.2.6 and can still see the same error messages. Does anyone know any workaround to this?

http://docs.splunk.com/Documentation/Splunk/6.2.6/ReleaseNotes/6.2.4#Distributed_search_and_search_h...

0 Karma

ppohar
Explorer

Is this issue resolved ? We are in Splunk 6.2.3 (build 264711).

We are seeing continuous fixup - failed to kick off replication on our search head captain.

08-26-2015 10:52:59.037 -0500 ERROR Fixup - failed to kick off replication from src=E865F266-125C-465F-BFF7-10773D2D3536 tgt=EC6D891C-FF0D-47E9-9D83-864D13A58B04 aid=scheduler_jmonettecoreapiRMD51ae1d00f2e3ed31a_at_1440604320_21214_E865F266-125C-465F-BFF7-10773D2D3536 err='src E865F266-125C-465F-BFF7-10773D2D3536 cannot be valid source for schedulerjmonettecoreapi_RMD51ae1d00f2e3ed31a_at_1440604320_21214_E865F266-125C-465F-BFF7-10773D2D3536'

0 Karma
Get Updates on the Splunk Community!

Introducing Splunk Enterprise 9.2

WATCH HERE! Watch this Tech Talk to learn about the latest features and enhancements shipped in the new Splunk ...

Adoption of RUM and APM at Splunk

    Unleash the power of Splunk Observability   Watch Now In this can't miss Tech Talk! The Splunk Growth ...

Routing logs with Splunk OTel Collector for Kubernetes

The Splunk Distribution of the OpenTelemetry (OTel) Collector is a product that provides a way to ingest ...