In a 2 site Indexer Cluster, the issue is that they are getting different search results when using the same search in Search head (from site1) and other Search Head (site 2)
1) Here is list of index..
indx06 site2 (GUID XXXXXX-BE48-4A47-A6CA-DA7EC3C098EC
indx04 site2 (GUID XXXXXX -82F1-460E-861F-D69A40A5CF32
indx01 site1 (GUID XXXXXX -1658-4C4C-884E-5F2CA1CD33DF
indx05 site1 (GUID XXXXXX -15E1-4B4F-AE4F-7D5022DB5876
indx02 site2 (GUID XXXXXX -BDD2-4ABC-9E0A-CC775F2A5F7D
indx03 site1 (GUID XXXXXX -44B0-4AAE-975F-5A2F2EF9CF1E
2) When the following search was run for the Search Head SH_SITE1 and SH_SITE2 we got a different result set.
index= Test_fd_q05_bfi_eu | eval bkt= _bkt | stats count by source ,bkt ,splunk_server
buckets have count of 178 on indexer indx06(site2)
bucket did not show up in the search from indx01(site1)
3) For this specific search, the difference was narrowed down to hot bucket (Test~6~ XXXXXX-BE48-4A47-A6CA-DA7EC3C098EC) .
This bucket originates from peer indx06(on site 2) and is searchable on Site2. Physically checked the bucket on indx06 and it shows the following:
du -h ./hot_v1_7
40k ./hot_vi_7/rawdata
312k ./hot_v1_7
3.2) Physically checked the bucket on indx01(site 1) and it shows following bucket 6_XXXXXX-BE48-4A47-A6CA-DA7EC3C098EC is smaller in size as shown below
du -h ./6_2E4F6FF5-BE48-4A47-A6CA-DA7EC3C098EC
32k ./6_2E4F6FF5-BE48-4A47-A6CA-DA7EC3C098EC/rawdata
72K ./6_2E4F6FF5-BE48-4A47-A6CA-DA7EC3C098EC
Note: Cluster Master Dashboard at Settings > Index cluster shows that both the Replication Factor and Search Factors are met.
Such issue may be caused when we first replicate a hot bucket, and we have no further slices/events for said hot bucket.
lets say peerA has started a new hot bucket bucketA. at some point, it will add the bucket to the master and trigger replication, it follows this kind of flow:
1) start of replication : peerA bucketA -> peerB bucketA
2) adding slice : peerA bucketA(slice) -> peerB bucketA(slice) ... continue doing #2.
3) end bucket : peerA bucketA rolls -> peerB bucketB rolls
at each point of #2 as a side event, peerB will update bucketA's tsidx files (at most once every 5 seconds by default) and metadata files. the bug is that we don't do it at point #1, although we should - the first slice should also generate tsidx and metadata. so if there is no further slices besides the first initial onFileOpened slice, peerB bucketA will never get any tsidx files (until it rolls, or until it actually gets a slice)
this bug is present in all 6.1.x from what I can see. HOWEVER, in 6.2.x, we changed a default setting of
[clustering]
searchable_target_sync_timeout = 0
to
[clustering]
searchable_target_sync_timeout = 60
which triggers a bucket replication sync timeout that triggers 60 seconds later (this timeout is added in every time we process data for the replication). this trigger actually calls an update to Tsidx+Metadata, so in 6.2.x this bug is mitigated by the timeout.
In your case the workaround is to set folloing on the cluster peers:
server.conf
[clustering]
searchable_target_sync_timeout = 60
The above change needs to be made on the Peer , so you will need to apply this on cluster peer using cluster bundle.
Another possibility is frozen buckets - when we freeze a bucket we no longer perform any fixups on said bucket thereafter - its possible a copy that is primary is frozen, and we then lose that primary. To check for this, check out the master endpoint:
https://master_uri:mgmt_port/services/cluster/master/buckets?filter=frozen=true&filter=has_primary=false
if any buckets are listed there, note that in the primaries_by_site section they are probably missing a primary or more, which will then cause different search results by site
Such issue may be caused when we first replicate a hot bucket, and we have no further slices/events for said hot bucket.
lets say peerA has started a new hot bucket bucketA. at some point, it will add the bucket to the master and trigger replication, it follows this kind of flow:
1) start of replication : peerA bucketA -> peerB bucketA
2) adding slice : peerA bucketA(slice) -> peerB bucketA(slice) ... continue doing #2.
3) end bucket : peerA bucketA rolls -> peerB bucketB rolls
at each point of #2 as a side event, peerB will update bucketA's tsidx files (at most once every 5 seconds by default) and metadata files. the bug is that we don't do it at point #1, although we should - the first slice should also generate tsidx and metadata. so if there is no further slices besides the first initial onFileOpened slice, peerB bucketA will never get any tsidx files (until it rolls, or until it actually gets a slice)
this bug is present in all 6.1.x from what I can see. HOWEVER, in 6.2.x, we changed a default setting of
[clustering]
searchable_target_sync_timeout = 0
to
[clustering]
searchable_target_sync_timeout = 60
which triggers a bucket replication sync timeout that triggers 60 seconds later (this timeout is added in every time we process data for the replication). this trigger actually calls an update to Tsidx+Metadata, so in 6.2.x this bug is mitigated by the timeout.
In your case the workaround is to set folloing on the cluster peers:
server.conf
[clustering]
searchable_target_sync_timeout = 60
The above change needs to be made on the Peer , so you will need to apply this on cluster peer using cluster bundle.
This resolved the issue