Hi,
I'm experiencing some strangeness with the following query:
index=main_index | dedup _raw | sort _raw | rename index as index_from | join type=left _raw [search index=dedupped_index | dedup _raw | sort _raw | rename index as index_to ] | where isNull(index_to) | collect dedupped_index
Essentially, I want to take the contents of main_index, find out which events are not in dedupped_index, and move them into dedupped_index. However, if I run this query multiple times, I get the same results constantly dumped into the dedupped_index, even though they are already there.
Does the order of the events matter for the join command? Is there a different/better way of doing this that will work?
Thanks!
(index=main_index OR index=dedupped_index) | dedup _raw, index | stats count as count values(index) as index by _raw | table count, _raw, index | where count=1 AND index="main_index" | collect index=dedupped_index
(index=main_index OR index=dedupped_index) | dedup _raw, index | stats count as count values(index) as index by _raw | table count, _raw, index | where count=1 AND index="main_index" | collect index=dedupped_index
It's always a pleasure to point to this nice answer http://answers.splunk.com/answers/129424/how-to-compare-fields-over-multiple-sourcetypes-without-joi... 🙂
Ingenious, I like this. It seems like it shouldn't be necessary in my environment very often, but having the option could be handy.
Anyway.
If you leave off the " ... | collect dedupped_index" at the end, what does the search/join/subsearch show you? If that's not the right list of events, then I'd pare down the search bit by bit confirming each part really does what you expect.
It's strange. If I add fields to my query, those fields will always be (correctly) absent on the second run. It's only when I query the entire index that I see data when I shouldn't.
In any case, my current work around is as follows (it seems faster, but also feels hackier):
(index=main_index OR index=dedupped_index) | dedup _raw, index | stats count as count values(index) as index by _raw | table count, _raw, index | where count=1 AND index="main_index" | collect index=dedupped_index
Joins are expensive and the your workaround is indeed proper way to do this 🙂
I agree with somesoni2 that you original way was not very efficient.
If the search you posted as the comment works correctly, I would recommend write that up as the answer to your own question and mark it as answered. That way people stumbling across this question in the future will know an "answer" was found.