Solved: Why does lookup return null when there are multipl...

yuanliu · ‎02-15-2023

I have a lookup with multiple columns (keys). Some combinations make a unique match, but I need an ambiguous search on a single key to return all matched items of a particular field. In a simplified form, the lookup is like this

QID	IP	Detected
12345	127.0.0.1	2022-12-10
45678	127.0.0.1	2023-01-21
12345	127.0.0.2	2023-01-01
45678	127.0.0.2	2022-12-15
23456	...	...

QID and IP determines a unique Detected value; you can say the combination is a primary key. No problem with search by primary key. My requirement is to search by QID alone. For 12345, for example, I expect the return to be multivalued (2022-12-10, 2023-01-01).

If I hard code QID in an emulation, that's exactly what I get.

| makeresults
| eval QID=12345
| lookup mylookup QID
| table QID Detected

This will give me

QID	Detected
12345	2022-12-10 2023-01-01

But if use the same lookup in a search, e.g.,

index=myindex QID=12345
| stats count by QID ``` result is the same whether or not stats precedes lookup ```
| lookup mylookup QID
| table QID Detected

the result is blank

QID	Detected
12345

The behavior can be more complex if the search returns more than one QID (e.g., QID IN (12345, 45678)). Sometimes one of them will get Detected populated, but not others.

How can I make sure multiple matches are all returned?

bowesmana · ‎02-22-2023

It looks like it's related to the size of the lookup, which exceeds max_memtable_bytes (default 25MB) and there is a memory issue, which can give somewhat random results.

When a lookup exceeds that value, Splunk uses an indexing mechanism to access the lookup, but it seems that does not work properly.

If you increase max_memtable_bytes in limits.conf to a size larger than your lookup, the problem goes away, e.g. doubling it to 50MB

[lookup]
max_memtable_bytes = 52428800

View solution in original post

bowesmana · ‎02-22-2023

It looks like it's related to the size of the lookup, which exceeds max_memtable_bytes (default 25MB) and there is a memory issue, which can give somewhat random results.

When a lookup exceeds that value, Splunk uses an indexing mechanism to access the lookup, but it seems that does not work properly.

If you increase max_memtable_bytes in limits.conf to a size larger than your lookup, the problem goes away, e.g. doubling it to 50MB

[lookup]
max_memtable_bytes = 52428800

yuanliu · ‎02-23-2023

Thank you so much, @bowesmana! This had been bugging me for years. Given the unpredictability, I should have suspected a memory-related problem.

So, there seems to be two separate problems. One is that index mechanism (mylookup.csv_XXXXX.idx) is not performing as designed. The second problem is the lack of any indication in splunkd.log that suggests max_memtable_bytes might be exceeded.

bowesmana · ‎02-15-2023

As your makeresults example shows, you should get both results, so it points to something different about the true value of QID in the index=myindex case.

If you add the following after stats count for a bit of diagnostics

| eval len=len(QID)
| eval Q=12345
| lookup mylookup QID OUTPUT Detected as D_1
| lookup mylookup QID as Q OUTPUT Detected as D_2
| eval match=if(QID=Q, "true", "false")
| table count QID len Q match D_*

what do you get?

yuanliu · ‎02-18-2023

Thanks for the diagnostic sequence, @bowesmana. As I try to implement, I realized two things. First, I didn't record the real-world QID in the original problem statement, even if the statement was precise. Now I have doubts about the precision because if I filter down to a single QID, I haven't found one that can trigger a blank return.

Second, I notice that even with makeresults, I can construct a group of QIDs to trigger this behavior. (The behavior seems to be the same as the group of QIDs appear in an index search.) For example,

| makeresults
| fields - _time
| eval QID = split("257210,257212,257100", ",")
| mvexpand QID
| lookup mylookup QID output Detected

and

index=myindex QID IN ("257210","257212","257100")
| stats count by QID
| lookup mylookup QID output Detected

both return something like

QID	Detected
257210
257212
257100	2022-12-10 2023-01-01 ...

Within this group, which one returns null and which real values do not change no matter the order of QIDs. If I search individual ones, each of them has non-null return.

In short, I couldn't reproduce a difference between simulated QID and real-world QID. The difference seems to be whether a group of events contain different QIDs or just a single QID.

How can I diagnose this further?

bowesmana · ‎02-19-2023

Is it KV store or CSV - if CSV is mylookup a definition or the CSV directly. If the definition, remove the layer of abstraction and try CSV. If KV store, try writing to the CSV and trying CSV.

Try making it a wildcard(QID) in the definition and adding for a couple of the failing examples, the ** around the QID.

Is it always the case that one of the QID Detected outputs is MV, as in your example? If so, disable one of the entries in the lookup. If it's a definition, set it to return 1 max result to see if that then causes the others to populate.

Just stabbing in the dark here to see what gives...

yuanliu · ‎02-19-2023

This is CSV with definition. I didn't even know that you can use CSV directly. So I tried. It made no difference. I then redefined QID's match type to wildcard. It also made no difference; if I surround any test value with **, no value can return because the entries themselves do not have wildcard characters.

The third test, unfortunately also doesn't give me additional info. I use the following as a basic test

| inputlookup mylookup
| stats count by QID
| lookup mylookup QID
| where isnull(Detected)

Somewhat to my surprise, there are many null outputs with count of 1. (Also, because how many QIDs are in search events is known to make a difference in which QIDs return null, I cannot say that the above test is definitive. It only shows that it is possible for single-entries to return null.)

bowesmana · ‎02-19-2023

For the wildcard definition, it is the CSV version that needs the **, not the data itself, so the lookup should contain

*257210*

and the data has 257210 - just in case there was something odd.

Can you DM me your lookup - as I can't reproduce it

Why does lookup return null when there are multiple matches?

lookup

Detecting Remote Code Executions With the Splunk Threat Research Team

Observability | Use Synthetic Monitoring for Website Metadata Verification

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk