Lookup Table vs. Summary Index

genesiusj · ‎01-16-2020

Hello,
I need to create a table(?) to use for populating 4 dashboard dropdowns: time picker, user, user-id, and IP address.
Would this be better achieved with a summary index or a lookup table?

As you can see from the code below, there is a lookup command in my search.

index=linuxevents
    AND source = "/var/log/secure"
    AND process=sshd
    `comment("There are 3 different ssh connection methods, and 1 ssh disconnection method.")`
    AND (_raw="*accepted password*" OR _raw="*accepted keyboard-interactive*" OR _raw="*accepted publickey*" OR _raw="*session closed for user*") 
    `comment("Extract user's ID, both (dis)connecting, as well IP address.")` 
| rex field=_raw ".*sshd.*(Accepted\spublickey\sfor\s|Accepted\skeyboard-interactive\/pam\sfor\s|Accepted\spassword\sfor\s)(?<sshUserA>\w*)\sfrom\s(?<sshIp>(\d{1,3}.){3}\d{1,3})|.*sshd.*closed\sfor\suser\s(?<sshUserC>\w*)" 
| transaction sshUserA, sshUserC, sshIp, host, pid 
| dedup _raw 
    `comment("Time stamp formatting will be used to discuss with Linux SysAdmins.")` 
| rex field=_raw "^(?<formatTimeStamp>(\d{4}(-\d{2}){2}T(\d{2}:){2}\d{2}.\d{6}-\d{2}:\d{2}\s|\w{3}(\s+\d{1,2}){2}(:\d{2}){2}\s))" 
    `comment("Converting start/end transaction times for readability.")` 
| eval startTime=strftime(_time,"%Y-%m-%d %H:%M:%S.%Q") 
| eval endTime=strftime(_time+duration,"%Y-%m-%d %H:%M:%S.%Q") 
    `comment("Convert duration for readability.")` 
| eval duration=tostring(duration,"duration") 
| sort 0 +_time 
    `comment("Merge into one user for readability.")` 
| eval user=mvdedup(mvappend(sshUserA, sshUserC)) 
    `comment("Determine if either start/end transaction time is missings.")` 
| eval startTime=case(sshUserA=="","",isnotnull(sshUserA),startTime) 
| eval endTime=case(sshUserC=="","",isnotnull(sshUserC),endTime) 
    `comment("Indicate if time range needs to broadened because of missing login/off event.")` 
| eval connection=case(isnull(sshUserA),"No login",isnull(sshUserC),"No logoff",isnotnull(sshUserA) AND isnotnull(sshUserC),"Both captured")
    `comment("Lookup auid for this user and host.")` 
| lookup lup_linuxProdAuditHostAcctAuid.csv host AS host, user AS user OUTPUTNEW user_id AS user_id 
| rename startTime AS "Start Time", endTime AS "End Time", duration AS "Duration (H:M:S.m)", user AS acct, sshIp AS addr, formatTimeStamp AS "Time Stamp Format", user_id AS auid, connection AS "Connection events" 
| table acct, auid, addr, host, "Connection events", "Start Time", "End Time", "Duration (H:M:S.m)", eventcount, pid, "Time Stamp Format"

Here is the code for populating that lookup table in the above search.

index=linuxevents
        AND source = "Unix:UserAccounts" 
    | dedup host, user, user_id 
    | table host, user, user_id 
    | appendpipe 
        [| inputlookup lup_linuxProdAuditHostAcctAuid.csv ] 
    | dedup host, user, user_id 
    | outputlookup lup_linuxProdAuditHostAcctAuid.csv

As you can also see, the table(?) I need to generate does not contain any stat or chart commands/fields; therefore, I believe that a lookup table (or a kvstore - should I open up that can of data) would better suit this.

What do you experts think?

Thanks and God bless,
Genesius

UPDATE:
To put some further context around this.

This is for a dashboard to combine the /var/log/secure, /var/log/audit/audit.log, and /var/log/sudo logs into one dashboard for users entire ssh session. Sorry but the audit app on splunkbase does capture EVERYTHING.
So my thinking is I require these fields to accurately search those 3 Linux logs: host, acct, auid, addr, pid, and time (connect/disconnect from ssh).
Therefore the summary index or lookup table would contain these field-values pairs to improve the speed and efficiency of the dashboard. In previous versions of this dashboard, it could take well over 3 mins (sometimes 10 or more) to pull all the events. This was due to having to "discover" some of the above fields through various searches at runtime.

For example in order to discover the addr in the /var/log/audit/audit.log we had to first search for a known value: acct. The acct appears in event in the 5th event. Then a search has to be performed on the audit.log with both the addr and acct values (which will append events 1 to 4 to the initial results). The same happens for auid, which doesn't appear in audit.log until event number 11 when old auid=4294967295 is changed to new auid=10014.

Another issue with the audit.log is there are some events that do not include ANY of these fields but ARE IN FACT part of the user's ssh session. This is found by running the transaction command against the event_id in the audit.log. Which requires running another search. And this is just for the audit.log.

There is still the issue of additional pids that don't include the above fields, except for the pid from the lookup or summary index. This initial (parent) pid will be used to determine the child pids. Then the various logs can be searched for those related events as well.

I hope this context helps.

So back to my original question.

Which would be the more efficient and effective solution? Using a lookup table or a summary index?

Thanks and God bless,
Genesius

to4kawa · ‎01-17-2020

Update CSV:

index=linuxevents source="Unix:UserAccounts" 
| stats count by host, user, user_id 
| table host, user, user_id 
| inputlookup lup_linuxProdAuditHostAcctAuid.csv append=t 
| outputlookup lup_linuxProdAuditHostAcctAuid.csv

Recommend query:

index=linuxevents
    AND source = "/var/log/secure"
    AND process=sshd
    `comment("There are 3 different ssh connection methods, and 1 ssh disconnection method.")`
    AND (_raw="*accepted password*" OR _raw="*accepted keyboard-interactive*" OR _raw="*accepted publickey*" OR _raw="*session closed for user*") 
    `comment("Extract user's ID, both (dis)connecting, as well IP address.")` 
| rex field=_raw ".*sshd.*(Accepted\spublickey\sfor\s|Accepted\skeyboard-interactive\/pam\sfor\s|Accepted\spassword\sfor\s)(?<sshUserA>\w*)\sfrom\s(?<sshIp>(\d{1,3}.){3}\d{1,3})|.*sshd.*closed\sfor\suser\s(?<sshUserC>\w*)" 
    `comment("Time stamp formatting will be used to discuss with Linux SysAdmins.")` 
| rex field=_raw "^(?<formatTimeStamp>(\d{4}(-\d{2}){2}T(\d{2}:){2}\d{2}.\d{6}-\d{2}:\d{2}\s|\w{3}(\s+\d{1,2}){2}(:\d{2}){2}\s))" 
| transaction sshUserA, sshUserC, sshIp, host, pid 
    `comment("Exclude extra fields. for speed up")` 
| fields - _raw 
| dedup sshUserA sshUserC sshIp host pid 
| fields _time sshUserA sshUserC sshIp host pid duration formatTimeStamp eventcount
    `comment("Converting start/end transaction times for readability.")` 
| eval startTime=strftime(_time,"%Y-%m-%d %H:%M:%S.%Q") 
| eval endTime=strftime(_time+duration,"%Y-%m-%d %H:%M:%S.%Q") 
    `comment("Convert duration for readability.")` 
| eval duration=tostring(duration,"duration") 
| sort 0 +_time 
    `comment("Merge into one user for readability.")` 
    `comment("can you try  | eval user = coalesce("sshUserA, sshUserC) ")` 
| eval user=mvdedup(mvappend(sshUserA, sshUserC)) 
    `comment("Determine if either start/end transaction time is missings.")` 
| eval startTime=case(sshUserA=="","",isnotnull(sshUserA),startTime) 
| eval endTime=case(sshUserC=="","",isnotnull(sshUserC),endTime) 
    `comment("Indicate if time range needs to broadened because of missing login/off event.")` 
| eval connection=case(isnull(sshUserA),"No login",isnull(sshUserC),"No logoff",isnotnull(sshUserA) AND isnotnull(sshUserC),"Both captured")
    `comment("Lookup auid for this user and host.")` 
| lookup lup_linuxProdAuditHostAcctAuid.csv host AS host, user AS user OUTPUTNEW user_id AS user_id 
| rename startTime AS "Start Time", endTime AS "End Time", duration AS "Duration (H:M:S.m)", user AS acct, sshIp AS addr, formatTimeStamp AS "Time Stamp Format", user_id AS auid, connection AS "Connection events" 
| table acct, auid, addr, host, "Connection events", "Start Time", "End Time", "Duration (H:M:S.m)", eventcount, pid, "Time Stamp Format"

Hi, @genesiusj
I adjusted the order to speed up the search and removed extra fields.
Also, a small change in the query will speed things up.
Depending on the result of the job inspector, I don't think that the matching with csv is time consuming.

genesiusj · ‎01-17-2020

@to4kawa
I tried the code, but it didn't work.
First, was a minor error because of the lookup command was incomplete.
Error in 'lookup' command: Must specify one or more lookup fields.
My code: | lookup lup_linuxProdAuditHostAcctAuid.csv host AS host, user AS user OUTPUTNEW user_id AS user_id
Your code: | lookup lup_linuxProdAuditHostAcctAuid.csv OUTPUT user_id

After correcting, error was gone.

However, your code produced 0 results. Mine produced 3,369 events.
I understand the use of stats vs transaction in theory. But running with transaction was only 6 seconds for All time.

Thank you for your help.

Would like to know if you have any ideas on my original question about summary index vs lookup table.

Thanks again and God bless,
Genesius

to4kawa · ‎01-17-2020

I see. sorry , it does not work.
transaction is no problem.
Which queries are using time, as determined by the job inspector?
I don't think lookup is using time.

so, my answer is updated a little.

genesiusj · ‎01-21-2020

@to4kawa
That is correct. The lookup is not using time (no need for me to check the job inspector). The lookup is used strictly for populating a list of user ID's and authentication ID's (auid's) per host. This search was one of the most time-intensive queries in my initial SPL. By running append lookup every evening any new user credentials added by our SysAdmins from the previous day will be captured.

However, giving this some more thought, I need to check for duplicate auid's with a different user ID. Although our SysAdmins don't recycle the auid at this time; I could see this being an issue in the distant future.

Back to your comment above, should the overall data for the dropdowns be populated from a summary index or a lookup table? I might have missed the answer.

Thanks again and God bless,
Genesius

genesiusj · ‎01-21-2020

@to4kawa
I tried your updated code and have a few questions.

| fields -_raw and
| fields _time sshUserA sshUserC sshIp host pid duration formatTimeStamp eventcount
Thanks, I have a tendency to forget that command. The field _raw is not needed later on, so....

I coded
| eval user=mvdedup(mvappend(sshUserA, sshUserC))
; you coded
| eval user = coalesce(sshUserA, sshUserC)
This was to merge into one user for readabilty. What is the difference? What benefit is there with either one?

I ran a | set diff use both of our SPL's. There is a difference of 7 events in mine and not yours. These are events missing either login or logoff. These are required to be captured. Not sure if either of the field commands or the modified dedup (from | dedup _raw to | dedup sshUserA sshUserC sshIp host pid) is the cause.

The duration to run either set of code is < 1 min; therefore, I don't believe I need to decrease this time further. Besides, this saved search will run at 1:00 AM. What I do need to understand is which method of storing these events would be more efficient: lookup table or summary index?

Thanks and God bless,
Genesius

to4kawa · ‎01-21-2020

| eval user=mvdedup(mvappend(sshUserA, sshUserC))

As a result of transaction , combine UserA and UserB in the same event(same row), and remove duplicate parts

| eval user = coalesce(sshUserA, sshUserC)

Enter the same user for UserA and UserB in different events.

my previous query did not use transaction . so I used coalesce.

genesiusj · ‎01-21-2020

@to4kawa
The |eval coalesce... was in the comments, so I did not know that it was not to use | transaction... after I uncommented it.

To clarify, time is NOT used with the initial lookup table. This table is strictly a list of hosts, users, and their user ids (auids). The list is appended every night for any new users added by the SysAdmins on the previous day.

The final data (summary index or lookup table) WILL be using time. The time picker on the dashboard will be compared to the "Start Time" and "End Time" in the final data. This is also why I can't afford to lose those 7 events I mentioned earlier when I ran | set diff... against both our sets of code.

The idea behind summary index or lookup table is created an increasing list of times, hosts, users, auid's, IP addresses, and pids, on a daily basis. This would be the data that populates the dropdowns.

BTW, I the first 4 lines of your CSV code and mine and ran them against all time.
Yours

index=oit_linuxevents
AND source = "Unix:UserAccounts" 
| stats count by host, user, user_id 
| table host, user, user_id

Mine

index=oit_linuxevents
AND source = "Unix:UserAccounts" 
| dedup host, user, user_id 
| table host, user, user_id

Yours returned 2.78 million events and took <11 mins to run.
Mine returned 2,814 events and <10 mins to run.

This is another scheduled saved search, with the time parameters set at the "Yesterday". This takes about 1 sec to run.

BTW, I need to use | transaction in order to use the duration field in my eval commands for End Time.

Thanks and God bless,
Genesius

to4kawa · ‎01-21-2020

Yours returned 2.78 million events and took <11 mins to run.
Mine returned 2,814 events and <10 mins to run.

wow, That's a great number. dedup is faster,I see.

If you specify the "Start time" , "End time" using tokens, it is easier to create SPL from the summary index.
now, "Start time" is String. This is difficult if you do not return to the epoch when making the dashboard.

I am sorry for making a lot of noise. Thanks a lot.

genesiusj · ‎01-22-2020

@to4kawa,
You are NOT "making a lot of noise". You have been very helpful, and have reminded me of commands I too often forget.

If you specify the "Start time" , "End time" using tokens, it is easier to create SPL from the summary index.
now, "Start time" is String. This is difficult if you do not return to the epoch when making the dashboard.

I would use this for the conversion.
| eval startTime=strptime("Start time", "%Y-%m-%d %H:%M:%S.%3Q")

Thanks again for ALL of your help.
Since I decided to use a summary index, now I just need to create it. The results from that will soon follow.

God bless,
Genesius

to4kawa · ‎01-21-2020

 | table acct, auid, addr, host, "Connection events", "Start Time", "End Time", "Duration (H:M:S.m)", eventcount, pid, "Time Stamp Format"

The results do not say when.
The summary index will give you time information, which will be useful for later searches.

However, if you don't care when the information is at the moment, or you can understand it automatically, lookup will save you time and search head space.

It depends on the search conditions, I think.
After all, the summary index is useful for fine-grained control over who can see the same results.
Because it can be controlled for each index.

acfecondo75 · ‎01-16-2020

Hey @genesiusj

My main consideration between using a summary index or a lookup is whether I need to see multiple snapshots at different points in time (summary index) or do I just need a list of the most recent data (lookup). In this case, you just need the latter.

However, I don't know if a lookup is really your best option either. The problem with using a lookup for this is, the lookup will only be as up to date as the last scheduled run of the search that populates it so there may be times where your dropdown selections don't accurately reflect the data in your panel. If the search is too resource intensive to run on the dashboard, then you may not have another option.

Another approach could be to have a base search that pulls all the information that would normally populate that lookup and then use that as the base for each dropdown input search. That way you can run the search once and have it fuel all four of those dropdowns.

Hopefully that helps!

genesiusj · ‎01-17-2020

@acfecondo75
As I mentioned in my below comment to to4kawa, the search for All time took only 6 seconds. Though I am sure as this grows the search will take longer, but not by much.

The problem with using a lookup for this is, the lookup will only be as up to date as the last scheduled run of the search that populates it so there may be times where your dropdown selections don't accurately reflect the data in your panel.

I don't this will an issue because the lookup table would be appended on a nightly basis.

I will update my original post to put some context around what I am trying to accomplish.

Thanks again and God bless,
Genesius

Lookup Table vs. Summary Index

Routing logs with Splunk OTel Collector for Kubernetes

Welcome to the Splunk Community!

Tech Talk | Elevating Digital Service Excellence: The Synergy of Splunk RUM & APM