Getting Data In

Only include events that match a list of 2000 different users

johnjohnson2
Explorer

I have some logs that can include any one of 50,000+ users. But, i only need to index and keep a subset of that -- approximately 2000 users.. I'm looking for the most efficient way to only include logs that are associated with these users.

I thought of using transforms.conf and doing a ridiculously long regex to match those users, but, looking for any better ideas.

Props.conf
[host::blah]
TRANSFORMS-null= setnull

Tranforms.conf
[setnull]
REGEX=
DEST_KEY=queue
FORMAT=nullQueue

Tags (2)
0 Karma

lukejadamec
Super Champion

I have an automatic lookup table of all Oracle returncodes/descriptions, which is a few times larger than what you’re looking to do, and I see zero performance impact.

Splunk docs (http://docs.splunk.com/Documentation/Splunk/5.0.4/Indexer/Indextimeversussearchtime) says there is a performance hit from index time extractions, so you should avoid it if you can – some mumbojumbo about making the index larger which makes all searches slower. However, it looks like you're doing a nullQueue as opposed to adding a new field, so it may work just fine.

If you really need to do this at index time, then you should figure out a way to automate the management of the regex and then just drop it in what Kristian posted.

It will be far easier to manage a csv lookup table, then it would be to manage a regex of that size.

Please post your results if you do do index time filtering with regex on this because I am curious of the impacts.

0 Karma

johnjohnson2
Explorer

These are iis logs that include usernames (cs_username)

0 Karma

kristian_kolb
Ultra Champion

Do these accounts have some sort of distinguishing pattern, like da_xxxxx, admxxxxx, sys-xxxxxx.
Otherwise the regex would be awful to maintain.

Is there perhaps some other field in the events that can be used to make the filtering on a broader scope.

Also, as per the docs on nullQueueing, you'll need to add an extra transform to keep some of the events;

http://docs.splunk.com/Documentation/Splunk/5.0.1/Deploy/Routeandfilterdatad#Keep_specific_events_an...

props.conf

[your_host]
TRANSFORMS-blah = setnull, keepsome

transforms.conf

[setnull]
REGEX = .
DEST_KEY = queue
FORMAT = nullQueue

[keepsome]
REGEX = here is where you write your super regex
DEST_KEY = queue
FORMAT = indexQueue

K

kristian_kolb
Ultra Champion

That pretty much answers the question I was asking. Are there any other distinguishing features that can be used for filtering, e.g. the c-ip, if the users you want to keep come from a certain ip-range.

Are you constrained license-wise? Otherwise you might index more data than you need and use tags or automatic lookups to your advantage. Not sure that it would consume less resources, but it would likely be more manageable.

/k

0 Karma

johnjohnson2
Explorer

These are AD usernames so they are all different if that answers what you are trying to ask

0 Karma

kristian_kolb
Ultra Champion

My question was rather, what differs between the usernames you want to keep, and those you want to throw out?

Are the all usernames just arbitrary strings, e.g. bob, apple, horse, crane, alice? And there is no pattern that can be used to filter out the unwanted ones. You simply have to know that 'crane' and 'horse' are the ones to keep.

johnjohnson2
Explorer

These are iis logs that include usernames (cs_username)

0 Karma
Get Updates on the Splunk Community!

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...

Introducing the 2024 Splunk MVPs!

We are excited to announce the 2024 cohort of the Splunk MVP program. Splunk MVPs are passionate members of ...

Splunk Custom Visualizations App End of Life

The Splunk Custom Visualizations apps End of Life for SimpleXML will reach end of support on Dec 21, 2024, ...