Getting Data In

how to assign index and sourcetype to multiple different file types that live in the same directory

tpsplunk
Communicator

I have a handful of different sourcetypes that all get written to log files in /var/log/app. I also have more than one index that the files can be sent to depending on their type. From my testing, and from what I can understand from the documentation I can't have multiple [monitor] stanzas with whitelist/blacklists for logs that live in the same directory. Therefore in order to assign sourcetype and index based on some filename regex it would seem my choices are limited to the following:

  1. don't use wildcards or whitelist|blacklists in inputs.conf. Instead on the LWF inputs.conf create a monitor stanza for each possible logfile name and assign the sourcetype and index in the monitor stanza. This is highly undesirable for me because i have about 300-400 different log file names and it also means i have to update the list each time a new log name is created
  2. in my LWF inputs.conf create a single monitor stanza that captures all the logs in the directory. On my indexers create a props.conf with multiple source statments, use the priority option to order similar wildcard source stanzas and also define a transform to assign the index. In this option I'm most concerned about the performance of assigning the index via a transform. Is this intensive?

an example of option 2 would be as follows:

example Log files:

  • coolservice.log (sourcetype: log4j, index:main)
  • coolservice-web.log (sourcetype:access_common, index:main)
  • coolservice-req.log (sourcetype:access_common, index:main)
  • coolservice-billing.log (sourcetype:custom-billing, index:billing)
  • radservice.log (sourcetype: log4j, index:main)
  • radservice-web.log (sourcetype:access_common, index:main)
  • radservice-req.log (sourcetype:access_common, index:main)
  • radservice-billing.log (sourcetype:custom-billing, index:billing)
  • ... (with 50 different service names)

inputs.conf (on the LWF)

[monitor:///var/log/app/]
whitelist = \w+\.(?:\d{4}-\d{2}-\d{2}|log)$
# i want to ignore zipped files
blacklist = \.(gz|bz2|z|zip)$

**props.conf (on my indexers)**

# billing logs
# eg coolservice-billing.log or coolservice-billing.2011-03-03
# or coolservice-billing.2011-03-03.log
[source::\w+-billing\.(?:\d{4}-\d{2}-\d{2}|log)$
sourcetype = custom-billing
TRANSFORMS-index = billingindex
priority = 200

# web logs
[source::\w+-(?:web|req)\.(?:\d{4}-\d{2}-\d{2}|log)$
sourcetype = log4j
TRANSFORMS-index = mainindex

# log4j service logs
[source::\w+\.(?:\d{4}-\d{2}-\d{2}|log)$
sourcetype = log4j
TRANSFORMS-index = mainindex

transforms.conf (on my indexers)

[billingindex]
REGEX = .*
DEST_KEY = _MetaData:Index
FORMAT = billingindex

[mainindex]
REGEX = .*
DEST_KEY = _MetaData:Index
FORMAT = mainindex

so my questions are: Is this the best/only way to do it? Will I suffer any indexing perf problems doing it this way?

Tags (2)
1 Solution

Stephen_Sorkin
Splunk Employee
Splunk Employee

If you are using 4.1 or newer, you can have multiple stanzas in inputs.conf where the whitelist is implied by the stanza name. for example:

[monitor:///var/log/app/\w+.log*]
sourcetype = log4j
index = main

[monitor:///var/log/app/\w+-(web|req).log*]
sourcetype = access_common
index = main

[monitor:///var/log/app/\w+-billing.log*]
sourcetype = custom_billing
index = billing

Note that the trailing * is needed to convince us to treat the pattern as a regex.

However, your props/transforms approach will work fine without any performance degradation.

View solution in original post

gkanapathy
Splunk Employee
Splunk Employee

actually the "sourcetype" statements in props.conf source stanzas are processed on the forwarders, the same as the inputs.conf, so if you were going to go that route, you would put it on the forwarders. See: http://www.splunk.com/wiki/Where_do_I_configure_my_Splunk_settings%3F . But Stephen's method is probably best, especially since you are also setting "index".

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

If you are using 4.1 or newer, you can have multiple stanzas in inputs.conf where the whitelist is implied by the stanza name. for example:

[monitor:///var/log/app/\w+.log*]
sourcetype = log4j
index = main

[monitor:///var/log/app/\w+-(web|req).log*]
sourcetype = access_common
index = main

[monitor:///var/log/app/\w+-billing.log*]
sourcetype = custom_billing
index = billing

Note that the trailing * is needed to convince us to treat the pattern as a regex.

However, your props/transforms approach will work fine without any performance degradation.

tpsplunk
Communicator

Note that the simple regexes in your original answer do seem to work, but i never could get any complex [monitor:///] regexes (e.g. with optional capture groups) to work consistently.

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

The .* here is like .* in path globbing from a Unix shell, in that it's the literal . followed by anything up to the path separator. The way to interpret this is: if we see a "" or a "...", we will transition to globbing mode. We first translate "" to "[^/]", "..." to "." and "." to ".". At this point, any remaining regexes are left in, as is. So the regex above will find files that start with one or more of \w, followed by a literal ".", followed by any characters until the end of the filename.

0 Karma

tpsplunk
Communicator

i'm pulling my hair out trying to get this to work. I really don't understand how you can mix syntax. your example: [monitor://F:\var\log\app\\w+.] uses a '.' at the end while documentation shows that ellipses, '...', should be used in place of .* in a monitor stanza: http://www.splunk.com/base/Documentation/4.1.7/Admin/Specifyinputpathswithwildcards. where can i find out definitively what regex syntax i can use in a monitor definition? should i be able to use advanced regex features like capture groups and optional terms?

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

Right, we should be able to use [monitor:///var/log/app/\w+.], [monitor:///var/log/app/\w+-web.], [monitor:///var/log/app/\w+-billing.]. The windows variants should be: [monitor://F:\var\log\app\\w+.] and so forth.

0 Karma

tpsplunk
Communicator

no - this is to expand the regexes above to allow for file rotation suffixes and also set up regexes for windows paths. i should have just used the previous log names for an example. so:
coolservice.log, coolservice.1.log, coolservice.log.2011-03-07
coolservice-web.log, coolservice-web.1.log, coolservice-web.log.2011-03-07
etc. (same as original question)
and the windows path for those logs will be F:\weblogs\
I can paste in what i've tried if you want. the debug splunk logs show the regex getting parsed all funky (e.g. a \ becomes \ so things like \w become \w and don't work)

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

If the characteristic here is to distinguish the above files from service-* files, I'd suggest [monitor:///var/log/app/service.*].

0 Karma

tpsplunk
Communicator

can you help me with some more complicated windows path regexes? say logs are in F:\weblogs and log names i'd like to match with a single monitor stanza regex are service.log service.1.log and service.log.2011-03-04. and ideally this would be reusable for a bunch of other services. let me know if this is better asked as a new question.

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

No, I can't find that clearly documented, but will make sure it gets into the docs. Also, no need to escape the '.', we will treat it as a literal.

0 Karma

tpsplunk
Communicator

i'm testing this now. is there a document somewhere that explains putting a * at the end of a monitor stanza treats the pattern as a regex? and is there a need to escape the '.' to make it a literal match? e.g. . in [monitor:///var/log/app/\w+.log*]

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

I designed this to not have any overlap. \w will only match [A-Za-z_], so the "-" will not match.

You can absolutely have a blacklist per-stanza.

0 Karma

tpsplunk
Communicator

how are the regex overlaps handled? for example I think a log with name coolservice-web.log will match the monitor regex for your first monitor stanza and also your 2nd monitor stanza. should i expect to be able to use a blacklist in each stanza?

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...