We have in-house web apps which log stuff, and are considering moving to Splunk for analysis. This would entail adopting a new log format, which is easy - we can write it out however Splunk wants. We understand this is the canonical format...
timestamp key1=value1 key2="value two" key3=value3
Problem is, sometimes we need to log a LOT of stuff in the 'value' part. One example is an exception, and would want to store a fairly large Python traceback (newlines and all). Yet, we still want the value to be findable/searchable/readable in reports. Another situation is when we want to log POST params in a web form; the values might be multiline text, unicode characters or whatever.
Does Splunk support a standard system for quoting or encoding multiline text and 'problem' characters in the "value" part of the format? I was expecting to find some well documented system like base64 or URL-encoding supported, but have been unable to find any docs on this.
If you are going to be logging potentially multiline values, then I would suggest that you use a different format for those events. You will have to define some kind of marker string, both to divide events from each other as well as to divide values from events. For example, I would define a type as follows:
[mycustomsourcetype]
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+-+-+==breaker==+---[\r\n]+)
EXTRACT-longkv = (?ms)\v+--kvbegin--:(?<_KEY_1>\w+)\v+(?<_VAL_1>[\V\v]+?(?=\v+---kvend---(?:\v|$)))
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%3N
TIME_PREFIX = ^
MAX_TIMESTAMP_LOOKAHEAD = 24
KV_MODE = auto
Then log entries should be output as:
2010-12-17T12:34:56.789 abc=123 xyz=blah
--kvbegin--:fieldname1
long field value with other stuff and btw, splunk will handle UTF-8 just fine by default
though you might want to set
the CHARSET property for a source or
*(Y*(*&(()**
sourcetype
---kvend---
--kvbegin--:anotherfieldname
well, here's another value
---kvend---
+-+-+==breaker==+---
2010-12-17T12:22:33.444 fieldname1=somenewvaluesagain
+-+-+==breaker==+---
2010-12-17T13:12:11.000 something
The LINE_BREAKER should be output between each event, and the kvend and kvbegin will delimit long KV pairs. Short ones will still be autoextracted. Note that the breakers between events will be removed by Splunk, but the ones between KV pairs will not (and need to be left in). The marker strings can of course be changed to anything you like or can stand.