I have data that is in json format but I only want to keep the value of the MESSAGE field from it. I created a transform to extract the value of MESSAGE and put it in _raw. This works fine until the value is over about 680 characters, then it doesn't work. I can't figure out what limit I'm hitting so I can increase it, any idea?
transforms.conf:
[replace-raw-with-message]
DEST_KEY = _raw
REGEX = MESSAGE": \"((\\.|[^\"])*)\"
FORMAT = $1
Our data looks like this. If the value in MESSAGE is over about 680 characters the replacement no longer works.
{ "MESSAGE": "the value" }
A long time overdue, but just in case anyone stumbles upon this issue as well. Try to add LOOKAHEAD with larger number (e.g. 10000) to you your transformation stanza. You can find out more in transformations.conf documentation.
Updated - never mind, found it.
This part of your regex (\\.|[^\"])*
, after the escaped slash is resolved, reads in English
" a group made up of any number from 0 to infinity of either an actual period OR something that isn't a quote."
Well, first, since an actual period is not a quote, the OR a redundant option that is costing extra work. Since you put the period test first, any time that there's an actual period, then splunk has to keep track of that point that it has an option it's going to need to backtrack to.
Try this -
MESSAGE": \"([^\"]*)\"
Chances are pretty good that you are looking at a catastrophic backtracking failure when the machine runs out of memory to figure out your data.
If you post a nonconfidential example of the data, then we may be able to revise your regex to avoid the problem.
The value of message can contain json data with escaped double quotes. The original regex will include them. For example:
"MESSAGES": "{ \"thefield\":\"thevalue\"}"
The \. was intended to catch all escaped characters. I tried changing it to \\" to only catch escaped double quotes but ran into the same issue. Like this:
REGEX = MESSAGE": \"((\\"|[^\"])*)\"
Any other thoughts? Can I allocate more memory somehow?
You can put anything in there. I was able to reproduce it on a test instance by putting 700 numbers in as the data, like this:
{ "MESSAGE": "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789" }